# Module 12 - Programming Assignment

## General Directions

1. You must follow the Programming Requirements outlined on Canvas.
2. The Notebook should be cleanly and fully executed before submission.
3. You should change the name of this file to be your JHED id. For example, `jsmith299.ipynb` although Canvas will change it to something else...
4. You must follow the Programming Requirments for this course.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You should always read the entire assignment before beginning your work, so that you know in advance what the requested output will be and can plan your implementation accordingly.
    </p>
</div>

<div style="color: white; background: #C83F49; margin:20px; padding: 20px;">
    <strong>Academic Integrity and Copyright</strong>
    <p>You are not permitted to consult outside sources (Stackoverflow, YouTube, ChatGPT, etc.) or use "code assistance" (Co-Pilot, etc) to complete this assignment. By submitting this assignment for grading, you certify that the submission is 100% your own work, based on course materials, group interactions, instructor guidance. You agree to comply by the requirements set forth in the Syllabus, including, by reference, the JHU KSAS/WSE Graduate Academic Misconduct Policy.</p>
    <p>Sharing this assignment either directly (e.g., email, github, homework site) or indirectly (e.g., ChatGPT, machine learning platform) is a violation of the copyright. Additionally, all such sharing is a violation the Graduate Academic Misconduct Policy (facilitating academic dishonesty is itself academic dishonesty), even after you graduate.</p>
    <p>If you have questions or if you're unsure about the policy, ask via Canvas Inbox. In this case, being forgiven is <strong>not</strong> easier than getting permission and ignorance is not an excuse.</p>
    <p>This assignment is copyright (&copy Johns Hopkins University &amp; Stephyn G. W. Butcher). All rights reserved.</p>
</div>

## k Nearest Neighbors and Model Evaluation

In this programming assignment you will use k Nearest Neighbors (kNN) to build a "model" that will estimate the compressive strength of various types of concrete. This assignment has several objectives:

1. Implement the kNN algorithm with k=9. Remember...the data + distance function is the model in kNN. In addition to asserts that unit test your code, you should "test drive" the model, showing output that a non-technical person could interpret.

2. You are going to compare the kNN model above against the baseline model described in the course notes (the mean of the training set's target variable). You should use 5 fold cross validation and Mean Squared Error (MSE):

$$MSE = \frac{1}{n}\sum^n_i (y_i - \hat{y}_i)^2$$

as the evaluation metric ("error"). Refer to the course notes for the format your output should take. Don't forget a discussion of the results.

3. use validation curves to tune a *hyperparameter* of the model. 
In this case, the hyperparameter is *k*, the number of neighbors. Don't forget a discussion of the results.

4. evaluate the *generalization error* of the new model.
Because you may have just created a new, better model, you need a sense of its generalization error, calculate that. Again, what would you like to see as output here? Refer to the course notes. Don't forget a discussion of the results. Did the new model do better than either model in Q2?

5. pick one of the "Choose Your Own Adventure" options.

Refer to the "course notes" for this module for most of this assignment.
Anytime you just need test/train split, use fold index 0 for the test set and the remainder as the training set.
Discuss any results.

## Load the Data

The function `parse_data` loads the data from the specified file and returns a List of Lists. The outer List is the data set and each element (List) is a specific observation. Each value of an observation is for a particular measurement. This is what we mean by "tidy" data.

The function also returns the *shuffled* data because the data might have been collected in a particular order that *might* bias training.

In [1]:
import random
from typing import List, Dict, Tuple, Callable

In [2]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [float(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("concrete_compressive_strength.csv")

In [4]:
data[0]

[141.3, 212.0, 0.0, 203.5, 0.0, 971.8, 748.5, 28.0, 29.89]

In [5]:
len(data)

1030

There are 1,030 observations and each observation has 8 measurements. The data dictionary for this data set tells us the definitions of the individual variables (columns/indices):

| Index | Variable | Definition |
|-------|----------|------------|
| 0     | cement   | kg in a cubic meter mixture |
| 1     | slag     | kg in a cubic meter mixture |
| 2     | ash      | kg in a cubic meter mixture |
| 3     | water    | kg in a cubic meter mixture |
| 4     | superplasticizer | kg in a cubic meter mixture |
| 5     | coarse aggregate | kg in a cubic meter mixture |
| 6     | fine aggregate | kg in a cubic meter mixture |
| 7     | age | days |
| 8     | concrete compressive strength | MPa |

The target ("y") variable is a Index 8, concrete compressive strength in (Mega?) [Pascals](https://en.wikipedia.org/wiki/Pascal_(unit)).

## Train/Test Splits - n folds

With n fold cross validation, we divide our data set into n subgroups called "folds" and then use those folds for training and testing. You pick n based on the size of your data set. If you have a small data set--100 observations--and you used n=10, each fold would only have 10 observations. That's probably too small. You want at least 30. At the other extreme, we generally don't use n > 10.

With 1,030 observations, n = 10 is fine so we will have 10 folds.
`create_folds` will take a list (xs) and split it into `n` equal folds with each fold containing one-tenth of the observations.

In [6]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [7]:
folds = create_folds(data, 10)

In [8]:
len(folds)

10

We always use one of the n folds as a test set (and, sometimes, one of the folds as a *pruning* set but not for kNN), and the remaining folds as a training set.
We need a function that'll take our n folds and return the train and test sets:

In [9]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

We can test the function to give us a train and test datasets where the test set is the fold at index 0:

In [10]:
train, test = create_train_test(folds, 0)

In [11]:
len(train)

927

In [12]:
len(test)

103

## Answers

Answer the questions above in the space provided below, adding cells as you need to.
Put everything in the helper functions and document them.
Document everything (what you're doing and why).
If you're not sure what format the output should take, refer to the course notes and what they do for that particular topic/algorithm.

## Problem 1: kNN

Implement k Nearest Neighbors algorithm with k = 9. (Do not confuse the algorithm with evaluating the algorithm. We just want the algorithm here.)

## `kNN` <a id="kNN"></a>

**Description:**  
This function implements the k-Nearest Neighbors (kNN) algorithm. Unlike many models, kNN does not require a training phase. Instead, the model processes both training and testing data simultaneously to make predictions. It calculates distances between the test points and the training points, identifies the k nearest neighbors, and predicts based on their values.

**Parameters:**  
- `training_data` (`List[List[float]]`): The training data for the model to utilize for predicting the test_data.
- `test_data` (`List[List[float]]`): The testing data for the model to predict.
- `k` (`int`): The number of nearest neighbors to consider for prediction.

**Returns:**
- `preds` (`List[float]`): The predicted values of test_data

In [13]:
def kNN(training_data: List[List[float]], test_data: List[List[float]], k: int = 9) -> List[float]: 
    
    if k > len(training_data):
        print(f"k of {k} is larger than training data, this will not result in good data, use a smaller k")
        return []

    preds = []
    
    for test_point in test_data:
        distances = [
            (train_point, sum((a - b) ** 2 for a, b in zip(test_point[:-1], train_point[:-1])) ** 0.5)
            for train_point in training_data
        ]
        neighbors = sorted(distances, key=lambda x: x[1])[:k]
        preds.append(sum(neighbor[0][-1] for neighbor in neighbors) / k)
    return preds

In [14]:
training_data = [[1.0, 1.0], [4.0, 4.0], [8.0, 8.0]]

test_data = [[1.5], [7.5]]

predictions = kNN(training_data, test_data, k=4)
assert(predictions == []) # k too large shouldnt predict

predictions = kNN(training_data, test_data, k=1)
assert(len(predictions) == 2) # Should only predict for 2 datapoints
assert(type(predictions[0]) == float) #Should be returning floats

k of 4 is larger than training data, this will not result in good data, use a smaller k


## Problem 2: Evaluation vs. The Mean

Using Mean Squared Error (MSE) as your evaluation metric, evaluate your implement above and the Null model, the mean. See the notes for the format of the output.

For this part of the assignment, the Programming Requirements are a bit difficult (although not impossible) to follow in terms of *testing*. If you can figure out how to test your code, that's best but if you can't, for this part of the assignment, that's ok.

## `mean_squared_error` <a id="mean_squared_error"></a>

**Description:**  
This function calculates the Mean Squared Error (MSE) between the true values (`y_true`) and the predicted values (`y_pred`). This is important for kNN as it is a simple and effective evaluation metric. Smaller values indicate better performance.

**Parameters:**  
- `y_true` (`List[float]`): True values of y
- `y_pred` (`List[float]`): Predicted values of y to compare to true

**Returns:**
- `mse` (`float`): MSE of the models predicted and true values

In [15]:
def mean_squared_error(y_true: List[float], y_pred: List[float]) -> float:
    return sum((true - pred) ** 2 for true, pred in zip(y_true, y_pred)) / len(y_true)

In [16]:
y_true = [1.0, 2.0, 3.0]
y_pred = [1.0, 2.0, 3.0]
mse = mean_squared_error(y_true, y_pred)
assert(mse == 0.0) # Exact same so should be 0

y_pred = [1.1, 2.1, 3.1]
mse = mean_squared_error(y_true, y_pred)
assert round(mse, 2) == 0.01 # Each error squared is 0.01, averaged over 3 samples

y_pred = [2.0, 3.0, 4.0]
mse = mean_squared_error(y_true, y_pred)
assert(mse == 1.0) # Each value is 1.0 over actual / 3 which should end up with 1.0

In [17]:
y_train = [row[-1] for row in train]  # Training targets
y_test = [row[-1] for row in test]   # Test targets

null_predictions = [sum(y_train) / len(y_train)] * len(y_test) # My understanding of null model is using the mean of the data

knn_predictions = kNN(train, test, 9)

null_mse = mean_squared_error(y_test, null_predictions)
knn_mse = mean_squared_error(y_test, knn_predictions)

# Output results
print(f"Mean Squared Error (MSE):")
print(f"  Null Model: {null_mse:.4f}")
print(f"  kNN (k={9}): {knn_mse:.4f}")

Mean Squared Error (MSE):
  Null Model: 233.1731
  kNN (k=9): 64.6342


## Problem 3: Hyperparameter Tuning

Tune the value of k.

## `hyperparameter_tuning` <a id="hyperparameter_tuning"></a>

**Description:**  
This function performs hyperparameter tuning for the k-Nearest Neighbors (kNN) algorithm by evaluating different values of k to find the one that minimizes the Mean Squared Error (MSE) on a given test dataset. This is incredibly important to the kNN function to find the best k value so we are not too badly overfitting or underfiting, making predictions better.

**Parameters:**  
- `train` (`List[List[float]]`): Training dataset for kNN to utilize to predict testing point
- `test` (`List[List[float]]`): Testing dataset to predict values for
- `k_values` (`List[int]`): Different values of k to try and determine the best one

**Returns:**
- `k_best` (`int`): The best found value of k for the dataset

In [18]:
def hyperparameter_tuning(train: List[List[float]], test: List[List[float]], k_values: List[int], debug: bool=False) -> int:

    best_k = None
    best_mse = float('inf') 
    for k in k_values:
        knn_predictions = kNN(train, test, k)
        y_test = [row[-1] for row in test]
        mse = mean_squared_error(y_test, knn_predictions)
        if(debug):
            print(f"k={k}, MSE={mse:.4f}")
        
        if mse < best_mse:
            best_mse = mse
            best_k = k
    if(debug):
        print(f"Best k: {best_k} with MSE: {best_mse:.4f}")
    return best_k

In [19]:
train_data = [
    [1.0, 1.0],
    [2.0, 2.0], 
    [3.0, 3.0]   
]

test_data = [
    [1.5, 1.5],  
    [2.5, 2.5]
]

k_values = [1, 2, 3]

best_k = hyperparameter_tuning(train_data, test_data, k_values)
assert(best_k == 2) # Should be k = 2
assert(type(best_k) == int) # k should be an integer no matter what
k_values = [1, 2, 3, 4, 5]
best_k = hyperparameter_tuning(train_data, test_data, k_values)
assert(best_k is not None) # best k should return even if k values go over training data size 

k of 4 is larger than training data, this will not result in good data, use a smaller k
k of 5 is larger than training data, this will not result in good data, use a smaller k


In [20]:
# Hyperparameter tuning
k_values = list(range(1, 21))  
best_k = hyperparameter_tuning(train, test, k_values, True)
print(f"The best k value is {best_k}")

k=1, MSE=69.0425
k=2, MSE=72.5728
k=3, MSE=57.5925
k=4, MSE=63.8717
k=5, MSE=67.6430
k=6, MSE=71.5821
k=7, MSE=69.6825
k=8, MSE=66.9971
k=9, MSE=64.6342
k=10, MSE=66.3015
k=11, MSE=68.7963
k=12, MSE=72.0420
k=13, MSE=72.5542
k=14, MSE=74.2889
k=15, MSE=75.9650
k=16, MSE=78.4388
k=17, MSE=79.7309
k=18, MSE=79.5882
k=19, MSE=80.2391
k=20, MSE=81.5264
Best k: 3 with MSE: 57.5925
The best k value is 3


## Problem 4: Generalization Error

Analyze and discuss the generalization error of your model with the value of k from Problem 3.

## `generalization_error` <a id="generalization_error"></a>

**Description:**  
This function evaluates the generalization error of a k-Nearest Neighbors (kNN) model by comparing its performance on the training and test datasets. It calculates the Mean Squared Error (MSE) for both sets and computes the generalization gap, which measures the difference in performance between the training and test datasets. This is an important function for kNN to determine how good it is at determining new data compared to the data it already has/"trained" on.

**Parameters:**  
- `train` (`List[List[float]]`): Training data for the dataset
- `test` (`List[List[float]]`): Testing data for the dataset
- `best_k` (`int`): Best found value of k for the datase

**Returns:**
- `train_mse, test_mse` (`tuple[float, float]`): The training MSE and testing MSE that we calculated

In [21]:
def generalization_error(train: List[List[float]], test: List[List[float]], best_k: int) -> tuple[float, float]:
    y_train = [row[-1] for row in train]
    y_test = [row[-1] for row in test]

    train_predictions = kNN(train, train, best_k)
    test_predictions = kNN(train, test, best_k)

    train_mse = mean_squared_error(y_train, train_predictions)
    test_mse = mean_squared_error(y_test, test_predictions)

    print(f"Generalization Error Analysis (k={best_k}):")
    print(f"  Training Error (MSE): {train_mse:.4f}")
    print(f"  Test Error (MSE): {test_mse:.4f}")
    print(f"  Generalization Gap: {abs(test_mse - train_mse):.4f}")
    return train_mse, test_mse

In [22]:
train_data = [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0]]
test_data = [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0]]
best_k = 1

train_mse, test_mse = generalization_error(train_data, test_data, best_k)
assert(train_mse == 0.0) # Should be perfect as its linear
assert(test_mse == train_mse) # Should be exact same as train since same dataset
test_data = [[1.5, 1.0], [2.5, 4.0], [3.5, 3.0]]
best_k = 3
train_mse, test_mse = generalization_error(train_data, test_data, best_k)
assert(test_mse > train_mse) # train_mse should be near perfect as its linear, test data should be bad as its not linear

Generalization Error Analysis (k=1):
  Training Error (MSE): 0.0000
  Test Error (MSE): 0.0000
  Generalization Gap: 0.0000
Generalization Error Analysis (k=3):
  Training Error (MSE): 0.6667
  Test Error (MSE): 2.0000
  Generalization Gap: 1.3333


In [23]:
generalization_mse = generalization_error(train, test, best_k)

Generalization Error Analysis (k=3):
  Training Error (MSE): 35.8110
  Test Error (MSE): 57.5925
  Generalization Gap: 21.7815


## Analysis

As we can see we have a generalization gap of 21.7815 between the training and testing data. The trainig error is reasonably low while the testing error is higher suggesting that the model is struggling more on generalizing to unseen data compared to seen data. This would be an indication of overfitting. Other k values will have different results. In previous iterations I got k=1 which had a generalization gap of over 80 as its training error was near 0 and its testing error was quite high. Each k will have differing results on the generalization ability of this model, more data may help or may make it worse.

## Problem 5: Choose your own adventure

You have two options for the next part:

1. You can implement mean normalization (also called "z-score standardization") of the *features*; **do not** normalize the target, y. See if this improves the generalization error of your model (middle).

2. You can implement *learning curves* to see if more data would likely improve your model (easiest).


## `standardize_features` <a id="standardize_features"></a>

**Description:**  
This function standardizes the features of a dataset by subtracting the mean and dividing by the standard deviation for each feature. This is done in hopes to imporve the performance of kNN as it ensures all features are on the same scale.

**Parameters:**  
- `data` (`List[List[float]]`): Data to standardize
- `means` (`List[float]`): Mean value for each feature
- `stds` (`List[float]`): Standard Deviation values for each feature

**Returns:**
- `standardized_data` (`List[List[float]]`): Dataset that has been standardized

In [24]:
def standardize_features(data: List[List[float]], means: List[float], stds: List[float]) -> List[List[float]]:
    standardized_data = []
    for row in data:
        standardized_row = [
            (value - mean) / std if std > 0 else 0
            for value, mean, std in zip(row[:-1], means, stds)
        ]
        standardized_data.append(standardized_row + [row[-1]]) 
    return standardized_data

In [25]:
data = [
    [2.0, 4.0, 1.0], 
    [3.0, 6.0, 0.0] 
]

means = [2.5, 5.0]
stds = [0.5, 1.0]

expected = [
    [-1.0, -1.0, 1.0], 
    [1.0, 1.0, 0.0]    
]

standardized = standardize_features(data, means, stds)
assert(type(standardized) == list) # Should return a list
assert(standardized == expected) # Should equal expected based on my calcs
means = [2.0, 5.0]
standardized2 = standardize_features(data, means, stds)
assert(not standardized == standardized2) # As means have changed the standardization should as well


## `mean_and_std` <a id="mean_and_std"></a>

**Description:**  
This function computes the mean and standard deviation for each feature in a dataset. The target column (assumed to be the last column) is excluded from the calculations. Needed to perform standardization.

**Parameters:**  
- `data` (`List[List[float]]`): dataset to get the mean and standard deviations of

**Returns:**
- `means, stds` (`Tuple[List[float], List[float]]`): The list of means and standardizations for each feature

In [26]:
import math

def mean_and_std(data: List[List[float]]) -> Tuple[List[float], List[float]]:
    num_features = len(data[0]) - 1  # Exclude target column
    means = [sum(row[i] for row in data) / len(data) for i in range(num_features)]
    stds = [
        math.sqrt(sum((row[i] - means[i]) ** 2 for row in data) / len(data))
        for i in range(num_features)
    ]
    return means, stds

In [27]:
data = [
    [2.0, 4.0, 1.0],
    [2.0, 4.0, 0.0]
]

expected_means = [2.0, 4.0]
expected_stds = [0.0, 0.0]  

means, stds = mean_and_std(data)
assert(type(means) == list and type(stds) == list) # They should both be a list
assert means == expected_means # Those are the means of the dataset minus y
assert stds == expected_stds # Should be no variation in dataset for stds

In [28]:
means, stds = mean_and_std(train)  # Compute means and stds from training data
standardized_train = standardize_features(train, means, stds)  # Standardize training data
standardized_test = standardize_features(test, means, stds)  # Standardize test data

best_k_standardized = hyperparameter_tuning(standardized_train, standardized_test, k_values=list(range(1, 21)))

# Re-run kNN with standardized features
generalization_mse = generalization_error(standardized_train, standardized_test, best_k_standardized)

Generalization Error Analysis (k=4):
  Training Error (MSE): 41.7987
  Test Error (MSE): 60.1788
  Generalization Gap: 18.3800


## Comparison

This did not help my training and testing error, it infact made it worse for both. However, the generalization gap did go down implying that they are becoming closer together with better generalization for unseen data, just a worse performance overall.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.