# Module 3 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## k Nearest Neighbors and Model Evaluation

In this programming assignment you will use k Nearest Neighbors (kNN) to build a "model" that will estimate the compressive strength of various types of concrete. This assignment has several objectives:

1. Implement the kNN algorithm with k=9. Remember...the data + distance function is the model in kNN. In addition to asserts that unit test your code, you should "test drive" the model, showing output that a non-technical person could interpret.

2. You are going to compare the kNN model above against the baseline model described in the course notes (the mean of the training set's target variable). You should use 10 fold cross validation and Mean Squared Error (MSE):

$$MSE = \frac{1}{n}\sum^n_i (y_i - \hat{y}_i)^2$$

as the evaluation metric ("error"). Refer to the course notes for the format your output should take. Don't forget a discussion of the results.

3. use validation curves to tune a *hyperparameter* of the model. 
In this case, the hyperparameter is *k*, the number of neighbors. Don't forget a discussion of the results.

4. evaluate the *generalization error* of the new model.
Because you may have just created a new, better model, you need a sense of its generalization error, calculate that. Again, what would you like to see as output here? Refer to the course notes. Don't forget a discussion of the results. Did the new model do better than either model in Q2?

5. pick one of the "Choose Your Own Adventure" options.

Refer to the "course notes" for this module for most of this assignment.
Anytime you just need test/train split, use fold index 0 for the test set and the remainder as the training set.
Discuss any results.

## Load the Data

The function `parse_data` loads the data from the specified file and returns a List of Lists. The outer List is the data set and each element (List) is a specific observation. Each value of an observation is for a particular measurement. This is what we mean by "tidy" data.

The function also returns the *shuffled* data because the data might have been collected in a particular order that *might* bias training.

In [320]:
import random
from typing import List, Dict, Tuple, Callable

In [321]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [float(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [322]:
data = parse_data("concrete_compressive_strength.csv")

In [323]:
data[0]

[312.9, 160.5, 0.0, 177.6, 9.6, 916.6, 759.5, 28.0, 52.45]

In [324]:
len(data)

1030

There are 1,030 observations and each observation has 8 measurements. The data dictionary for this data set tells us the definitions of the individual variables (columns/indices):

| Index | Variable | Definition |
|-------|----------|------------|
| 0     | cement   | kg in a cubic meter mixture |
| 1     | slag     | kg in a cubic meter mixture |
| 2     | ash      | kg in a cubic meter mixture |
| 3     | water    | kg in a cubic meter mixture |
| 4     | superplasticizer | kg in a cubic meter mixture |
| 5     | coarse aggregate | kg in a cubic meter mixture |
| 6     | fine aggregate | kg in a cubic meter mixture |
| 7     | age | days |
| 8     | concrete compressive strength | MPa |

The target ("y") variable is a Index 8, concrete compressive strength in (Mega?) [Pascals](https://en.wikipedia.org/wiki/Pascal_(unit)).

## Train/Test Splits - n folds

With n fold cross validation, we divide our data set into n subgroups called "folds" and then use those folds for training and testing. You pick n based on the size of your data set. If you have a small data set--100 observations--and you used n=10, each fold would only have 10 observations. That's probably too small. You want at least 30. At the other extreme, we generally don't use n > 10.

With 1,030 observations, n = 10 is fine so we will have 10 folds.
`create_folds` will take a list (xs) and split it into `n` equal folds with each fold containing one-tenth of the observations.

In [325]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [326]:
folds = create_folds(data, 10)

In [327]:
len(folds)

10

We always use one of the n folds as a test set (and, sometimes, one of the folds as a *pruning* set but not for kNN), and the remaining folds as a training set.
We need a function that'll take our n folds and return the train and test sets:

In [328]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

We can test the function to give us a train and test datasets where the test set is the fold at index 0:

In [329]:
train, test = create_train_test(folds, 0)

In [330]:
len(train)

927

In [331]:
len(test)

103

## Answers

Answer the questions above in the space provided below, adding cells as you need to.
Put everything in the helper functions and document them.
Document everything (what you're doing and why).
If you're not sure what format the output should take, refer to the course notes and what they do for that particular topic/algorithm.

## Problem 1: kNN

Implement k Nearest Neighbors with k = 9.

### <a id="knearestneighbors"></a> k nearest neighbors algorithm

Formal Parameters:
**xq** the data point whose target we want to predict

**data** the training set

**evaluation_metric** How to evaluate the *closeness* of two observations.  It is used closely with distance_function and p.  It is mostly just distance formula in the first 8 indices of a list.

**distance_function** How to evaluate distance.  Was mostly just the Minkowski distance.

**p** used in the minkowski distance. Defaults to 2.

**k** how many neigbors we are using to average the target value.  Defaults to 9.

**returns** the mean of the target values of the k nearest neighbors.

The k nearest neighbors algorithm predicts target values by identifying relevant features and averaging the target values of the k nearest observations with those features.

In [332]:
def k_nearest_neighbors(xq,data,evaluation_metric,distance_function,p=2,k=9):
    nearest_neighbors = k_nearest_neighbors_list(xq,data,evaluation_metric,distance_function,p,k)
    total = 0
    for neighbor in nearest_neighbors:
        total += neighbor[8]
        
    return total/k

### <a id="knearestneighborslist"></a> k nearest neighbors list

Formal Parameters:
**xq** the data point whose target we want to predict

**data** the training set

**evaluation_metric** How to evaluate the *closeness* of two observations.  It is used closely with distance_function and p.  It is mostly just distance formula in the first 8 indices of a list.

**distance_function** How to evaluate distance.  Was mostly just the Minkowski distance.

**p** used in the minkowski distance. Defaults to 2.

**k** how many neigbors we are using to average the target value.  Defaults to 9.

**returns** A list of the k nearest neighbors

This function is used directly by the [k nearest neighbors algorithm](#knearestneighbors)

In [333]:
def k_nearest_neighbors_list(xq,data,evaluation_metric,distance_function,p=2,k=9):
    nearest_neighbors_values = [float('inf')]*k
    nearest_neighbors = [0]*k
    for data_point in data:
        distance = evaluation_metric(data_point,xq,distance_function,p)
        update_nearest(data_point,distance,nearest_neighbors,nearest_neighbors_values,k)
    return nearest_neighbors

### <a id="evaluation_metric"></a> evaluation metric

Formal Parameters:
**xq** the data point whose target we want to predict.  A list with 8 elements for features, and usually target at the last index

**data_point** a point from the training set.  A list with 9 indices.

**distance_function** How to evaluate distance.  Was mostly just the Minkowski distance without the root

**p** used in the minkowski distance. Defaults to 2.

**returns** The distance between xq and data_point

This function is used directly by the [k nearest neighbors algorithm](#knearestneighborslist).  It applies distance formula on the first 8 coordinates of the xq and data_point lists.  Since f(x) = x^p is a continuous, increasing function on R>=0 for all p>0, we don't actually have to take the pth root to compare distances.  I suppose this eases computational strain and computer math errors.

In [335]:
def evaluation_metric(data_point,xq,distance_function,p=2):
    total = 0
    for i in range(len(data_point)-1):
        total+=distance_function(data_point[i],xq[i],p)
    return total
    

In [336]:
assert(evaluation_metric([1,2,5],[3,4],minkowski_distance_no_root)==8)
assert(evaluation_metric([1,2,5],[3,4],minkowski_distance_no_root,1)==4)

### <a id="minkowski"></a> Minkowski distance

Formal Parameters:
**x** a number

**y** some other number

**p** the exponent of the distance x-y. Defaults to 2.

**returns** absolute value of (x-y)^p

This function is not actually the minkowski distance, but a helper function.  Used in [evaluation metric](#evaluation_metric)

In [337]:
def minkowski_distance_no_root(x,y,p=2):
    return abs(x-y) **p

In [338]:
assert(minkowski_distance_no_root(1,5,1)== 4)
assert(minkowski_distance_no_root(5,1,1)== 4)
assert(minkowski_distance_no_root(1,5)== 16)

### <a id="update_nearest"></a> update nearest

Formal Parameters:
**data_point** a data point from the training set

**distance** the calculated distance from data_point to xq

**nearest_neighbors** A list of the k nearest neighbors to xq (so far).  It is ordered, with closer neighbors at the end of the list, and further neighbors at the beginning

**nearest_neighbors_values** A list of the distances of the k nearest neighbors to xq (so far).  It is in decreasing order.

**k** how many neighbors

**returns** None

**modifies** nearest_neighbors and nearest_neighbors_values

If data_point is closer to xq than some neighbor(s) in nearest_neighbors, based on distance and nearest_neighbors_values, it will insert the distance and the data_point into the correct locations, and discard the furthest neighbor and distance.  Used to build the final list returned by [k_nearest_neighbors_list](#knearestneighborslist)

In [339]:
def update_nearest(data_point,distance,nearest_neighbors,nearest_neighbors_values,k):
    if distance < nearest_neighbors_values[0]:
        nearest_neighbors_values[0] = distance
        nearest_neighbors[0] = data_point
    i = 0
    while i<k-1 and nearest_neighbors_values[i] < nearest_neighbors_values[i+1]:
        swap(nearest_neighbors_values,i,i+1)
        swap(nearest_neighbors,i,i+1)
        i+=1 

Testing [k_nearest_neighbors_list](#knearestneighborslist) and update_nearest here to avoid conflicts with running on the first time.

In [None]:
xq = [1,1,1]
data1 = [[4,4,4,4],[0,0,0,0],[1,1,1,1],[2,2,2,2],[3,3,3,3]]
answer = [[2,2,2,2],[0,0,0,0],[1,1,1,1]]
assert(k_nearest_neighbors_list(xq,data1,evaluation_metric,minkowski_distance_no_root,p=1,k=3)==answer)

In [340]:
data1 = [1]*9
data_dist = [float('inf')]*9
answer_data = [1]*9
answer_dist = [float('inf')]*9
answer_dist.pop()
answer_dist.append(2)
answer_data.pop()
answer_data.append(0)

update_nearest(0,2,data1,data_dist,9)
assert(data1==answer_data)
assert(data_dist==answer_dist)

data1 = ["hola","hello","howdy","hi","greetings"]
data_dist = [6,5,3,2,1]
data_point = "hey"
dist = 4
answer_data = ["hello","hey","howdy","hi","greetings"]
answer_dist = [5,4,3,2,1]
update_nearest(data_point,dist,data1,data_dist,5)
assert(data1==answer_data)
assert(data_dist==answer_dist)


### <a id="swap"></a> swap

Formal Parameters:
**lst** the list to swap values on

**index1** an index of the value in lst to swap

**index2** the other index of the value in lst to swap

**returns** None

**modifies** lst

A simple helper function to do a swapping operation on a list.  Used by [update_nearest](#update_nearest)

In [341]:
def swap(lst,index1,index2):
    temp = lst[index1]
    lst[index1]=lst[index2]
    lst[index2] = temp

In [342]:
lst = [[0,1,2],[3,4],[5],[6]]
swap(lst,0,2)
assert(lst==[[5],[3,4],[0,1,2],[6]])

## Problem 2: Evaluation vs. The Mean

Using Mean Squared Error (MSE) as your evaluation metric, evaluate your implement above and the Null model, the mean.

### <a id="nullmodel"></a> null model

Formal Parameters:
**data** the training set

The null model simply takes the mean of all target values of the training set.

In [343]:
def null_model(data):
    total = 0
    for data_point in data:
        total+= data_point[len(data_point)-1]
        
    mean = total/len(data)
    return mean

### <a id="mse"></a> mean squared error

Formal Parameters:
**train** the training set

**test** the set to test against

**model** the model to test on.  Model will only take the training set and a data point as formal parameters, so the user should input a lambda function with all the relevant parameters, not the actual function for the model.

The mean squared error function makes a prediction on a data point in the test set based on the model run on the training set and that data point.  It then calculates the difference squared, and averages this over all data points in the test set.

In [344]:
def mean_squared_error(train,test,model):
    total = 0
    for data_point in test:
        prediction = model(train,data_point)
        actual = data_point[len(data_point)-1]
        total+= (prediction - actual)**2
    
    mean = total/len(test)
    return mean
        

### <a id="nullmodeleval"></a> null model evaluation

**formal parameters**

**folds** the folds for data

**returns** None

prints the error for each of the folds and the mean error accross all the folds in the [baseline model](#nullmodel).

In [345]:
def null_model_printer(folds):
    print("null model evaluation")
    null_lambda = lambda train, data_point: null_model(train)
    total = 0
    for i in range(10):
        train, test = create_train_test(folds, i)
        error = mean_squared_error(train,test,null_lambda)
        print("fold "+str(i)+ " error: " + str(error))
        total += error

    print("mean error: " + str(total/10))

null_model_printer(folds)
    

null model evaluation
fold 0 error: 322.90357000008146
fold 1 error: 248.84773030864778
fold 2 error: 238.35038767200925
fold 3 error: 304.2318391970946
fold 4 error: 220.53007860412012
fold 5 error: 307.6558775567913
fold 6 error: 356.68716016426777
fold 7 error: 260.70237138953763
fold 8 error: 292.32928143237336
fold 9 error: 247.98119463604732
mean error: 280.0219490960971


In [346]:
print("kNN evaluation")
kNN_lambda = lambda train, data_point: k_nearest_neighbors(data_point,train,evaluation_metric,minkowski_distance_no_root,p=2,k=9)
total = 0
for i in range(10):
    train, test = create_train_test(folds, i)
    error = mean_squared_error(train,test,kNN_lambda)
    print("fold "+str(i)+ " error: " + str(error))
    total += error
    
print("mean error: " + str(total/10))

kNN evaluation
fold 0 error: 66.91231301690036
fold 1 error: 105.95659733908666
fold 2 error: 82.83899000359582
fold 3 error: 112.39770213352509
fold 4 error: 63.77420997243199
fold 5 error: 90.41399377921611
fold 6 error: 85.49962636941149
fold 7 error: 84.46132404410882
fold 8 error: 101.27450837828118
fold 9 error: 93.6980547165288
mean error: 88.72273197530862


## Problem 3: Hyperparameter Tuning

Tune the value of k.

### <a id="kNNtest"></a> kNN_tuner


**formal parameters**

**upper_bound** k ranges from 1 to upper_bound-1

**folds** the folds for data

**returns** None

prints the error for each of the folds and the mean error accross all the folds for each k up to upper bound in the [kNN algorithm](#k_nearest_neighbors).

In [347]:
def kNN_tuner(upper_bound,folds):
    print("kNN evaluation for different k")

    for j in range(1,upper_bound):
        print("k = " + str(j))
        kNN_lambda = lambda train, data_point: k_nearest_neighbors(data_point,train,evaluation_metric,minkowski_distance_no_root,p=2,k=j)
        total = 0
        for i in range(10):
            train, test = create_train_test(folds, i)
            error = mean_squared_error(train,test,kNN_lambda)
            print("fold "+str(i)+ " error: " + str(error))
            total += error

        print("mean error: " + str(total/10))
        
kNN_tuner(15,folds)

kNN evaluation for different k
k = 1
fold 0 error: 69.42476407766993
fold 1 error: 73.88831747572816
fold 2 error: 80.9708067961165
fold 3 error: 86.01133980582529
fold 4 error: 61.62753883495146
fold 5 error: 64.6175990291262
fold 6 error: 86.05543980582523
fold 7 error: 69.428245631068
fold 8 error: 70.01612815533983
fold 9 error: 100.19667864077671
mean error: 76.22368582524273
k = 2
fold 0 error: 56.283593932038826
fold 1 error: 99.68995485436886
fold 2 error: 74.11385291262135
fold 3 error: 88.31174441747571
fold 4 error: 54.2759453883495
fold 5 error: 66.57155898058254
fold 6 error: 76.00316553398056
fold 7 error: 68.04372572815538
fold 8 error: 97.24017766990289
fold 9 error: 83.88065728155343
mean error: 76.44143766990291
k = 3
fold 0 error: 58.36190442286948
fold 1 error: 89.46732696871629
fold 2 error: 78.8826888888889
fold 3 error: 79.33382470334415
fold 4 error: 52.492071736785334
fold 5 error: 71.60590204962244
fold 6 error: 79.55219989212515
fold 7 error: 69.0530046386192

## Problem 4: Generalization Error

Analyze and discuss the generalization error of your model with the value of k from Problem 3.

After testing kNN on values in the range [[1,14]](#kNNtest), we see that the smallest error comes from k=1, and the error generally increases with k.  kNN outperforms the [null model](#nullmodeleval) for all values of k

## Q5: Choose your own adventure

You have three options for the next part:

1. You can implement mean normalization (also called "z-score standardization") of the *features*; do not normalize the target, y. See if this improves the generalization error of your model (middle).

2. You can implement *learning curves* to see if more data would likely improve your model (easiest).

3. You can implement *weighted* kNN and use the real valued GA to choose the weights. weighted kNN assigns a weight to each item in the Euclidean distance calculation. For two points, j and k:
$$\sqrt{\sum w_i (x^k_i - x^j_i)^2}$$

You can think of normal Euclidean distance as the case where $w_i = 1$ for all features  (ambitious, but fun...you need to start EARLY because it takes a really long time to run).

The easier the adventure the more correct it must be...

### <a id="zscore"></a> z score standardization

Formal Parameters:
**data** The data

**returns** None

Standardizes the indices 0-7 of each row in data according to the formula Z=(X-mean)/standard deviation.

In [348]:
# supporting code and discussion
import copy
def z_score_standardization(data):
    means= [0]*8
    standard_deviations = [0]*8
    standardized_data = copy.deepcopy(data)
    
    for observation in data:
        for i in range(len(observation)-1):
            means[i]+=observation[i]
            
    for i in range(len(means)):
        means[i]/=len(data)
        
    for observation in data:
        for i in range(len(observation)-1):
            standard_deviations[i]+=(observation[i]-means[i])**2
    
    for i in range(len(standard_deviations)):
        standard_deviations[i]/=len(data)
        standard_deviations[i] = standard_deviations[i]**.5
        
    for observation in standardized_data:
        for i in range(len(observation)-1):
            observation[i] = (observation[i]-means[i])/standard_deviations[i]
    return standardized_data

In [349]:
new_data = z_score_standardization(data)
folds = create_folds(new_data, 10)


In [350]:
null_model_printer(folds)

null model evaluation
fold 0 error: 322.90357000008146
fold 1 error: 248.84773030864778
fold 2 error: 238.35038767200925
fold 3 error: 304.2318391970946
fold 4 error: 220.53007860412012
fold 5 error: 307.6558775567913
fold 6 error: 356.68716016426777
fold 7 error: 260.70237138953763
fold 8 error: 292.32928143237336
fold 9 error: 247.98119463604732
mean error: 280.0219490960971


In [351]:
kNN_tuner(15,folds)

kNN evaluation for different k
k = 1
fold 0 error: 59.95056796116507
fold 1 error: 89.42234660194178
fold 2 error: 86.52787864077669
fold 3 error: 93.5237776699029
fold 4 error: 57.8706582524272
fold 5 error: 64.92879417475729
fold 6 error: 90.66249708737863
fold 7 error: 60.3011300970874
fold 8 error: 62.45907184466021
fold 9 error: 92.31745922330096
mean error: 75.79641815533981
k = 2
fold 0 error: 54.066966747572806
fold 1 error: 105.1210395631067
fold 2 error: 83.79595849514558
fold 3 error: 96.59036067961165
fold 4 error: 58.64883932038833
fold 5 error: 62.91974805825244
fold 6 error: 80.47399441747572
fold 7 error: 64.60396577669906
fold 8 error: 73.32246432038836
fold 9 error: 90.85974417475724
mean error: 77.04030815533977
k = 3
fold 0 error: 58.59115825242719
fold 1 error: 96.12142761596544
fold 2 error: 79.50537971952538
fold 3 error: 71.73220377562029
fold 4 error: 50.406561920172585
fold 5 error: 70.23700312837109
fold 6 error: 80.99009795037757
fold 7 error: 69.35562653721

As we can see above, comparing this to the kNN test done on the [original data](#kNNtest), we get a slightly lower error for each k. The [null model](#nullmodeleval) is unaffected.  I was unsure of whether to normalize the entire data set at once, or normalize each training and testing data set one at a time.  I chose to do the former.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

##Difficulty

My main concern is that the [error output](#kNNtest) is in actual values, not percentages, like in the book.  Usually, when I find percent error, I use the formula 100*(predicted-actual)/actual %.  However, I didn't see a place to use that in conjunction with [MSE](#mse).  I'm also unsure if I'm doing [z score standardization](#zscore) correctly.