
# A KNN
_2 points_

- How many distances you need to calculate if you have 60,000 samples in the trainingset for 50 samples? 
- How many distances do you need to calculate if you have n samples im the trainingset?

### Solution

A 1. 60000 samples (training). <br>
50 samples are tested on the set, finding the closest to the given sample. <br>
For each sample, all pictures are checked, so per sample 60000 distances are calculated. <br>
So, this follows into: 60000 \* 50 = 3.000.000 distances total <br><br>
A 2. Examples trainingset (n) \* Samples = total distances

# B KNN MNIST
_3 points_

- What is the error rate of KNN on the test set?
- What is the error rate for each label (number)?

Do for k = 2, 4, 8

- How does the choice of k influence the results?

### Solution

In [70]:
import tensorflow as tf
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
import pandas as pd

In [4]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() # load dataset
# reshape the train sets into 2D arrays
x_train_flat = x_train.reshape([x_train.shape[0],
                              x_train.shape[1] * x_train.shape[2]])
x_test_flat = x_test.reshape([x_test.shape[0],
                           x_test.shape[1] * x_test.shape[2]])

In [None]:
# creating the KNN models with k = 2, 4, 8
# n_jobs = -1 makes sure that all cores are used for the model
knn_2 = KNeighborsClassifier(n_neighbors=2, n_jobs=-1)
knn_4 = KNeighborsClassifier(n_neighbors=4, n_jobs=-1)
knn_8 = KNeighborsClassifier(n_neighbors=8, n_jobs=-1)

In [7]:
# Fitting the train set into the knn models
knn_2.fit(x_train_flat, y_train)
knn_4.fit(x_train_flat, y_train)
knn_8.fit(x_train_flat, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=8, p=2,
           weights='uniform')

In [11]:
knn_2_predictions = knn_2.predict(x_test_flat)
knn_4_predictions = knn_4.predict(x_test_flat)
knn_8_predictions = knn_8.predict(x_test_flat)

In [24]:
# Assumption: We define the error rate as 1 - f1 score.
# We chose the f1 score because it combines the results from the precision and recall results.
# Because we have a multiclass model we chose "micro" as the average parameter to get a global overview of the score.
knn_2_f1_score = f1_score(y_test, knn_2_predictions, average="micro")
knn_4_f1_score = f1_score(y_test, knn_4_predictions, average="micro")
knn_8_f1_score = f1_score(y_test, knn_8_predictions, average="micro")
print("Error rate for knn {}: {}".format(2, (1 - knn_2_f1_score)))
print("Error rate for knn {}: {}".format(4, (1 - knn_4_f1_score)))
print("Error rate for knn {}: {}".format(8, (1 - knn_8_f1_score)))

Error rate for knn 2: 0.0373
Error rate for knn 4: 0.03180000000000005
Error rate for knn 8: 0.03300000000000003


In [65]:
"""
To calculate the individual error rates, we will isolate the prediction results
by individual labels and define the error rate again as 1 - f1 score
This will be applied to all knn models through the outer loop
The results will be stored in the "results" array with the following structure: the first dimension
defines the knn model in order 2, 4, 8; the second dimension defines the label (0-9)
"""

prediction_array = [knn_2_predictions, knn_4_predictions, knn_8_predictions]
results = []
for i in range(len(prediction_array)):
    predictions = prediction_array[i]
    result_array = []
    for i in range(10):
        indices = np.where(y_test ==i)
        indices = indices[0] # np. where returns a tuple with the first element being the array we want to work with
        #print("{}: {}".format(i, indices.shape))
        true_values = y_test[indices] # similar to an array with (len(indices))-times the number i
        #print(true_values)
        pred_values = predictions[indices]
        #print(pred_values)
        result_array.append(1 - f1_score(true_values, pred_values, average="micro"))
        #print(f1_score(true_values, pred_values, average="micro"))
    results.append(result_array)
    #print(result_array)
print("Error rates by model and label")
pd.DataFrame(np.array(results).T, columns=["Knn_2", "Knn_4", "Knn_8"])

Error rates by model and label


Unnamed: 0,Knn_2,Knn_4,Knn_8
0,0.004082,0.004082,0.007143
1,0.001762,0.001762,0.002643
2,0.035853,0.037791,0.050388
3,0.028713,0.030693,0.033663
4,0.023422,0.03055,0.039715
5,0.047085,0.033632,0.026906
6,0.019833,0.016701,0.015658
7,0.04572,0.038911,0.041829
8,0.100616,0.073922,0.065708
9,0.071358,0.053518,0.048563


### Answer

In this exercise and all following examples, we chose the f1 score because it combines the results from the precision and recall results. Because we have a multiclass models we chose "micro" as the average parameter to get a global overview of the score.


The error rates (in our scenario defined as 1 - f1 score) do not differ that much between the three models.<br>
For knn = 2 the error rate is: 0.0373<br>
For knn = 4 the error rate is: 0.03180000000000005<br>
For knn = 8 the error rate is: 0.03300000000000003<br>

The individual error rates can be seen in the Dataframe above.<br><br>

<b>Changing the value of k</b> does not have a big impact on the overall error rate; however, the individual error rates change a lot especially when comparing Knn_2 and Knn_8.<br>
For example, Knn_2 has an error rate of .0041 at the label "0" while Knn_8 has an error rate of .0071
This effect can also be seen at labels 1, 2, 5, 8, 9

In addidition, we question the usage of even parameters for k in the model generation because a collision between categories is possible, and maybe very common with k = 2 if both labels of the neighbors do not match. Maybe one can receive a higher accuracy using an uneven number like k = 3. (we tested this in the cell below and calculated an f1_score).

In [1]:
knn_3 = KNeighborsClassifier(n_neighbors=3, n_jobs=-1).fit(x_train_flat, y_train)
knn_3_predictions = knn_3.predict(x_test_flat)
knn_3_f1_score = f1_score(y_test, knn_3_predictions, average="micro")
print("Error rate for knn 3: {}".format(1 - knn_2_f1_score))
print("The error rates for knn2, knn4 and knn8 were: 0.0373; 0.0318; 0.0330")

NameError: name 'KNeighborsClassifier' is not defined