# B KNN MNIST

- What is the error rate of KNN on the test set?
- What is the error rate for each label (number)?

Do for k = 2, 4, 8
- How does the choice of k influence the results?

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
import scipy.spatial.distance
from sklearn.metrics import f1_score
import pandas as pd

plt.set_cmap('gray')

# load dataset

In [2]:
train = tf.keras.datasets.mnist.load_data()[0]
X_train, Y_train = train[0], train[1]

test = tf.keras.datasets.mnist.load_data()[1]
X_test, Y_test = test[0], test[1]

n_train = X_train.shape[0]
n_test = X_test.shape[0]

m = X_train.shape[1]**2

In [3]:
X_train = X_train.reshape([n_train, m])
X_test = X_test.reshape([n_test, m])

idx = np.random.randint(n_train, size=10000)
x_train_sample = X_train[idx]
y_train_sample = Y_train[idx]

idx = np.random.randint(n_test, size=1000)
x_test_sample = X_test[idx]
y_test_sample = Y_test[idx]
#print(idx)
karray = [2, 4, 8]

Function for f1 score

In [4]:
def f1score(k):
    dists = scipy.spatial.distance.cdist(x_train_sample, x_test_sample, metric='euclid')
    idx_nearest = np.argpartition(dists, k, axis = 0)[:k]
    nearest_dists = np.take(x_train_sample, idx)
    
    pred = y_train_sample[idx_nearest]
    pred = np.ndarray.transpose(pred)
    actuals = y_test_sample
    prediction = np.array([])
    
    for row in pred:
        count = np.bincount(row)
        prediction = np.append(prediction, int(np.argmax(count)))
    
    prediction = prediction.astype(int)
    '''We use (1-f1_score) instead of error rate because it considers false positives as well as false negatives.
    We are going to use it in further exercises, too.'''
    calcf1score = f1_score(actuals, prediction, average = 'micro')
    print ("The error rate for KNN with k = {} on this Dataset is {}".format(k, (1 - calcf1score)))
#f1score(3)

In [5]:
results = []
def endresult():
    for k in karray:
        result_array = []
        dists = scipy.spatial.distance.cdist(x_train_sample, x_test_sample, metric='euclid')
        idx_nearest = np.argpartition(dists, k, axis = 0)[:k]
        nearest_dists = np.take(x_train_sample, idx)
    
        pred = y_train_sample[idx_nearest]
        pred = np.ndarray.transpose(pred)
        actuals = y_test_sample
        prediction = np.array([])

        for row in pred:
            count = np.bincount(row)
            prediction = np.append(prediction, int(np.argmax(count)))
    
        prediction = prediction.astype(int)
        '''We use 1-f1_score since it is an better indicator of quality than the error rate.'''
        calcf1score = f1_score(actuals, prediction, average = 'micro')
        print ("The error rate for KNN with k = {} on this Dataset is {}".format(k, (1 - calcf1score)))
        
        for i in range(10):
            indices = np.where(actuals == i)
            f1 = 1-(f1_score(actuals[indices], prediction[indices], average = 'micro'))
            result_array.append(f1)
            
        results.append(result_array)
               
endresult()

The error rate for KNN with k = 2 on this Dataset is 0.07099999999999995
The error rate for KNN with k = 4 on this Dataset is 0.05700000000000005
The error rate for KNN with k = 8 on this Dataset is 0.07199999999999995


In [6]:
pd.DataFrame(np.array(results).T, columns=["k = 2", "k = 4", "k = 8"])

Unnamed: 0,k = 2,k = 4,k = 8
0,0.010204,0.010204,0.010204
1,0.0,0.0,0.0
2,0.095652,0.086957,0.113043
3,0.054545,0.036364,0.036364
4,0.070707,0.070707,0.10101
5,0.107143,0.083333,0.095238
6,0.0375,0.0375,0.0375
7,0.027027,0.036036,0.081081
8,0.242718,0.165049,0.184466
9,0.065934,0.043956,0.054945


- As you can see KNN works best when k is 4. 
- The most mistakes happen on 8s and 5s.

## Answer

As you can see within the table above the most mistakes tend to happen on 8s and 9s, but in general the error rate for all chosen neighbors was about 5 - 6%, which seems to be a good result. Also, it was no big influence on the result how many neighbours were chosen.
