# **CS 4361/5361 Machine Learning**

**Classifying the MNIST datasets using k-nearest neighbors**

**Author:** Estevan Ramos<br>
**Last modified:** 2021/09/08<br>


# **Lab 1**

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import time
from scipy import stats
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

In [None]:
def most_common(labels):
    return stats.mode(labels,axis=0)[0][0]

In [None]:
def accuracy(p,y):
    return np.mean(p==y)

In [None]:
def distance(x_test,x_train):
    # Returns 2D array dist
    # where dist[i,j] is the Euclidean distance from training example i to test example j
    dist = np.sum(x_train**2,axis=1).reshape(-1,1) # dist = x_train**2
    dist = dist - 2*np.matmul(x_train,x_test.T)    # dist = X_train**2  - 2*X_train*X_test
    dist = dist + np.sum(x_test.T**2,axis=0).reshape(1,-1) # dist = X_train**2  - 2*X_train*X_test + X_test**2 - Not really necessary
    dist = np.sqrt(dist) 
    return  dist

In [None]:
def knn(x_train, y_train, x_test, k):
    d = distance(x_test,x_train) 
    neighbors = np.argsort(d,axis=0)[:k]
    pred = most_common(y_train[neighbors])
    return pred

**Exercise 1.** Write a function to compute the confusion matrix and use it to describe the results from the classification of the MNIST test set.

In [None]:
def confusion_matrix(actual,pred):
  cm = np.zeros((10,10))
  cm[actual,pred] = cm[actual,pred] + 1
  return cm

In [None]:
confusion_matrix(y_test,pred)

**Exercise 2.**Fashion MNIST is another simple and commonly used dataset to test machine learning algorithms. 

* Display some randomly-chose images from Fashion-MNIST.
* Evaluate the accuracy of 3-nearest neighbor on Fashion-MNIST.

The code to download the data is as follows: 

In [1]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
#show image
im = np.random.randint(0,x_train.shape[0])
plt.imshow(x_train[im],cmap='gray')
print('Class:',y_train[im])
plt.show()
#reshape and convert
x_train = np.float32(x_train/255).reshape(x_train.shape[0],-1)
x_test = np.float32(x_test/255).reshape(x_test.shape[0],-1)
#get prediction
pred = knn(x_train, y_train, x_test, 3)
#print Accuracy
print('Accuracy = {:.4f}'.format(accuracy(pred,y_test)))

SyntaxError: ignored

**Exercise 3.** We can speed up the computation significantly (at the cost of lower accuracy) by generating a new training set containing only ONE example of every class and applying 1-nearest neighbor. 

Usually we find the mean (average) example for every class and use that as the representative for that class.

In [None]:
def mean_class(x_train , y_train):
  mean = []
  for i in range(np.max(y_train)):
    ind = np.array([y_train == i]).flatten()
    sum = np.mean(x_train[ind], axis=0)
    mean.append(sum)
  return np.asarray(mean)

In [None]:
#download data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
#reshape and convert
x_train = np.float32(x_train/255).reshape(x_train.shape[0],-1)
x_test = np.float32(x_test/255).reshape(x_test.shape[0],-1)
#calulate mean
mean = mean_class(x_train, y_train)

In [None]:
#make pred for mean with 1 nearest neighbor
pred = knn(mean,np.arange(10),x_test,1)
#print accuracy
accuracy = accuracy_score(pred, y_test)
print(f'Accuracy {accuracy:.4}')

(9, 784)


**Exercise 4.** Modify the knn function to also return an array C containing the number of neighbors of each test example that belong to the class that was predicted for that test example.
 
For example, if the k-nearest neighbors of x_test[i] belong to classes [7,3,7], pred[i] is 7 and C[i] is 2, since 2 of the neighbors belong to the class predicted for that example. Clearly, C must be an integer between 1 and k. 

Use the function to evaluate the accuracy of the classier for cases where all the neighbors belong to the same class and for all other cases. We expect accuracy to be higher when all neighbors belong to the same class; find out if this assumption is correct. 

In [None]:
def modified_knn(x_train, y_train, x_test, k):
    d = distance(x_test,x_train) 
    neighbors = np.argsort(d,axis=0)[:k]
    pred = most_common(y_train[neighbors])
    #gets the class of nearest neighbors
    kn = y_train[neighbors]
    c = []
    for i in range(len(kn[0])):
      #adds the sum of a boolean array of all the neighbors equal to the prediction to c
      c.append(np.sum([kn[:,i] == pred[i]]))
    return pred , np.array(c)

In [None]:
n = 2000
#get prediction and c
pred , c = modified_knn(x_train, y_train, x_test[:n], 3)
#use c to make a boolean array where only use 3 of nearest neighbors
c = np.array([c==3]).flatten()
#index using c
pred = pred[c]
y_test = y_test[:n]
y_test = y_test[c]
#print accuracy
print('Test set size =',n)
print('Accuracy = {:.4f}'.format(accuracy(pred,y_test)))

**Exercise 5.** Use the sklearn implementation of k-nearest neighbors to classify the MNIST and Fashion-MNIST datasets.

See 
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html for documentation. 

Display accuracies and running times using default parameters.

Try to improve performance, either accuracy or running time, but using different parameters. In particular, answer the following questions:


*   Does weighted or unweighted k-nn result in higher accuracy?
*   What are the effects of the choice of k on the algorithms accuracy and running times?
* Which algorithm to compute the nearest neighbors (ball tree, kd tree, or brute force) yields the best results?





In [None]:
#Mnist
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print(x_train.shape)

#reshape and convert
x_train = np.float32(x_train/255).reshape(x_train.shape[0],-1)
x_test = np.float32(x_test/255).reshape(x_test.shape[0],-1)

#knn model
model =  KNeighborsClassifier(n_neighbors = 3, weights='distance', algorithm='kd_tree', n_jobs=-1)
#fit the model
model.fit(x_train[:], y_train[:])

(60000, 28, 28)


KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=3, p=2,
                     weights='distance')

In [None]:
start = time.time()
pred = model.predict(x_test)
elapsed_time = time.time() - start
print('Accuracy = {:.4f}'.format(accuracy(pred,y_test)))
print('Elapsed time = {:.4f} secs'.format(elapsed_time))

In [None]:
#Fashion_Mnist
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
print(x_train.shape)

#reshape and convert
x_train = np.float32(x_train/255).reshape(x_train.shape[0],-1)
x_test = np.float32(x_test/255).reshape(x_test.shape[0],-1)
#knn model
model =  KNeighborsClassifier(n_neighbors = 5, weights='distance', algorithm='brute', n_jobs=-1)
#fit the model
model.fit(x_train[:], y_train[:])

In [None]:
start = time.time()
pred = model.predict(x_test)
elapsed_time = time.time() - start
print('Accuracy = {:.4f}'.format(accuracy(pred,y_test)))
print('Elapsed time = {:.4f} secs'.format(elapsed_time))

Above are all settings that got the best results for both mnist and fashion . adding more neighbors on mnist doesnt really increase accuracy keeps it around the same espically if we are using distance because if we use distance our NN make up the majority of the prediction anyways. But for Fashion mnist more neigbors increase accuracy as each class probably isnt as distant from each other making more neighbors the deciding factor. Using kd-trees versus brute doesnt make a difference except that when using the whole dataset it can take over 10X as long using kd-trees.