The goal of this lab is to implement and compare a K-Nearest Neighbours (KNN) classifier, a Decision
Tree (DT) classifier, and a Stochastic Gradient Descent (SGD) classifier. Below we provide a brief
overview of these classifiers before specifying the task for this lab.

In [31]:
# Import relevant packages
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [32]:
# Step 2
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

In [44]:
# Take 5000 samples for training and 1000 for testing
training_size = 5000
testing_size = 1000

x_train = x_train[0:training_size]
y_train = y_train[0:training_size]

x_test = x_test[0:testing_size]
y_test = y_test[0:testing_size]

x_train = x_train.reshape(x_train.shape[0], -1)
x_test = x_test.reshape(x_test.shape[0], -1)

KNN = KNeighborsClassifier(
    n_neighbors=3,
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    p=1,
    metric_params=None,
    n_jobs=None
)

DT = DecisionTreeClassifier(
)

SGD = SGDClassifier (
    max_iter=250
)


KNN.fit(x_train, y_train)
result = KNN.predict(x_test)
print("KNN Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmKn = metrics.confusion_matrix(y_test, result)
cmKnDisp = metrics.ConfusionMatrixDisplay(confusion_matrix = cmKn, display_labels = [False, True])
print(cmKn)

DT.fit(x_train, y_train)
result = DT.predict(x_test)
print("\nDT Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmDt = metrics.confusion_matrix(y_test, result)
print(cmDt)

SGD.fit(x_train, y_train)
result = SGD.predict(x_test)
print("\nSGD Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmSgd = metrics.confusion_matrix(y_test, result)
print(cmSgd)


KNN Classifier
Accuracy:  0.82
Precision:  0.824949282236487
Recall:  0.82
F1 Score:  0.819827968618585
[[ 90   0   3   5   0   0   8   0   1   0]
 [  2 100   0   3   0   0   0   0   0   0]
 [  4   0  84   0  14   0   9   0   0   0]
 [ 11   1   2  76   1   0   2   0   0   0]
 [  0   1  18   0  83   0  13   0   0   0]
 [  0   0   0   0   0  71   0   7   1   8]
 [ 20   0  20   3   5   0  49   0   0   0]
 [  0   0   0   0   0   1   0  89   0   5]
 [  2   0   1   0   1   1   2   0  88   0]
 [  0   0   0   0   0   0   0   5   0  90]]

DT Classifier
Accuracy:  0.735
Precision:  0.7390250598253136
Recall:  0.735
F1 Score:  0.7351417554757588
[[70  1  5  7  0  2 20  0  2  0]
 [ 1 96  0  7  0  0  1  0  0  0]
 [ 5  1 69  3 15  2 13  1  2  0]
 [ 3  5  4 69  5  0  6  0  1  0]
 [ 2  2 22  4 62  0 18  0  5  0]
 [ 0  2  0  1  0 67  0 11  3  3]
 [12  3  9  7 13  0 50  0  2  1]
 [ 0  0  0  0  0  3  0 84  0  8]
 [ 1  1  1  0  1  1  8  0 81  1]
 [ 0  1  0  0  0  1  1  5  0 87]]

SGD Classifier
Accuracy: 

The accuracy of KN classifer compared to the paper has no version that is 1 to 1 as this script uses k=3. It is less acurate by 8% possibily due to less near neighbours and a smaller training set for data to be fit and tested against.
The accuracy of DT is also less by ~6% due to a smaller training data set as well.
Similarly the this script has around 3% less accuracy for the Schoastic Gradient Descent. Down to a smaller data set and different tuning parameters.

In [48]:
# Take 10000 samples for training and 1000 for testing
training_size = 10000
testing_size = 1000

x_train = x_train[0:training_size]
y_train = y_train[0:training_size]

x_test = x_test[0:testing_size]
y_test = y_test[0:testing_size]

x_train = x_train.reshape(x_train.shape[0], -1)
x_test = x_test.reshape(x_test.shape[0], -1)

KNN = KNeighborsClassifier(
    n_neighbors=3,
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    p=1,
    metric_params=None,
    n_jobs=None
)

DT = DecisionTreeClassifier(
)

SGD = SGDClassifier (
    max_iter=250
)


KNN.fit(x_train, y_train)
result = KNN.predict(x_test)
print("KNN Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmKn = metrics.confusion_matrix(y_test, result)
cmKnDisp = metrics.ConfusionMatrixDisplay(confusion_matrix = cmKn, display_labels = [False, True])
print(cmKn)

DT.fit(x_train, y_train)
result = DT.predict(x_test)
print("\nDT Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmDt = metrics.confusion_matrix(y_test, result)
print(cmDt)

SGD.fit(x_train, y_train)
result = SGD.predict(x_test)
print("\nSGD Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmSgd = metrics.confusion_matrix(y_test, result)
print(cmSgd)


KNN Classifier
Accuracy:  0.82
Precision:  0.824949282236487
Recall:  0.82
F1 Score:  0.819827968618585
[[ 90   0   3   5   0   0   8   0   1   0]
 [  2 100   0   3   0   0   0   0   0   0]
 [  4   0  84   0  14   0   9   0   0   0]
 [ 11   1   2  76   1   0   2   0   0   0]
 [  0   1  18   0  83   0  13   0   0   0]
 [  0   0   0   0   0  71   0   7   1   8]
 [ 20   0  20   3   5   0  49   0   0   0]
 [  0   0   0   0   0   1   0  89   0   5]
 [  2   0   1   0   1   1   2   0  88   0]
 [  0   0   0   0   0   0   0   5   0  90]]

DT Classifier
Accuracy:  0.73
Precision:  0.7338130353277903
Recall:  0.73
F1 Score:  0.7306245826408669
[[74  2  5  6  1  1 17  0  1  0]
 [ 1 98  0  5  0  1  0  0  0  0]
 [ 6  0 66  4 17  2 16  0  0  0]
 [ 5  5  5 64  3  2  8  0  1  0]
 [ 3  1 18  4 62  0 23  0  4  0]
 [ 0  0  0  1  0 67  0 11  3  5]
 [11  2  9 12 13  1 48  0  1  0]
 [ 0  0  0  0  0  3  0 85  0  7]
 [ 4  0  3  0  0  2  5  0 80  1]
 [ 0  0  0  0  0  3  0  6  0 86]]

SGD Classifier
Accuracy:  0

The accuracy of KN 0.82 is 3% worse than the paper. The gap is closing as more data is avaliable to test against the same amount of training samples.
The accuracy of DT 0.73 is still 6% worse than the paper. This could be due to the decision tree not splitting enough with the training data.
The accuracy of SDG 0.79 is still ~3% less accurate. There may not be enough training data to map the testing values to a more accurate gradient.

In [49]:
# Take 20000 samples for training and 1000 for testing
training_size = 20000
testing_size = 1000

x_train = x_train[0:training_size]
y_train = y_train[0:training_size]

x_test = x_test[0:testing_size]
y_test = y_test[0:testing_size]

x_train = x_train.reshape(x_train.shape[0], -1)
x_test = x_test.reshape(x_test.shape[0], -1)

KNN = KNeighborsClassifier(
    n_neighbors=3,
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    p=1,
    metric_params=None,
    n_jobs=None
)

DT = DecisionTreeClassifier(
)

SGD = SGDClassifier (
    max_iter=250
)


KNN.fit(x_train, y_train)
result = KNN.predict(x_test)
print("KNN Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmKn = metrics.confusion_matrix(y_test, result)
cmKnDisp = metrics.ConfusionMatrixDisplay(confusion_matrix = cmKn, display_labels = [False, True])
print(cmKn)

DT.fit(x_train, y_train)
result = DT.predict(x_test)
print("\nDT Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmDt = metrics.confusion_matrix(y_test, result)
print(cmDt)

SGD.fit(x_train, y_train)
result = SGD.predict(x_test)
print("\nSGD Classifier")
print('Accuracy: ', metrics.accuracy_score(y_test, result))
print('Precision: ', metrics.precision_score(y_test, result, average='weighted'))
print('Recall: ', metrics.recall_score(y_test, result, average='weighted'))
print('F1 Score: ', metrics.f1_score(y_test, result, average='weighted'))
cmSgd = metrics.confusion_matrix(y_test, result)
print(cmSgd)


KNN Classifier
Accuracy:  0.82
Precision:  0.824949282236487
Recall:  0.82
F1 Score:  0.819827968618585
[[ 90   0   3   5   0   0   8   0   1   0]
 [  2 100   0   3   0   0   0   0   0   0]
 [  4   0  84   0  14   0   9   0   0   0]
 [ 11   1   2  76   1   0   2   0   0   0]
 [  0   1  18   0  83   0  13   0   0   0]
 [  0   0   0   0   0  71   0   7   1   8]
 [ 20   0  20   3   5   0  49   0   0   0]
 [  0   0   0   0   0   1   0  89   0   5]
 [  2   0   1   0   1   1   2   0  88   0]
 [  0   0   0   0   0   0   0   5   0  90]]

DT Classifier
Accuracy:  0.73
Precision:  0.7305935295677743
Recall:  0.73
F1 Score:  0.7294354711730547
[[71  1  5  7  2  3 16  0  2  0]
 [ 1 97  1  6  0  0  0  0  0  0]
 [ 3  1 64  4 22  2 15  0  0  0]
 [ 3  6  3 69  4  2  5  0  1  0]
 [ 1  2 20  5 66  0 17  0  4  0]
 [ 0  2  0  0  0 69  0 10  3  3]
 [12  1  8  7 16  1 49  0  2  1]
 [ 0  0  0  0  0  3  0 83  0  9]
 [ 3  2  2  0  1  2  5  1 77  2]
 [ 0  0  0  0  0  5  0  5  0 85]]

SGD Classifier
Accuracy:  0

The accuracy of KN 0.82 is still 3% worse than the paper. Only way to achieve more accuracy would be to tune parameters.
The accuracy of DT 0.73 is still 6% worse than the paper. This is still due to the decision tree not splitting enough with the training data.
The accuracy of SDG 0.79 is still ~3% less accurate. There may not be enough training data to map the testing values to a more accurate gradient. This will also require better tuning of the hyper parameters