**Note to grader:** Each question consists of parts, e.g. Q1(i), Q1(ii), etc. Each part must be graded  on a 0-4 scale, following the standard NJIT convention (A:4, B+: 3.5, B:3, C+: 2.5, C: 2, D:1, F:0). 
The total score must be re-scaled to 100 -- that should apply to all future assignments so that Canvas assigns the same weight on all assignments. 



In [None]:
# Grader's area
import numpy as np
M = np.zeros([10,10])
maxScore = 0


# **Assignment 4**

The goal of this assignment is to run some experiments with scikit-learn on a fairly sizeable and interesting image data set. This is the MNIST data set that consists of lots of images, each having 28x28 pixels. By today's standards, this may seem relatively tiny, but only a few years ago was quite challenging computationally, and it motivated the development of several ML algorithms and models that are now state-of-the-art  solutions for much bigger data sets. 

The assignment is experimental. We will try to whether a combination of PCA and kNN can yield any good results for the MNIST data set. Let's see if it can be made to work on this data set. 

Note: There are less difficult Python parts in this assignment. You can get things done by just repeating things from the class notebooks. But your participation and interaction via Canvas is always appreciated!

## Preparation Steps

In [1]:
# Import all necessary python packages
import numpy as np
import warnings
warnings.filterwarnings("ignore")
#import os
#import pandas as pd
#import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
#from sklearn.linear_model import LogisticRegression

In [2]:
# we load the data set directly from scikit learn 
# 
# note: this operation may take a few seconds. If for any reason it fails we 
# can revert back to loading from local storage. 

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
y = y.astype(int)
X = ((X / 255.) - .5) * 2
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=10000, random_state=123, stratify=y)


## <font color = 'blue'> Question 1. Inspecting the Dataset </font>

**(i)** How many data points are in the training and test sets ? <br>
**(ii)** How many attributes does the data set have ?

Exlain how you found the answer to the first two questions. 

[**Hint**: Use the 'shape' method associated with numpy arrays. ]

**(iii)** How many different labels does this data set have. Can you demonsrate how to read that number from the vector of labels *y_train*?  <br>
**(iv)** How does the number of attributes relates to the size of the images? <br>
**(v)** What is the role of line 12 in the above code? 





*(Please insert cells below for your answers. Clearly id the part of the question you answer)*

Q 1.1 Solution 

In [3]:
print("Number of datapoints in training dataset = ", X_train.shape[0])

Number of datapoints in training dataset =  60000


In [4]:
print("Number of datapoints in test dataset = ", X_test.shape[0])

Number of datapoints in test dataset =  10000


Q 1.2 Solution

In [5]:
print("Number of attributes in X = ", X_train.shape[1])

Number of attributes in X =  784


We observe that X_train has 60,000 rows and X_test has 10,000 rows. 
Each row represents 784 pixels in a 28x28 handwritten digit picture as attributes. 
Hence, we have 60,000 unique image samples in training and 10,000 unique image samples in test dataset

Q 1.3 Solution

Let us check the unique number of elements in y_train

In [6]:
print("Total number of unique labels in y_train = ", y_train.nunique())

Total number of unique labels in y_train =  10


In [7]:
print("\nThose Unique labels are as follows: ", y_train.unique())


Those Unique labels are as follows:  [4 8 7 0 5 2 6 9 3 1]


Q 1.4 Solution

We know that the size of one image sample in MNIST database is 28x28
28x28 = 784
The pixels in the image are reshaped into one long row (1x784) in our given dataset. 
X_train and X_test have 784 columns (attributes) which correspond to the 28x28 pixels in each image.

Q 1.5 Solution

Initially in the fetched dataset from fetch_openml, they are grayscale images 
having pixel values ranging from 0-255. 
Where 0 generally is the darkest color gradient in the image and 255 is white color representation. 
Gradient based optimizations always work most stable when the data is scalled/ normalized. 
Hence, in order to scale the data between a close range [-1, 1] the below line is used.
X = ((X / 255.) - .5) * 2
For example if a pixel is 0,
(0/255 - 0.5) * 2 = -1
And if a pixel is 255,
(255/255 - 0.5) * 2 = 1
Hence all pixels in range [0, 255] will now be scaled between [-1, 1] which helps the modeling the data for better predictions.

In [None]:
# For grader use only

# in this case, make excetion and, assign 0-2 points for each subquestion

# insert grade here  
# G[1,1] = 
# G[1,2] =
# G[1,3] = 
# G[1,4] = 
# G[1,5] =  


maxScore = maxScore + 10


##  <font color = 'blue'> Question 2. PCA on MNIST </font>

Because the number of attributes of the MNIST data set may be too big to apply kNN on it (due to the 'curse of dimensionality'), we want to compress the images down to a smaller number of 'fake' attributes. 

Use scikit-learn to output a data set *X_train_transformed* and *X_test_transformed*, with $l$ attributes. Here a reasonable choice of $l$ is 10, equal to the number of labels. But you can try slightly smaller or bigger values as well. 


**Hint**: Take a look at [this notebook](https://colab.research.google.com/drive/1DG5PjWejo8F7AhozHxj8329SuMtXZ874?usp=drive_fs) we used in the lecture, and imitate what we did there. Be careful though, to use only the scikit-learn demonstration, not the exhaustive PCA steps we did before it.

**Note**: This computation can take a while. If problems are encountered we can try the same experiment on a downsized data set. 

 Q 2 Solution

Let us fit the pca with X_train, and use that to transform both x_train and x_test datasets

In [8]:
from sklearn.decomposition import PCA

pca = PCA(n_components=10).fit(X_train)
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)


In [9]:
X_train_transformed

array([[ 0.03437929,  5.93042898, -1.32557712, ...,  0.8019018 ,
         0.79840204,  0.24801572],
       [ 1.71107136,  0.61959259, -5.61616874, ..., -1.90999584,
         4.2392874 ,  0.73459712],
       [-3.7001538 ,  6.08597899, -2.9244742 , ...,  3.9178675 ,
        -3.01684793,  1.97779327],
       ...,
       [-1.38016637, -7.54467952,  0.50226112, ...,  0.45941678,
         0.99681634,  0.99236185],
       [ 1.20194872,  7.38339737, -1.36960992, ..., -0.36764195,
         4.36452479,  2.01714674],
       [-1.08511158, -4.05124585, -6.87142438, ...,  5.19034749,
        -1.13512276, -0.03783238]])

In [10]:
X_test_transformed

array([[-3.03519080e+00, -1.98450372e+00,  1.13102652e+00, ...,
        -8.75749497e-02,  3.00985524e+00, -1.38450237e+00],
       [-2.40049602e+00,  6.22497703e+00,  3.36925073e-03, ...,
        -9.45233276e-01, -1.76151956e+00,  1.41992121e+00],
       [ 3.00133262e+00,  1.35338249e+00, -1.65937150e-01, ...,
         2.03883549e+00, -2.90634752e-01, -7.43722886e-01],
       ...,
       [ 5.16858165e+00, -3.04103594e+00,  2.89070441e+00, ...,
         5.93610708e+00, -4.42720363e+00, -1.31170901e+00],
       [-5.94360770e+00, -1.19969025e+00, -1.18195932e-01, ...,
        -2.89865177e-01, -2.00405449e+00, -2.30484225e+00],
       [-4.71396056e+00, -8.16037076e-01, -1.46922238e+00, ...,
        -9.14891156e-01, -3.15507579e+00,  3.32986193e+00]])

In [None]:
# for grader use
maxScore = maxScore +4 


# insert grade here (out of 4)
# G[2,1] =



## <font color = 'blue'> Question 3. kNN on MNIST attributes from PCA </font>


Having calculated the *transformed* MNIST data set we can now apply a kNN approach to the MNIST classification data set. Here are the sets:

(i) Fit a $k$-NN classifier on the transformed data set. Here $k$ is a hyperparameter, and you can experiment with it. Be aware though, that larger $k$ can take more time to fit. 

(ii) Apply the classifier on the transformed test set. What is the classification accuracy? 

(iii) A theoretical question: if we skipped all the above steps and we just assigned a **random** label to each test point, what would the classification accuracy be on average?  Does your result (ii) beat the random expectation? 

(iv) Experiment with different settings of $k$, and other hyperparameters that are described in the scikit-learn manual of the kNN classifier. Report your findings in a separate cell. Also for **participation points**: report your best result on Canvas! 

[**Hint**: Imitate the steps from the classroom notebook]


Q 3.1  Solution

In [11]:
from sklearn.neighbors import KNeighborsClassifier


In [12]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train_transformed, y_train)

Q 3.2 Solution

In [13]:
y_pred_train = knn.predict(X_train_transformed)
y_pred_test = knn.predict(X_test_transformed)

In [14]:
from sklearn.metrics import accuracy_score

In [15]:
print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9532166666666667

Testing Accuracy =  0.9347


Q 3.3 Solution 

If we replaced every unique label with another random label consistently in both train and test datasets, 
For example, 
If all images of hand written digit 4 were to be replaced with a random label 91 (consistantly for all 4, replace 91).
And similarly for all labels, 
then the accuracy would not change. Because KNN classifier segregates the data samples based on how similar they are,
and checks if they have same label or not. As long as they are consistantly replaced with random variable the accuracy would not change. 
However, if every test sample is mapped with a completely random label. Then the classifier accuracy would drastically reduce.
KNN classifer would still segregate similar datapoints into clusters, and when checked their labels would be very different, hence the accuracy 
would be very low (Mostly accuracy = 0, unless there are accidental true positives while randomly assigning the label) 

Q 3.4 Solution

In [16]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train_transformed, y_train)

y_pred_train = knn.predict(X_train_transformed)
y_pred_test = knn.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9532166666666667

Testing Accuracy =  0.9347


In [17]:
knn = KNeighborsClassifier(n_neighbors=3, weights='uniform', algorithm='kd_tree', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train_transformed, y_train)

y_pred_train = knn.predict(X_train_transformed)
y_pred_test = knn.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9616333333333333

Testing Accuracy =  0.9321


In [18]:
knn = KNeighborsClassifier(n_neighbors=10, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train_transformed, y_train)

y_pred_train = knn.predict(X_train_transformed)
y_pred_test = knn.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9445333333333333

Testing Accuracy =  0.9329


In [19]:
knn = KNeighborsClassifier(n_neighbors=7, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train_transformed, y_train)

y_pred_train = knn.predict(X_train_transformed)
y_pred_test = knn.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9490833333333333

Testing Accuracy =  0.9344


In [20]:
knn = KNeighborsClassifier(n_neighbors=25, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train_transformed, y_train)

y_pred_train = knn.predict(X_train_transformed)
y_pred_test = knn.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9323833333333333

Testing Accuracy =  0.9276


## Summary:


My best accuracy so far after optimizing the model is ~93.5% on the PCA transformed test set. 

Out of curiosity I tried training and using the non-transformed data. It took a longer period of time to train but was able to get around 97.48% accuracy on the test set after optimizing the model. 

I guess this makes sense, because we loose information during dimensionality reduction. 

In [21]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                           metric='minkowski', metric_params=None, n_jobs=None)
knn = knn.fit(X_train, y_train)

y_pred_train = knn.predict(X_train)
y_pred_test = knn.predict(X_test)

from sklearn.metrics import accuracy_score

print("Training Accuracy = ", accuracy_score(y_train, y_pred_train))
print("\nTesting Accuracy = ", accuracy_score(y_test, y_pred_test))

Training Accuracy =  0.9809833333333333

Testing Accuracy =  0.9748


In [None]:
# for grader use
maxScore = maxScore +12

# insert grade here (each item out of 4)
# G[3,1] =
# G[3,2] = 
# G[3,3] =
# G[3,4] = 

In [None]:
# for grader use

# Total Grade Calculation
rawScore = np.sum(G)
score = rawScore*100/maxScore