The goal of this project is to run some experiments with scikit-learn on a fairly sizeable and interesting image data set. This is the MNIST data set that consists of lots of images, each having 28x28 pixels. By today's standards, this may seem relatively tiny, but only a few years ago was quite challenging computationally, and it motivated the development of several ML algorithms and models that are now state-of-the-art  solutions for much bigger data sets. 

The project is experimental. I will try to experiment whether a combination of PCA and kNN can yield any good results for the MNIST data set. Let's see if it can be made to work on this data set. 

## Preparation Steps

In [25]:
# Import all necessary python packages
import numpy as np
#import os
#import pandas as pd
#import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
#from sklearn.linear_model import LogisticRegression

In [26]:
# Load the data set directly from scikit learn 
# 
# note: this operation may take a few seconds. If for any reason it fails I 
# can revert back to loading from local storage. 

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
y = y.astype(int)
X = ((X / 255.) - .5) * 2
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=10000, random_state=123, stratify=y)


## <font color = 'blue'> Inspecting the Dataset </font>

**(i)** How many data points are in the training and test sets ? <br>
**(ii)** How many attributes does the data set have ?

**(iii)** How many different labels does this data set have. Demonsrate how to read that number from the vector of labels *y_train*.  <br>
**(iv)** How does the number of attributes relates to the size of the images? <br>
**(v)** What is the role of line 12 in the above code? 







In [27]:
#(i)
#With NumPy data structure, data elements are stored in the form of an array. 
#When the shape() method is associated with the NumPy array, 
#the dimensions of the array are represented in the form of a tuple


print("x_train shape is ",X_train.shape)

x_train shape is  (60000, 784)


In [28]:
#(i)
print("y_train shape is ",y_train.shape)

y_train shape is  (60000,)


In [29]:
#There are 60000 data points in training set.

In [30]:
# (i)
print("X_test shape is ",X_test.shape)

X_test shape is  (10000, 784)


In [31]:
#(i)
print("y_test shape is ",y_test.shape)

y_test shape is  (10000,)


In [32]:
#There are 10000 data points in test set.

In [33]:
# (ii)
#The info () method prints information about the DataFrame. 
#The information contains the number of columns, column labels, column data types, 
#memory usage, range index, and the number of cells in each column (non-null values).
# as y_train and y_test are series, hence info() method will not work on it. 'describe' methond will work on that.

print('Following is the description of y_train',y_train.describe)

print('Following is the description of y_test',y_test.describe)


Following is the description of y_train <bound method NDFrame.describe of 21107    4
48688    8
41009    7
28301    8
44449    0
        ..
46256    1
3895     7
60280    2
47094    4
6545     3
Name: class, Length: 60000, dtype: int32>
Following is the description of y_test <bound method NDFrame.describe of 17739    5
362      9
6726     6
7137     5
50611    3
        ..
10580    1
67063    1
60503    2
61660    7
30906    1
Name: class, Length: 10000, dtype: int32>


In [34]:
# (ii)
print(X_train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60000 entries, 21107 to 6545
Columns: 784 entries, pixel1 to pixel784
dtypes: float64(784)
memory usage: 359.3 MB
None


In [35]:
#(ii)
print(X_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 17739 to 30906
Columns: 784 entries, pixel1 to pixel784
dtypes: float64(784)
memory usage: 59.9 MB
None


In [36]:
#There are 784 attributes in the dataset

In [37]:
# (iii)
unique_label =y_train.unique()
print('Unique labels ', unique_label)
print('Number of Unique labels ',len(unique_label))

Unique labels  [4 8 7 0 5 2 6 9 3 1]
Number of Unique labels  10


In [38]:
#(iv)
# Its known that , that 28x28 pixel which means its 784 pixels for each image.
# The number of attributes also corresponds to 784 , which means, each pixel information for each image is available as 
#attributes. This is how the number of attributes relates to the size of the images.

In [39]:
#(v)
#As Computers are digital machines, Images had to be broken down into pixels so that a computer 
#could represent them digitally. It isn’t possible to represent all of the colors in the world, 
#because the color spectrum is continuous and computers work with discrete values.
#Each basic color (Red, Green, and Blue) is 8 bits, so they are each limited to 256 (i.e 2^8) , 
#in this case 255 since 0 is included. Hence by dividing by 255, the 0-255 range can be described 
#with a 0.0-1.0 range where 0.0 means 0 (0x00) and 1.0 means 255 (0xFF).Hence after this Normalization , 
#getting thedifference from midpoint i.e. 0.5 and dividing that by 2 gives a standardized result.
# This is the significance of line 12 in the above code

##  <font color = 'blue'> PCA on MNIST </font>

Because the number of attributes of the MNIST data set may be too big to apply kNN on it (due to the 'curse of dimensionality'), I want to compress the images down to a smaller number of 'fake' attributes. 

Use scikit-learn to output a data set *X_train_transformed* and *X_test_transformed*, with $l$ attributes. Here a reasonable choice of $l$ is 10, equal to the number of labels. But I can try slightly smaller or bigger values as well. 

This computation can take a while. If problems are encountered I can try the same experiment on a downsized data set. 

In [41]:


from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

#sc = StandardScaler()
#X_train_std = sc.fit_transform(X_train)
#X_test_std = sc.transform(X_test)


from sklearn.decomposition import PCA

pca = PCA(n_components=10).fit(X_train)
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)
print(X_train_transformed)

#plt.bar(range(1, 16), pca.explained_variance_ratio_, alpha=0.5, align='center')
#plt.step(range(1, 16), np.cumsum(pca.explained_variance_ratio_), where='mid')
#plt.ylabel('Explained variance ratio')
#plt.xlabel('Principal components')


[[ 0.03437929  5.93042918 -1.32557813 ...  0.80195449  0.7978461
   0.24621453]
 [ 1.71107107  0.61959372 -5.61617337 ... -1.90770905  4.24306747
   0.74074022]
 [-3.70015368  6.08597826 -2.92447206 ...  3.91616948 -3.0181605
   1.97646634]
 ...
 [-1.38016625 -7.54468152  0.50227195 ...  0.45987978  0.99481983
   0.98419285]
 [ 1.20194888  7.38339717 -1.36960361 ... -0.36814754  4.36011049
   2.00813116]
 [-1.08511144 -4.05124468 -6.87142856 ...  5.18837101 -1.1371051
  -0.03534276]]


In [42]:
print(X_train_transformed)

[[ 0.03437929  5.93042918 -1.32557813 ...  0.80195449  0.7978461
   0.24621453]
 [ 1.71107107  0.61959372 -5.61617337 ... -1.90770905  4.24306747
   0.74074022]
 [-3.70015368  6.08597826 -2.92447206 ...  3.91616948 -3.0181605
   1.97646634]
 ...
 [-1.38016625 -7.54468152  0.50227195 ...  0.45987978  0.99481983
   0.98419285]
 [ 1.20194888  7.38339717 -1.36960361 ... -0.36814754  4.36011049
   2.00813116]
 [-1.08511144 -4.05124468 -6.87142856 ...  5.18837101 -1.1371051
  -0.03534276]]


In [53]:
print(X_test_transformed)

[[-3.03519093e+00 -1.98450351e+00  1.13101908e+00 ... -8.78126774e-02
   3.01389637e+00 -1.37343158e+00]
 [-2.40049598e+00  6.22497719e+00  3.36945207e-03 ... -9.45531069e-01
  -1.76392609e+00  1.41181420e+00]
 [ 3.00133273e+00  1.35338220e+00 -1.65933361e-01 ...  2.03894281e+00
  -2.93231334e-01 -7.51156528e-01]
 ...
 [ 5.16858182e+00 -3.04103646e+00  2.89070906e+00 ...  5.93459701e+00
  -4.43076191e+00 -1.31899591e+00]
 [-5.94360770e+00 -1.19968920e+00 -1.18198943e-01 ... -2.90152657e-01
  -2.00506747e+00 -2.30567508e+00]
 [-4.71396057e+00 -8.16036413e-01 -1.46922398e+00 ... -9.15080000e-01
  -3.15628499e+00  3.32769098e+00]]


## <font color = 'blue'> kNN on MNIST attributes from PCA </font>


Having calculated the *transformed* MNIST data set I can now apply a kNN approach to the MNIST classification data set. Here are the sets:

(i) Fit a $k$-NN classifier on the transformed data set. Here $k$ is a hyperparameter, and I can experiment with it. Be aware though, that larger $k$ can take more time to fit. 

(ii) Apply the classifier on the transformed test set. What is the classification accuracy? 

(iii) A theoretical question: if I skipped all the above steps and I just assigned a **random** label to each test point, what would the classification accuracy be on average?  Does the result (ii) beat the random expectation? 

(iv) Experiment with different settings of $k$, and other hyperparameters that are described in the scikit-learn manual of the kNN classifier. Report the findings in a separate cell. Also for **participation points**: report the best result on Canvas! 




In [58]:
#(i)

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, 
                           #p=2, 
                           weights='distance',
                           algorithm='kd_tree'
                           #metric='minkowski'
                          )
knn.fit(X_train_transformed, y_train)


KNeighborsClassifier(algorithm='kd_tree', weights='distance')

In [59]:
#(ii)

predicted = knn.predict(X_test_transformed)
print(knn.score(X_test_transformed,y_test))

0.9354


In [46]:
#(iii)
# Define array of random labels
from sklearn.metrics import accuracy_score
random_labels = np.random.choice(y.unique(),size=len(y_test))
# Get the accuracy with random labels
print("Accuracy with Random labels:",accuracy_score(random_labels,y_test))




Accuracy with Random labels: 0.0974


In [None]:
#(iii)-contd.
#Hence accuracy of random lables is very much low.I am getting only 9.74% , 
#whereas with respect to PCA followed by KNN, I am getting 93.54% accuracy

In [47]:
#(iv)
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=10, 
                           #p=2, 
                           weights='distance',
                           algorithm='kd_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted1 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=10, 
                           #p=2, 
                           weights='distance',
                           algorithm='ball_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted2 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=10, 
                           #p=2, 
                           weights='distance',
                            algorithm='brute'
                          )
knn.fit(X_train_transformed, y_train)
predicted3 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=10, 
                           #p=2, 
                           weights='uniform',
                           algorithm='kd_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted4 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=10, 
                           #p=2, 
                           weights='uniform',
                           algorithm='ball_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted5 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=10, 
                           #p=2, 
                           weights='uniform',
                            algorithm='brute'
                          )
knn.fit(X_train_transformed, y_train)
predicted6 = knn.predict(X_test_transformed)

knn = KNeighborsClassifier(n_neighbors=5, 
                           #p=2, 
                           weights='uniform',
                           algorithm='kd_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted7 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=15, 
                           #p=2, 
                           weights='uniform',
                           algorithm='ball_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted8 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=20, 
                           #p=2, 
                           weights='uniform',
                            algorithm='brute'
                          )
knn.fit(X_train_transformed, y_train)
predicted9 = knn.predict(X_test_transformed)

knn = KNeighborsClassifier(n_neighbors=5, 
                           #p=2, 
                           weights='distance',
                           algorithm='kd_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted10 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=15, 
                           #p=2, 
                           weights='distance',
                           algorithm='ball_tree'
                          )
knn.fit(X_train_transformed, y_train)
predicted11 = knn.predict(X_test_transformed)


knn = KNeighborsClassifier(n_neighbors=20, 
                           #p=2, 
                           weights='distance',
                            algorithm='brute'
                          )
knn.fit(X_train_transformed, y_train)
predicted12 = knn.predict(X_test_transformed)

print('Accuracy for n_neighbors=10, weights=distance, algorithm=kd_tree' ,accuracy_score(y_test, predicted1))
print('Accuracy for n_neighbors=10, weights=distance, algorithm=ball_tree' ,accuracy_score(y_test, predicted2))
print('Accuracy for n_neighbors=10, weights=distance, algorithm=brute' ,accuracy_score(y_test, predicted3))

print('Accuracy for n_neighbors=10, weights=uniform, algorithm=kd_tree' ,accuracy_score(y_test, predicted4))
print('Accuracy for n_neighbors=10, weights=uniform, algorithm=ball_tree' ,accuracy_score(y_test, predicted5))
print('Accuracy for n_neighbors=10, weights=uniform, algorithm=brute' ,accuracy_score(y_test, predicted6))

print('Accuracy for n_neighbors=5, weights=uniform, algorithm=kd_tree' ,accuracy_score(y_test, predicted7))
print('Accuracy for n_neighbors=15, weights=uniform, algorithm=ball_tree' ,accuracy_score(y_test, predicted8))
print('Accuracy for n_neighbors=20, weights=uniform, algorithm=brute' ,accuracy_score(y_test, predicted9))

print('Accuracy for n_neighbors=5, weights=distance, algorithm=kd_tree' ,accuracy_score(y_test, predicted10))
print('Accuracy for n_neighbors=15, weights=distance, algorithm=ball_tree' ,accuracy_score(y_test, predicted11))
print('Accuracy for n_neighbors=20, weights=distance, algorithm=brute' ,accuracy_score(y_test, predicted12))




Accuracy for n_neighbors=10, weights=distance, algorithm=kd_tree 0.9346
Accuracy for n_neighbors=10, weights=distance, algorithm=ball_tree 0.9346
Accuracy for n_neighbors=10, weights=distance, algorithm=brute 0.9346
Accuracy for n_neighbors=10, weights=uniform, algorithm=kd_tree 0.9328
Accuracy for n_neighbors=10, weights=uniform, algorithm=ball_tree 0.9328
Accuracy for n_neighbors=10, weights=uniform, algorithm=brute 0.9328
Accuracy for n_neighbors=5, weights=uniform, algorithm=kd_tree 0.9347
Accuracy for n_neighbors=15, weights=uniform, algorithm=ball_tree 0.9313
Accuracy for n_neighbors=20, weights=uniform, algorithm=brute 0.9287
Accuracy for n_neighbors=5, weights=distance, algorithm=kd_tree 0.9354
Accuracy for n_neighbors=15, weights=distance, algorithm=ball_tree 0.9327
Accuracy for n_neighbors=20, weights=distance, algorithm=brute 0.9313
