**Note to grader:** Each question is assigned with a score. The final score will be (sum of actual scores)/(sum of maximum scores)*100. The grading rubrics are shown in the TA guidelines.

# **Assignment 4**

The goal of this assignment is to run some experiments with scikit-learn on a fairly sizeable and interesting image data set. This is the MNIST data set that consists of lots of images, each having 28x28 pixels. By today's standards, this may seem relatively tiny, but only a few years ago was quite challenging computationally, and it motivated the development of several ML algorithms and models that are now state-of-the-art  solutions for much bigger data sets. 

The assignment is experimental. We will try to whether a combination of PCA and kNN can yield any good results for the MNIST data set. Let's see if it can be made to work on this data set. 

Note: There are less difficult Python parts in this assignment. You can get things done by just repeating things from the class notebooks. But your participation and interaction via Canvas is always appreciated!

## Preparation Steps

In [94]:
# Import all necessary python packages
import numpy as np
#import os
#import pandas as pd
#import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
#from sklearn.linear_model import LogisticRegression

In [95]:
# we load the data set directly from scikit learn 
# 
# note: this operation may take a few seconds. If for any reason it fails we 
# can revert back to loading from local storage. 

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
y = y.astype(int)
X = ((X / 255.) - .5) * 2
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=10000, random_state=123, stratify=y)


## <font color = 'blue'> Question 1. Inspecting the Dataset （50 pts = 10 pts by 5 questions)</font>

**(i)** How many data points are in the training and test sets ? <br>
**(ii)** How many attributes does the data set have ?

Exlain how you found the answer to the first two questions. 

[**Hint**: Use the 'shape' method associated with numpy arrays. ]

**(iii)** How many different labels does this data set have. Can you demonsrate how to read that number from the vector of labels *y_train*?  <br>
**(iv)** How does the number of attributes relates to the size of the images? <br>
**(v)** What is the role of line 12 (X = ((X / 255.) - .5) * 2) in the above code? 





*(Please insert cells below for your answers. Clearly id the part of the question you answer)*

In [96]:

print(np.shape(X))
print(np.shape(X_train))
print(np.shape(y_train))
print(np.shape(X_test))
print(np.shape(y_test))

(70000, 784)
(60000, 784)
(60000,)
(10000, 784)
(10000,)


_____________________________________________________
i: 
60000 data points in training set
10000 data points in test set

ii:
784 attributes, another one for the output
Used numpy shape function.
_____________________________________________________

_____________________________________________________
iii:
The dataset has 10 different labels- one for each digit.
The label corresponds to the number it represents. (If label is 0, it represents the number zero).
_____________________________________________________

_____________________________________________________
iv:
The images are 28 by 28 pixels. 28 times 28 is 784, so each feature represents a single pixel.
_____________________________________________________
_____________________________________________________
v: Each individual pixel is represented by numbers from 0 to 255, that code changes the range to -1 to 1 instead. Which makes it more convenient to apply machine learning algorithms.
_____________________________________________________

In [97]:
# For grader use only
maxScore = 0
maxScore = maxScore + 50


##  <font color = 'blue'> Question 2. PCA on MNIST (10 pts) </font>

Because the number of attributes of the MNIST data set may be too big to apply kNN on it (due to the 'curse of dimensionality'), we want to compress the images down to a smaller number of 'fake' attributes. 

Use scikit-learn to output a data set *X_train_transformed* and *X_test_transformed*, with $l$ attributes. Here a reasonable choice of $l$ is 10, equal to the number of labels. But you can try slightly smaller or bigger values as well. 

Print out the shape of *X_train_transformed* and *X_test_transformed*.


**Hint**: Take a look at [this notebook](https://colab.research.google.com/drive/1DG5PjWejo8F7AhozHxj8329SuMtXZ874?usp=drive_fs), and imitate what we did there. Be careful though, to use only the scikit-learn demonstration, not the exhaustive PCA steps. 

**Note**: This computation can take a while. If problems are encountered we can try the same experiment on a downsized data set. 

In [98]:
from sklearn.preprocessing import StandardScaler

# Mean of 0 std of 1
sc = StandardScaler(); 
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

# Eigendecomposition of covariance matrix
# Get direction and magnitude of principal 
#           components of data.
cov_mat = np.cov(X_train_std.T)
# eigen_vecs are the direction of principal components
# eigan_vals are the magnitude of variance.
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

# eigenvalue, eigenvector tuples
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i])
               for i in range(len(eigen_vals))]

# sort the tuples from high to low
# The first pair represents the principle component 
#         that represents the most variance in the data.
eigen_pairs.sort(key=lambda k: k[0], reverse=True)

# Choose the two top eigen pairs
"""
w = np.hstack((eigen_pairs[0][1][:, np.newaxis],
               eigen_pairs[1][1][:, np.newaxis]))
"""

# Top ten eigen pairs
w = np.hstack([eigen_pairs[i][1][:, np.newaxis] for i in range(10)])

X_train_transformed = np.dot(X_train_std,w)
X_test_transformed = np.dot(X_test_std,w)

print(f"Shape of transformed training dataset: {np.shape(X_train_transformed)}")
print(f"Shape of transformed test dataset: {np.shape(X_test_transformed)}")

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Shape of transformed training dataset: (60000, 10)
Shape of transformed test dataset: (10000, 10)


In [99]:
# for grader use
maxScore = maxScore + 10 



## <font color = 'blue'> Question 3. kNN on MNIST attributes from PCA （40 pts = 10 pts by 4 questions) </font>


Having calculated the *transformed* MNIST data set we can now apply a kNN approach to the MNIST classification data set. Here are the sets:

(i) Fit a $k$-NN classifier on the transformed data set. Here $k$ is a hyperparameter, and you can experiment with it. Be aware though, that larger $k$ can take more time to fit. 

(ii) Apply the classifier on the transformed test set. What is the classification accuracy? 

(iii) A theoretical question: if we skipped all the above steps and we just assigned a **random** label to each test point, what would the classification accuracy be on average?  Does your result (ii) beat the random expectation? (conduct 1000 trials to get the average accuracy)

(iv) Experiment with different settings of $k$. Experiment design: calculates accuracy for increasing values of k; stops when k decreases for 5 values of k; report your findings in a separate cell.

[**Hint**: Take a look at this [notebook](https://colab.research.google.com/drive/1Mh6I3bR8pE90kcs28JfKok59NtfV_7ct?usp=drive_fs)]


In [100]:
# i.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn =  KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_transformed,y_train)

# ii.
# Use k nearest neighbors to predict the reduced dimensionality dataset.
y_pred = knn.predict(X_test_transformed)
accuracy = accuracy_score(y_test,y_pred)
print(f"Classification Accuracy: {accuracy}")
# The classification accuracy is 0.913.

# iii.
# set random label to each test point
y_random  = np.random.randint(0, 10, size=10000)
random_accuracy = accuracy_score(y_test,y_random)
print(f"Random Classification Accuracy: {random_accuracy}")

"""
Since there are 10 possible labels, if we just randomly assigned labels the accuracy should be around 0.1, since there is a 1/10 chance to randomly guess one of the 10 labels.
The classification accuracy beats the random classification accuracy by approximately 80 percent.
"""

# iv.
#  Experiment design: calculates accuracy for increasing values of k; stops when k decreases for 5 values of k
for k in range(1,302,50):
    knn =  KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_transformed,y_train)

    y_pred = knn.predict(X_test_transformed)
    accuracy = accuracy_score(y_test,y_pred)
    print(f"Classification Accuracy for k value {k}: {accuracy}")

Classification Accuracy: 0.913
Random Classification Accuracy: 0.1054
Classification Accuracy for k value 1: 0.8956
Classification Accuracy for k value 51: 0.9033
Classification Accuracy for k value 101: 0.8913
Classification Accuracy for k value 151: 0.8849
Classification Accuracy for k value 201: 0.8801
Classification Accuracy for k value 251: 0.876
Classification Accuracy for k value 301: 0.8736


The classification accuracy increases from 1 to 51, but decreases with 50 unit long increments. This is likely because the model is becoming more and more general, and missing small flucuations in data. (Underfitting is occuring)

In [101]:
# for grader use
maxScore = maxScore + 40



In [102]:
# for grader use

score = actualScore*100/maxScore

NameError: name 'actualScore' is not defined