## MNist Deep Dive Project

### By Daniyal Mufti

In this project we will do a deep dive on the famous MNist dataset applying various classification models along with augmenting the data to see if we can achieve SOTA performance on the dataset.  

### 1. Setup

In [17]:
#import relevant libraries
from sklearn.datasets import fetch_openml
import numpy as np

In [18]:
#Let's fetch the data
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist["data"], mnist["target"]

In [19]:
print(X.shape, y.shape)

(70000, 784) (70000,)


In [20]:
#A function to plot a digit
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")

In [21]:
#Let's split the data into Train and Test sets

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

##### Let's create an augemented dataset with shifted images of instances added to it.  We will test if that helps lower our classification error.

##### Note we only need to augement the training datasets as the test datasets are kept  to see if the augementation helps with lower the classification error.

In [22]:
from scipy.ndimage.interpolation import shift

# A helper function to shift our MNist dataset instance
def shift_digit(digit_array, dx, dy, new=0):
    return shift(digit_array.reshape(28, 28), [dy, dx], cval=new).reshape(784)

# Let's create our augemented Training dataset

X_train_expanded = [X_train]
y_train_expanded = [y_train]
for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    shifted_images = np.apply_along_axis(shift_digit, axis=1, arr=X_train, dx=dx, dy=dy)
    X_train_expanded.append(shifted_images)
    y_train_expanded.append(y_train)

X_train_expanded = np.concatenate(X_train_expanded)
y_train_expanded = np.concatenate(y_train_expanded)
print(X_train_expanded.shape, y_train_expanded.shape)

(300000, 784) (300000,)


### 2. Test Predictive Models

##### So now we have two training datasets. Let's try using some predictive models and see how they perform on both those datasets.

##### a. KNN Classifier

Lets start with weights hyperparameter equal to distance and number of neighbours equal to 4. Later on we can do grid search to find the best hyper parameters.

In [23]:
#import KNN Classifier model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [24]:
#Regular Dataset
knn_clf = KNeighborsClassifier(weights='distance', n_neighbors=4)
knn_clf.fit(X_train, y_train)
y_knn_pred = knn_clf.predict(X_test)
accuracy_score(y_test, y_knn_pred)

0.9714

In [25]:
#Augemented Dataset
knn_clf_aug = KNeighborsClassifier(weights='distance', n_neighbors=4)
knn_clf_aug.fit(X_train_expanded, y_train_expanded)
y_knn_aug_pred = knn_clf_aug.predict(X_test)
accuracy_score(y_test, y_knn_aug_pred)

0.9763

##### Great! By the looks of it, augementing the data gives us a greater accuracy score. Let's try scaling the feature set and see if that helps.

In [26]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.transform(X_test.astype(np.float32))
X_train_expanded_scaled = scaler.fit_transform(X_train_expanded.astype(np.float32)) 

In [27]:
#Regular Dataset Scaled
knn_clf_scaled = KNeighborsClassifier(weights='distance', n_neighbors=4)
knn_clf_scaled.fit(X_train_scaled, y_train)
y_knn_scaled_pred = knn_clf_scaled.predict(X_test_scaled)
accuracy_score(y_test, y_knn_scaled_pred)

0.9489

In [28]:
#Augemented Dataset Scaled
knn_clf_aug_scaled = KNeighborsClassifier(weights='distance', n_neighbors=4)
knn_clf_aug_scaled.fit(X_train_expanded_scaled, y_train_expanded)
y_knn_aug_scaled_pred = knn_clf_aug_scaled.predict(X_test_scaled)
accuracy_score(y_test, y_knn_aug_scaled_pred)

0.9584

##### Unfortunately it seems like scaling the data did not help our KNN classifier achieve a greater accuracy(though augementing helped even with the scaled datasets).  

Let's try grid search but let's use the datasets which have been augemented but not scaled and see if we can get greater accuracy scores from hyperparameter tuning.

In [29]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

knn_clf_gridSearch = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf_gridSearch, param_grid, cv=5, verbose=3)
grid_search.fit(X_train_expanded, y_train_expanded)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END ....n_neighbors=3, weights=uniform;, score=0.995 total time= 4.1min
[CV 2/5] END ....n_neighbors=3, weights=uniform;, score=0.963 total time= 4.1min
[CV 3/5] END ....n_neighbors=3, weights=uniform;, score=0.959 total time= 4.1min
[CV 4/5] END ....n_neighbors=3, weights=uniform;, score=0.973 total time= 4.1min
[CV 5/5] END ....n_neighbors=3, weights=uniform;, score=0.969 total time= 4.1min
[CV 1/5] END ...n_neighbors=3, weights=distance;, score=0.995 total time= 4.1min
[CV 2/5] END ...n_neighbors=3, weights=distance;, score=0.964 total time= 4.1min
[CV 3/5] END ...n_neighbors=3, weights=distance;, score=0.961 total time= 4.1min
[CV 4/5] END ...n_neighbors=3, weights=distance;, score=0.976 total time= 4.1min
[CV 5/5] END ...n_neighbors=3, weights=distance;, score=0.972 total time= 4.1min
[CV 1/5] END ....n_neighbors=4, weights=uniform;, score=0.992 total time= 5.0min
[CV 2/5] END ....n_neighbors=4, weights=uniform;,

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid=[{'n_neighbors': [3, 4, 5],
                          'weights': ['uniform', 'distance']}],
             verbose=3)

In [30]:
grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [31]:
grid_search.best_score_

0.9747666666666668

In [32]:
y_pred_gridSearch = grid_search.predict(X_test)
accuracy_score(y_test, y_pred_gridSearch)

0.9763