# MNIST Digit Recognition Workbook

### Starter Code

First, let's load the MNIST dataset and split it into training and testing sets.

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load MNIST data from https://openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

  warn(


Training set shape: (56000, 784)
Test set shape: (14000, 784)


### Visualization and Plotting

In [None]:
from matplotlib import pyplot as plt
print(X_test[X_test.columns[1]])

## Part 1: PCA + KNN

### Prelude

Principal Component Analysis (PCA) is a statistical technique to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. Here, you will use PCA to reduce the dimensionality of the MNIST dataset before applying the KNN algorithm for classification.

### Steps

1. **Load the Dataset:** Start by loading the MNIST dataset.
2. **Apply PCA:** Reduce the dimensionality of the dataset.
3. **KNN Classification:** Use the KNN algorithm to classify the digits.

# Tips
- Choosing n_components for PCA: Start with n_components=0.95 which keeps 95% of the variance. Experiment with other values to see how it changes the results.
- Choosing n_neighbors for KNN: Common starting points are 3, 5, and 7. Adjust based on the performance and try to avoid overfitting.
- Explore: Use visualizations like plotting some of the digits before and after PCA to understand what is retained and what is lost.

In [3]:
#@title Feature Standardization
#this is required because methods such as KNN, K-means, and regression are optimisation techniques that aim to
#minimise the distance (in some sense). This implies that each feature gets a different importance and therefore
#has a different effect on the data above -> more prominance to features with higher values
from sklearn.preprocessing import StandardScaler

X_test_cols = X_test.columns
X_test_num_cols= X_test.shape[1]
for i in range(X_test_num_cols):
  stdization_params = StandardScaler().fit(X_test[[X_test_cols[i]]])
  X_test[X_test_cols[i]] = stdization_params.transform(X_test[[X_test_cols[i]]])

X_train_cols = X_train.columns
X_train_num_cols= X_train.shape[1]
for i in range(X_train_num_cols):
  stdization_params = StandardScaler().fit(X_train[[X_train_cols[i]]])
  X_train[X_train_cols[i]] = stdization_params.transform(X_train[[X_train_cols[i]]])

In [14]:
#@title Performing PCA

from sklearn.decomposition import PCA
import numpy as np

#getting the cov matrix for train and test datasets
# X_train_cov = X_train.cov()
# X_test_cov = X_test.cov()

#finding the eigen values and corresponding vectors in order to induce dimensional reduction
# eigenvalues_X_train, eigenvectors_X_train = np.linalg.eig(X_train_cov)
# eigenvalues_X_test, eigenvectors_X_test = np.linalg.eig(X_test_cov)

#now using the sklearn library for PCA,
pca_params = PCA(n_components = 2).fit(X_train)
X_train_pca = pca_params.transform(X_train)
print(X_train_pca.shape)
X_test_pca = pca_params.transform(X_test)
print(X_test_pca.shape)
# print(pca_params.explained_variance_)

(56000, 2)
(14000, 784)


In [None]:
#@title Visualisation

#un-comment the following code snippets to visualise the MNIST data after PCA

# pca_params = PCA(n_components = 2).fit(X_train)
# X_train_pca = pca_params.transform(X_train)
# print(X_train_pca.shape)
# plt.scatter(X_train_pca[:,0], X_train_pca[:,1])

# X_test_pca = pca_params.transform(X_test)
# print(X_test_pca.shape)
# plt.scatter(X_test_pca[:,0], X_test_pca[:,1])

the score for number of neighbors =  3 is  100


In [None]:
#@title KNN Classification

from sklearn.neighbors import KNeighborsClassifier

#performing KNN algo for n_neighbors = 3, 5, and 7
for k in range(3, 8, 2):
  print(X_test.shape)
  knn_model = KNeighborsClassifier(n_neighbors = k).fit(X_train_pca, y_train)
  score = knn_model.score(X_test_pca, y_test)
  print("the score for number of neighbors = ", k, " is ", score*100)

(14000, 784)
0.9446428571428571


## Part 2: K-Means + SVM

### Prelude

K-Means is a popular clustering algorithm, and Support Vector Machines (SVMs) are a powerful classification method. In this part, you will use K-Means to extract features from the dataset and then use these features to train an SVM classifier.

### Steps

1. **K-Means Clustering:** Apply K-Means to find clusters in the dataset.
2. **Feature Extraction:** Use the distances from each point to the cluster centroids as features.
3. **SVM Classification:** Use the SVM classifier to classify the digits.

**Additional Tips for Students:**
- **Choosing the Number of Clusters (k) in K-Means:** Start with `k=10` since there are 10 digits (0-9). Experiment with different values to see if they improve the performance.
- **Selecting SVM Kernel:** Try different kernels like 'linear', 'poly', 'rbf', and 'sigmoid'. Observe how the choice of kernel affects accuracy.
- **Visualization:** Consider visualizing the centroids of the clusters. Each centroid is a point in the same space as the input data and can be viewed as an "average" digit if reshaped to 28x28 pixels.
- **Cross-Validation:** Use cross-validation to find the best parameters for both K-Means and SVM to further improve the model.

---

In [15]:
#@title K-Means Clustering

from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics import accuracy_score
import pandas as pd

for n_clusters in range(10, 11):
  #setting the initial params
  n_init = 10
  algo_init = "random"

  #K-means algo
  kmeans_params = KMeans(n_clusters = n_clusters, init = algo_init, n_init = n_init)
  kmeans_params.fit(X_train_pca)

  #finding cluster centres and followed by SS wrt centroids for feature selection from K-Means
  cluster_centroids = kmeans_params.__dict__['cluster_centers_']
  features_selected_train = np.column_stack([np.sqrt(np.sum((X_train_pca-ci)**2, axis = 1)) for ci in cluster_centroids])
  features_selected_test = np.column_stack([np.sqrt(np.sum((X_test_pca-ci)**2, axis = 1)) for ci in cluster_centroids])

10
[[-4.60747694  2.89817624]
 [-6.35429956 -1.8407708 ]
 [17.09118319  3.87304771]
 [ 6.5324151   2.66554109]
 [ 0.27865082 -0.11941424]
 [14.60508153 -7.55014095]
 [ 5.56942685 12.2793274 ]
 [ 5.25491862 -6.43747573]
 [-0.90498198  7.46636009]
 [-1.68228201 -5.68414525]]
(56000, 10)


In [16]:
#@title SVM Implementation

from sklearn.svm import SVC

svm_params = SVC(C = 1, kernel = 'poly', degree = 3)
svm_params.fit(features_selected_train, y_train)

In [18]:
#@title Accuracy Score before Cross Validation

from sklearn.metrics import accuracy_score

y_pred = svm_params.predict(features_selected_test)
print(accuracy_score(y_test, y_pred))

0.3712142857142857


In [None]:
#@title Cross Validation



## Part 3: SIFT + SVM

### Prelude

Scale-Invariant Feature Transform (SIFT) is an algorithm to detect and describe local features in images. After extracting these features, you will use an SVM classifier for the classification.

### Steps

1. **SIFT Feature Extraction:** Extract SIFT features from each image.
2. **Feature Description:** Use the features to describe the dataset.
3. **SVM Classification:** Use these descriptions to train and predict using SVM.

### Starter Code

#### SIFT Feature Extraction

First, let's define a function to extract SIFT features from an image.

**Additional Tips for Students:**
- **SIFT Feature Size:** SIFT descriptors are 128-dimensional; ensure all feature vectors are the same length.
- **Choosing SVM Kernel:** Try 'linear', 'poly', 'rbf', and 'sigmoid' kernels to observe their effects.
- **Regularization Parameter (C):** Experiment with different values of \(C\); smaller values specify stronger regularization.
- **Handling Missing Descriptors:** In case no keypoints are found in an image, use a zero vector for that image’s descriptors.

---