## Assignment 5
## Advanced Data Mining

### 1(a) (20 pts) USPS is a hand-written digit database including ten# digits, i.e., 0, 1, 2, ..., 9. We will use a subset (USPS_sub.mat) including 1K images in this experiment.The resolution of each image is 16 × 16, and thus the length of each vector is 256. In this question, you will use SVM for the digit classification.For each digit, you will use the first 5, 10, or 15 images as the training data, and the rest as test data to finish this multi-class classification problem.In each setting, you will be required to use: (1) linear kernel, (2) polynomial kernel, (3) radial basis function (RBF) kernel. As there are a few model parameters for each kernel, e.g., band-width 𝛾 in RBF kernel, and the common slack variable 𝐶,you will need to use cross-validation on the training set to select the best model parameters.You can use “grid search” to find the best values for them. So, in total there are 9 different results (recognition accuracy) from 3 different training setting and 3 different kernels.

In [12]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm, model_selection, preprocessing
from sklearn.decomposition import PCA
import scipy

#### Load the USPS dataset (replace this with the actual USPS_sub.mat file) and Preprocessing the data

In [13]:
from scipy.io import loadmat

# Load the USPS_sub.mat file (replace the path with the actual path on your machine)
data = loadmat('/Users/bharath/Documents/ADM/Assignment5/Code&Data/USPS_sub.mat')

# Extract the feature matrix (X) and labels (y) from the loaded data
# Replace 'data_key' and 'label_key' with the actual keys in your .mat file
X = data['data']
y = data['label']

# If necessary, reshape the labels to a 1D array
y = np.ravel(y)


#### Apply PCA if necessary

In [14]:
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X)

#### Define train-test splits

In [15]:
train_sizes = [5, 10, 15]

#### SVM kernel types

In [16]:
kernels = ['linear', 'poly', 'rbf']

#### Model parameters for grid search

In [17]:
param_grid = {'C': np.logspace(-3, 3, 7), 'gamma': np.logspace(-3, 3, 7)}

#### Loop over train sizes and kernels

In [18]:
for size in train_sizes:
    for kernel in kernels:
        # Split the data
        X_train, X_test, y_train, y_test = model_selection.train_test_split(X_pca, y, train_size=size*10, stratify=y)
        
        # Create SVM classifier
        clf = svm.SVC(kernel=kernel)
        
        # Grid search with cross-validation
        grid_search = model_selection.GridSearchCV(clf, param_grid, cv=5)
        grid_search.fit(X_train, y_train)
        
        # Best parameters
        best_params = grid_search.best_params_
        
        # Train the model with the best parameters
        clf_best = svm.SVC(kernel=kernel, C=best_params['C'], gamma=best_params['gamma'])
        clf_best.fit(X_train, y_train)
        
        # Calculate the recognition accuracy
        accuracy = clf_best.score(X_test, y_test)
        print(f"Kernel: {kernel}, Training size: {size}, Accuracy: {accuracy}")

Kernel: linear, Training size: 5, Accuracy: 0.7252631578947368
Kernel: poly, Training size: 5, Accuracy: 0.3989473684210526
Kernel: rbf, Training size: 5, Accuracy: 0.19052631578947368
Kernel: linear, Training size: 10, Accuracy: 0.7666666666666667
Kernel: poly, Training size: 10, Accuracy: 0.6
Kernel: rbf, Training size: 10, Accuracy: 0.21555555555555556
Kernel: linear, Training size: 15, Accuracy: 0.8047058823529412
Kernel: poly, Training size: 15, Accuracy: 0.7470588235294118
Kernel: rbf, Training size: 15, Accuracy: 0.25176470588235295


### b. (20 pts) Please implement k-nearest-neighbor (kNN) classifier by yourself and then use the USPS dataset above for the same experiments. First, you will use 5, 10, or 15 images as the training data and the rest as test data to finish this multi-class classification problem. Second, for each experiment, please try k=1, 3, and 5 for your kNN classifier. Therefore, 3*3=9 experimental results will be reported.

In [19]:
from sklearn.neighbors import KNeighborsClassifier

# Loop over train sizes and k values
k_values = [1, 3, 5]

for size in train_sizes:
    for k in k_values:
        # Split the data
        X_train, X_test, y_train, y_test = model_selection.train_test_split(X_pca, y, train_size=size*10, stratify=y)
        
        # Train kNN classifier
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        
        # Calculate the recognition accuracy
        accuracy = knn.score(X_test, y_test)
        print(f"k: {k}, Training size: {size}, Accuracy: {accuracy}")


k: 1, Training size: 5, Accuracy: 0.6168421052631579
k: 3, Training size: 5, Accuracy: 0.5631578947368421
k: 5, Training size: 5, Accuracy: 0.49473684210526314
k: 1, Training size: 10, Accuracy: 0.7155555555555555
k: 3, Training size: 10, Accuracy: 0.6866666666666666
k: 5, Training size: 10, Accuracy: 0.6422222222222222
k: 1, Training size: 15, Accuracy: 0.7223529411764706
k: 3, Training size: 15, Accuracy: 0.7176470588235294
k: 5, Training size: 15, Accuracy: 0.7011764705882353
