# PCA for dimensionality reduction before classification

In [1]:
import numpy as np
import random

from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Upload the `data.csv` file to this workspace, then read in the data to a `numpy` array in `X` and the labels to `y`. If you want to, you can add code to the following cell to explore `X` (for example, see its shape).


In [2]:
dat = np.genfromtxt('data.csv',delimiter=',')
X = dat[:, :-1]
y = dat[:, -1]

Use `train_test_split` to split the data into training and test sets. Reserve 30% of the data for the test set. 

Make sure to shuffle the data, and pass `random_state = 42` so that your random split will match the auto-grader's.

In [3]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=True)

You will use the training data to fit a support vector classifier. However, instead of fitting the training data directly, you will first transform it using PCA. Then, you will use only a subset of features - the first `n_comp` principal components - as input to your classifier. 

You will use K-fold cross validation to find the optimal value of `n_comp`. You should consider every possible value of `n_comp`, from 1 component (simplest possible model) to all of the components (most flexible model).

In the next cell,

* Use the `sklearn` implementation of `KFold` to iterate over candidate models. In your `KFold`, use 5 splits, and don't shuffle the data (you already shuffled it when dividing into training and test.)
* Use the `sklearn` implementation of `PCA` to transform the data. Pass `random_state = 42` to `PCA` so that your result will match the auto-grader's.
* Use the `sklearn` implementation of `SVC` to classify the data using the first `n_comp` principal components.  Pass `random_state = 42` to `SVC` so that your result will match the auto-grader's.

In [6]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
kf = KFold(n_splits=5, shuffle=False)
average_accuracy = np.zeros(X_train.shape[1])
average_validation = np.zeros(X_train.shape[1])
pca = PCA(n_components=X_train.shape[1], random_state=42)
for n_comp in range(1,X_train.shape[1]+1):
    # pca = PCA(n_components=n_comp, random_state=42)
    fold_acc = np.zeros(kf.get_n_splits())
    for i, (train_index, test_index) in enumerate(kf.split(X_train)):
        X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]
        y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
        pca.fit(X_train_fold)
        X_train_fold_pca = pca.transform(X_train_fold)[:,:n_comp]
        X_test_fold_pca = pca.transform(X_test_fold)[:,:n_comp]
        svc = SVC(random_state=42)
        svc.fit(X_train_fold_pca, y_train_fold)
        y_pred = svc.predict(X_test_fold_pca)
        fold_acc[i] = accuracy_score(y_test_fold, y_pred)
    average_accuracy[n_comp-1] = fold_acc.mean()
    average_validation[n_comp-1] = fold_acc.std()/np.sqrt(kf.get_n_splits()-1)

In [7]:
average_validation

array([0.0451754 , 0.05248907, 0.03642157, 0.04164966, 0.06624013,
       0.05050763, 0.04164966, 0.04164966, 0.04164966, 0.05714286,
       0.04738035, 0.04738035, 0.04738035, 0.04738035, 0.04738035,
       0.04738035, 0.04738035, 0.04738035, 0.04738035, 0.04738035])

In [8]:
average_accuracy

array([0.64285714, 0.62857143, 0.67142857, 0.65714286, 0.68571429,
       0.71428571, 0.72857143, 0.72857143, 0.72857143, 0.7       ,
       0.7       , 0.7       , 0.7       , 0.7       , 0.7       ,
       0.7       , 0.7       , 0.7       , 0.7       , 0.7       ])

Compute the mean validation accuracy and the standard error of the mean validation accuracy across the folds. Save the results in `acc_mean` and `acc_se`, respectively. 

In [9]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

acc_mean = average_accuracy
acc_se = average_validation

Then, compute the optimal value of `n_comp`, and save this in `n_pca_opt`.

In [10]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

n_pca_opt = np.argmax(average_accuracy)+1

Finally, compute the optimal `n_comp` according to the one-SE rule, and save this in `n_pca_one_se`.

In [11]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

max_avg_accuracy = np.max(average_accuracy)
max_se = average_validation[np.argmax(average_accuracy)]

# Calculate the threshold for the one-SE rule
threshold = max_avg_accuracy - max_se

# Apply the one-SE rule to find the simplest model within one standard error of the best model
n_pca_one_se = np.where(average_accuracy >= threshold)[0][0] + 1  # Adding 1 because array indices start from 0

In [12]:
n_pca_one_se

6