In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

Since there are so many columns, the feature selection and model fitting we are about to do can be computationally intensive. Therefore, we'll only consider the first 2000 columns. The next cell will load in that feature data into `X` and the labels into `y`.

In [2]:
X = np.load('X.npy', allow_pickle=True)[:,:2000]
y = np.load('y.npy', allow_pickle=True)


Then, you will set aside 25% of the data for evaluating the final model at the end.  Save the result in `Xtr`, `ytr`, `Xts`, and `yts`. 

Use `sklearn`'s `train_test_split` with shuffling, and you will specify the `random_state = 42`so that your results will match the autograders' results.


In [4]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
Xtr, Xts, ytr, yts = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)


Now, you will use 10-fold cross validation (with `sklearn`'s `KFold`, no additional shuffling since you have already shuffled the data) to evaluate model candidates as follows:

* First, within each fold, compute the *absolute value* of the correlation coefficient between each column of the feature data and the target variable. (You may use `numpy`'s `corrcoef` function.) Save the results in `score_ft`, which has one entry per column per fold.
* Then, iterate over the number of columns to include in the model - the `d` values in `d_list`. In each iteration, you will use the `d` features that had the highest absolute value of correlation coefficient in the model.
* You will train an SVC model with a linear kernel, `C=10`, `random_state = 24`, and all other settings at their default values. You will evaluate the model on the validation data and save the accuracy score in `score_val`, which has one entry per `d` value per fold.

(Note: in many cases we would standardize the data before fitting an SVC, but we won't do that here.)

Write your solution in the `#grade` cell.

In [5]:
d_list = np.arange(1, X.shape[1]+1) 
nd = len(d_list)
nfold = 10

score_ft = np.zeros((nd, nfold))
score_val = np.zeros((nd,nfold))

In [6]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# Initialize KFold cross-validation
kf = KFold(n_splits=nfold)

# Iterate over the folds
for fold_idx, (train_index, val_index) in enumerate(kf.split(Xtr)):
    X_train_fold, y_train_fold = Xtr[train_index], ytr[train_index]
    X_val_fold, y_val_fold = Xtr[val_index], ytr[val_index]

    # Compute the absolute value of the correlation coefficient for each feature
    for i in range(X_train_fold.shape[1]):
        feature_data = X_train_fold[:, i]
        corr_coef = np.abs(np.corrcoef(feature_data, y_train_fold)[0, 1])
        score_ft[i, fold_idx] = corr_coef

    # Iterate over the number of features to include in the model
    for d in d_list:
        # Select the top d features based on the correlation coefficients
        top_features_indices = np.argsort(-score_ft[:, fold_idx])[:d]
        X_train_fold_d = X_train_fold[:, top_features_indices]
        X_val_fold_d = X_val_fold[:, top_features_indices]

        # Train the SVC model
        model = SVC(kernel='linear', C=10, random_state=24)
        model.fit(X_train_fold_d, y_train_fold)

        # Evaluate the model on the validation fold
        y_pred = model.predict(X_val_fold_d)
        score_val[d-1, fold_idx] = accuracy_score(y_val_fold, y_pred)

Use `score_val` to find `best_d`, the optimal number of features to include in the model (best mean validation accuracy). (Compute the value - don't hard-code it.)

In [8]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
mean_val_accuracy = score_val.mean(axis=1)

# Find the index of the maximum mean validation accuracy
best_d_index = np.argmax(mean_val_accuracy)

# The best number of features, best_d, is the index + 1 since index is 0-based and d starts at 1
best_d = d_list[best_d_index]


Then, find `best_d_one_se`, the optimal number of features to include according to the one-SE rule:

In [9]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# Compute the standard error for each d value
std_val_accuracy = score_val.std(axis=1) / np.sqrt(nfold)

# Find the best model (the one with the highest mean validation accuracy)
best_mean_accuracy = np.max(mean_val_accuracy)
best_model_index = np.argmax(mean_val_accuracy)

# Apply the one-SE rule
# Find the model within one standard error of the best model
# Start from the best model and go backwards to find the simplest model within one SE
best_d_one_se_index = np.where(mean_val_accuracy >= best_mean_accuracy - std_val_accuracy[best_model_index])[0][0]
best_d_one_se = d_list[best_d_one_se_index]