# **Frequentist Machine Learning, Project 3**
### **Anna Konvicka and David Stekol**
Re-implement the example in section 7.10.2 using any simple, out of the box classifier (like K nearest neighbors from sci-kit). Reproduce the results for the incorrect and correct way of doing cross-validation.


In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold

In [None]:
# data generation, 50 labels, 5000 predictors (gaussian), error rate=50%, 

classes = np.round(np.random.rand(50))    # Assign classes/labels (50% in each)
predictors = np.random.normal(loc=0, scale=1, size=(50,5000))


# Wrong and Bad

From textbook: 
1. Screen the predictors: find a subset of "good" predictors that show fairly strong (univariate) correlation with the class labels
2. Using the above subset of predictors, build a multivariate classifier
3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model

In [None]:
# STEP 1

corrs = np.corrcoef(predictors, classes, rowvar=False)[:-1][:,-1]

# sort and find most correlated 
corrs_sorted = sorted(range(len(corrs)), key=lambda x:corrs[x], reverse=True) # check this if errors (lambda key, idk why tf)
hundred_most = corrs_sorted[:100]   # returns indices of most highly correlated predictors
top_predictors = predictors[:, hundred_most]

In [None]:
# STEPS 2, 3

  # from sklearn kfold doc https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html 

kfolds = KFold(n_splits=50, shuffle=True)
bad_correctnesses = []

for train_index, test_index in kfolds.split(top_predictors):
  x_train, x_test = top_predictors[train_index], top_predictors[test_index]
  y_train, y_test = classes[train_index], classes[test_index]

  neigh = KNeighborsClassifier(n_neighbors=1)   # from KNeighborsClassifier doc
  neigh.fit(x_train, y_train)

  predictions = neigh.predict(x_test)
  correctness = (sum(predictions == y_test)/(len(y_test)))
  bad_correctnesses.append(correctness)

# RESULTS
bad_proportion_correct = np.mean(bad_correctnesses)
print("CV error rate: ", (1-bad_proportion_correct)*100, "%")

CV error rate:  4.0000000000000036 %


# Sane and Reasonable

From textbook:
1. Divide the samples into K cross-validation folds at random
2. For each fold:

  a. Find a subset of "good" predictors using all samples except those in fold K

  b. Build a multivariate classifier with these predictors 

  c. Use the classifier to predict the class labels in fold K

In [None]:
kfolds = KFold(n_splits=50, shuffle=True)     # do I need to do this again? Probably doesn't hurt 
good_correctnesses = []

for train_index, test_index in kfolds.split(top_predictors):
  corrs = np.corrcoef(predictors[train_index], classes[train_index], rowvar=False)[:,-1][:-1]
  hundred_most = sorted(range(len(corrs)), key=lambda x:corrs[x], reverse=True)[:100]
  top_predictors = predictors[:, hundred_most]

  x_train, x_test = top_predictors[train_index], top_predictors[test_index]
  y_train, y_test = classes[train_index], classes[test_index]

  neigh = KNeighborsClassifier(n_neighbors=1)
  neigh.fit(x_train, y_train)

  predictions = neigh.predict(x_test)
  correctness = (sum(predictions == y_test)/(len(y_test)))
  good_correctnesses.append(correctness)

# RESULTS
good_proportion_correct = np.mean(good_correctnesses)
print("CV error rate: ", (1-good_proportion_correct)*100, "%")

CV error rate:  56.00000000000001 %
