# CSE 572: Lab 10

In this lab, you will practice implementing techniques for model selection including cross validation and grid search.

To execute and make changes to this notebook, click File > Save a copy to save your own version in your Google Drive or Github. Read the step-by-step instructions below carefully. To execute the code, click on each cell below and press the SHIFT-ENTER keys simultaneously or by clicking the Play button. 

When you finish executing all code/exercises, save your notebook then download a copy (.ipynb file). Submit the following **three** things:
1. a link to your Colab notebook,
2. the .ipynb file, and
3. a pdf of the executed notebook on Canvas.

To generate a pdf of the notebook, click File > Print > Save as PDF.

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
seed = 0
np.random.seed(0)

### Load the iris dataset

In [2]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']

data.sample(5, random_state=seed)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
114,5.8,2.8,5.1,2.4,Iris-virginica
62,6.0,2.2,4.0,1.0,Iris-versicolor
33,5.5,4.2,1.4,0.2,Iris-setosa
107,7.3,2.9,6.3,1.8,Iris-virginica
7,5.0,3.4,1.5,0.2,Iris-setosa


In [3]:
data.shape

(150, 5)

Standardize the data by subtracting the feature-wise mean and dividing by the feature-wise standard deviation for each sample.

In [4]:
# YOUR CODE HERE
cols = data.columns.difference(['class'])
data[cols] = (data[cols] - data[cols].mean()) / data[cols].std()

In [5]:
data.sample(5, random_state=seed)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
114,-0.052331,-0.585801,0.760212,1.574155,Iris-virginica
62,0.189196,-1.969583,0.136778,-0.260321,Iris-versicolor
33,-0.414621,2.643024,-1.336794,-1.308593,Iris-setosa
107,1.759119,-0.355171,1.440322,0.787951,Iris-virginica
7,-1.018437,0.797981,-1.280118,-1.308593,Iris-setosa


### k-fold Cross validation

We will use 5-fold cross validation to train and evaluate our classifier. We will not do any model selection/hyperparameter tuning in this step, so we need to split our data into a training and test set.

To split the data into 5 folds we will shuffle the rows and then split them into $k$ equal groups.

In [6]:
k = 5

# Note: np.split raises error if indices_or_sections is 
# an integer and doesn't result in equal size splits
folds = np.split(data.sample(frac=1, random_state=seed), indices_or_sections=k)

Use a for loop to print the number of samples and number of samples from each class in each fold.

In [7]:
# YOUR CODE HERE
fc = 0
for fold in folds:
  fc = fc+1
  c = 0
  vi = 0
  ve = 0
  s=0
  for row in fold.iterrows():
    c = c+1;
    if row[1]['class'] == 'Iris-virginica':
      vi = vi+1
    if row[1]['class']  == 'Iris-versicolor':
      ve = ve+1;
    if row[1]['class']  == 'Iris-setosa':
      s = s+1
  print('Fold %i' %(fc))
  print('number of instances %i' %(c))
  print('Iris-virginica - %i' %(vi))
  print('Iris-versicolor - %i' %(ve))
  print('Iris-setosa - "%i' %(s))

Fold 1
number of instances 30
Iris-virginica - 6
Iris-versicolor - 13
Iris-setosa - "11
Fold 2
number of instances 30
Iris-virginica - 15
Iris-versicolor - 10
Iris-setosa - "5
Fold 3
number of instances 30
Iris-virginica - 10
Iris-versicolor - 10
Iris-setosa - "10
Fold 4
number of instances 30
Iris-virginica - 10
Iris-versicolor - 6
Iris-setosa - "14
Fold 5
number of instances 30
Iris-virginica - 9
Iris-versicolor - 11
Iris-setosa - "10


### Train a k Nearest Neighbors classifier 

We will use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) in sklearn for our classification model. Use cross validation to train and evaluate the model. Set hyperparameters to `n_neighbors=5`, `metric='l2'`, and `weights='uniform'`.

Implement a for loop to iterate through each fold, training a new kNN model each iteration with one fold assigned to validation and the remaining folds assigned to training. Compute the validation accuracy for each iteration and append it to the `accuracies` list.

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.exceptions import NotFittedError

accuracies = []

for i in range(len(folds)):
  val_X= folds[i][cols]
  val_y = folds[i]['class']

  knn = KNeighborsClassifier(n_neighbors=5, metric='l2', weights='uniform')
  l = folds.copy()
  del l[i]
  df = pd.concat(l, axis=0, ignore_index=True)
  knn.fit(df[cols], df['class'])
    
  y_pred = knn.predict(val_X)
  accuracies.append(accuracy_score(val_y, y_pred))
    
# YOUR CODE HERE

Print the mean and standard deviation of the accuracy from cross validation (across all $k$ folds).

In [9]:
print('Mean accuracy: {:.2f}'.format(np.mean(accuracies)))
print('Standard deviation of accuracy: {:.2f}'.format(np.std(accuracies)))

Mean accuracy: 0.95
Standard deviation of accuracy: 0.06


**Question 1: If you increased the number of folds, do you expect the standard deviation of the accuracy across $k$ folds to increase or decrease? Why?**

**Answer:**
Standard of for k = 5 is 0.06 
Standard of for k = 10 is 0.08 
Standard of for k = 15 is 0.09

So, SD increases with increase in numbe of folds. With the increase in the number of folds the training set size increases and validation set size decreases (1:K). With decrease in test size prediction error increases.

YOUR ANSWER HERE

### Hyperparameter selection using cross validation and grid search

In this exercise, we will use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) again but this time we will perform hyperparameter selection using k-fold cross validation and Grid Search. 

We have three model choices (hyperparameters) for our kNN model:
- Number of neighbors ($k$ or `n_neighbors`). We will consider all integer values $k \in [1,10]$.
- Whether to treat all neighbors equally when taking majority vote, or weight them according to their distance from the query point (`weights='uniform'` or `weights='distance'`).
- The distance metric for computing distance between query point and neighbors (`metric` argument). We will consider three options for `metric`: `'l1'`, `'l2'`, and `'cosine'`.

**Question 2: How many total combinations of the above hyperparameter choices are there?**

**Answer:**
10 x 2 x 3 = 60

YOUR ANSWER HERE

Instead of implementing cross validation manually as we did in the previous example, we will use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) class in sklearn to perform grid search and cross validation simultaneously. 

First, we will split the data into a training (70\%) and test (30\%) test.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[data.columns[:-1]], 
                                                    data['class'], 
                                                    test_size=0.3, 
                                                    random_state=seed)

We will then use the training set for cross validation and grid search to select the optimal hyperparameter settings.

Next, we define the values for grid search using a dictionary in which the keys are the parameter names to be passed to the model function and each corresponding value is a list of possible values to try in grid search.

In [11]:
param_grid = {'n_neighbors': list(range(1, 11)), 
              'weights': ['uniform', 'distance'],
              'metric': ['l1', 'l2', 'cosine']
             }

Next, we instantiate a kNeighborsClassifier but do not specify the hyperparameter settings yet.

In [12]:
knn = KNeighborsClassifier()

We can then pass this classifier and our parameter grid to a new GridSearchCV object and fit the GridSearchCV using our training data.

In [13]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(knn, param_grid)

clf.fit(X_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'metric': ['l1', 'l2', 'cosine'],
                         'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'weights': ['uniform', 'distance']})

The cross validation results are stored as an attribute of the GridSearchCV object as a dictionary with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

In [14]:
cv_results = pd.DataFrame(clf.cv_results_)

cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001736,0.000494,0.00218,0.000504,l1,1,uniform,"{'metric': 'l1', 'n_neighbors': 1, 'weights': ...",0.857143,0.904762,1.0,0.904762,0.952381,0.92381,0.048562,37
1,0.001453,1e-05,0.001506,3.9e-05,l1,1,distance,"{'metric': 'l1', 'n_neighbors': 1, 'weights': ...",0.857143,0.904762,1.0,0.904762,0.952381,0.92381,0.048562,37
2,0.001514,0.00015,0.002413,0.000627,l1,2,uniform,"{'metric': 'l1', 'n_neighbors': 2, 'weights': ...",0.857143,0.904762,1.0,0.857143,0.952381,0.914286,0.055533,44
3,0.001486,1.9e-05,0.001513,2.6e-05,l1,2,distance,"{'metric': 'l1', 'n_neighbors': 2, 'weights': ...",0.857143,0.904762,1.0,0.904762,0.952381,0.92381,0.048562,37
4,0.001546,0.000175,0.001935,5.5e-05,l1,3,uniform,"{'metric': 'l1', 'n_neighbors': 3, 'weights': ...",0.904762,1.0,1.0,0.857143,0.952381,0.942857,0.055533,10
5,0.00152,8.1e-05,0.001592,9.5e-05,l1,3,distance,"{'metric': 'l1', 'n_neighbors': 3, 'weights': ...",0.904762,1.0,1.0,0.904762,0.952381,0.952381,0.042592,2
6,0.001936,0.000641,0.002594,0.001105,l1,4,uniform,"{'metric': 'l1', 'n_neighbors': 4, 'weights': ...",0.857143,1.0,1.0,0.904762,1.0,0.952381,0.060234,2
7,0.001492,0.000136,0.001545,0.000113,l1,4,distance,"{'metric': 'l1', 'n_neighbors': 4, 'weights': ...",0.857143,1.0,1.0,0.904762,0.952381,0.942857,0.055533,10
8,0.001731,0.000269,0.002331,0.000516,l1,5,uniform,"{'metric': 'l1', 'n_neighbors': 5, 'weights': ...",0.809524,1.0,1.0,0.904762,0.952381,0.933333,0.07127,27
9,0.001456,1.6e-05,0.001513,1.9e-05,l1,5,distance,"{'metric': 'l1', 'n_neighbors': 5, 'weights': ...",0.809524,1.0,1.0,0.904762,0.952381,0.933333,0.07127,27


Look at the [GridSearchCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to read about the other attributes stored after fitting. Print the value of the attribute that gives the parameter settings for the best results on the hold out data.

In [15]:
# YOUR CODE HERE
clf.best_params_

{'metric': 'l1', 'n_neighbors': 6, 'weights': 'uniform'}

Train a new kNN classifier using the hyperparameter settings that were found to give the best results on the hold out data from GridSearchCV (the values printed in the last cell). Train it on the full training set.

In [16]:
# YOUR CODE HERE
knn = KNeighborsClassifier(n_neighbors=6, metric='l1', weights='uniform')
knn.fit(X_train, y_train)

KNeighborsClassifier(metric='l1', n_neighbors=6)

Apply the trained classifier to the test dataset and print the test accuracy.

In [17]:
# YOUR CODE HERE
y_pred = knn.predict(X_test)
print('Accuracy on test data %.2f' % (accuracy_score(y_test, y_pred)))

Accuracy on test data 0.98
