# Python Machine Learning for Biology
# Hyperparameter Tuning

What is a hyperparameter?    

We'll go over some best practices for building machine learning models by fine-tuning hyperparameters and evaluating model performance.  

We'll cover:  
* Cross-Validation: Getting unbiased estimates of model performance
* Learning and Validation Curves: Diagnosing common problems
* GridSearch: Fine-tuning machine learning algorithms

# Independent Work (Review)
Peform a logistic regression on the cancer dataset
1. import the cancer dataset
2. create X and y variables
3. encode categorical variables
4. split data into testing and training datasets (80:20)
5. standardize the data
6. perform a logistic regression
7. report the accuracy score

In [210]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

(Side note: we can figure out what it labeled each class of tumor)

## Cross Validation

*Let's review*
* Why don't we evaluate using our training data?
* What is overfitting? 
* What is underfitting?  
* What are the drawbacks of train/test/split?

Two techniques to try to figure out our model's generalization error are **holdout validation** and **k-fold cross validation.** 

### Holdout validation (AKA Train/Test/Split)

We've been doing holdout validation, where we separate the dataset into training and testing datasets. But if we do lots of **model selection**, that is tune our hyper-parameters to see which give us the best model, we start reusing that same test dataset over and over again. Then the model is likely to overfit.  

A better way of using the holdout method is to divide the dataset into three parts: a training set, a test set, and a validation set. Use the training set to fit the model, use the validation set to compare model performance among different models, and use the test set to test model generalizability. This is a way less biased way to do it because the model has never seen the test data before.  

<img src="assets/traintestsplit.png"/>

A disadvantage of this method is that it is sensitive to how we divide up the data. 

*But what if we created a bunch of train/test/splits, calculated the test accuracy for each, and averaged these?* That is the essence of **k-fold cross validation.**

### K-fold Cross Validation

1. Split the data into *k* sets (folds) without replacement. 
2. Use *k-1* sets on model training and use 1 for model testing. 
3. Repeat *k* times, using a different set for the testing set each time. We'll have *k* models and *k* performance estimates.  

Then we can calculate the average performance of the model based on the *k* folds so we have a performance estimate that is less biased to how we sliced and diced the data. 

The standard value of *k* that people use is 10 (has been shown in experients to give a good out-of-sample accuracy). It's a good idea to use a larger *k* if you are working with a smaller dataset (lower generalization bias the higher your *k*). Larger values of *k* will have a slower runtime.  

<img src="assets/kfolds.png"/>

**Stratified k-fold cross validation** has even better bias and variance estimates, especially if you have really unequal class proportions. This method preserves the class proportions in each fold. `cross_val_score` does this by default.

***Train/test/split may still be the better option if you need speed***

#### Simulate splitting a dataset of 25 observations into 5 folds

In [1]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False).split(range(25))

# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
    print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))

Iteration                   Training set observations                   Testing set observations
    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]       
    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]       
    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]     
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]     
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]     


#### Perform a stratified k-fold cross validation on the cancer dataset

## Independent Work
Select the best hyperparameters (K) for a KNN of the iris dataset using stratified cross validation scores.

**Bonus** Compare the best K of KNN to a Logistic Regression for the iris dataset to see which model performs better (with stratified cross validation).

## Grid Search: fine-tuning models

*Review:* 
Which parameters does the machine learning model "learn"? Which are parameters we have to tune?  

Validation curves help us figure out an optimal value of one hyperparameter. Grid search helps us find optimal combinations of hyperparameters.

This will be more efficient than the for loops we were using when trying to find the best K for KNN.

#### Perform a grid search on an SVM of our cancer data

## Independent Exercise
Perform a gridsearch to find the best K for a KNN of the iris dataset

### Reducing Computaional Expense Using `RandomizedSearchCV`
* Doing an exhaustive search of many different parameters at once can become quickly computationally infeasible
* `RandomizedSearchCV` searches a subset of the paramters, and you control the computational "budget"

#### Specify parameter distributions rather than a parameter grid
We'll use the iris dataset and KNN for this just to demo. *Note: If we had a continuous parameter, we would need to specify a continuous distribution*

Generally, we recommend starting with `gridSearchCV` and switching to `randomSearchCV` only if things get computationally hairy. 

### Nested cross-validation

Earlier we combined k-fold cross validation and grid search to fine-tune our hyperparameters. A better way to do this is with **nested cross-validation.**  

**Nested cross-validation** is when we have an outer k-fold cross-validation loop to split the data into training and testing folds and an inner loop used to select a model using k-fold cross-validation on the training fold. After model selection, we evaluate model performance on our test fold. 

<img src="assets/nestedcv.png"/>

#### Nested cross-validation on our cancer dataset with an SVM (This is a 5x2 cross-validation)

#### Use nested cross-validation to compare SVM to another algorithm