# Classifier Optimization
**V.0.1 - Alpha testing, [contributions](#contributions)**

In earlier exercises, we explored a variety of classifiers and feature selection techniques. During the past exercises we didn't pay much attention to the parameters of these procedures and set them arbitrarily or based on intuition. In what follows we are going to investigate data-driven, unbiased techniques to optimize classification pipelines.

Cross-vaidation can allow us to pick the best performing parameters in in an unbiased fashion. We will be using the useful features from scikit-learn to build up some cross-validation analyses. scikit-learn also offers a simple procedure for building and automating the various steps involved in classifier optimization (e.g. data scaling => feature selection => parameter tuning). This is part of the [Pipeline package](http://scikit-learn.org/stable/modules/pipeline.html#pipeline). We will also explore these methods in this exercise.

## Goal of this script
1. Build a pipeline of steps to optimize classifier performance.    
2. Use the pipeline to make optimal choices.

**Recap:** The localizer data we are working with ([Kim et al., 2017](https://doi.org/10.1523/JNEUROSCI.3272-16.2017)) consists of 3 runs with 5 blocks for each category. In the matlab stimulus file, the first row has the stimulus labels for the 1st, 2nd and 3rd runs of the localizer. Each run was 310 TRs.
The 4th row contains the time when the stimulus was presented for each of the runs. The stimulus labels and their corresponding categories are as follows: 1 = Faces, 2 = Scenes, 3 = Objects


## Table of Contents
[1. Make preprocessing pipeline](#preprocessing)  

[2. How big should a training set be?](#training)  

[3. Cross-validation: Hyper-parameter selection](#cross_val)   

[4. How to avoid double dipping](#double_dipping)  
>[4.1 Example of double dipping](#example_double_dip)  

[5. Make a pipeline](#pipeline)  

Exercises
>[Exercise 1](#ex1)  
>[Exercise 2](#ex2)  
>[Exercise 3](#ex3)  
>[Exercise 4](#ex4)  
>[Exercise 5](#ex5)  
>[Exercise 6](#ex6)  
>[Exercise 7](#ex7)  
>[Exercise 8](#ex8)  
>[Exercise 9](#ex9)  

[Novel contribution](#novel)  

In [None]:
# Import fMRI and general analysis libraries
import nibabel as nib
import numpy as np
import scipy.io
from scipy import stats
import pandas as pd

# Import plotting library
import matplotlib.pyplot as plt
%matplotlib notebook

# Import machine learning libraries
from nilearn.input_data import NiftiMasker
from sklearn import preprocessing
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline


## 1. Make preprocessing pipeline <a id="preprocessing"></a>

In past notebooks we have preprocessed the fMRI data from [Kim et al. (2017)](https://doi.org/10.1523/JNEUROSCI.3272-16.2017) using the following steps:  
>Extract the BOLD data for a mask.  
>Get the the stimulus labels.  
>Assign a label to every TR.  
>Shift the label time course to take account of the hemodynamic lag.  
>Extract BOLD data only for the conditions of interest (ignore the TRs corresponding to the baseline, i.e. when there was no task).  
>Average stimuli within blocks in order to reduce concerns around temporal autocorrelation.

In general it can be useful to make a script that contains all of the functions you might use across multiple scripts. This is so that if you make an update to the function, you don't have to update all of the versions in the scripts that might otherwise define the function. Often these will be python scripts called *utils.py*

**Self-Study:** Explore the *utils.py* script to see how it is possible to make this kind of script.

In [None]:
# We still have to import the functions of interest
from utils import load_data, load_labels, label2TR, shift_timing, reshape_data, blockwise_sampling

Get the data from one participant ready for analysis.

In [None]:
# Preset variables
dir = '/opt/public_FMRI/vdc/'
num_runs=3
TR=1.5
hrf_lag = 4.5  # In seconds what is the lag between a stimulus onset and the peak bold response
shift_size = int(hrf_lag / TR)  # Convert the shift into TRs

sub_id = 1

# Convert the number into a participant folder name
if (sub_id < 10):
    sids = '0' + str(sub_id)
else:
    sids = str(sub_id)   

# Specify the subject name
sub = 'sub-' + sids

# Load subject labels
stim_label_allruns = load_labels(dir, sub)

# Load the fMRI data using a whole-brain mask
epi_mask_data_all, _ = load_data(directory=dir, subject_name=sub, mask_name='', zscore_data=True)

# This can differ per participant
print(sub, '= TRs: ', epi_mask_data_all.shape[1], '; Voxels: ', epi_mask_data_all.shape[0])
TRs_run = int(epi_mask_data_all.shape[1] / num_runs)

# Convert the timing into TR indexes
stim_label_TR = label2TR(stim_label_allruns, num_runs, TR, TRs_run)

# Shift the data some amount
stim_label_TR_shifted = shift_timing(stim_label_TR, shift_size)

# Perform the reshaping of the data
bold_data, labels = reshape_data(stim_label_TR_shifted, epi_mask_data_all)

# Down sample the data to be blockwise rather than trialwise
bold_data, labels = blockwise_sampling(bold_data, labels)

## 2. How big should a training set be? <a id="training"></a>

When we split up our data into training and test sets we are trying to strike a balance between giving our classifier enough data to train a model with precise parameter estimates while ensuring that we also have enough data so that our test statistic has low variance. But what is that balance? If you google that, the common answer that you will find is: It depends! Generally, we use a rule of thumb that between 10% and 20% of our dataset should be the test. However, let's now investigate how different training set size affects classifier performance in a data-driven manner!

Aside: Not only do your training samples need to be independent, but so do your test samples. If the test samples are highly correlated then the effective number of test samples is lower and the test statistic variance will be higher.

In [None]:
# Run a basic n fold classification
def classification(classifier, data, labels, n_folds=5, test_size=0.2):
    
    # How many folds of the classifier
    skfold = StratifiedShuffleSplit(n_splits=n_folds, test_size=test_size) 

    clf_score = np.array([])
    for train, test in skfold.split(data, labels):

        # Pull out the sample data
        train_data = data[train, :]
        test_data = data[test, :]
        
        # Train and test the classifier
        clf = classifier.fit(train_data, labels[train])
        clf_score = np.hstack((clf_score, clf.score(test_data, labels[test])))

    return clf_score.mean()

**Exercise 1:**<a id="ex1"></a> Use a Support Vector Machine Classifier to examine how the accuracy of the classifier changes with different test set sizes from 10% to 90% in 10% steps. Plot the results. Do this over 10 folds to decrease the variability of the results.

## 3. Cross-Validation: Hyper-parameter selection <a id="cross_val"></a>

Each of the classifiers we have used so far has one or more "hyper-parameters" used to configure and optimize the model based on the data and our goals. Read [this Machine Learning Mastery Article](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/) for an explanation of the distinction between hyper-parameters and parameters. For instance, regularized logistic regression has a "penalty" hyper-parameter, which determines how much to emphasize the weight regularizing expression (e.g., L2 norm) when training the model.

**Exercise 2:**<a id="ex2"></a> SVM has a "cost" hyper-parameter, aka soft-margin hyper-parameter. Briefly describe what it does:

**A:**

We want to pick the best cost hyper-parameter for our dataset and to do this we will use cross-validation. Each hyper-parameter can be considered as a dimension such that the set of hyper-parameters is a space to be searched for effective values. The [GridSearchCV method in scikit-learn](http://scikit-learn.org/stable/modules/grid_search.html#grid-search) explores this space by dividing it up into a grid of values to be searched exhaustively. 

To give you an intuition for how grid search works, imagine trying to figure out what climate you find most comfortable. Let's say that there are two (hyper-)parameters that seem relevant: temperature and humidity. A given climate can be defined by the combination of values of these two parameters and you could report how comfortable you find this climate. A grid search would involve changing the value of each parameter with respect to the other in some fixed step size (e.g., 60 degrees and 50% humidity, 60 degrees and 60% humidity, 65 degrees and 60% humidity, etc.) and evaluating your preference for each combination.  

Note that the number of steps and hyper-parameters to search is up to you. But be aware of combinatorial explosion: the granularity of the search (the smaller the steps) and the number of hyper-parameters considered increases the search time exponentially.

GridSearchCV is an *extremely* useful tool for [hyper-parameter optimization](http://scikit-learn.org/stable/modules/grid_search.html#grid-search) because it is very flexible. You can look at different values of a hyper-parameter, different [kernels](http://scikit-learn.org/stable/modules/svm.html), different training/test split sizes, etc. The input is a dictionary where the key is the parameter of interest (the sides of the grid) and the values are the parameter increments to search over (the steps of the grid).

**Exercise 3:**<a id="ex3"></a> Grid search can be slow because it returns results for all possible combinations of hyper-parameters. Can you think of a more efficient way to find the good hyper-parameter settings (Hint: How can you narrow the search?)

**A:**


Below we are going to do a grid search over the SVM cost parameter and investigate the results. The output contains information about the best hyper-parameter.

In [None]:
# Search over different cost parameters
parameters = {'C':[0.01, 0.1, 1, 10]}
clf = GridSearchCV(SVC(kernel='linear'),
                   parameters,
                   cv=StratifiedShuffleSplit(n_splits=3, test_size=0.1),
                   return_train_score=True)
clf.fit(bold_data, labels);

# Print the results
print(clf.best_estimator_)  # What was the best classifier and cost?
print(clf.best_score_)  # What was the best classification score?

Want to see more details from the cross validation? All the results are stored in the dictionary cv\_results\_. Let's took a look at some of the important metrics stored here. For more details you can look at the [cv\_results\_ method on scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

You can printout cv\_results\_ directly or for a nicer look you can import it into a pandas dataframe and print it out. Each row corresponds to one parameter combination.

([Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) is a widely used data processing and machine learning package. Some people love it more than the animal.)

In [None]:
# Ugly way
print(clf.cv_results_)
print("\nUh, that is not in a good human-readable format.\n")

# Nicer way (using pandas)
results = pd.DataFrame(clf.cv_results_)
print("It's much easier to read this way, after converting it to a pandas dataframe: \n")
print(results)

We are now going to do some different types of cross-validation hyper-parameter tuning.

**Exercise 4:**<a id="ex4"></a> In machine learning, kernels are classes of algorithms that can be used to create a model. The (gaussian) radial basis function (RBF) kernel is very common in SVM classifiers. Briefly describe what it does:

**A:**

In [None]:
# Search over different cost and gamma parameters of a radial basis kernel
parameters = {'gamma':[10e-3, 10e0, 10e3], 'C':[10e-3, 10e0, 10e3]}
clf = GridSearchCV(SVC(kernel='rbf'),
                   parameters,
                   cv=StratifiedShuffleSplit(n_splits=3, test_size=0.1))
clf.fit(bold_data, labels)
print(clf.best_estimator_)  # What was the best classifier and parameters?
print(clf.best_score_)  # What was the best classification score?

**Exercise 5:**<a id="ex5"></a>  When would linear SVM be expected to outperform other kernels and why? Answer this question and run an analysis in which you compare linear, polynomial, and RBF kernels for SVM using GridSearchCV. This doesn't mean you run three separate GridSearchCV calls, this mean you should use these kernels as different hyper-parameters (as well as fitting cost and gamma).

**A:**

In [None]:
# Insert code here

When we are writing a classification pipeline, nested cross validation can be very useful. As the name suggests, this procedure nests a second cross-validation within folds of the first cross validation. As before, we will divide data into training and test sets (outer loop), but additionally will divide the training set itself in order to set the hyper-parameters into training and test (or validation) sets (inner loop).

Thus, on each split we now have a training (inner), validation (inner), and test (outer) dataset; a typical dataset size distribution might be 60%, 20%, 20%. Within the inner loop we train the model and find the optimal hyper-parameters (i.e., that have the highest performance when tested on the validation data). The typical practice is to then retrain your model with these hyper-parameters on both the training AND validation datasets and then evaluate on your held-out test dataset to get a score.

![image](https://qph.ec.quoracdn.net/main-qimg-bb7689c141427db9ab8ab030745aa8bc)

This is turtles all the way down, you could have any number of inner loops. However, you will run into data issues quickly (not enough data for training) and you will also run the risk of over-fitting your data: you will find the optimal parameters for a small set of your data but this might not generalize to the rest of your data. For more on the problem of overfitting, take a look at [this short and comprehensible EliteDataScience post](https://elitedatascience.com/overfitting-in-machine-learning).

For more description and a good summary of what you have learnt so far then check [here](http://www.predictiveanalyticsworld.com/patimes/nested-cross-validation-simple-cross-validation-isnt-enough/8952/).

**Self-study:** Some people discourage training on both training and validation data, saying you should only ever use the training data for fitting the model. Figure out why these people hold these views.

**Exercise 6:**<a id="ex6"></a> Set up a nested cross validation loop. In this loop you will perform hyper-parameter cross validation on a training dataset (which itself will be split into training and validation) and then score these optimized hyper-parameters on the test dataset. Perform 10 outer loop folds and 5 inner loop folds. Report the mean classification score for the outer loops.  
Things to watch out for: 
- The optimal hyper-parameter settings for each outer loop fold can be different; don't have a double nested analysis (easy mistake to make if you use GridSearchCV).
- You do not acutally need to set up a for-loop for this
- As always: In doubt, check if the [scikit-learn documentation](http://scikit-learn.org/stable/index.html) or [StackExchange Community](https://stackexchange.com/) is helpful
- Running the nested cross validation will take a couple of minutes. Grab a snack.

## 4. How to avoid double dipping <a id="double_dipping"></a>

One of the good things about the GridSearchCV method is that it makes it easy (but not impossible!) to prevent double dipping. In previous exercises we examined cases where double dipping is clear (e.g., training on all of the data and testing on a subset); however, double dipping can be a lot more subtle and harder to detect.

For instance, a common form of double dipping is Z scoring over both your training and testing datasets together, rather than Z scoring the two groups separately (in fact we are doing it in this exercise right now!). This is doubling dipping because information from one group affects the other. Imagine a scenario where you were using different runs as your test set. It might be that on a given run that the variability in activity is much higher than in all the other runs. By Z scoring over all runs you are decreasing the variability in all the other runs which could mask any patterns of variability. 

In practice, Z scoring can be unavoidable: if we have different runs but we don't want to use them as the basis for our training/test splits (for instance because there are practice effects) then we need to combine samples from different runs. Without normalization, these may have wildly different scales due to scanner drift or other confounds, distorting the classifier. Hence we need to normalize within run but this could be considered double dipping because each run includes both training and test data. Even without these concerns about different scales between runs, we might also worry about Z scoring over small numbers of observations in our test set. In the end, Z scoring is double dipping like jaywalking is illegal.

**Self-study:** Simulate an example where double dipping with Z scoring affects the results. (Hint: make observations  that are noisy samples of a given pattern of results where the amount of noise varies).


**Exercise 7:** <a id="ex7"></a> If we do a 5 x 6 grid search is there a greater risk of double dipping than if we do a 3 x 2? Are there concerns with overfitting?

**A:**

### Example of double dipping<a id="example_double_dip"></a>

Below we work through an exercise of another common type of double dipping in which we perform voxel selection on all of our data before splitting it into a training and test dataset

In [None]:
n_folds = 100  # How many folds of the classifier
test_size = 0.2
skfold = StratifiedShuffleSplit(n_splits=n_folds, test_size=test_size) 

clf_score = np.array([])
for train, test in skfold.split(bold_data, labels):
    
    # Do voxel selection on all voxels
    mean_threshold = np.percentile(np.mean(bold_data, axis=0), 95)
    selected_voxels = np.where(mean_threshold <= np.mean(bold_data, axis = 0))
    
    # Pull out the sample data
    train_data = bold_data[train, :]
    test_data = bold_data[test, :]

    # Train and test the classifier
    classifier = SVC(kernel="linear", C=1)
    clf = classifier.fit(train_data[:, selected_voxels[0]], labels[train])
    score = clf.score(test_data[:, selected_voxels[0]], labels[test])
    clf_score = np.hstack((clf_score, score))

print(clf_score.mean())

**Exercise 8:**<a id="ex8"></a> Create a copy of this code that fixes the concerns about double dipping. 

In [None]:
# Insert code here

## 5. Build a Pipeline <a id="pipeline"></a>

scikit-learn has a method, [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline), that simplifies running preprocessing steps in an automated fashion. Below we create a pipeline with the following steps:
>Scale the data.  
>Use PCA and choose the best option from a set of dimensions.  
>Choose the best cost hyperparameter value for an SVM.

It is then really easy to do cross validation at different levels of this pipeline.

The steps below are based on [this example in scikit-learn](http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html#illustration-of-pipeline-and-gridsearchcv).

In [None]:
# Set up the pipeline
pipe = Pipeline([
        ('scale', preprocessing.StandardScaler()),
        ('reduce_dim', PCA()),
        ('classify', SVC(kernel="linear")),
    ])

# PCA dimensions
component_steps = [20, 40]

# Classifier cost options
c_steps = [10e-1, 10e0, 10e1, 10e2]

# Build the grid search dictionary
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7)], 
        'reduce_dim__n_components': component_steps,
        'classify__C': c_steps,
    },
]

Now we are going to put it all together and run the pipeline

In [None]:
# parallelization parameter, will return to this later...
n_jobs=1

clf_pipe = GridSearchCV(pipe,
                        cv=3,
                        n_jobs=n_jobs,
                        param_grid=param_grid,
                        return_train_score=True
                       )
clf_pipe.fit(bold_data, labels)  # run the pipeline

print(clf_pipe.best_estimator_)  # What was the best classifier and parameters?
print()  # easy way to output a blank line to structure your output
print(clf_pipe.best_score_)  # What was the best classification score?
print()

# sort results with declining mean test score
cv_results = pd.DataFrame(clf_pipe.cv_results_)
print(cv_results.sort_values(by='mean_test_score', ascending=False))

**Exercise 9:**<a id="ex9"></a> Build a pipeline that takes the following steps:

1. Z score the data.  
2. Grid search over PCA and the VarianceThreshold method for voxel selection. In other words, test the pipeline with either PCA as your method for voxel selection or VarianceThreshold as your method for voxel selection.
3. Grid search over the linear and RBF SVM kernel.

Run this pipeline for at least 5 subjects and present your average results.

In [None]:
# Insert your code here

**Novel contribution:**<a id="novel"></a> be creative and make one new discovery by adding an analysis, visualization, or optimization.

## Contributions <a id="contributions"></a> 

M. Kumar, C. Ellis and N. Turk-Browne produced the initial notebook  
T. Meissner minor edits