# Cross validation in scikit-learn
### In this tutorial
- What not to do
- K-Fold cross validation
- Stratified K-Fold cross validation
- Nested stratified K-Fold cross validation

## What not to do
We will train a simple classifier on the iris dataset, one of the standard datasets included in scikit-learn. We will start by training and testing on the same data. Let't import the data, scikit-learn and pandas:

In [None]:
# We will use an example dataset incluided in scikit-learn
from sklearn.datasets import load_iris

# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

import pandas as pd

We will now load the dataset and have a look at the data:

In [None]:
# Load the dataset
iris = load_iris()

# Prepare the data
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
features = df.columns[:4]

print(len(df),'samples')
df.head()

We have four features for every flower, and we are trying to recognize which species of iris it belongs to. Let's try a random forest as a classification algorithm:

In [None]:
# The classifier:
trees = 2
jobs = 20
forest = RandomForestClassifier(trees, n_jobs=jobs)

# Something we should never do: train and test on the same data

forest.fit(df[features],df['species'])
results = forest.predict(df[features])
train_accuracy = accuracy_score(results,df['species'])
print(results[48:52])
print('Train accuracy score:',train_accuracy)

Looks good! But how would it perform on new data? By using all of our data for training, we cannot answer this question. A common strategy would be to split our data in a training set and a test set. However, when our dataset is small, we often want to somehow use all of our data both for training and testing. To do this, we can use...

## K-Fold cross validation

The idea is simple: we split the data in K subsets, and select each one in turn. We test on that subset, and train using the remaining data. Our final evaluation of the classifier is the average across all subsets, or **folds**.

For K = 5 we have:

<img src="folds.png">

Let's take a look at how we can do this in scikit-learn:

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold

splitter = KFold(5,shuffle=False) # Let's see how this works

train, test = next(splitter.split(df))
print('Train',train,'\n\nTest',test)
df.iloc[test]

## Stratified K-Fold
Stratified K-Folds work in exactly the same way, however the samples in each fold are chosen to preserve the percentage of samples for each class:

In [None]:
# Stratified k-fold
splitter = StratifiedKFold(5,shuffle=False) # Let's see the effects of shuffle

train, test = next(splitter.split(df,y = df['species']))
print('Train',train,'\n\nTest',test)
df.iloc[test]

As an exercise, let's implement 5-fold stratified cross validation using scikit-learn. We need to:

- Loop over all folds
- Train the random forest using the training indices
- Evaluate on the testing data
- Save the prediction and the ground truth, or alternatively record the accuracy
- Return the average accuracy across all folds

In [None]:
########## Exercise: implementing stratified K-fold CV

predictions = []
targets = []

for train, test in splitter.split(df,y = df['species']):
    pass


## Nested stratified K-Fold CV

Sometimes we need to optimize for the hyperparameters of our classifiers, for example using a grid search. Scikit-learn makes it easy to combine a grid search with cross validation:

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators': [1,10,100,1000,10000], 'max_depth': (1,2,None)}


gs = GridSearchCV(estimator=RandomForestClassifier(n_jobs=jobs),
            param_grid=parameters,
            cv = StratifiedKFold(10,shuffle=True),
            verbose = 3
            )

gs.fit(df[features],df['species'])

In [None]:
print('Best score',gs.best_score_,'\nParameters:', gs.best_params_)

This comes with a problem: the data we just used was also involved in the optimization process, as we kept testing on the same data through the whole grid. We need two hold-out datasets. A dev set, to test the algorithm as we explore the grid, and a final test set, to evaluate the result we selected as the optimum on new data the algorithm has not seen before.

We do so by nesting a stratified K-Fold CV inside another stratified K-Fold CV. We first split our data in a training and test set, and further split the training set into a training and a dev set. In practice, we just need to build a loop as we did before, but for each iteration we will run a grid search with its own inner CV. Let's do this now:

In [None]:
########## Exercise: nested stratified K-fold CV
parameters = {'n_estimators': [10,100,1000], 'max_depth': (1,2,None)}
outer = StratifiedKFold(10,shuffle=True)
inner = StratifiedKFold(10,shuffle=True)

predictions = []
targets = []

import tqdm
for train, test in ...:
    pass

test_accuracy = accuracy_score(predictions,targets)
print('Test accuracy',test_accuracy)