# K Fold Cross Validation

K-fold cross-validation works by:

* splitting the full dataset into k equal length partitions,
* selecting k-1 partitions as the training set and
* selecting the remaining partition as the test set
* training the model on the training set,
* using the trained model to predict labels on the test set,
* computing an error metric (e.g. simple accuracy) and setting aside the value for later,
* repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
* calculating the mean of the k error values.

Using 5 or 10 folds is common for k-fold cross-validation. 


## scikit-learn package
When working in a production environment however, you should use scikit-learn. Scikit-learn has a few different tools that make performing cross validation easy. Similar to having to instantiate a LinearRegression or LogisticRegression object before you can train one of those models, you need to instantiate a KFold class before you can perform k-fold cross-validation:


**kf = KFold(n, n_folds, shuffle=False, random_state=None)**
where:

* n is the number of observations in the dataset,
* n_folds is the number of folds you want to use,
* shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
* random_state is used to specify a seed value if shuffle is set to True.
    
You'll notice here that only the first parameter depends on the dataset at all. This is because the KFold class returns an iterator object but won't actually handle the training and testing of models. If we're primarily only interested in accuracy and error metrics for each fold, we can use the KFold class in conjunction with the **cross_val_score** function, which will handle training and testing of the models in each fold.

Here are the relevant parameters for the cross_val_score function:

**cross_val_score(estimator, X, Y, scoring=None, cv=None)**
where:

* estimator is a sklearn model that implements the fit method (e.g. instance of LinearRegression or LogisticRegression),
* X is the list or 2D array containing the features you want to train on,
* y is a list containing the values you want to predict (target column),
* scoring is a string describing the scoring criteria (see more at: http://scikit-learn.org/stable/modules/model_evaluation.html).
* cv describes the number of folds. Here are some examples of accepted values:
  * an instance of the KFold class,
  * an integer representing the number of folds.
  * Depending on the scoring criteria you specify, either a single value is returned (e.g. average_precision) or an array of values (e.g. accuracy), one value for each fold.

Here's the general workflow for performing k-fold cross-validation using the classes we just described:

instantiate the model class you want to fit (e.g. LogisticRegression),
instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
use the cross_val_score function to return the scoring metric you're interested in.


In [1]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import pandas as pd

In [2]:
# Read Smarket.csv into a Dataframe named stocks
stocks = pd.read_csv('Data/Smarket.csv') 

stocks_up = pd.get_dummies(stocks['Direction'])
# Join the dummy variables to the main dataframe
stocks_new = pd.concat([stocks, stocks_up], axis=1)

stocks_new["actual_label"] = stocks_new["Up"]
stocks_new.head()

Unnamed: 0.1,Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Down,Up,actual_label
0,1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up,0,1,1
1,2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up,0,1,1
2,3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down,1,0,0
3,4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up,0,1,1
4,5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up,0,1,1


In [3]:
# define a regression model
reg = LogisticRegression()
x_columns = ["Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume"]

In [4]:
# define k-fold
kf = KFold(n_splits = 5, shuffle=True, random_state=18)

In [5]:
# calculate accuracy score for each k-fold split
accuracy = cross_val_score(reg,stocks_new[x_columns], stocks_new["actual_label"], scoring="accuracy", cv=kf)
avg_accuracy = sum(accuracy) / len(accuracy)
print(accuracy)
print("Average Accuracy = ", avg_accuracy)

[0.512 0.488 0.476 0.488 0.5  ]
Average Accuracy =  0.4928


In [6]:
# calculate auc score for each k-fold split
roc_auc = cross_val_score(reg,stocks_new[x_columns], stocks_new["actual_label"], scoring="roc_auc", cv=kf)
avg_roc_auc = sum(roc_auc) / len(roc_auc)
print(roc_auc)
print("Average AUC = ", avg_roc_auc)

[0.51384615 0.47624232 0.46574273 0.46748671 0.51299283]
Average AUC =  0.4872621472436573
