# Cross Validation
Source: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
## Basic cross-validation
Cross validation is a method of splitting your data set into chunks (folds) and using one part as a test/validation set and the rest as a training set. You can create, fit and score the model using different combinations of the data folds as test and train sets and take an average of the results to get a more accurate score for the model.

In [9]:
from sklearn import datasets, svm # import sample data and svm module

digits = datasets.load_digits() # load sample data
x_features = digits.data # features/predictors
y_labels = digits.target # labels/outcomes

svc = svm.SVC(C=1, kernel='linear') # define model as linear svm
svc.fit(x_features[:-100], y_labels[0:-100]).score(x_features[-100:], y_labels[-100:]) # fit model and calculate accuracy

0.97999999999999998

## K-Folds
The above model simply builds and applies the model once to the entire dataset, using one part to fit and one part to test its accuracy. This provides an accuracy of just under 98% but this is a one off run and we can use KFolds to get a more accurate and reliable accuracy prediction.

In [10]:
import numpy as np

# split x and y into 3 equal folds
x_folds = np.array_split(x_features, 3)
y_folds = np.array_split(y_labels, 3)
scores = list()

# iterate through folds
for k in range(3):
    x_train = list(x_folds) # create copy of folds
    x_test = x_train.pop(k) # extract fold for test
    x_train = np.concatenate(x_train) # use remaining folds for train
    y_train = list(y_folds) # same for y/outcomes
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    
    scores.append(svc.fit(x_train, y_train).score(x_test, y_test)) # record each score

print(scores) # print all scores

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]


## K-Folds and Generators/Customisation
We can use more complex generators to split the data in custom ways and then analyse these methods. This split function attached to the kfold model splits your dataset into train and test data, you can define how many times you test the data and how many folds/samples you split the data into.

In [31]:
from sklearn.model_selection import KFold, cross_val_score

x = ['a', 'a', 'b', 'c', 'c', 'c'] # sample size is 6 (split method below splits data into 6 samples, using 4 as train and 2 as test)
k_fold = KFold(n_splits=3) # 3 equal folds

for train_indices, test_indices in k_fold.split(x): # splits the data into train and test based on the x variable above
    print('Train: %s | Test: %s' % (train_indices, test_indices)) # show train, test data for each fold

# print scores 
print([svc.fit(x_features[train], y_labels[train]).score(x_features[test], y_labels[test]) for train, test in k_fold.split(x_features)])

Train: [2 3 4 5] | Test: [0 1]
Train: [0 1 4 5] | Test: [2 3]
Train: [0 1 2 3] | Test: [4 5]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]


## cross_val_score
This helper function computes all of the inputs of your process, including your estimator (svm/svc here), cross-validation object (KFolds here) and input data set. It then splits your dataset repeatedly, applies the estimator to each training set and validates it against each test set to produce an array of individual scores

In [32]:
# n_jobs=-1 tells it to use all CPUs on your PC
cross_val_score(svc, x_features, y_labels, cv=k_fold, n_jobs=-1)

array([ 0.93489149,  0.95659432,  0.93989983])

You can see that this outputs the same results as the previous code, essentially this helper offers a shorthand toolkit for plugging in all of the various objects and models you need to run your cross-validation, saving you time and code.

By default, the accuracy scores are determined using the estimator that you selected and its 'scoring' method. You can manually select alternate scoring methods if you wish to have a specific analysis of your model.

In [34]:
cross_val_score(svc, x_features, y_labels, cv=k_fold, scoring='precision_macro')

array([ 0.93969761,  0.95911415,  0.94041254])

This specific 'precision_macro' scoring method produces different results. A list of different scoring methods and their meaning can be found in the following link: http://scikit-learn.org/stable/modules/model_evaluation.html

As well as the standard KFolds method which we have been using, there are many other methods of cross validation, these can be found in the following link: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

## Grid Search
This is a method of plugging in an estimator (e.g. svc) and the dataset and then the grid search tool creates a grid of parameters and calculates the best combination of parameters to maximise the cross-validation score.

In [36]:
from sklearn.model_selection import GridSearchCV, cross_val_score

params = np.logspace(-6, -1, 10) # creates values along a log scale (start, stop, sample)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=params), n_jobs=-1)

clf.fit(x_features[:1000], y_labels[:1000])
print('TEST | Best score: %s | Best estimator: %s' % (clf.best_score_, clf.best_estimator_.C))
print('TRAIN | Score: %s' % (clf.score(x_features[1000:], y_labels[1000:])))

TEST | Best score: 0.925 | Best estimator: 0.00774263682681
TRAIN | Score: 0.943538268507


You can see that the performance is better on the train data rather than the test.

By default GridSearchCV uses a 3-fold cross-validation, however if it detects a classifier (e.g. SVM) instead of a regressor it uses a stratified 3-fold. Stratification is where the data is rearranged when splitting so that each fold is representative of the sample (e.g. if there are 2 folds and 2 classes represented, it aims to have each class as ~50% in each fold), this improves bias and variance in sampling.

In [37]:
cross_val_score(clf, x_features, y_labels)

array([ 0.93853821,  0.96327212,  0.94463087])

Here, we are performing nested cross-validation. The cross_val_score is estimating the prediction score of the 3 folds and within each of these folds the GridSearchCV is estimating the best parameters to use.

Therefore the resulting scores are unbiased estimates of the prediction accuracy when applied to new data.

# To Do:
* Finish off this source, there is one final section on cross-validated estimators being automatically calculated.