# 1. Why is Cross Validation important?

In machine learning, it is important to assess how a model learned from training data can be generalized on a new unseen data set. In particular a problem of overfitting is observed when a model makes good predictions on the training data but has poor performance on unknown data.

Cross-validation is frequently used for reducing the overfitting risk. This methodology splits training data into two parts: one is the proper training set, while the other part is the validation set which is used to check overfitting and find the best partition according to some evaluation metric. THere are three commonly used options:

1. *K-fold cross-validation:* Data is divided into a train and a validation set for k-times (called folds) and the minimizing combination is then selected;
2. *Leave one out validation:* very similar to k-fold except that the folds contain a single data point.
3. *Stratified K-fold cross-validation*: the folds are selected so that the mean response is approximately the same in all folds.

The first code fragment presents an example of cross-validation computed with the scikit-leran toolkit on a toy dataset called diabetes. The size of the test data is 20%, the machine learning classifier is based on SVM (support vector machines), the evaluation metric is accuracy and the number of folds is set to 4.

In [1]:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm



In [4]:
diabetes = datasets.load_diabetes()

In [5]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(diabetes.data,
                                                                     diabetes.target,
                                                                     test_size=0.2,
                                                                     random_state=0)

In [7]:
# Test size 20%
print(X_train.shape, y_train.shape)

(353, 10) (353,)


In [8]:
print(X_test.shape, y_test.shape)

(89, 10) (89,)


In [10]:
clf = svm.SVC(kernel='linear', C=1)
clf

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
scores = cross_validation.cross_val_score(clf, diabetes.data, diabetes.target, cv=4)
scores



array([ 0.00456621,  0.02272727,  0.03225806,  0.06896552])

In [17]:
print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()))

Accuracy: 0.03 (+/- 0.02)
