<a href="https://colab.research.google.com/github/DouglasAltwig/Data-Science-Foundation/blob/master/Cross-Validation/K-fold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### K-Fold
Cross-validation is a resampling procedure used to evaluate machine learning models on a
limited data sample. The procedure has a single parameter called k that refers to the number
of groups that a given data sample is to be split into. As such, the procedure is often called
k-fold cross-validation.

### Variations on Cross-Validation

*   **Stratified**: each fold has the same proportion of observations with a given categorical value.
*   **LOOCV** (leave-one-out cross-validation): *k* may be set to the total number of observations in the dataset such that each observation is given a chance to be held out of the dataset.
*   **Repeated**: This is where the *k-fold* cross-validation procedure is repeated *n* times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.


In [6]:
# scikit-learn k-fold cross-validation
from numpy import array
from sklearn.model_selection import KFold

In [11]:
# Data sample
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

# Define cross-validation folds
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# Enumerate splits
for train, test in kfold.split(data):
  print('train: %s, test %s' % (data[train], data[test]))

train: [0.1 0.4 0.5 0.6], test [0.2 0.3]
train: [0.2 0.3 0.4 0.6], test [0.1 0.5]
train: [0.1 0.2 0.3 0.5], test [0.4 0.6]


### Scikit-learn Example (Traditional Train-Test-Split)

In [12]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [20]:
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print(X.shape, y.shape)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(150, 4) (150,)
(90, 4) (90,)
(60, 4) (60,)


In [21]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9666666666666667

### Scikit-learn Example (Using K-Fold Cross-Validation)

In [18]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [19]:
print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)
