# Cross validation
---

**Cross validation** is a technique for assessing model performance. There are many different cross validation methods, but they all share a few common features:

1. split data into training and test sets.
2. train the model with training dataset and test it with the test dataset.
3. resplit the data into training and test sets.
4. repeat step 1-3 until every subset of data has served as a test dataset (this is not accurate for shuffle methods, but you get the idea).
5. average the model performance evaluation scores for each round of training, test dataset pairs.

## K-Fold cross validation

The K-Fold method splits data into k groups (folds), with k-1 groups as training data and the fold left out as test data.

<hr>
<font color=blue>**This method doesn't split the data directly, instead it generates `indices` that can be used to generate dataset splits!**</font>

In [5]:
from sklearn.model_selection import KFold
import numpy as np

In [6]:
y = list('abcdefghij')
y

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [7]:
kf = KFold(n_splits=5)

In [8]:
for train_idx, test_idx in kf.split(y):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [2 3 4 5 6 7 8 9]; test indices: [0 1]
train indices: [0 1 4 5 6 7 8 9]; test indices: [2 3]
train indices: [0 1 2 3 6 7 8 9]; test indices: [4 5]
train indices: [0 1 2 3 4 5 8 9]; test indices: [6 7]
train indices: [0 1 2 3 4 5 6 7]; test indices: [8 9]


### Use indices to generate dataset splits

In [9]:
for train_idx, test_idx in kf.split(y):
    train = [y[i] for i in train_idx]
    test = [y[i] for i in test_idx]
    print("train: %s; test: %s" %(train, test))

train: ['c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']; test: ['a', 'b']
train: ['a', 'b', 'e', 'f', 'g', 'h', 'i', 'j']; test: ['c', 'd']
train: ['a', 'b', 'c', 'd', 'g', 'h', 'i', 'j']; test: ['e', 'f']
train: ['a', 'b', 'c', 'd', 'e', 'f', 'i', 'j']; test: ['g', 'h']
train: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']; test: ['i', 'j']


## Leave One Out (LOO)

Take one sample from n samples as test data and the rest as train data. 
<hr>
<font color=blue>**This method doesn't split the data directly, instead it generates `indices` that can be used to generate dataset splits!**</font>

In [10]:
from sklearn.model_selection import LeaveOneOut

In [11]:
y

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [12]:
loo = LeaveOneOut()

In [13]:
for train_idx, test_idx in loo.split(y):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [1 2 3 4 5 6 7 8 9]; test indices: [0]
train indices: [0 2 3 4 5 6 7 8 9]; test indices: [1]
train indices: [0 1 3 4 5 6 7 8 9]; test indices: [2]
train indices: [0 1 2 4 5 6 7 8 9]; test indices: [3]
train indices: [0 1 2 3 5 6 7 8 9]; test indices: [4]
train indices: [0 1 2 3 4 6 7 8 9]; test indices: [5]
train indices: [0 1 2 3 4 5 7 8 9]; test indices: [6]
train indices: [0 1 2 3 4 5 6 8 9]; test indices: [7]
train indices: [0 1 2 3 4 5 6 7 9]; test indices: [8]
train indices: [0 1 2 3 4 5 6 7 8]; test indices: [9]


## Leave P Out (LPO)

Take p samples from n samples and the rest as train data. This method is computationally expensive, since there are many different combinations of taking p samples out of n samples.

<hr>
<font color=blue>**This method doesn't split the data directly, instead it generates `indices` that can be used to generate dataset splits!**</font>

In [14]:
from sklearn.model_selection import LeavePOut

In [15]:
lpo = LeavePOut(p=3)

In [16]:
for train_idx, test_idx in lpo.split(y):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [3 4 5 6 7 8 9]; test indices: [0 1 2]
train indices: [2 4 5 6 7 8 9]; test indices: [0 1 3]
train indices: [2 3 5 6 7 8 9]; test indices: [0 1 4]
train indices: [2 3 4 6 7 8 9]; test indices: [0 1 5]
train indices: [2 3 4 5 7 8 9]; test indices: [0 1 6]
train indices: [2 3 4 5 6 8 9]; test indices: [0 1 7]
train indices: [2 3 4 5 6 7 9]; test indices: [0 1 8]
train indices: [2 3 4 5 6 7 8]; test indices: [0 1 9]
train indices: [1 4 5 6 7 8 9]; test indices: [0 2 3]
train indices: [1 3 5 6 7 8 9]; test indices: [0 2 4]
train indices: [1 3 4 6 7 8 9]; test indices: [0 2 5]
train indices: [1 3 4 5 7 8 9]; test indices: [0 2 6]
train indices: [1 3 4 5 6 8 9]; test indices: [0 2 7]
train indices: [1 3 4 5 6 7 9]; test indices: [0 2 8]
train indices: [1 3 4 5 6 7 8]; test indices: [0 2 9]
train indices: [1 2 5 6 7 8 9]; test indices: [0 3 4]
train indices: [1 2 4 6 7 8 9]; test indices: [0 3 5]
train indices: [1 2 4 5 7 8 9]; test indices: [0 3 6]
train indices: [1 2 4 5 6 8 

## Random Permutation (Shaffle & Split)

The data is shuffled and splitted into train and test pairs. You have to specify number of shuffling iterations and the proportion to train/test sets.

In [17]:
from sklearn.model_selection import ShuffleSplit

In [18]:
ss = ShuffleSplit(n_splits=10, test_size=0.25)

In [19]:
for train_idx, test_idx in ss.split(y):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [4 9 3 1 8 2 5]; test indices: [0 7 6]
train indices: [4 8 6 7 3 0 2]; test indices: [5 1 9]
train indices: [5 7 0 8 6 9 2]; test indices: [4 3 1]
train indices: [1 6 4 5 8 2 7]; test indices: [0 3 9]
train indices: [1 9 8 7 3 6 5]; test indices: [4 0 2]
train indices: [1 7 4 6 8 0 3]; test indices: [9 2 5]
train indices: [4 2 8 6 5 9 1]; test indices: [0 3 7]
train indices: [5 4 9 6 3 8 0]; test indices: [1 7 2]
train indices: [4 1 6 0 2 9 7]; test indices: [5 3 8]
train indices: [7 8 2 1 9 4 3]; test indices: [0 5 6]


## Leave One Group Out

This method is similar as the `Leave One Out` method. The variable is grouped by values and one group is held for serving as test dataset.

In [20]:
y = list('aaa') + list('bbbbb') + list('cccc')

In [21]:
y

['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c']

In [22]:
X = np.random.randn(len(y))
X

array([ 0.2413556 , -0.65333186,  0.84401647, -1.10281985,  1.01318204,
       -0.69043423, -0.19806153, -1.3683989 , -0.74623   ,  0.35248938,
        0.11474869,  0.53338053])

In [23]:
from sklearn.model_selection import LeaveOneGroupOut

In [24]:
logo = LeaveOneGroupOut()

In [25]:
for train_idx, test_idx in logo.split(X, groups=y):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [ 3  4  5  6  7  8  9 10 11]; test indices: [0 1 2]
train indices: [ 0  1  2  8  9 10 11]; test indices: [3 4 5 6 7]
train indices: [0 1 2 3 4 5 6 7]; test indices: [ 8  9 10 11]


## Leave P Group Out

This method is similar as the `Leave One Group Out`. The variable is grouped by values and P groups are held for serving as test dataset.

In [26]:
y

['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c']

In [27]:
X

array([ 0.2413556 , -0.65333186,  0.84401647, -1.10281985,  1.01318204,
       -0.69043423, -0.19806153, -1.3683989 , -0.74623   ,  0.35248938,
        0.11474869,  0.53338053])

In [28]:
from sklearn.model_selection import LeavePGroupsOut

In [29]:
lpgo = LeavePGroupsOut(n_groups=2)

In [34]:
for train_idx, test_idx in lpgo.split(X, groups=y):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [ 8  9 10 11]; test indices: [0 1 2 3 4 5 6 7]
train indices: [3 4 5 6 7]; test indices: [ 0  1  2  8  9 10 11]
train indices: [0 1 2]; test indices: [ 3  4  5  6  7  8  9 10 11]


## Group Shuffle Split

`Group Shuffle Split` is  a combination of `ShuffleSplit` and `LeavePGroupsOut`. It shuffles the groups and then x groups are held for test datas, where x can be a proportion or an absolute number of groups.

In [35]:
X = np.random.randn(20)

In [36]:
X

array([-0.70782569, -0.41241592, -0.13149337, -0.81185999,  0.27883507,
        0.89086179, -1.25196285,  0.21387114,  1.00030339, -1.45616003,
       -0.64170703,  0.16466327, -0.9789483 , -1.08461979,  0.12992165,
        1.20777846,  0.46672233,  0.16845777, -1.42556463, -0.14764408])

In [64]:
y = list('MF' * 10)

In [65]:
y

['M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F']

In [66]:
g = ([1,2,3,4,5] * 4)
g.sort()

In [67]:
g

[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5]

In [68]:
from sklearn.model_selection import GroupShuffleSplit

In [69]:
gss = GroupShuffleSplit(n_splits=10, test_size=2)

In [70]:
for  train_idx, test_idx in gss.split(X, y, groups=g):
    print("train indices: %s; test indices: %s" %(train_idx, test_idx))

train indices: [ 0  1  2  3  8  9 10 11 12 13 14 15]; test indices: [ 4  5  6  7 16 17 18 19]
train indices: [ 0  1  2  3  4  5  6  7 12 13 14 15]; test indices: [ 8  9 10 11 16 17 18 19]
train indices: [ 0  1  2  3 12 13 14 15 16 17 18 19]; test indices: [ 4  5  6  7  8  9 10 11]
train indices: [ 0  1  2  3  4  5  6  7 16 17 18 19]; test indices: [ 8  9 10 11 12 13 14 15]
train indices: [ 0  1  2  3 12 13 14 15 16 17 18 19]; test indices: [ 4  5  6  7  8  9 10 11]
train indices: [ 0  1  2  3  4  5  6  7  8  9 10 11]; test indices: [12 13 14 15 16 17 18 19]
train indices: [ 0  1  2  3  4  5  6  7  8  9 10 11]; test indices: [12 13 14 15 16 17 18 19]
train indices: [ 0  1  2  3  4  5  6  7  8  9 10 11]; test indices: [12 13 14 15 16 17 18 19]
train indices: [ 4  5  6  7  8  9 10 11 12 13 14 15]; test indices: [ 0  1  2  3 16 17 18 19]
train indices: [ 0  1  2  3  4  5  6  7 16 17 18 19]; test indices: [ 8  9 10 11 12 13 14 15]
