Cross-Validation is a resampling technique that helps to make our model sure about its efficiency and accuracy on the unseen data.

In this tutorial, I'll talk about 3 common Cross-Validation techniques:¶
1. Holdout cross-validation (traditional train-test split)
2. k-fold cross-validation (variants - LOOCV, LpOCV)
3. Stratified k-fold cross-validation

There are many hybrid Cross-Validation techniques available as well. Kindly refer to References section.

# Read Dataset

In [7]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
df=pd.read_csv('cancer_dataset.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [8]:
X=df.iloc[:,2:-1]
y=df.iloc[:,1]

In [9]:
df.shape

(569, 33)

In [10]:
y.value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

# Cross-Validation

## Holdout cross-validation

* Only 1 time evaluation (1 iteration).
* The dataset is randomly split into training and validation data (based on split_size and random_state).
* Not suitable for an imbalanced dataset.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
result = model.score(X_test, y_test)
print(result)

0.8771929824561403


## k-fold cross-validation

* One of the most widely used technique.
* Not suitable for an imbalanced dataset.

ALGORITHM: 
```
* No. of total records in dataset = n
* No. of evaluations/iterations = k

In each iteration
  * No. of records in test set = n/k
  * No. of records in train set = n-(n/k)

* Take average of all test accuracies
```

EXAMPLE:
```
Let k = 5 and n = 1000

In each of 10 iterations
  * No. of records in test set = n/k = 1000/5 = 200
  * No. of records in train set = n-(n/k) = 1000-200 = 800

Iteration 1 - first 200 test, last 800 train
Iteration 2 - 200 train , 200 test, 600 train
Iteration 3 - 400 train, 200 test, 400 train
Iteration 4 - 600 train, 200 test, 200 train
Iteration 5 - first 800 train, last 200 test
```


In [12]:
from sklearn.model_selection import KFold

model=DecisionTreeClassifier()
kfold_validation=KFold(10)
results=cross_val_score(model,X,y,cv=kfold_validation)
print(results)
print(np.mean(results))

[0.94736842 0.9122807  0.89473684 0.94736842 0.92982456 0.98245614
 0.9122807  0.94736842 0.9122807  0.92857143]
0.9314536340852129


## Leave-one-out cross-validation (LOOCV)

* Variant of k-fold cross-validation where k=n/1.
* Not recommended to use.
* High computation time required.

ALGORITHM: 
```
* No. of total records in dataset = n
* No. of evaluations/iterations = k = n

In each iteration
  * No. of records in test set = n/n = 1
  * No. of records in train set = n-(n/n) = n-1

* Take average of all test accuracies
```

EXAMPLE:
```
Let k = n = 1000

Iteration 1 - 1 test, 999 train
Iteration 2 - 1 train, 1 test, 998 train
Iteration 3 - 2 train, 1 test, 997 train
.
.
.
Iteration 999 - 998 train, 1 test, 1 train
Iteration 1000 - 999 train, 1 test
```

In [13]:
from sklearn.model_selection import LeaveOneOut

model=DecisionTreeClassifier()
leave_validation=LeaveOneOut()
results=cross_val_score(model,X,y,cv=leave_validation)
print(results)
print(np.mean(results))

[1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.
 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

## Leave p-out cross-validation (LpOCV)

* Variant of k-fold cross-validation
* Testing on all distinct samples of size p, while the remaining n - p samples form the training set in each iteration.
* LpOCV is NOT equivalent to k=n/p which creates non-overlapping test sets.
* Not recommended to use.
* High computation time required.

In [15]:
from sklearn.model_selection import LeavePOut

leave_validation=LeavePOut(2)
results=cross_val_score(model,X,y,cv=leave_validation)
#print(results)
#print(np.mean(results))

## Stratified k-fold cross-validation

* One of the most widely used technique.
* Slight change to the K Fold cross validation technique.
* Ensures that validation data has an equal number of instances of target class label.
* Suitable for an imbalanced dataset.

In [16]:
from sklearn.model_selection import StratifiedKFold

skfold=StratifiedKFold(n_splits=10)
model=DecisionTreeClassifier()
results=cross_val_score(model,X,y,cv=skfold)
print(results)
print(np.mean(results))

[0.87719298 0.85964912 0.92982456 0.85964912 0.92982456 0.89473684
 0.9122807  0.94736842 0.92982456 0.98214286]
0.912249373433584


# References

* https://scikit-learn.org/stable/modules/cross_validation.html#
* https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d
* https://www.upgrad.com/blog/cross-validation-in-machine-learning/