# <center>Cross Validation Techniques</center>

In [17]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold, LeavePOut, LeaveOneOut, ShuffleSplit, TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

### `Hold Out Cross Validation`

In this technique, we consider the entire dataset. The dataset is split into two parts -   
Normally the thumb-rule is 70-30, where, 70% of the entire data is considered as training data.   
And the remaining 30% is considered as test/validation data.  
Alternative rule is 80-20.

<b>Advantages:</b>
- Quick to execute.

<b>Disadvantage:</b>
- Not suitable for an imbalance dataset.
- Not suitable for very small dataset. Large chunk of data gets deprived of training of model.

In [2]:
iris = load_iris()
X= iris.data
y = iris.target

log = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

log.fit(X_train, y_train)

print("Accuracy on training data is", accuracy_score(y_train, log.predict(X_train)))
print("Accuracy on testing data is", accuracy_score(y_test, log.predict(X_test)))

Accuracy on training data is 0.9619047619047619
Accuracy on testing data is 1.0


### `K-Fold Cross Validation`

In this technique, the entire dataset is partitioned into k parts.  
Each partition is called a fold.  
We use 1 fold for validation/test.  
We use K - 1 fold for training.  

This technique is repeated K times until each fold is used as a validation data.  
And remaining folds are used as testing data.  
The final accuracy of the model is computed by taking the mean accuracy K-model's validation data.

<b>Advantages:</b>
- The entire dataset is used as training set and validation set.  

<b>Disadvantages:</b>
- This technique should not be used for imbalanced datasets.

In [3]:
iris = load_iris()
X= iris.data
y = iris.target

log = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

log.fit(X_train, y_train)

kf = KFold(n_splits=5)

score = cross_val_score(log, X, y, cv=kf)

print("Cross Validation Scores are:", score)
print("Average Cross Validation Score is:", score.mean())

Cross Validation Scores are: [1.         1.         0.86666667 0.93333333 0.83333333]
Average Cross Validation Score is: 0.9266666666666665


### `Stratified K-Fold Cross Validation`

This is an enhanced version of K-Fold cross validation technique.  
Generally used with imbalanced datasets.  
In this technique, the entire dataset is partitioned into k parts of equal sizes.  
The training instance ration will be same as the original dataset ratio.   

<b>Disadvantages:</b>
- Not suitable for time series data.

In [5]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)

print(skf)


for train_index, test_index in skf.split(X, y):
    print("Train:", train_index, "Test:", test_index)
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
Train: [1 3] Test: [0 2]
Train: [0 2] Test: [1 3]


## `Leave P Out Cross Validation`

This technique is an exhaustive cross validation technique in which we consider P-samples used as validation set (test)  
And the remaining (n-P) samples are used as training set.  
For example, in a dataset with n=100 data, if we use P=10, then in each iteration, 10 values will be used as validation dataset and the remaining 90 dataset are used as test dataset.  
This process is repeated until the whole dataset gets divided into validation sets of P-samples and (n-P) training samples.  

<b>Advantages:</b>
- All the data samples get selected as both training and validation samples.

<b>Disadvantages:</b>
- High computation time.
- Not suitable for imbalanced datasets.

In [8]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])

lpo = LeavePOut(2)
lpo.get_n_splits(X)

print(lpo)

for train_index, test_index in lpo.split(X, y):
    print("Train:", train_index, "Test:", test_index)
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

LeavePOut(p=2)
Train: [2 3] Test: [0 1]
Train: [1 3] Test: [0 2]
Train: [1 2] Test: [0 3]
Train: [0 3] Test: [1 2]
Train: [0 2] Test: [1 3]
Train: [0 1] Test: [2 3]


## `Leave One Out Cross Validation` 

This technique is very similar to the Leave P Out Technique, except P is always 1.

In [12]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])

loo = LeavePOut(2)
loo.get_n_splits(X)

print(loo)

for train_index, test_index in loo.split(X, y):
    print("Train:", train_index, "Test:", test_index)
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

LeavePOut(p=2)
Train: [2 3] Test: [0 1]
Train: [1 3] Test: [0 2]
Train: [1 2] Test: [0 3]
Train: [0 3] Test: [1 2]
Train: [0 2] Test: [1 3]
Train: [0 1] Test: [2 3]


## `Monte-Carlo : Shuffle Split Cross Validation`

In this technique, the dataset gets randomly partitioned into training set and the validation set.  
Here, we decide the percentage of the dataset we want to use as the training set and the percentage that we want to use as the validation set.  
If the added percentage of the training and testing sets is not equal to 100%, the remaining data is not used in either training or testing dataset.  

For example,  
Lets say we have 100 samples, and 60% of samples are to be used as training set and if we mention 20% of dataset to be used as testing set, the remaining 20% data will not be used.

<b>Advantages:</b>
- We are free to set the size of training and testing sets.
- We can also choose the number of repetition.
- It is not dependent on number of folds.

<b>Disadvantages:</b>
- Few samples may not be selected for either training or testing set.
- Not suitable for imbalanced data.

In [16]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
y = np.array([1, 2, 1, 2, 1, 2])

rs = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
rs.get_n_splits(X)

print(rs)

for train_index, test_index in rs.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    
print("\n\n")

rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=0.25, random_state=0)

for train_index, test_index in rs.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)
TRAIN: [1 3 0 4] TEST: [5 2]
TRAIN: [4 0 2 5] TEST: [1 3]
TRAIN: [1 2 4 0] TEST: [3 5]
TRAIN: [3 4 1 0] TEST: [5 2]
TRAIN: [3 5 1 0] TEST: [2 4]



TRAIN: [1 3 0] TEST: [5 2]
TRAIN: [4 0 2] TEST: [1 3]
TRAIN: [1 2 4] TEST: [3 5]
TRAIN: [3 4 1] TEST: [5 2]
TRAIN: [3 5 1] TEST: [2 4]


## `Time Series Cross Validation`

This technique is all about performing cross validation on the time series data.

In [19]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit()
print(tscv)

for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    
# Fix test_size to 2 with 12 samples
X = np.random.randn(12, 2)
y = np.random.randint(0, 2, 12)

tscv = TimeSeriesSplit()
print(tscv)

for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)
TRAIN: [0 1] TEST: [2 3]
TRAIN: [0 1 2 3] TEST: [4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
