**<h1><center>CROSS-VALIDATION</center></h1>**

Cross-validation is a technique to evaluate the model performance by dividing the data into two parts, model will train on one part and test on another part, this process iterates on every possible combination.

<u>Cross Validation Techniques</u>

1. LOOCV (Leave One Out Cross Validation)
2. K-Fold Cross Validation
3. Stratifies Cross Validation

**<u>1. LOOCV (Leave One Out Cross Validation) : </u>**

In this method, we will train on the whole dataset but leaves only one data point for testing and then it iterates for each data point. It has some advantages as well as disadvantages also.
- An advantage of using this method is that we make use of all data points and hence it is low bias.

- The major drawback  is it takes a lot of execution time as it iterates over `the number of data points` times and it leads to higher variation in the testing model as we are testing against only one data point. If the data point is an outlier it can lead to higher variation. 

**<u>2. K-Fold Cross Validation : </u>**

K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. 

Lets take one example that data set will be divided into 5 parts called 5-Fold cross validation where `K=5`. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set.

<center><img src = "https://miro.medium.com/max/1400/1*9NosjiPCNNAhHfEdNYFfUQ.png" width = 450px></center>

**Evaluating a ML model using K-Fold CV**

In [1]:
import numpy as np
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
kf = KFold(n_splits=2)

for train, test in kf.split(X,y):
     print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


Here we have labels 0,1 so the k-fold cross-validation spit choosed is 2 and 3rd index for training , that is label 1 which is imbalanced data set.To avoid this we use stratified cross validation.

**<u>3. Stratifies Cross Validation</u>**

In some cases we will have imbalanced datset while splitting dataset into train and test we should concern about target classes. Stratifies cross validation is extension to k-fold cross validation.

In this technique each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This variation is also known as Stratified K Fold.

In [None]:
import numpy as np
from sklearn.model_selection import StratifiedKFold


X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

strtifieskfold = StratifiedKFold(n_splits=2)
print(strtifieskfold)
strtifieskfold.get_n_splits(X, y)

StratifiedKFold(n_splits=2, random_state=None, shuffle=False)


2

In [None]:

for train_index, test_index in strtifieskfold.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]


Here we have labels 0,1 so the k-fold cross-validation spit choosed is 1 and 3rd index for training , that is label 0 and 1 which is balanced data set.