# <span style = "color:coral">K-Fold Cross Validation</span>

***

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
    1. Take the group as a hold out or test data set
    2. Take the remaining groups as a training data set
    3. Fit a model on the training set and evaluate it on the test set
    4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

#### Simple implementation of K-Fold Cross-Validation in Python

In [1]:
from sklearn.model_selection import KFold
X = ["a",'b','c','d','e','f']

kf = KFold(n_splits=3)

for train, test in kf.split(X):

    print("Train data",train,"Test data",test)

Train data [2 3 4 5] Test data [0 1]
Train data [0 1 4 5] Test data [2 3]
Train data [0 1 2 3] Test data [4 5]


<b> OUTPUT: </b>

Train: [2 3 4 5] Test: [0 1]

Train: [0 1 4 5] Test: [2 3]

Train: [0 1 2 3] Test: [4 5]

### Example with real data

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

In [3]:
digits = load_digits()

In [4]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [6]:
from sklearn.model_selection import KFold,StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data,digits.target):#applying cross vaidation
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [7]:
print ('Logistic Regression scores are: ', scores_logistic)
print ('SVM scores are: ', scores_svm)
print ('Random Forest scores are: ', scores_rf)

Logistic Regression scores are:  [0.8948247078464107, 0.9532554257095158, 0.9098497495826378]
SVM scores are:  [0.3806343906510851, 0.41068447412353926, 0.5125208681135225]
Random Forest scores are:  [0.9232053422370617, 0.9348914858096828, 0.9315525876460768]


### <span style = "color: blue"> In real life cases we do not need to use these codes. There is a function in sklearn which does these for us </span>

#### cross_val_score

In [8]:
from sklearn.model_selection import cross_val_score

#### For Logistic Regression

In [9]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

array([0.89482471, 0.95325543, 0.90984975])

#### For SVM

In [10]:
cross_val_score(SVC(), digits.data, digits.target,cv=3)

array([0.96494157, 0.97996661, 0.96494157])

#### For Random Forest Classifier

In [11]:
cross_val_score(RandomForestClassifier(n_estimators=40), digits.data, digits.target,cv=3)

array([0.93989983, 0.94657763, 0.93155259])

***

# <center><span style = "color:CornflowerBlue; font-family:Courier New;font-size:40px">EDURE LEARNING</span></center>