
Several ML models are trained on subsets of the available input data and they are evalauted on complementary subset of the data. When performing supervise machine learning, a portion of the dataset is usually hold out as testing data. The testing data is not included in the training and it can be use to evaluate the generalization performance of the model. 

Herein, we demonstrate how common methods such as using train_test_split() class can mislead the model performance when it comes to the imbalance classification task. To tackle this issue, we implement Stratified K fold cross-validation, which splits the in K folds by preserving the class ratio as in the original dataset, for the new stratified dataset we have trained our classifier. The iris dataset is herein used to fit a support vector machine on it. In addition K-Fold and Stratified K-Fold cross validation techniques are also implemented to ascertain the generalization ability of the trained model. 

## Modules

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import datasets

## Data

In [2]:
iris = datasets.load_iris()

In [3]:
X = pd.DataFrame(iris.data)

In [4]:
Y = iris.target

## Split data into train and test set

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

## Build Model

In [6]:
model = svm.SVC()

In [7]:
model.fit(X_train, y_train)

SVC()

In [8]:
model.score(X_train, y_train)

0.9833333333333333

In [9]:
model.score(X_test, y_test)

0.9333333333333333

## Note !!
During training, the dataset is split into train and test set by the train_test_split class of scikit-learn library and after training the model, we get some accuracy in return. But is this the best accuracy of the model or whether this model will give the best performance at the time of deployment? Here comes the importance of cross validation splitting techniques. Cross validation is a technique that is use to detect overfitting, i.e. the model failing to generalize patttern effectively.

## Cross_Validation (CV)

## K-fold CV
The most commonly used validation technique is K-Fold CV. It involves splitting the training set into k numbers of folds. The first k-1 folds are used for training, and the remaining fold is held for testing, this process is repeated for K-folds. A  total of k folds are fitted and evaluated, and the score for each of this folds is returned. 


In [10]:
from sklearn.model_selection import cross_val_score

In [11]:
scores = cross_val_score(model, X, Y, cv=10)
scores

array([1.        , 0.93333333, 1.        , 1.        , 1.        ,
       0.93333333, 0.93333333, 0.93333333, 1.        , 1.        ])

In [12]:
print("Average Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Average Accuracy: 0.97 (+/- 0.07)


## Stratified K-Fold
K-Fold CV has shown encouraging results for balanced classification tasks. However, it fails for imbalance datasets. This is becasue K-Fold splits the data randomly without taking care of class imbalance. To overcome this, the stratified k fold cross-validation which is an extension of the K-Fold CV is used. Instead of splitting the data randomly, it splits the data in stratified manner. It maintains the same class ratio throughout the K folds as the ratio in the original dataset. 

In [13]:
model_skf = svm.SVC()
from sklearn.model_selection import StratifiedKFold

accuracy =[]

skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(X, Y)
for train_index, test_index in skf.split(X,Y):
    X1_train, X1_test = X.iloc[train_index], X.iloc[test_index]
    Y1_train, Y1_test = Y[train_index], Y[test_index]
    
    model_skf.fit(X1_train, Y1_train)
    scores1 = model_skf.score(X1_test, Y1_test)
    accuracy.append(scores1)
print(accuracy)

[1.0, 0.9333333333333333, 1.0, 1.0, 1.0, 0.9333333333333333, 0.9333333333333333, 0.9333333333333333, 1.0, 1.0]


In [14]:
Average_score = sum(accuracy)/10
print ('Average_score:', Average_score)

Average_score: 0.9733333333333334


## 
As show above, we have a fairly robust model trained on 10 folds with a mean accuracy of 97.3% on the test.