# KFold

##### Many times we get in a dilemma of which machine learning model should we use for a given problem. KFold cross validation allows us to evaluate performance of a model by creating K folds of given dataset. This is better then traditional train_test_split. In this tutorial we will cover basics of cross validation and kfold. We will also look into cross_val_score function of sklearn library which provides convenient way to run cross validation on a model

In [1]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [2]:
digit=load_digits()
dir(digit)

['DESCR', 'data', 'images', 'target', 'target_names']

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digit.data,digit.target,test_size=0.2)

In [5]:
ld_lr=LogisticRegression()
ld_svm=SVC()
ld_rf=RandomForestClassifier()

### Logistic regression

In [10]:
ld_lr.fit(X_train,y_train)
ld_lr.score(X_test,y_test)

0.9555555555555556

### SVM

In [7]:
ld_svm.fit(X_train,y_train)
ld_svm.score(X_test,y_test)

0.4583333333333333

### randam_forest

In [8]:
ld_rf.fit(X_train,y_train)
ld_rf.score(X_test,y_test)

0.9305555555555556

### By using KFold

In [13]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [14]:
 for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
        print(train_index,test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


#### using KFold for digits data  

In [21]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digit.data,digit.target):
    X_train, X_test, y_train, y_test = digit.data[train_index], digit.data[test_index], \
                                       digit.target[train_index], digit.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

NameError: name 'model' is not defined

In [22]:
from sklearn.model_selection import cross_val_score


In [23]:
cross_val_score(LogisticRegression(),digit.data,digit.target)

array([0.89534884, 0.94991653, 0.90939597])

In [24]:
cross_val_score(SVC(),digit.data,digit.target)

array([0.39368771, 0.41068447, 0.45973154])

In [25]:
cross_val_score(RandomForestClassifier(),digit.data,digit.target)

array([0.88372093, 0.91986644, 0.91275168])

In [26]:
import numpy as np 


In [33]:
s1=cross_val_score(RandomForestClassifier(n_estimators=5), digit.data, digit.target, cv=10)
print(np.average(s1))

0.8785977051138912


In [34]:
s4=cross_val_score(RandomForestClassifier(n_estimators=40), digit.data, digit.target, cv=10)
print(np.average(s4))

0.9538325337353033


In [36]:
s1=cross_val_score(SVC(kernel='linear'), digit.data, digit.target, cv=10)
print(np.average(s1))

0.9610800248897716
