# CROSS VALIDATION

- normal TRAIN_TEST_SPLIT
- K FOLD CV
- LEAVE ONE OUT CV
- STRATIFIED CV
- TIME SERIES CV

**IMPORT LIBRARIES AND DATASET**

In [52]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [54]:
from sklearn.datasets import load_digits

digit = load_digits()

In [55]:
dir(digit)

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

**TRAIN TEST SPLIT**

In [56]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(digit.data, digit.target, test_size=0.2, random_state=42)

**APPLYING LOGISTIC REGRESSSION**

In [57]:
log_model = LogisticRegression()
log_model.fit(train_X, train_y)
log_pred = log_model.predict(test_X)

log_model.score(train_X, train_y), log_model.score(test_X, test_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


(1.0, 0.9694444444444444)

**APPLYING SVM**

In [58]:
svc_model = SVC()
svc_model.fit(train_X, train_y)
svc_pred = svc_model.predict(test_X)

svc_model.score(train_X, train_y), svc_model.score(test_X, test_y)

(0.9965205288796103, 0.9861111111111112)

**MODEL WITH SVM LINEAR KERNEL, RBF KERNEL AND POLY KERNEL**

In [59]:
svc_linear = SVC(kernel='linear')
svc_linear.fit(train_X, train_y)
svc_pred = svc_linear.predict(test_X)

svc_linear.score(train_X, train_y), svc_linear.score(test_X, test_y)

(1.0, 0.9777777777777777)

In [60]:
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(train_X, train_y)
svc_pred = svc_rbf.predict(test_X)

svc_rbf.score(train_X, train_y), svc_rbf.score(test_X, test_y)

(0.9965205288796103, 0.9861111111111112)

In [61]:
svc_poly = SVC(kernel='poly')
svc_poly.fit(train_X, train_y)
svc_pred = svc_poly.predict(test_X)

svc_poly.score(train_X, train_y), svc_poly.score(test_X, test_y)

(0.9979123173277662, 0.9916666666666667)

**RANDOM FOREST**

In [62]:
rf_clf = RandomForestClassifier()
rf_clf.fit(train_X, train_y)
rf_pred = rf_clf.predict(test_X)

rf_clf.score(train_X, train_y), rf_clf.score(test_X, test_y)

(1.0, 0.9805555555555555)

**K-FOLD CROSS VALIDATION**

In [63]:
from sklearn.model_selection import KFold

In [64]:
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [65]:
for train_index, test_index in kf.split([1, 2, 3, 4, 5, 6, 7, 8, 9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


**SCORES ON VARIOUS MODELS**

In [66]:
def get_score(model, train_X, test_X, train_y, test_y):
    model.fit(train_X, train_y)
    model_train_score = model.score(train_X, train_y)
    model_test_score = model.score(test_X, test_y)
    return {'model': str(model), 'Train score': model_train_score, 'Test score': model_test_score}

In [67]:
models = [LogisticRegression(), SVC(kernel='linear'), SVC(kernel='poly'), SVC(kernel='rbf'), RandomForestClassifier(criterion='gini', n_estimators=40), RandomForestClassifier(criterion='entropy')]

model_and_scores = pd.DataFrame([get_score(model, train_X, test_X, train_y, test_y) for model in models])
model_and_scores.sort_values(['Test score'], ascending=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Unnamed: 0,model,Train score,Test score
2,SVC(kernel='poly'),0.997912,0.991667
3,SVC(),0.996521,0.986111
4,RandomForestClassifier(n_estimators=40),1.0,0.980556
5,RandomForestClassifier(criterion='entropy'),1.0,0.980556
1,SVC(kernel='linear'),1.0,0.977778
0,LogisticRegression(),1.0,0.969444


In [68]:
for train_index, test_index in kf.split(digit.data):
    train_x, test_x, train_y, test_y = digit.data[train_index], digit.data[test_index], digit.target[train_index], digit.target[test_index]
    
    model_and_scores = pd.DataFrame([get_score(model, train_x, test_x, train_y, test_y) for model in models])
    print(model_and_scores.sort_values(['Test score'], ascending=False))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
3                                        SVC()     0.997496    0.966611
2                           SVC(kernel='poly')     0.999165    0.958264
1                         SVC(kernel='linear')     1.000000    0.934891
5  RandomForestClassifier(criterion='entropy')     1.000000    0.924875
0                         LogisticRegression()     1.000000    0.923205
4      RandomForestClassifier(n_estimators=40)     1.000000    0.923205


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
3                                        SVC()     0.995826    0.981636
2                           SVC(kernel='poly')     1.000000    0.979967
1                         SVC(kernel='linear')     1.000000    0.956594
5  RandomForestClassifier(criterion='entropy')     1.000000    0.954925
4      RandomForestClassifier(n_estimators=40)     1.000000    0.951586
0                         LogisticRegression()     1.000000    0.941569


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
3                                        SVC()     0.996661    0.954925
2                           SVC(kernel='poly')     1.000000    0.951586
1                         SVC(kernel='linear')     1.000000    0.939900
5  RandomForestClassifier(criterion='entropy')     1.000000    0.918197
0                         LogisticRegression()     1.000000    0.914858
4      RandomForestClassifier(n_estimators=40)     1.000000    0.913189


**STRARIFIED K FOLD CROSS VALIDATION**

In [76]:
from sklearn.model_selection import StratifiedKFold

In [77]:
folds = StratifiedKFold(n_splits=10)

In [78]:
for train_index, test_index in folds.split(digit.data, digit.target):
    train_x, test_x, train_y, test_y = digit.data[train_index], digit.data[test_index], digit.target[train_index], digit.target[test_index]
    
    model_and_scores = pd.DataFrame([get_score(model, train_x, test_x, train_y, test_y) for model in models])
    print(model_and_scores.sort_values(['Test score'], ascending=False))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.999382    0.961111
3                                        SVC()     0.998145    0.944444
1                         SVC(kernel='linear')     1.000000    0.938889
0                         LogisticRegression()     1.000000    0.905556
4      RandomForestClassifier(n_estimators=40)     1.000000    0.905556
5  RandomForestClassifier(criterion='entropy')     1.000000    0.900000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.999382    1.000000
1                         SVC(kernel='linear')     1.000000    0.994444
3                                        SVC()     0.996289    0.988889
4      RandomForestClassifier(n_estimators=40)     1.000000    0.977778
5  RandomForestClassifier(criterion='entropy')     1.000000    0.977778
0                         LogisticRegression()     1.000000    0.961111


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
4      RandomForestClassifier(n_estimators=40)     1.000000    0.961111
1                         SVC(kernel='linear')     1.000000    0.933333
2                           SVC(kernel='poly')     0.999382    0.933333
5  RandomForestClassifier(criterion='entropy')     1.000000    0.933333
3                                        SVC()     0.996289    0.927778
0                         LogisticRegression()     1.000000    0.877778


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.998763    0.983333
3                                        SVC()     0.996289    0.966667
1                         SVC(kernel='linear')     1.000000    0.944444
4      RandomForestClassifier(n_estimators=40)     1.000000    0.933333
0                         LogisticRegression()     1.000000    0.927778
5  RandomForestClassifier(criterion='entropy')     1.000000    0.927778


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.999382    0.994444
3                                        SVC()     0.995671    0.983333
1                         SVC(kernel='linear')     1.000000    0.961111
5  RandomForestClassifier(criterion='entropy')     1.000000    0.961111
0                         LogisticRegression()     1.000000    0.944444
4      RandomForestClassifier(n_estimators=40)     1.000000    0.938889


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.999382    0.994444
1                         SVC(kernel='linear')     1.000000    0.988889
3                                        SVC()     0.996289    0.988889
4      RandomForestClassifier(n_estimators=40)     1.000000    0.977778
0                         LogisticRegression()     1.000000    0.966667
5  RandomForestClassifier(criterion='entropy')     1.000000    0.966667


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.999382    0.988889
3                                        SVC()     0.996908    0.988889
5  RandomForestClassifier(criterion='entropy')     1.000000    0.977778
1                         SVC(kernel='linear')     1.000000    0.966667
4      RandomForestClassifier(n_estimators=40)     1.000000    0.961111
0                         LogisticRegression()     1.000000    0.950000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')     0.999382    0.994413
3                                        SVC()     0.996292    0.994413
1                         SVC(kernel='linear')     1.000000    0.977654
4      RandomForestClassifier(n_estimators=40)     1.000000    0.972067
5  RandomForestClassifier(criterion='entropy')     1.000000    0.972067
0                         LogisticRegression()     1.000000    0.938547


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
2                           SVC(kernel='poly')      1.00000    0.960894
3                                        SVC()      0.99691    0.960894
4      RandomForestClassifier(n_estimators=40)      1.00000    0.949721
5  RandomForestClassifier(criterion='entropy')      1.00000    0.944134
1                         SVC(kernel='linear')      1.00000    0.932961
0                         LogisticRegression()      1.00000    0.871508


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


                                         model  Train score  Test score
1                         SVC(kernel='linear')     1.000000    0.966480
2                           SVC(kernel='poly')     1.000000    0.966480
3                                        SVC()     0.998146    0.955307
0                         LogisticRegression()     1.000000    0.938547
5  RandomForestClassifier(criterion='entropy')     1.000000    0.938547
4      RandomForestClassifier(n_estimators=40)     1.000000    0.916201


## Calculating cross validation score

In [79]:
from sklearn.model_selection import cross_val_score

In [83]:
for model in models:
    cvs = cross_val_score(model, digit.data, digit.target, cv=3)
    print('Model', model, cvs, np.mean(cvs))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Model LogisticRegression() [0.92153589 0.94156928 0.91652755] 0.9265442404006677
Model SVC(kernel='linear') [0.93489149 0.95826377 0.93823038] 0.9437952142459656
Model SVC(kernel='poly') [0.95826377 0.97829716 0.95325543] 0.9632721202003337
Model SVC() [0.96494157 0.97996661 0.96494157] 0.9699499165275459
Model RandomForestClassifier(n_estimators=40) [0.93155259 0.94991653 0.92153589] 0.9343350027824151
Model RandomForestClassifier(criterion='entropy') [0.93322204 0.95325543 0.92320534] 0.9365609348914857


**VOTING CLASSIFIER**

In [119]:
from sklearn.ensemble import VotingClassifier

In [121]:
hard_scores = VotingClassifier(estimators=cvs, voting='hard')
soft_scores = VotingClassifier(estimators=cvs, voting='soft')

In [122]:
hard_scores, soft_scores

(VotingClassifier(estimators=array([0.93322204, 0.95325543, 0.92320534])),
 VotingClassifier(estimators=array([0.93322204, 0.95325543, 0.92320534]),
                  voting='soft'))

## LEAVE ONE OUT CROSS VALIDATION

In [84]:
from sklearn.model_selection import LeaveOneOut

In [85]:
loocv = LeaveOneOut()

In [88]:
results = []
model = SVC(kernel='rbf')

for train_index, test_index in loocv.split(digit.data):
    T_X, t_X, T_y, t_y = digit.data[train_index], digit.data[test_index], digit.target[train_index], digit.target[test_index]
    
    model.fit(T_X, T_y)
    res = model.predict(t_X)
    results.append(res)
    
print(np.mean(results))

4.478575403450194


In [101]:
len(results)

1797

In [103]:
results[:10]

[array([0]),
 array([1]),
 array([2]),
 array([3]),
 array([4]),
 array([9]),
 array([6]),
 array([7]),
 array([8]),
 array([9])]

In [104]:
digit.target.shape

(1797,)

In [105]:
digit.target[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [112]:
model.score(t_X, t_y)

1.0