# Iris Data - KFold and Cross Val Score Exercise
This exercise demonstartes one of the caveats of the train test split function. Each time the function is called on a given data set it splits the data differently leading to different test set scores - Including when stratify is called within the train_test_split funcrtion.

In [231]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()

In [232]:
# Instantiate the models
lr = LogisticRegression(max_iter =3000)
svm = SVC(C=0.7)
rfc = RandomForestClassifier(n_estimators=100)

In [233]:
# Counting target variables
unique, count = np.unique(iris.target, return_counts=True)
np.asarray((unique, count))

array([[ 0,  1,  2],
       [50, 50, 50]])

In [234]:
# Creating traing and test split to highlight caveat and for comparison against KFold
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    test_size=0.3)

In [235]:
# Counting unique target variables in test split
unique, count = np.unique(y_test, return_counts=True)
np.asarray((unique, count))

array([[ 0,  1,  2],
       [19, 12, 14]])

In [236]:
# Custom model fit and score function
ef get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_train, y_train)

In [237]:
get_score(lr, X_train, X_test, y_train, y_test)

0.9809523809523809

In [238]:
get_score(svm, X_train, X_test, y_train, y_test)

0.9619047619047619

In [239]:
get_score(rfc, X_train, X_test, y_train, y_test)

1.0

## KFold cross validator
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [240]:
from sklearn.model_selection import KFold, StratifiedKFold

In [250]:
kf = KFold(n_splits=3)
skf = StratifiedKFold(n_splits=3)

In [242]:
lr_scores = []
svm_scores = []
rfc_scores= []

for train_index, test_index in kf.split(iris.data):
    X_train, X_test, y_train, y_test = iris.data[train_index], iris.data[test_index], \
                                        iris.target[train_index], iris.target[test_index]
    lr_scores.append(get_score(lr, X_train, X_test, y_train, y_test))
    svm_scores.append(get_score(svm, X_train, X_test, y_train, y_test))
    rfc_scores.append(get_score(rfc, X_train, X_test, y_train, y_test))

In [243]:
print(lr_scores)
print(svm_scores)
print(rfc_scores)

[0.96, 1.0, 1.0]
[0.96, 1.0, 1.0]
[1.0, 1.0, 1.0]


## Stratified KFold 
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

In [244]:
lr_scores2 = []
svm_scores2 = []
rfc_scores2 = []

for train_index, test_index in skf.split(iris.data, iris.target):
    X_train, X_test, y_train, y_test = iris.data[train_index], iris.data[test_index], \
                                        iris.target[train_index], iris.target[test_index]
    
    lr_scores2.append(get_score(lr, X_train, X_test, y_train, y_test))
    svm_scores2.append(get_score(svm, X_train, X_test, y_train, y_test))
    rfc_scores2.append(get_score(rfc, X_train, X_test, y_train, y_test))
    

In [245]:
print(lr_scores2)
print(svm_scores2)
print(rfc_scores2)

[0.96, 0.98, 0.99]
[0.96, 0.97, 0.97]
[1.0, 1.0, 1.0]


## Cross val score
Cross val score will do the same thing as the for loop above by creating train and test splits in the data (default 5) though the KFold and StratifiedKfold can also be added as a parameter.

In [246]:
# Cross val score
from sklearn.model_selection import cross_val_score
# cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None,
    # verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)

In [247]:
cross_val_score(lr, iris.data, iris.target)

array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ])

In [257]:
# CV = 3 equvilant KFold(n_splits=3, random_state=None, shuffle=False)
cross_val_score(svm, iris.data, iris.target, cv=3)

array([0.96, 0.98, 0.94])

In [256]:
cross_val_score(rfc, iris.data, iris.target, cv=skf)

array([0.98, 0.94, 0.96])