# Digits Data - KFold and Cross Val Score Exercise
This exercise demonstartes one of the caveats of the train test split function. Each time the function is called on a given data set it splits the data differently leading to different test set scores - Including when stratify is called within the train_test_split funcrtion.

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()

In [41]:
# Count of unique target classes in the dataset
unique, counts = np.unique(digits.target, return_counts=True)
np.asarray((unique, counts))

array([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9],
       [178, 182, 177, 183, 181, 182, 181, 179, 174, 180]])

In [42]:
# Creating traing and test splits to highlight caveat and for score comparison 
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size = 0.3)


In [43]:
# Count of unique target classes in the test set
unique, counts = np.unique(y_test, return_counts=True)
np.asarray((unique, counts))

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [56, 57, 48, 59, 70, 47, 50, 66, 41, 46]])

In [44]:
# Model fit and score function 
# NOTE: SCORE IS NOT NECESSARILY THE BEST METRIC FOR PERCISION/F1 SCORE
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [45]:
# Instantiating classifiers with basic hyper parameters
lr = LogisticRegression(max_iter=4000)
svm = SVC(C=0.7)
rfc = RandomForestClassifier(n_estimators=100)

In [46]:
get_score(lr, X_train, X_test, y_train, y_test)

0.9703703703703703

In [47]:
get_score(svm, X_train, X_test, y_train, y_test)

0.9888888888888889

In [48]:
get_score(rfc, X_train, X_test, y_train, y_test)

0.9740740740740741

## KFold cross validator
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [49]:
# KFold Cross Validation
from sklearn.model_selection import KFold, StratifiedKFold

In [50]:
# Instantiating KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [51]:
# For loop using train and test indexing of kfold splits
lr_scores = []
svm_scores = []
rfc_scores = []


for train_index, test_index in kf.split(digits.data):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]
    
    lr_scores.append(get_score(lr, X_train, X_test, y_train, y_test))
    svm_scores.append(get_score(svm, X_train, X_test, y_train, y_test))
    rfc_scores.append(get_score(rfc, X_train, X_test, y_train, y_test))  

In [52]:
print(lr_scores)
print(svm_scores)
print(rfc_scores)

[0.9282136894824707, 0.9415692821368948, 0.9165275459098498]
[0.9632721202003339, 0.9782971619365609, 0.9515859766277128]
[0.9332220367278798, 0.9632721202003339, 0.9198664440734557]


## Stratified KFold 
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

In [53]:
# Instantiating StratifiedKFold 
skf = StratifiedKFold(n_splits=3)
skf

StratifiedKFold(n_splits=3, random_state=None, shuffle=False)

## Cross val score
Cross val score will do the same thing as the for loop above by creating train and test splits in the data (default 5) though the KFold and StratifiedKfold can also be added as a parameter.

In [54]:
from sklearn.model_selection import cross_val_score
# cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None,
    # verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)

In [55]:
cross_val_score(lr, digits.data, digits.target, cv=skf,
                scoring = 'accuracy', n_jobs=2, verbose=1)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    8.6s finished


array([0.92487479, 0.93823038, 0.92320534])

In [56]:
cross_val_score(svm, digits.data, digits.target)

array([0.95555556, 0.94166667, 0.98050139, 0.98885794, 0.93593315])

In [57]:
cross_val_score(rfc, digits.data, digits.target)

array([0.92777778, 0.92222222, 0.95264624, 0.96935933, 0.93593315])

## Cross val score with custom scorer
Using a combination of sklearns make_scorer, accuracy_score and classification_report to create a custom scorer which provides a basic accuracy score as well as recording the original and predicted classes in arrays to be called using classification report for a thorough breakdown of model accuracy.

In [58]:
from sklearn.metrics import make_scorer, accuracy_score, classification_report 

In [59]:
def get_class_report(y_pred, y_true):
    original_class.extend(y_true)
    predicted_class.extend(y_pred)
    return accuracy_score(y_pred, y_true)

In [60]:
original_class = []
predicted_class = []

cross_val_score(lr, digits.data, digits.target, cv=kf, 
                scoring=make_scorer(get_class_report))

array([0.92821369, 0.94156928, 0.91652755])

In [61]:
print(classification_report(original_class, predicted_class))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       181
           1       0.92      0.85      0.88       197
           2       0.93      0.98      0.96       168
           3       0.91      0.94      0.92       177
           4       0.94      0.96      0.95       178
           5       0.95      0.94      0.94       183
           6       0.97      0.95      0.96       185
           7       0.92      0.95      0.94       173
           8       0.89      0.89      0.89       175
           9       0.88      0.88      0.88       180

    accuracy                           0.93      1797
   macro avg       0.93      0.93      0.93      1797
weighted avg       0.93      0.93      0.93      1797

