# Cross validation

## The dataset and classifier

First, let's introduce the dataset and divide it into training and test set:

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X,y = make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1, weights = (0.7,0.3), class_sep=0.99, random_state=14)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [2]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

## Applying cross validation

In [3]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracy scores: '+str(scores))

Accuracy scores: [0.97857143 1.         0.97857143 0.99285714 0.99285714]


In [4]:
outcomes = cross_val_score(classifier, X_train, y_train, cv=10, scoring = 'roc_auc')
print(outcomes)

[1.         0.99431818 1.         1.         0.99147727 1.
 1.         1.         0.9718173  1.        ]


In [6]:
from sklearn.model_selection import cross_validate

metrics = ['roc_auc','accuracy','precision']

# By default, we should not really care about the training scores. To show them, we add the extra return_train_score parameter
outcomes = cross_validate(classifier, X_train, y_train,scoring=metrics, cv=10, return_train_score=True)
for metric in outcomes.keys():
    print(metric+' value: '+str(outcomes[metric]))

fit_time value: [0.00500107 0.00300336 0.00301099 0.00400186 0.00500321 0.00502777
 0.00300097 0.00299907 0.0039947  0.00399709]
score_time value: [0.01000071 0.00699234 0.00799656 0.00599599 0.01402617 0.00495982
 0.00699401 0.00599527 0.00499988 0.00500202]
test_roc_auc value: [1.         0.99431818 1.         1.         0.99147727 1.
 1.         1.         0.9718173  1.        ]
train_roc_auc value: [0.99679871 0.99722555 0.99684614 0.99691728 0.9973204  0.9969616
 0.99691431 0.99671333 0.99949163 0.99673697]
test_accuracy value: [0.95714286 0.98571429 1.         1.         0.97142857 1.
 1.         0.98571429 0.98571429 1.        ]
train_accuracy value: [0.99206349 0.99206349 0.98888889 0.98888889 0.99365079 0.98888889
 0.98888889 0.99047619 0.99047619 0.98888889]
test_precision value: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
train_precision value: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


Now, the outcome is a dictionary with the different metrics per fold for both the training and test set (note that, since we have set aside a separate test set, this is our validation set in this case).

## Setting up a pipeline

Remember when we talked about training, validation and test sets, we mentioned that the pre-processing (e.g., replacing missing values, transformations, over- and under-sampling, etc.) should be performed on the training and test set separately to avoid any bias? That is, the same transformation, with the same parameters, should be applied to both. Otherwise, information of the training set can 'leak' into the testing process, while the testing stage needs to be completely independent.

To simplify this, we can set up a pipeline containing the various steps that need to be applied, i.e., transformation and training a classifier:

In [12]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

metrics = ['accuracy']

pipeline = make_pipeline(StandardScaler(), classifier)
outcomes = cross_validate(pipeline, X_train, y_train, scoring=metrics, cv=10, return_train_score=True)
for metric in outcomes.keys():
    print(metric+' value: '+str(outcomes[metric]))

fit_time value: [0.00399804 0.00199819 0.0039947  0.00300407 0.00399184 0.00199652
 0.00299931 0.00299907 0.00199938 0.00300384]
score_time value: [0.         0.00099897 0.0010035  0.00100303 0.         0.00100064
 0.         0.0009973  0.00100541 0.00099778]
test_accuracy value: [0.97142857 0.98571429 1.         1.         0.97142857 0.98571429
 1.         0.98571429 0.98571429 1.        ]
train_accuracy value: [0.99206349 0.99206349 0.99047619 0.99047619 0.99206349 0.99047619
 0.98888889 0.99047619 0.99047619 0.98888889]


## Predictions for every sample

If you want to obtain the predictions for every sample from when it was in the test set (in 10-fold CV, every sample is used exactly once), the following code can be used:

In [13]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

predictions = cross_val_predict(pipeline, X_train, y_train, cv=10)
print(accuracy_score(y_train, predictions))

0.9885714285714285


Typically, we will use cross-validation to see what classifier, or what parameters, are working best over our training/validation sets. Then, finally, we use them on our test set for our final evaluation.

## Adding sampling strategy to pipeline

Since our data is imbalanced, we might want to preserve this imbalance in every fold. To do so, we can use the stratified CV procedure as well:

In [21]:
from sklearn.model_selection import StratifiedKFold

stratified_kfold = StratifiedKFold(n_splits=10, random_state=40, shuffle=True)
outcomes = cross_validate(pipeline, X_train, y_train, scoring=metrics, cv=stratified_kfold, return_train_score=True)
for metric in outcomes.keys():
    print(metric+' value: '+str(outcomes[metric]))

fit_time value: [0.00499368 0.00399804 0.00300169 0.00399947 0.00400329 0.00299883
 0.00399995 0.00401187 0.00299835 0.00399733]
score_time value: [0.00099826 0.00100183 0.00100493 0.00100017 0.00302482 0.00099945
 0.         0.00098896 0.00100327 0.        ]
test_accuracy value: [0.98571429 0.98571429 1.         1.         1.         0.98571429
 0.97142857 0.98571429 0.98571429 1.        ]
train_accuracy value: [0.99206349 0.99206349 0.98888889 0.98888889 0.98888889 0.99206349
 0.99206349 0.99206349 0.99047619 0.99047619]
