# 1) Cross-validation

author: Mat Gilson, https://github.com/MatthieuGilson

This notebook shows a key concept at the core of all machine learning procedures. It aims to quantify the generalizability of a trained classifier to unseen data. Here we also see how to get a baseline reference in terms of accuracy. 

See also the documentation of scikit-learn library (https://scikit-learn.org/)

In [None]:
# import librairies

import numpy as np

from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit, LeaveOneOut

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

from sklearn.metrics import confusion_matrix

import pandas as pd

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sb

font = {'family' : 'DejaVu Sans',
        'weight' : 'regular',
        'size'   : 18}
mpl.rc('font', **font)


In [None]:
# create synthetic dataset where 2 classes of s0+s1 samples of m-dimensional inputs with controlled contrast
def gen_inputs(m,        # input dimensionality
               s0,       # number of samples for class 0
               s1,       # number of samples for class 1
               scaling): # scaling factor to separate classes

    # labels
    lbl = np.zeros([s0+s1], dtype=int)
    # inputs
    X = np.zeros([s0+s1,m])

    # create s0 and s1 samples for the 2 classes
    for i in range(s0+s1):
        # label
        lbl[i] = int(i<s0)
        # inputs are random noise plus a shift
        for j in range(m):
            # positive/negative shift for 1st/2nd class
            if i<s0:
                a = -scaling
            else:
                a = scaling
            # the shift linearly depends on the feature index j
            X[i,j] = a*j/m + np.random.randn()
            
    return X, lbl

Let's have a first look at the data.

In [None]:
# input properties
m = 10 # input dimensionality
s0 = 100 # number of samples for class 0
s1 = 100 # number of samples for class 1
scaling = 1.0 # class contrast

# generate inputs
X, y = gen_inputs(m, s0, s1, scaling)

# bins for istograms
vbins = np.linspace(-3,3,30)

# plot
plt.figure(figsize=[6,7])
plt.subplot(411)
i = 0
plt.hist(X[:s0,i], histtype='step', bins=vbins, color='r')
plt.hist(X[s0:,i], histtype='step', bins=vbins, color='b')
plt.axis(xmin=-3, xmax=3)
plt.legend(['class 0', 'class 1'], fontsize=10)
plt.title('input {}'.format(i), loc='left')
plt.subplot(412)
i = int((m-1)*0.33)
plt.hist(X[:s0,i], histtype='step', bins=vbins, color='r')
plt.hist(X[s0:,i], histtype='step', bins=vbins, color='b')
plt.axis(xmin=-3, xmax=3)
plt.title('input {}'.format(i), loc='left')
plt.subplot(413)
i = int((m-1)*0.66)
plt.hist(X[:s0,i], histtype='step', bins=vbins, color='r')
plt.hist(X[s0:,i], histtype='step', bins=vbins, color='b')
plt.axis(xmin=-3, xmax=3)
plt.title('input {}'.format(i), loc='left')
plt.subplot(414)
i = m-1
plt.hist(X[:s0,i], histtype='step', bins=vbins, color='r')
plt.hist(X[s0:,i], histtype='step', bins=vbins, color='b')
plt.axis(xmin=-3, xmax=3)
plt.title('input {}'.format(i), loc='left')
plt.xlabel('X values')
plt.tight_layout()
plt.savefig('ex_contrast_X')
plt.show()

In [None]:
# create matrix rgb
mat_rgb = np.zeros([m, vbins.size-1, 3])
for i in range(m):
    mat_rgb[i,:,0] = np.histogram(X[:s0,i], bins=vbins)[0]
    mat_rgb[i,:,2] = np.histogram(X[s0:,i], bins=vbins)[0]
mat_rgb /= mat_rgb.max() / 2.0

plt.figure(figsize=[6,7])
plt.imshow(mat_rgb)
plt.xlabel('values')
plt.ylabel('input index')
plt.show()

Now let's see how to separate the 2 classes using a classifier.

In [None]:
# Classifiers and learning parameters
clf = make_pipeline(StandardScaler(), 
                    LogisticRegression(C=10000.0, penalty='l2', solver='lbfgs', max_iter=500) )

What is the reference as "chance" level: $50\%$ for $2$ classes?

In [None]:
acc = pd.DataFrame(columns=['score'])

# repetitions
n_rep = 20
for i_rep in range(n_rep):
    # generate data
    X, y = gen_inputs(m, s0, s1, scaling)
    
    # Train and test classifiers with subject labels
    clf.fit(X, y)
    # accuracy on train set
    d = {'score': [clf.score(X, y)]}
    acc = acc.append(pd.DataFrame(data=d))

# plot
sb.violinplot(data=acc, y='score', scale='width', palette=['brown']) # cut=0
plt.text(0, 1.05, str(acc['score'].mean())[:4], horizontalalignment='center')
plt.yticks([0,1])
plt.axis(ymax=1.02)
plt.xlabel('classifier')
plt.ylabel('accuracy')
plt.show()

In [None]:
# theoretical chance level
chance_level = 1.0 / 2

sb.violinplot(data=acc, y='score', scale='width', palette=['brown']) # cut=0
plt.plot([-1,1], [chance_level]*2, '--k')
plt.yticks([0,1])
plt.axis(ymax=1.02)
plt.xlabel('classifier')
plt.ylabel('accuracy')
plt.show()

We can play with the contrast between the 2 classes, for example a more difficult classification with lower contrast / separability.

In [None]:
acc = pd.DataFrame(columns=['score'])

# change contrast
scaling = 0.5 # try 0.2, 0.1, 0.0

# repetitions
n_rep = 20
for i_rep in range(n_rep):
    # generate data
    X, y = gen_inputs(m, s0, s1, scaling)
    
    # Train and test classifiers with subject labels
    clf.fit(X, y)
    # accuracy on train set
    d = {'score': [clf.score(X, y)]}
    acc = acc.append(pd.DataFrame(data=d))

# plot
sb.violinplot(data=acc, y='score', scale='width', palette=['brown']) # cut=0
plt.text(0, 1.05, str(acc['score'].mean())[:4], horizontalalignment='center')
plt.plot([-1,1], [chance_level]*2, '--k')
plt.yticks([0,1])
plt.axis(ymax=1.02)
plt.xlabel('classifier')
plt.ylabel('accuracy')
plt.show()

Even for `scaling=0`, the classification accuracy is above the expected chance level $0.5$...

Let's try with similar (i.e. with same scaling), but new data.

In [None]:
X_new, y_new = gen_inputs(m, s0, s1, scaling) # also change the scaling to play with the code

print(clf.score(X_new, y_new))

In [None]:
acc = pd.DataFrame(columns=['type','score'])

# input dimensionality (number of features)
m = 10 # try 5, 20

# class contrast
scaling = 1.0 # try 0.5, 0.0

# loop with training on a dataset and testing on a new dataset
for i_rep in range(n_rep):
    # generate data
    X, y = gen_inputs(m, s0, s1, scaling)

    # train and calcluate accuracy
    clf.fit(X, y)
    d = {'type': ['training'],
         'score': [clf.score(X, y)]}
    acc = acc.append(pd.DataFrame(data=d))

    # generate new data
    X_new, y_new = gen_inputs(m, s0, s1, scaling)
    
    # only test classifier that was trained on other data
    d = {'type': ['new'],
         'score': [clf.score(X_new, y_new)]}
    acc = acc.append(pd.DataFrame(data=d))
    
sb.violinplot(data=acc, x='type', y='score', scale='width', palette=['brown','orange'])
plt.text(0, 1.05, str(acc[acc['type']=='training']['score'].mean())[:4], horizontalalignment='center')
plt.text(1, 1.05, str(acc[acc['type']=='new']['score'].mean())[:4], horizontalalignment='center')
plt.plot([-1,2], [chance_level]*2, '--k')
plt.yticks([0,1])
plt.axis(ymax=1.02)
plt.xlabel('classifier')
plt.ylabel('accuracy')
plt.show()


The classifier tends to extract specific "information" from the data it is trained with, which corresponds to the notion of overfitting.

The "real" accuracy that should be taken into account is the accuracy for the new data, which quantifies the generalization capability of the classifier to new data from the same class.

## Cross-validation scheme

The idea is to generalize the previous observation by splitting the data into a training set and a testing set, ofr a number of repetitions. The relevant result to report is the test accuracy.

In [None]:
# number of repetitions and storage of results
n_rep = 10

# Cross-validation scheme
cvs0 = ShuffleSplit(n_splits=n_rep, test_size=0.2)

In [None]:
# generate n_rep splits
ind_split = np.zeros([n_rep,s0+s1])
i_rep = 0
for train_ind, test_ind in cvs0.split(X, y):
    ind_split[i_rep, test_ind] = 1
    i_rep += 1

# calculate the size of the test set for each split
test_size = np.vstack((ind_split[:,:s0].sum(axis=1),
                       ind_split[:,s0:].sum(axis=1)))

plt.figure()
plt.subplot(121)
plt.imshow(ind_split, cmap='binary', interpolation='nearest', aspect=40)
plt.xlabel('sample index')
plt.ylabel('split index')
plt.subplot(122)
plt.plot(test_size[0,::-1], np.arange(n_rep), 'b')
plt.plot(test_size[1,::-1], np.arange(n_rep), 'r')
plt.xlabel('test size per class')
plt.show()

In [None]:
# wrapper to test cross-validation scheme (cvs) 
def test_cvs(cvs, cvs_lbl):
    
    acc = pd.DataFrame(columns=['type', 'cv', 'score'])
    
    for train_ind, test_ind in cvs.split(X, y):
    
        # train and test classifier
        clf.fit(X[train_ind,:], y[train_ind])
        # accuracy on train set
        d = {'type': ['train'],
             'cv': [cvs_lbl], 
             'score': [clf.score(X[train_ind,:], y[train_ind])]}
        acc = pd.concat((acc, pd.DataFrame(data=d)), ignore_index=True)
        # accuracy on test set
        d = {'type': ['test'],
             'cv': [cvs_lbl],
             'score': [clf.score(X[test_ind,:], y[test_ind])]}
        acc = pd.concat((acc, pd.DataFrame(data=d)), ignore_index=True)

    return acc

In [None]:
# generate a dataset of features and labels
m = 10 # input dimensionality
s0 = 100 # number of samples for class 0
s1 = 100 # number of samples for class 1
scaling = 1.0 # class contrast
X, y = gen_inputs(m, s0, s1, scaling)

# evaluate classification accuracy on train and test sets
acc = test_cvs(cvs0, 'no strat')

# theoretical chance level
chance_level = 0.5

sb.violinplot(data=acc, x='cv', y='score', hue='type', split=True, scale='width', palette=['brown','orange'])
plt.plot([-1,2], [chance_level]*2, '--k')
plt.yticks([0,1])
plt.ylabel('accuracy')
plt.axis(ymax=1.02)
plt.show()

In [None]:
# matrix of predicted class versus true class in test set
cm = np.zeros([2,2])

# repeat classification
for train_ind, test_ind in cvs0.split(X, y):
    clf.fit(X[train_ind,:], y[train_ind])
    cm += confusion_matrix(y_true=y[test_ind], 
                           y_pred=clf.predict(X[test_ind,:]))        


plt.figure()
plt.imshow(cm, cmap='jet', vmin=0)
plt.colorbar(label='sample count')
plt.xticks([0,1], ['class 0','class 1'])
plt.yticks([0,1], ['class 0','class 1'])
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.tight_layout()
plt.show()

# Baseline accuracy as reference 

How to interpret the test accuracy?

In [None]:
# wrapper to test cross-validation scheme (cvs) 
def test_cvs(cvs, cvs_y):
    
    acc = pd.DataFrame(columns=['type', 'cv', 'score'])
    
    for train_ind, test_ind in cvs.split(X, y):
    
        # train and test classifiers
        clf.fit(X[train_ind,:], y[train_ind])
        # accuracy on test set
        d = {'type': ['test'],
             'cv': [cvs_y],
             'score': [clf.score(X[test_ind,:], y[test_ind])]}
        acc = pd.concat((acc, pd.DataFrame(data=d)), ignore_index=True)

        # shuffling
        train_ind_rand = np.random.permutation(train_ind)
        clf.fit(X[train_ind,:], y[train_ind_rand])
        # accuracy on test set
        d = {'type': ['shuf'],
             'cv': [cvs_y],
             'score': [clf.score(X[test_ind,:], y[test_ind])]}
        acc = pd.concat((acc, pd.DataFrame(data=d)), ignore_index=True)

    return acc

In [None]:
acc = test_cvs(cvs0, 'no strat')

# theoretical chance level
chance_level = 0.5

sb.violinplot(data=acc, x='cv', y='score', hue='type', split=True, scale='width', palette=['orange','gray'])
plt.plot([-1,2], [chance_level]*2, '--k')
plt.yticks([0,1])
plt.ylabel('accuracy')
plt.axis(ymax=1.02)
plt.show()

We find distributed values of accuracies for the shuffling surrogates around the expected value. It looks fine, but is all the variability coming from the classifier?

# Unbalanced data

We now look into a different version of the dataset, a bit less artificial. We consider the case of unbalanced classes, with different number of samples in the two classes.

In [None]:
# generate inputs with unbalanced classes
m = 10 # input dimensionality
s0 = 100 # number of samples for class 0
s1 = 300 # number of samples for class 1
scaling = 1.0 # class contrast

X, y = gen_inputs(m, s0, s1, scaling) # also change the scaling to play with the code

In [None]:
acc = test_cvs(cvs0, 'no strat')

# theoretical chance level
naive_chance_level = 1.0 / 2 # equal probability for each category
greedy_chance_level = max(s0,s1) / (s0+s1) # always predict the larger class

sb.violinplot(data=acc, x='cv', y='score', hue='type', split=True, scale='width', palette=['orange','gray'])
plt.plot([-1,2], [naive_chance_level]*2, ':k')
plt.plot([-1,2], [greedy_chance_level]*2, '--k')
plt.yticks([0,1])
plt.ylabel('accuracy')
plt.axis(ymax=1.02)
plt.show()

The baseline accuracy has changed in this case... A "stupid" classifier predicting always the larger class will perform above $50\%$ accuracy for unbalanced classes.

In [None]:
# generate n_rep splits
ind_split = np.zeros([n_rep,s0+s1])
i_rep = 0
for train_ind, test_ind in cvs0.split(X, y):
    ind_split[i_rep, test_ind] = 1
    i_rep += 1

# calculate the size of the test set for each split
test_size = np.vstack((ind_split[:,:s0].sum(axis=1),
                       ind_split[:,s0:].sum(axis=1)))

plt.figure()
plt.subplot(121)
plt.imshow(ind_split, cmap='binary', interpolation='nearest', aspect=40)
plt.xlabel('sample index')
plt.ylabel('split index')
plt.subplot(122)
plt.plot(test_size[0,::-1], np.arange(n_rep), 'b')
plt.plot(test_size[1,::-1], np.arange(n_rep), 'r')
plt.xlabel('test size per class')
plt.show()

Does the variability of the class ratio in the test set matter?

Let's compare to other possibility for splitting the data in train-test sets.

## Stratification

Let's consider a different splitting scheme that preserves the ratio of classes from the original dataset in each train-test sets.

In [None]:
# stratified shuffle split: preserving ratio of classes in train-test sets
cvs1 = StratifiedShuffleSplit(n_splits=n_rep, test_size=0.2)

# generate n_rep splits
ind_split = np.zeros([n_rep,s0+s1])
i_rep = 0
for train_ind, test_ind in cvs1.split(X, y):
    ind_split[i_rep, test_ind] = 1
    i_rep += 1

# calculate the size of the test set for each split
test_size = np.vstack((ind_split[:,:s0].sum(axis=1),
                       ind_split[:,s0:].sum(axis=1)))

plt.figure()
plt.subplot(121)
plt.imshow(ind_split, cmap='binary', interpolation='nearest', aspect=60)
plt.xlabel('sample index')
plt.ylabel('split index')
plt.subplot(122)
plt.plot(test_size[0,::-1], np.arange(n_rep), 'b')
plt.plot(test_size[1,::-1], np.arange(n_rep), 'r')
plt.xlabel('test size per class')
plt.show()

## Leave-one-out scheme

Another popular scheme consists in leaving out one sample for testing, especially in the case of small datasets. 

In [None]:
# leave-two-out scheme
cvs2 = LeaveOneOut()

# generate n_rep splits
ind_split = np.zeros([s0+s1,s0+s1])
i_rep = 0
for train_ind, test_ind in cvs2.split(X, y):
    ind_split[i_rep, test_ind] = 1
    i_rep += 1

plt.figure()
plt.imshow(ind_split, cmap='binary', interpolation='nearest', aspect=1)
plt.xlabel('sample index')
plt.ylabel('split index')
plt.show()

## Comparison of cross-validation schemes

In [None]:
# labels for cvs
cvs_lbl = ['no strat', 'strat', 'loo']

# accuracy
acc = pd.DataFrame(columns=['type', 'cv', 'score'])

# loop over cross-validation schemes
for i, cvs in enumerate([cvs0, cvs1, cvs2]):
    acc_tmp = test_cvs(cvs, cvs_lbl[i])
    acc = pd.concat((acc, acc_tmp))
    
# theoretical chance level
naive_chance_level = 1.0 / 2 # equal probability for each category
greedy_chance_level = max(s0,s1) / (s0+s1) # always predict the larger class

# calculate mean of accuracy for leave-two-out scheme
sel_df = np.logical_and(acc['cv']=='loo', acc['type']=='shuf')
mean_small_test = acc[sel_df]['score'].mean()

# plot
sb.violinplot(data=acc, x='cv', y='score', hue='type', split=True, scale='width', palette=['orange','gray'], cut=0)
plt.plot([-1,3], [naive_chance_level]*2, ':k')
plt.plot([-1,3], [greedy_chance_level]*2, '--k')
plt.plot([2.5], [mean_small_test], marker='x', markersize=16, color='k', ls='')
plt.yticks([0,1])
plt.ylabel('accuracy')
plt.legend(loc='lower left')
plt.tight_layout()
plt.savefig('acc_cv')
plt.show()

## Conclusion

We see that the leave-one-out (loo) only makes sense when averaging the results (see the cross).

The non-stratified splitting inflates the distribution of chance-level accuracy.

A good default (for sufficient data) is $80\%$-$20\%$ split for the train-test sets, repeated $100$ times.

Importantly, this approach is *conservative*: the risk to find "information" by chance where there is none (akin to a "false positive") can controlled by comparing the classifier with teh chance-level accuracy, in particular its spread. Instead, mistakes tend to yield a large variability in the accuracy distributions, with an absence of positive outcome. In other words, when you find some effect, you can trust it, but it is also likely that you have to try different classifiers to get the best accuracy. This is in fact not a problem, but can inform on the structure of the data (where there is information related to the predicted label).

*Refs:*
- cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html
- chance-level evaluation: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
