# Validation Schemas

## Holdout Method

In this method, we divide the dataset into train, test and validation. Normally, we would use the train and validation schema to choose the best model, while leaving the test for final checks and metrics. For example, when using a random forest and trying to find the best combination of parameters. For this, we use the *train_test_split* function twice.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# load the dataset
X = load_breast_cancer().data
y = load_breast_cancer().target

# print its size
print(X.shape)

# divide the dataset into train and a temporary variable (stratify=y keeps an
# equal percentage of the 0 and 1 class)
X_train, X_temp, y_train, y_temp = train_test_split(X,y,test_size=0.4,stratify=y)

print(f'\nTrain: {X_train.shape[0]}\nRest: {X_temp.shape[0]}\n\n')

X_val, X_test, y_val, y_test = train_test_split(X_temp,y_temp,test_size=0.5,stratify=y_temp)

print(f'Train: {X_train.shape[0]}\nValidation: {X_val.shape[0]} \nTest: {X_test.shape[0]}')


(569, 30)

Train: 341
Rest: 228


Train: 341
Validation: 114 
Test: 114


## $k$-fold Cross-Validation 

Im this technique a test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into $k$ smaller sets. The following procedure is followed for each of the k “folds”:
- A model is trained using $k-1$ of the folds as training data;
- The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,stratify=y)

clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(scores)
clf.fit(X_train,y_train)
accuracy_score(clf.predict(X_test),y_test)

[0.94202899 0.97058824 0.94117647 0.94117647 0.94117647]


0.9517543859649122

In [3]:
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d", "e", "f"]
kf = KFold(n_splits=4, shuffle=True)
for train, test in kf.split(X):
     print(f"{train}, {test}")

[1 2 4 5], [0 3]
[0 1 3 4], [2 5]
[0 2 3 4 5], [1]
[0 1 2 3 5], [4]


## Leave-one-out Cross Validation

This technique consists on leaving one element out in each of the trainings of the model.

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import pandas as pd

# Create the dataframe
df = pd.DataFrame({'y': [6, 8, 12, 14, 14, 15, 17, 22, 24, 23],
                   'x1': [2, 5, 4, 3, 4, 6, 7, 5, 8, 9],
                   'x2': [14, 12, 12, 13, 7, 8, 7, 4, 6, 5]})

In [41]:
#define predictor and response variables
X = df[['x1', 'x2']]
y = df['y']

#define cross-validation method to use
cv = LeaveOneOut()

#build multiple linear regression model
model = LinearRegression()

#use LOOCV to evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error',
                         cv=cv, n_jobs=-1)

print(scores)
print(len(scores))
print(len(X))

[-1.7159581  -4.91317829 -1.37391167 -6.76454294 -3.88354531 -2.18982218
 -2.22270204 -3.37927664 -4.43852243 -0.5800885 ]
10
10


## Leave-one-group-out cross validation


Provides train/test indices to split data such that each training set is comprised of all samples except ones belonging to one specific group. Arbitrary domain specific group information is provided an array integers that encodes the group of each sample.

In [43]:
import numpy as np
from sklearn.model_selection import LeaveOneGroupOut
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 1, 2])

# In this method, we define to which group the elements belong to
groups = np.array([1, 1, 2, 2])
logo = LeaveOneGroupOut()

for i, (train_index, test_index) in enumerate(logo.split(X, y, groups)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}, group={groups[train_index]}")
    print(f"  Test:  index={test_index}, group={groups[test_index]}")

Fold 0:
  Train: index=[2 3], group=[2 2]
  Test:  index=[0 1], group=[1 1]
Fold 1:
  Train: index=[0 1], group=[1 1]
  Test:  index=[2 3], group=[2 2]


## Nested Cross-Validation

The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.

In [53]:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)

# configure the inner cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)

# define the model
rfc = RandomForestClassifier(random_state=1)

# define search space
space = {'n_estimators' : [10, 100, 500],
         'max_features' : [2, 4, 6]}

# define search
search = GridSearchCV(rfc, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)

# configure the outer cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)

# execute the nested cross-validation
scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (pd.Series(scores).mean(), pd.Series(scores).std()))

Accuracy: 0.927 (0.020)


##  Wilcoxon signed-rank test

As Data Scientists, we want to make sure that I understand if a model is actually significantly more accurate than another. Fortunately, many methods exist that apply statistics to the selection of Machine Learning models.

The Wilcoxon signed-rank test which is the non-parametric version of the paired Student’s t-test. It can be used when the sample size is small and the data does not follow a normal distribution.

We can apply this significance test for comparing two Machine Learning models. Using k-fold cross-validation we can create, for each model, k accuracy scores. This will result in two samples, one for each model.

Then, we can use the Wilcoxon signed-rank test to test if the two samples differ significantly from each other. If they do, then one is more accurate than the other.


In [57]:
from scipy.stats import wilcoxon
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# Load the dataset
X = load_iris().data
y = load_iris().target

# Prepare models and select your CV method
model1 = ExtraTreesClassifier()
model2 = RandomForestClassifier()
kf = KFold(n_splits=20)

# Extract results for each model on the same folds
results_model1 = cross_val_score(model1, X, y, cv=kf)
results_model2 = cross_val_score(model2, X, y, cv=kf)

# Calculate p value
stat, p = wilcoxon(results_model1, results_model2, zero_method='zsplit'); 
p




0.6766573217164245

## McNemar’s Test

McNemar’s test is used to check the extent to which the predictions between one model and another match. This is referred to as the homogeneity of the contingency table. From that table, we can calculate $\chi^2$ which can be used to compute the p-value:

If the p-value is lower than 0.05 we can reject the null hypothesis and see that one model is significantly better than the other.

We can use mlxtend package to create the table and calculate the corresponding p-value:

In [58]:
import numpy as np
from mlxtend.evaluate import mcnemar_table, mcnemar

# The correct target (class) labels
y_target = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

# Class labels predicted by model 1
y_model1 = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
                     0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1])

# Class labels predicted by model 2
y_model2 = np.array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
                     1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0])

# Calculate p value
tb = mcnemar_table(y_target=y_target, 
                   y_model1=y_model1, 
                   y_model2=y_model2)
chi2, p = mcnemar(ary=tb, exact=True)

print('chi-squared:', chi2)
print('p-value:', p)


chi-squared: 3
p-value: 0.5078125


## 5x2CV paired t-test

The 5x2CV paired t-test is a method often used to compare Machine Learning models due to its strong statistical foundation.

The method works as follows. Let’s say we have two classifiers, A and B. We randomly split the data in 50% training and 50% test. Then, we train each model on the training data and compute the difference in accuracy between the models from the test set, called DiffA. Then, the training and test splits are reversed and the difference is calculated again in DiffB.

This is repeated five times after which the mean variance of the differences is computed (S²). Then, it is used to calculate the t-statistic:

In [67]:
from mlxtend.evaluate import paired_ttest_5x2cv
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from mlxtend.data import iris_data

# Prepare data and clfs
X, y = iris_data()
clf1 = ExtraTreeClassifier()
clf2 = DecisionTreeClassifier()

# Calculate p-value
t, p = paired_ttest_5x2cv(estimator1=clf1,
                          estimator2=clf2,
                          X=X, y=y,
                          random_seed=1)
p

0.5623122704704893