# Hypothesis Testing for Testing the Significance of the Difference of Performances of Two Machine Learning Models

In this notebook, we will compare the performance of two machine learning algorithms on a binary classification task, then check if the observed difference is statistically significant or not.

We will compare the performance of two linear algorithms on this dataset.

Specifically, a **Logistic Regression** algorithm and a **Linear Discriminant Analysis (LDA)** algorithm.

The procedure we will like to use is **repeated stratified k-fold cross-validation with 10 folds and 3 repeats (and 2 folds and 5 repeats)**. We will use this procedure to evaluate each algorithm and return and report the mean classification accuracy.


In [2]:
!pip install mlxtend



In [3]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Use 5x2 statistical hypothesis testing procedure to compare two machine learning algorithms
from mlxtend.evaluate import paired_ttest_5x2cv

## Define Dataset

First, we can use the make_classification() function to create a synthetic dataset with 1,000 samples and 10 input variables.

In [4]:
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)

In [14]:
X.shape

(1000, 10)

In [15]:
y.shape

(1000,)

# First Try - with 3-Repeats 10-Fold-Cross-Validation

You can then call the paired_ttest_5x2cv() function and pass in your data and models and it will report the t-statistic value and the p-value as to whether the difference in the performance of the two algorithms is significant or not.

## Evaluate the First Model

In [5]:
model1 = LogisticRegression()
cv1 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores1 = cross_val_score(model1, X, y, scoring='accuracy', cv=cv1, n_jobs=-1)
print('LogisticRegression Mean Accuracy: %.3f (%.3f)' % (mean(scores1), std(scores1)))

LogisticRegression Mean Accuracy: 0.892 (0.036)


## Evaluate the Second Model

In [6]:
model2 = LinearDiscriminantAnalysis()
cv2 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores2 = cross_val_score(model2, X, y, scoring='accuracy', cv=cv2, n_jobs=-1)
print('LinearDiscriminantAnalysis Mean Accuracy: %.3f (%.3f)' % (mean(scores2), std(scores2)))

LinearDiscriminantAnalysis Mean Accuracy: 0.893 (0.033)


We can see that the mean performance for the two algorithms are 89.3 percent and 89.2 percent for LDA and Logistic Regression respectively.

## Check if Difference between Algorithms is Real

In [7]:
t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y, scoring='accuracy', random_seed=1)

## Summarize

In [8]:
print('P-value: %.3f, t-Statistic: %.3f' % (p, t))

P-value: 0.328, t-Statistic: 1.085


## Interpret the Result

The p-value must be interpreted using an alpha value, which is the significance level that you are willing to accept.

**If the p-value is less or equal to the chosen alpha, we reject the null hypothesis** that the models have the same mean performance, which means the difference is probably real. 

**If the *p-value* is greater than alpha, we fail to reject the null hypothesis** that the models have the same mean performance and any observed difference in the mean accuracies is probability a statistical fluke.

The smaller the alpha value, the better, and a common value is 5 percent (0.05).

In [11]:
if p <= 0.05:
	print('The Null Hypothesis Rejected : Difference between mean performance is probably real')
else:
	print('Fail to Reject the Null Hypothesis : Algorithms probably have the same performance')

Fail to Reject the Null Hypothesis : Algorithms probably have the same performance


#### **Note:** Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. 
Consider running the example a few times and compare the average outcome.


# Second Try - with 5-Repeats and 2-Fold-Cross-Validation

Recall that we are reporting performance using a different procedure (3×10 CV) than the procedure used to estimate the performance in the statistical test (5×2 CV). Perhaps results would be different if we looked at scores using five repeats of two-fold cross-validation?

The example below is updated to report classification accuracy for each algorithm using 5×2 CV.

In [20]:
# Evaluate model 1
model1 = LogisticRegression()
cv1 = RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=1)
scores1 = cross_val_score(model1, X, y, scoring='accuracy', cv=cv1, n_jobs=-1)
print('LogisticRegression Mean Accuracy: %.3f (%.3f)' % (mean(scores1), std(scores1)))

LogisticRegression Mean Accuracy: 0.894 (0.012)


In [21]:
# Evaluate model 2
model2 = LinearDiscriminantAnalysis()
cv2 = RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=1)
scores2 = cross_val_score(model2, X, y, scoring='accuracy', cv=cv2, n_jobs=-1)
print('LinearDiscriminantAnalysis Mean Accuracy: %.3f (%.3f)' % (mean(scores2), std(scores2)))


LinearDiscriminantAnalysis Mean Accuracy: 0.890 (0.013)


In this case, we can see that the difference in the mean performance for the two algorithms is even larger, 89.4 percent vs. 89.0 percent in favor of Logistic Regression instead of LDA as we saw with 3×10 CV.

In [23]:
# check if difference between algorithms is real
t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y, scoring='accuracy', random_seed=1)
# summarize
print('P-value: %.3f, t-Statistic: %.3f' % (p, t))
# interpret the result
if p <= 0.05:
	print('Reject the Null Hypothesis : Difference between mean performance is probably real')
else:
	print('Fail to Reject the Null Hypothesis : Algorithms probably have the same performance')

P-value: 0.328, t-Statistic: 1.085
Fail to Reject the Null Hypothesis : Algorithms probably have the same performance


In this case, we can see that the p-value is about 0.32, which is much larger than 0.05. 

**This leads us to fail to reject the null hypothesis, suggesting that any observed difference between the algorithms is probably not real.**

We could just as easily choose logistic regression or LDA and both would perform about the same on average.

**This highlights that performing model selection based only on the mean performance may not be sufficient.**