Machine learning models are chosen based on their mean performance, often calculated using k-fold cross-validation. The algorithm with the best mean performance is expected to be better than those algorithms with worse mean performance. But sometimes the mean performance is caused by a statistical fluke. Therefore statistical hypothesis test helps in evaluating whether the difference in the mean performance between any two algorithms is real or not.

The paired sample t-test is an uni variate test that tests for a significant difference between 2 related variables. Therefore we are using this test

The null hypothesis of the test, is that there is no difference in the means between the samples. The rejection of the null hypothesis indicates that there is enough evidence that the sample means are different.

##Comparing with and without MLSMOTE for a Neural Network Model



In [2]:
# study of mlp learning curves given different number of layers for multi-class classification
from pandas import read_csv
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score, \
    confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold, RepeatedKFold, learning_curve, GridSearchCV
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

In [None]:
X = read_csv('/content/drive/MyDrive/train_features.csv')
X = np.asarray(X)[:,4:]     #4

y = read_csv('/content/drive/MyDrive/train_targets_scored.csv')
y = np.asarray(y)[:,1:]     #1

n_layers = [64, 128, 64]    #[120, 84]
X_shape, y_shape = X.shape[1], y.shape[1]

In [4]:
def get_model(learn_rate=0.0001,):
    n_input, n_classes = X_shape, y_shape
    model = Sequential()

    model.add(Dense(n_layers[0], input_dim=n_input, activation='relu'))
    for nb_neurons in n_layers[1:len(n_layers)]:
        model.add(Dense(nb_neurons, activation='relu'))
    model.add(Dense(n_classes, activation='sigmoid'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

In [None]:
model = KerasClassifier(build_fn=get_model, epochs=150, batch_size=10, verbose=0)
# evaluate using 10-fold cross validation
kfold1 = RepeatedKFold(n_splits=10, n_repeats=1, random_state=123)
results1 = cross_val_score(model, np.asarray(X).astype(np.float64), np.asarray(y).astype(np.float64), cv=kfold1)
print(results1.mean())
print(results1)

0.017257916973903775 \\
[0.0247691  0.01553317 0.01889169 0.02980689 0.01469971 0.01385972
 0.0100798  0.00713986 0.01805964 0.01973961]

In [3]:
X = read_csv('/content/drive/MyDrive/train_features_mlsmote_200.csv')
X = np.asarray(X)[:,4:]

y = read_csv('/content/drive/MyDrive/train_targets_scored_mlsmote_200.csv')
y = np.asarray(y)[:,1:]

n_layers = [64, 128, 64]    #[120, 84]
X_shape, y_shape = X.shape[1], y.shape[1]

In [5]:
model2 = KerasClassifier(build_fn=get_model, epochs=150, batch_size=10, verbose=0)
# evaluate using 10-fold cross validation
kfold = RepeatedKFold(n_splits=10, n_repeats=1, random_state=123)
results = cross_val_score(model2, np.asarray(X).astype(np.float64), np.asarray(y).astype(np.float64), cv=kfold)
print(results.mean())
print(results)

0.5264029383659363
[0.53111249 0.53991199 0.52671278 0.52482718 0.526564   0.52310592
 0.52750707 0.5243634  0.51933354 0.52059102]


In [8]:
from scipy.stats import ttest_rel
results1 = [0.0247691,  0.01553317, 0.01889169, 0.02980689, 0.01469971, 0.01385972, 0.0100798,  0.00713986, 0.01805964, 0.01973961]
t, p = ttest_rel(results, results1)
print("p-value:", p)
print("t-statistics:", p)
if p <= 0.01:
    print('Since p<0.01, We reject the null-hypothesis that both models perform equally well on this dataset.')
else:
    print('Since p>0.01, we don\'t reject the null hypothesis.')

p-value: 2.455344331889295e-17
t-statistics: 2.455344331889295e-17
Since p<0.01, We reject the null-hypothesis that both models perform equally well on this dataset.


For alpha = 0.01, we have p-value < alpha, So, we can reject the null hypothesis  **H0**  with a confidence of 99%, and we can be 99% percent confident that applying MLSMOTE does improve performance, even if the performance numbers themselves are not great.

##Comparing a Neural Network Model with MLSMOTE to SVM with MLSMOTE


In [25]:
import pickle

with open("/content/clf_mlsmote.pickle", "rb") as f1:
    svm_mlsmote = pickle.load(f1)
idx_mlsmote = svm_mlsmote.best_index_
cv_scores_svm_mlsmote = [svm_mlsmote.cv_results_[f"split{i}_test_score"][idx_mlsmote] for i in range(10)]
scores_mlsmote = [x for x in cv_scores_svm_mlsmote if not np.isnan(x)]
print(scores_mlsmote)
t, p = ttest_rel(results[0:len(scores_mlsmote)], scores_mlsmote)
print("p-value:", p)
print("t-statistics:", p)
if p <= 0.01:
    print('Since p<0.01, We reject the null-hypothesis that both models perform equally well on this dataset.')
else:
    print('Since p>0.01, we don\'t reject the null hypothesis.')

[0.36561743341404357, 0.3765133171912833, 0.3726392251815981, 0.3784503631961259, 0.3721549636803874, 0.3786924939467312]
p-value: 1.1211805771823691e-07
t-statistics: 1.1211805771823691e-07
Since p<0.01, We reject the null-hypothesis that both models perform equally well on this dataset.




For alpha = 0.01, we have p-value < alpha, So, we can reject the null hypothesis  **H0**  with a confidence of 99%, and we can be 99% percent confident that Neural Network does improve performance over SVM.