# Sheet 07

In [32]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from mlxtend.evaluate import (
    cochrans_q,
    mcnemar,
    mcnemar_table
)
from itertools import combinations

## Exercise 01: Significance Testing

### a) Types of significance tests

We need to compare different classification models, which cannot be done with correlation tests, like Pearsons's Correlation.


List of possible tests that could be applied to data which holds predicted and true classification values:
* McNemar: only for two parameter to compare
* Cochran's Q
* 5x2 CV Paired t

Since we have more than two parameters, we either need to check pair-wise with McNemar or we can use the Cochran Q test. We opted to first see the results of our hypothesis with a Cochran Q test and have a deeper investigation with pair-wise McNemar tests.

### b) Hypothesis

$H_{0}$: There is no difference between the classification accuracy.

$H_{A}$: There are difference between the classification accuracy.

Significance threshold: $\alpha = 0.05$

This threshold is used to determine if the null hypothesis holds or can be rejected when compared to the p-value:
* $p > \alpha$ &rarr; $H_{0}$ cannot be rejected
* $p < \alpha$ &rarr; $H_{0}$ can be rejected which results in $H_{A}$

### c) Apply test

In [3]:
# load dataset
df = pd.read_json('../data/digits_classifiers.json')
df

Unnamed: 0,RandomForestClassifier(),LinearSVC(),KNeighborsClassifier(),MLPClassifier()
truth,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, ..."
predictions,"[0, 1, 8, 3, 4, 9, 6, 7, 8, 9, 0, 1, 2, 3, 4, ...","[0, 1, 2, 3, 4, 9, 6, 7, 8, 9, 0, 1, 2, 3, 4, ...","[0, 1, 8, 3, 4, 9, 6, 7, 8, 9, 0, 1, 2, 3, 4, ...","[0, 1, 2, 3, 4, 9, 6, 7, 8, 9, 0, 1, 2, 3, 4, ..."


In [4]:
# ground truth: these values are the same for every classification model
truth = df['RandomForestClassifier()'].iloc[0]

In [15]:
# calculate conchans q for every classification model with truth values
q, p_value =  cochrans_q(
    np.array(truth),
    np.array(df.iloc[1, 0]),
    np.array(df.iloc[1, 1]),
    np.array(df.iloc[1, 2]),
    np.array(df.iloc[1, 3])
)

print(f"Q (or chi-squared): {q:.2f} \np-value: {p_value}")

Q (or chi-squared): 107.33 
p-value: 4.1168104053085647e-23


Since the p-value is smaller than $\alpha$ ($4.11 \times 10^{-23} < 0.05$), we can reject the null hypothesis and conclude that the classification accuracy differs between the models.

To further investigate, we can now apply pair-wise test to every model with `McNemar`. With this approach we can determine if we can find "a clear winner".

In [51]:
# get all combinations for McNemar approach
list_labels = list(combinations(df.columns, 2))
list_values = list(combinations(df.iloc[1], 2))

print(f"Possible combinations labels: {len(list_labels)}")
print(f"Possible combinations values: {len(list_values)}")

Possible combinations labels: 6
Possible combinations values: 6


In [47]:
def mcnemar_table_and_score(truth_values, combination_values, combination_labels):
    print(f"Comparison of {combination_labels[0]} and {combination_labels[1]}\n")

    tb = mcnemar_table(
        np.array(truth),
        np.array(combination_values[0]),
        np.array(combination_values[1])
    )

    print(f"McNemar table:\n{tb}\n")

    chi2, p = mcnemar(tb, corrected=True)

    print(f"chi-squared: {chi2:.2f} \np-value: {p}")
    print("----------")

#### Calculate significant statisticall difference

In [48]:
# loop over every combination
for i in range(len(list_labels)):
    mcnemar_table_and_score(truth, list_values[i], list_labels[i])

Comparison of RandomForestClassifier() and LinearSVC()

McNemar table:
[[1578  111]
 [  40   68]]

chi-squared: 32.45 
p-value: 1.2227811884233145e-08
----------
Comparison of RandomForestClassifier() and KNeighborsClassifier()

McNemar table:
[[1667   22]
 [  63   45]]

chi-squared: 18.82 
p-value: 1.4338726289171642e-05
----------
Comparison of RandomForestClassifier() and MLPClassifier()

McNemar table:
[[1637   52]
 [  42   66]]

chi-squared: 0.86 
p-value: 0.3532628012845984
----------
Comparison of LinearSVC() and KNeighborsClassifier()

McNemar table:
[[1594   24]
 [ 136   43]]

chi-squared: 77.01 
p-value: 1.7041780183157275e-18
----------
Comparison of LinearSVC() and MLPClassifier()

McNemar table:
[[1585   33]
 [  94   85]]

chi-squared: 28.35 
p-value: 1.014322925494309e-07
----------
Comparison of KNeighborsClassifier() and MLPClassifier()

McNemar table:
[[1654   76]
 [  25   42]]

chi-squared: 24.75 
p-value: 6.518503507391278e-07
----------


With results from the McNemar analysis we now can evaluate if there is a clear winner in terms of accuracy of the prediction.

#### Identify best model

**RandomForestClassifier() and LinearSVC()**

The p-value is smaller than $\alpha$ and therefore we can reject our null hypothesis. Looking at the 2x2 matrix, we see, that `RandomForestClassifier` predicted 111 correct, which `LinearSVC` predicted wrong. Compared to the vise versa 40 correctly predicted values yield to a 111:40 ratio in the favor of `RandomForestClassifier`.

This concludes: $RandomForestClassifier > LinearSVC$

**RandomForestClassifier() and KNeighborsClassifier()**

The p-value is smaller than $\alpha$ and therefore we an reject our null hyphothesis. The 2x2 matrix indicates, that `KNeighborsClassifier` successfully predicted 63 values, which `RandomForestClassifier` could not. Again, yielding to a ratio of 63:22 in favor of `KNeighborsClassifier`.

This concludes: $KNeighborsClassifier > RandomForestClassifier > LinearSVC$

**RandomForestClassifier() and MLPClassifier()**

Have a p-value of 0.35, which is bigger than our $\alpha$ value and therefore we cannot reject our null hypothesis and conclute that these two models perform equally well.

This concludes: $KNeighborsClassifier > RandomForestClassifier = MLPClassifier > LinearSVC$

**LinearSVC() and KNeighborsClassifier()**

Is already conducted from the observations from above, but p-value is smaller than $\alpha$ which allows us to reject our null hypothesis once again.

The results stays the same: $KNeighborsClassifier > RandomForestClassifier = MLPClassifier > LinearSVC$

**LinearSVC() and MLPClassifier()**

p-value is smaller than $\alpha$ and this leads to the same result as already derived from above: $KNeighborsClassifier > RandomForestClassifier = MLPClassifier > LinearSVC$

**KNeighborsClassifier() and MLPClassifier()**

p-value is also smaller than $\alpha$ in this comparison, which also undermines the result we already have: $KNeighborsClassifier > RandomForestClassifier = MLPClassifier > LinearSVC$

So, we can say `KNeighborsClassifier` would be the winner because in comparison to other models, it had significant statistically difference (p-values smaller than $\alpha$) and had the best ratios for correctly predictes values where others predicted wrong.