Imports and methods:

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from mlxtend.evaluate import (
    cochrans_q,
    mcnemar,
    mcnemar_table
)
from itertools import combinations

In [26]:
def get_label_combinations(df):
    return list(combinations(df.drop(columns='truth'), 2))

def get_value_combinations(listOfValues):
    return list(combinations(listOfValues, 2))

In [27]:
def mcnemar_test(df: pd.DataFrame, values, labels):
    print(f"Comparison of {labels[0]} and {labels[1]}\n")

    tb = mcnemar_table(
        np.array(df['truth']),
        np.array(values[0]),
        np.array(values[1])
    )

    print(f"Confusion matrix:\n{tb}\n")

    chi2, p = mcnemar(tb, corrected=True)

    print(f"chi-squared: {chi2:.2f} \np-value: {p}")
    print("----------")

# Hypothesis Testing
We try to find any statistical significance between our applied classifier models on each approach. For every approach, we are using the same null and alternative hypothesis because we want to see if the classification results differ.

## Null and alternative hypothesis

$H_{0}$: There is no difference between the classification accuracy.

$H_{A}$: There are differences between the classification accuracy.

Significance threshold: $\alpha = 0.05$

This threshold is used to determine if the null hypothesis holds or can be rejected when compared to the p-value:
* $p > \alpha$ &rarr; $H_{0}$ is true
* $p \le \alpha$ &rarr; $H_{0}$ is false and $H_{A}$ is true

First, we prepare data for hypothesis testing from the csv exports we did in "03_prediction.ipynb".

In [28]:
df_approach_1 = pd.read_csv("../data/classification_results_approach_1.csv")
df_approach_2 = pd.read_csv("../data/classification_results_approach_2.csv")
df_approach_3 = pd.read_csv("../data/classification_results_approach_3.csv")

### Approach 1

In [29]:
df_approach_1

Unnamed: 0,truth,svc_linear,svc_poly,poly_reg
0,30,1,1,62.485990
1,19,13,3,56.882629
2,115,13,13,60.267201
3,41,1,1,62.866345
4,3,13,13,62.268922
...,...,...,...,...
305,3,1,1,60.446078
306,23,1,1,58.477346
307,132,1,1,62.715579
308,1,1,1,61.271565


In [30]:
col_names_1 = df_approach_1.columns
print(f'Labels for approach 1: {col_names_1.values}')

Labels for approach 1: ['truth' 'svc_linear' 'svc_poly' 'poly_reg']


Since we used more than two classifier models, we can first check for statistical difference with `Cochrans Q`.

In [31]:
q_1, p_1 = cochrans_q(
        np.array(df_approach_1[col_names_1[0]]),
        np.array(df_approach_1[col_names_1[1]]),
        np.array(df_approach_1[col_names_1[2]]),
        np.array(df_approach_1[col_names_1[3]]),
)

print(f"Q (or chi-squared): {q_1:.2f} \np-value: {p_1}")

Q (or chi-squared): 12.25 
p-value: 0.002187491118182885


The results show, that $p \le \alpha$ and therefore we can reject our null hypothesis. We can conclude, that we have a difference in classification accuracy!

In order to find the "best" classifier model, we use `McNemar` to compare the models pairwise.

In [32]:
# prepare list of possible combinations
value_list = list()
value_list.append(list(df_approach_1[col_names_1[1]]))
value_list.append(list(df_approach_1[col_names_1[2]]))
value_list.append(list(df_approach_1[col_names_1[3]]))

labels = get_label_combinations(df_approach_1)
values = get_value_combinations(value_list)

In [33]:
for i in range(len(labels)):
    mcnemar_test(df_approach_1, values[i], labels[i])

Comparison of svc_linear and svc_poly

Confusion matrix:
[[  6   1]
 [  1 302]]

chi-squared: 0.50 
p-value: 0.47950012218695337
----------
Comparison of svc_linear and poly_reg

Confusion matrix:
[[  0   7]
 [  0 303]]

chi-squared: 5.14 
p-value: 0.02334220201289086
----------
Comparison of svc_poly and poly_reg

Confusion matrix:
[[  0   7]
 [  0 303]]

chi-squared: 5.14 
p-value: 0.02334220201289086
----------


**SVC linear vs. SVC polynomial**

Here, we have $p > \alpha$ and therefore we can conduct that the null hypothesis is correct and we have no statistical difference between these two models.

**SVC linear vs. Polynomial regression**

With $p \le \alpha$, the alternative hypothesis would be correct, but looking at the confusion matrix scores SVC linear predicted 7 labels correctly where Polynomial regression predicted wrong. Now, we can assume that SVC linear is 'better'.

**SVC polynomial vs. Polynomial regression**

Has the same results as the comparison of SVC polynomial and Polynomial regression. Since, SVC linear and SVC polynomial result in no significant statistical difference, we can also assume, that SVC polynomial is 'better' as Polynomial regression.

**Result**

$SVClinear = SVCpolynomial > PolynomialRegression$, which is a tie for the best classification for SVC linear and SVC polynomial for Approach 1.

### Approach 2

In [34]:
df_approach_2

Unnamed: 0,truth,SVC_linear,SVC_rbf
0,54,8,8
1,15,10,1
2,43,9,9
3,20,54,5
4,11,13,10
...,...,...,...
305,5,17,17
306,1,20,20
307,136,85,9
308,14,23,23


In [35]:
col_names_2 = df_approach_2.columns
print(f'Labels for approach 2: {col_names_2.values}')

Labels for approach 2: ['truth' 'SVC_linear' 'SVC_rbf']


Here, we only have two classifier to compare. Since, the analysis with `Cochrans Q` and `McNemar` for two models results in the same results, we just use `McNemar` approach 2.

In [36]:
# prepare list for McNemar
value_list = list()
value_list.append(df_approach_2[col_names_2[1]])
value_list.append(df_approach_2[col_names_2[2]])

labels = get_label_combinations(df_approach_2)
values = get_value_combinations(value_list)

In [37]:
for i in range(len(labels)):
    mcnemar_test(df_approach_2, values[i], labels[i])

Comparison of SVC_linear and SVC_rbf

Confusion matrix:
[[  2   2]
 [  4 302]]

chi-squared: 0.17 
p-value: 0.6830913983096086
----------


**SVC linear vs. SVC rbf**

Here, we have $p > \alpha$ and therefore we can conduct that the null hypothesis is correct and we have no statistical difference between these two models. When looking at the confusion matrix, we can see, that SVC linear successfully predicted 2 labels, while SVC rbf got 4 labels correct. This indicates that SVC rbf is better, but since null hypothesis holds, we conduct for no significance difference (also, because the overall results are not great).

**Result**
Both classifier have equally performance and therefore we end up in a tie ($SVClinear = SVCrbf$) for approach 2.

### Approach 3

In [38]:
df_approach_3

Unnamed: 0,truth,svm_linear,svm_poly,logistic_regression,k_nearest_neighbour
0,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP
1,NEW_ENTRY,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_DOWN
2,MOVE_DOWN,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_DOWN
3,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP
4,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP
...,...,...,...,...,...
305,MOVE_DOWN,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP
306,MOVE_DOWN,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP
307,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP
308,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_UP,MOVE_DOWN


In [39]:
col_names_3 = df_approach_3.columns
print(f'Labels for approach 3: {col_names_3.values}')

Labels for approach 3: ['truth' 'svm_linear' 'svm_poly' 'logistic_regression'
 'k_nearest_neighbour']


Again, we have more than two classifier models. So, we apply `Cochrans Q` first and then for further investigation we will use `McNemar`.

In [40]:
# check cochrans q for multiple classifier
q_2, p_2 = cochrans_q(
        np.array(df_approach_3[col_names_3[0]]),
        np.array(df_approach_3[col_names_3[1]]),
        np.array(df_approach_3[col_names_3[2]]),
        np.array(df_approach_3[col_names_3[3]]),
        np.array(df_approach_3[col_names_3[4]]),
)

print(f"Q (or chi-squared): {q_2:.2f} \np-value: {p_2}")

Q (or chi-squared): 45.37 
p-value: 7.718796379764298e-10


The analysis shows, that $p \le \alpha$. Therefore the null hypothesis does not hold and we can conduct that we have a significant statistical difference in the accuracy of the applied classifier models.

To see, which model performs 'the best', we, again, check with `McNemar` pairwise.

In [41]:
# prepare list of values
value_list = list()
value_list.append(df_approach_3[col_names_3[1]])
value_list.append(df_approach_3[col_names_3[2]])
value_list.append(df_approach_3[col_names_3[3]])
value_list.append(df_approach_3[col_names_3[4]])

In [42]:
# mcnemar
labels = get_label_combinations(df_approach_3)
values = get_value_combinations(value_list)

for i in range(len(labels)):
    mcnemar_test(df_approach_3, values[i], labels[i])

Comparison of svm_linear and svm_poly

Confusion matrix:
[[176   0]
 [  0 134]]

chi-squared: inf 
p-value: 0.0
----------
Comparison of svm_linear and logistic_regression

Confusion matrix:
[[176   0]
 [  0 134]]

chi-squared: inf 
p-value: 0.0
----------
Comparison of svm_linear and k_nearest_neighbour

Confusion matrix:
[[118  58]
 [ 23 111]]

chi-squared: 14.27 
p-value: 0.00015823397209216498
----------
Comparison of svm_poly and logistic_regression

Confusion matrix:
[[176   0]
 [  0 134]]

chi-squared: inf 
p-value: 0.0
----------
Comparison of svm_poly and k_nearest_neighbour

Confusion matrix:
[[118  58]
 [ 23 111]]

chi-squared: 14.27 
p-value: 0.00015823397209216498
----------
Comparison of logistic_regression and k_nearest_neighbour

Confusion matrix:
[[118  58]
 [ 23 111]]

chi-squared: 14.27 
p-value: 0.00015823397209216498
----------


  chi2 = (abs(ary[0, 1] - ary[1, 0]) - 1.0) ** 2 / float(n)


**SVC linear vs. SVC polynomial**

Here, we have $p = 0$ because both classifier have the exact same performance.

**SVC linear vs. Logistic regression**

Here, we have $p = 0$ because both classifier have the exact same performance.

**SVC linear vs. K nearest neighbour**

$p \le \alpha$ which indicates, that the null hypothesis is not true and there is a performance difference. Since, SVC linear correctly predicted more values, than K nearest neighbour where the other model predicted wrongly (58 for SVC and 23 for KNN), we can say, that SVC linear performs 'better'.

**SVC polynomial vs. Logistic regression**

Here, we have $p = 0$ because both classifier have the exact same performance.


**SVC polynomial vs. K nearest neighbour**

We already saw, that SVC linear and SVC polynomial have the same performance. Also, looking at the results we furhter noticed, that these are the same as in SVC linear vs. K nearest neighbour.

This conducts, that SVC polynomial has a better performance.

**Logistic regression vs. K nearest neighbour**

Like, the comparison from SVC polynomial vs. K nearest neighbour, we have the same results (same p value and confusion matrix). Meaning, that Logistic regression is 'better' than K nearest neighbour. This is further supported with SVC polynomial and Logist regression having the exact same performance.

**Result**

We have multiple models performing equally well, except for K nearest neighbour ($SvcLinear = SvcPolynomial = LogisticRegression > KNN$). Because SvcLinear has better results, K nearest neighbour has the worst performance.  