# Statistical Hypothesis Testing II

In this notebook, we will implement and apply **statistical hypothesis tests** to make inferences about risks of learning algorithms.

At the start, we will compare two learning algorithms on one domain via the paired $t$-test.

Subsequently, we will compare the two learning algorithm across multiple domains via the Wilcoxon signed-rank test.

### **Table of Contents**
1. [Paired $t$-test](#paired-t-test)
2. [Wilcoxon Signed-rank Test](#wilcoxon-signed-rank-test)

In [1]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np

from scipy import stats

### **1. Paired $t$-test** <a class="anchor" id="paired-t-test"></a>

We implement the function [`t_test_paired`](../e2ml/evaluation/_paired_tests.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage. Once, the implementation has been completed, we check it for varying types of tests.

In [2]:
from e2ml.evaluation import t_test_paired
r_1 = np.round(stats.norm.rvs(loc=0.1, scale=0.03, size=40, random_state=0), 2)
r_2 = np.round(stats.norm.rvs(loc=0.12, scale=0.03, size=40, random_state=1), 2)
mu_0 = 0
t_statistic, p = t_test_paired(sample_data_1=r_1, sample_data_2=r_2, mu_0=mu_0, test_type="right-tail")
assert np.round(t_statistic, 4) == -1.4731 , 'The paired t-test statistic must be ca. -1.4731.' 
assert np.round(p, 4) == 0.9256, 'The p-value must be ca. 0.9256 for the one-sided right-tail test.' 
t_statistic, p = t_test_paired(sample_data_1=r_1, sample_data_2=r_2, mu_0=mu_0, test_type="left-tail")
assert np.round(t_statistic, 4) == -1.4731 , 'The paired t-test statistic must be ca. -1.4731.' 
assert np.round(p, 4) == 0.0744, 'The p-value must be ca. 0.0744 for the one-sided left-tail test.' 
t_statistic, p = t_test_paired(sample_data_1=r_1, sample_data_2=r_2, mu_0=mu_0, test_type="two-sided")
assert np.round(t_statistic, 4) == -1.4731 , 'The paired t-test statistic must be ca. -1.4731.' 
assert np.round(p, 4) == 0.1487, 'The p-value must be ca. 0.1487 for the two-sided test.' 

Next, we want to check whether a *support vector classifier* (SVC) significantly outperforms a *Gaussian process classifier* (GPC) on the data set breast cancer, where we use the zero-one loss as performance measure and the paired $t$-test with $\alpha=0.01$. Design and perform the corresponding evaluation study.

In [19]:
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.datasets import load_breast_cancer

from e2ml.evaluation import cross_validation, zero_one_loss
from e2ml.preprocessing import StandardScaler

x, y = load_breast_cancer(return_X_y=True)

sample_idx = np.arange(len(y), dtype=int)
n_folds = 8
train, test = cross_validation(sample_indices=sample_idx, n_folds=n_folds, y=y, random_state=0)

risks_gpc = []
risks_svc = []

for train_, test_ in zip(train, test):
    scaler = StandardScaler().fit(x[train_])
    x_train = scaler.transform(x[train_])
    x_test = scaler.transform(x[test_])

    gpc = GaussianProcessClassifier(random_state=0)
    gpc = gpc.fit(x_train, y[train_])
    preds_gpc = gpc.predict(x_test)
    risks_gpc.append(zero_one_loss(y[test_], preds_gpc))

    svc = SVC(random_state=0)
    svc = svc.fit(x_train, y[train_])
    preds_svc = svc.predict(x_test)
    risks_svc.append(zero_one_loss(y[test_], preds_svc))

alpha = 0.01

_, p = t_test_paired(risks_gpc, risks_svc, test_type="right-tail")

if(p <= alpha):
    print("Null-Hypothesis can be rejected svc > gpc")
else: 
    print("Not enough evidence to reject Null-Hypothesis svc ? gpc")

[[1.305e+01 1.859e+01 8.509e+01 ... 1.258e-01 3.113e-01 8.317e-02]
 [1.166e+01 1.707e+01 7.370e+01 ... 4.262e-02 2.731e-01 6.825e-02]
 [8.734e+00 1.684e+01 5.527e+01 ... 0.000e+00 2.445e-01 8.865e-02]
 ...
 [2.092e+01 2.509e+01 1.430e+02 ... 2.542e-01 2.929e-01 9.873e-02]
 [1.729e+01 2.213e+01 1.144e+02 ... 1.528e-01 3.067e-01 7.484e-02]
 [1.283e+01 2.233e+01 8.526e+01 ... 1.977e-01 3.407e-01 1.243e-01]]
[[1.321e+01 2.525e+01 8.410e+01 ... 6.005e-02 2.444e-01 6.788e-02]
 [1.402e+01 1.566e+01 8.959e+01 ... 8.216e-02 2.136e-01 6.710e-02]
 [1.426e+01 1.817e+01 9.122e+01 ... 7.530e-02 2.636e-01 7.676e-02]
 ...
 [2.092e+01 2.509e+01 1.430e+02 ... 2.542e-01 2.929e-01 9.873e-02]
 [1.729e+01 2.213e+01 1.144e+02 ... 1.528e-01 3.067e-01 7.484e-02]
 [1.283e+01 2.233e+01 8.526e+01 ... 1.977e-01 3.407e-01 1.243e-01]]
[[1.321e+01 2.525e+01 8.410e+01 ... 6.005e-02 2.444e-01 6.788e-02]
 [1.402e+01 1.566e+01 8.959e+01 ... 8.216e-02 2.136e-01 6.710e-02]
 [1.426e+01 1.817e+01 9.122e+01 ... 7.530e-02 2.63

#### **Questions:**
1. (a) What are possible issues of your conducted evaluation study?
   
   zero-one-losses may not be normaly distributed.
   
   train/test sets may not be independed of each other

   Nur ein parameter set evaluiert

### **2. Wilcoxon Signed-rank Test** <a class="anchor" id="wilcoxon-signed-rank-test"></a>

We implement the function [`wilcoxon_signed_rank_test`](../e2ml/evaluation/_paired_tests.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage. Once, the implementation has been completed, we check it for varying types of tests.

In [25]:
from e2ml.evaluation import wilcoxon_signed_rank_test

# Test for exact computation.
r_1 = stats.norm.rvs(loc=0.1, scale=0.03, size=10, random_state=0)
r_2 = stats.norm.rvs(loc=0.15, scale=0.03, size=10, random_state=1)
d = r_2 - r_1
print(d)
w_statistic, p = wilcoxon_signed_rank_test(sample_data_1=d, test_type="right-tail")
print(w_statistic, p)
assert w_statistic == 47 , 'The positive rank sum statistic must be 47.' 
assert np.round(p, 4) == 0.0244, 'The p-value must be ca. 0.0244 for the one-sided right-tail test.' 
w_statistic, p = wilcoxon_signed_rank_test(sample_data_1=d, test_type="left-tail")
assert w_statistic == 47 , 'The positive rank sum statistic must be 47.' 
assert np.round(p, 4) == 0.9814, 'The p-value must be ca. 0.9814 for the one-sided left-tail test.' 
w_statistic, p = wilcoxon_signed_rank_test(sample_data_1=d, test_type="two-sided")
assert w_statistic == 47 , 'The positive rank sum statistic must be 47.' 
assert np.round(p, 4) == 0.0488, 'The p-value must be ca. 0.0488 for the two-sided test.' 

# Test for approximative computation.
r_1 = stats.norm.rvs(loc=2, scale=0.3, size=100, random_state=0)
r_2 = stats.norm.rvs(loc=2.1, scale=0.3, size=100, random_state=1)
d = r_2 - r_1
w_statistic, p = wilcoxon_signed_rank_test(sample_data_1=d, test_type="right-tail")
assert w_statistic == 3303 , 'The positive rank sum statistic must be 3303.' 
assert np.round(p, 4) == 0.0037, 'The p-value must be ca. 0.0037 for the one-sided right-tail test.' 
w_statistic, p = wilcoxon_signed_rank_test(sample_data_1=d, test_type="left-tail")
assert w_statistic == 3303 , 'The positive rank sum statistic must be 3303.' 
assert np.round(p, 4) == 0.9963, 'The p-value must be ca. 0.9963 for the one-sided left-tail test.' 
w_statistic, p = wilcoxon_signed_rank_test(sample_data_1=d, test_type="two-sided")
assert w_statistic == 3303 , 'The positive rank sum statistic must be 3303.' 
assert np.round(p, 4) == 0.0075, 'The p-value must be ca. 0.0075 for the two-sided test.' 

[ 0.04580879  0.01964259  0.00479271 -0.04941585  0.01993549  0.01027218
  0.0738417   0.03170451  0.06266774  0.03020093]
47 0.0244140625


In [22]:
p

0.0

Next, we want to check whether a *support vector classifier* (SVC) significantly outperforms a *Gaussian process classifier* (GPC) on ten articially generated data sets, where we use the zero-one loss as performance measure and the paired $t$-test with $\alpha=0.01$. Design and perform the corresponding evaluation study.

In [27]:
from sklearn.datasets import make_classification

# Create 10 articial data sets.
data_sets = []
n_classes_list = np.arange(2, 12)
for n_classes in n_classes_list:
    X, y = make_classification(
        n_samples=500, n_classes=n_classes, class_sep=2, n_informative=10, random_state=n_classes
    )
    data_sets.append((X, y))

# TODO
sample_idx = np.arange(len(y), dtype=int)
n_folds = len(data_sets)
train, test = cross_validation(sample_indices=sample_idx, n_folds=n_folds, y=y, random_state=0)

risks_gpc = []
risks_svc = []

for train_, test_, data_set in zip(train, test, data_sets):

    x, y = data_set

    scaler = StandardScaler().fit(x[train_])
    x_train = scaler.transform(x[train_])
    x_test = scaler.transform(x[test_])

    gpc = GaussianProcessClassifier(random_state=0)
    gpc = gpc.fit(x_train, y[train_])
    preds_gpc = gpc.predict(x_test)
    risks_gpc.append(zero_one_loss(y[test_], preds_gpc))

    svc = SVC(random_state=0)
    svc = svc.fit(x_train, y[train_])
    preds_svc = svc.predict(x_test)
    risks_svc.append(zero_one_loss(y[test_], preds_svc))

alpha = 0.01

_, p = wilcoxon_signed_rank_test(risks_gpc, risks_svc, test_type="right-tail")
print(p)
if(p <= alpha):
    print("Null-Hypothesis can be rejected svc > gpc")
else: 
    print("Not enough evidence to reject Null-Hypothesis svc ? gpc")

[[ 0.12131799 -2.7757279   0.04820994 ... -1.21374358 -0.71050611
   0.57490397]
 [-1.29519415 -0.97914272  4.61990169 ...  0.005388   -1.46881007
  -0.61139575]
 [ 1.12148302 -1.77725912  1.16641169 ...  4.15128984  2.00781324
  -0.95800309]
 ...
 [-0.78646999 -2.66726106  2.00719931 ... -1.52853549  1.24404273
   0.10399198]
 [-1.51581074  2.44148393  2.17667149 ... -4.6413538  -1.81543639
   0.91340145]
 [ 1.29976551 -1.27150341 -0.12435306 ... -3.09277562 -1.02787408
  -2.19902077]]
[[-0.3574577  -0.38860382 -4.02831314 ... -0.11137354 -0.05436675
   5.40969274]
 [-1.51222305  0.58227177 -4.22725733 ... -0.77676272  3.493906
   4.60091541]
 [ 1.51048018 -0.43108294 -1.2149737  ... -0.78605866 -2.1592613
   3.55011461]
 ...
 [ 1.38677463  1.44079246 -4.0824453  ... -0.04605197  1.31325816
   6.32006092]
 [-1.01477531 -0.68501334 -4.27886272 ...  0.70845974 -0.71924165
  -1.85334008]
 [ 0.85023954 -0.58505412 -2.18147235 ...  0.02513849  0.28350167
   0.76551212]]
[[ 0.17042548 -0.02

#### **Questions:**
2. (a) What are possible issues of your conducted evaluation study?
   
   default values
   train overlap