# Part 6: Cross-Validation & Hypothesis Testing

We apply cross-validation training using the kNN algorithm on the scikit-learn breast_cancer dataset and try to identify whether different algorithm predictions belong to the same distribution.

## Leave-One-Out Cross-Validation

We start by importing needed libraries.

From `sklearn`, we will use `datasets` to load our dataset, `metrics` to evaluate our model, `StandardScaler` to standardize dataset features, `KNeighborsClassifier` to create our model and `LeaveOneOut` to perform cross-validation with.

We will also use `numpy` for miscellaneous operations, `pandas` for loading an algorithm accuracy dataset and `friedmanchisquare` to evaluate whether the algorithm accuracies are statistically different.

In [1]:
import numpy as np
from sklearn import datasets, metrics
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut
import pandas as pd
from scipy.stats import friedmanchisquare

In [2]:
breastCancer_ds = datasets.load_breast_cancer()

We start by exploring our data.

In [3]:
breastCancer_ds

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

We have two classes, therefore a binary classification problem, which is what we are going to need.

In [4]:
print(breastCancer_ds.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

We make and standardize the features and extract the targets from our dataset.

In [5]:
features = StandardScaler().fit_transform(breastCancer_ds.data)
targets = breastCancer_ds.target

We perform cross-validation on a k-nearest neighbors model and calculate the true positive, true negative, false positive and false negative rates.

We also extract predictions.

In [6]:
# cv = KFold(n_splits=targets.shape[0])
cv = LeaveOneOut()

y_true, y_pred = [], []
TP, TN, FP, FN = 0, 0, 0, 0
for train_idx, test_idx in cv.split(features):
    x_train, x_test = features[train_idx], features[test_idx]
    y_train, y_test = targets[train_idx], targets[test_idx]
    
    clf = KNeighborsClassifier(n_jobs=4)
    clf.fit(x_train, y_train)
    
    y_hat = clf.predict(x_test)
    y_pred.append(y_hat[0])
    
y_pred = np.array(y_pred)
    
TP = sum(np.logical_and(targets==1, y_pred==1))
TN = sum(np.logical_and(targets==0, y_pred==0))
FP = sum(np.logical_and(targets==0, y_pred==1))
FN = sum(np.logical_and(targets==1, y_pred==0))

In [7]:
TP, TN, FP, FN

(354, 198, 14, 3)

What is the accuracy of our model?

In [8]:
# sum(targets==y_pred) / len(targets)
# (TP + TN) / (TP + TN + FP + FN)
metrics.accuracy_score(targets, y_pred)

0.9701230228471002

## Friedman test

We will now perform the Friedman test on 5 algorithms' accuracies to verify whether they are statistically different.

We load our dataset.

In [9]:
algoPerformance_ds = pd.read_csv('algo_performance.csv')
algoPerformance_ds

Unnamed: 0,C4.5,1-NN,NaiveBayes,Kernel,CN2
0,0.219,0.202,0.249,0.165,0.261
1,0.803,0.75,0.813,0.692,0.798
2,0.859,0.814,0.845,0.542,0.816
3,0.809,0.774,0.673,0.275,0.785
4,0.768,0.79,0.727,0.872,0.706
5,0.759,0.654,0.734,0.703,0.714
6,0.693,0.611,0.572,0.689,0.572
7,0.915,0.857,0.86,0.7,0.777
8,0.544,0.531,0.558,0.439,0.541
9,0.855,0.796,0.857,0.607,0.809


Taking a look at the Friedman test critical values, we want to locate the one having an alpha level of 0.05 (90% confidence), k equal to 5 (number of rows) and 5 degrees of freedom. That is a statistic of 8.96.

Therefore if the statistic exceeds 8.96, we reject the null hypothesis that algorithm accuracy values do not come from the same distribution.

Another way to reject the null hypothesis is if the p-value is smaller than alpha, which is 0.05.

![Friedman test values](https://image.slidesharecdn.com/a-160509180623/95/friedmans-test-15-638.jpg)

In [10]:
friedmanchisquare(*(algoPerformance_ds[col] for col in algoPerformance_ds.columns))

FriedmanchisquareResult(statistic=39.91275167785245, pvalue=4.512033059024698e-08)

In this case, the statistic is ~40 and the p-value is very much smaller than 0.05 and we therefore reject the null hypothesis.

So, we conclude the algorithm accuracies are statistically different.

Still, even with a different value of alpha, for example 0.01 for a confidence interval of 98%, we still reject the null hypothesis, since the statistic is larger than 11.68, based on the Friedman critical value lookup table.