# Yacine Mahdid May 18 2020
This notebook is addressing directly this issue [number 28](https://github.com/BIAPT/eeg-pain-detection/issues/28)
We need to develop the permutation test module to check if the classifier is better than random.

To do so I will:
- [X] take out the part form `ex_14` that are directly relevant to the classification and put them here
- [X] refactor them so that they are easier to work with
- [X] build the permutation module
- [X] apply it to our current classification and check the result

To read:
- [X] [Pipeline article](https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976)

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.model_selection import permutation_test_score

# Helper dict
POPULATION_ID = {
    "MSK": 0,
    "HEALTHY": 1,
    "BOTH": 2
}


def pre_process(input_filename, population_id="BOTH"):
    """This function load, reshape and clean up the data frame so that it is amenable for machine learning

        Args:
            input_filename (string): This is the path to the data which should be in csv
            population_id (string): We can get either MSK, HEALTHY or BOTH

        Returns:
            X: the features for this data in the form of a matrix
            y: the label vector for the data which in this analysis is 0 or 1
            group: the group vector for the data which tell which user the row below, used for
            Leave-One-Subject-Out (LOSO) cross validation
    """

    # Read the CSV
    df = pd.read_csv(input_filename)

    # Get the right population
    if population_id != "BOTH":
        df = df[df.type == POPULATION_ID[population_id]]

    # We had this weird column appearing so we will remove it
    df = df.drop(['Unnamed: 22'], axis=1)

    # Extract the right information for ml part
    X = df.drop(['id', 'type', 'is_hot'], axis=1).to_numpy()
    y = df.is_hot.to_numpy()
    group = df.id.to_numpy()

    return X, y, group

def classify_loso(X, y, group, clf):
    """ Main classification function to train and test a ml model with Leave one subject out

        Args:
            X (numpy matrix): this is the feature matrix with row being a data point
            y (numpy vector): this is the label vector with row belonging to a data point
            group (numpy vector): this is the group vector (which is a the participant id)
            clf (sklearn classifier): this is a classifier made in sklearn with fit, transform and predict functionality

        Returns:
            accuracies (list): the accuracy at for each leave one out participant
    """
    logo = LeaveOneGroupOut()

    accuracies = []
    for train_index, test_index in logo.split(X, y, group):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        clf.fit(X_train, y_train)
        y_hat = clf.predict(X_test)

        accuracy = accuracy_score(y_test, y_hat)
        accuracies.append(accuracy)
    return accuracies


def permutation_test(X, y, group, clf, num_permutation=1000):
    """ Helper function to validate that a classifier is performing higher than chance

        Args:
            X (numpy matrix): this is the feature matrix with row being a data point
            y (numpy vector): this is the label vector with row belonging to a data point
            group (numpy vector): this is the group vector (which is a the participant id)
            clf (sklearn classifier): this is a classifier made in sklearn with fit, transform and predict functionality
            num_permutation (int): the number of time to permute y
            random_state (int): this is used for reproducible output
        Returns:
            accuracies (list): the accuracy at for each leave one out participant

    """

    logo = LeaveOneGroupOut()
    train_test_splits = logo.split(X, y, group)
    (accuracy, permutation_scores, p_value) = permutation_test_score(clf, X, y, groups=group, cv=train_test_splits,
                                                                     n_permutations=num_permutation,
                                                                     verbose=num_permutation, n_jobs=-1)
    return accuracy, permutation_scores, p_value


input_filename = '/home/yacine/Documents/BIAPT/data_window_10.csv'

pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler()),
    ('SVM', SVC())])

X,y,group = pre_process(input_filename, "MSK")
accuracies = classify_loso(X, y, group, pipe)
np.mean(accuracies)

The analysis is refactored to use sklearn concept in more depth. The analysis will flow like this:
1. preprocessing of the data using `preprocess`
2. classification using loso with `classify_loso`
3. permutation test to validation the classifier with `permutation_test`

In [5]:
(accuracy, permutation_scores, p_value) = permutation_test(X, y, group, pipe, num_permutation=1000)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Pickling array (shape=(1079, 19), dtype=float64).
Pickling array (shape=(1079,), dtype=int64).
Pickling array (shape=(1079,), dtype=int64).
Pickling array (shape=(1070,), dtype=int64).
Pickling array (shape=(9,), dtype=int64).
Pickling array (shape=(1067,), dtype=int64).
Pickling array (shape=(12,), dtype=int64).
Pickling array (shape=(1066,), dtype=int64).
Pickling array (shape=(13,), dtype=int64).
Pickling array (shape=(1061,), dtype=int64).
Pickling array (shape=(18,), dtype=int64).
Pickling array (shape=(1062,), dtype=int64).
Pickling array (shape=(17,), dtype=int64).
Pickling array (shape=(1068,), dtype=int64).
Pickling array (shape=(11,), dtype=int64).
Pickling array (shape=(1065,), dtype=int64).
Pickling array (shape=(14,), dtype=int64).
Pickling array (shape=(1059,), dtype=int64).
Pickling array (shape=(20,), dtype=int64).
Pickling array (shape=(1057,), dtype=int64).
Pickling array (shape=(22,), dtype=i

KeyboardInterrupt: 

In [45]:
print("Accuracy: ", accuracy)
print("Permutation Score: ", np.mean(permutation_scores))
print("P value: ", p_value)

Accuracy:  0.5509481738477935
Permutation Score:  0.4916073266806892
P value:  0.000999000999000999
