## Leave one out classification
This file runs a leave one out classification in each of the clusters to check if the discriminative performance in each of the clusters is better than the classification performance in the whole.

The process goes like this: given a set clustering, we perform leave one out classificaton tests in each of the cluster for the following problems and with the following classifiers:
* In the AD/CN, AD/MCI and MCI/AD tasks.
* Using linear regression, linear SVM, RBF SVM and random forests.

We use leave one out validation because other forms of validation, such as 10-fold CV, would not work, as we do not have enough data in some of the clusters/problems to work with.

It is not very useful, because sample sizes of each label are very dispar. Need to weight it someway.

In [13]:
# Include and load packages, config files

import numpy as np
import simlr_ad
import pandas as pd
from utils.data_utils import load_all_data
from utils.utils import compute_simlr, feat_ranking

# Parameters of the procedure
clusters = 3
rd_seed = 1714                                          # Random seed for experiment replication

# Paths
existing_cluster = True                               # Compute the clustering again or use an existing one
cluster_path = "results/base_cluster/cluster_data.csv"   # Path of the existing cluster, if applicable
covariate_path = "data/full_data.csv"                 # Path of the covariance data frame (.csv)
feature_path = "data/UCSDVOL.csv"                     # Path of the feature path (.csv)

covariate_data, cov_names, feature_data, feature_names = load_all_data(covariate_path, feature_path)
feature_data['DX'] = covariate_data.DX_bl.values

if existing_cluster:
    # Load existent
    c_data = pd.read_csv(cluster_path)
else:
    # Compute base clustering
    y_b, S, F, ydata, alpha = compute_simlr(
        np.array(covariate_data_new[cov_names]), clusters)



We need to define two loops:
* For each cluster,
* For each possible problem in the cluster AD/MCI, AD/CN MCI/AD

And, in each of the iterations, do a leave one out classification procedure with each of the classifiers:
* linear reg
* log reg
* lin svm
* rbf svm

In [45]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn import linear_model, svm, ensemble
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
for c in range(1,clusters+1):
    # Select data clusters
    data_c = feature_data[c_data.C.values == c]
    probs = [('AD', 'LMCI'), ('CN', 'AD'), ('LMCI', 'CN')]
    for p in probs:
        print('Results for cluster: ' + str(c))
        print('For the classification problem of ' + p[0] + ' vs ' + p[1])
        # For each problem
        x_1 = data_c[data_c.DX.values == p[0]]
        x_2 = data_c[data_c.DX.values == p[1]]
        x_1 = x_1[feature_names].values.tolist()
        x_2 = x_2[feature_names].values.tolist()
        X = x_1 + x_2
        print(p[0] + ' samples: ' + str(len(x_1)))
        print(p[1] + ' samples: ' + str(len(x_2)))
        Y = np.concatenate((np.zeros(len(x_1), dtype=np.float64), np.ones(len(x_2), dtype=np.float64)))
        loo = LeaveOneOut()
        splits = loo.get_n_splits(X)
        
        # SVM        
        clf = svm.LinearSVC(class_weight='balanced')
        sc1 = cross_val_score(clf, X, Y, cv=loo, scoring='accuracy')
        print('Accuracy of linear SVM: ' + str(np.average(sc1)))
        
        # Decisio nTree
        clf = DecisionTreeClassifier(max_depth=5, class_weight='balanced')
        sc2 = cross_val_score(clf, X, Y, cv=loo, scoring='accuracy')
        print('Accuracy of DecisionTreeClassifier: ' + str(np.average(sc2)))

        # Naive Bayes
        clf = GaussianNB()
        sc3 = cross_val_score(clf, X, Y, cv=loo, scoring='accuracy')
        print('Accuracy of Naive Bayes: ' + str(np.average(sc3)))

        # RBF SVM
        clf = svm.SVC(class_weight='balanced')
        sc4 = cross_val_score(clf, X, Y, cv=loo, scoring='accuracy')
        print('Accuracy of RBF SVM: ' + str(np.average(sc4)))


Results for cluster: 1
For the classification problem of AD vs LMCI
AD samples: 26
LMCI samples: 124
Accuracy of linear SVM: 0.7266666666666667
Accuracy of DecisionTreeClassifier: 0.72
Accuracy of Naive Bayes: 0.7866666666666666
Accuracy of RBF SVM: 0.74
Results for cluster: 1
For the classification problem of CN vs AD
CN samples: 19
AD samples: 26
Accuracy of linear SVM: 0.8444444444444444
Accuracy of DecisionTreeClassifier: 0.7111111111111111
Accuracy of Naive Bayes: 0.8222222222222222
Accuracy of RBF SVM: 0.8444444444444444
Results for cluster: 1
For the classification problem of LMCI vs CN
LMCI samples: 124
CN samples: 19
Accuracy of linear SVM: 0.6853146853146853
Accuracy of DecisionTreeClassifier: 0.7552447552447552
Accuracy of Naive Bayes: 0.6573426573426573
Accuracy of RBF SVM: 0.6293706293706294
Results for cluster: 2
For the classification problem of AD vs LMCI
AD samples: 34
LMCI samples: 53
Accuracy of linear SVM: 0.5402298850574713
Accuracy of DecisionTreeClassifier: 0.586