# Bootstrapping clustering procedure

We use the bootstrap technique to train the clustering 1000 times, with different samples. This way, we should be able to obtain a better picture of the resulting space. The exact procedure is, for each iteration:
* Obtain a random subsampling of the data.
* Compute the clustering
* Calculate Jaccard coeficient between original clusters and new. Record highest Jaccard coeficient.
In the end, compute median of the jaccard coeficients. This procedure is similar to clusterboot() algorithm in R, to account for stability in the clustering and find if we are actually finding relevant clusters or not.

In [3]:
import numpy as np
import simlr_ad
import pandas as pd
from utils.data_utils import load_all_data
from utils.utils import compute_simlr, feat_ranking

**Parameters**

In [4]:
# Parameters of the procedure
niterations = 100
clusters = 3
stab_limit = 0.5 # if the stability of a said cluster is dissolved, it records.
rd_seed = 1714                                          # Random seed for experiment replication

# Paths
existing_cluster = True                               # Compute the clustering again or use an existing one
cluster_path = "results/extendeddata_cluster/cluster_data.csv"   # Path of the existing cluster, if applicable
covariate_path = "data/useddata_homo_abeta_plasma_meta.csv"                 # Path of the covariance data frame (.csv)
feature_path = "data/UCSDVOL.csv"                     # Path of the feature path (.csv)

# Parameters of the cluster creation
config_file = "configs/config_base.ini"               # Configuration file for the clustering computation
output_directory_name = "bootstrap"

# Testing parameters


**Data loader**

In [5]:
covariate_data, cov_names, feature_data, feature_names = load_all_data(covariate_path, feature_path)

In [6]:
if existing_cluster:
    # Load existent
    c_data = pd.read_csv(cluster_path)
else:
    # Compute base clustering
    y_b, S, F, ydata, alpha = compute_simlr(
        np.array(covariate_data_new[cov_names]), clusters)


In [7]:
## Test outlier detection
from sklearn import svm
clf = svm.OneClassSVM(kernel="rbf")
clf.fit(covariate_data[cov_names])
y_pred = clf.predict(covariate_data[cov_names])
n_error_outliers = y_pred[y_pred == -1].size
print(n_error_outliers)

149


### Main Loop

In [8]:
from sklearn.cluster import KMeans
# array where the number of times a cluster is dissolved (Jaccard coeficient < stab_limit)
n_diss = np.zeros(clusters)
niterations=100
# array of arrays where all the coefficients obtained will be stored.
j_coeff = np.zeros((clusters,niterations))
# Base labels
for i in range(niterations):
    # Subsample
    boot_data = covariate_data.sample(n=len(covariate_data), replace=True)
    # Compute it
    y_it, S, F, ydata, alpha = compute_simlr(
       np.array(boot_data[cov_names]), clusters)
    # y_b = np.random.randint(1,clusters+1, size=len(boot_data))
    # km = KMeans(n_clusters=clusters, random_state = rd_seed).fit(covariate_data[cov_names])
    # y_b = km.labels_ + 1
    # Assign clusters
    for c in range(1, clusters+1):
        # For each of the original clusters
        # And that PTID is included in PTID
        cond = (c_data.C.values == c)
        set_b = c_data[cond].PTID.values
        set_b = set_b[np.in1d(set_b, boot_data.PTID.values)]
        max_js = 0.0
        for k in range(1, clusters+1):
            # Create new set of clusters
            cond = (y_it == k)
            set_it = boot_data[cond].PTID.values
            # set_it = set_it[np.in1d(set_it, boot_data.PTID.values)]
            # compute jaccard score between base assignation and given cluster
            inter = set([x for x in set_b if x in set_it])
            union = set(list(set_b) + list(set_it))
            js = float(len(inter) / len(union))
            # If larger, get it
            if js > max_js:
                max_js = js
        # If it dissolves, we want to record it
        if max_js < stab_limit:
            n_diss[c-1] += 1
        # Save jaccard scores
        j_coeff[c-1,i] = max_js
    
print('Computation finished')
for c in range(1,clusters+1):
    print('Cluster ' + str(c) + ': ' + str(np.mean(j_coeff[c-1,:])) + " Jaccard score.")
    print("It got dissolved " + str(n_diss[c-1]) + ", " + str((n_diss[c-1]/niterations)* 100) + "% of the time.")


Computation finished
Cluster 1: 0.5511451669506124 Jaccard score.
It got dissolved 42.0, 42.0% of the time.
Cluster 2: 0.6020502050229517 Jaccard score.
It got dissolved 31.0, 31.0% of the time.
Cluster 3: 0.33888082055997193 Jaccard score.
It got dissolved 94.0, 94.0% of the time.
