# Evaluation of Representation Learning tools on a Clustering task

In this notebook we provide an example with K-means clustering.

Imports

In [None]:
import os
import warnings
import json
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from scipy.stats import chi2_contingency

In [None]:
run clustering_Functions.ipynb

Specify the path where the results will be saved

In [None]:
path_results = ...

## Pre-trained Representation Learning task and resulting embedding vectors

Load here the patient representations learned. This will be the input data for the clustering task.

In [None]:
patient_representation = np.load(...)

In [None]:
patient_representation.shape

## Cross-Validation

### Gridsearch for the optimal number of clusters

Parameters

In [None]:
nb_folds = 10 # Number of folds for the Cross Validation
max_k = 20 # Maximal number of clusters to be tested

In [None]:
df_gridsearch_kmeans = gridsearch_cluster_hyperparametres(method='kmeans', data=patient_representation, nb_folds = nb_folds, max_clusters=max_k)

In [None]:
print('The optimal number of clusters is: ' + str(int(df_gridsearch_kmeans.iloc[np.argmax(df_gridsearch_kmeans.Silhouette_test)]['k'])))
k_opt_kmeans = int(df_gridsearch_kmeans.iloc[np.argmax(df_gridsearch_kmeans.Silhouette_test)]['k'])

### Train & Evaluate

In [None]:
nb_folds = 10

In [None]:
dfTrain_kmeans, dfTest_kmeans = train_cluster_CV(method='kmeans', data=patient_representation, k=k_opt_kmeans, nb_folds = nb_folds)

- Evaluation Metrics on training set

In [None]:
dfTrain_kmeans

Averaged results

In [None]:
dfTrain_kmeans.mean()

- Evaluation Metrics on validation set

In [None]:
dfTest_kmeans

Averaged

In [None]:
dfTest_kmeans.mean()

- Final Table containing averaged and standard deviation metrics on both training and test sample

In [None]:
Eval_kmeans = pd.DataFrame(index=dfTrain_kmeans.columns.tolist(), data = {'Train_Mean':dfTrain_kmeans.mean(), 'Train_Std':dfTrain_kmeans.std(), 'Test_mean':dfTest_kmeans.mean(), 'Test_std':dfTest_kmeans.std()})
print(Eval_kmeans)

## Compute k-means on the whole dataset

In [None]:
kmeans = KMeans(n_clusters=k_opt_kmeans)
kmeans.fit(patient_representation)
kmeans_clusters = kmeans.predict(patient_representation)

## Chi-squared validation

Load the attributes on which the statistical test will be computed

In [None]:
data_attributes = ... # The dataframe of attributes per sample
data_attributes['Clusters'] = kmeans_clusters

Compute the Chi-squared test on 5 randomly generated sub-samples

In [None]:
kmeans_CS_p_values = pd.DataFrame()
for i in range(5):
    X_train, X_test = train_test_split(data_attributes, test_size=0.2)
    for var  in data_attributes.columns:
        contingency_table = pd.crosstab(X_train['Clusters'], X_train[var])
        chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
        kmeans_CS_p_values.loc[var, i]=round(p_value,4)

Averaged p-values

In [None]:
kmeans_CS_p_values.mean(axis=1)

Standard deviation p-values

In [None]:
kmeans_CS_p_values.std(axis=1)

## PCA evaluation

### Compute PCA on the whole dataset

In [None]:
pca = PCA(n_components=2)
X_PCA = pca.fit_transform(patient_representation)

### PCA according to clusters

Remark: the following cell has to be adapted for more than 2 clusters

In [None]:
plt.figure(figsize=(4,4))
plt.scatter(X_PCA[kmeans_clusters == 0, 0], X_PCA[kmeans_clusters == 0, 1], color='deepskyblue', alpha=0.6, label='Cluster 1')
plt.scatter(X_PCA[kmeans_clusters == 1, 0], X_PCA[kmeans_clusters == 1, 1], color='gold', marker='s', alpha=0.6, label='Cluster 2')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA on K-means clusters')
plt.legend(['Cluster 1', 'Cluster 2'])
#plt.savefig('PCA_kmeans.PNG')
plt.show()

### PCA regarding the different attributes loaded before

In [None]:
principal_df = pd.DataFrame(data=X_PCA, columns=['PC1', 'PC2'])
principal_df['Cluster'] = kmeans_clusters
df_pca = pd.merge(principal_df, data_attributes, on = ...)
columns_list = # List of the columns of data_attributes 

The following function computes the PCA visualizations regarding the different values of the attributes; for each value taken by the considerated attribute, a plot is displayed emphasizing the specific samples with this value regarding the cluster they belong to.

This function is developed for maximum 5 clusters.

In [None]:
PCA_caracteristic(df_pca, columns_list)

## T-SNE Evaluation

### Compute t-SNE on the whole dataset

In [None]:
tsne = TSNE(n_components=2, random_state=42)
X_TSNE = tsne.fit_transform(patient_representation)

### t-SNE according to clusters

Remark: the following cell has to be adapted for more than 2 clusters

In [None]:
plt.figure(figsize=(4,4))
plt.scatter(X_TSNE[kmeans_clusters == 0, 0], X_TSNE[kmeans_clusters == 0, 1], color='deepskyblue', alpha=0.6, label='Cluster 1')
plt.scatter(X_TSNE[kmeans_clusters == 1, 0], X_TSNE[kmeans_clusters == 1, 1], color='gold', marker='s', alpha=0.6, label='Cluster 2')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('T-SNE on K-means clusters')
plt.legend(['Cluster 1', 'Cluster 2'])
#plt.savefig('TSNE_kmeans.PNG')
plt.show()

### PCA regarding the different attributes loaded before

In [None]:
principal_df = pd.DataFrame(data=X_TSNE, columns=['PC1', 'PC2'])
principal_df['Cluster'] = kmeans_clusters
df_tsne = pd.merge(principal_df, data_attributes, on = ...)
columns_list = # List of the columns of data_attributes 

The following function computes the t-SNE visualizations regarding the different values of the attributes; for each value taken by the considerated attribute, a plot is displayed emphasizing the specific samples with this value regarding the cluster they belong to.

This function is developed for maximum 5 clusters.

In [None]:
PCA_caracteristic(df_tsne, columns_list, path='CS', loc_legend2='lower right')