# Hands on session: Clustering methods
In this example, we will use various clustering techniques to cluster dimensional reduced (PCA) neuronal data. Here, we will use kmeans clustering, agglomerative clustering, and DBSCAN.

This exercise refers to [Chapter 3 "Clustering methods"](https://www.fabriziomusacchio.com/teaching/teaching_dimensionality_reduction_in_neuroscience/03_clustering) of the "[Dimensionality reduction in neuroscience](https://www.fabriziomusacchio.com/teaching/teaching_dimensionality_reduction_in_neuroscience/)" course (tutor: Fabrizio Musacchio, Oct 17, 2024)

## Acknowledgements
The dataset is from the 2023's course 'data analysis techniques in neuroscience' by the Chen Institute for Neuroscience at Caltech:  

<https://github.com/cheninstitutecaltech/Caltech_DATASAI_Neuroscience_23>

and originally from the paper:

Remedios, R., Kennedy, A., Zelikowsky, M. et al. Social behaviour shapes hypothalamic neural  ensemble representations of conspecific sex. Nature 550, 388–392 (2017). <https://doi.org/10.1038/nature23885>

## Dataset
We will work with the same calcium imaging data from the previous exercise (PCA). For details, please refer to the previous exercise.

## Environment setup
For reproducibility:

```bash
conda create -n dimredcution python=3.11 mamba -y
conda activate dimredcution
mamba install ipykernel matplotlib numpy scipy scikit-learn -y
```

We begin by loading the necessary libraries:

In [1]:
# %% IMPORTS
import os
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import time

from mpl_toolkits import mplot3d
from scipy.io import loadmat
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage


# set global properties for all plots:
plt.rcParams.update({'font.size': 14})
plt.rcParams["axes.spines.top"]    = False
plt.rcParams["axes.spines.bottom"] = False
plt.rcParams["axes.spines.left"]   = False
plt.rcParams["axes.spines.right"]  = False

Next, we define the path to the data. If you are running this script in a Google Colab environment, you need upload the data file `hypothalamus_calcium_imaging_remedios_et_al.mat` from the GitHub repository to your Google Drive; please follow further instructions [here](https://www.fabriziomusacchio.com/blog/2023-03-23-google_colab_file_access/).

In [2]:
DATA_PATH = '../data/'
DATA_FILENAME = 'hypothalamus_calcium_imaging_remedios_et_al.mat'
DATA_FILE = os.path.join(DATA_PATH, DATA_FILENAME)

RESULTSPATH = '../results/'
# check whether the results path exists, if not, create it:
if not os.path.exists(RESULTSPATH):
    os.makedirs(RESULTSPATH)

We define a helper-function to calculate the silhouette score for each clustering method and plot the according silhouette plot:

In [3]:
# define a function for manually computing the mean silhouette score of a kmeans clustering:
def plot_silhouette_score(fitted_model, PCA_model_S3_fit, method='kmeans'):
    
    # get the cluster labels:
    labels = fitted_model.labels_
    # labels = agglo_fit.labels_
    # labels = kmeans_fit.labels_
    n_clusters = len(np.unique(labels))

    # compute the silhouette scores for each sample:
    silhouette_vals = silhouette_samples(PCA_model_S3_fit, labels)
    """ 
    With silhouette_samples, we can compute the silhouette score for each sample. The silhouette
    score is a measure of how similar an object is to its own cluster (cohesion) compared to other
    clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the
    object is well matched to its own cluster and poorly matched to neighboring clusters. 

    The silhouette_samples function computes the silhouette coefficient for each sample. The Silhouette 
    Coefficient is a measure of how well samples are clustered with sample 10 that are similar to 
    themselves. Clustering models with a high silhouette coefficient are said to be dense, where samples 
    in the same cluster are similar to each other, and well separated, where samples in different 
    clusters are not very similar to each other.

    The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean 
    nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is 

            (b - a) / max(a, b)). 

    To clarify, b is the distance between a sample and the nearest cluster that the sample is not a 
    part of. Note that Silhouette Coefficient is only defined if number of labels is 

            2 < n_labels < n_samples - 1.

    """

    # compute the mean silhouette score:
    silhouette_avg = silhouette_score(PCA_model_S3_fit, labels)
    print(f"Mean silhouette score: {silhouette_avg}")
    """ 
    silhouette_score returns the mean Silhouette Coefficient over all samples.
    """

    # Create a silhouette plot
    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(7, 6)

    ax1.set_xlim([-0.1, 1]) # the silhouette coefficient can range from -1, 1
    ax1.set_ylim([0, len(PCA_model_S3_fit) + (n_clusters + 1) * 10])
    # the (n_clusters+1)*10 is for inserting blank space between silhouette plots of individual clusters

    y_lower = 10
    for i in range(n_clusters):
        # aggregate the silhouette scores for samples belonging to cluster i, and sort them:
        ith_cluster_silhouette_vals = silhouette_vals[labels == i]
        ith_cluster_silhouette_vals.sort()

        size_cluster_i = ith_cluster_silhouette_vals.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.viridis(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_vals,
                        facecolor=color, edgecolor=color, alpha=1.0)

        # label the silhouette plots with their cluster numbers at the middle:
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # compute the new y_lower for next plot:
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # plot a vertical line for the average silhouette score of all the values:
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks(np.arange(-0.1, 1.1, 0.2))

    plt.tight_layout()
    plt.savefig(os.path.join(RESULTSPATH, f'{method} silhouette_plot ({n_clusters} clusters).png'), dpi=300)
    plt.show()

Now, we load the dataset:

In [4]:
# %% LOAD THE DATA
hypothalamus_data = loadmat(DATA_FILE)

# Extract the N main data arrays into N separate variables:
neural_data   = hypothalamus_data['neural_data']
attack_vector = hypothalamus_data['attack_vector']
gender_vector = hypothalamus_data['sex_vector']

## 📝 Repeat the PCA analysis from the previous script. Perform the PCA for 3 components
1. Perform PCA with 3 components.
2. Plot the 3 principal components (PCs) in a 3D scatter plot.
3. plot the 3 PCs in a three 2D scatter plot (using `subplot`).

Useful tip: With `ax.view_init(elev=30, azim=45)`, you can change the view of the 3D plot.

In [None]:
# Your code goes here:

# n_components  =
# PCA_model_S10 = 


# fit the PCA model to the neural data:


# Plot the first three principal components in 3D space:



## 📝 k-means clustering
Apply k-means clustering to the three PCs of the PCAed neural data:

1. Start with 2 clusters. 
2. After the clustering, update the 3D and 2D plots by color-code them with the cluster labels. What do you notice? 
3. Compare the clustering results with the gender labeled PCAed data from the previous script. What do you notice?
4. Compute the silhouette score for the kmeans clustering and interpret the results.
5. Repeat all steps for 3 and 6 clusters. What do you notice?

In [None]:
# Your code goes here:

# create a kmeans object with n_clusters clusters:
# n_clusters=
# kmeans    = 
# kmeans_fit = 


# verify that kmeans has indeed identified 2 clusters (`kmeans_fit.labels_`):


# inspect the cluster centers (`kmeans_fit.cluster_centers_`):


# re-plot the 3 PCs, now color-coded by the kmeans cluster labels and with the cluster centers:

# also re-plot the 3 PCs in separate 2D subplots, now color-coded by the kmeans cluster labels and with the cluster centers:



In [None]:
# Your code goes here:

# plot the silhouette plot by using our helper-function:
# plot_silhouette_score(...)



In [8]:
# Your answers go here:

# Answer to question 2:

# Answer to question 3:

# Answer to question 4:

# Answer to question 5:


## 📝 Agglomerative clustering
1. Repeat the clustering analysis with agglomerative clustering. Start with 2 clusters.
2. Compare with the 2-cluster k-means results and the gender labeled PCAed data. What do you notice?
3. Plot the dendrogram of the agglomerative clustering. What do you notice?
4. Plot the silhouette plot for the agglomerative clustering. What do you notice?
5. Repeat all steps for 3 and 6 clusters. What do you notice?


In [None]:
# Your code goes here:

# create an agglomerative clustering object with n_clusters clusters:
# n_clusters=2
# agglo     =
# agglo_fit =



# verify that agglo has indeed identified 2 clusters (`agglo_fit.labels_`):


# re-plot the 3 PCs, now color-coded with the agglomerative cluster labels:


# also re-plot the 3 PCs in separate 2D subplots, now color-coded  with the agglomerative cluster labels:



Calculating and plotting the dendrogram may take a while due to the computation of the linkage matrix. The dendrogram shows the hierarchical clustering of the data. It therefore recalculates the distances between the clusters at each merge and shows the overall structure of the data. The dendrogram needs be created only once and is independent of the later chosen number of clusters.

The x-axis shows the samples and the y-axis shows the distance between the clusters. The height of the dendrogram at each merge represents the distance between the two clusters that are merged.

The function `linkage` computes the hierarchical clustering of the input data based on the method specified (here: 'ward'). The method 'ward' minimizes the variance of the clusters being merged. The linkage matrix contains the hierarchical clustering encoded as a linkage matrix.

In [None]:
# Your code goes here:

# plot the dendrogram:
# linkage_matrix = linkage(..., method='ward')
#
# fig, ax = plt.subplots(1, 1, figsize=(10, 5))
# dendrogram(linkage_matrix, ax=ax)
# plt. ...



In [None]:
# Your code goes here:

# plot the silhouette plot using our helper-function:
# plot_silhouette_score(fitted_model=..., PCA_model_S3_fit=..., method=...)



In [12]:
# Your answers go here:

# Answer to question 2:

# Answer to question 3:

# Answer to question 4:

# Answer to question 5:



## 📝 DBSCAN clustering
1. Repeat the clustering analysis with DBSCAN. Start with an `eps` of 0.5 and `min_samples? of 5.
2. Compare with the previous cluster results. What do you notice?


In [None]:
# Your code goes here:

# create a DBSCAN object with eps and min_samples:
# eps        =0.5
# min_samples=5
# dbscan     =
# dbscan_fit =


# check how many clusters were found (`dbscan_fit.labels_`):


# plot the 3 PCs, now color-coded by the DBSCAN cluster labels:


# also re-plot the 3 PCs in separate 2D subplots, now color-coded with the DBSCAN cluster labels:


In [14]:
# Your answer goes here:

# Answer to question 2:

