# Spectral clustering

Welcome to our exploration of spectral clustering in this notebook. We're going to apply it to a very interesting area: clustering neonatal brain MRI.

Our first step is quite straightforward. Below is a cell that contains the code to import all the basic libraries we'll need. It's just like gathering our tools before we start our work.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler

### Identify preterm and term scans

Now, we're moving on to a dataset containing MRI images of preterm babies.  Intriguingly, each baby was scanned twice, and we're tasked with the automatic identification of these first and second scans in the database, utilizing a clustering method. We've got volumes of data on 86 structures to help us recognize the scans.

Here's a bit more context on the patients and scans:

* **Preterm:** Babies born prior to 37 weeks Gestational Age (GA)

* **First scan**: Conducted within 4 weeks of birth

* **Second scan**:  Conducted between 38 and 43 weeks GA

Our next move is to load up the dataset and get a visual representation of its structure using PCA, a method you might recall from a previous notebook. Just run the cell below to do that. Keep an eye out - are you able to see two distinct clusters?

In [None]:
# load data
data = pd.read_csv("datasets/structures_first_second_scan.csv",header=None)
structure_volumes = data.to_numpy()

# Create features
X = StandardScaler().fit_transform(structure_volumes[:,1:])

# We have information about the first or second scan for comparison
y = structure_volumes[:,0]

print('Number of samples: {}  Number of features: {}'.format(X.shape[0],X.shape[1]))

# Apply PCA to reduce to two dimension and plot the data
from sklearn.decomposition import PCA
pca = PCA( n_components = 2)
X2 = pca.fit_transform(X)
plt.plot(X2[:,0],X2[:,1],'bo', alpha = 0.8)
plt.title('PCA', fontsize = 16)
plt.xlabel('component 1', fontsize = 12)
plt.ylabel('component 2', fontsize = 12)

At this point, we're at a point where we can apply the **k-means** algorithm or the **Gaussian Mixture model** to the entire dataset, or even to the PCA transformed dataset. To see the result of this, go ahead and run the code below.

But we're not stopping there. We're also going to calculate the **accuracy** compared to the ground truth labels. This will help us understand how well our clustering worked.

In [None]:
# predict using k-means
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=2).fit_predict(X2)

# Calculate accuracy score
from sklearn.metrics import accuracy_score
print('Accuracy score: ', round(accuracy_score(y,y_pred),2))
print('Accuracy score: ', round(accuracy_score(y,1-y_pred),2))

# Plot
def PlotData(X,y):
    # plot
    plt.plot(X[y==0,0],X[y==0,1],'bo',alpha=0.8, label = 'Cluster 1')
    plt.plot(X[y==1,0],X[y==1,1],'r*',alpha=0.8, label = 'Cluster 2')
    # annotate
    plt.legend()
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('Clustering result')

# Plot reduced dataset
PlotData(X2,y_pred)

Now we would like to explore how Spectral clustering will deal with this dataset. In Scikit-learn, we can do that using object `SpectralClustering`.

**Activity 1:**
* Create a `SpectralClustering` model.
* Set number of clusters `n_clusters` to 2
* Set number of components (dimension of the manifold embedding) `n_components` to 2
* Set affinity to `'nearest_neighbors'`

Take your time with this activity, and remember that there's no rush. Should any questions arise, don't hesitate to bring them up. We're all here to learn, after all!

In [None]:
# predict using spectral clustering
from sklearn.cluster import SpectralClustering
model = None

# check parameters
print(model.n_clusters)
print(model.n_components)
print(model.affinity)
print(model.n_neighbors)

**Activity 2:** Awesome! Let's keep the ball rolling.

Now, we're going to put our spectral clustering model to work. Predict the labels using the model you defined above and see what happens. Let's check if it does the job well.

Remember, it's completely fine if you don't get it perfectly right the first time.

In [None]:
# fit spectral clustering model and predict labels
y_pred = None

# Calculate accuracy score
print('Accuracy score: ', round(accuracy_score(y,y_pred),2))
print('Accuracy score: ', round(accuracy_score(y,1-y_pred),2))

# Plot reduced dataset
PlotData(X2,y_pred)

Moving right along!

**Activity 3:**
Now we're going to add a little twist. Start experimenting with the number of components (`n_components`) of the embedded space. We're interested in seeing whether changing this value has any impact on the results. What's the smallest number that still yields effective results?

Remember, it's okay to try different things and make adjustments as you go along. This is how we learn!

**Answer:**

<img src="pictures/brain1.png" width = "100" style="float: right;">

# Exercise 4 (optional)

### Spectral clustering from precomputed matrices

<img src="pictures/brain2.png" width = "100" style="float: right;">

Excellent, we're making great progress!

In this exercise, we'll be illustrating how we can cluster MRI images of babies scanned at 40 weeks GA. We have images from 68 term and preterm babies, all of which were initially co-aligned to the same reference space. Following that, we calculated cross-correlation between all pairs of images to measure their similarity. This resulted in a matrix of similarities, also known as an affinity matrix. This matrix can be found in the '`babies.csv`' file.

### Load the affinity matrix

<img src="pictures/brain3.png" width = "100" style="float: right;">

To begin, let's load the affinity matrix. You can do this by running the cell below. Once that's done, take a moment to inspect the matrix. Pay special attention to the diagonal of the matrix. What value do you see there, and why do you think it's there?

**Answer:**

In [None]:
import pandas as pd

# read the file
df = pd.read_csv('datasets/babies.csv', header=None)

# print the affinity matrix
df

Now we're moving on to convert our data from a dataframe object to a numpy array. This is a common step when we're working with data in Python, as numpy arrays provide a lot of flexibility and functionality that's useful for data analysis and machine learning.

 What is the dimension of this matrix and why?

**Answer:**

In [None]:
# convert to numpy array
NCC=df.to_numpy()

# print the shape
NCC.shape

### Visualise the dataset

Terrific, let's move on to visualizing our dataset!

**Task 4.1:** Your next task is to visualize the dataset as defined by the affinity matrix. Follow these steps to do so:

- Begin by calculating the `SpectralEmbedding` with 3 components and a precomputed affinity matrix. You might need to check the documentation to figure out how to create the embedding model.

- When fitting the model, make sure to use the affinity matrix instead of the feature matrix.

- After you've calculated the 3D feature matrix in the embedded space, it's time to bring it to life. Plot the dataset in 2D using the first two embedded coordinates.

Take your time with this task. It might seem a bit complex, but it's just a series of steps, one after the other.

In [None]:
# Create the embedding
embedding = None

# Fit the model using the affinity matrix and calculate the feature matrix in the 3D embedded space
Xe = None

# Plot the first two dimensions of the embedded space
plt.plot(None,None,'bo', alpha = 0.8)

# Annotate the plot
plt.title('Spectral Embedding')
plt.xlabel('Embedded component 1')
plt.ylabel('Embedded component 2')

### Perform spectral clustering

let's proceed to the next step - performing spectral clustering!

**Task 4.2:**  Here's your next set of tasks:

- Start by creating a **`SpectralClustering`** model. Set it up with 3 components and 3 clusters.

- Now, fit the model using the precomputed affinity matrix, and use this fitted model to predict the labels.

- Next, complete the function `**PlotData3`** that plots the first two dimensions of the data with 3 clusters.

- Finally, plot the result of the clustering.


In [None]:
# Create spectral clustering model
clustering = None

# Fit and predict using the affinity matrix
y_pred = None

# Function for plotting data with three clusters
def PlotData3(X,y):
    # plot
    plt.plot(None,None,'bo',alpha=0.8, label = 'Cluster 1')
    plt.plot(None,None,'r*',alpha=0.8, label = 'Cluster 2')
    plt.plot(None,None,'g^',alpha=0.8, label = 'Cluster 3')
    # annotate
    plt.legend()
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('Clustering result')

# Plot


### Interpret the clusters

Now, let's try to make sense of the clusters we've created.

**Task 4.3:** This time, we're going to load a file that stores the gestational age at birth for all 68 babies in our dataset. Your task involves a few steps:

- First, plot the first two dimensions of the embedded dataset using a `scatter` plot. However, we want this plot to be colour-coded by the GA at birth `ages`.

- You can refer to the function `PlotDataColourcoded` above to understand how to implement this colour-coding.

Once you've done that, take a moment to look at the clusters. Based on the color coding, what do you think the clusters represent? How do you interpret the clusters? What do they tell us about the gestational ages of the babies in our dataset?

**Answer:**

In [None]:
# Load GA and convert to numpy
df2 = pd.read_csv('datasets/ages.csv',header=None)
ages = df2.to_numpy()

# Scatterplot of the embedded space colour-coded by GA


# annotate the plot
plt.colorbar()
plt.xlabel('embedded coordinate 1')
plt.xlabel('embedded coordinate 2')
plt.title('Embedding colorcoded by age')