# Machine Learning - Task E

**Part of IAFIG-RMS *Python for Bioimage Analysis* Course.**

*M Kundegorski*

2019-12-13

In this task you will use image information to cluster images into classes based on similarity of colour. In reality we could use any feature but colour is very visual and easy to follow.

There are 6 'classes' (colours) of blob, each of them has different value of hue (but the same mean intensity).

This task is based on a scikit-learn tutorial of k-means clustering: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html#sphx-glr-auto-examples-cluster-plot-cluster-iris-py.

## The Task

In one experiment we were able to take a 'branbow'-esque dataset, i.e. every cell is expressing a combination of different fluorophores (colours), and each colour combination represents a type of cell. We imaged these cells with an RGB camera. These cells are sparse so we were easily able to segment them, find their bounding box and create a database of images each containing a single cell. We want to use clustering to automatically identify which cell belongs to which cell-type. Luckily, we've convinced a PhD student to manually label our sample database so we can confirm our results at the end.

Given the colour of our cells we want to automatically, and in an unsupervised fashion, 'cluster' our cells into classes, ignoring other features like shape or size.

1. Let's start by loading our data and visualising the manual class labels in RGB space (other colour spaces would also work)
2. Use k-means clustering to cluster our data and compare it to our manual labels
3. Advanced: Use mean shift clustering to cluster our data and compare it to our manual labels

## Task E.1

Let's start by loading data. Later on, you can change parameters to see how the clustering results change with different numbers and noise.

In [None]:
# Utils is a custom module written to simplify these tutorials
# You do not need to understand these codes for this practical
from utils.practice_data import generateBlobsData  # this loads data into a DataFrame
from utils.practice_data import showBlobs  # this allows quick visualisation of the data

import pandas as pd
import numpy as np

imageDir = './assets/simple_blobs/'
# Is 30 a sufficent number of samples? How much noise can our algorithms deal with?
number_of_samples = 30
noise = 10 #value from 0 to 250
problem = generateBlobsData(imageDir, 6, 30, imSize = 100, colour=True, noiseSize = noise)

# Visualise
display(problem.describe())
showBlobs(problem.head(30))

## Task E.2

Time for some data wrangling - as usual the data needs to have right datatype required by the algorithm. For this simple clustering example we're going to extract the mean 'red', the mean 'green' and the mean 'blue' value for each cell. To make things simply, given we've only got one cell per image, we will use the means for the whole image. How might this affect our means and therefore our clustering? Should we use the mean or the median?

Run the following cell to get the median colour channel values.

In [None]:
y=problem.loc[:,'class'].values.astype(int)  # Convert classes to int

for cell in problem.index:
    problem.loc[cell,'mean-R'] = problem.loc[cell,'raw_data'][0].mean()  # mean of the first (red) channel of our image data for this cell
    problem.loc[cell,'mean-G'] = problem.loc[cell,'raw_data'][1].mean()  # mean of the second (green) channel of our image data for this cell
    problem.loc[cell,'mean-B'] = problem.loc[cell,'raw_data'][2].mean()  # mean of the third (blue) channel of our image data for this cell

problem.describe(exclude=['raw-data'])

Run the following cell to plot our cells on three axes - one for R, one for G and one for B. This code also colourcodes cells by their class (the manual class at the moment).

In [None]:
% matplotlib widget
import matplotlib.pyplot as plt

f, axis = plt.subplots(1,1,subplot_kw={'projection':'3d'})  # create a figure with a single 3d axis (subplot)

blob_classes = problem['class'].unique()
for blob_class in blob_classes:
    axis.plot(problem[problem['class']==blob_class]['ch1'],
            problem[problem['class']==blob_class]['ch2'],
            problem[problem['class']==blob_class]['ch3'], 
            'o', label=blob_class)
axis.set_ylabel('Red')
axis.set_ylim(0,255)
axis.set_xlabel('Green')
axis.set_xlim(0,255)
axis.set_zlabel('Blue')
axis.set_zlim(0,255)
    
plt.legend(loc='upper left', numpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, 0))  # add a legend

plt.show()

K-means clustering

In [None]:
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=6)
k_means.fit(problem[['ch1','ch2','ch3']]) 
problem['cluster'] = k_means.labels_

The simple way to estimate number of clusters from kmeans algorithm is to see a "knee" point in Sum of Square Errors (SSE) of samples.
You can access value of SSE using `.inertia_` attribute of a fitted kmeans model

In [None]:
# 1 Create a list to gather values for each cluster number candidates

# 2 Create a `for` loop in range of possible values
#    Calculate kmeans fit with a cluster number and add `.intertia_` parameter to the list

# 3 Plot values of SSE for each cluster

In [None]:
sse = {}
for k in range(1, 20):
    kmeans = cluster.KMeans(n_clusters=k, max_iter=5).fit(problem[['ch1','ch2','ch3']])
    #print(kmeans.inertia_)
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
    
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

Mean Shift Clustering: What happens if we change `bandwidth` parameter?

In [None]:
from sklearn.cluster import MeanShift
ms = MeanShift(bandwidth=10).fit(problem[['ch1','ch2','ch3']])
problem['cluster_ms'] = ms.labels_

Let's compare kmeans and Mean Shift clustering with ground-truth data

In [None]:
#show cluster
fig = plt.figure(figsize=plt.figaspect(0.3))
ax1 = fig.add_subplot(1, 3, 1, projection='3d')
ax2 = fig.add_subplot(1, 3, 2, projection='3d')
ax3 = fig.add_subplot(1, 3, 3, projection='3d')

blob_classes = problem['class'].unique()
for blob_class in blob_classes:
    ax1.plot(problem[problem['class']==blob_class]['ch1'],
            problem[problem['class']==blob_class]['ch2'],
            problem[problem['class']==blob_class]['ch3'], 
            'o', label=blob_class)
ax1.set_title('Ground truth label')    
blob_classes = problem['cluster'].unique()
for blob_class in blob_classes:
    ax2.plot(problem[problem['cluster']==blob_class]['ch1'],
            problem[problem['cluster']==blob_class]['ch2'],
            problem[problem['cluster']==blob_class]['ch3'], 
            'o', label=blob_class)
ax2.set_title('k-means with {} cluster'.format(len(blob_classes)))
    
blob_classes = problem['cluster_ms'].unique()
for blob_class in blob_classes:
    ax3.plot(problem[problem['cluster_ms']==blob_class]['ch1'],
            problem[problem['cluster_ms']==blob_class]['ch2'],
            problem[problem['cluster_ms']==blob_class]['ch3'], 
            'o', label=blob_class)
ax3.set_title('Mean shift clustering with {} clusters'.format(len(blob_classes)))    
    
ax1.legend(loc='upper left', numpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, 0))
ax2.legend(loc='upper left', numpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, 0))
ax3.legend(loc='upper left', numpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, 0))

plt.show()