# Gaussian Mixture

Welcome to this tutorial, where we're going to explore the world of Gaussian Mixture Models. This powerful statistical model allows us to model data that might not fit a single Gaussian distribution, but can be modeled as a mix of several Gaussian distributions.

We'll first apply it to the **high-dimensional Wisconsin breast cancer dataset** and compare its performance with K-means clustering. We've chosen this dataset because it's complex enough to show the strengths of Gaussian Mixture Models, but also well-studied enough that we can confidently interpret the results.

Next, we'll apply Gaussian Mixture Model to a more challenging task - the segmentation of **brain MRI** into three distinct categories: White Matter (WM), Gray Matter (GM), and Cerebro-spinal Fluid (CSF). This will allow us to explore various components of the model, such as image and class intensity distributions, also known as **likelihoods**, and the resulting probabilistic predictions, or **posteriors**.

Let's get started!

### Imports

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

### Load Breast Cancer Dataset

Now, we will proceed to load the Breast Cancer dataset. We'll construct a feature matrix, denoted as `X`, and a label vector, `y`. It's essential to bear in mind that although we're using the label vector, it's primarily for the assessment of our clustering results. The reason behind this is that clustering falls into the realm of unsupervised machine learning tasks, where we usually don't have access to any labels.

In [None]:
from sklearn import datasets

bc = datasets.load_breast_cancer()

print('\n Features: \n', bc.feature_names)
print('\n Labels: ', bc.target_names)

# Feature matrix
X = bc.data
# Label vector
y = bc.target
print('\n We have {} features.'.format(X.shape[1]))

Throughout this module, you've learned how dimensionality reduction can help us make sense of high-dimensional datasets. `PCA` is our tool of choice for today to shed some light on our 30-dimensional Wisconsin Breast Cancer Dataset.

**Activity 1:** Complete the code below to explore structure of the Wisconsin Breast Cancer Dataset.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# apply standard scaler to the feature matrix
X = StandardScaler().fit_transform(X)

# create PCA feature transformer with 2 components
model = None

# transform features using PCA
X2 = None

# create function for plotting 2D two-class dataset
def PlotData(X,y):
    # plot
    plt.plot(X[y==0,0],X[y==0,1],'bo',alpha=0.5, label = bc.target_names[0])
    plt.plot(None,None,'r*',alpha=0.5, label = bc.target_names[1])
    # annotate
    plt.legend()
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('Wisconsin Breast Cancer Dataset')

# Plot reduced dataset


### K-means clustering

Now it's time for some K-means clustering! We'll see how well it can pick out the natural clusters in our dataset and check if these clusters line up with healthy and cancerous cells.

**Activity 2:** Ready for some coding? Let's get to it! Complete the code below to perform k-means clustering with 2 clusters. After that, we'll evaluate how accurate the results are compared to the actual labels. Do you think the 30D model will outperform the 2D models we discussed in the lecture? Let's find out!

*Note: We need to print out two different scores because the clusters will be assigned labels 0 or 1 randomly. The higher score will be the measure of the performance.*

**Answer:**

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans

# Create k-means model with 2 clusters
model = None

# Fit the model and predict the labels
y_pred = None

# Print out accuracy score compared to ground truth labels
print('Accuracy score: ', round(accuracy_score(y,y_pred),2))
print('Accuracy score: ', round(accuracy_score(y,1-y_pred),2))

We will now plot the clustering result.

**Activity 3:** Call the function `PlotData` with the predicted labels. Compare to the ground truth labels plotted above.

In [None]:
# Plot the clustering result


## Exercise 1
### Clustering high-dimensional dataset using Gaussian Mixture

__Task 2.1:__ We're now going to perform clustering on slice2 using GaussianMixture. Set the number of clusters to 3 and the random state to 42. This ensures you'll get the same results every time you run your code. If you're unsure how to do that, don't hesitate to check the documentation for help. Here are the steps you need to follow:

- First, you'll need to create the GaussianMixture model.
- Then, transform slice2 into a 2D array to create the feature matrix X.

Once that's done,

* Fit the model and predict the labels
* Print out accuracy score compared to ground truth labels
* Plot the clustering result

Let's dive into it and have some fun!

Which method performed better. Can you reason why?

**Answer:**

In [None]:
from sklearn.mixture import GaussianMixture

# Create GMM model with 2 clusters
model = None

# Fit the model and predict the labels
y_pred = None

# Print out accuracy score compared to ground truth labels
print('Accuracy score: ', round(accuracy_score(y,y_pred),2))
print('Accuracy score: ', round(accuracy_score(y,1-y_pred),2))

# Plot the clustering result


### Load brain MRI
Our 2D brain MRI image is saved in a pickle format as `slice.p`. The non-brain tissue has been removed and image has been padded with zeros. When performing GMM clustering to segment the WM, GM and CSF, we will need to exclude the zero values for the algorithm to work well.

So, run the following code to load the brain MRI slice and display it. Take a good look at it, because we're going to dig deep into it. Ready? Let's go!

In [None]:
import pickle
slice = pickle.load(open( "datasets/slice.p", "rb" ))
print('Slice dimesions: ',slice.shape)

plt.imshow(slice)
plt.set_cmap('gray')

Alright, let's go ahead and create a histogram of our image. You'll notice there's a big spike at zero - that's just the padding in the background of the image.

*Note: don't worry about us using **`_=plt.hist`**. We're just doing that to prevent the histogram values from showing up in our output. Let's keep things nice and tidy!*

In [None]:
# display histogram
_=plt.hist(slice.flatten(), bins = 100, density = True)

To ensure that our GMM segmentation performs well, it's necessary to exclude the padding. That means we'll be focusing on and plotting a histogram of only the non-zero elements. This can be done by using the `np.where` function, but you could also use the logical array `slice>0` if that's more familiar to you.

**Note:** Just remember, `slice2`—the array where we've stored the selected pixels—will now be a 1D array.

Take a moment to look at the histogram. Can you identify the peaks that represent the white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF)?

In [None]:
# find indices of non-zero elements
ind = np.where(slice>0)
# select non-zero elemenst
slice2 = slice[ind]
# check the dimension
print('Shape od selected data is ', slice2.shape)
# plot histogram
_=plt.hist(slice2, bins = 100, density = True)

## Exercise 2
### Segmentation of brain MRI using Gaussian mixture

__Task 2.1:__ Now perform clustering of `slice2` using `GaussianMixture`. Set number of clusters to `3` and random state to `42` to get the same result every time you rerun it. Check help how exactly to do that. Perform following tasks:
* Create the `GausianMixture` model
* Create the feature matrix `X` by reshaping `slice2` into 2D array
* Fit the model and predict the labels
* Reshape the predicted labels to the original shape of `slice`
* Display using `imshow`

In [None]:
from sklearn.mixture import GaussianMixture

# select model
model=None

# create features - needs to be a 2D array
X = None

# fit the model and predict the cluster labels
y_pred = None

# Create array of 2D labels
labels2D = np.zeros(slice.shape)

# put the labels into fields with non-zero indices
# You need to add one so that labels start from 1
labels2D[None]=None+1

# display the label image

plt.set_cmap('plasma')

**Task 2.2:** Predict the probabilistic segmentations for each tissue class.  Display the maps in a figure with three subplots.

To do that perform the following steps:
* predict the probabilistic segmentations using function `predict_proba`
* check the size of the resulting predicted probability matrix
* write a `for` loop over the tissue types
* select the probability map for the current class from the predicted probability matrix
* create an array of zeros the same shape as `slice`
* insert the class-dependent probability into the right locations in this array
* display the array using `subplot` and `imshow`

In [None]:
# predict probabilistic segmentations
proba = None

# check the dimensions
print('Dimensions of proba ', None)

In [None]:
#display
plt.figure(figsize = [15,4])
plt.set_cmap('gray')

for i in range(3):
    # take only posteriors for class i
    post = None

    # reshape to the 3D image
    post2D = None
    post2D[None]=None

    # display
    plt.subplot(1,3,i+1)
    plt.imshow(post2D)

## Exercise 3 (optional)

### Explore Gaussian Mixture model

 In this exercise, we're going to unravel some fascinating theoretical concepts, including **likelihoods** and **posteriors**. Remember the `GaussianMixture` model from Exercise 2? We're going to use it again to perform some brain MRI segmentation magic!

### Posterior probabilities

Now, let's think about probabilistic segmentation $p_{ik}$. Probabilitic segmentation $p_{ik}$ gives us probability that pixel $i$ to belong the class $k$. This gives us the chance of pixel $i$ belonging to class $k$. But wait, these are actually what we call posterior probabilities! In other words, they're $$p(z_i=k|x_i, \mu_k, \sigma_k,c_k)$$ probabilities for the labels $z_i$, given the intensity value $x_i$ and the parameters $\mu_k, \sigma_k,c_k$ of the Gaussian intensity distribution for class $k$.

**Task 3.1:** ow let's plot how the posterior probability for each class changes with pixel intensity value. Ready to fill in the missing code below and bring those probability curves to life? Here we go!

In [None]:
# pixel intensity value range
intensity_range = np.linspace(0, np.max(slice2),200)

# predict posterior probabilities for the intensity range
# do not forget to reshape the intensity range to 2D array for the prediction
proba_curves = None

# display
plt.figure(figsize = [14,10])
# plot normalised histogram
# normalisation is achieved by parameter density
plt.subplot(211)
_=plt.hist(slice2, bins = 100, density = True)
plt.title('Normalised Intensity Histogram')

# plot posterior probabilities in a for loop
plt.subplot(212)
for i in range(0,3):
    plt.plot(None,None, linewidth = 3, label = 'Class {}'.format(i))

# annotate the subplot
plt.title('Posterior probabilities')
plt.xlabel('intensity')
plt.ylabel('posterior probability')
plt.legend(loc = 'upper left')

### Class-dependent likelihood

When we talk about class-dependent likelihoods, we're actually dealing with Gaussian distributions, and these are scaled by the mixing proportions. In mathematical terms, this looks like

$$p(x_i|z_i=k,\mu_k,\sigma_k, c_k)=G(x_i,\mu_k,\sigma_k)c_k$$

To visualize these distributions over a normalized histogram (make sure to use `density=True`), we'll need to extract some key parameters from our fitted model. These are `means_`, `covariances_`, and `weights_`.

Our next step is to calculate the Gaussian distributions using these parameters. The handy `norm.pdf` function from the `scipy.stats` module is perfect for this. Finally, we need to multiply these distributions by the weights and plot the results.

Task 3.2: Time to bring these Gaussian intensity distributions to life! Let's plot them for each class $k$ over the normalized image histogram. Just fill in the missing code below and run the cell to see them in action. Ready to give it a shot? Here we go!


---


If you want more details:

 In this equation

$$p(x_i|z_i=k,\mu_k,\sigma_k, c_k)=G(x_i,\mu_k,\sigma_k)c_k$$

 , we are dealing with several parameters related to Gaussian Mixture Models:

$x_i$ : This represents the data point or feature that we are considering. In this case, it can represent the intensity of a pixel in an image.

$z_i$ : This is the latent (or hidden) variable that denotes the class or cluster to which the data point $x_i$ belongs.

$k$ : This is the class or cluster index in the Gaussian Mixture Model. The model tries to determine to which class a data point belongs.

$\mu_k$ : This represents the mean of the Gaussian distribution for class $k$.

$\sigma_k$ : This stands for the standard deviation (or variance) of the Gaussian distribution for class $k$. It determines the width of the Gaussian.

$c_k$ : This is the mixing proportion or weight for class $k$, which represents the prior probability of that class. It indicates how much each Gaussian component contributes to the overall model.

$G(x_i,\mu_k,\sigma_k)$ : This represents the Gaussian distribution for class $k$ with mean $\mu_k$ and standard deviation $\sigma_k$.

The equation itself represents the conditional probability of the intensity $x_i$ given that it belongs to class $k$, considering the parameters of the Gaussian distribution and its mixing proportion. It's used to model the likelihood of $x_i$ under the given Gaussian distribution parameters.

In [None]:
# to calculate gaussian distribution
from scipy.stats import norm

# get parameters of GMM
# use flatten to make 1D arrays

# means
m = None

# standard deviation (you need to take sqrt of covariances)
s = None

# mixing proportions
w = None

# display
plt.figure(figsize = [14,5])

# histogram
_=plt.hist(None, bins = 100, None)

# class-dependent likelihoods - Gaussian PDFs
for i in range(0,3):
    likelihood = None
    plt.plot(intensity_range,likelihood, linewidth = 3, label = 'Class {}'.format(i))
plt.legend()
plt.title('Class specific likelihood functions')


### Likelihood

The likelihood for each pixel intensity $x_i$ given the Gaussian Mixture Model parameters $\phi = (\mu_k,\sigma_k,c_k),k=1,...,K$ can be evaluated as
$$p(x_i|\phi)=\sum_{k=1}^KG(x_i,\mu_k,\sigma_k)c_k $$

We can calculate this function by simply adding together the class-dependent likelihoods. Or, there's an alternative route: we can use the handy `score_samples` function provided by the `GaussianMixture` model, which returns the **log-likelihood**.

**Task 3.3:** Now, let's make the likelihood function come alive for the whole intensity range over the normalized image histogram. To do this, first evaluate log-likelihood over the intensity range using the `score_samples` function. Then, calculate the exponential using `np.exp` and plot the result. Excited to see how it looks? Let's get plotting!

In [None]:
# Compare histogram with fitted Gaussian mixture likelihood function
plt.figure(figsize = [14,5])
# histogram
_=plt.hist(None, None, None)
# calculate likelihood
log_likelihood = None
likelihood = np.exp(None)
# plot likelihood
plt.plot(intensity_range, None, linewidth = 3, c='k')
plt.title('Fitted likelihood function')