[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CAMM-UTK/acns-AI-tutorial/blob/main/Diffraction_ML/Gaussian_Mixture_Model.ipynb)



For questions/comments, please contact [Krishnanand Mallayya](https://www.linkedin.com/in/krishnanandmallayya/)

---


# 1. Unsupervised learning (GMM) for diffraction data

In this notebook, we will explore unsupervised machine learning through clustering. Clustering allows us to discover inherent structures or groupings within data without prior knowledge or labels.

We will introduce Gaussian mixture model (GMM) clustering with the goal of identifying clusters of distinct parametric dependencies (such as temperature dependence) in a dataset.


The dataset in this notebook is derived from intensities of X-ray diffraction evolving as a function of temperature. The data is already preprocessed and serves as an illustration of how to implement GMM clustering using scikit-learn.  

In the second notebook ([XTEC_with_GMM](https://github.com/CAMM-UTK/acns-AI-tutorial/blob/main/Diffraction_ML/XTEC_with_GMM.ipynb)), we will explore the actual X-ray diffraction temperature series data. There, we will demonstrate the preprocessing, clustering, and visualization of clustered diffraction patterns using the python package: X-ray Diffraction Temperature Clustering (X-TEC). The same approach can be extended to analyze various other types of data evolving under a parameter, such as time, B-field, etc.


### Imports and set-ups for running in colab

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm

import time

import sys

## 1.1 Load the data

In [None]:
# Download the data
!wget -O Signal_vs_T.csv "https://www.dropbox.com/scl/fi/9pqmngxzjkmcwl1wwm8jg/Signal_vs_T.csv?rlkey=q9curpt809bj5nupvdelg2600"

In [None]:
df = pd.read_csv('Signal_vs_T.csv', sep=' ', index_col=0,header=0)
df.columns.name='Temperature (K)'

In [None]:
display(df)

The data has 56868 different signals (which are actually the rescaled intensities of XRD, which you will see in the second notebook), at 19 different temperatures.

In [None]:
Temperature = df.columns.values.astype('float')
Signals= df.values.astype('float')

## 1.2 Visualize the data

Lets plot a few of these signals

In [None]:

plt.figure(figsize=(5,5))


# plotting the first 5000 signals
for i in range(0,5000):
    plt.plot(Temperature,Signals[i])

plt.xlabel('$T$ (K) ',size=20)
plt.ylabel(r'Signals',size=20)
plt.xticks(np.arange(30, 310, 50), fontsize=15)
plt.yticks(fontsize=15)
plt.ylim([-3,2.5])




We can see some groups of distinct temperature dependencies. To visually illustrate how this translates to clustering with Gaussian Mixture model, let us first simplify the problem.  


Each signal in this dataset is a 19-dimensional vector.  Let us visualise them in 2D, by taking a 2D cross-section of the 19 dimensional space.

### Let's select two temperatures  

In [None]:
a=1
b=4

Ta = Temperature[a]
Tb = Temperature[b]

print(Ta,Tb)


Sa=Signals[:,a]
Sb=Signals[:,b]


### Plot this 2D cross-section

In [None]:
plt.figure(figsize=(6,5))

plt.scatter(Sa,Sb,s=3,alpha=0.2)
plt.xlim([-1,2.5])
plt.ylim([-0.8,1.5])
plt.xticks(np.arange(-1.0,2.5,1),fontsize=15);
plt.yticks(np.arange(-1.0,1.5,1),fontsize=15);

plt.xlabel(f'Signal($T$ ={Ta}K)',size=20);
plt.ylabel(f'Signal($T$ ={Tb}K)',size=20);

However, manually inspecting 2D cross-sections is not practical for higher-dimensional data. Instead, we can use the Gaussian Mixture Models to identify clusters in the original 19-dimensional space, and identify the underlying structure in the data.


## 1.3 Gaussian Mixture Model


GMM is a probabilistic model that assumes the data is generated from a mixture of a finite number of Gaussian distributions. Each Gaussian distribution represents a cluster, and the goal is to find the parameters of these distributions that best fit the data.

To apply GMM clustering to our 19-dimensional data, we can use the scikit-learn library in Python.

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

In [None]:
from sklearn.mixture import GaussianMixture


In [None]:
print('(num of data, num of features) = ', Signals.shape)

## 1.4 Specify number of clusters

Choosing the number of clusters is not always obvious, and requires some experimentation. Setting n_clusters too large can lead to overfitting, where the clustering algorithm fits noise or other irrelevant and overly specific clusters. It can also result in fragmented clusters that are redundant and only capture minor variations. On the other hand, if it is set too small, we will miss physically important clusters. Try these different cases!



Let's start with 4 clusters. Later in this notebook, we will explore a heuristic method that often comes handy in determining the number of clusters.

In [None]:
n_clusters = 4

## 1.5 Fit the GMM

In [None]:
# Create a Gaussian Mixture Model
GMM = GaussianMixture(n_components=n_clusters,random_state=13) # fixing the random_state ensures results are same in all runs);

# Fit the GMM
GMM.fit(Signals);

# Predict the cluster assignments (labels) for each signal
cluster_assigns = GMM.predict(Signals)

In [None]:
print(cluster_assigns)
print(cluster_assigns.shape)

Cluster_assigns contains integer labels from 0 to n_clusters-1, representing the cluster assignment for each data point.

## 1.6 Visualize the clustered data

We will first create a list of discrete colors, one for each cluster  

In [None]:
# Generate a set of discrete colors
cluster_colors = cm.tab10(np.linspace(0, 1, n_clusters))

Now, let's go back to the 2D cross-section again and plot the points, but this time, we will color each point according to its assigned cluster label.

In [None]:
plt.figure(figsize=(6,5))

for i in range(n_clusters):
    mask = cluster_assigns == i
    plt.scatter(Sa[mask], Sb[mask],s=3,alpha=0.2, color=cluster_colors[i], label=f'Cluster {i+1}')

plt.xlim([-1,2.5])
plt.ylim([-0.8,1.5])
plt.xticks(np.arange(-1.0,2.5,1),fontsize=15);
plt.yticks(np.arange(-1.0,1.5,1),fontsize=15);
plt.xlabel(f'Signals ($T$ ={Ta} K)',size=20);
plt.ylabel(f'Signals ($T$ ={Tb} K)',size=20);

We can see the different clusters distinguished by their colors. Note that the GMM is separating the structures not just based on the data distribution in this 2D cross-section, but in the entire 19 dimensional space.

Going back to the Signal vs Temperature plot, we can now represent each signal by its cluster color

In [None]:
plt.figure(figsize=(5,5))


# plotting the first 5000 signals
for i in range(0,5000):
    plt.plot(Temperature,Signals[i],color=cluster_colors[cluster_assigns[i]],alpha=0.5)

plt.xlabel('$T$ (K) ',size=20)
plt.ylabel(r'Signals',size=20)
plt.xticks(np.arange(30, 310, 50), fontsize=15)
plt.yticks(fontsize=15)
plt.ylim([-3,2.5])


### We now see the clusters of distinct temperature dependencies

Each cluster in the GMM is described by a Gaussian with a mean and a variance in the 19 dimensional space. We can plot the distinct temperature dependencies by the cluster mean and (diagonal elements) of variance.

In [None]:
GMM_means=GMM.means_
GMM_cov=GMM.covariances_

In [None]:
plt.figure(figsize=(5,5))

for i in range(n_clusters):

    cluster_mean = GMM.means_[i]
    cluster_std = np.sqrt(np.diag(GMM.covariances_[i]))  # the diagonal elements of the covariance matrix


    plt.plot(Temperature,cluster_mean,lw=3,color=cluster_colors[i])
    plt.gca().fill_between(Temperature,cluster_mean-cluster_std,cluster_mean+cluster_std,color=cluster_colors[i],alpha=0.5)


plt.xlabel('$T$ (K)',size=20)
plt.ylabel(r'Signals',size=20)
plt.xticks(np.arange(30, 310, 50), fontsize=15)
plt.yticks(fontsize=15)
plt.ylim([-3,2.5])


### 1.7  Experimenting with the number of clusters

Choosing the appropriate number of clusters require some experimentation and visualization of the results. One heuristic approach is to use the Bayesian Information Criterion (BIC) score.



Let us use the BIC score to estimate the optimal number of clusters:

We'll fit GMMs with varying numbers of components (clusters). For each model, we'll calculate the BIC score. We'll then plot the BIC scores against the number of components. The optimal number of clusters is typically indicated by the 'elbow' point in this plot, where the rate of BIC score improvement begins to level off.



In [None]:
# Define the range of n_clusters
n_clusters_list = list(range(2, 10))

bic_scores = []

for n_c in n_clusters_list:
    gmm = GaussianMixture(n_components=n_c,random_state=11) # fixing the random_state ensures results are same in all runs
    gmm.fit(Signals)
    bic_scores.append(gmm.bic(Signals))  # this calculates the BIC score for this GMM



In [None]:
# Plot BIC scores
plt.figure(figsize=(10, 5))
plt.plot(n_clusters_list, bic_scores, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('BIC score')
plt.show()

The BIC score shows an "elbow" 

### Try the clustering with n_clusters = 3, 5, 6 etc


## 1.8 Next Notebook: [XTEC with GMM](https://github.com/CAMM-UTK/acns-AI-tutorial/blob/main/Diffraction_ML/XTEC_with_GMM.ipynb)