<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>

## Investigating the performance of K-means Clustering and Gaussian Mixture Models

### 

#### Pallavi Sahu, first year PhD student 
#### Hildebrand Austin Department of Petroleum and Geosystems Engineering, UT Austin
 

### Subsurface Machine Learning Course, The University of Texas at Austin
#### Hildebrand Department of Petroleum and Geosystems Engineering, Cockrell School of Engineering
#### Department of Geological Sciences, Jackson School of Geosciences

Workflow supervision and review by:

#### Instructor: Prof. Michael Pyrcz, Ph.D., P.Eng., Associate Professor, The Univeristy of Texas at Austin
##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Course TA: Misael Morales, Graduate Student, The University of Texas at Austin
##### [LinkedIn](https://www.linkedin.com/in/misaelmmorales/)



### Executive Summary


This notebook aim is to investigate the breakage point of k-mean clustering on data having non-similar clusters. This is an interative plot demostration of clustering methods on synthetically generated data, predominatly consist of two clusters. The relative size and spread of two clusters can be changed by slide bar of ratio of size dataset and  major and minor axis ratio of the spread of data. This workflow help to compare the performance of K-Mean clustering and Gaussian Mixture Models on data by changing the relative size and spread of dataset. It can be observed that if the relative distribution of cluster datasets  within data are non-similar then k-mean algorithm failed to cluster the data accurately. It is recommended to plot the data to see general distribution of data and advise user to cautiously use the clustering algorithms keeping the limitations of algorithms in mind.


#### K-Mean Algorithm  And Gaussian Mixture Models for Clustering

Here's a simple workflow, demonstration of assumptions of k-mean clustering. We use a:
* K-mean Algorithm and Gaussian Mixture Models for Clustering
* We used synthetically generated data consist of predominately two clusters.
* We assumed equal prior probability for all clusters for a used dataset.

#### k-Means Clustering

The K-means clustering approach is primarly applied as an unsupervised method for classification. However, for this workflow, we have taken labelled dataset consist of two clusters. We used K-mean algorithm to find clusters in our dataset ignoring the labelled clusters. Finally we review the performance of k-mean clustering by comparing with labelled clusters.  Aim of this workflow is to check the efficiency of clustering method to cluster the data correctly when we are diverting from the algorithm assumptions.

Assumptions of K-mean Clustering:
* Spherical, Convex, isotropic clusters. There should be minimize difference within clusters.
* Equal variance for all features 
* Similar size/ frequency data
* Equal prior probability for all clusters
 
Advantanges
* Relatively simple to implement.
* Scales to large data sets.
* Guarantees convergence.
* Can warm-start the positions of centroids.
* Easily adapts to new examples.

Diasadvantages
* k-means has trouble clustering data where clusters are of varying sizes and density.
* Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. 


We’ll use some of the available functions in the Scikit-learn library to process the randomly generated data.
                                 Kmean = KMeans(n_clusters=2)

#### Gaussian Mixture Models

A Gaussian Mixture is a function that is comprised of several Gaussians, each identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset.In one dimension the probability density function of a Gaussian Distribution is given by: 

\begin{equation}
G( X | µ) = \frac{\exp^\frac{-(x-\mu )^2}{\sigma^2}}{\sigma^2   √2π}
\end{equation}

where $\mu$     and $\sigma^2$     are respectively mean and variance of the distribution.


Each Gaussian k in the mixture is comprised of the following parameters:
* A mean μ that defines its centre.
* A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.
* A mixing probability π that defines how big or small the Gaussian function will be.
* It works so well on non-linear datasets.


Advantages:

* Does not assume clusters to be of any geometry. Works well with non-linear geometric distributions as well.
* Does not bias the cluster sizes to have specific structures as does by K-Means (Circular).

Disadvantages:

* Uses all the components it has access to, so initialization of clusters will be difficult when dimensionality of data is high.
* Difficult to interpret.


sklearn.mixture package is used to learn Gaussian Mixture Models (diagonal, spherical, tied and full covariance matrices supported), sample them, and estimate them from data. 

                                  gmm = GMM(n_components=2).fit(X)


  

##### Generating Synthetic Data
We have used numpy.random.normal from numpy library to draw random samples from a normal (Gaussian) distribution. We have synthetically generated data taken from two normal (Gaussian) distribution.

                                  random.normal(loc=0.0, scale=1.0, size=500)

Parameters:
* location or array_like of floats: Mean (“centre”) of the distribution.
* scale or array_like of floats: Standard deviation (spread or “width”) of the distribution. Must be non-negative.
* sizeint or tuple of ints, size of dataset

For this workflow, 
* We have taken two centre of distribution for two clusters((5,5) & (10,1)). This can be changed as per the user.
* The spread of one cluster distribution is fixed to (1,1). And the spread of another cluster can be changed by slide bar w.r.t to first cluster.
* The size one cluster is fixed to 500 and the size of another cluster can be changed by slide bar w.r.t to first cluster.

References:

* https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1
* https://scikit-learn.org/stable/modules/mixture.html
* https://github.com/GeostatsGuy
* https://medium.com/@yara.ahmed.amin/gaussian-mixture-model-4c71342b67d3

#### Load the Required Libraries

The following code loads the required libraries.

In [5]:
%matplotlib inline
from ipywidgets import interactive                        # widgets and interactivity
from ipywidgets import widgets                            # widgets and interactivity
import matplotlib; import matplotlib.pyplot as plt        # plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
plt.rc('axes', axisbelow=True)                  # set axes and grids in the background for all plots
import numpy as np                                        # working with arrays
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.mixture import GaussianMixture as GMM
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn import mixture
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

#### Make Our Interactive Plot 

For this demonstration we will: 

* declare a set of 2 widgets in a HBox (horizontal box of widgets). 


* define a function 'f' that will read the output from these widgets and make a plot

You may have some flicker and lag.  I have not tried to optimize performance for this demonstration. 

In [6]:
# 2 slider bars for the model input
# 1- The size of one of the cluster is fixed. The size of another cluster can be changed w.r.t the size of first cluster by changing the ratio.
# 2- The spread of one cluster is fixed. The spread of another cluster can be changed w.r.t the spread of first cluster by changing the ratio.
a = widgets.FloatSlider(min=1, max = 100.0, value = 0.5, description = 'Size Ratio',continuous_update=False)
b = widgets.FloatSlider(min=0.1, max = 10.0, value = 0.5, step = 0.01, description = ' Axis Ratio',continuous_update=False)

ui = widgets.HBox([a,b],)

# function to make the plot  
def f(a, b):    
     
    n=500  # Selected size of first cluster
    m=int(500*a) # Size of second cluster w.r.t to first chosen by Ratio slide bar
    location_cluster_1=(1,1) # Selected location of first cluster
    location_cluster_2=(1,b) # Location of second cluster w.r.t to first chosen by  Axis slide bar
    
    # Generating the Synthetic Datasets
    x1 = np.random.normal((5,5), location_cluster_1, (n,2))
    x2 = np.random.normal((10,1), location_cluster_2, (m,2))

    n=np.shape(x1)[0]
    # Putting labels to the selected dataset
    X1 = np.append(x1,np.ones([len(x1),1]),1)
    X2 = np.append(x2,np.zeros([len(x2),1]),1)
    X=np.concatenate((X1,X2),axis=0)
    df=pd.DataFrame(X)
    shuffled = shuffle(df)
    df=shuffled.iloc[:, [0,1]]
    
    # Using K-mean algorithm to find Clusters
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(df)
    k_mean_labels=kmeans.labels_
    
    # Using Gaussian Mixture Models algorithm to find Clusters
    gmm = GMM(n_components=2)
    Y=gmm.fit(df)
    gmm_labels = gmm.predict(df)

    
    
    # Ploting the Plots
    plt.subplot(131)
    plt.scatter(x1[:,0], x1[:,1], label='x1')
    plt.scatter(x2[:,0], x2[:,1], label='x2')
    plt.title('Original Labelled Data')
    plt.xlabel('Property 1') 
    plt.ylabel('Property 2')
    
    plt.subplot(132)
    plt.scatter(df[0], df[1], c=kmeans.labels_)
    plt.title('Clusters by k_mean Clustering')
    plt.xlabel('Property 1') 
    plt.ylabel('Property 2')
    
    plt.subplot(133)
    plt.scatter(df[0], df[1], c=gmm_labels)
    plt.title('Clusters by Gaussian Mixture Models')
    plt.xlabel('Property 1') 
    plt.ylabel('Property 2')

    
    plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=0.8, wspace=0.2, hspace=0.1)
    plt.show()


interactive_plot = widgets.interactive_output(f, {'a': a, 'b': b})
interactive_plot.clear_output(wait = True)                # reduce flickering by delaying plot updating


### Results

This is an interactive plot. The size and spread of dataset can be changed by sliding the slide bar. The three plots shows:
1) Shows plot of original labelled data.
2) Shows plot of clusters clustered by the K-Mean algorithm.
3) Shows plot of clusters clustered by Gaussian Mixture Models.

In [7]:
display(ui, interactive_plot)

HBox(children=(FloatSlider(value=1.0, continuous_update=False, description='Size Ratio', min=1.0), FloatSlider…

Output()

### Inference

We can change the size and spread of the dataset by changing the sliding bar. When we voliate the assumption of k mean algorithm that the dataset should be of similar size/ frequency data, by increasng the ratio of relative size of dataset of clusters and relative spread, the accuracy of k-mean algorithm to cluster the data decreases. This illustrates the limitation of k mean cluster algorithm to cluster the datasets having unsimilar datasets.  

However, Gaussian Mixture Models that does not assume clusters to be of any geometry, works well with unsimilar datasets having different sizes and non-linear geometric distributions as well.



I hope this was helpful,

*Pallavi Sahu*


___________________

#### Work Supervised by:

### Michael Pyrcz, Associate Professor, University of Texas at Austin 
*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. 

For more about Michael check out these links:

#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Want to Work Together?

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! 

* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

* I can be reached at mpyrcz@austin.utexas.edu.

I'm always happy to discuss,

*Michael*

Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin
