

<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>


## Interactive Workflow for Understanding Different Clustering Methods (Educational Tool)

#### Syed Talha Tirmizi (UT EID: st35345)
#### Hildebrand Department of Petroleum and Geosystems Engineering, Cockrell School of Engineering

### Subsurface Machine Learning Course, The University of Texas at Austin





_____________________

Workflow supervision and review by:

#### Instructor: Prof. Michael Pyrcz, Ph.D., P.Eng., Associate Professor, The Univeristy of Texas at Austin
##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Course TA: Ademide Mabadeje, Graduate Student, The University of Texas at Austin
##### [LinkedIn](https://www.linkedin.com/in/ademidemabadeje/)


### Executive Summary

* What is the gap, problem, opportunity, scientific question?

There are many clustering methods to choose from, including the popular sklearn library has a variety of around thirteen different clustering methods. The answer to choosing the best clustering method depends on the data, similar to most of the questions answered based on the data in the field of data science and machine learning. This workflow is an educational tool for new learners to understand the difference between various clustering methods, given with the flexibility to make changes in the data. This gives them a two-way learning opportunity to understand the clustering methods based on the differences between them and with different datasets. 

* What was done to address the above?

An interactive workflow is developed keeping in view the goal of educational tool. The goal here is to understand the clustering methods. So, three different clustering methods have been incorporated which are mentioned below:

(a) DBSCAN (Density Based Spatial Clustering of Applications with Noise) which is a density based clustering method.

(b) K-means clustering which is a centroid based clustering method.

(c) Hierarchical Agglomerative Clustering which is connectivity based clustering method. Here, it is worth mentioning that ward linkage method is used which analyzes the variance of clusters instead of measuring the distance. It is most suitable for quantitative variables. 

* What was learned?

The choice of clustering methods could not be understood without the understanding of the considered dataset. The number of datapoints can change our preferences among the clustering methods. Also, the clusters are made based on the assumptions made by the clustering methods as k-means clustering method assumes the clusters of spherical shape, which is not desirable if our dataset has datapoints forming moon shape and blob shapes near to each other. DBSCAN performs better in clustering but as the moon shape clusters spread, it tends to become unfavorable. 

* What are your recommendations?

Educating different types of clustering methods with the same dataset would not give a comprehensive understanding to the students, therefore, different datasets should be used with the clustering methods, so that the students can understand it in-depth, helpling to decide which clustering method to use.

### Import Packages

Here we will import several Python packages, which are basically a collection of modules. The purpose of each package has been addressed in the comments below.

In [1]:
import numpy as np                        # ndarrys for gridded data
import pandas as pd                       # DataFrames for tabular data
import matplotlib.pyplot as plt           # for plotting
from sklearn.neighbors import NearestNeighbors # nearest neighbours function to calculate eps hyperparameter
from sklearn.preprocessing import MinMaxScaler # min/max normalization
from sklearn.cluster import KMeans        # k-means clustering
from sklearn.cluster import DBSCAN        # DBSCAN clustering
from sklearn.cluster import AgglomerativeClustering #Hierarchical Agglomerative Clustering
import scipy.cluster.hierarchy as sch     # for dendrogram
from sklearn.datasets import make_moons   # import moons dataset
from sklearn.datasets import make_blobs   # import blobs dataset
import warnings # import warnings library
# to enable the inline plotting
%matplotlib inline  
warnings.filterwarnings("ignore") #to avoid unnecessary warning messages
import time #import library for calculating execution time
from matplotlib import pyplot as plt #for graph visualization
from sklearn import cluster, datasets, mixture # for clustering purpose
from ipywidgets import Button, Layout # for interactive workflow development
import ipywidgets as widgets # for interactive workflow development

### Defining Functions

In most brief sense, functions are a set of codes for a particular task and can be called whenever required. Three functions have been defined below.
* The first function is for creating a dataset based on user's preference. This gives the flexibility to the user to modify the data and understand the clustering methods more comprehensively. 
* The second function is used for assiging different clusters for the purpose of plotting and labeling them for the plot's legend.
* After the clustering method has been executed, it is important to display its results in the form of a plot. So, the third function is used to create the plot. This function has a while loop which calls the second function (cluster_grouping).

In [2]:
# def keyword is used to define the function 
# def function_name(arguments):

def make_data(Sld_blob_noOfdata, Sld_moon_noOfdata): 
    # blob means a drop of a thick liquid or viscous substance. Therefore, it creates circular shape datapoints as seen in the plots.
    blob = make_blobs(Sld_blob_noOfdata, centers = 2, center_box =(-1.7,1.7), cluster_std=0.1,random_state=19) # function from sklearn, https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
    # moon shape pattern datapoints are create below
    moons = make_moons(Sld_moon_noOfdata, noise=0.08, random_state=19) 
    # below a global variable is defined which could be used across different functions and code.
    # X is the user generated dataset which will be used for the clustering methods
    global X;
    # vstack() function is used to stack arrays in sequence vertically 
    X = np.vstack((moons[0], blob[0]));
    # above function has been learnt from Saul, N.: https://github.com/scikit-learn-contrib/hdbscan/blob/master/docs/how_hdbscan_works.rst
    
    
def cluster_grouping(index, y_c):                          #for grouping the clusters in the graph
    # used to create an array of different colors to be used for different clusters within the data
    #color = ['red', 'green', 'blue', 'orange', 'cyan', 'gray', 'brown', 'olive', 'purple', ]
    # cluster_label will be displayed in the legend of the plot, to identify which cluster is of which color  
    cluster_label = 'Cluster ' + str(index+1)
    #this is used to generate a scatter plot. 
    #plt.scatter(x values, y values, s(marker size), c for color of each cluster, label)
    plt.scatter(X[y_c == index, 0], X[y_c == index, 1], \
                s = 100, c = [np.random.rand(3,)], label = cluster_label);
    

def visualize_graphs(n_clusters, y_c):      # for creating the cluster graph
    plt.figure(figsize=((12,8)));           # creates figure object
    index = 0;                              # index is set to zero, index will be used for while loop till the total number of clusters
    while index < n_clusters[0]:
        cluster_grouping(index, y_c)        # function is called with two arguments
        index = index + 1                   # increment in index value by 1
    plt.title(dropdown.value)               # plot title is set to the name of the clustering method chosen in dropdown box
    plt.xlabel('X')                         # label for x axis
    plt.ylabel('Y')                         # label for y axis
    plt.text(1, -1, r"Number of predicted clusters: " + str(n_clusters)) # text will be displayed inside the plot area to indicate the number of predicted clusters
    plt.text(0.99,0.01,("%.2fs" % (t1 - t0)).lstrip("0"), transform=plt.gca().transAxes, size=15, horizontalalignment="right") # to display the execution time of the clustering method, Orbifold consulting (https://orbifold.net/default/outlier-detection/)
    plt.legend()                            # displays legend in the plot
    plt.show()                              # displays the plot

### Setting up the widgets and layout of the Interactive Workflow using Ipywidget

Ipywidgets: Ipywidgets, python library, are also called as python widgets or simply widgets. They are interactive HTML widgets for Jupyter Notebook. 

Interactive widgets are a great way to introduce an immersive learning experience which is very important for efficient education tool. This allows the user to gain control over the data and visualize it with the help of plots. The user can also understand how changing the data can impact the different types of clustering methods or how the clustering methods differ in the way they form clusters.  



Below, we have used different types of widgets which includes:
* slider
* dropdown
* buttons

The following code block has codes learnt from different videos by Qiusheng Wu, Playlist: Geographic Software Design from [Youtube: Geographic Software Design](https://www.youtube.com/playlist?list=PLAxJ4-o7ZoPeUqGpMhvJoVk5G-TrvMAd-) and Jupyter Widgets, [Widget List](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html).

The below code block also has the elbow method and dendrogram method to find the number of cluster for the case of k-means clustering and hierarchical agglomerative clustering method, respectively. 

Elbow method is used to find the optimal number of clusters by fitting the model over a large number of K values. 
Dendrogram method is used for Hierarchical Agglomerative Clustering (HAC) to find the optimal number of clusters based on the tree-diagram representing hierarchical relationships between different sets of the data. In it, the distance between clusters is shown by the heights of the blocks.

In [3]:
widget_width = "800px" #widget width
padding = "0px 0px 0px 4px" #upper, right, bottom, left
layout = widgets.Layout(width='auto', height='30px') #layout used for buttons

blobs_slider = widgets.IntSlider( # slider used to define the number of data points in blob shapes
    min = 0,                      # minimum value of the slider
    max = 400,                    # maximum value of the slider
    description = "Number of samples in blobs shape :",  # description used to display for the slider
    readout = True,               # shows the present value of the slider besides it
    continuous_update = True,     # restricts executions to mouse release events
    layout = widgets.Layout(width="400px", padding=padding), # sets the layout and padding value of the slider
    style={"description_width": "initial"},   # full description text is shown
)

moons_slider = widgets.IntSlider( # slider used to define the number of data points in moon shapes
    min = 0,                      # minimum value of the slider
    max = 400,                    # maximum value of the slider
    description = "Number of samples in moons shape :", # description used to display for the slider
    readout = True,               # shows the present value of the slider besides it
    continuous_update = True,     # restricts executions to mouse release events
    layout = widgets.Layout(width="400px", padding=padding), # sets the layout and padding value of the slider
    style={"description_width": "initial"},   # full description text is shown
    
)

kmeans_int_slider = widgets.IntSlider(  # slider used to define the number of clusters to used
    min = 0,                            # minimum value of the slider
    max = 20,                           # maximum value of the slider
    description = "Number of clusters in K-means/HAC :",   #description used to display for the slider
    readout = True,                     # shows the present value of the slider besides it
    continuous_update = True,           # restricts executions to mouse release events
    layout = widgets.Layout(width="400px", padding=padding), # sets the layout and padding value of the slider
    style={"description_width": "initial"}, # full description text is shown
    
)

dbscan_float_slider = widgets.FloatSlider(   # slider used to define the eps value to be used in DBSCAN
    min = 0,                            # minimum value of the slider
    max = 1,                            # maximum value of the slider
    description = "eps (DBSCAN only) :",  # description used to display for the slider
    readout = True,                     # shows the present value of the slider besides it
    continuous_update = True,           # restricts executions to mouse release events
    layout = widgets.Layout(width="400px", padding=padding), # sets the layout and padding value of the slider
    style={"description_width": "initial"}, # full description text is shown
    
)

dropdown = widgets.Dropdown(             #used to define the dropdown menu
    options=["DBSCAN", "Kmeans", "Hierarchical Agglomerative Clustering"],  # set the value for the dropdown
    value=None,                # value to be shown in the dropdown initially
    description="Clustering Method: ",   # description used to display for the dropdown
    style={"description_width": "initial"}, # full description text is shown
    layout=widgets.Layout(width="400px")  # layout in which the width of the dropdown is defined
)


# Buttons are learnt from https://medium.com/@jdchipox/how-to-interact-with-jupyter-33a98686f24e

#initiating the button with its layout values
submit_data_btn = widgets.Button(description='Submit Data', button_style = "info", layout=layout, display='flex', flex_flow='column', align_items='stretch')

# function to be executed when the button is pressed
def submit_data_btn_clicked(_):
    with output:
        output.clear_output()
        Sld_blob_noOfdata = blobs_slider.value    # extract the value of the blob slider
        Sld_moon_noOfdata = moons_slider.value    # extract the value of the moon slider
        make_data(Sld_blob_noOfdata, Sld_moon_noOfdata);  #execute the function to generate the data
        print("Data has been submitted successfully")   # display message to confirm the successful execution

submit_data_btn.on_click(submit_data_btn_clicked)  # this links the button with its respective function

#initiating the button with its layout values
run_btn = widgets.Button(description='Run', button_style = "info", layout=layout, display='flex', flex_flow='column', align_items='stretch')

# function to be executed when the button is pressed
def run_btn_clicked(_):
    with output:
        output.clear_output()
        plt.clf()     # this clears the graphs if displayed previously
        global t1, t0 # set the global variable for the time of execution
        t0 = time.time() # store the initial time of execution
        n_clusters_submitted = kmeans_int_slider.value   # submits the number of clusters to be used in clustering methods
        #below clustering methods from https://cprosenjit.medium.com/8-clustering-methods-from-scikit-learn-we-should-know-e2ff7ee9ca18 
        if dropdown.value == "DBSCAN":    
            dbscan = DBSCAN(dbscan_float_slider.value, metric='euclidean') #runs the DBSCAN function excerpt from scikit-learn library
            y_c = dbscan.fit_predict(X)  # fit the data to DBSCAN   
            labels = dbscan.labels_ # load the generated cluster indices into labels 
            n_clusters = len(set(labels)) - (1 if -1 in labels else 0) # calculate the number of clusters
            t1 = time.time() # store the final time of execution
            visualize_graphs([n_clusters], y_c) # executes the function to generate the figure to display the clusters
                
        elif dropdown.value == "Kmeans":    
            # submits the number of clusters to be used in clustering methods
            kmeans = KMeans(n_clusters_submitted, \
            # selects initial cluster centers for k-mean clustering
            init = 'k-means++', \
            random_state = 42) # choose n_clusters at random from data for the starting centroids
            y_c = kmeans.fit_predict(X)  # fit the data to K-means
            t1 = time.time() # store the final time of execution
            visualize_graphs([n_clusters_submitted], y_c) # executes the function to generate the figure to display the clusters
                
        elif dropdown.value == "Hierarchical Agglomerative Clustering": 
            #number of clusters to be used 
            hc = AgglomerativeClustering(n_clusters_submitted, \
            # to compute the linkage
                             affinity = 'euclidean', \
             # in our case affinity = euclidean, therefore only ward linkage is acceptable. Ward linkage analyzes the variance of clusters instead of measuring the distance. It is most suitable for quantitative variables.
                             linkage = 'ward')
            y_c = hc.fit_predict(X)  # fits the data to HAC
            t1 = time.time() # store the final time of execution
            visualize_graphs([n_clusters_submitted], y_c) # executes the function to generate the figure to display the clusters

            
run_btn.on_click(run_btn_clicked) # this links the button with its respective function

#initiating the button with its layout values
submit_nclusters_btn = widgets.Button(description='Submit No. of Clusters for K-means/HAC', button_style = "info", layout=layout, display='flex', flex_flow='column', align_items='stretch')


# function to be executed when the button is pressed
def submit_nclusters_btn_clicked(_):
    with output:
        output.clear_output()
        n_clusters = kmeans_int_slider.value
        print(kmeans_int_slider.value)

submit_nclusters_btn.on_click(submit_nclusters_btn_clicked) # this links the button with its respective function

#initiating the button with its layout values
find_nclusters_btn = widgets.Button(description='Find No. of Clusters', button_style = "info", layout=layout, display='flex', flex_flow='column', align_items='stretch')

# function to be executed when the button is pressed
def find_nclusters_btn_clicked(_):
    with output:
        output.clear_output()
        #elbow and dendrogram method from https://cprosenjit.medium.com/8-clustering-methods-from-scikit-learn-we-should-know-e2ff7ee9ca18 
        if dropdown.value == "Kmeans":
            sum_sqrd = [] # define the sum of squared distance
            for i in range(1, 11):  #starts the for loop to be run 11 times
                # submits the number of clusters to be used in clustering methods
                kmeans = KMeans(n_clusters = i, \
                                # initializing the centroid.
                                init = 'k-means++', \
                                # choose n_clusters at random from data for the starting centroids
                                random_state = 42)
                kmeans.fit(X) # fit the data to K-means
                sum_sqrd.append(kmeans.inertia_) #Sum of squared distances of samples to their closest cluster center
            plt.plot(range(1, 11), sum_sqrd) 
            plt.title('K-Means - the Elbow method')
            plt.xlabel('Number of clusters')
            plt.ylabel('Sum of squared distances')
            plt.show()
        elif dropdown.value == "Hierarchical Agglomerative Clustering": 
            print("Please wait") # it takes long for the dendrogram figure to display
            plt.figure(figsize=(35,12))
            dendrogram = sch.dendrogram(\
                                        #Execute hierarchical agglomerative clustering.
                                        sch.linkage(X, \
                                                    # distance metric
                                                    metric='euclidean', \
                                                    # variance minimization algorithm
                                                    method = 'ward'))
            plt.title('Dendrogram')               
            plt.xticks(fontsize=14, rotation=90)
            plt.show()
        else:
            #in case the K-means or HAC is not selected in the dropdown menu, following text will be displayed.
            print("Please select the appropriate clustering method in the dropdown menu above") 

# initiating the button with its layout values
find_nclusters_btn.on_click(find_nclusters_btn_clicked) # this links the button with its respective function

#initiating the button with its layout values
submit_eps_btn = widgets.Button(description='Submit eps (DBSCAN)', button_style = "info", layout=layout, display='flex', flex_flow='column', align_items='stretch')

# function to be executed when the button is pressed
def submit_eps_btn_clicked(_):
    with output:
        output.clear_output()
        eps = dbscan_float_slider.value     #extract the value for the local radius for expanding clusters
        print(dbscan_float_slider.value)   # prints the extracted value

# initiating the button with its layout values        
submit_eps_btn.on_click(submit_eps_btn_clicked) # this links the button with its respective function
        
output = widgets.Output() # to display output

def dropdown_change(change):
    if change['new']: #if the dropdown menu changes to a new value
        with output:
            output.clear_output() 
            print(change['new'])  # displays the selected dropdown value
            
dropdown.observe(dropdown_change, "value") #links the dropdown with its respective function

#display the widgets in the following manner with the help of VBox and HBox
# VBox: to layout all their children in one vertical row 
# HBox: to layout all their children in one horizontal row 
toolbar_widget = widgets.VBox()
toolbar_widget.children = [
    widgets.HBox([blobs_slider, moons_slider]),
    dropdown,
    widgets.HBox([kmeans_int_slider]),
    widgets.HBox([dbscan_float_slider]),
    widgets.HBox([submit_data_btn, find_nclusters_btn, submit_nclusters_btn]),
    widgets.HBox([submit_eps_btn, run_btn]),
    output,
]

### Results

IPyWidgets, a python library, used to create HTML interactive widgets for the Jupyter notebook. These widgets are used to make the workflow interactive for the user, using various sliders, buttons and dropdown menu. The widgets remain active and updates the values according to the user interaction. 
The workflow is designed in a way to simplify the interactivity as much as possible while also focusing on its main goal. The following instructions are mentioned as a guideline for the user:

1. The present workflow provides the flexibility to the user to create its own data in specific shapes. User can increase/decrease the number of datapoints. Therefore, use the first two sliders to change the number of data samples in the shape of moon and blob respectively. Then press the "Submit Data" button to store the generate and store the data.

2. Choose the desired clustering method from the drop down menu.

3. If you have chosen "DBSCAN" as your desired clustering method, you will choose the epsilon parameter's value from the indicated slider and press the "Submit eps (DBSCAN)" button. You can now skip to point no. 6.

4. If you have chosen the "K-means" as your desired clustering method, you need to press the "Find No. of Clusters" button first to pick a value using the elbow method. A graph will be shown for your convenience.

5. If you have chosen the "Hierarchical Agglomerative Clustering" as your desired clustering method in the dropdown menu, you need to press the "Find No. of Clusters" button first to pick a value using Dendrogram graph, which will take a few moments to be displayed (depending on your computational power)". 

6. Now press the run button, to execute the desired clustering method and a graph will be shown below, highlighting the different clusters in the data and the number of predicted clusters. The time for execution of the particular clustering method is also displayed at the bottom-right corner of the graph.

In [4]:
toolbar_widget

VBox(children=(HBox(children=(IntSlider(value=0, description='Number of samples in blobs shape :', layout=Layo…

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

## Summary

The user has the flexibility to define its own dataset with different shapes. The interactive workflow provides the opportunity to understand the different clustering methods, with well illustrated plot. The workflow also gives an option to determine the optimal number of clusters for K-means and Hierarchical Agglomerative Clustering methods with the help of elbow method and dendrogram method, respectively. Changing the eps value for the case of DBSCAN, the user can understand its impact on the number of clusters in the data. The plot also shows the execution time for each clustering method in the bottom-right corner. 

Although it’s a relatively faster method, K-means clustering has the limitation that it can’t deal with the non-convex cluster shapes, hence, it is not always an ideal clustering method. On the other hand, the DBSCAN can form clusters of arbitrary shapes. For example, if you take k-means clustering method and DBSCAN method over the same dataset, it will show different clustering patterns. The other key point to notice here is that the Hierarchical Agglomerative Clustering and K-means clustering requires to pre-define the number of clusters however, DBSCAN forms new clusters based on the data. If the value of epsilon is taken to be very high, it will not form appropriate clusters, try taking eps = 0.6. Therefore, DBSCAN needs careful selection of its parameters to form clustering. It doesn’t work well over clusters with different densities. 


I hope this was helpful,

Syed Talha Tirmizi

___________________