[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# UnSupervised Learning Methods

## Clustering - Agglomerative (Hierarchical)

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 22/04/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/UnSupervisedLearningMethods/2023_03/0007ClusteringHierarchical.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.base import BaseEstimator, ClusterMixin

from scipy.cluster import hierarchy

# Miscellaneous
import os
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


## Clustering by Agglomerative (Bottom Up) Policy

In this note book we'll use the Agglomerative method for clustering.  
We'll use the SciPy `hierarchy` module to create a SciKit LEarn compatible clustering class.

* <font color='brown'>(**#**)</font> SciKit Learn has a class for _agglomerative clustering_: [`AgglomerativeClustering`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html). Which is basically based on SciPy.
* <font color='brown'>(**#**)</font> The magic in those method is the definition of the relation between samples and sub sets of samples.

In [None]:
# Parameters

# Data Generation
csvFileName = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/ShoppingData.csv'

# Model
linkageMethod   = 'ward' 
thrLvl          = 200
clusterCriteria = 'distance'



In [None]:
# Auxiliary Functions

def PlotDendrogram( dfX: pd.DataFrame, linkageMethod: str, valP: int, thrLvl: int, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, markerSize: int = MARKER_SIZE_DEF ):

    if hA is None:
        hF, hA = plt.subplots(1, 1, figsize = figSize)
    else:
        hF = hA.get_figure()

    mLinkage = hierarchy.linkage(dfX, method = linkageMethod)
    hierarchy.dendrogram(mLinkage, p = valP, truncate_mode = 'lastp', color_threshold = thrLvl, no_labels = True, ax = hA)
    hA.axhline(y = thrLvl, c = 'k', lw = 2, linestyle = '--')

    hA.set_title(f'Dendrogram of the Data')


## Generate / Load Data

We'll generate a simple case of anisotropic data clusters.


In [None]:
# Loading / Generating Data

dfData = pd.read_csv(csvFileName)

print(f'The features data shape: {dfData.shape}')

In [None]:
# The Data Frame

dfData.head(10)

In [None]:
dfData = dfData.rename(columns = {'Genre': 'Sex'})
dfData

### Plot Data

In [None]:
# Display the Data
# Pair Plot of the data (Excluding ID)

sns.pairplot(dfData, vars = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)'], hue = 'Sex', height = 4, plot_kws = {'s': 20})
plt.show()

## Pre Process Data

In [None]:
# Remove ID data
dfX = dfData.drop(columns = ['CustomerID'], inplace = False)
dfX['Sex'] = dfX['Sex'].map({'Female': 0, 'Male': 1}) #<! Convert the 'Sex' column into {0, 1} values
dfX

## Cluster Data by Hierarchical Agglomerative (Bottom Up) Clustering Method


In [None]:
# Implement the Hierarchical Agglomerative clustering as an Estimator

class HierarchicalAgglomerativeCluster(ClusterMixin, BaseEstimator):
    def __init__(self, linkageMethod: str, thrLvl: Union[int, float], clusterCriteria: str):
        self.linkageMethod      = linkageMethod
        self.thrLvl             = thrLvl
        self.clusterCriteria    = clusterCriteria        
    
    def fit(self, mX, vY = None):

        numSamples  = mX.shape[0]
        featuresDim = mX.shape[1]

        mLinkage = hierarchy.linkage(mX, method = self.linkageMethod)
        vL       = hierarchy.fcluster(mLinkage, self.thrLvl, criterion = self.clusterCriteria)

        self.mLinkage           = mLinkage
        self.labels_            = vL
        self.n_features_in      = featuresDim

        return self
    
    def transform(self, mX):

        return hierarchy.linkage(mX, method = self.linkageMethod)
    
    def predict(self, mX):

        vL = hierarchy.fcluster(self.mLinkage, self.thrLvl, criterion = self.clusterCriteria)

        return vL




* <font color='red'>(**?**)</font> In the context of a new data, what's the limitation of this method?

In [None]:
# Interactive Visualization

# TODO: Add Criteria for `fcluster`

hPlotDendrogram = lambda linkageMethod, thrLvl: PlotDendrogram(dfX, linkageMethod, 200, thrLvl, figSize = (8, 8))
linkageMethodDropdown = Dropdown(description = 'Linakage Method', options = [('Single', 'single'), ('Complete', 'complete'), ('Average', 'average'), ('Weighted', 'weighted'), ('Centroid', 'centroid'), ('Median', 'median'), ('Ward', 'ward')], value = 'ward')
# criteriaMethodDropdown = Dropdown(description = 'Linakage Method', options = [('Single', 'single'), ('Complete', 'complete'), ('Average', 'average'), ('Weighted', 'weighted'), ('Centroid', 'centroid'), ('Median', 'median'), ('Ward', 'ward')], value = 'ward')
thrLvlSlider = IntSlider(min = 1, max = 1000, step = 1, value = 100, layout = Layout(width = '30%'))
interact(hPlotDendrogram, linkageMethod = linkageMethodDropdown, thrLvl = thrLvlSlider)

plt.show()

### Clustering as Feature

We can visualize the effect on the data by treating the clustering labels as a feature.

In [None]:
oAggCluster = HierarchicalAgglomerativeCluster(linkageMethod = linkageMethod, thrLvl = thrLvl, clusterCriteria = clusterCriteria)
oAggCluster = oAggCluster.fit(dfX)

In [None]:
dfXX            = dfX.copy()
dfXX['Label']   = oAggCluster.labels_

In [None]:
sns.pairplot(dfXX, hue = 'Label', palette = sns.color_palette()[:oAggCluster.labels_.max()], height = 3, plot_kws = {'s': 20})
plt.show()