[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# UnSupervised Learning Methods

## Clustering - Hierarchical Density Based Spatial Clustering of Applications with Noise (HDBSCAN)

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.2.000 | 29/04/2023 | Royi Avital | Using SciKit Learn based HDBSACAN                                  |
| 0.1.000 | 29/04/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/UnSupervisedLearningMethods/2023_03/0009ClusteringHDBSCAN.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
# from hdbscan import HDBSCAN

from sklearn.cluster import DBSCAN, HDBSCAN
from sklearn.datasets import load_digits

# Miscellaneous
import os
from platform import python_version
import random
import urllib.request

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

MNIST_IMG_SIZE = (8, 8)

DATA_FILE_URL   = r'https://drive.google.com/uc?export=download&confirm=9iBg&id=11YqtdWwZSNE-0KxWAf1ZPINi9-ar56Na'
DATA_FILE_NAME  = r'ClusteringData.npy'


In [None]:
# Fixel Algorithms Packages


## Clustering by Density

This notebook demonstrates clustering using the [_HDBSCAN_](https://hdbscan.readthedocs.io) algorithm.  

* <font color='brown'>(**#**)</font> The _DBSCAN_ method approximates the idea of applying the high dimensionality KDE, applying a threshold and finding the connected components.
* <font color='brown'>(**#**)</font> The _HDBSCAN_ method add _Hierarchical_ to mostly handle the main weakness of _DBSCAN_: Handling different density among different clusters.

In [None]:
# Parameters

# Data Generation

# Model
minNumSamplesCluster    = 20
minNumSamplesCore       = 5 #<! Like Z in DBSCAN



In [None]:
# Auxiliary Functions

def PlotScatterData(mX: np.ndarray, vL: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, markerSize: int = MARKER_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, axisTitle: str = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    vU = np.unique(vL)
    numClusters = len(vU)

    for ii in range(numClusters):
        vIdx = vL == vU[ii]
        hA.scatter(mX[vIdx, 0], mX[vIdx, 1], s = ELM_SIZE_DEF, edgecolor = EDGE_COLOR, label = ii)
    
    hA.set_xlabel('${{x}}_{{1}}$')
    hA.set_ylabel('${{x}}_{{2}}$')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.grid()
    hA.legend()

    # return hF

def PlotMnistImages(mX: np.ndarray, vY: np.ndarray = None, numRows: int = 1, numCols: int = 1, imgSize = MNIST_IMG_SIZE, randomChoice = True, hF = None):

    numSamples  = mX.shape[0]

    numImg = numRows * numCols

    # tFigSize = (numRows * 3, numCols * 3)
    tFigSize = (numCols * 3, numRows * 3)

    if hF is None:
        hF, hA = plt.subplots(numRows, numCols, figsize = tFigSize)
    else:
        hA = hF.axis
    
    hA = np.atleast_1d(hA) #<! To support numImg = 1
    hA = hA.flat

    if randomChoice:
        vIdx = np.random.choice(numSamples, numImg, replace = False)
    else:
        vIdx = range(numImg)

    
    for kk in range(numImg):
        
        idx = vIdx[kk]
        mI  = np.reshape(mX[idx, :], imgSize)
    
        hA[kk].imshow(mI, cmap = 'gray')
        hA[kk].tick_params(axis = 'both', left = False, top = False, right = False, bottom = False, labelleft = False, labeltop = False, labelright = False, labelbottom = False)
        labelStr = f', Label = {vY[idx]}' if vY is not None else ''
        hA[kk].set_title(f'Index = {idx}' + labelStr)
    
    plt.show()


## Generate / Load Data

We'll generate a simple case of anisotropic data clusters.


In [None]:
# Download Data
# This section downloads data from the given URL if needed.

if not os.path.exists(DATA_FILE_NAME):
    urllib.request.urlretrieve(DATA_FILE_URL, DATA_FILE_NAME)

In [None]:
# Loading / Generating Data

mX = np.load(DATA_FILE_NAME)
vL = np.ones(shape = mX.shape[0])

print(f'The features data shape: {mX.shape}')

### Plot Data

In [None]:
# Display the Data

PlotScatterData(mX, vL)

## Cluster Data by HDBSCAN

* <font color='brown'>(**#**)</font> Very robust to hyper parameters.
* <font color='brown'>(**#**)</font> Slower than DBSCAN, yet pretty fast on its own.

In [None]:
oHDBSCAN = HDBSCAN(min_cluster_size = minNumSamplesCluster, min_samples = minNumSamplesCore)
vC       = oHDBSCAN.fit_predict(mX)

print(f'Number of clusters (Noise included): {len(np.unique(vC))}')

In [None]:
# Plot Scatter Data

PlotScatterData(mX, vC)

## Cluster Digits

In [None]:
# Load Digits Data

mX, vY = load_digits(return_X_y = True)

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')

In [None]:
# Plot Data
PlotMnistImages(mX, vY, 2, 5)

In [None]:
# Plot Data

hF, hAs = plt.subplots(nrows = 2, ncols = 10, figsize = (7, 2))
hAs = hAs.flat
for hA in hAs:
    idx  = np.random.randint(mX.shape[0])
    mI   = np.reshape(mX[idx, :], MNIST_IMG_SIZE)
    
    hA.imshow(mI, cmap = 'gray')
    hA.set_xticks([])
    hA.set_yticks([])

plt.tight_layout()
plt.show()

In [None]:
# Cluster Data

oHDBSCAN = HDBSCAN(min_cluster_size = 30, min_samples = minNumSamplesCore)
vC       = oHDBSCAN.fit_predict(mX)

print(f'Number of clusters (Noise included): {len(np.unique(vC))}')

* <font color='red'>(**?**)</font> Look at the labels. Does they match the digit?

In [None]:
# Plot the Some of the Clusters Samples

hF, hAs = plt.subplots(nrows = 10, ncols = 10, figsize = (6, 6))
for ii in range(10):
    mXi = mX[vC == ii]
    for jj in range(10):
        if mXi.shape[0] > 0:
            idx = np.random.randint(mXi.shape[0])
            mI  = np.reshape(mXi[idx], MNIST_IMG_SIZE)
        
        hA = hAs[ii, jj]
        hA.imshow(mI, cmap = 'gray')
        hA.set_xticks([])
        hA.set_yticks([])

plt.tight_layout()
plt.show()

* <font color='brown'>(**#**)</font> The HDBSCAN has support for Out of Sample Extension with `approximate_predict()` method. See `prediction_data` parameter.