Illustration of **HDBSCAN** (Hierarchical Density-Based Spatial Clustering of Applications with Noise

HDBSCAN uses a density-based approach which makes few implicit assumptions about the clusters. It is a non-parametric method that looks for a cluster hierarchy shaped by the multivariate modes of the underlying distribution. Rather than looking for clusters with a particular shape, it looks for regions of the data that are denser than the surrounding space.

In [1]:
from sklearn.datasets import make_blobs
import pandas as pd
blobs, labels = make_blobs(n_samples=2000, n_features=10)
pd.DataFrame(blobs).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,8.379826,-3.431563,7.738508,-4.913754,-1.967531,0.872824,0.792221,-5.460465,-5.514017,-8.38731
1,-7.297782,-5.133098,9.946554,-5.491745,-4.65415,5.730699,-8.675504,3.27614,-6.592246,-3.086301
2,7.486794,-2.291226,7.914219,-6.133867,-0.036212,-0.658201,1.859775,-4.598181,-5.601568,-9.491371
3,-5.768523,-5.545639,9.736792,-5.152286,-7.23316,7.728697,-6.204335,2.170504,-5.560478,-2.241978
4,10.069958,-3.700054,6.170105,-5.879932,-3.19739,-0.142917,0.883956,-7.007234,-4.961165,-8.688909


In [2]:
# Load hdbscan module
!pip install hdbscan
import hdbscan

Collecting hdbscan
  Downloading hdbscan-0.8.27.tar.gz (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 5.5 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: hdbscan
  Building wheel for hdbscan (PEP 517) ... [?25l[?25hdone
  Created wheel for hdbscan: filename=hdbscan-0.8.27-cp37-cp37m-linux_x86_64.whl size=2311927 sha256=df947f1b7d214cd654f235bcde4604832ebcb372085a44d29a3b4e8eaf50b698
  Stored in directory: /root/.cache/pip/wheels/73/5f/2f/9a259b84003b84847c259779206acecabb25ab56f1506ee72b
Successfully built hdbscan
Installing collected packages: hdbscan
Successfully installed hdbscan-0.8.27


In [3]:
clusterer = hdbscan.HDBSCAN()
clusterer.fit(blobs)

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
        approx_min_span_tree=True, cluster_selection_epsilon=0.0,
        cluster_selection_method='eom', core_dist_n_jobs=4,
        gen_min_span_tree=False, leaf_size=40,
        match_reference_implementation=False, memory=Memory(location=None),
        metric='euclidean', min_cluster_size=5, min_samples=None, p=None,
        prediction_data=False)

In [4]:
# Here is how we get the clusters
clusterer.labels_

array([2, 0, 2, ..., 0, 2, 1])

In [5]:
# We can determine the number of clusters found by finding the largest cluster label
clusterer.labels_.max()

2

In [6]:
"""Each data point is assigned a cluster membership score ranging from 0.0 to 1.0. A score of 0.0 represents 
a sample that is not in the cluster at all (all noise points will get this score) while a score of 1.0 represents 
a sample that is at the heart of the cluster (note that this is not the spatial centroid notion of core)."""

'Each data point is assigned a cluster membership score ranging from 0.0 to 1.0. A score of 0.0 represents \na sample that is not in the cluster at all (all noise points will get this score) while a score of 1.0 represents \na sample that is at the heart of the cluster (note that this is not the spatial centroid notion of core).'

In [7]:
# Provide the cluster probabilities
clusterer.probabilities_

array([0.90146952, 0.82010313, 0.65939305, ..., 0.82187308, 0.85422799,
       0.71990094])

In [8]:
# What metrics support HDBSCAN?
hdbscan.dist_metrics.METRIC_MAPPING

{'arccos': hdbscan.dist_metrics.ArccosDistance,
 'braycurtis': hdbscan.dist_metrics.BrayCurtisDistance,
 'canberra': hdbscan.dist_metrics.CanberraDistance,
 'chebyshev': hdbscan.dist_metrics.ChebyshevDistance,
 'cityblock': hdbscan.dist_metrics.ManhattanDistance,
 'cosine': hdbscan.dist_metrics.ArccosDistance,
 'dice': hdbscan.dist_metrics.DiceDistance,
 'euclidean': hdbscan.dist_metrics.EuclideanDistance,
 'hamming': hdbscan.dist_metrics.HammingDistance,
 'haversine': hdbscan.dist_metrics.HaversineDistance,
 'infinity': hdbscan.dist_metrics.ChebyshevDistance,
 'jaccard': hdbscan.dist_metrics.JaccardDistance,
 'kulsinski': hdbscan.dist_metrics.KulsinskiDistance,
 'l1': hdbscan.dist_metrics.ManhattanDistance,
 'l2': hdbscan.dist_metrics.EuclideanDistance,
 'mahalanobis': hdbscan.dist_metrics.MahalanobisDistance,
 'manhattan': hdbscan.dist_metrics.ManhattanDistance,
 'matching': hdbscan.dist_metrics.MatchingDistance,
 'minkowski': hdbscan.dist_metrics.MinkowskiDistance,
 'p': hdbscan.dis

In [9]:
# Say we are looking at Manhattan distance
clusterer = hdbscan.HDBSCAN(metric='manhattan')
clusterer.fit(blobs)
clusterer.labels_

array([1, 0, 1, ..., 0, 1, 2])

In [10]:
"""What if you don’t have a nice set of points in a vector space, but only have a pairwise distance matrix providing 
the distance between each pair of points? This is a common situation."""

'What if you don’t have a nice set of points in a vector space, but only have a pairwise distance matrix providing \nthe distance between each pair of points? This is a common situation.'

In [11]:
from sklearn.metrics.pairwise import pairwise_distances
distance_matrix = pairwise_distances(blobs)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_

array([1, 0, 1, ..., 0, 1, 2])