# Density-Based Clustering

## Objectives

- Explore the application of density-based clustering algorithms, specifically DBSCAN and OPTICS, to identify clusters based on data point densities.
- Assess how the choice of parameters such as eps, min_samples, and min_cluster_size affects the clustering results.
- Demonstrate the effectiveness of these algorithms on datasets with complex geometric shapes and varying densities.

## Background

Density-based clustering algorithms like DBSCAN and OPTICS group data points into clusters based on the density of points in a region, identifying areas of high density separated by areas of low density. These algorithms are beneficial for handling data with noise and clusters of varying shapes and sizes.

## Datasets Used

The notebook uses synthetic datasets, including two-dimensional blobs, half-moons, and concentric circles, to test the algorithms' effectiveness in various scenarios. The datasets were visualized and analyzed to understand the impact of algorithm parameters on clustering performance.

## DBSCAN

Density-based clustering is a type of clustering algorithm that groups data points into clusters based on the density of points in a region, typically identifying clusters as high-density areas separated by low-density regions.


DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular data science and machine learning clustering algorithm. Unlike algorithms like k-means, DBSCAN does not require the user to specify the number of clusters in advance.

In [1]:
import numpy as np
import pandas as pd

import ClusterVisualizer as cv

### A Simple Example

Let's generate a simple dataset in 2D.

In [2]:
from sklearn.datasets import make_blobs

# Generating data
X, _ = make_blobs(random_state=0, centers=2, cluster_std=0.5, n_samples=50)

# Saving data to a DataFrame
df_X = pd.DataFrame(X, columns=["x", "y"])

In [3]:
cv_X = cv.ClusterVisualizer(df_X)

cv_X.plot_data()

In [4]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN()
labels = dbscan.fit_predict(X).astype(str)

Let's create a function for plotting the clusters!

In [5]:
cv_X.plot_density_based_clustering(labels, title='DBSCAN Clustering')

### Deeper Inside DBSCAN

In DBSCAN, the parameters `eps` and `min_samples` are crucial for defining the behavior of the algorithm:
- Epsilon (`eps`) defines the distance within which data points are considered neighbors of each other. Any data point within this distance is considered part of the same cluster.
- The minimum number of data points (`min_samples`) is required to form a dense region or cluster.


The default values are: `eps=0.5` and `min_samples=5`

DBSCAN categorizes the data points into three main types:
- **Core Points**: they have at least `min_samples` data points (including themselves) within a distance of `eps`.
- **Border Points**: they are data points within the `eps` distance of a core point but do not have enough neighboring data points (within `eps`) to be considered core points themselves.
- **Noise Points**: they are data points that do not meet the criteria for either a core or a border point. They are considered outliers.

In [6]:
point_type = np.full_like(dbscan.labels_, 'Border', dtype=object)
point_type[dbscan.core_sample_indices_] = 'Core'
point_type[dbscan.labels_ == -1] = 'Noise'
point_type

array(['Core', 'Core', 'Core', 'Core', 'Core', 'Border', 'Noise',
       'Border', 'Core', 'Noise', 'Noise', 'Border', 'Core', 'Core',
       'Core', 'Core', 'Core', 'Core', 'Border', 'Core', 'Core', 'Core',
       'Core', 'Core', 'Core', 'Noise', 'Core', 'Noise', 'Core', 'Border',
       'Core', 'Core', 'Core', 'Border', 'Noise', 'Core', 'Core', 'Noise',
       'Border', 'Core', 'Core', 'Core', 'Noise', 'Core', 'Core', 'Core',
       'Noise', 'Border', 'Border', 'Core'], dtype=object)

In [7]:
dbscan = DBSCAN().fit(X)   

cv_X.plot_DBSCAN_categories(dbscan.labels_, dbscan.core_sample_indices_, 
                           title='DBSCAN Clustering')

### DBSCAN parameters: eps and min_samples

The `eps` parameter in DBSCAN sets the maximum distance between two points for them to belong to the same neighborhood. 

It directly influences the number of points that constitute a cluster and determines whether a point is a core, border, or noise. 

Default value: `eps=0.5`

In [8]:
# Default value: eps=0.5
dbscan2 = DBSCAN().fit(X)   

cv_X.plot_DBSCAN_categories(dbscan2.labels_, dbscan2.core_sample_indices_, 
                            title='DBSCAN Clustering (eps=0.5)')

In [9]:
# A small eps value may lead to many noise points.
dbscan3 = DBSCAN(eps=0.35).fit(X)

cv_X.plot_DBSCAN_categories(dbscan3.labels_, dbscan3.core_sample_indices_, 
                            title='DBSCAN Clustering (eps=0.5)')

In [10]:
# A high eps value may lead to the disappearance of the noise points.
dbscan4 = DBSCAN(eps=1).fit(X)

cv_X.plot_DBSCAN_categories(dbscan4.labels_, dbscan4.core_sample_indices_, 
                            title='DBSCAN Clustering (eps=1)')

In [11]:
# A higher eps can cause distinct clusters to merge.
dbscan5 = DBSCAN(eps=2).fit(X)

cv_X.plot_DBSCAN_categories(dbscan5.labels_, dbscan5.core_sample_indices_, 
                            title='DBSCAN Clustering (eps=2)')

The `min_samples` parameter in DBSCAN defines the minimum number of points required to form a dense region. It determines the density threshold for clustering. 

Default value: `min_samples=5`

In [12]:
# Default value: min_samples=5
dbscan6 = DBSCAN(min_samples=5).fit(X)

cv_X.plot_DBSCAN_categories(dbscan6.labels_, dbscan6.core_sample_indices_, 
                            title='DBSCAN Clustering (min_samples=5)')

In [13]:
# A low min_samples value may result in too many points being classified as core points
dbscan7 = DBSCAN(min_samples=3).fit(X)

cv_X.plot_DBSCAN_categories(dbscan7.labels_, dbscan7.core_sample_indices_, 
                            title='DBSCAN Clustering (min_samples=3)')

In [14]:
# A high min_samples value may cause fewer core points and more noise.
dbscan8 = DBSCAN(min_samples=8).fit(X)

cv_X.plot_DBSCAN_categories(dbscan8.labels_, dbscan8.core_sample_indices_, 
                            title='DBSCAN Clustering (min_samples=8)')

Choosing appropriate values of `eps` and `min_samples` is essential for the algorithm to accurately capture the data's intrinsic clustering structure.

## OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is a clustering algorithm that identifies variable-density clusters by ordering data points based on their spatial density, without requiring a predetermined number of clusters.

### A Simple Example

In [15]:
# Generate sample data with three clusters of different densities
np.random.seed(0)
n_samples = 100
centers = [(0, 0), (6, 3), (2, 10)]
cluster_std = [0.5, 0.1, 2.0]  # Different standard deviations (density)

Xs, _ = make_blobs(n_samples=n_samples, centers=centers, cluster_std=cluster_std)

In [16]:
# Saving data to a DataFrame
df_Xs = pd.DataFrame(Xs, columns=["x", "y"])
df_Xs.head()

Unnamed: 0,x,y
0,0.077474,0.189081
1,0.712763,5.553194
2,0.033259,0.151236
3,-0.208767,10.10433
4,1.088935,10.034958


In [17]:
# Visualizing the data
cv_Xs = cv.ClusterVisualizer(df_Xs)

cv_Xs.plot_data()

In [18]:
# Applying DBSCAN
dbscan_s = DBSCAN(eps=0.5, min_samples=10).fit(Xs)

cv_Xs.plot_DBSCAN_categories(dbscan_s.labels_, dbscan_s.core_sample_indices_,
                             title='DBSCAN Clustering')

DBSCAN fails to detect the three clusters. It correctly catches the cluster with a higher density, somehow detects the cluster with medium density, and fails to detect the less dense cluster. Points belonging to the less dense clusters are all marked as noise.

Let's apply OPTICS algorithm!

In [19]:
from sklearn.cluster import OPTICS

optics = OPTICS(min_samples=10, xi=0.05).fit(Xs)

In [20]:
cv_Xs.plot_density_based_clustering(optics.labels_, title='OPTICS Clustering')

As you can see, OPTICS can handle varying densities and identify all clusters more effectively.

### OPTICS parameters: min_samples, xi and min_cluster_size

OPTICS algorithm has several parameters. Let's analyze: `min_samples`, `xi` and `min_cluster_size`. 

The `min_samples` parameter in the OPTICS algorithm specifies the minimum number of neighboring points required to form a core point, influencing the algorithm's sensitivity to cluster density and noise. It defines the minimum density required for a cluster.

In [21]:
# Increasing the min_samples value may lead to no detect clusters.
labels2 = OPTICS(min_samples=50).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels2, title='OPTICS Clustering (min_samples=50)')

In [22]:
# Decreasing the min_samples value may lead to more noise points and a wrong number of clusters.
labels3 = OPTICS(min_samples=5).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels3, title='OPTICS Clustering (min_samples=5)')

The `xi` parameter is a threshold that determines the minimum relative decrease in density required to start a new cluster. It identifies points in the reachability plot where a sharp change in density occurs. 

- A higher `xi` value makes the algorithm form fewer, larger clusters.
- A lower `xi` value results in more, smaller clusters, reflecting more subtle changes in density.

In [23]:
# xi close to 1 may lead to detect fewer clusters.
labels4 = OPTICS(min_samples=10, xi=0.9).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels4, title='OPTICS Clustering (xi=0.9)')

In [24]:
# Increasing the xi value for detecting more clusters
labels5 = OPTICS(min_samples=10, xi=0.3).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels5, title='OPTICS Clustering (xi=0.3)')

The min_cluster_size parameter defines the minimum number of points a group must contain to be recognized as a distinct cluster, thereby determining the smallest size of clusters that the algorithm will identify.

It can be an integer value or a fraction. The fraction represents the minimum size of a cluster as a proportion of the total number of points in the dataset. For example, setting it to 0.05 means a cluster must contain at least 5% of the total data points to be considered a valid cluster.

In [25]:
# We know there are three clusters with different densities. 
# Let's use min_cluster_size = 0.3
labels6 = OPTICS(min_cluster_size=0.3).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels6, title='OPTICS Clustering (min_cluster_size=0.3)')

In [26]:
# A higher min_cluster_size value may lead to fewer clusters.
labels7 = OPTICS(min_cluster_size=0.4).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels7, title='OPTICS Clustering (min_cluster_size=0.4)')

In [27]:
# A few clusters may be detected with a very high min_cluster_size value.
labels8 = OPTICS(min_cluster_size=0.1).fit_predict(Xs)

cv_Xs.plot_density_based_clustering(labels8, title='OPTICS Clustering (min_cluster_size=0.1)')

## Other Examples

Let's analyze the behavior of DBSCAN and OPTICS methods with some well-known datasets.

### Half Moons Example

In [28]:
from sklearn.datasets import make_moons

Xm, _ = make_moons(500, noise=.05, random_state=0)

# Saving data to a DataFrame
df_Xm = pd.DataFrame(Xm, columns=["x", "y"])
df_Xm.head()

Unnamed: 0,x,y
0,0.391849,0.904123
1,-0.095541,0.458993
2,0.111626,0.099991
3,1.761204,-0.124941
4,1.907206,-0.099912


In [29]:
# Plotting the data
cv_Xm = cv.ClusterVisualizer(df_Xm)

cv_Xm.plot_data(title='Half Moons')

In [30]:
# Standardizing the data for better results
from sklearn.preprocessing import StandardScaler

Xm = StandardScaler().fit_transform(Xm)

In [31]:
labels_db2 = DBSCAN(min_samples=10).fit_predict(Xm)   

cv_Xm.plot_density_based_clustering(labels_db2, title='Half Moons - DBSCAN Clustering')

In [32]:
cv_Xm.plot_DBSCAN_categories(labels_db2, title='Half Moons - DBSCAN Clustering')

In [33]:
labels_op = OPTICS(min_cluster_size=0.5).fit_predict(Xm)

cv_Xm.plot_density_based_clustering(labels_op, title='Half Moons - OPTICS Clustering')

### Concentric Circles Example

In [34]:
from sklearn.datasets import make_circles

Xc, _ = make_circles(n_samples=1000, random_state=0, noise=0.08, factor=0.2)

# Saving data to a DataFrame
df_Xc = pd.DataFrame(Xc, columns=["x", "y"])

In [35]:
# Plotting the data
cv_Xc = cv.ClusterVisualizer(df_Xc)

cv_Xc.plot_data(title='Concentric Circles')

In [36]:
# Standardizing the data for better results
Xc = StandardScaler().fit_transform(Xc)

In [37]:
labels_db3 = DBSCAN(min_samples=10).fit_predict(Xc)   

cv_Xc.plot_density_based_clustering(labels_db3, title='Concentric Circles - DBSCAN Clustering')

In [38]:
cv_Xc.plot_DBSCAN_categories(labels_db3, title='Concentric Circles - DBSCAN Clustering')

In [39]:
labels_c = OPTICS(min_cluster_size=0.5).fit_predict(Xc)

cv_Xc.plot_density_based_clustering(labels_c, title='Concentric Circles - OPTICS Clustering')

## Conclusions

Key Takeaways:
- DBSCAN and OPTICS effectively identified clusters in data with non-uniform densities and complex shapes, such as half-moons and concentric circles.
- The choice of parameters significantly influences the detection of dense regions and the designation of points as core, border, or noise.
- While DBSCAN struggles with varying density clusters without careful tuning of its eps and min_samples parameters, OPTICS shows improved capability in handling these variations with its additional parameters like xi and min_cluster_size, offering more flexibility in identifying clusters of different densities.

## References

- [DBSCAN - sklearn library](https://scikit-learn.org/stable/modules/clustering.html#dbscan)
- [OPTICS - sklearn library](https://scikit-learn.org/stable/modules/clustering.html#optics)
- Muller, A.C. & Guido, S. (2017) Introduction to Machine Learning with Python. A guide for Data scientists. USA: O’Reilly, chapter 3.