# Bridging OPTICS and DBSCAN Clustering

## Objectives

- Explore the integration of OPTICS and DBSCAN algorithms to enhance clustering analysis by leveraging their individual strengths.
- Examine the effects of parameter adjustments on clustering outcomes in hybrid setups.
- Evaluate the combined approach on datasets with varying densities and complexity.

## Background

This notebook explores a hybrid clustering approach that combines the scalable ordering mechanism of OPTICS with the precise clustering capabilities of DBSCAN, allowing for better adaptation to data with varying densities and complex structures.

## Datasets Used

- The notebook uses synthetic datasets with different densities and distributions generated to showcase the impact of varying 'eps' values and the adaptability of the hybrid approach in effectively distinguishing clusters.
- The 'quakes' dataset demonstrates the application of real-world geospatial data, highlighting how these techniques handle noise and outliers in datasets with natural variances.

## A Simple Example

Here we will study a hybrid approach that combines the versatility of OPTICS with the simplicity of DBSCAN.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 8)

import plotly.express as px
import plotly.graph_objects as go

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN, OPTICS

from pydataset import data

import ClusterVisualizer as cv
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

In [2]:
np.random.seed(0)
n_samples = 100
centers = [(0, 0), (3, 10), (12, 5)]
# Different standard deviations (density)
cluster_std = [0.6, 0.1, 3.0]  

X, _ = make_blobs(n_samples=n_samples, centers=centers, cluster_std=cluster_std)

In [3]:
# Save the data to a DataFrame
df_X = pd.DataFrame(X, columns=['x', 'y'])
df_X.head()

Unnamed: 0,x,y
0,0.092968,0.226898
1,10.069145,-1.670209
2,0.03991,0.181483
3,8.68685,5.156495
4,10.633402,5.052437


In [4]:
cv_X = cv.ClusterVisualizer(df_X)

cv_X.plot_data()

In [5]:
opt_model = OPTICS(min_samples=10, min_cluster_size=.05).fit(X)

cv_X.plot_clusters(opt_model.labels_, title='Data - OPTICS Algorithm')

Remember, the labels are generated based on the OPTICS algorithm's clustering logic, which involves creating an ordered list of points (the reachability plot) and extracting clusters from this plot based on the OPTICS parameters.

In [6]:
# The reachability distance of each observation
opt_model.reachability_[opt_model.ordering_].round(2)

array([ inf, 0.52, 0.45, 0.45, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37,
       0.37, 0.39, 0.43, 0.45, 0.47, 0.49, 0.52, 0.53, 0.55, 0.55, 0.56,
       0.59, 0.62, 0.62, 0.62, 0.62, 0.7 , 0.74, 0.78, 0.78, 0.83, 0.83,
       0.88, 7.64, 3.53, 2.4 , 2.4 , 2.4 , 2.4 , 2.4 , 2.4 , 2.4 , 2.4 ,
       2.66, 2.66, 2.66, 2.68, 2.68, 2.7 , 2.7 , 2.7 , 2.87, 2.87, 2.87,
       2.87, 2.99, 2.99, 2.99, 3.07, 3.07, 3.08, 3.25, 3.25, 3.65, 4.19,
       4.23, 5.6 , 0.23, 0.12, 0.1 , 0.1 , 0.1 , 0.09, 0.08, 0.08, 0.08,
       0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09,
       0.09, 0.09, 0.09, 0.1 , 0.1 , 0.1 , 0.11, 0.11, 0.11, 0.13, 0.13,
       0.14])

In [7]:
# Reachability plot
fig_r = px.line(x=range(len(opt_model.reachability_)),   
                y=opt_model.reachability_[opt_model.ordering_],
                width=600, height=400, title='Reachability Plot')
fig_r.update_yaxes(title_text='Reachability')
fig_r.show()    

## Bridging OPTICS and DBSCAN Clustering

`cluster_optics_dbscan` is a function in scikit-learn that extracts a DBSCAN clustering from an OPTICS clustering ordering and reachability distances. 

It allows users to apply the DBSCAN algorithm with a specified `eps` value on the results of an OPTICS clustering. 

It combines the strengths of both algorithms for more flexible cluster extraction.

Let's use the method `plot_optics_dbscan` to visualize a scatter plot of the clusters and a reachability plot with reachability distances, enhancing understanding of the data's structure and clustering behavior.

In [8]:
# Plot the result of applying DBSCAN over the reachability distances.
cv_X.plot_optics_dbscan(opt_model, eps=0.5)

- With `eps = 0.5`, the algorithm detects two clusters. 
- Points that are close together and form a dense region are grouped into the same cluster.
- Points that do not fall into any cluster with sufficient density are colored grey and marked as noise (with the label -1). Notice there are too many noise points.

In [9]:
cv_X.plot_optics_dbscan(opt_model, eps=1)

- With `eps = 1.0`, the algorithm detects the red points as a cluster without noises. 
- The more dispersed points are not considered a cluster, they are classified as noise points.

In [10]:
cv_X.plot_optics_dbscan(opt_model, eps=3)

The algorithm results with `eps = 3.0` are pretty good.
- The algorithm detects three clusters. 
- Points that are close together and form a dense region are grouped into the same cluster.
- Points that do not fall into any cluster with sufficient density are colored grey and marked as noise (with the label -1). Notice there are few noise points.

In [11]:
cv_X.plot_optics_dbscan(opt_model, eps=4)

With `eps = 4.0`, there are only two noise points, that are far from the nearest  cluster.

In [12]:
cv_X.plot_optics_dbscan(opt_model, eps=5)

With `eps = 5.0`, there are three clusters without noise points.

## Example 2: Four Different Densities Clusters

Let's generate a little more complicated dataset with four clusters of different densities.

In [13]:
# Generate data with different standard deviations
random_state = 2
centers = [(-2, 3), (0, 2), (3, -2), (7, 3)]
# Different standard deviations (density)
cluster_std = [0.3, 0.05, 1.2, 0.7]  

X2, _ = make_blobs(n_samples=200, centers=centers, cluster_std=cluster_std, random_state=random_state)

In [14]:
# Save the data to a DataFrame
df_X2 = pd.DataFrame(X2, columns=['x', 'y'])
df_X2.head()

Unnamed: 0,x,y
0,7.652321,4.027832
1,7.968647,3.103844
2,-2.056841,2.976834
3,-0.005721,1.975091
4,6.230492,2.522929


In [15]:
cv_X2 = cv.ClusterVisualizer(df_X2)

cv_X2.plot_data(title='Data 2 - Different Densities Clustering')

In [16]:
# Fit the OPTICS model
opt_model2 = OPTICS(min_cluster_size=0.2).fit(X2)

In [17]:
# Plot the clusters
cv_X2.plot_density_based_clustering(opt_model2.labels_, title='Data 2 - OPTICS Algorithm')

The OPTICS method effectively detects the four clusters with different densities. It marks only one point as noise. Notice that the noise point is far from all clusters, so the results are pretty good.


Let's apply the `cluster_optics_dbscan` algorithm with different eps values and see how the eps election affects the algorithm results. Remember `cluster_optics_dbscan` takes the OPTICS ordering and extracts clusters based on a specified eps value, similar to DBSCAN.

In [18]:
# Starting with eps = 0.5
cv_X2.plot_optics_dbscan(opt_model2, eps=0.5)

- `eps = 0.5` is too small. The algorithm detects many clusters. 
- Points that are close together and form a dense region are grouped into the same cluster.
- Points that do not fall into any cluster with sufficient density are colored grey and marked as noise (with the label -1). Notice there are too many noise points.

In [19]:
# Increasing eps to 1
cv_X2.plot_optics_dbscan(opt_model2, eps=1)

- The algorithm using `eps = 1` gives pretty good results. It detects the four clusters. 
- Points that are close together and form a dense region are grouped into the same cluster.
- Points that do not fall into any cluster with sufficient density are colored grey and marked as noise (with the label -1). Notice there are few noise points.

In [20]:
# Increasing eps to 1.5
cv_X2.plot_optics_dbscan(opt_model2, eps=1.5)

- The algorithm using `eps = 1.5` gives the same result as OPTICS method: `OPTICS(min_cluster_size=0.2).fit(X2)`

In [21]:
# Increasing eps to 2
cv_X2.plot_optics_dbscan(opt_model2, eps=2)

- The algorithm using `eps = 2` only detects two big clusters without noise points.

In [22]:
# Increasing eps to 4
cv_X2.plot_optics_dbscan(opt_model2, eps=4)

- The algorithm using `eps = 4` only detects one big cluster without noise points.

Remember, there is no one-size-fits-all value for `eps`; it is highly dependent on the scale and density variations in the data. Visual inspection of results often complements the quantitative approach to fine-tuning `eps`.

## Quakes Dataset

In [23]:
# Load the 'quakes' dataset
quakes = data('quakes')
print(quakes.shape)
quakes.head()

(1000, 5)


Unnamed: 0,lat,long,depth,mag,stations
1,-20.42,181.62,562,4.8,41
2,-20.62,181.03,650,4.2,15
3,-26.0,184.1,42,5.4,43
4,-17.97,181.66,626,4.1,19
5,-20.42,181.96,649,4.0,11


In [24]:
# Selecting relevant features for clustering
features_o = quakes[['depth', 'mag']]

In [25]:
# Plotting the data
features_o_melted = features_o.melt(var_name='feature', value_name='value')

fig_o = px.box(features_o_melted, x='feature', y='value')
fig_o.update_layout(width=600, height=400,
                    title='Quakes - Boxplot of Original Data')
fig_o.show()

In [26]:
# Scaling the features
scaler = StandardScaler()
features_s = scaler.fit_transform(features_o)

features_s = pd.DataFrame(features_s, columns=features_o.columns)
features_s.head()

Unnamed: 0,depth,mag
0,1.163402,0.446132
1,1.571892,-1.044286
2,-1.250401,1.93655
3,1.460485,-1.29269
4,1.56725,-1.541093


In [27]:
# Plotting the data
features_s_melted = features_s.melt(var_name='feature', value_name='value')

fig_s = px.box(features_s_melted, x='feature', y='value')
fig_s.update_layout(width=600, height=400,
                    title='Quakes - Boxplot of Standardized Data')
fig_s.show()

In [28]:
# Plotting the data using latitud and longitud
cv = cv.ClusterVisualizer(quakes)

cv.plot_data(title='Quakes - Original Data')

In [29]:
# Applying DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5).fit(features_s)

# Plotting the clusters
cv.plot_density_based_clustering(dbscan.labels_, title='Quakes - DBSCAN Algorithm')

Let's create a function for plotting the data with noise.

In [30]:
def plot_box_with_noise(data, cluster, title='Boxplots with Noise Points'):
    '''
    Plot boxplots for each feature in the data, with the noise points highlighted.
    data: DataFrame with the features
    cluster: Array with the cluster labels
    '''
    data = data.copy()
    data['cluster'] = cluster
    data_melted = data.melt(id_vars='cluster', var_name='feature', value_name='value')
    # Separate the noise points for the scatter plot
    noise_data = data_melted[data_melted['cluster'] == -1]
    fig = px.box(data_melted, x='feature', y='value', notched=True)
    fig.update_layout(width=600, height=400, title=title, legend=dict(x=0.8, y=1.12))
    # Adding scatter plot for noise
    fig.add_trace(go.Scatter(x=noise_data['feature'], y=noise_data['value'], mode='markers', 
                             marker=dict(color='grey', size=7), name='Noise'))
    fig.show()

In [31]:
plot_box_with_noise(features_s, dbscan.labels_, title='Quakes - Boxplots with Noise Points')

Notice the noise points of DBSCAN algorithm correspond to the outliers reported by the standardized boxplot of the Earthquake magnitude.

In [32]:
# Applying OPTICS
optics = OPTICS(min_samples=5, min_cluster_size=0.2).fit(features_s)

# Plotting the clusters
cv.plot_density_based_clustering(optics.labels_, title='Quakes - OPTICS Algorithm')

In [33]:
plot_box_with_noise(features_s, optics.labels_, title='Quakes - Boxplots with Noise Points')

The OPTICS algorithm reports more noise points. Like DBSCAN, those points correspond to the higher values of the Earthquake magnitude.

## Conclusions

Key Takeaways:
- The hybrid approach effectively identifies clusters. The 'eps' parameter allows for fine-tuning cluster boundaries and noise detection, reflecting a dynamic response to data density.
- In the synthetic examples, changing 'eps' values under the OPTICS-derived structure allowed for flexible and varied clustering outcomes, illustrating the method's robustness across different scenarios.
- The real-world application to the 'quakes' dataset demonstrated the hybrid method's capability to discern meaningful clusters while managing noise, thus proving effective in practical scenarios.
- The approach shows promise in handling complex clustering tasks, with the ability to adjust to data specificities through parameter tuning, offering a comprehensive tool for exploratory data analysis.

## References

- [DBSCAN - sklearn library](https://scikit-learn.org/stable/modules/clustering.html#dbscan)
- [OPTICS - sklearn library](https://scikit-learn.org/stable/modules/clustering.html#optics)
- Muller, A.C. & Guido, S. (2017) Introduction to Machine Learning with Python. A guide for Data scientists. USA: O’Reilly, chapter 3, page 187.