# Clustering: DBSCAN

Now let's try using DBSCAN for more challenging cluster shapes.

DBSCAN is not as popular as K-Means, but it vital for irregular cluster shapes. Other clustering algorithms, like K-Means, struggle with irregular cluster shapes.

References:<br>
Scikit-learn DBSCAN: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html <br>
Tutorial for DBSCAN Clustering in Python Sklearn: https://machinelearningknowledge.ai/tutorial-for-dbscan-clustering-in-python-sklearn/ <br>
Comparing different clustering algorithms on toy datasets: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn
%pip install seaborn
%pip install -U matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.cluster import DBSCAN, KMeans
from sklearn.metrics import silhouette_score, v_measure_score

## Identifying Moon Clusters

Let's make some moon clusters and try to cluster them.

### Viewing our moons

Let's create the data and see what it looks like.

In [None]:
# Generate the moons with 500 samples and a noise of 0.1


# plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, label=y)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.title("Ground truth clusters")

<details><summary>Click to cheat</summary>

```python
# Generate the moons with 500 samples and a noise of 0.1
X, y = datasets.make_moons(500, noise=0.1)

# plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, label=y)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.title("Ground truth clusters")
```
</details>

### Creating our clusterers

In [None]:
# Choose your hyperparameters
# k = 
# eps = 
# minPts = 

# Create the models with/without neighbourhood component analysis
models = [
    # Create a K-means clusterer
    
    # Create a DBSCAN clusterer
    
]

# Train the models and store in a list called models2


<details><summary>Click to cheat</summary>

```python
# Choose your hyperparameters
k = 2
eps = 0.1639
minPts = 8

# Create the models with/without neighbourhood component analysis
models = [
    # Create a K-means clusterer
    KMeans(k),
    # Create a DBSCAN clusterer
    DBSCAN(eps, min_samples=minPts)
]

# Train the models and store in a list called models2
models2 = [model.fit(X) for model in models]
```
</details>

### Plotting the Clusters

In [None]:
titles = (
    f"K-Means with k = {k}",
    f"DBSCAN with eps = {eps} and minPts = {minPts}"
)

# Set-up 1xlen(models) grid for plotting.
fig, sub = plt.subplots(nrows=1, ncols=len(models2), figsize=(5 * len(models2), 5),
        constrained_layout=True)

x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1

for title, model, ax in zip(titles, models2, sub.flatten()):
    labels = model.labels_
    ax.scatter(X[:,0], X[:,1], c=labels, label=y)
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)

    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(model.labels_).count(-1)
    ax.title.set_text(f"{title}\nNum clusters = {n_clusters}, Num noise = {n_noise}")
    
    v_meas = v_measure_score(y, labels)
    ax.text(
        0.9, 0.1,
        f"{v_meas:.2f}",
        size=15,
        ha="center",
        va="center",
        transform=ax.transAxes,
    )

plt.show()

### Optimising min distance $\varepsilon$

DBSCAN involves using a hyperparameter $\varepsilon$, which is annoying to adjust and very sensitive.

What we need is an automated way of finding the ideal $\varepsilon$. This ideal value is found by looking for the "elbow point".

In [None]:
%pip install kneed

In [None]:
from kneed import KneeLocator
from sklearn.neighbors import NearestNeighbors
import pandas as pd

# We need the data as a Pandas DataFrame
df = pd.DataFrame(X, y)
df = df.rename(columns={0: "X1", 1: "X2"})

# Create a Nearest Neighbors to find the best eps
nn = NearestNeighbors(n_neighbors=11)
neighbors = nn.fit(df)
distances, indices = neighbors.kneighbors(df)
distances = np.sort(distances[:,10], axis=0)

# Find the knee
i = np.arange(len(distances))
knee = KneeLocator(i, distances, S=1, curve="convex", direction="increasing", interp_method="polynomial")
idealDist = distances[knee.knee]

# Plot the distances
fig = plt.figure(figsize=(5, 5))
knee.plot_knee()
plt.xlabel("Points")
plt.ylabel("Distance")
plt.text(
    -0.7, 0.8,
    f"Knee distance = {idealDist:.4f}",
    size=15,
    ha="center",
    va="center",
    transform=ax.transAxes,
)
plt.show()


## Circle Clusters

Similarly, DBSCAN also works wonders for circular clusters

### Generating the data

First things first, we need to generate the data. Let's also what it looks like.

In [None]:
# Generate the circular data using 1500 samples, a factor of 0.5, and noise of 0.1


# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, label=y)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.title("Ground truth clusters")

<details><summary>Click to cheat</summary>

```python
# Generate the circular data using 1500 samples, a factor of 0.5, and noise of 0.1
X, y = datasets.make_circles(1500, factor=0.4, noise=0.1)

# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, label=y)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.title("Ground truth clusters")
```
</details>

### Finding the Knee

This time, we'll find the knee first.

In [None]:
from kneed import KneeLocator
from sklearn.neighbors import NearestNeighbors
import pandas as pd

# We need the data as a Pandas DataFrame
df = pd.DataFrame(X, y)
df = df.rename(columns={0: "X1", 1: "X2"})

# Create a Nearest Neighbors to find the best eps
nn = NearestNeighbors(n_neighbors=11)
neighbors = nn.fit(df)
distances, indices = neighbors.kneighbors(df)
distances = np.sort(distances[:,10], axis=0)

# Find the knee
i = np.arange(len(distances))
knee = KneeLocator(i, distances, S=1, curve="convex", direction="increasing", interp_method="polynomial")
idealDist = distances[knee.knee]

# Plot the distances
fig = plt.figure(figsize=(5, 5))
knee.plot_knee()
plt.xlabel("Points")
plt.ylabel("Distance")
plt.text(
    -0.7, 0.8,
    f"Knee distance = {idealDist:.4f}",
    size=15,
    ha="center",
    va="center",
    transform=ax.transAxes,
)
plt.show()


### Create the clustering models

In [None]:
# Choose your hyperparameters
k =            # number of clusters
eps =     # Ideal distance from above
minPts =      # Pick a reasonable minPts

# Create the models with/without neighbourhood component analysis
models = [
    # Create a K-means clusterer
    
    # Create a DBSCAN clusterer
    
]

# Train the models and store in a list called models2


<details><summary>Click to cheat</summary>

```python
# Choose your hyperparameters
k = 2           # number of clusters
eps = 0.1143    # Ideal distance from above
minPts = 14     # Pick a reasonable minPts

# Create the models with/without neighbourhood component analysis
models = [
    # Create a K-means clusterer
    KMeans(k),
    # Create a DBSCAN clusterer
    DBSCAN(eps, min_samples=minPts)
]

# Train the models and store in a list called models2
models2 = [model.fit(X) for model in models]
```
</details>

### Plot the identified clusters

In [None]:
titles = (
    f"K-Means with k = {k}",
    f"DBSCAN with eps = {eps} and minPts = {minPts}"
)

# Set-up 1xlen(models) grid for plotting.
fig, sub = plt.subplots(nrows=1, ncols=len(models2), figsize=(5 * len(models2), 5),
        constrained_layout=True)

x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1

for title, model, ax in zip(titles, models2, sub.flatten()):
    labels = model.labels_
    ax.scatter(X[:,0], X[:,1], c=labels, label=y)
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)

    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(model.labels_).count(-1)
    ax.title.set_text(f"{title}\nNum clusters = {n_clusters}, Num noise = {n_noise}")
    
    v_meas = v_measure_score(y, labels)
    ax.text(
        0.9, 0.1,
        f"{v_meas:.2f}",
        size=15,
        ha="center",
        va="center",
        transform=ax.transAxes,
    )

plt.show()