# DBSCAN Algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a powerful clustering algorithm where clusters can be any shape, not just round, and their sizes can be different.

You see, in an algorithm like K-Means, every single datapoint is forced into a cluster, no matter how far away or out of place it is. DBSCAN thinks differently. It believes that for a point to be part of a cluster, it must be in a "dense" or crowded neighborhood.

DBSCAN classifies every point into one of three types:

*   **Core Point:** A point in the heart of a dense cluster.
*   **Border Point:** A point on the edge of a dense cluster.
*   **Outlier (Noise):** A point that doesn't belong to any dense cluster.

---

 ![](assets/core_border_outlier.png)

---

To decide which type each point is, DBSCAN uses two simple parameters:

*   **`epsilon` (ε):** A distance or radius. This defines the "neighborhood" around each point.
*   **`min_points`:** The minimum number of points required to form a dense region.

Now, we can define the rules for classifying each point:

*   A point is a **Core Point** if it has at least `min_points` neighbors (including itself) within its `epsilon` radius.
*   A point is a **Border Point** if it has fewer than `min_points` neighbors, but it is close enough to be a neighbor of a **Core Point**.
*   If a point is neither a Core nor a Border point, it is considered an **Outlier (Noise)**.

---

![](assets/recognizing_outliers_borders_cores.png)

---

Finally, a cluster is formed by connecting Core points that are neighbors of each other. Any Border point that is a neighbor of one of these Core points becomes part of that same cluster. Outliers, belonging to no cluster, are left alone. no clusterings for them!

Enough theory! Let's dive into the code to see it better.

### Step 1: Import Libraries and Generate Data

First, we'll import the necessary tools and create our dataset. We'll generate 200 data points shaped like two interleaving crescent moons and add some random noise to represent outliers.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons

# n_samples: The total number of points to generate.
# noise: The standard deviation of Gaussian noise added to the data.
# random_state: Ensures we get the same 'random' points every time we run the code.
X, y = make_moons(n_samples=1000, noise=0.09)

df = pd.DataFrame(X, columns=['feature1', 'feature2'])

df.head()

### Step 2: Visualize the Raw Data

Before we do any clustering, let's look at our data. This helps us understand the challenge. You can clearly see the two "moon" shapes that a simple circular algorithm like K-Means would struggle with.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(df['feature1'], df['feature2'], c='gray')
plt.title('Raw, Unclustered "Moons" Data', fontsize=16)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

### Step 3: Set Up and Run the DBSCAN Algorithm

Now, we create an instance of the `DBSCAN` model and fit it to our data. The most important part is choosing the `eps` (epsilon) and `min_samples` parameters. Finding the best values often requires some experimentation.

*   `eps=0.2`: We'll define the neighborhood radius as 0.2.
*   `min_samples=5`: We'll say a point needs at least 5 neighbors to be a Core point.

In [None]:
# eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
# min_samples: The number of samples in a neighborhood for a point to be considered as a core point.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.1, min_samples=5)

# Fit the model to our data
dbscan.fit(df)

### Step 4: Analyze the Clustering Results

After fitting, the results are stored in the model. The most important attribute is `labels_`, where each point is assigned a cluster ID. DBSCAN uses **-1** to label all the points it considers **outliers (noise)**.

In [None]:
labels = dbscan.labels_
df['cluster'] = labels

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print(f"Estimated number of clusters: {n_clusters_}")
print(f"Estimated number of noise points: {n_noise_}")

df.head()

### Step 5: Visualize the Final DBSCAN Clusters

This is the final and most revealing step. We'll plot the data again, but this time we will color the points according to the cluster they belong to. We will make the noise points small and black to show that they don't belong to any group.

We can also highlight the **Core points** to see the "hearts" of the clusters that DBSCAN identified.

In [None]:
plt.figure(figsize=(10, 8))

unique_labels = set(labels)

# Define colors for the clusters
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
    if k == -1:
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=8)

plt.title(f'DBSCAN Clustering | Clusters: {n_clusters_}', fontsize=16)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()