# DBSCAN: Advanced Tutorial

**DBSCAN (Density-Based Spatial Clustering of Applications with Noise)** is an unsupervised clustering algorithm that groups together points that are closely packed, and marks outliers as noise.

In this notebook, we explore DBSCAN using synthetic and real datasets, visualize clusters, tune parameters, and compare with KMeans.

## 1. Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_moons, make_blobs
from sklearn.cluster import DBSCAN, KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

sns.set(style="whitegrid")


## 2. What is DBSCAN?

DBSCAN groups together points that are close to each other based on a distance metric (e.g. Euclidean) and a minimum number of neighbors.

It works well when:
- Clusters are irregularly shaped
- There's noise or outliers
- You don’t know the number of clusters in advance

Key Parameters:
- `eps`: max distance between two samples to be in the same neighborhood
- `min_samples`: minimum number of points to form a dense region


## 3. Clustering Synthetic Moon Data

In [None]:
X, _ = make_moons(n_samples=500, noise=0.1, random_state=42)
X = StandardScaler().fit_transform(X)

db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X)

plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, cmap='viridis', s=30)
plt.title("DBSCAN Clustering on Moon-Shaped Data")
plt.show()


## 4. Compare DBSCAN and KMeans on Blobs

In [None]:
X_blob, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
X_blob = StandardScaler().fit_transform(X_blob)

kmeans = KMeans(n_clusters=3, random_state=42).fit(X_blob)
dbscan = DBSCAN(eps=0.5, min_samples=5).fit(X_blob)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=kmeans.labels_, cmap='tab10')
axes[0].set_title("KMeans Clustering")

axes[1].scatter(X_blob[:, 0], X_blob[:, 1], c=dbscan.labels_, cmap='tab10')
axes[1].set_title("DBSCAN Clustering")

plt.show()


## 5. Silhouette Score for DBSCAN

In [None]:
if len(set(dbscan.labels_)) > 1:
    score = silhouette_score(X_blob, dbscan.labels_)
    print("Silhouette Score (DBSCAN):", round(score, 3))
else:
    print("Silhouette Score not available (only one cluster)")


## 6. Tuning eps and min_samples

In [None]:
eps_values = np.linspace(0.1, 1.0, 10)
scores = []

for eps in eps_values:
    model = DBSCAN(eps=eps, min_samples=5)
    labels = model.fit_predict(X_blob)
    if len(set(labels)) > 1:
        scores.append(silhouette_score(X_blob, labels))
    else:
        scores.append(-1)

plt.plot(eps_values, scores, marker='o')
plt.xlabel("eps")
plt.ylabel("Silhouette Score")
plt.title("DBSCAN Silhouette Score vs eps")
plt.show()


## 7. Summary

- DBSCAN is ideal for noisy and non-spherical data
- Doesn’t require the number of clusters upfront
- Sensitive to `eps` and `min_samples`
- Try with different distance metrics and scaled data