# DBSCAN

This notebook demonstrates how to use **DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) from the `rice_ml.unsupervised_learning` package, and compares its behavior with **K-Means** on the same dataset.

DBSCAN groups points that lie in dense regions and labels low-density points as **noise** (`-1`). Unlike K-Means, it does **not** require choosing `k` in advance.

## 1. Setup

We first import the required modules. When running this notebook directly from the `examples/` folder, we also add the repository's `src/` directory to `sys.path` so that `rice_ml` can be imported without installing the package.

In [None]:
import os
import sys
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt

# --- Make `rice_ml` importable when running the notebook directly ---
cwd = Path.cwd().resolve()
for p in [cwd] + list(cwd.parents):
    if (p / "src" / "rice_ml").exists():
        sys.path.insert(0, str(p / "src"))
        break
else:
    raise RuntimeError("Could not find 'src/rice_ml'. Run this notebook inside the repo, or install the package.")

from rice_ml.processing.preprocessing import standardize
from rice_ml.unsupervised_learning.dbscan import dbscan
from rice_ml.unsupervised_learning.k_means_clustering import kmeans

np.random.seed(42)


## 2. Load a dataset

For a simple and common clustering demo, we use the **Iris** dataset.

- It has **150** samples and **4** numeric features.
- It includes a ground-truth species label, which we will use only for **qualitative comparison** and an optional clustering score (Adjusted Rand Index).

In [None]:
# Iris is a classic toy dataset; scikit-learn is OK to use inside notebooks.
try:
    from sklearn.datasets import load_iris
    from sklearn.metrics import adjusted_rand_score
except ImportError as e:
    raise ImportError(
        "This notebook uses scikit-learn for the Iris dataset and an optional ARI metric. "
        "Install it with: pip install scikit-learn"
    ) from e

iris = load_iris()
X_raw = iris.data
y_true = iris.target
feature_names = iris.feature_names

print("X shape:", X_raw.shape)
print("Classes:", np.unique(y_true), "(n_classes =", len(np.unique(y_true)), ")")
print("Feature names:", feature_names)


## 3. Preprocessing

Both DBSCAN and K-Means rely on distances. To avoid one feature dominating due to scale, we **standardize** each feature:

\[ X_{std} = \frac{X - \mu}{\sigma} \]

We use the `standardize` helper from `rice_ml.processing.preprocessing`.

In [None]:
X = standardize(X_raw)

# For visualization, we will use the last two features (petal length/width),
# which usually separate the Iris species better than the first two.
X_vis = X[:, 2:4]

print("Standardized X mean (approx):", np.round(X.mean(axis=0), 3))
print("Standardized X std (approx):", np.round(X.std(axis=0), 3))


## 4. Helper: 2D scatter plot

To keep plots readable, we visualize clusters in 2D using the standardized petal features (`X[:, 2:4]`).

Noise points (label `-1`) are drawn with a different marker.

In [None]:
def plot_clusters_2d(X2, labels, title):
    labels = np.asarray(labels)
    unique = np.unique(labels)

    plt.figure(figsize=(7, 5))

    # Plot clusters (including noise)
    for lab in unique:
        mask = labels == lab
        if lab == -1:
            plt.scatter(X2[mask, 0], X2[mask, 1], marker="x", s=50, label="noise (-1)")
        else:
            plt.scatter(X2[mask, 0], X2[mask, 1], s=40, label=f"cluster {int(lab)}")

    plt.xlabel("petal length (standardized)")
    plt.ylabel("petal width (standardized)")
    plt.title(title)
    plt.legend(loc="best", fontsize=9)
    plt.show()


## 5. Run DBSCAN (rice_ml)

Key parameters:

- `eps`: neighborhood radius for density (larger `eps` → more points become connected)
- `min_samples`: minimum neighbors to form a **core** point (larger `min_samples` → stricter density)

`dbscan` returns:

- `labels_db`: array of cluster labels (`-1` means noise)
- `n_clusters_db`: number of clusters (excluding noise)

In [None]:
eps = 0.6
min_samples = 6

labels_db, n_clusters_db = dbscan(X, eps=eps, min_samples=min_samples)

n_noise = int(np.sum(labels_db == -1))
print("DBSCAN params: eps =", eps, ", min_samples =", min_samples)
print("Clusters found (excluding noise):", n_clusters_db)
print("Noise points:", n_noise, f"({n_noise / len(labels_db):.1%})")

plot_clusters_2d(X_vis, labels_db, f"DBSCAN (rice_ml): eps={eps}, min_samples={min_samples}")


## 6. Compare with K-Means (rice_ml)

K-Means always produces exactly `k` clusters and assigns **every** point to some cluster.

Here we set `k=3` because Iris has three species (again: this is just for comparison; K-Means is still unsupervised).

In [None]:
k = 3
labels_km, centers_km = kmeans(X, k=k)

print("K-Means: k =", k)
print("Cluster sizes:", np.bincount(labels_km))

plot_clusters_2d(X_vis, labels_km, "K-Means (rice_ml): k=3")


## 7. Compare against ground truth labels

Clustering is unsupervised, so there is no single “correct” output. But since Iris provides true species labels, we can compute an external clustering metric:

- **Adjusted Rand Index (ARI)**: 1.0 means perfect match, ~0 means random assignment.

For DBSCAN, we exclude noise points from the ARI calculation (otherwise noise penalizes heavily).

In [None]:
ari_km = adjusted_rand_score(y_true, labels_km)

# Exclude noise for DBSCAN ARI
mask = labels_db != -1
if mask.sum() > 0:
    ari_db = adjusted_rand_score(y_true[mask], labels_db[mask])
else:
    ari_db = np.nan

print("ARI (K-Means vs true labels):", round(float(ari_km), 4))
print("ARI (DBSCAN vs true labels, excluding noise):", (None if np.isnan(ari_db) else round(float(ari_db), 4)))


## 8. Interpretation

- DBSCAN can find clusters of **arbitrary shape** and can label ambiguous points as **noise**.
- K-Means forces exactly `k` **compact** clusters and assigns every point to a cluster.

On Iris, K-Means often produces a reasonable 3-way partition, while DBSCAN’s result can be more sensitive to `eps` and `min_samples` (because the Iris species overlap in feature space).

**Try exploring (recommended):**
- Increase `eps` to merge nearby dense regions (fewer clusters, less noise).
- Decrease `eps` to split clusters or create more noise points.
- Increase `min_samples` to demand denser clusters (more noise, fewer clusters).

In practice, DBSCAN is often strongest when you expect **non-spherical** clusters and meaningful **outliers/noise**, while K-Means is strong when clusters are roughly **spherical** and well-separated.
