# Density-Based Clustering — DBSCAN & HDBSCAN (with K-Means comparison)

---

## 1) Big Picture

- **DBSCAN**: finds clusters as **dense regions** separated by low-density gaps; labels outliers as **noise**.
- **HDBSCAN**: generalises DBSCAN by exploring **all density levels** (all \( \varepsilon \)) and selecting **stable clusters**.
- **K-Means**: partitions space around **centroids**, assumes roughly **equal variance** and **convex** clusters; no noise label.

---

## 2) DBSCAN — Concept & Mechanics

### 2.1 Key parameters

- **Neighborhood radius:**
  $$
  \varepsilon \text{ (eps)}
  $$

- **Minimum neighbors (including the point) to be core:**
  $$
  k = \text{min\_samples}
  $$
  or equivalently, “a point is *core* if it has at least \( k \) points within \( \varepsilon \).”

---

### 2.2 Point types

- **Core**: at least \( k \) points within \( \varepsilon \).
- **Border**: not core, but within \( \varepsilon \) of a core.
- **Noise**: neither core nor border.

---

### 2.3 Algorithm (intuitive flow)

1. Pick an **unvisited** point.
2. If **core**, start a cluster and **expand** by recursively adding neighbors within \( \varepsilon \) (core chains).
3. If not core and not within \( \varepsilon \) of a cluster’s core, mark as **noise** (may later become border).
4. Repeat until all points are visited.

**Density reachability:**
A point is reachable from a core point if there exists a chain of core points, each pair \( \le \varepsilon \) apart.

---

### 2.4 Strengths & caveats

- ✅ Arbitrary shapes, unknown \( k \), noise detection.
- ⚠️ **Single global \( \varepsilon \)** struggles with **varying densities**.
- ⚠️ High-D distances can degrade; apply DR (e.g., UMAP/PCA).

---

## 3) HDBSCAN — Concept & Mechanics (Deep Dive)

### 3.1 Motivation

DBSCAN fixes one \( \varepsilon \); HDBSCAN scans **all \( \varepsilon \)** and keeps clusters that **persist** across density scales.

---

### 3.2 Core distance & mutual reachability

- **Core distance** of \( x \): distance to its \( k \)-th nearest neighbor
  $$
  \operatorname{core\_dist}(x) = \text{dist}\big(x,\ \text{k-th NN of }x\big)
  $$

- **Mutual-reachability distance** between \( a,b \):
  $$
  d_{\mathrm{mreach}}(a,b)
  = \max\Big(
    \operatorname{core\_dist}(a),\
    \operatorname{core\_dist}(b),\
    \|a-b\|
  \Big)
  $$

**Intuition:**
An edge is “cheap” only if **both points are dense** (small core distances) **and** **close**.

---

### 3.3 One structure for all \( \varepsilon \)

1. Compute all \( d_{\mathrm{mreach}} \).
2. Build an **MST** on these weights.
3. Sweep density via \( \lambda = 1/\varepsilon \):
   - as \( \lambda \) increases (i.e., \( \varepsilon \) decreases), **cut** MST edges with weight \(> \varepsilon\);
   - clusters **split** at those cuts → a **hierarchy**.

---

### 3.4 Condensed tree & stability

- Build a **condensed tree** (density dendrogram).
- For a cluster \( C \) living from \( \lambda_{\text{birth}} \) to \( \lambda_{\text{death}} \), define **stability**:
  $$
  \text{Stability}(C)
  =
  \int_{\lambda_{\text{birth}}}^{\lambda_{\text{death}}}
    |C(\lambda)|\, d\lambda
  $$
- Keep clusters with **high stability** (persistent + sizable); assign uncertain points as **noise**.
- Optional: **soft membership** (probabilities).

---

## 4) Your reasoning (DBSCAN ↔ HDBSCAN via MST)

1. Ignore the “\( k \)-neighbor” rule: build an MST on **pairwise distances**; **cut** edges \(>\varepsilon\) → connected components match “DBSCAN-without-core-rule.”
2. DBSCAN’s extra constraint (must have \( k \) neighbors) is folded into HDBSCAN by redefining distances:
   $$
   d_{\mathrm{mreach}}(a,b)=\max(\operatorname{core\_dist}(a),\operatorname{core\_dist}(b),\|a-b\|)
   $$
   which **inflates** edges in sparse zones so tiny/isolated groups don’t become clusters.
3. Cutting the **HDBSCAN MST** at a single \( \varepsilon \) **recovers DBSCAN(\( \varepsilon,k \))** on **core points**; attach border points within \( \varepsilon \) to match DBSCAN labels.

→ **DBSCAN is a horizontal slice** of HDBSCAN’s hierarchy.

---

## 5) Parameter intuition & effects

### 5.1 DBSCAN
- **\( \varepsilon \)**:
  - Too small → many **noise** points; clusters fragment.
  - Too large → clusters **merge**; structure blurs.
- **\( k=\text{min\_samples} \)**:
  - Higher → more **conservative** (more noise, denser clusters).
  - Lower → more **sensitive** (may pick spurious small clusters).
- **Metric**: Euclidean; **cosine** for text; **L1** for sparse data.

---

### 5.2 HDBSCAN
- **min\_cluster\_size**: minimum population to count as a **stable** cluster. ↑ → fewer/larger clusters, more noise.
- **min\_samples \( (=k) \)**: conservativeness of density estimate; default \(=\) min\_cluster\_size is often good.
- **Metric**: same as DBSCAN. For embeddings, cosine + UMAP works best.

**Tip (embeddings):**
Use **UMAP → HDBSCAN** so distances reflect meaningful local structure; HDBSCAN adapts to **variable density** on that manifold.

---

## 6) Why HDBSCAN/DBSCAN beat K-Means on uneven densities

- **K-Means** minimises within-cluster variance:
  $$
  \min_{\{\mu_j\},\{c_i\}} \sum_i \big\|x_i - \mu_{c_i}\big\|^2
  $$
  → implicitly prefers **equal spread** and **convex** partitions, and **forces** every point into a cluster.
- **DBSCAN/HDBSCAN** rely on **local density**:
  - Retain **tight** and **loose** blobs simultaneously.
  - Mark **outliers** as **noise**.
  - **HDBSCAN** examines **all \( \varepsilon \)** and selects **stable** clusters.

**TL;DR:**
K-Means = **global scale** → fixed variance assumption.
HDBSCAN = **local scale** → variable density accepted.

---

## 7) DBSCAN/HDBSCAN vs K-Means — Summary

| Aspect | K-Means | DBSCAN | HDBSCAN |
|---|---|---|---|
| \#Clusters | Must set \( k \) | Auto | Auto |
| Cluster definition | Centroid proximity | Density at \( \varepsilon \) | Stability across \( \varepsilon \) |
| Shape | Convex | Arbitrary | Arbitrary |
| Uneven density | ✗ | △ (single \( \varepsilon \)) | ✓ (all \( \varepsilon \)) |
| Outliers | ✗ | ✓ | ✓ |
| Soft membership | ✗ | ✗ | ✓ |
| Online | ✓ | ✗ | ✗ |

---

## 8) Worked example (conceptual)

**Data:** two clusters with different density + a few outliers.
- Dense ~ 250 pts near \((0,0)\), std \( \approx 0.2 \)
- Sparse ~ 250 pts near \((4,0)\), std \( \approx 1.1 \)
- Outliers ~ 25 pts scattered

**K-Means (true \( k=2 \))**:
→ straight Voronoi boundary; **slices dense blob**, **absorbs outliers**.
**DBSCAN** (\( \varepsilon \approx 0.6,\ k \approx 6 \)):
→ two clusters + **noise (-1)**.
**HDBSCAN** (min\_cluster\_size \( \approx 20\!-\!40 \)):
→ two stable clusters + noise; optional **probabilities**.

---

## 9) Practical playbook (embeddings / incidents)

1. **Normalise** (unit length for cosine).
2. **Reduce dimension:** **UMAP** (local structure) or **PCA** (variance).
3. **Cluster:** use **HDBSCAN**; for production, fit **K-Means** on centroids for faster assignment.
4. **Evaluate:** cluster count, noise %, coherence, stability.

---

## 10) Equivalence note (single-threshold HDBSCAN ≡ DBSCAN)

Build \( d_{\mathrm{mreach}} \) → MST → **cut at one \( \varepsilon \)**.
Connected components over **core points** = **DBSCAN(\( \varepsilon,k \))**; attach border points within \( \varepsilon \) to match DBSCAN output.
Thus, **DBSCAN is a single slice** of HDBSCAN’s hierarchy.

---
