# 2.3.6 Hierarchical Clustering — scikit-learn Notes

---

## 🧩 What It Is
Hierarchical clustering builds **nested clusters** by successively merging or splitting them.
The resulting structure is a **tree (dendrogram)** where:
- The **root** is one big cluster containing all samples.
- The **leaves** are individual samples.
- The **internal nodes** represent intermediate merges.

By cutting the tree at different levels, we can obtain different numbers of clusters.
In scikit-learn, this is mainly implemented via **Agglomerative Clustering** (bottom-up).

---

## 🔽 Agglomerative (Bottom-Up) Approach
Each observation starts as its own cluster.
At each step, the algorithm merges the **two clusters that are most similar**, according to a linkage criterion.

### Linkage Types
| Linkage | Merge rule | Typical behaviour |
|----------|-------------|-------------------|
| **Ward** | Minimises within-cluster variance (sum of squared differences) | Produces regular, compact clusters; only works with Euclidean distance |
| **Complete** | Minimises the *maximum* distance between points in different clusters | Produces compact, evenly sized clusters |
| **Average** | Minimises the *average* distance between points in different clusters | More balanced; supports non-Euclidean distances |
| **Single** | Minimises the *minimum* distance between points in different clusters | Captures chain-like shapes; sensitive to noise (“rich-get-richer”) |

---

## ⚙️ Choosing Linkage

- **Ward** → most regular cluster sizes but only with Euclidean metric.
- **Single** → flexible shapes but high sensitivity to noise.
- **Average / Complete** → can handle non-Euclidean metrics like cosine or L1.
- The algorithm can exhibit “**rich-get-richer**” behavior — large clusters keep absorbing smaller ones.

Guideline from the doc:
> “Single linkage is the worst for uneven clusters; Ward gives the most regular sizes.
> For non-Euclidean metrics, average linkage is a good alternative.”

---

## 🧭 Connectivity Constraints
We can impose a **connectivity matrix** (often sparse) that defines which samples are allowed to merge.

Use cases:
- Spatial data (e.g., neighboring pixels in an image).
- Graph data (e.g., K-nearest neighbors).

Benefits:
- Adds locality structure.
- Can **speed up computation** by reducing possible merges.

Caveats:
- Can amplify “rich-get-richer” effect for single/average/complete linkages.

---

## 📏 Distance Metric (Affinity)
For linkages other than Ward, the metric (affinity) can vary.

Common choices:
- **Euclidean** (default)
- **L1 (Manhattan)** → robust for sparse data
- **Cosine** → invariant to vector magnitude, good for text embeddings

Rule of thumb:
> Choose a metric that **maximizes inter-cluster distance** while **minimizing intra-cluster distance**.

Ward linkage supports only Euclidean distance.

---

## 🌳 Dendrogram Visualization
- Dendrograms show the **merge hierarchy**.
- For small datasets, it’s helpful for visual diagnostics.
- Each vertical split corresponds to a merge at a specific distance threshold.
- Cutting at a horizontal distance level yields a chosen number of clusters.

---

## ⚖️ Practical Considerations

- **Scalability**:
  Without connectivity, the algorithm considers *all pairwise merges*, making it \( O(n^3) \) and slow for large datasets.
  With connectivity, merges are limited to local neighborhoods, improving speed.

- **Memory**:
  Pairwise distances are \( O(n^2) \), which can be large.

- **Transductive**:
  Hierarchical clustering doesn’t easily assign new samples post-fit.
  To label new points, one usually must re-fit or approximate via nearest cluster.

---

## ✂️ Divisive Variant — Bisecting K-Means
An alternative hierarchical approach (top-down):
- Start with one cluster containing all samples.
- Repeatedly **bisect clusters** using K-Means until the desired number is reached.
- Efficient when \( n_{\text{clusters}} \ll n_{\text{samples}} \).

---

## 🧠 When to Use Hierarchical Clustering

**Use it when:**
- You want **hierarchical structure** (not just flat labels).
- You need interpretability via dendrograms.
- You can work with small-to-medium datasets.
- You have graph/spatial data and can use connectivity constraints.

**Avoid it when:**
- You need scalability for very large datasets.
- You must assign new/unseen samples quickly (online or streaming context).
- You only need simple flat clusters.

---

## ⚙️ Key Takeaways

- Linkage choice controls **merge strategy** and **cluster shape**.
- Metric choice affects **distance sensitivity**.
- Connectivity can encode **locality** or **graph structure**.
- Dendrograms provide **interpretable visualization**.
- Complexity grows quickly with number of samples.
- Hierarchical clustering is **transductive**, not inductive.

---

## 💡 Relevance for MLE Practice

- **Preprocessing** (scaling, outlier handling) critically affects performance.
- **Interpretability**: tree structure is valuable for explaining how clusters form.
- **Dimensionality reduction** (PCA/t-SNE/UMAP) is often used before hierarchical clustering to stabilize results.
- **Production caveats**: assigning new data points is non-trivial — not ideal for online systems.
- **Hyperparameters** (linkage, affinity, connectivity) should be tuned thoughtfully.
- **Multi-resolution view**: dendrogram enables analysis at different cluster granularities.

---

## 🧾 Summary Table

| Aspect | Description |
|--------|-------------|
| Algorithm | Agglomerative clustering (bottom-up) |
| Output | Dendrogram + flat cluster labels |
| Linkages | Ward, complete, average, single |
| Distance metric | Euclidean (Ward), others allowed for others |
| Constraints | Optional connectivity matrix |
| Complexity | \( O(n^3) \) naive, improved with constraints |
| Inductive? | No — transductive only |
| Visualization | Dendrogram (merge tree) |
| Alternative | Bisecting K-Means (top-down) |

---
