# 📜 Clustering Evaluation Metrics (AI/ML/DL)

---

## 🔹 1. Internal Metrics (Unsupervised Quality)

Evaluate clustering **without ground-truth labels**: focus on compactness, separation, and structure.

- **Within-Cluster Sum of Squares (WCSS / Inertia):**
  - $WCSS = \sum_i \sum_{x \in C_i} \|x - c_i\|^2$
  - Lower values = tighter clusters.
  - ❌ Not normalized, always decreases with more clusters.

- **Silhouette Score (Rousseeuw, 1987):**
  $$
  s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
  $$
  - $a(i)$ = average distance of $i$ to its own cluster.
  - $b(i)$ = distance of $i$ to nearest other cluster.
  - ✅ Balances cohesion vs separation.
  - ❌ Degrades in high dimensions.

- **Davies–Bouldin Index (DBI, 1979):**
  $$
  DBI = \frac{1}{k} \sum_{i=1}^k \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d(c_i, c_j)}
  $$
  - Lower is better.

- **Dunn Index (1974):**
  $$
  DI = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k diam(C_k)}
  $$
  - High = better cluster separation.

- **Calinski–Harabasz Index (1974):**
  $$
  CH = \frac{tr(B_k)}{tr(W_k)} \cdot \frac{n-k}{k-1}
  $$
  - Ratio of between-cluster to within-cluster variance.

- **Gap Statistic (Tibshirani, 2001):**
  - Compare WCSS against expected value under random reference.
  - Helps estimate optimal $k$.

---

## 🔹 2. External Metrics (Supervised Evaluation)

Require **ground-truth labels** to compare clustering quality.

- **Rand Index (RI, 1971):** Measures agreement between clustering and labels.
- **Adjusted Rand Index (ARI, 1985):** RI corrected for chance.
- **Mutual Information (MI):**
  $$
  MI(U,V) = \sum_{i,j} P(i,j) \log \frac{P(i,j)}{P(i)P(j)}
  $$
- **Normalized Mutual Information (NMI):**
  $$
  NMI = \frac{MI(U,V)}{\sqrt{H(U)H(V)}}
  $$
- **V-Measure:** Harmonic mean of homogeneity and completeness.
- **Fowlkes–Mallows Index (1983):**
  $$
  FMI = \sqrt{\frac{TP}{TP+FP} \cdot \frac{TP}{TP+FN}}
  $$

---

## 🔹 3. Probabilistic & Distribution-Based Metrics

For probabilistic clustering (e.g., **GMMs, VAEs, LDA**):

- **Log-Likelihood:** Maximize $\log p(X|\theta)$.
- **AIC (Akaike Information Criterion):**
  $$
  AIC = 2k - 2\ln(\hat{L})
  $$
- **BIC (Bayesian Information Criterion):**
  $$
  BIC = k \ln(n) - 2\ln(\hat{L})
  $$
- **Perplexity:** Common in topic modeling; lower = better fit.

---

## 🔹 4. Deep Learning & Modern Clustering Metrics

For **deep representation learning + clustering**:

- **Clustering Accuracy (ACC):** Align cluster IDs with true labels via Hungarian algorithm.
- **NMI (again):** Popular in DEC, SwAV, etc.
- **Clustering Purity:**
  $$
  Purity = \frac{1}{n} \sum_i \max_j |C_i \cap L_j|
  $$

---

## 🔹 5. Task-Specific Metrics

- **Graphs / Community Detection:** Modularity, Conductance, Normalized Cut.
- **Computer Vision:** Purity, NMI, ARI.
- **NLP / Topic Models:** Perplexity, Coherence Score.

---

## ✅ Summary Families

- **Internal metrics:** Silhouette, DBI, Dunn, Gap → no labels required.
- **External metrics:** ARI, NMI, V-Measure, Purity → when labels exist.
- **Probabilistic metrics:** Likelihood, AIC/BIC, Perplexity → for mixture/latent models.
- **Deep Learning metrics:** ACC (Hungarian), NMI, Purity → embedding-based clustering.



# 📊 Comparative Table: Clustering Evaluation Metrics (AI/ML/DL)

| Metric                  | Formula (simplified)                                                   | Intuition                                     | Pros                                     | Cons                                     | When to Use                                   |
|--------------------------|------------------------------------------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------|-----------------------------------------------|
| **WCSS (Inertia)**       | $L = \sum_{i=1}^n \|x_i - c_{z_i}\|^2$                               | Compactness: minimize within-cluster variance | Simple, intuitive (k-Means)              | Always decreases with $k$, not comparable | Choosing $k$ in k-Means                        |
| **Silhouette Score**     | $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$                        | Balance cohesion ($a$) vs separation ($b$)    | Intuitive, normalized $[-1,1]$           | Poor in high dimensions                   | General-purpose clustering validation          |
| **Davies–Bouldin (DBI)** | Avg. similarity between clusters                                      | Separation vs compactness trade-off           | Fast, automatic                          | Harder to interpret                        | Comparing cluster partitions                   |
| **Dunn Index**           | $DI = \frac{\min d(C_i, C_j)}{\max diam(C_k)}$                       | Ratio of min inter- to max intra-distance     | Encourages well-separated clusters       | Sensitive to noise, expensive             | Identifying compact, separated clusters        |
| **Calinski–Harabasz**    | $\frac{\text{between-cluster var}}{\text{within-cluster var}}$        | Variance ratio (separation vs compactness)    | Scales well, efficient                   | Biased toward larger $k$                   | Model selection in clustering                  |
| **Gap Statistic**        | $\text{Gap} = \log(WCSS) - \log(\mathbb{E}[WCSS_{null}])$             | Compare vs random baseline                    | Helps find optimal $k$                   | Computationally heavy                      | Estimating optimal # of clusters               |
| **Rand Index (RI)**      | Pairwise agreement fraction                                           | Agreement with ground truth                   | Simple, interpretable                    | Doesn’t adjust for chance                  | Basic external validation                      |
| **Adjusted Rand (ARI)**  | Corrected RI                                                          | Accounts for chance clustering                | Robust to random labeling                | Can be unstable for small data             | Evaluating clustering vs ground truth          |
| **Mutual Info (MI)**     | $MI(U,V) = \sum_{i,j} P(i,j)\log \frac{P(i,j)}{P(i)P(j)}$            | Info overlap between clusters & labels        | Works with any clustering                | Unbounded, hard to compare                 | Comparing unsupervised clusters vs labels      |
| **Normalized MI (NMI)**  | $NMI = \frac{MI(U,V)}{\sqrt{H(U)H(V)}}$                              | Scaled MI $\in [0,1]$                         | Scale-independent                        | Sensitive to label permutations            | Topic modeling, NLP clustering                 |
| **V-Measure**            | Harmonic mean of homogeneity & completeness                          | Balance cluster quality                       | Interpretable, standardized              | Needs labels                               | Document clustering, NLP                       |
| **Fowlkes–Mallows (FMI)**| $FMI = \sqrt{\frac{TP}{TP+FP} \cdot \frac{TP}{TP+FN}}$               | Geometric mean of cluster precision/recall    | Balanced like F1                         | Rare in DL                                 | Supervised clustering evaluation               |
| **Clustering Accuracy**  | $ACC = \max_{\pi}\frac{1}{n}\sum_i 1[y_i = \pi(c_i)]$                | Align clusters with labels (Hungarian match)  | Direct measure, intuitive                | Needs ground truth                         | Deep clustering with labels available          |
| **Purity**               | $\text{Purity} = \frac{1}{n}\sum_i \max_j |C_i \cap L_j|$           | Fraction correctly assigned                   | Simple                                    | Inflates with many clusters                 | Multi-class clustering                         |
| **Log-Lik / AIC / BIC**  | $AIC = 2k - 2\ln(\hat L)$, $BIC = k\ln(n) - 2\ln(\hat L)$            | Probabilistic model fit                       | Likelihood-based, penalize complexity    | Assume distribution                        | GMMs, VAE-based clustering                     |
| **Perplexity (LDA)**     | $Perp = \exp(-\frac{1}{N}\sum \log p(x))$                            | Topic distribution uncertainty                | Standard in NLP                          | Doesn’t always align with human quality     | Topic/document clustering                      |
| **Modularity (Graphs)**  | $Q = \frac{1}{2m}\sum_{ij}(A_{ij} - \frac{k_ik_j}{2m})\delta(c_i,c_j)$ | Intra vs inter-edge density in graphs         | Strong for graph communities             | Not for non-graph data                      | Graph clustering, GNN tasks                    |
| **Coherence (NLP)**      | Semantic similarity of top words                                     | Human-interpretability for topics             | Domain-relevant                          | Task-specific                               | Topic modeling evaluation                      |

---

## ✅ Key Insights

- **Internal metrics** → no labels: Silhouette, DBI, Dunn, CHI, Gap.  
- **External metrics** → require labels: ARI, NMI, V-Measure, Purity.  
- **Probabilistic metrics** → for mixture/latent models: Log-Likelihood, AIC, BIC, Perplexity.  
- **Domain-specific metrics** → NLP (Coherence), Graphs (Modularity), CV (Purity, ACC).  
- **Deep Learning clustering** → ACC, NMI, Purity are most used.  
