# 📈 Clustering Evaluation Metrics (Unsupervised Learning)

Clustering is an unsupervised learning method, so we often **don’t have ground truth labels**. Some metrics evaluate how well-separated and compact the clusters are (internal metrics), while others require external labels (external metrics).

---

## 1. Silhouette Score

**Equation:**

For each sample \( i \):
- \( a(i) \) = mean intra-cluster distance (cohesion)
- \( b(i) \) = mean nearest-cluster distance (separation)

\[
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
\]

Overall silhouette score:
\[
S = \frac{1}{n} \sum_{i=1}^{n} s(i)
\]

**✅ When to Use:**
- To assess cluster separation and cohesion.
- You don't have ground truth labels.

**❌ When *Not* to Use:**
- For clusters with non-convex shapes.
- When clusters are of very different sizes/densities.

**📌 Example:**  
Evaluating K-Means clustering performance to choose optimal number of clusters.

---

## 2. Davies–Bouldin Index (DBI)

**Equation:**

\[
DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{j \ne i} \left( \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right)
\]

Where:
- $ \sigma_i $: average distance of points in cluster $ i \) to its centroid.
- $ d(c_i, c_j) $: distance between centroids of clusters \( i \) and \( j \).

**✅ When to Use:**
- Internal validation to compare different clustering results.

**❌ When *Not* to Use:**
- When clusters are highly non-spherical or vary greatly in size.

**📌 Example:**  
Selecting between clustering algorithms (e.g., K-Means vs. Agglomerative) on same dataset.

---

## 3. Calinski–Harabasz Index (Variance Ratio Criterion)

**Equation:**

$$
CH = \frac{ \text{Tr}(B_k) / (k - 1) }{ \text{Tr}(W_k) / (n - k) }
$$

Where:
- $ \text{Tr}(B_k) $: between-cluster dispersion
- $ \text{Tr}(W_k) $: within-cluster dispersion
- $ k $: number of clusters
- $ n $: number of data points

**✅ When to Use:**
- Fast and efficient for comparing cluster compactness and separation.

**❌ When *Not* to Use:**
- When data is not well-suited to variance-based measures (e.g., categorical data).

**📌 Example:**  
Evaluating K-Means clustering results across different `k` values.

---

## 4. Adjusted Rand Index (ARI)

**Equation (conceptually):**

$$
ARI = \frac{\text{Index} - \text{Expected Index}}{\text{Max Index} - \text{Expected Index}}
$$

Measures similarity between predicted clusters and true labels, adjusted for chance.

**✅ When to Use:**
- You **have ground truth labels** (external validation).

**❌ When *Not* to Use:**
- No labeled data available.

**📌 Example:**  
Comparing clustering labels to known classes in benchmark datasets like Iris.

---

## 5. Normalized Mutual Information (NMI)

**Equation:**

$$
NMI(U, V) = \frac{2 \cdot I(U; V)}{H(U) + H(V)}
$$

Where:
- $ I(U; V) $: mutual information between cluster assignment $ U $ and ground truth $ V $
- $ H(U), H(V) $: entropy

**✅ When to Use:**
- You have ground truth labels and want to compare labelings.

**❌ When *Not* to Use:**
- Completely unsupervised scenarios.

**📌 Example:**  
Evaluating how well clustering aligns with known labels (e.g., digit labels in MNIST).

---

## 🔍 Summary Table

| Metric              | Type         | Needs Ground Truth? | Best For                         | Avoid When...                       |
|---------------------|--------------|----------------------|-----------------------------------|-------------------------------------|
| Silhouette Score    | Internal     | No                   | Evaluating cohesion/separation   | Clusters are oddly shaped           |
| Davies–Bouldin Index| Internal     | No                   | Comparing cluster quality         | High variance in cluster shape/size |
| Calinski–Harabasz   | Internal     | No                   | Fast, variance-based validation   | Non-numeric or categorical data     |
| Adjusted Rand Index | External     | Yes                  | Comparing with known labels       | No ground truth available           |
| Normalized MI       | External     | Yes                  | Label alignment                   | No labels present                   |
