# Clustering evaluation methods

Clustering evaluation methods assess the performance of clustering algorithms. These methods are generally categorized into **internal evaluation**, **external evaluation**, and **relative evaluation**.

---

### 1. **Internal Evaluation Methods**
These methods use information from the data itself to measure the quality of clustering without any ground truth labels.

#### (a) **Silhouette Coefficient**
- Measures the compactness and separation of clusters.
- Formula:
  
  $$ S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} $$
  
  - $a(i)$: Average intra-cluster distance (within the same cluster).
  - $b(i)$: Average inter-cluster distance (to the nearest other cluster).
- Range: $[-1, 1]$. Higher values indicate better clustering.

#### (b) **Davies-Bouldin Index (DBI)**
- Evaluates intra-cluster similarity and inter-cluster separation.
- Formula:

  $$ DBI = \frac{1}{N} \sum_{i=1}^{N} \max_{i \neq j} \left( \frac{s_i + s_j}{d_{i,j}} \right) $$
  
  - $s_i$: Average distance of points in cluster $i$ to the cluster center.
  - $d_{i,j}$: Distance between cluster centers $i$ and $j$.
- Lower values indicate better clustering.

#### (c) **Dunn Index**
- Measures the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
- Formula:
  
  $$ \text{Dunn} = \frac{\min_{i \neq j} d(c_i, c_j)}{\max_k \delta(c_k)} $$
  
  - $d(c_i, c_j)$: Distance between cluster centers $i$ and $j$.
  - $\delta(c_k)$: Diameter of cluster $k$.
- Higher values indicate better clustering.

---

### 2. **External Evaluation Methods**
These methods compare the clustering result to ground truth labels, if available.

#### (a) **Adjusted Rand Index (ARI)**
- Measures the similarity between the clustering result and ground truth, adjusted for chance.
- Formula:

  $$ ARI = \frac{\text{Index} - \text{Expected Index}}{\text{Max Index} - \text{Expected Index}} $$

- Range: $[-1, 1]$; higher values indicate better clustering.

#### (b) **Normalized Mutual Information (NMI)**
- Measures the amount of information shared between the clustering result and ground truth.
- Formula:

  $$ NMI = \frac{2 \cdot I(U, V)}{H(U) + H(V)} $$

  - $I(U, V)$: Mutual information between the true labels ($U$) and cluster labels ($V$).
  - $H(U)$, $H(V)$: Entropies of $U$ and $V$.
- Range: $[0, 1]$; higher values indicate better clustering.

#### (c) **Fowlkes-Mallows Index (FMI)**
- Measures the geometric mean of precision and recall for clustering.
- Formula:

  $$ FMI = \sqrt{\frac{TP}{TP + FP} \cdot \frac{TP}{TP + FN}} $$

  - $TP$: True Positives.
  - $FP$: False Positives.
  - $FN$: False Negatives.
- Range: $[0, 1]$; higher values indicate better clustering.

---

### 3. **Relative Evaluation Methods**
These methods compare clustering results with varying parameters or algorithms to determine the best configuration.

#### (a) **Elbow Method**
- Plots the sum of squared distances (SSD) from points to their cluster centers for different numbers of clusters ($k$).
- The "elbow point" indicates the optimal $k$.

#### (b) **Gap Statistic**
- Compares clustering performance against a null reference distribution to find the optimal number of clusters.

#### (c) **Silhouette Analysis**
- Analyzes the Silhouette Coefficient for different numbers of clusters to find the optimal $k$.

---

### Choosing an Evaluation Method
- Use **internal metrics** if no ground truth labels are available.
- Use **external metrics** if ground truth labels are known.
- Use **relative metrics** to tune parameters like the number of clusters.

