
# Evaluation of K-Means Clustering

This notebook evaluates the quality of clustering results using **three commonly used metrics**:
1. **Silhouette Score**
2. **Inertia (Within-Cluster Sum of Squares)**
3. **Davies–Bouldin Index**

We will apply these metrics on synthetic clustered data.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

sns.set_theme(style="whitegrid")

In [None]:
# Generate synthetic data
data, _ = make_blobs(n_samples=300, centers=4, cluster_std=2.0, random_state=42)

# Fit KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(data)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

In [None]:
# Visualize clustering result
plt.figure(figsize=(8, 6))
for i in range(4):
    plt.scatter(data[labels == i, 0], data[labels == i, 1], label=f'Cluster {i+1}')
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, label='Centroids')
plt.title("Cluster Visualization")
plt.xlabel("Feature1")
plt.ylabel("Feature2")
plt.legend()
plt.show()

In [None]:
# Evaluation metrics
silhouette = silhouette_score(data, labels)
inertia = kmeans.inertia_
db_index = davies_bouldin_score(data, labels)

print(f"Silhouette Score: {silhouette:.3f}")
print(f"Inertia (WSS): {inertia:.2f}")
print(f"Davies–Bouldin Index: {db_index:.3f}")


## Evaluation Metrics Explained

### 1. **Silhouette Score**
- **Range**: [-1, 1] (higher is better)
- **Interpretation**: Measures how well each point fits within its cluster vs. the nearest other cluster.
- **Ideal Use**: Best for visual validation of clustering; suitable for any distance-based clustering.
- Think of it like: "How confident are we that this item belongs in this group?"

### 2. **Inertia (Within-Cluster Sum of Squares)**
- **Range**: [0, ∞] (lower is better)
- **Interpretation**: Total distance of samples to their nearest cluster center. Measures compactness.
- **Ideal Use**: Useful for elbow method to find the optimal number of clusters.
- Think of it like: "Are the people in each team sitting close together?"

### 3. **Davies–Bouldin Index**
- **Range**: [0, ∞] (lower is better)
- **Interpretation**: Ratio of within-cluster distance to between-cluster separation.
- **Ideal Use**: Suitable for automated evaluation; no ground truth needed.
- Think of it like: "Are groups distinct and not overlapping?"

### Summary: When to Use Which?
| Metric               | Goal                          | Best When...                          |
|----------------------|-------------------------------|----------------------------------------|
| **Silhouette Score** | Separation & Cohesion         | Measure overall clarity of the clusters |
| **Inertia (WSS)**    | Cluster Compactness           | Find the best number of clusters using the "elbow method" |
| **DB Index**         | Ratio of compactness/separation | Quickly check how well-separated your clusters are |
