# 🌀 K-Means Clustering

## Intuition

K-Means partitions data into *K* groups so that:
- points within the same cluster are **as close as possible**,
- points in different clusters are **as far apart as possible**.

Each cluster is represented by its **centroid**, the mean of its members.

---

## Objective Function

We minimize the total **within-cluster sum of squares (WCSS)**:

$$
J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2
$$

where:
- \( C_k \) = set of points in cluster \(k\)
- \( \mu_k \) = centroid of cluster \(k\)
- \( \|x_i - \mu_k\|^2 \) = squared Euclidean distance

Goal: find centroids \( \{\mu_k\}_{k=1}^{K} \) that minimize \(J\).

---

## Lloyd’s Algorithm

Iterate until convergence:

1. **Assignment step (E-step)**
   Assign each sample to the nearest centroid:
   $$
   c_i = \arg\min_k \|x_i - \mu_k\|^2
   $$

2. **Update step (M-step)**
   Recompute each centroid as the mean of its assigned points:
   $$
   \mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i
   $$

Each step never increases \(J\), so the process converges.

---

## Convergence and Boundaries

Points near the boundary between clusters can temporarily be closer to another centroid.
In the next assignment step, they are simply **reassigned** to that nearer cluster,
and centroids update again accordingly.

The algorithm stops when assignments no longer change —
that’s a **local minimum** of \(J\).
To avoid poor local minima, use **K-Means++** initialization or multiple random restarts (`n_init`).

---

## K-Means++ Initialization

To pick smarter starting centroids:

1. Choose one random data point \(x_i\) as the first centroid.
2. For each remaining point, compute its squared distance to the **nearest** chosen centroid:
   $$
   D(x_i)^2 = \min_{\mu \in C} \|x_i - \mu\|^2
   $$
3. Sample the next centroid from the data points with probability proportional to \(D(x_i)^2\):
   $$
   p_i = \frac{D(x_i)^2}{\sum_j D(x_j)^2}
   $$
4. Repeat until \(K\) centroids are chosen.
5. Proceed with normal Lloyd iterations.

✅ Each centroid starts as a **real data point** \(x_i\),
ensuring diverse, well-spread initial centers.

---

## Mini-Batch K-Means

For large datasets, we can update centroids using only **small random batches** of samples.

For each mini-batch \(B\):

1. Assign each \(x_i \in B\) to its nearest centroid.
2. Update the corresponding centroid incrementally:

   $$
   \mu_k \leftarrow \mu_k + \eta (x_i - \mu_k)
   $$

If we set the learning rate \( \eta = \frac{1}{t} \),
where \(t\) is the number of points assigned to cluster \(k\) so far,
then the update is **exactly equivalent** to maintaining the running mean.

**Proof:**

\[
\begin{aligned}
\mu_k^{(t)} &= \mu_k^{(t-1)} + \frac{1}{t}(x_t - \mu_k^{(t-1)}) \\
&= \left(1 - \frac{1}{t}\right)\mu_k^{(t-1)} + \frac{1}{t}x_t \\
&= \frac{t-1}{t} \cdot \frac{1}{t-1}\sum_{i=1}^{t-1} x_i + \frac{1}{t}x_t \\
&= \frac{1}{t}\sum_{i=1}^{t} x_i
\end{aligned}
\]

✅ So the incremental update keeps the **exact mean** while avoiding full dataset passes.

---

## Choosing K

### Elbow Method

Compute total cost \(J_K\) for different values of \(K\):

$$
J_K = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2
$$

Plot \(J_K\) vs \(K\).
As \(K\) increases, \(J_K\) decreases —
pick \(K\) at the **“elbow”**, where adding more clusters gives little improvement.

---

### Silhouette Method

For each point \(x_i\):

- \(a_i\): average distance to other points in its **own** cluster.
- \(b_i\): smallest average distance to points in any **other** cluster.

Then silhouette score:

$$
s_i = \frac{b_i - a_i}{\max(a_i, b_i)}
$$

- \(s_i \approx 1\): well-clustered
- \(s_i \approx 0\): near a boundary
- \(s_i < 0\): probably misclassified

Overall silhouette = mean of \(s_i\).
Choose \(K\) that maximizes this value.

---

## Strengths & Weaknesses

| ✅ Strengths | ⚠️ Weaknesses |
|--------------|---------------|
| Fast and easy to implement | Assumes spherical clusters |
| Works well for large, dense datasets | Sensitive to initialization and outliers |
| Scales well with dimensions (after scaling) | Requires predefined K |
| Interpretable centroids | Only finds convex clusters |

---

## When to Use

Use K-Means when clusters are roughly:
- compact, convex, and similar in size,
- distances are meaningful (e.g., Euclidean),
- and you want fast, scalable results.

Avoid K-Means for:
- elongated or non-convex shapes (use DBSCAN or Spectral instead),
- categorical or sparse data without a proper metric,
- or data with strong outliers.

---

## Summary

K-Means alternates between:
- **Assignment:** move points to the nearest centroid.
- **Update:** move centroids to the mean of assigned points.

K-Means++ improves initialization.
Mini-Batch accelerates convergence on large datasets.
Elbow and Silhouette help estimate a good K.
Convergence ensures all points are closest to their own centroid.



Elbow → “How much better does the fit get as I add more clusters?”

Silhouette → “Are clusters actually distinct and well-separated?”

In practice:

Start with Elbow for a rough range, then use Silhouette to fine-tune
𝐾
K.