# Module 4 — Session 4 (K-Means)

These are my answers for the practical exercises.


## Exercise 1 — Trace K-Means (Conceptual)

We have points:

- A=(1,1), B=(1,2), C=(2,1)
- D=(5,4), E=(5,5), F=(6,4)

k = 2  
Initial centroids: **C1=(1,1)** and **C2=(5,4)**

I'll trace the first 2 iterations.


In [None]:
import numpy as np

points = {
    'A': (1, 1),
    'B': (1, 2),
    'C': (2, 1),
    'D': (5, 4),
    'E': (5, 5),
    'F': (6, 4),
}

C1 = np.array([1, 1], dtype=float)
C2 = np.array([5, 4], dtype=float)

def dist(p, c):
    p = np.array(p, dtype=float)
    return float(np.linalg.norm(p - c))

# Iteration 1: assignment step
rows = []
for name, p in points.items():
    d1 = dist(p, C1)
    d2 = dist(p, C2)
    assigned = 'C1' if d1 < d2 else 'C2'
    rows.append((name, p, d1, d2, assigned))

rows


**Iteration 1 (Update step):**  
C1 gets A, B, C.  
C2 gets D, E, F.

Now compute the new centroids (means).


In [None]:
import numpy as np

cluster1 = np.array([points['A'], points['B'], points['C']], dtype=float)
cluster2 = np.array([points['D'], points['E'], points['F']], dtype=float)

C1_new = cluster1.mean(axis=0)
C2_new = cluster2.mean(axis=0)

C1_new, C2_new


So after Iteration 1:

- **C1 = (4/3, 4/3) ≈ (1.33, 1.33)**
- **C2 = (16/3, 13/3) ≈ (5.33, 4.33)**

**Iteration 2:** re-assign using the new centroids.


In [None]:
C1 = C1_new
C2 = C2_new

rows2 = []
for name, p in points.items():
    d1 = dist(p, C1)
    d2 = dist(p, C2)
    assigned = 'C1' if d1 < d2 else 'C2'
    rows2.append((name, p, d1, d2, assigned))

rows2


✅ The assignments **do not change** between Iteration 1 and Iteration 2 (A,B,C stay with C1; D,E,F stay with C2).


## Exercise 2 — Implement K-Means (Scikit-learn)

Now I generate a synthetic dataset with `make_blobs`, train KMeans with k=4, and plot the result.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 4 distinct clusters
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)

# Quick look at the raw (unlabeled) data
plt.figure()
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Raw Unlabeled Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()


In [None]:
# Train K-Means with k=4 (we 'know' it's 4 here because we generated it that way)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X)

labels = kmeans.labels_
centers = kmeans.cluster_centers_

# Plot clusters + centroids
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50)
plt.scatter(centers[:, 0], centers[:, 1], marker='*', s=400, c='red')
plt.title('K-Means Clustering Result (k=4)')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()


## Exercise 3 — Elbow Method

Here I compute WCSS (inertia) for k from 1 to 10 and plot it.


In [None]:
wcss = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    wcss.append(km.inertia_)

plt.figure()
plt.plot(list(K_range), wcss, marker='o')
plt.title('Elbow Method (WCSS / Inertia)')
plt.xlabel('Number of clusters (k)')
plt.ylabel('WCSS (inertia)')
plt.xticks(list(K_range))
plt.show()

# My eyeballing: the bend should be around k=4 for this dataset.


## Exercise 4 — Silhouette Method (Challenge)

### 4.1 Silhouette + Elbow on the same figure (with a right-side axis)

Silhouette score is between -1 and 1 (higher is better).


In [None]:
import numpy as np
from sklearn.metrics import silhouette_score

wcss = []
sil_scores = []
K_range = range(2, 11)  # silhouette needs at least 2 clusters

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    wcss.append(km.inertia_)
    sil_scores.append(silhouette_score(X, km.labels_))

fig, ax1 = plt.subplots()

ax1.plot(list(K_range), wcss, marker='o')
ax1.set_title('Elbow + Silhouette')
ax1.set_xlabel('Number of clusters (k)')
ax1.set_ylabel('WCSS (inertia)')

ax2 = ax1.twinx()
ax2.plot(list(K_range), sil_scores, marker='s')
ax2.set_ylabel('Silhouette score')

ax1.set_xticks(list(K_range))
plt.show()

best_k = list(K_range)[int(np.argmax(sil_scores))]
best_k


If both methods are coherent, the elbow should be around the same k where silhouette is near its maximum.


### 4.2 Make it harder for Elbow (more overlap), but silhouette still helps

If I increase `cluster_std`, the blobs overlap more.  
Then inertia tends to decrease more smoothly, so the elbow can be kind of "meh", while silhouette often still shows a clearer peak.

(You can tweak `cluster_std` below to see how it changes.)


In [None]:
# Try a more overlapped dataset
X2, _ = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.2,   # <- bigger std = more overlap
    random_state=7
)

plt.figure()
plt.scatter(X2[:, 0], X2[:, 1], s=30)
plt.title('More Overlapped Data (harder case)')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()


In [None]:
# Elbow + Silhouette again
wcss2 = []
sil2 = []
K_range2 = range(2, 11)

for k in K_range2:
    km = KMeans(n_clusters=k, random_state=7, n_init=10)
    km.fit(X2)
    wcss2.append(km.inertia_)
    sil2.append(silhouette_score(X2, km.labels_))

fig, ax1 = plt.subplots()

ax1.plot(list(K_range2), wcss2, marker='o')
ax1.set_title('Overlapped Data: Elbow + Silhouette')
ax1.set_xlabel('Number of clusters (k)')
ax1.set_ylabel('WCSS (inertia)')
ax1.set_xticks(list(K_range2))

ax2 = ax1.twinx()
ax2.plot(list(K_range2), sil2, marker='s')
ax2.set_ylabel('Silhouette score')

plt.show()

best_k2 = list(K_range2)[int(np.argmax(sil2))]
best_k2


My result here is the `best_k2` printed above (based on max silhouette).  
If the elbow curve is not super sharp, silhouette is usually easier to justify.
