# Module 4 — Session 4: Practical Exercises (K-Means)

This notebook follows the instructions from your **M4, S4_ Practical Exercises** file.


## Exercise 1: Trace K-Means (Conceptual)

Points:
- A=(1,1), B=(1,2), C=(2,1), D=(5,4), E=(5,5), F=(6,4)

k = 2, initial centroids:
- C1 = (1,1)
- C2 = (5,4)

Trace **first two iterations**:
- Iteration 1: assignment step + update step (new centroids)
- Iteration 2: assignment step + update step
- Did assignments change?


### (Optional) Helper: distance calculator

You can use this code to compute distances quickly.


In [None]:
import numpy as np

points = {
    "A": np.array([1, 1]),
    "B": np.array([1, 2]),
    "C": np.array([2, 1]),
    "D": np.array([5, 4]),
    "E": np.array([5, 5]),
    "F": np.array([6, 4]),
}

C1 = np.array([1, 1])
C2 = np.array([5, 4])

def dist(p, c):
    return np.linalg.norm(p - c)

for name, p in points.items():
    print(name, "d(C1)=", round(dist(p, C1), 3), " d(C2)=", round(dist(p, C2), 3))


## Exercise 2: Implement K-Means (Coding)

Generate synthetic data with 4 clusters using `make_blobs`, visualize it, train KMeans with `n_clusters=4`,
then plot colored clusters and centroids.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans


In [None]:
# Generate synthetic data with 4 distinct clusters
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)

# Visualize the generated data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Raw Unlabeled Data')
plt.show()


In [None]:
# Train K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)

# Visualize clustered results and centroids
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    s=200,
    marker='*'
)
plt.title('K-Means Clusters (k=4)')
plt.show()


## Exercise 3: The Elbow Method (Coding)

Compute WCSS (inertia) for k=1..10 and plot it.  
Answer in Markdown: **Where is the elbow? Does it suggest k=4?**


In [None]:
wcss = []
k_values = range(1, 11)

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    wcss.append(km.inertia_)

plt.plot(list(k_values), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS / Inertia')
plt.show()


## Exercise 4: The Silhouette Method (Challenge)

### 4.1
Implement the Silhouette Method and plot silhouette values on the same figure as the Elbow plot using a secondary (right) y-axis.
Then answer:
- Are the two plots coherent in the choice of optimal k?
- Does k still seem optimal?

### 4.2
Regenerate blobs with different parameters (notably `cluster_std`) such that:
- Elbow method is ambiguous
- Silhouette method yields a clear result


In [None]:
from sklearn.metrics import silhouette_score


In [None]:
# 4.1 — Elbow + Silhouette on the same plot (secondary y-axis)
wcss = []
sil_scores = []
k_values = range(2, 11)  # silhouette is defined for k>=2

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X)
    wcss.append(km.inertia_)
    sil_scores.append(silhouette_score(X, labels))

fig, ax1 = plt.subplots()
ax1.plot(list(k_values), wcss)
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('WCSS / Inertia')

ax2 = ax1.twinx()
ax2.plot(list(k_values), sil_scores)
ax2.set_ylabel('Silhouette Score')

plt.title('Elbow + Silhouette')
plt.show()

list(zip(k_values, sil_scores))[:5], (max(sil_scores), k_values[np.argmax(sil_scores)])


In [None]:
# 4.2 — Try new blob parameters where elbow is ambiguous but silhouette is clearer
X2, y2 = make_blobs(n_samples=300, centers=4, cluster_std=2.3, random_state=42)

# Compute both curves again
wcss2 = []
sil2 = []
k_values2 = range(2, 11)

for k in k_values2:
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X2)
    wcss2.append(km.inertia_)
    sil2.append(silhouette_score(X2, labels))

fig, ax1 = plt.subplots()
ax1.plot(list(k_values2), wcss2)
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('WCSS / Inertia')

ax2 = ax1.twinx()
ax2.plot(list(k_values2), sil2)
ax2.set_ylabel('Silhouette Score')

plt.title('Elbow + Silhouette (More Overlap)')
plt.show()

(max(sil2), k_values2[np.argmax(sil2)])
