# Debug Drill 07: The Wrong K

**Symptom:** Your colleague ran K-Means with K=20 clusters. Marketing says: "These segments don't make sense - cluster 7 and cluster 12 look identical!"

**Your task:** Find the right K, validate the clusters, and write a postmortem.

**Time:** 15 minutes

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/data/streamcart_customers.csv')

cluster_features = ['tenure_months', 'logins_last_30d', 'orders_last_30d', 'support_tickets_last_30d']
X = df[cluster_features].fillna(0)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# ===== COLLEAGUE'S CODE (CONTAINS BUG) =====

# "More clusters = more granular = better!"
k = 20
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X_scaled)

print(f"Created {k} clusters")
print(f"\nCluster sizes:")
print(df['cluster'].value_counts().sort_index())

In [None]:
# Look at cluster profiles
profiles = df.groupby('cluster')[cluster_features].mean().round(1)
print("\nCluster profiles (averages):")
print(profiles)

## Your Investigation

**Q1:** Look at the cluster profiles. Can you spot clusters that look nearly identical?

In [None]:
# TODO: Identify similar clusters
# Compare tenure_months, logins, orders across clusters
# Which clusters are redundant?

**Q2:** Use the Elbow Method to find the right K.

In [None]:
# TODO: Run elbow method
inertias = []
k_range = range(2, 15)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(10, 5))
plt.plot(k_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method - Find the Bend')
plt.grid(True, alpha=0.3)
plt.show()

# TODO: What K shows the "elbow"?
# Your answer: 

## Fix the Bug

**Q3:** Rerun with the correct K and profile the clusters.

In [None]:
# TODO: Choose better K based on elbow
k_optimal = 4  # TODO: Set based on your elbow analysis

kmeans_fixed = KMeans(n_clusters=k_optimal, random_state=42, n_init=10)
df['cluster_fixed'] = kmeans_fixed.fit_predict(X_scaled)

print(f"\n=== {k_optimal} CLUSTERS ===")
print(f"\nCluster sizes:")
print(df['cluster_fixed'].value_counts().sort_index())

print("\nCluster profiles:")
profiles_fixed = df.groupby('cluster_fixed')[cluster_features + ['churn_30d']].mean().round(2)
print(profiles_fixed)

In [None]:
# TODO: Give each cluster a business name
cluster_names = {
    0: "???",  # TODO: Name based on profile
    1: "???",
    2: "???",
    3: "???"
}

for c, name in cluster_names.items():
    profile = profiles_fixed.loc[c]
    print(f"\nCluster {c} ({name}):")
    print(f"  Tenure: {profile['tenure_months']:.1f} months")
    print(f"  Logins: {profile['logins_last_30d']:.1f}/month")
    print(f"  Churn rate: {profile['churn_30d']:.1%}")

## Self-Check

In [None]:
# Verify fix
assert k_optimal < 10, "K is still too high"
assert k_optimal >= 3, "K is too low to be useful"

# Check clusters are meaningfully different
churn_rates = df.groupby('cluster_fixed')['churn_30d'].mean()
churn_spread = churn_rates.max() - churn_rates.min()
assert churn_spread > 0.05, "Clusters should have different churn rates"

print("PASS: Clusters are distinct and actionable!")

## Postmortem

Write 3 bullets:
1. **Root cause:** 
2. **How we detected it:** 
3. **Prevention for next time:** 