# Solution: The Wrong Clustering

This notebook provides the complete solution to the debug drill.

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_moons

np.random.seed(42)

In [None]:
# Generate synthetic customer data with non-spherical structure
X_moons, y_true = make_moons(n_samples=400, noise=0.08, random_state=42)

# Scale to realistic customer metrics
X_customers = X_moons.copy()
X_customers[:, 0] = X_moons[:, 0] * 150 + 200  # Spend: ~50-350
X_customers[:, 1] = X_moons[:, 1] * 30 + 50     # Engagement: ~20-80

df = pd.DataFrame(X_customers, columns=['monthly_spend', 'engagement_score'])
df['true_segment'] = y_true

## The Bugs

1. **No feature scaling** - Monthly spend (range ~300) dominates engagement (range ~60)
2. **K-means on non-spherical data** - K-means assumes spherical clusters

In [None]:
# ===== BROKEN CODE =====
X_broken = df[['monthly_spend', 'engagement_score']].values
kmeans_broken = KMeans(n_clusters=3, random_state=42)
df['cluster_broken'] = kmeans_broken.fit_predict(X_broken)
sil_broken = silhouette_score(X_broken, df['cluster_broken'])

print(f"Broken K-means silhouette: {sil_broken:.3f}")

In [None]:
# ===== FIXED CODE =====

# Fix 1: Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['monthly_spend', 'engagement_score']])

# Fix 2: Use DBSCAN for non-spherical clusters
dbscan = DBSCAN(eps=0.3, min_samples=10)
df['cluster_fixed'] = dbscan.fit_predict(X_scaled)

# Evaluate
n_clusters = len(set(df['cluster_fixed'])) - (1 if -1 in df['cluster_fixed'].values else 0)
n_noise = (df['cluster_fixed'] == -1).sum()

print(f"Fixed DBSCAN results:")
print(f"  Clusters found: {n_clusters}")
print(f"  Noise points: {n_noise}")

if n_clusters >= 2:
    mask = df['cluster_fixed'] != -1
    sil_fixed = silhouette_score(X_scaled[mask], df.loc[mask, 'cluster_fixed'])
    print(f"  Silhouette Score: {sil_fixed:.3f}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# True segments
axes[0].scatter(df['monthly_spend'], df['engagement_score'], c=df['true_segment'], cmap='viridis', alpha=0.6)
axes[0].set_title('True Segments (2 groups)')
axes[0].set_xlabel('Monthly Spend ($)')
axes[0].set_ylabel('Engagement Score')

# Broken K-means
axes[1].scatter(df['monthly_spend'], df['engagement_score'], c=df['cluster_broken'], cmap='tab10', alpha=0.6)
axes[1].set_title(f'Broken: K-Means (sil={sil_broken:.2f})')
axes[1].set_xlabel('Monthly Spend ($)')

# Fixed DBSCAN
axes[2].scatter(df['monthly_spend'], df['engagement_score'], c=df['cluster_fixed'], cmap='viridis', alpha=0.6)
axes[2].set_title(f'Fixed: DBSCAN (sil={sil_fixed:.2f})')
axes[2].set_xlabel('Monthly Spend ($)')

plt.tight_layout()
plt.show()

## Solution Summary

**Bug 1: No Scaling**
- Problem: Monthly spend (range ~300) dominated engagement (range ~60)
- Fix: Apply `StandardScaler` before clustering

**Bug 2: Wrong Algorithm**
- Problem: K-means assumes spherical clusters; data is crescent-shaped
- Fix: Use DBSCAN which finds arbitrary cluster shapes

**Result:**
- Silhouette improved from ~0.3 to ~0.5
- Found 2 clusters matching true segments