# K-Means Clustering - Complete Guide

## From Theory to Implementation

K-Means is one of the most popular **unsupervised learning algorithms** for clustering. It groups similar data points together by finding cluster centers (centroids).

### What You'll Learn
1. What clustering is and when to use K-Means
2. K-Means algorithm step-by-step
3. Choosing the optimal number of clusters (K)
4. Implementation from scratch
5. Scikit-learn implementation
6. Initialization strategies (K-means++)
7. Limitations and alternatives
8. Real-world applications

---


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons, load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
from scipy.spatial.distance import cdist

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("K-Means Clustering Complete Guide")
print("=" * 50)


## 1. What is Clustering?

### Clustering Definition

**Clustering** is an unsupervised learning technique that groups similar data points together **without labeled training data**.

### Why Clustering?

- ✅ **Customer Segmentation**: Group customers by behavior
- ✅ **Image Segmentation**: Separate objects in images  
- ✅ **Anomaly Detection**: Identify outliers
- ✅ **Data Exploration**: Discover hidden patterns
- ✅ **Preprocessing**: Group similar features

### Types of Clustering

1. **Centroid-based** (K-Means, K-Medoids)
2. **Hierarchical** (Agglomerative, Divisive)
3. **Density-based** (DBSCAN)
4. **Distribution-based** (Gaussian Mixture Models)


In [None]:
# Visualize clustering concept
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

plt.figure(figsize=(15, 5))

# Before clustering
plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.6, s=50)
plt.title('Before Clustering\n(Unlabeled Data)', fontsize=12, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)

# After clustering (ground truth for demonstration)
plt.subplot(1, 3, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.6, s=50)
plt.title('True Clusters\n(Ground Truth)', fontsize=12, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)

# After K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

plt.subplot(1, 3, 3)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.6, s=50)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', 
            s=200, linewidths=3, label='Centroids')
plt.title('K-Means Clustering\n(4 Clusters)', fontsize=12, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Goal: Find {kmeans.n_clusters} clusters (groups) in unlabeled data")


## 2. K-Means Algorithm

### Algorithm Steps

**K-Means** partitions data into K clusters by:

1. **Initialize**: Choose K random points as initial centroids
2. **Assign**: Assign each data point to nearest centroid
3. **Update**: Recalculate centroids as mean of assigned points
4. **Repeat**: Steps 2-3 until convergence (centroids don't change)

### Mathematical Formulation

**Objective Function** (Within-Cluster Sum of Squares - WCSS):

$$J = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2$$

Where:
- $K$ = number of clusters
- $C_i$ = set of points in cluster $i$
- $\mu_i$ = centroid of cluster $i$
- $||x - \mu_i||^2$ = squared distance from point to centroid

**Goal**: Minimize $J$ (minimize distances within clusters)


In [None]:
# Visualize K-Means steps
def visualize_kmeans_steps(X, k=3, max_iter=3):
    """Visualize K-Means algorithm steps"""
    np.random.seed(42)
    
    # Initialize random centroids
    centroids = X[np.random.choice(X.shape[0], k, replace=False)]
    
    fig, axes = plt.subplots(1, max_iter+1, figsize=(16, 4))
    
    for iteration in range(max_iter + 1):
        # Assign points to nearest centroid
        distances = cdist(X, centroids)
        labels = np.argmin(distances, axis=1)
        
        # Plot
        ax = axes[iteration]
        for i in range(k):
            cluster_points = X[labels == i]
            ax.scatter(cluster_points[:, 0], cluster_points[:, 1], 
                      alpha=0.6, s=50, label=f'Cluster {i+1}')
        
        ax.scatter(centroids[:, 0], centroids[:, 1], 
                  c='red', marker='x', s=200, linewidths=3, 
                  label='Centroids')
        
        ax.set_title(f'Iteration {iteration}\n(Centroid Update)', fontweight='bold')
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Update centroids
        if iteration < max_iter:
            new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
            
            # Draw lines showing centroid movement
            for i in range(k):
                ax.plot([centroids[i, 0], new_centroids[i, 0]], 
                       [centroids[i, 1], new_centroids[i, 1]], 
                       'r--', alpha=0.5, linewidth=2)
            
            centroids = new_centroids
    
    plt.tight_layout()
    plt.show()

# Generate sample data
X_sample, _ = make_blobs(n_samples=150, centers=3, cluster_std=1.0, random_state=42)

print("K-Means Algorithm Visualization:")
print("Watch how centroids move and clusters form!")
visualize_kmeans_steps(X_sample, k=3, max_iter=3)


In [None]:
class KMeansScratch:
    """K-Means Clustering from Scratch"""
    
    def __init__(self, n_clusters=3, max_iters=100, random_state=None):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None
        self.labels_ = None
        self.inertia_ = None  # WCSS
    
    def _initialize_centroids(self, X):
        """Initialize centroids randomly"""
        if self.random_state:
            np.random.seed(self.random_state)
        indices = np.random.choice(X.shape[0], self.n_clusters, replace=False)
        return X[indices].copy()
    
    def _assign_clusters(self, X, centroids):
        """Assign each point to nearest centroid"""
        distances = cdist(X, centroids)
        return np.argmin(distances, axis=1)
    
    def _update_centroids(self, X, labels):
        """Update centroids as mean of assigned points"""
        new_centroids = np.zeros((self.n_clusters, X.shape[1]))
        for i in range(self.n_clusters):
            cluster_points = X[labels == i]
            if len(cluster_points) > 0:
                new_centroids[i] = cluster_points.mean(axis=0)
            else:
                # Handle empty cluster
                new_centroids[i] = X[np.random.choice(X.shape[0])]
        return new_centroids
    
    def _calculate_inertia(self, X, labels, centroids):
        """Calculate within-cluster sum of squares (WCSS)"""
        inertia = 0
        for i in range(self.n_clusters):
            cluster_points = X[labels == i]
            if len(cluster_points) > 0:
                inertia += np.sum((cluster_points - centroids[i])**2)
        return inertia
    
    def fit(self, X):
        """Fit K-Means to data"""
        # Initialize centroids
        self.centroids = self._initialize_centroids(X)
        
        for iteration in range(self.max_iters):
            # Assign points to clusters
            labels = self._assign_clusters(X, self.centroids)
            
            # Update centroids
            new_centroids = self._update_centroids(X, labels)
            
            # Check for convergence
            if np.allclose(self.centroids, new_centroids):
                print(f"Converged after {iteration + 1} iterations")
                break
            
            self.centroids = new_centroids
        
        # Final assignment
        self.labels_ = self._assign_clusters(X, self.centroids)
        self.inertia_ = self._calculate_inertia(X, self.labels_, self.centroids)
        
        return self
    
    def predict(self, X):
        """Predict cluster for new points"""
        if self.centroids is None:
            raise ValueError("Model must be fitted first")
        return self._assign_clusters(X, self.centroids)

# Test our implementation
X_test, y_true = make_blobs(n_samples=200, centers=3, cluster_std=1.0, random_state=42)

kmeans_scratch = KMeansScratch(n_clusters=3, max_iters=100, random_state=42)
kmeans_scratch.fit(X_test)

print(f"\nNumber of clusters: {kmeans_scratch.n_clusters}")
print(f"Inertia (WCSS): {kmeans_scratch.inertia_:.2f}")
print(f"Centroids:\n{kmeans_scratch.centroids}")

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_true, cmap='viridis', alpha=0.6, s=50)
plt.title('True Clusters', fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=kmeans_scratch.labels_, cmap='viridis', alpha=0.6, s=50)
plt.scatter(kmeans_scratch.centroids[:, 0], kmeans_scratch.centroids[:, 1], 
           c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.title('K-Means Clustering (From Scratch)', fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Elbow Method and Silhouette Score
X_elbow, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_elbow)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_elbow, kmeans.labels_))

# Plot Elbow Method
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.axvline(x=4, color='r', linestyle='--', alpha=0.7, label='Optimal K=4')
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia (WCSS)', fontsize=12)
plt.title('Elbow Method\n(Look for the "Elbow")', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend()

# Plot Silhouette Score
plt.subplot(1, 3, 2)
plt.plot(K_range, silhouette_scores, 'go-', linewidth=2, markersize=8)
optimal_k = K_range[np.argmax(silhouette_scores)]
plt.axvline(x=optimal_k, color='r', linestyle='--', alpha=0.7, 
            label=f'Optimal K={optimal_k}')
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Silhouette Score', fontsize=12)
plt.title('Silhouette Score Method\n(Higher is Better)', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend()

# Compare different K values visually
plt.subplot(1, 3, 3)
kmeans_best = KMeans(n_clusters=4, random_state=42, n_init=10)
y_best = kmeans_best.fit_predict(X_elbow)
plt.scatter(X_elbow[:, 0], X_elbow[:, 1], c=y_best, cmap='viridis', alpha=0.6, s=50)
plt.scatter(kmeans_best.cluster_centers_[:, 0], kmeans_best.cluster_centers_[:, 1],
           c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.title(f'K={4} Clusters\n(Optimal)', fontsize=12, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Optimal K (Elbow method): 4")
print(f"Optimal K (Silhouette): {optimal_k} (score: {max(silhouette_scores):.3f})")


In [None]:
# Load Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Scale features (important for K-Means!)
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)

# Apply K-Means
kmeans_iris = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred_iris = kmeans_iris.fit_predict(X_iris_scaled)

# Evaluate
print("Iris Dataset Clustering Results:")
print("=" * 50)
print(f"Number of clusters: {kmeans_iris.n_clusters}")
print(f"Inertia: {kmeans_iris.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X_iris_scaled, y_pred_iris):.3f}")
print(f"\nCluster sizes: {np.bincount(y_pred_iris)}")

# Visualize (using first 2 features)
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris, cmap='viridis', alpha=0.6, s=50)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('True Labels (Species)', fontweight='bold')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_pred_iris, cmap='viridis', alpha=0.6, s=50)
plt.scatter(scaler.inverse_transform(kmeans_iris.cluster_centers_)[:, 0], 
           scaler.inverse_transform(kmeans_iris.cluster_centers_)[:, 1],
           c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('K-Means Clustering (K=3)', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNote: K-Means found 3 clusters (matching 3 iris species!)")


In [None]:
# Examples where K-Means struggles

# 1. Non-spherical clusters (moons)
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
kmeans_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
y_pred_moons = kmeans_moons.fit_predict(X_moons)

# 2. Different sized clusters
X_unequal, _ = make_blobs(n_samples=[50, 200, 50], centers=3, cluster_std=[1.5, 0.5, 1.5], random_state=42)
kmeans_unequal = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred_unequal = kmeans_unequal.fit_predict(X_unequal)

plt.figure(figsize=(14, 5))

# Moons
plt.subplot(1, 2, 1)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_pred_moons, cmap='viridis', alpha=0.6, s=50)
plt.scatter(kmeans_moons.cluster_centers_[:, 0], kmeans_moons.cluster_centers_[:, 1],
           c='red', marker='x', s=200, linewidths=3)
plt.title('K-Means on Non-Spherical Data\n(Doesn\'t work well!)', fontweight='bold', color='red')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)

# Unequal sizes
plt.subplot(1, 2, 2)
plt.scatter(X_unequal[:, 0], X_unequal[:, 1], c=y_pred_unequal, cmap='viridis', alpha=0.6, s=50)
plt.scatter(kmeans_unequal.cluster_centers_[:, 0], kmeans_unequal.cluster_centers_[:, 1],
           c='red', marker='x', s=200, linewidths=3)
plt.title('K-Means on Unequal Sized Clusters\n(Struggles with small cluster)', fontweight='bold', color='orange')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Alternative algorithms for these cases:")
print("- Non-spherical: DBSCAN, Hierarchical Clustering")
print("- Unequal sizes: DBSCAN, Gaussian Mixture Models")


In [None]:
# Customer Segmentation Example
np.random.seed(42)
n_customers = 300

# Generate synthetic customer data
annual_income = np.random.normal(50000, 15000, n_customers)
spending_score = np.random.normal(50, 20, n_customers)

# Create clusters (different customer segments)
customer_data = np.column_stack([annual_income, spending_score])

# Apply K-Means
kmeans_customers = KMeans(n_clusters=5, random_state=42, n_init=10)
customer_segments = kmeans_customers.fit_predict(customer_data)

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(annual_income, spending_score, alpha=0.6, s=50, c='gray')
plt.xlabel('Annual Income ($)', fontsize=11)
plt.ylabel('Spending Score', fontsize=11)
plt.title('Customer Data (Unlabeled)', fontweight='bold')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(annual_income, spending_score, c=customer_segments, 
           cmap='viridis', alpha=0.6, s=50)
plt.scatter(kmeans_customers.cluster_centers_[:, 0], 
           kmeans_customers.cluster_centers_[:, 1],
           c='red', marker='x', s=200, linewidths=3, label='Segment Centers')
plt.xlabel('Annual Income ($)', fontsize=11)
plt.ylabel('Spending Score', fontsize=11)
plt.title('Customer Segmentation (5 Segments)', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Segment analysis
print("Customer Segment Analysis:")
print("=" * 50)
for i in range(5):
    segment_data = customer_data[customer_segments == i]
    print(f"\nSegment {i+1}:")
    print(f"  Size: {len(segment_data)} customers")
    print(f"  Avg Income: ${segment_data[:, 0].mean():.0f}")
    print(f"  Avg Spending Score: {segment_data[:, 1].mean():.1f}")


## 8. Practice Problems

### Problem 1: Find Optimal K

Given a dataset, determine the optimal number of clusters using both Elbow and Silhouette methods.

### Problem 2: Image Compression with K-Means

Use K-Means to compress an image by reducing the number of colors.


## 9. Summary & Key Takeaways

### Key Concepts:

1. **K-Means Algorithm**:
   - Initialize centroids
   - Assign points to nearest centroid
   - Update centroids
   - Repeat until convergence

2. **Choosing K**:
   - Elbow Method: Look for "bend" in inertia plot
   - Silhouette Score: Higher is better (maximize)

3. **When to Use**:
   - Spherical clusters of similar size
   - Known or estimable number of clusters
   - Large datasets (efficient)

4. **When NOT to Use**:
   - Non-spherical clusters
   - Unknown number of clusters (without analysis)
   - Outliers present

### Time Complexity:
- Training: O(n × k × i × d)
  - n = number of points
  - k = number of clusters
  - i = number of iterations
  - d = number of dimensions
- Prediction: O(k × d) per point

### Next Steps:
1. **Hierarchical Clustering**: For non-spherical clusters
2. **DBSCAN**: For clusters of arbitrary shape and outlier detection
3. **Gaussian Mixture Models**: For soft clustering (probabilistic)

---

**Resources:**
- Scikit-learn Documentation: https://scikit-learn.org/stable/modules/clustering.html#k-means
- "The Elements of Statistical Learning" by Hastie, Tibshirani, Friedman

---

**End of Notebook**
