# Week 4, Day 1: K-means Clustering

## Learning Objectives
- Understand unsupervised learning concepts
- Learn K-means clustering algorithm
- Master cluster analysis and evaluation
- Practice implementing K-means

## Topics Covered
1. Introduction to Clustering
2. K-means Algorithm
3. Cluster Evaluation
4. Applications

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

## 1. Basic K-means Example

In [None]:
def basic_kmeans_example():
    # Generate synthetic data
    np.random.seed(42)
    
    # Create three clusters
    n_samples = 300
    
    cluster1 = np.random.normal(0, 1, (n_samples, 2))
    cluster2 = np.random.normal(5, 1, (n_samples, 2))
    cluster3 = np.random.normal(2.5, 1, (n_samples, 2)) + np.array([0, 5])
    
    # Combine clusters
    X = np.vstack([cluster1, cluster2, cluster3])
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Apply K-means
    kmeans = KMeans(n_clusters=3, random_state=42)
    clusters = kmeans.fit_predict(X_scaled)
    
    # Visualize results
    plt.figure(figsize=(12, 5))
    
    # Original data
    plt.subplot(121)
    plt.scatter(X[:, 0], X[:, 1], alpha=0.5)
    plt.title('Original Data')
    
    # Clustered data
    plt.subplot(122)
    scatter = plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
                color='red', marker='x', s=200, linewidth=3, label='Centroids')
    plt.title('K-means Clustering')
    plt.colorbar(scatter)
    plt.legend()
    
    plt.show()
    
    # Print cluster information
    print("Cluster sizes:")
    print(pd.Series(clusters).value_counts())

basic_kmeans_example()

## 2. Finding Optimal K

In [None]:
def find_optimal_k():
    # Generate data
    np.random.seed(42)
    n_samples = 300
    
    # Create clusters
    X = np.vstack([
        np.random.normal(0, 1, (n_samples, 2)),
        np.random.normal(4, 1, (n_samples, 2)),
        np.random.normal(8, 1, (n_samples, 2))
    ])
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Calculate metrics for different k values
    k_range = range(1, 10)
    inertias = []
    silhouette_scores = []
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(X_scaled)
        inertias.append(kmeans.inertia_)
        
        if k > 1:  # Silhouette score requires at least 2 clusters
            silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))
    
    # Plot elbow curve
    plt.figure(figsize=(12, 5))
    
    plt.subplot(121)
    plt.plot(k_range, inertias, 'bo-')
    plt.xlabel('k')
    plt.ylabel('Inertia')
    plt.title('Elbow Method')
    
    # Plot silhouette scores
    plt.subplot(122)
    plt.plot(list(k_range)[1:], silhouette_scores, 'ro-')
    plt.xlabel('k')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Analysis')
    
    plt.tight_layout()
    plt.show()

find_optimal_k()

## 3. Customer Segmentation Example

In [None]:
def customer_segmentation():
    # Generate customer data
    np.random.seed(42)
    n_customers = 1000
    
    # Create features
    age = np.random.normal(35, 10, n_customers)
    income = np.random.normal(50000, 20000, n_customers)
    spending_score = np.random.normal(50, 25, n_customers)
    
    # Create DataFrame
    customer_data = pd.DataFrame({
        'Age': age,
        'Income': income,
        'SpendingScore': spending_score
    })
    
    # Scale features
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(customer_data)
    
    # Apply K-means
    kmeans = KMeans(n_clusters=5, random_state=42)
    customer_data['Cluster'] = kmeans.fit_predict(features_scaled)
    
    # Visualize results
    plt.figure(figsize=(15, 5))
    
    # Income vs Age
    plt.subplot(131)
    scatter = plt.scatter(customer_data['Age'], customer_data['Income'], 
                         c=customer_data['Cluster'], cmap='viridis')
    plt.xlabel('Age')
    plt.ylabel('Income')
    plt.title('Age vs Income')
    plt.colorbar(scatter)
    
    # Income vs Spending Score
    plt.subplot(132)
    scatter = plt.scatter(customer_data['Income'], customer_data['SpendingScore'], 
                         c=customer_data['Cluster'], cmap='viridis')
    plt.xlabel('Income')
    plt.ylabel('Spending Score')
    plt.title('Income vs Spending Score')
    plt.colorbar(scatter)
    
    # Age vs Spending Score
    plt.subplot(133)
    scatter = plt.scatter(customer_data['Age'], customer_data['SpendingScore'], 
                         c=customer_data['Cluster'], cmap='viridis')
    plt.xlabel('Age')
    plt.ylabel('Spending Score')
    plt.title('Age vs Spending Score')
    plt.colorbar(scatter)
    
    plt.tight_layout()
    plt.show()
    
    # Analyze clusters
    print("\nCluster Analysis:")
    print(customer_data.groupby('Cluster').mean())

customer_segmentation()

## Practical Exercises

In [None]:
# Exercise 1: Mall Customer Segmentation

def mall_customer_segmentation():
    # Generate mall customer data
    np.random.seed(42)
    n_customers = 500
    
    # Create features
    age = np.random.normal(35, 10, n_customers)
    annual_income = np.random.normal(60000, 20000, n_customers)
    spending_score = np.random.normal(50, 25, n_customers)
    visit_frequency = np.random.poisson(10, n_customers)
    
    # Create DataFrame
    mall_data = pd.DataFrame({
        'Age': age,
        'AnnualIncome': annual_income,
        'SpendingScore': spending_score,
        'VisitFrequency': visit_frequency
    })
    
    print("Sample of mall customer data:")
    print(mall_data.head())
    
    # Task: Segment customers using K-means
    # 1. Preprocess the data
    # 2. Find optimal number of clusters
    # 3. Apply K-means clustering
    # 4. Analyze and visualize segments
    
    # Your code here

mall_customer_segmentation()

In [None]:
# Exercise 2: Document Clustering

def document_clustering():
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Sample documents
    documents = [
        "Machine learning is fascinating",
        "Deep learning revolutionizes AI",
        "Python programming is fun",
        "Data science uses machine learning",
        "Programming in Python is easy",
        "AI transforms technology",
        "Learning Python programming",
        "Data analysis with machine learning"
    ]
    
    print("Sample documents:")
    for i, doc in enumerate(documents, 1):
        print(f"{i}. {doc}")
    
    # Task: Cluster similar documents
    # 1. Convert text to TF-IDF features
    # 2. Apply K-means clustering
    # 3. Analyze document clusters
    # 4. Visualize results
    
    # Your code here

document_clustering()

## MCQ Quiz

1. What type of learning is K-means?
   - a) Supervised
   - b) Unsupervised
   - c) Reinforcement
   - d) Semi-supervised

2. What does the 'k' in K-means represent?
   - a) Number of features
   - b) Number of clusters
   - c) Number of iterations
   - d) Number of samples

3. How is the optimal number of clusters typically determined?
   - a) Random selection
   - b) Elbow method
   - c) Fixed value
   - d) Maximum value

4. What is inertia in K-means?
   - a) Number of iterations
   - b) Cluster size
   - c) Sum of squared distances
   - d) Number of features

5. Which preprocessing step is important for K-means?
   - a) Feature scaling
   - b) Feature selection
   - c) Dimensionality reduction
   - d) Feature engineering

6. What is a centroid in K-means?
   - a) First data point
   - b) Cluster center
   - c) Median point
   - d) Random point

7. What is the silhouette score used for?
   - a) Feature selection
   - b) Cluster evaluation
   - c) Data preprocessing
   - d) Model selection

8. Which is NOT a limitation of K-means?
   - a) Sensitive to outliers
   - b) Needs predefined k
   - c) Handles categorical data
   - d) Assumes spherical clusters

9. What is the time complexity of K-means?
   - a) O(n)
   - b) O(n²)
   - c) O(nkdi)
   - d) O(log n)

10. When should you NOT use K-means?
    - a) Customer segmentation
    - b) Document clustering
    - c) Time series data
    - d) Image compression

Answers: 1-b, 2-b, 3-b, 4-c, 5-a, 6-b, 7-b, 8-c, 9-c, 10-c