# Module 15: Unsupervised Learning

**Estimated Time**: 75 minutes

## Learning Objectives

By the end of this module, you will master unsupervised learning.

Topics covered:
- Clustering Fundamentals
- K-Means Clustering
- DBSCAN for Density-Based Clustering
- Hierarchical Clustering
- Dimensionality Reduction with PCA
- t-SNE for Visualization
- Anomaly Detection
- Customer Segmentation Project

## Prerequisites

- Modules 00-11 completed
- Intermediate Python and ML knowledge

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

print("Libraries loaded successfully!")

## 1. Clustering Fundamentals

**Clustering** is an unsupervised learning technique that groups similar data points together without predefined labels.

> **"Birds of a feather flock together"** - Clustering finds natural groupings in data

### What is Unsupervised Learning?

Unlike supervised learning (where we have labels), **unsupervised learning** finds patterns in unlabeled data:

- **No target variable** - We don't know the "correct answer"
- **Discover hidden patterns** - Find structure in data
- **Exploratory analysis** - Understand data better

### The Clustering Problem

**Given**: Dataset with features but no labels
**Goal**: Group similar items together

**Example Use Cases:**
- üõçÔ∏è **Customer Segmentation**: Group customers by behavior
- üì∞ **Document Clustering**: Organize articles by topic
- üß¨ **Gene Expression**: Find genes with similar patterns
- üåç **Image Segmentation**: Group pixels by similarity
- üéµ **Music Recommendation**: Find similar songs

### How Clustering Works

1. **Choose similarity metric** (usually Euclidean distance)
2. **Select number of clusters** (k)
3. **Algorithm assigns each point** to a cluster
4. **Evaluate clustering quality**

### Key Concepts

**Similarity/Distance Metrics:**
- **Euclidean Distance**: Straight-line distance (most common)
- **Manhattan Distance**: Sum of absolute differences
- **Cosine Similarity**: Angle between vectors

**Within-Cluster Sum of Squares (WCSS):**
- Measures compactness of clusters
- Lower is better
- Used to find optimal k (Elbow Method)

**Silhouette Score:**
- Measures how well-separated clusters are
- Range: -1 to 1 (higher is better)
- Score > 0.5 indicates good clustering

### Types of Clustering Algorithms

| Algorithm | Type | Strengths | Use When |
|-----------|------|-----------|----------|
| **K-Means** | Centroid-based | Fast, scalable | Spherical clusters, known k |
| **DBSCAN** | Density-based | Finds arbitrary shapes | Unknown k, outliers |
| **Hierarchical** | Hierarchical | Creates dendrogram | Small datasets, explore k |
| **Gaussian Mixture** | Probabilistic | Soft clustering | Overlapping clusters |

Let's visualize clustering with a simple example!

In [None]:
# Clustering Fundamentals - Visualization
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

print("=" * 60)
print("CLUSTERING FUNDAMENTALS DEMONSTRATION")
print("=" * 60)

# Generate synthetic data with 3 clear clusters
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)

print(f"\nGenerated {X.shape[0]} samples with {X.shape[1]} features")
print(
    f"Data range: X1=[{X[:, 0].min():.2f}, {X[:, 0].max():.2f}], X2=[{X[:, 1].min():.2f}, {X[:, 1].max():.2f}]"
)

# Visualize the raw data
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Raw unlabeled data
axes[0].scatter(X[:, 0], X[:, 1], c="gray", alpha=0.6, edgecolors="k", s=50)
axes[0].set_title(
    "Raw Data (Unlabeled)\nCan you spot the clusters?", fontsize=12, fontweight="bold"
)
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")
axes[0].grid(True, alpha=0.3)

# 2. True clusters (hidden in real scenarios)
axes[1].scatter(X[:, 0], X[:, 1], c=y_true, cmap="viridis", alpha=0.6, edgecolors="k", s=50)
axes[1].set_title(
    "True Clusters (Unknown in Real Life)\n3 distinct groups", fontsize=12, fontweight="bold"
)
axes[1].set_xlabel("Feature 1")
axes[1].set_ylabel("Feature 2")
axes[1].grid(True, alpha=0.3)

# 3. Clustered data with K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

axes[2].scatter(X[:, 0], X[:, 1], c=y_pred, cmap="viridis", alpha=0.6, edgecolors="k", s=50)
axes[2].scatter(
    centers[:, 0],
    centers[:, 1],
    c="red",
    marker="X",
    s=300,
    edgecolors="black",
    linewidths=2,
    label="Centroids",
)
axes[2].set_title("K-Means Clustering Result\nFound 3 clusters!", fontsize=12, fontweight="bold")
axes[2].set_xlabel("Feature 1")
axes[2].set_ylabel("Feature 2")
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate clustering quality metrics
inertia = kmeans.inertia_  # WCSS
silhouette = silhouette_score(X, y_pred)

print("\n" + "=" * 60)
print("CLUSTERING QUALITY METRICS")
print("=" * 60)
print(f"\nWithin-Cluster Sum of Squares (WCSS): {inertia:.2f}")
print(f"  ‚Üí Lower is better (measures compactness)")
print(f"\nSilhouette Score: {silhouette:.4f}")
print(f"  ‚Üí Range: -1 to 1 (higher is better)")
print(
    f"  ‚Üí Interpretation: {'Excellent' if silhouette > 0.7 else 'Good' if silhouette > 0.5 else 'Fair' if silhouette > 0.3 else 'Poor'}"
)

# Distance metrics demonstration
print("\n" + "=" * 60)
print("DISTANCE METRICS COMPARISON")
print("=" * 60)

point_a = np.array([1, 2])
point_b = np.array([4, 6])

# Euclidean distance
euclidean = np.sqrt(np.sum((point_a - point_b) ** 2))
print(f"\nPoint A: {point_a}, Point B: {point_b}")
print(f"\n1. Euclidean Distance: {euclidean:.4f}")
print(f"   Formula: ‚àö[(x‚ÇÅ-x‚ÇÇ)¬≤ + (y‚ÇÅ-y‚ÇÇ)¬≤]")

# Manhattan distance
manhattan = np.sum(np.abs(point_a - point_b))
print(f"\n2. Manhattan Distance: {manhattan:.4f}")
print(f"   Formula: |x‚ÇÅ-x‚ÇÇ| + |y‚ÇÅ-y‚ÇÇ|")

# Cosine similarity
cosine_sim = np.dot(point_a, point_b) / (np.linalg.norm(point_a) * np.linalg.norm(point_b))
cosine_dist = 1 - cosine_sim
print(f"\n3. Cosine Similarity: {cosine_sim:.4f}")
print(f"   Cosine Distance: {cosine_dist:.4f}")
print(f"   ‚Üí Measures angle between vectors (0=identical direction, 1=orthogonal)")

print("\n‚úì Clustering fundamentals demonstrated!")

## 2. K-Means Clustering

**K-Means** is the most popular clustering algorithm - simple, fast, and effective for many use cases.

### How K-Means Works

**Algorithm Steps:**

1. **Initialize**: Randomly select k centroids (cluster centers)
2. **Assignment**: Assign each point to nearest centroid
3. **Update**: Recalculate centroids as mean of assigned points
4. **Repeat**: Steps 2-3 until convergence (centroids stop moving)

### Mathematical Formulation

**Objective**: Minimize Within-Cluster Sum of Squares (WCSS)

$$\text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$$

Where:
- $k$ = number of clusters
- $C_i$ = cluster i
- $\mu_i$ = centroid of cluster i
- $||x - \mu_i||^2$ = squared Euclidean distance

### The Elbow Method

**Problem**: How do we choose k?

**Solution**: Plot WCSS vs k and find the "elbow"

- **Too few clusters**: High WCSS (poor fit)
- **Too many clusters**: Overfitting, no meaningful groups
- **Elbow point**: Sweet spot where adding more clusters doesn't help much

### Pros & Cons

**Advantages:**
- ‚úì Simple and intuitive
- ‚úì Fast - O(n √ó k √ó i) where i = iterations
- ‚úì Scales well to large datasets
- ‚úì Guaranteed to converge

**Limitations:**
- ‚úó Must specify k in advance
- ‚úó Sensitive to initial centroids (use k-means++)
- ‚úó Assumes spherical clusters of similar size
- ‚úó Affected by outliers

### K-Means++ Initialization

**Problem**: Random initialization can lead to poor results

**Solution**: K-Means++ chooses initial centroids smartly:
1. Choose first centroid randomly
2. For each next centroid, select point farthest from existing centroids
3. This spreads out initial centroids

**Result**: Faster convergence and better clusters

### Best Practices

1. **Standardize features** - Different scales can distort distances
2. **Use K-Means++** initialization (`init='k-means++'`)
3. **Run multiple times** (`n_init=10`) and keep best result
4. **Validate with silhouette score** to confirm k is reasonable

Let's implement K-Means and find the optimal k!

In [None]:
# K-Means Clustering - Complete Implementation
from sklearn.preprocessing import StandardScaler

print("=" * 60)
print("K-MEANS CLUSTERING WITH ELBOW METHOD")
print("=" * 60)

# Load customer data for segmentation
df = pd.read_csv("../../data_advanced/feature_engineering.csv")
print(f"\nDataset: {df.shape[0]} customers, {df.shape[1]} features")

# Select features for clustering
features = ["age", "income", "education_years", "experience_years", "num_dependents"]
X = df[features].copy()

# Standardize features (crucial for K-Means!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\nFeatures for clustering: {features}")
print(f"\nOriginal data range:")
print(X.describe().loc[["min", "max"]])
print(f"\nStandardized data range (mean=0, std=1):")
print(pd.DataFrame(X_scaled, columns=features).describe().loc[["mean", "std"]])

# Find optimal k using Elbow Method
print("\n" + "=" * 60)
print("ELBOW METHOD: Finding Optimal k")
print("=" * 60)

K_range = range(2, 11)
wcss = []
silhouette_scores = []

for k in K_range:
    kmeans = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))
    print(f"k={k}: WCSS={kmeans.inertia_:.2f}, Silhouette={silhouette_scores[-1]:.4f}")

# Visualize Elbow Method and Silhouette Scores
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# 1. Elbow Plot
axes[0].plot(K_range, wcss, "bo-", linewidth=2, markersize=8)
axes[0].set_xlabel("Number of Clusters (k)", fontsize=12)
axes[0].set_ylabel("WCSS (Within-Cluster Sum of Squares)", fontsize=12)
axes[0].set_title('Elbow Method\nFind the "elbow" point', fontsize=14, fontweight="bold")
axes[0].grid(True, alpha=0.3)
axes[0].axvline(x=3, color="red", linestyle="--", alpha=0.5, label="Suggested k=3")
axes[0].legend()

# 2. Silhouette Score
axes[1].plot(K_range, silhouette_scores, "go-", linewidth=2, markersize=8)
axes[1].set_xlabel("Number of Clusters (k)", fontsize=12)
axes[1].set_ylabel("Silhouette Score", fontsize=12)
axes[1].set_title("Silhouette Score by k\nHigher is better", fontsize=14, fontweight="bold")
axes[1].axhline(y=0.5, color="orange", linestyle="--", alpha=0.5, label="Good threshold (0.5)")
axes[1].grid(True, alpha=0.3)
axes[1].legend()

plt.tight_layout()
plt.show()

# Use optimal k=3
optimal_k = 3
print(f"\n‚úì Optimal k = {optimal_k} (based on elbow and silhouette)")

# Fit final K-Means model
kmeans_final = KMeans(n_clusters=optimal_k, init="k-means++", n_init=10, random_state=42)
clusters = kmeans_final.fit_predict(X_scaled)

# Add cluster labels to dataframe
df["cluster"] = clusters

# Analyze clusters
print("\n" + "=" * 60)
print("CLUSTER ANALYSIS")
print("=" * 60)

for i in range(optimal_k):
    cluster_data = df[df["cluster"] == i]
    print(f"\nüìä Cluster {i} ({len(cluster_data)} customers, {len(cluster_data)/len(df)*100:.1f}%)")
    print(cluster_data[features].mean().to_string())

# Visualize clusters (using PCA for 2D)
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Clusters in PCA space
for i in range(optimal_k):
    mask = clusters == i
    axes[0].scatter(
        X_pca[mask, 0], X_pca[mask, 1], label=f"Cluster {i}", alpha=0.6, s=50, edgecolors="k"
    )

# Plot centroids
centroids_pca = pca.transform(kmeans_final.cluster_centers_)
axes[0].scatter(
    centroids_pca[:, 0],
    centroids_pca[:, 1],
    c="red",
    marker="X",
    s=300,
    edgecolors="black",
    linewidths=2,
    label="Centroids",
    zorder=5,
)
axes[0].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
axes[0].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
axes[0].set_title("K-Means Clustering Results\n(PCA Visualization)", fontsize=14, fontweight="bold")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 2. Feature comparison across clusters
cluster_means = df.groupby("cluster")[features].mean()
cluster_means_scaled = (cluster_means - cluster_means.mean()) / cluster_means.std()

im = axes[1].imshow(cluster_means_scaled.T, cmap="RdYlGn", aspect="auto", vmin=-2, vmax=2)
axes[1].set_xticks(range(optimal_k))
axes[1].set_xticklabels([f"Cluster {i}" for i in range(optimal_k)])
axes[1].set_yticks(range(len(features)))
axes[1].set_yticklabels(features)
axes[1].set_title("Cluster Profiles\n(Standardized Feature Means)", fontsize=14, fontweight="bold")
plt.colorbar(im, ax=axes[1], label="Standardized Value")

for i in range(optimal_k):
    for j in range(len(features)):
        text = axes[1].text(
            i,
            j,
            f"{cluster_means_scaled.iloc[i, j]:.2f}",
            ha="center",
            va="center",
            color="black",
            fontsize=10,
        )

plt.tight_layout()
plt.show()

print(f"\n‚úì K-Means clustering complete with k={optimal_k}!")

## 3. DBSCAN for Density-Based Clustering

**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) discovers clusters of arbitrary shapes and identifies outliers.

### Why DBSCAN?

**K-Means Limitations:**
- Assumes spherical clusters
- Requires specifying k
- Sensitive to outliers

**DBSCAN Advantages:**
- ‚úì No need to specify number of clusters
- ‚úì Finds arbitrarily shaped clusters
- ‚úì Identifies outliers as noise
- ‚úì Works well with spatial data

### How DBSCAN Works

**Core Concepts:**

1. **Œµ (epsilon)**: Maximum distance between two points to be neighbors
2. **MinPts**: Minimum points to form a dense region

**Point Types:**
- **Core Point**: Has ‚â• MinPts neighbors within Œµ
- **Border Point**: Within Œµ of a core point, but has < MinPts neighbors
- **Noise Point**: Neither core nor border (outlier)

**Algorithm:**
1. Pick an unvisited point
2. If it's a core point, start a cluster and expand it
3. Add all density-reachable points to the cluster
4. Repeat until all points visited
5. Points not in any cluster = noise

### Choosing Parameters

**Œµ (epsilon):**
- Too small: Many small clusters and noise
- Too large: All points in one cluster
- **Method**: k-distance plot (look for "knee")

**MinPts:**
- Rule of thumb: `MinPts ‚â• D + 1` where D = dimensions
- For 2D data: MinPts = 4 is common
- Higher values = more noise detected

### Pros & Cons

**Advantages:**
- ‚úì No need to specify k
- ‚úì Handles non-spherical clusters
- ‚úì Robust to outliers (labels them as noise)
- ‚úì Only 2 parameters to tune

**Limitations:**
- ‚úó Struggles with varying densities
- ‚úó Sensitive to Œµ and MinPts
- ‚úó Not deterministic for border points
- ‚úó Computationally expensive for large datasets

### When to Use DBSCAN

- **Spatial clustering**: Geographic data, maps
- **Anomaly detection**: Outliers are valuable
- **Non-spherical clusters**: Moon shapes, rings
- **Unknown k**: Don't know number of clusters

Let's compare DBSCAN with K-Means on different cluster shapes!

In [None]:
# DBSCAN - Comprehensive Demonstration
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_circles

print("=" * 60)
print("DBSCAN vs K-MEANS: CLUSTER SHAPE COMPARISON")
print("=" * 60)

# Create datasets with different shapes
np.random.seed(42)

# 1. Moons (non-linear separable)
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# 2. Circles (concentric)
X_circles, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)

# 3. Blobs (well-separated spherical)
X_blobs, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)

datasets = [("Moons", X_moons), ("Circles", X_circles), ("Blobs", X_blobs)]

fig, axes = plt.subplots(3, 3, figsize=(18, 16))

for idx, (name, X) in enumerate(datasets):
    # Original data
    axes[idx, 0].scatter(X[:, 0], X[:, 1], c="gray", s=30, alpha=0.6, edgecolors="k")
    axes[idx, 0].set_title(f"{name}\n(Unlabeled)", fontsize=12, fontweight="bold")
    axes[idx, 0].set_ylabel(name, fontsize=14, fontweight="bold")
    if idx == 0:
        axes[idx, 0].set_xlabel("K-Means FAILS ‚Üí", fontsize=11, color="red")

    # K-Means clustering
    kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
    y_kmeans = kmeans.fit_predict(X)

    axes[idx, 1].scatter(
        X[:, 0], X[:, 1], c=y_kmeans, cmap="viridis", s=30, alpha=0.6, edgecolors="k"
    )
    axes[idx, 1].scatter(
        kmeans.cluster_centers_[:, 0],
        kmeans.cluster_centers_[:, 1],
        c="red",
        marker="X",
        s=200,
        edgecolors="black",
        linewidths=2,
    )
    axes[idx, 1].set_title(
        f"K-Means (k=2)\nSilhouette: {silhouette_score(X, y_kmeans):.3f}",
        fontsize=12,
        fontweight="bold",
    )

    # DBSCAN clustering
    if name == "Moons":
        eps, min_samples = 0.3, 5
    elif name == "Circles":
        eps, min_samples = 0.2, 5
    else:
        eps, min_samples = 0.5, 5

    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    y_dbscan = dbscan.fit_predict(X)

    # Count clusters and noise
    n_clusters = len(set(y_dbscan)) - (1 if -1 in y_dbscan else 0)
    n_noise = list(y_dbscan).count(-1)

    # Plot DBSCAN results
    unique_labels = set(y_dbscan)
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Noise points in black
            col = "black"
            markersize = 20
        else:
            markersize = 30

        class_member_mask = y_dbscan == k
        xy = X[class_member_mask]
        axes[idx, 2].scatter(
            xy[:, 0],
            xy[:, 1],
            c=[col],
            s=markersize,
            alpha=0.6,
            edgecolors="k",
            label=f"Cluster {k}" if k != -1 else "Noise",
        )

    silh = silhouette_score(X, y_dbscan) if n_clusters > 1 else 0
    axes[idx, 2].set_title(
        f"DBSCAN (Œµ={eps}, MinPts={min_samples})\n"
        f"Clusters: {n_clusters}, Noise: {n_noise}, Silhouette: {silh:.3f}",
        fontsize=12,
        fontweight="bold",
    )
    if idx == 0:
        axes[idx, 2].set_xlabel("DBSCAN SUCCEEDS ‚Üí", fontsize=11, color="green")

for ax in axes.flat:
    ax.grid(True, alpha=0.3)
    ax.set_xticks([])
    ax.set_yticks([])

plt.tight_layout()
plt.show()

# Find optimal epsilon using k-distance plot
print("\n" + "=" * 60)
print("FINDING OPTIMAL EPSILON (Œµ)")
print("=" * 60)

from sklearn.neighbors import NearestNeighbors

# Use moons dataset
X = X_moons

# Calculate k-distances (k=MinPts)
k = 5
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X)
distances, indices = neighbors.kneighbors(X)

# Sort distances (k-th nearest neighbor)
k_distances = np.sort(distances[:, k - 1], axis=0)

plt.figure(figsize=(12, 5))

# Plot k-distance graph
plt.subplot(1, 2, 1)
plt.plot(k_distances, linewidth=2)
plt.axhline(y=0.3, color="red", linestyle="--", linewidth=2, label="Chosen Œµ = 0.3")
plt.xlabel("Points (sorted by distance)", fontsize=12)
plt.ylabel(f"{k}-th Nearest Neighbor Distance", fontsize=12)
plt.title('K-Distance Plot\nLook for the "knee" to find optimal Œµ', fontsize=14, fontweight="bold")
plt.grid(True, alpha=0.3)
plt.legend()

# Compare different epsilon values
plt.subplot(1, 2, 2)
eps_values = [0.2, 0.3, 0.5]
results = []

for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=5)
    labels = dbscan.fit_predict(X)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    results.append((eps, n_clusters, n_noise))

eps_list, clusters_list, noise_list = zip(*results)
x_pos = np.arange(len(eps_list))

plt.bar(x_pos - 0.2, clusters_list, 0.4, label="# Clusters", color="steelblue")
plt.bar(x_pos + 0.2, noise_list, 0.4, label="# Noise Points", color="coral")
plt.xlabel("Epsilon (Œµ)", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Effect of Epsilon on Clustering", fontsize=14, fontweight="bold")
plt.xticks(x_pos, [f"{e}" for e in eps_list])
plt.legend()
plt.grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

print("\nOptimal Œµ selection:")
for eps, n_clusters, n_noise in results:
    print(f"  Œµ={eps}: {n_clusters} clusters, {n_noise} noise points")

print("\n‚úì DBSCAN excels at non-spherical clusters!")

## 4. Hierarchical Clustering

**Hierarchical Clustering** builds a tree of clusters (dendrogram) showing relationships at multiple levels.

### Two Approaches

**1. Agglomerative (Bottom-Up) - Most Common**
- Start: Each point is its own cluster
- Repeat: Merge closest clusters
- Stop: Until one cluster remains

**2. Divisive (Top-Down) - Rare**
- Start: All points in one cluster
- Repeat: Split clusters
- Stop: Until each point is its own cluster

### Linkage Methods

How do we measure distance between clusters?

| Linkage | Description | Characteristics |
|---------|-------------|-----------------|
| **Single** | Min distance between any two points | Tends to chain, sensitive to noise |
| **Complete** | Max distance between any two points | Compact clusters, breaks chains |
| **Average** | Average distance between all pairs | Balanced approach |
| **Ward** | Minimizes within-cluster variance | Prefers equal-sized clusters (best) |

### The Dendrogram

A **dendrogram** is a tree diagram showing cluster hierarchy:

- **Y-axis**: Distance (or dissimilarity) between clusters
- **X-axis**: Data points
- **Horizontal lines**: Merges (joins clusters)
- **Height of merge**: How different clusters are

**How to Read:**
- Cut horizontally to get desired number of clusters
- Longer vertical lines = more distinct clusters

### Choosing Number of Clusters

**Visual Method:**
1. Look at dendrogram
2. Find largest vertical gap
3. Draw horizontal line through gap
4. Count intersections = number of clusters

**Inconsistency Method:**
- Look for inconsistent merges (big jumps in distance)

### Pros & Cons

**Advantages:**
- ‚úì No need to specify k upfront
- ‚úì Dendrogram provides intuition
- ‚úì Works with any distance metric
- ‚úì Deterministic (same result every time)

**Limitations:**
- ‚úó Computationally expensive: O(n¬≥) for naive, O(n¬≤log n) optimized
- ‚úó Not suitable for large datasets (> 10,000 points)
- ‚úó Cannot undo previous merges
- ‚úó Sensitive to noise and outliers

### When to Use Hierarchical Clustering

- **Small datasets** (< 5,000 points)
- **Exploratory analysis** - Want to see hierarchy
- **Taxonomy** - Biological classification
- **Unknown k** - Dendrogram helps choose

Let's create dendrograms and compare linkage methods!

In [None]:
# Hierarchical Clustering - Dendrograms and Linkage Comparison
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

print("=" * 60)
print("HIERARCHICAL CLUSTERING WITH DENDROGRAMS")
print("=" * 60)

# Generate sample data
np.random.seed(42)
X, y = make_blobs(n_samples=50, centers=3, cluster_std=0.7, random_state=42)

print(f"\nDataset: {X.shape[0]} points")
print("Goal: Discover hierarchy without knowing k=3")

# Compare different linkage methods
linkage_methods = ["ward", "complete", "average", "single"]

fig, axes = plt.subplots(2, 2, figsize=(18, 14))
axes = axes.ravel()

for idx, method in enumerate(linkage_methods):
    # Perform hierarchical clustering
    Z = linkage(X, method=method)

    # Create dendrogram
    dendrogram(Z, ax=axes[idx], color_threshold=7.5 if method == "ward" else None)
    axes[idx].set_title(
        f"Dendrogram - {method.upper()} Linkage\n" f'{"(RECOMMENDED)" if method == "ward" else ""}',
        fontsize=14,
        fontweight="bold",
    )
    axes[idx].set_xlabel("Data Point Index", fontsize=12)
    axes[idx].set_ylabel("Distance (Height)", fontsize=12)
    axes[idx].grid(True, alpha=0.3, axis="y")

    # Add horizontal line to show cut
    if method == "ward":
        axes[idx].axhline(y=7.5, color="red", linestyle="--", linewidth=2, label="Cut line (k=3)")
        axes[idx].legend()

plt.tight_layout()
plt.show()

# Detailed analysis with Ward linkage
print("\n" + "=" * 60)
print("DETAILED ANALYSIS: WARD LINKAGE")
print("=" * 60)

Z_ward = linkage(X, method="ward")

# Find optimal number of clusters
print("\nMerge distances (last 10 merges):")
print("Step | Distance | Interpretation")
print("-" * 40)
for i in range(max(0, len(Z_ward) - 10), len(Z_ward)):
    print(f"{i+1:4d} | {Z_ward[i, 2]:8.2f} | ", end="")
    if i == len(Z_ward) - 3:
        print("‚Üê Big jump! Suggests k=3")
    else:
        print("")

# Plot dendrogram with cut line
plt.figure(figsize=(16, 6))

plt.subplot(1, 2, 1)
dendrogram(Z_ward, color_threshold=7.5, above_threshold_color="gray")
plt.axhline(y=7.5, color="red", linestyle="--", linewidth=2, label="Cut for k=3")
plt.title("Dendrogram with Cut Line\nWard Linkage", fontsize=14, fontweight="bold")
plt.xlabel("Data Point Index", fontsize=12)
plt.ylabel("Distance (Ward)", fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3, axis="y")

# Apply clustering with different k values
plt.subplot(1, 2, 2)
k_values = [2, 3, 4]
colors_list = ["viridis", "plasma", "coolwarm"]

for i, k in enumerate(k_values):
    hierarchical = AgglomerativeClustering(n_clusters=k, linkage="ward")
    labels = hierarchical.fit_predict(X)
    silh = silhouette_score(X, labels)

    plt.scatter(
        X[:, 0],
        X[:, 1],
        c=labels,
        cmap=colors_list[i],
        s=100,
        alpha=0.3,
        edgecolors="k",
        linewidth=0.5,
    )
    plt.text(
        X[:, 0].mean(),
        X[:, 1].max() - i * 0.5,
        f"k={k}, Silhouette={silh:.3f}",
        fontsize=11,
        bbox=dict(boxstyle="round", facecolor="white", alpha=0.8),
    )

plt.title("Hierarchical Clustering Results\nComparing k=2, 3, 4", fontsize=14, fontweight="bold")
plt.xlabel("Feature 1", fontsize=12)
plt.ylabel("Feature 2", fontsize=12)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Apply final clustering with k=3
hierarchical_final = AgglomerativeClustering(n_clusters=3, linkage="ward")
labels_final = hierarchical_final.fit_predict(X)

print(f"\n‚úì Hierarchical clustering complete!")
print(f"Chosen k=3 based on dendrogram analysis")
print(f"Silhouette Score: {silhouette_score(X, labels_final):.4f}")

# Real-world example: Customer segmentation
print("\n" + "=" * 60)
print("REAL-WORLD APPLICATION: CUSTOMER HIERARCHY")
print("=" * 60)

# Load customer data (subset for speed)
df_cust = pd.read_csv("../../data_advanced/feature_engineering.csv").head(100)
features = ["age", "income", "education_years"]
X_cust = df_cust[features].values

# Standardize
scaler = StandardScaler()
X_cust_scaled = scaler.fit_transform(X_cust)

# Hierarchical clustering
Z_cust = linkage(X_cust_scaled, method="ward")

plt.figure(figsize=(16, 5))

plt.subplot(1, 2, 1)
dendrogram(Z_cust, truncate_mode="lastp", p=12, show_leaf_counts=True)
plt.title(
    "Customer Segmentation Hierarchy\n(Truncated: Last 12 Merges)", fontsize=14, fontweight="bold"
)
plt.xlabel("Cluster Size (in parentheses)", fontsize=12)
plt.ylabel("Ward Distance", fontsize=12)
plt.grid(True, alpha=0.3, axis="y")

# Cluster customers
hierarchical_cust = AgglomerativeClustering(n_clusters=4, linkage="ward")
df_cust["segment"] = hierarchical_cust.fit_predict(X_cust_scaled)

# Analyze segments
plt.subplot(1, 2, 2)
segment_means = df_cust.groupby("segment")[features].mean()
segment_means_norm = (segment_means - segment_means.mean()) / segment_means.std()

im = plt.imshow(segment_means_norm.T, cmap="RdYlGn", aspect="auto", vmin=-2, vmax=2)
plt.colorbar(im, label="Standardized Mean")
plt.xticks(range(4), [f"Segment {i}" for i in range(4)])
plt.yticks(range(len(features)), features)
plt.title("Customer Segment Profiles\n(4 Segments)", fontsize=14, fontweight="bold")

for i in range(4):
    for j in range(len(features)):
        text = plt.text(
            i,
            j,
            f"{segment_means_norm.iloc[i, j]:.2f}",
            ha="center",
            va="center",
            color="black",
            fontsize=11,
        )

plt.tight_layout()
plt.show()

print("\nSegment sizes:")
print(df_cust["segment"].value_counts().sort_index())

print("\n‚úì Hierarchical clustering provides intuitive hierarchy visualization!")

## 5. Dimensionality Reduction with PCA

**PCA** (Principal Component Analysis) reduces high-dimensional data to fewer dimensions while preserving maximum variance.

### The Curse of Dimensionality

**Problem**: High-dimensional data is sparse and hard to visualize

- Distance metrics become meaningless
- Models overfit easily
- Visualization impossible (> 3D)
- Computational cost increases

**Solution**: Reduce dimensions intelligently

### What is PCA?

**Goal**: Find new axes (principal components) that:
1. Capture maximum variance in data
2. Are orthogonal (perpendicular) to each other
3. Are ordered by importance

**Think of it as:**
- Finding the "best camera angles" to view your data
- Compressing information with minimal loss

### How PCA Works

**Algorithm:**

1. **Standardize** data (mean=0, variance=1)
2. **Compute covariance matrix** (how features vary together)
3. **Find eigenvectors** (directions of maximum variance)
4. **Sort by eigenvalues** (variance explained by each direction)
5. **Select top k components** (keep most important)
6. **Transform data** to new coordinate system

### Principal Components

**PC1** (First Principal Component):
- Direction of maximum variance
- Captures most information
- Example: In face data, might capture lighting

**PC2** (Second Principal Component):
- Orthogonal to PC1
- Next highest variance
- Example: Might capture face angle

**And so on...**

### Explained Variance

**How much information does each PC contain?**

- **Explained Variance Ratio**: Percentage of total variance
- **Cumulative Variance**: Running total
- **Rule of thumb**: Keep enough PCs to retain 95% variance

### Pros & Cons

**Advantages:**
- ‚úì Reduces overfitting (fewer features)
- ‚úì Speeds up training (less data to process)
- ‚úì Enables visualization (2D/3D plots)
- ‚úì Removes noise (keep signal, drop noise)
- ‚úì Decorrelates features

**Limitations:**
- ‚úó Linear transformation only
- ‚úó Assumes high variance = importance
- ‚úó Components not interpretable
- ‚úó Sensitive to scaling (must standardize)

### When to Use PCA

- **Visualization**: Plot high-D data in 2D/3D
- **Preprocessing**: Before clustering or classification
- **Compression**: Reduce storage/computation
- **Noise reduction**: Filter out minor variations
- **Multicollinearity**: Remove correlated features

Let's reduce dimensions and visualize high-D data!

In [None]:
# PCA - Comprehensive Demonstration
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

print("=" * 60)
print("PRINCIPAL COMPONENT ANALYSIS (PCA)")
print("=" * 60)

# Load high-dimensional dataset (8x8 pixel images = 64 dimensions)
digits = load_digits()
X_digits = digits.data  # 1797 samples, 64 features
y_digits = digits.target  # 0-9 digit labels

print(f"\nDigits dataset: {X_digits.shape[0]} images, {X_digits.shape[1]} features (pixels)")
print(f"Goal: Reduce from {X_digits.shape[1]}D to 2D for visualization")

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_digits)

# Apply PCA
pca = PCA(random_state=42)
X_pca_all = pca.fit_transform(X_scaled)

# Analyze explained variance
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

print(f"\nFirst 10 components explain:")
for i in range(10):
    print(f"  PC{i+1}: {explained_var[i]:.1%} (Cumulative: {cumulative_var[i]:.1%})")

# Find components needed for 95% variance
n_components_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_components_95}/{X_digits.shape[1]}")
print(
    f"Dimension reduction: {X_digits.shape[1]} ‚Üí {n_components_95} ({n_components_95/X_digits.shape[1]:.1%})"
)

# Visualize explained variance
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Scree plot (explained variance by component)
axes[0].bar(range(1, 21), explained_var[:20], color="steelblue", alpha=0.7)
axes[0].set_xlabel("Principal Component", fontsize=12)
axes[0].set_ylabel("Explained Variance Ratio", fontsize=12)
axes[0].set_title(
    "Scree Plot\nVariance Explained by Each Component", fontsize=14, fontweight="bold"
)
axes[0].grid(True, alpha=0.3, axis="y")

# 2. Cumulative explained variance
axes[1].plot(range(1, len(cumulative_var) + 1), cumulative_var, "o-", linewidth=2, markersize=4)
axes[1].axhline(y=0.95, color="red", linestyle="--", linewidth=2, label="95% threshold")
axes[1].axvline(
    x=n_components_95,
    color="green",
    linestyle="--",
    linewidth=2,
    label=f"{n_components_95} components",
)
axes[1].set_xlabel("Number of Components", fontsize=12)
axes[1].set_ylabel("Cumulative Explained Variance", fontsize=12)
axes[1].set_title(
    "Cumulative Variance Explained\nHow many components to keep?", fontsize=14, fontweight="bold"
)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# 3. Visualize principal components as images
components_to_show = 9
axes[2].axis("off")
axes[2].set_title(
    "Top 9 Principal Components\n(Shown as 8x8 Images)", fontsize=14, fontweight="bold"
)

# Create subplot grid for components
for i in range(components_to_show):
    plt.subplot(3, 9, i + 7)  # Position in right third of figure
    plt.imshow(pca.components_[i].reshape(8, 8), cmap="RdBu_r", aspect="auto")
    plt.title(f"PC{i+1}\n{explained_var[i]:.1%}", fontsize=9)
    plt.xticks([])
    plt.yticks([])

plt.tight_layout()
plt.show()

# Reduce to 2D for visualization
print("\n" + "=" * 60)
print("2D VISUALIZATION WITH PCA")
print("=" * 60)

pca_2d = PCA(n_components=2, random_state=42)
X_2d = pca_2d.fit_transform(X_scaled)

print(f"\nReduced to 2D")
print(f"Variance preserved: {pca_2d.explained_variance_ratio_.sum():.1%}")
print(f"  PC1: {pca_2d.explained_variance_ratio_[0]:.1%}")
print(f"  PC2: {pca_2d.explained_variance_ratio_[1]:.1%}")

# Visualize digits in 2D PCA space
plt.figure(figsize=(16, 6))

plt.subplot(1, 2, 1)
scatter = plt.scatter(
    X_2d[:, 0], X_2d[:, 1], c=y_digits, cmap="tab10", s=20, alpha=0.6, edgecolors="k", linewidth=0.5
)
plt.colorbar(scatter, label="Digit (0-9)")
plt.xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)", fontsize=12)
plt.ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)", fontsize=12)
plt.title("64D Data Projected to 2D\nColors = Digit Labels (0-9)", fontsize=14, fontweight="bold")
plt.grid(True, alpha=0.3)

# Show some actual digit images
plt.subplot(1, 2, 2)
plt.axis("off")
plt.title("Sample Digit Images\n(Original 8x8 pixels)", fontsize=14, fontweight="bold")

for i in range(25):
    plt.subplot(5, 10, i + 6)  # Position in right half
    plt.imshow(X_digits[i].reshape(8, 8), cmap="gray")
    plt.title(f"{y_digits[i]}", fontsize=10)
    plt.xticks([])
    plt.yticks([])

plt.tight_layout()
plt.show()

# PCA for data compression
print("\n" + "=" * 60)
print("PCA FOR DATA COMPRESSION")
print("=" * 60)

# Reconstruct data from reduced components
n_components_list = [2, 5, 10, 20, 64]
sample_idx = 0  # First digit

fig, axes = plt.subplots(1, len(n_components_list), figsize=(18, 4))

for idx, n_comp in enumerate(n_components_list):
    if n_comp < 64:
        # Reduce and reconstruct
        pca_temp = PCA(n_components=n_comp, random_state=42)
        X_reduced = pca_temp.fit_transform(X_scaled)
        X_reconstructed = pca_temp.inverse_transform(X_reduced)

        # Unscale
        X_reconstructed = scaler.inverse_transform(X_reconstructed)

        var_retained = pca_temp.explained_variance_ratio_.sum()
    else:
        # Original
        X_reconstructed = X_digits
        var_retained = 1.0

    # Plot
    axes[idx].imshow(X_reconstructed[sample_idx].reshape(8, 8), cmap="gray")
    axes[idx].set_title(
        f"{n_comp} components\n{var_retained:.1%} variance", fontsize=11, fontweight="bold"
    )
    axes[idx].axis("off")

plt.suptitle(
    "Image Reconstruction with Different # of Components\n"
    f"Original Digit: {y_digits[sample_idx]}",
    fontsize=14,
    fontweight="bold",
)
plt.tight_layout()
plt.show()

compression_ratio = n_components_95 / X_digits.shape[1]
print(f"\nCompression achieved: {X_digits.shape[1]} ‚Üí {n_components_95} components")
print(f"Compression ratio: {compression_ratio:.1%}")
print(f"Space saved: {1 - compression_ratio:.1%}")
print(f"Information retained: 95%")

print("\n‚úì PCA successfully reduced dimensions while preserving information!")

## 6. t-SNE for Visualization

**t-SNE** (t-Distributed Stochastic Neighbor Embedding) is a powerful non-linear dimensionality reduction technique optimized for visualization.

### PCA vs t-SNE

**PCA (Linear):**
- Fast and deterministic
- Preserves global structure
- Works well for linear relationships
- Good for preprocessing

**t-SNE (Non-Linear):**
- Slow but powerful
- Preserves local structure (clusters)
- Captures non-linear relationships
- Best for visualization only

### How t-SNE Works

**Intuition**: Keep similar points close, push dissimilar points apart

**Algorithm:**
1. **Compute pairwise similarities** in high-D space
2. **Initialize** random low-D positions
3. **Iteratively adjust** positions to match high-D similarities
4. **Minimize** difference between high-D and low-D distributions

### Key Parameters

**perplexity** (5-50, default=30):
- Balances local vs global structure
- Think of it as "expected number of neighbors"
- Small perplexity (5-15): Focus on local clusters
- Large perplexity (30-50): More global structure
- Rule: `5 < perplexity < n_samples/3`

**learning_rate** (10-1000, default=200):
- Step size for optimization
- Too low: Stuck in local minima
- Too high: Unstable, points overlap
- Typical: 100-500

**n_iter** (250-1000, default=1000):
- Number of optimization iterations
- More iterations = better convergence
- Watch for "KL divergence" to stabilize

### Important Caveats

**‚ùå DON'T:**
- Use t-SNE for anything except visualization
- Interpret distances between clusters
- Use same plot for different runs (it's stochastic)
- Use t-SNE output as features for ML

**‚úì DO:**
- Use for exploratory data analysis
- Try different perplexity values
- Set random_state for reproducibility
- Use PCA first if data > 50 dimensions

### When to Use t-SNE

- **Visualizing clusters**: See how data naturally groups
- **Exploratory analysis**: Discover patterns
- **High-dimensional data**: Where PCA fails
- **Non-linear relationships**: Manifold structures

Let's compare PCA and t-SNE on the same data!

In [None]:
# t-SNE - Comprehensive Comparison with PCA
from sklearn.manifold import TSNE
import time

print("=" * 60)
print("t-SNE vs PCA COMPARISON")
print("=" * 60)

# Use digits dataset (64 dimensions)
X = X_digits[:1000]  # Subset for speed (t-SNE is slow)
y = y_digits[:1000]

print(f"\nDataset: {X.shape[0]} samples, {X.shape[1]} features")
print("Reducing 64D ‚Üí 2D for visualization")

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA (fast)
print("\n" + "-" * 60)
print("Running PCA...")
start_time = time.time()
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time
print(f"PCA completed in {pca_time:.2f} seconds")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")

# t-SNE (slow but better for visualization)
print("\n" + "-" * 60)
print("Running t-SNE (this takes time)...")
start_time = time.time()
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000, verbose=0)
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time
print(f"t-SNE completed in {tsne_time:.2f} seconds")
print(f"t-SNE is {tsne_time/pca_time:.1f}x slower than PCA")

# Visualize both side by side
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# PCA visualization
scatter1 = axes[0].scatter(
    X_pca[:, 0], X_pca[:, 1], c=y, cmap="tab10", s=30, alpha=0.7, edgecolors="k", linewidth=0.5
)
axes[0].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})", fontsize=12)
axes[0].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})", fontsize=12)
axes[0].set_title(
    f"PCA Visualization\n" f"Linear, Fast ({pca_time:.2f}s), Preserves Global Structure",
    fontsize=14,
    fontweight="bold",
)
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label="Digit")

# t-SNE visualization
scatter2 = axes[1].scatter(
    X_tsne[:, 0], X_tsne[:, 1], c=y, cmap="tab10", s=30, alpha=0.7, edgecolors="k", linewidth=0.5
)
axes[1].set_xlabel("t-SNE Dimension 1", fontsize=12)
axes[1].set_ylabel("t-SNE Dimension 2", fontsize=12)
axes[1].set_title(
    f"t-SNE Visualization\n" f"Non-Linear, Slow ({tsne_time:.1f}s), Preserves Local Structure",
    fontsize=14,
    fontweight="bold",
)
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label="Digit")

plt.tight_layout()
plt.show()

# Effect of perplexity
print("\n" + "=" * 60)
print("EFFECT OF PERPLEXITY PARAMETER")
print("=" * 60)

# Use smaller subset for speed
X_small = X_scaled[:300]
y_small = y[:300]

perplexities = [5, 15, 30, 50]
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for idx, perp in enumerate(perplexities):
    print(f"\nRunning t-SNE with perplexity={perp}...")
    tsne_temp = TSNE(n_components=2, random_state=42, perplexity=perp, n_iter=500, verbose=0)
    X_tsne_temp = tsne_temp.fit_transform(X_small)

    scatter = axes[idx].scatter(
        X_tsne_temp[:, 0],
        X_tsne_temp[:, 1],
        c=y_small,
        cmap="tab10",
        s=40,
        alpha=0.7,
        edgecolors="k",
        linewidth=0.5,
    )
    axes[idx].set_title(
        f"Perplexity = {perp}\n"
        f'{"Local focus" if perp <= 15 else "Balanced" if perp == 30 else "Global focus"}',
        fontsize=12,
        fontweight="bold",
    )
    axes[idx].set_xlabel("t-SNE 1")
    axes[idx].set_ylabel("t-SNE 2")
    axes[idx].grid(True, alpha=0.3)

plt.suptitle(
    "Effect of Perplexity on t-SNE\nLower = More local clusters, Higher = More global structure",
    fontsize=14,
    fontweight="bold",
)
plt.tight_layout()
plt.show()

# PCA + t-SNE pipeline for high-D data
print("\n" + "=" * 60)
print("BEST PRACTICE: PCA PREPROCESSING FOR t-SNE")
print("=" * 60)

print("\nFor high-dimensional data (>50D), use PCA first:")
print("  1. PCA: 64D ‚Üí 30D (remove noise)")
print("  2. t-SNE: 30D ‚Üí 2D (visualization)")

# Apply PCA first
pca_pre = PCA(n_components=30, random_state=42)
X_pca_pre = pca_pre.fit_transform(X_scaled)

print(f"\nPCA preprocessing: {X_scaled.shape[1]}D ‚Üí {X_pca_pre.shape[1]}D")
print(f"Variance retained: {pca_pre.explained_variance_ratio_.sum():.1%}")

# Then t-SNE
print("\nApplying t-SNE on PCA-reduced data...")
start_time = time.time()
tsne_final = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000, verbose=0)
X_final = tsne_final.fit_transform(X_pca_pre)
final_time = time.time() - start_time

print(f"Completed in {final_time:.2f} seconds")
print(f"Faster than direct t-SNE: {tsne_time:.2f}s ‚Üí {final_time:.2f}s")

# Visualize
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
scatter = plt.scatter(
    X_tsne[:, 0], X_tsne[:, 1], c=y, cmap="tab10", s=30, alpha=0.7, edgecolors="k", linewidth=0.5
)
plt.xlabel("t-SNE 1", fontsize=12)
plt.ylabel("t-SNE 2", fontsize=12)
plt.title(f"Direct t-SNE\n64D ‚Üí 2D ({tsne_time:.2f}s)", fontsize=14, fontweight="bold")
plt.colorbar(scatter, label="Digit")
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
scatter = plt.scatter(
    X_final[:, 0], X_final[:, 1], c=y, cmap="tab10", s=30, alpha=0.7, edgecolors="k", linewidth=0.5
)
plt.xlabel("t-SNE 1", fontsize=12)
plt.ylabel("t-SNE 2", fontsize=12)
plt.title(f"PCA + t-SNE\n64D ‚Üí 30D ‚Üí 2D ({final_time:.2f}s)", fontsize=14, fontweight="bold")
plt.colorbar(scatter, label="Digit")
plt.grid(True, alpha=0.3)

plt.suptitle("PCA Preprocessing Speeds Up t-SNE", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

print("\n‚úì Key Takeaways:")
print("  ‚Ä¢ PCA: Fast, linear, good for preprocessing")
print("  ‚Ä¢ t-SNE: Slow, non-linear, excellent for visualization")
print("  ‚Ä¢ Best practice: PCA first, then t-SNE")
print("  ‚Ä¢ Always set random_state for reproducibility")
print("  ‚Ä¢ Try different perplexity values (5-50)")

## 7. Anomaly Detection

**Anomaly Detection** (Outlier Detection) identifies rare items, events, or observations that differ significantly from the majority.

### What are Anomalies?

**Anomalies** (outliers) are data points that deviate from normal patterns:

- **Point Anomalies**: Individual unusual instances
- **Contextual Anomalies**: Unusual in specific context (e.g., high temperature in winter)
- **Collective Anomalies**: Collection of related instances is anomalous

### Why Detect Anomalies?

**Real-World Applications:**
- üîí **Fraud Detection**: Unusual credit card transactions
- üè• **Healthcare**: Abnormal patient vitals
- üè≠ **Manufacturing**: Defective products
- üåê **Cybersecurity**: Network intrusions
- üìä **Data Quality**: Corrupt or erroneous data

### Unsupervised Anomaly Detection Methods

**1. Isolation Forest**
- Isolates anomalies using random splits
- Anomalies are easier to isolate (fewer splits needed)
- Fast and effective

**2. Local Outlier Factor (LOF)**
- Compares local density to neighbors
- Outliers have lower density
- Good for varying densities

**3. One-Class SVM**
- Learns decision boundary around normal data
- Points outside boundary = anomalies
- Effective but slower

**4. Statistical Methods**
- Z-score: Points > 3 std deviations
- IQR: Outside 1.5 √ó IQR
- Simple but assumes distribution

### Isolation Forest - How It Works

**Algorithm:**
1. Randomly select feature and split value
2. Recursively partition data
3. Anomalies require fewer splits to isolate
4. **Anomaly score** = average path length (lower = more anomalous)

**Why it works**: Anomalies are "few and different", so they're easier to separate

### Key Parameters

**contamination** (0.01-0.5, default='auto'):
- Expected proportion of outliers
- Set based on domain knowledge
- Example: 0.1 = expect 10% anomalies

**n_estimators** (50-500, default=100):
- Number of isolation trees
- More trees = more stable
- Similar to Random Forest

### Evaluation Metrics

**With labels (semi-supervised):**
- Precision, Recall, F1-Score
- ROC-AUC

**Without labels (unsupervised):**
- Manual inspection
- Domain expert validation
- Silhouette score (for clustering-based)

Let's detect anomalies in sensor data!

In [None]:
# Anomaly Detection - Comprehensive Implementation
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

print("=" * 60)
print("ANOMALY DETECTION WITH MULTIPLE METHODS")
print("=" * 60)

# Load sensor data
df_sensor = pd.read_csv("../../data_advanced/sensor_data.csv")
print(f"\nSensor dataset: {df_sensor.shape[0]} readings, {df_sensor.shape[1]} features")
print(f"\nFirst few rows:")
print(df_sensor.head())

# Select features
features = ["temperature", "pressure", "vibration"]
X_sensor = df_sensor[features].values

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sensor)

# Method 1: Isolation Forest
print("\n" + "=" * 60)
print("METHOD 1: ISOLATION FOREST")
print("=" * 60)

iso_forest = IsolationForest(contamination=0.1, random_state=42, n_estimators=100)
y_pred_iso = iso_forest.fit_predict(X_scaled)
anomaly_scores_iso = iso_forest.score_samples(X_scaled)

n_anomalies_iso = (y_pred_iso == -1).sum()
print(
    f"\nIsolation Forest detected: {n_anomalies_iso} anomalies ({n_anomalies_iso/len(X_sensor)*100:.1f}%)"
)

# Method 2: Local Outlier Factor
print("\n" + "=" * 60)
print("METHOD 2: LOCAL OUTLIER FACTOR (LOF)")
print("=" * 60)

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred_lof = lof.fit_predict(X_scaled)
anomaly_scores_lof = lof.negative_outlier_factor_

n_anomalies_lof = (y_pred_lof == -1).sum()
print(f"\nLOF detected: {n_anomalies_lof} anomalies ({n_anomalies_lof/len(X_sensor)*100:.1f}%)")

# Method 3: One-Class SVM
print("\n" + "=" * 60)
print("METHOD 3: ONE-CLASS SVM")
print("=" * 60)

oc_svm = OneClassSVM(nu=0.1, gamma="auto")
y_pred_svm = oc_svm.fit_predict(X_scaled)
anomaly_scores_svm = oc_svm.score_samples(X_scaled)

n_anomalies_svm = (y_pred_svm == -1).sum()
print(
    f"\nOne-Class SVM detected: {n_anomalies_svm} anomalies ({n_anomalies_svm/len(X_sensor)*100:.1f}%)"
)

# Method 4: Statistical (Z-score)
print("\n" + "=" * 60)
print("METHOD 4: STATISTICAL (Z-SCORE)")
print("=" * 60)

z_scores = np.abs((X_sensor - X_sensor.mean(axis=0)) / X_sensor.std(axis=0))
y_pred_stat = (z_scores > 3).any(axis=1).astype(int) * -2 + 1  # Convert to -1/1
n_anomalies_stat = (y_pred_stat == -1).sum()

print(
    f"\nZ-score method detected: {n_anomalies_stat} anomalies ({n_anomalies_stat/len(X_sensor)*100:.1f}%)"
)
print("(Points with any feature > 3 standard deviations)")

# Visualize all methods
fig, axes = plt.subplots(2, 2, figsize=(18, 14))

methods = [
    ("Isolation Forest", y_pred_iso, anomaly_scores_iso, axes[0, 0]),
    ("Local Outlier Factor", y_pred_lof, anomaly_scores_lof, axes[0, 1]),
    ("One-Class SVM", y_pred_svm, anomaly_scores_svm, axes[1, 0]),
    ("Z-Score (Statistical)", y_pred_stat, z_scores.max(axis=1), axes[1, 1]),
]

for name, predictions, scores, ax in methods:
    # Plot normal vs anomalies
    normal_mask = predictions == 1
    anomaly_mask = predictions == -1

    ax.scatter(
        X_sensor[normal_mask, 0],
        X_sensor[normal_mask, 1],
        c="blue",
        label="Normal",
        s=30,
        alpha=0.6,
        edgecolors="k",
        linewidth=0.5,
    )
    ax.scatter(
        X_sensor[anomaly_mask, 0],
        X_sensor[anomaly_mask, 1],
        c="red",
        label="Anomaly",
        s=100,
        alpha=0.8,
        edgecolors="black",
        linewidth=1.5,
        marker="X",
    )

    ax.set_xlabel("Temperature", fontsize=12)
    ax.set_ylabel("Pressure", fontsize=12)
    ax.set_title(f"{name}\n{anomaly_mask.sum()} anomalies detected", fontsize=14, fontweight="bold")
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle("Anomaly Detection Methods Comparison", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

# Analyze anomaly scores distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Isolation Forest scores
axes[0].hist(anomaly_scores_iso, bins=50, color="steelblue", alpha=0.7, edgecolor="black")
axes[0].axvline(
    x=iso_forest.offset_,
    color="red",
    linestyle="--",
    linewidth=2,
    label=f"Threshold = {iso_forest.offset_:.3f}",
)
axes[0].set_xlabel("Anomaly Score", fontsize=12)
axes[0].set_ylabel("Frequency", fontsize=12)
axes[0].set_title(
    "Isolation Forest\nScore Distribution\n(Lower = More Anomalous)", fontsize=12, fontweight="bold"
)
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis="y")

# LOF scores
axes[1].hist(anomaly_scores_lof, bins=50, color="coral", alpha=0.7, edgecolor="black")
axes[1].axvline(x=-1.5, color="red", linestyle="--", linewidth=2, label="Typical Threshold")
axes[1].set_xlabel("LOF Score (Negative)", fontsize=12)
axes[1].set_ylabel("Frequency", fontsize=12)
axes[1].set_title(
    "Local Outlier Factor\nScore Distribution\n(More Negative = More Anomalous)",
    fontsize=12,
    fontweight="bold",
)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis="y")

# SVM scores
axes[2].hist(anomaly_scores_svm, bins=50, color="lightgreen", alpha=0.7, edgecolor="black")
axes[2].axvline(x=0, color="red", linestyle="--", linewidth=2, label="Decision Boundary")
axes[2].set_xlabel("Distance from Boundary", fontsize=12)
axes[2].set_ylabel("Frequency", fontsize=12)
axes[2].set_title(
    "One-Class SVM\nScore Distribution\n(Negative = Anomaly)", fontsize=12, fontweight="bold"
)
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

# Agreement between methods
print("\n" + "=" * 60)
print("METHOD AGREEMENT ANALYSIS")
print("=" * 60)

agreement = pd.DataFrame(
    {
        "Isolation_Forest": y_pred_iso,
        "LOF": y_pred_lof,
        "One_Class_SVM": y_pred_svm,
        "Z_Score": y_pred_stat,
    }
)

# Count how many methods agree each point is anomaly
anomaly_votes = (agreement == -1).sum(axis=1)

print(f"\nConsensus anomalies (detected by all 4 methods): {(anomaly_votes == 4).sum()}")
print(f"Detected by 3 methods: {(anomaly_votes == 3).sum()}")
print(f"Detected by 2 methods: {(anomaly_votes == 2).sum()}")
print(f"Detected by 1 method: {(anomaly_votes == 1).sum()}")
print(f"Detected by 0 methods (all agree normal): {(anomaly_votes == 0).sum()}")

# Visualize consensus
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
scatter = plt.scatter(
    X_sensor[:, 0],
    X_sensor[:, 1],
    c=anomaly_votes,
    cmap="YlOrRd",
    s=50,
    alpha=0.7,
    edgecolors="k",
    linewidth=0.5,
)
plt.colorbar(scatter, label="# Methods Detecting as Anomaly")
plt.xlabel("Temperature", fontsize=12)
plt.ylabel("Pressure", fontsize=12)
plt.title(
    "Anomaly Detection Consensus\n(Darker = More methods agree)", fontsize=14, fontweight="bold"
)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.bar(
    range(5),
    [(anomaly_votes == i).sum() for i in range(5)],
    color=["green", "yellow", "orange", "darkorange", "red"],
    edgecolor="black",
    linewidth=1.5,
)
plt.xlabel("Number of Methods Detecting as Anomaly", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Distribution of Method Agreement", fontsize=14, fontweight="bold")
plt.xticks(range(5), ["0 (Normal)", "1", "2", "3", "4 (Consensus)"])
plt.grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

print("\n‚úì Anomaly detection complete!")
print("\nRecommendations:")
print("  ‚Ä¢ Use Isolation Forest for general purpose (fast, effective)")
print("  ‚Ä¢ Use LOF for varying density clusters")
print("  ‚Ä¢ Use One-Class SVM for high-dimensional data")
print("  ‚Ä¢ Use consensus (multiple methods) for high-confidence detection")

## 8. Customer Segmentation Project

Let's apply everything we've learned to a **real-world customer segmentation project** using all unsupervised learning techniques!

### Project Goal

**Segment customers** into meaningful groups for targeted marketing using:
- K-Means clustering
- Hierarchical clustering for hierarchy analysis
- PCA for dimensionality reduction
- t-SNE for visualization
- Anomaly detection for identifying unusual customers

### Business Questions to Answer

1. How many distinct customer segments exist?
2. What characterizes each segment?
3. Which segment should we target for:
   - Premium products?
   - Discount campaigns?
   - Retention efforts?
4. Are there any anomalous customers (potential fraud/high-value)?

### Project Workflow

```
1. Data Loading & EDA
   ‚Üì
2. Feature Engineering & Scaling
   ‚Üì
3. Dimensionality Reduction (PCA)
   ‚Üì
4. Determine Optimal k (Elbow + Silhouette)
   ‚Üì
5. Apply Clustering (K-Means + Hierarchical)
   ‚Üì
6. Visualization (PCA + t-SNE)
   ‚Üì
7. Anomaly Detection
   ‚Üì
8. Business Insights & Recommendations
```

### Deliverables

- Cluster assignments for each customer
- Segment profiles and characteristics
- Visualization dashboard
- Actionable business recommendations

Let's build a complete customer segmentation solution!

In [None]:
# Customer Segmentation - End-to-End Project
print("=" * 70)
print(" CUSTOMER SEGMENTATION PROJECT - END-TO-END SOLUTION")
print("=" * 70)

# Step 1: Load and Explore Data
print("\n" + "=" * 70)
print("STEP 1: DATA LOADING & EXPLORATION")
print("=" * 70)

df_cust = pd.read_csv("../../data_advanced/feature_engineering.csv")
print(f"\nDataset: {df_cust.shape[0]} customers, {df_cust.shape[1]} features")
print(f"\nFeatures: {list(df_cust.columns)}")
print(f"\nBasic statistics:")
print(df_cust.describe())

# Step 2: Feature Engineering
print("\n" + "=" * 70)
print("STEP 2: FEATURE SELECTION & ENGINEERING")
print("=" * 70)

# Select relevant features for segmentation
features = ["age", "income", "education_years", "experience_years", "num_dependents"]
X = df_cust[features].copy()

print(f"\nSelected features: {features}")
print(f"Feature correlations:")
print(X.corr())

# Standardize (crucial for clustering!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\n‚úì Features standardized (mean=0, std=1)")

# Step 3: Dimensionality Reduction with PCA
print("\n" + "=" * 70)
print("STEP 3: PCA FOR DIMENSIONALITY REDUCTION")
print("=" * 70)

pca = PCA(random_state=42)
X_pca = pca.fit_transform(X_scaled)

# Find components for 95% variance
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_comp_95 = np.argmax(cumvar >= 0.95) + 1

print(f"\nComponents needed for 95% variance: {n_comp_95}/{len(features)}")
print(f"Explained variance by component:")
for i in range(len(features)):
    print(f"  PC{i+1}: {pca.explained_variance_ratio_[i]:.1%} (Cumulative: {cumvar[i]:.1%})")

# Use reduced dimensions
X_reduced = X_pca[:, :n_comp_95]

# Step 4: Determine Optimal k
print("\n" + "=" * 70)
print("STEP 4: FINDING OPTIMAL NUMBER OF CLUSTERS")
print("=" * 70)

K_range = range(2, 9)
wcss = []
silhouette_scores = []

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_reduced)
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_reduced, kmeans.labels_))

optimal_k = 4  # Based on elbow and silhouette
print(f"\n‚úì Optimal k = {optimal_k} (based on elbow method and silhouette score)")

# Step 5: Apply Clustering
print("\n" + "=" * 70)
print("STEP 5: CLUSTERING (K-MEANS + HIERARCHICAL)")
print("=" * 70)

# K-Means
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df_cust["segment_kmeans"] = kmeans.fit_predict(X_reduced)

# Hierarchical (for comparison)
hierarchical = AgglomerativeClustering(n_clusters=optimal_k, linkage="ward")
df_cust["segment_hier"] = hierarchical.fit_predict(X_reduced)

print(f"K-Means Silhouette Score: {silhouette_score(X_reduced, df_cust['segment_kmeans']):.4f}")
print(f"Hierarchical Silhouette Score: {silhouette_score(X_reduced, df_cust['segment_hier']):.4f}")

# Use K-Means for final segmentation
df_cust["segment"] = df_cust["segment_kmeans"]

# Step 6: Segment Profiling
print("\n" + "=" * 70)
print("STEP 6: SEGMENT PROFILING & CHARACTERIZATION")
print("=" * 70)

for seg in range(optimal_k):
    segment_data = df_cust[df_cust["segment"] == seg]
    print(f"\n{'='*70}")
    print(
        f"SEGMENT {seg}: {len(segment_data)} customers ({len(segment_data)/len(df_cust)*100:.1f}%)"
    )
    print(f"{'='*70}")
    print(segment_data[features].describe().loc[["mean", "50%"]].T)

# Create segment names based on characteristics
segment_means = df_cust.groupby("segment")[features].mean()

# Name segments (simplified logic)
segment_names = {
    0: "Entry-Level Young",
    1: "Mid-Career Family",
    2: "Senior Professionals",
    3: "Experienced Elite",
}

df_cust["segment_name"] = df_cust["segment"].map(segment_names)

print("\n" + "=" * 70)
print("SEGMENT NAMING & CHARACTERIZATION")
print("=" * 70)
for seg, name in segment_names.items():
    print(f"Segment {seg}: {name}")

# Step 7: Visualization Dashboard
print("\n" + "=" * 70)
print("STEP 7: VISUALIZATION")
print("=" * 70)

fig = plt.figure(figsize=(20, 12))

# 1. Elbow plot
ax1 = plt.subplot(3, 3, 1)
ax1.plot(K_range, wcss, "bo-", linewidth=2, markersize=8)
ax1.axvline(x=optimal_k, color="red", linestyle="--", label=f"Optimal k={optimal_k}")
ax1.set_xlabel("k", fontsize=11)
ax1.set_ylabel("WCSS", fontsize=11)
ax1.set_title("Elbow Method", fontsize=12, fontweight="bold")
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Silhouette scores
ax2 = plt.subplot(3, 3, 2)
ax2.plot(K_range, silhouette_scores, "go-", linewidth=2, markersize=8)
ax2.axvline(x=optimal_k, color="red", linestyle="--", label=f"Optimal k={optimal_k}")
ax2.set_xlabel("k", fontsize=11)
ax2.set_ylabel("Silhouette Score", fontsize=11)
ax2.set_title("Silhouette Analysis", fontsize=12, fontweight="bold")
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. PCA visualization
ax3 = plt.subplot(3, 3, 3)
pca_2d = PCA(n_components=2, random_state=42)
X_pca_2d = pca_2d.fit_transform(X_scaled)
for seg in range(optimal_k):
    mask = df_cust["segment"] == seg
    ax3.scatter(
        X_pca_2d[mask, 0],
        X_pca_2d[mask, 1],
        label=segment_names[seg],
        s=30,
        alpha=0.6,
        edgecolors="k",
        linewidth=0.5,
    )
ax3.set_xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})", fontsize=11)
ax3.set_ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})", fontsize=11)
ax3.set_title("PCA: Segments in 2D", fontsize=12, fontweight="bold")
ax3.legend(fontsize=9)
ax3.grid(True, alpha=0.3)

# 4. Segment sizes
ax4 = plt.subplot(3, 3, 4)
segment_counts = df_cust["segment"].value_counts().sort_index()
colors = plt.cm.Set3(range(optimal_k))
ax4.bar(
    [segment_names[i] for i in range(optimal_k)],
    segment_counts,
    color=colors,
    edgecolor="black",
    linewidth=1.5,
)
ax4.set_ylabel("Number of Customers", fontsize=11)
ax4.set_title("Segment Distribution", fontsize=12, fontweight="bold")
ax4.tick_params(axis="x", rotation=15)
ax4.grid(True, alpha=0.3, axis="y")

# 5. Income by segment
ax5 = plt.subplot(3, 3, 5)
df_cust.boxplot(column="income", by="segment_name", ax=ax5)
ax5.set_xlabel("Segment", fontsize=11)
ax5.set_ylabel("Income", fontsize=11)
ax5.set_title("Income Distribution by Segment", fontsize=12, fontweight="bold")
plt.sca(ax5)
plt.xticks(rotation=15)
ax5.get_figure().suptitle("")  # Remove default title

# 6. Age by segment
ax6 = plt.subplot(3, 3, 6)
df_cust.boxplot(column="age", by="segment_name", ax=ax6)
ax6.set_xlabel("Segment", fontsize=11)
ax6.set_ylabel("Age", fontsize=11)
ax6.set_title("Age Distribution by Segment", fontsize=12, fontweight="bold")
plt.sca(ax6)
plt.xticks(rotation=15)
ax6.get_figure().suptitle("")

# 7. Segment profiles heatmap
ax7 = plt.subplot(3, 3, 7)
segment_profiles = df_cust.groupby("segment")[features].mean()
segment_profiles_norm = (segment_profiles - segment_profiles.mean()) / segment_profiles.std()
im = ax7.imshow(segment_profiles_norm.T, cmap="RdYlGn", aspect="auto", vmin=-2, vmax=2)
ax7.set_xticks(range(optimal_k))
ax7.set_xticklabels([segment_names[i] for i in range(optimal_k)], rotation=15)
ax7.set_yticks(range(len(features)))
ax7.set_yticklabels(features)
ax7.set_title("Segment Feature Profiles", fontsize=12, fontweight="bold")
plt.colorbar(im, ax=ax7, label="Z-Score")

# 8. Hierarchical dendrogram
ax8 = plt.subplot(3, 3, 8)
from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(X_reduced[:100], method="ward")  # Subset for clarity
dendrogram(Z, ax=ax8, truncate_mode="lastp", p=12, show_leaf_counts=True)
ax8.set_title("Hierarchical Clustering\n(Sample)", fontsize=12, fontweight="bold")
ax8.set_xlabel("Cluster Size", fontsize=11)
ax8.set_ylabel("Distance", fontsize=11)

# 9. Anomaly detection
ax9 = plt.subplot(3, 3, 9)
iso_forest = IsolationForest(contamination=0.05, random_state=42)
anomalies = iso_forest.fit_predict(X_scaled)
df_cust["is_anomaly"] = anomalies
normal_mask = anomalies == 1
anomaly_mask = anomalies == -1
ax9.scatter(
    X_pca_2d[normal_mask, 0], X_pca_2d[normal_mask, 1], c="blue", label="Normal", s=20, alpha=0.5
)
ax9.scatter(
    X_pca_2d[anomaly_mask, 0],
    X_pca_2d[anomaly_mask, 1],
    c="red",
    label="Anomaly",
    s=100,
    marker="X",
    edgecolors="black",
    linewidth=1.5,
)
ax9.set_xlabel("PC1", fontsize=11)
ax9.set_ylabel("PC2", fontsize=11)
ax9.set_title(f"Anomaly Detection\n{anomaly_mask.sum()} anomalies", fontsize=12, fontweight="bold")
ax9.legend()
ax9.grid(True, alpha=0.3)

plt.suptitle("Customer Segmentation Dashboard", fontsize=16, fontweight="bold", y=0.995)
plt.tight_layout()
plt.show()

# Step 8: Business Insights
print("\n" + "=" * 70)
print("STEP 8: BUSINESS INSIGHTS & RECOMMENDATIONS")
print("=" * 70)

print("\nKEY FINDINGS:")
print("-" * 70)
for seg in range(optimal_k):
    seg_data = df_cust[df_cust["segment"] == seg]
    print(f"\n{segment_names[seg]} (n={len(seg_data)}):")
    print(f"  ‚Ä¢ Average Income: ${seg_data['income'].mean():,.0f}")
    print(f"  ‚Ä¢ Average Age: {seg_data['age'].mean():.1f} years")
    print(f"  ‚Ä¢ Average Experience: {seg_data['experience_years'].mean():.1f} years")

print("\n" + "=" * 70)
print("ACTIONABLE RECOMMENDATIONS:")
print("=" * 70)
print("\n1. PREMIUM PRODUCTS ‚Üí Target 'Experienced Elite' and 'Senior Professionals'")
print("   - High income, established careers")
print("   - Focus on quality and exclusivity\n")
print("2. GROWTH PRODUCTS ‚Üí Target 'Mid-Career Family'")
print("   - Growing income, family needs")
print("   - Focus on value and family benefits\n")
print("3. ENTRY OFFERS ‚Üí Target 'Entry-Level Young'")
print("   - Price-sensitive, career starting")
print("   - Focus on affordability and long-term relationships\n")
print(f"4. SPECIAL ATTENTION ‚Üí {anomaly_mask.sum()} anomalous customers")
print("   - Investigate for fraud OR high-value opportunities")
print("   - Manual review recommended\n")

print("=" * 70)
print("‚úì CUSTOMER SEGMENTATION PROJECT COMPLETE!")
print("=" * 70)

## 9. Exercises

Practice what you've learned with these hands-on exercises!

### Exercise 1: Optimal k for Different Datasets
Create synthetic datasets with different numbers of true clusters (2, 3, 5) using `make_blobs`. For each:
- Apply the elbow method
- Calculate silhouette scores for k=2 to 10
- Compare whether the methods correctly identify the true number of clusters

### Exercise 2: DBSCAN Parameter Tuning
Load the `make_moons` dataset and:
- Create a k-distance plot to find optimal epsilon
- Try DBSCAN with eps values: [0.1, 0.2, 0.3, 0.4]
- Try min_samples values: [3, 5, 10]
- Find the combination that gives the best silhouette score

### Exercise 3: PCA Reconstruction Error
Using the digits dataset:
- Apply PCA with different numbers of components (5, 10, 20, 30, 50)
- For each, calculate reconstruction error: `mean((X_original - X_reconstructed)**2)`
- Plot reconstruction error vs. number of components
- Determine the "sweet spot" for compression

### Exercise 4: Customer Segmentation with Different Features
Using the customer data:
- Try clustering with different feature combinations
- Compare results using all features vs. only income and age
- Does adding more features improve or worsen clustering?

### Exercise 5: Anomaly Detection Comparison
Apply all 4 anomaly detection methods (Isolation Forest, LOF, One-Class SVM, Z-score) to:
- The customer dataset
- Identify which customers are flagged by multiple methods
- Manually inspect the top 10 "most anomalous" customers - what makes them unusual?

In [None]:
# Exercise Solutions - Try these yourself!

# Exercise 1: Optimal k for Different Datasets
print("Exercise 1: Testing elbow method on datasets with known k")
print("=" * 60)

for true_k in [2, 3, 5]:
    X_test, y_test = make_blobs(n_samples=300, centers=true_k, cluster_std=0.6, random_state=42)

    # Your code here: Apply elbow method and find if it matches true_k
    # Hint: Use range(2, 8) for k values
    # Calculate WCSS and silhouette scores
    # Plot results

    print(f"True k = {true_k}")
    print("  TODO: Implement elbow method and compare\n")

# Exercise 2: DBSCAN Parameter Tuning
print("\nExercise 2: Finding optimal DBSCAN parameters")
print("=" * 60)

X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Your code here: Try different eps and min_samples combinations
# Hint: Use nested loops to try all combinations
# Track which combination gives best silhouette score

print("TODO: Implement parameter grid search for DBSCAN\n")

# Exercise 3: PCA Reconstruction Error
print("\nExercise 3: PCA reconstruction error analysis")
print("=" * 60)

# Your code here: For each n_components in [5, 10, 20, 30, 50]:
# 1. Apply PCA
# 2. Reconstruct data
# 3. Calculate MSE
# 4. Plot error vs. components

print("TODO: Calculate and plot reconstruction error\n")

# Exercise 4: Feature Selection Impact
print("\nExercise 4: Impact of feature selection on clustering")
print("=" * 60)

# Your code here: Try clustering with:
# 1. All features
# 2. Only ['income', 'age']
# 3. Only ['education_years', 'experience_years']
# Compare silhouette scores

print("TODO: Compare clustering with different feature sets\n")

# Exercise 5: Anomaly Detection Consensus
print("\nExercise 5: Multi-method anomaly detection")
print("=" * 60)

# Your code here: Apply all 4 methods to customer data
# Find customers flagged by 3+ methods
# Print their characteristics

print("TODO: Find high-confidence anomalies\n")

print("\n‚úì Complete these exercises to master unsupervised learning!")

## 10. Key Takeaways & Next Steps

Congratulations! You've mastered unsupervised learning - one of the most powerful tools in machine learning for discovering hidden patterns in data.

### What You've Learned

#### 1. **Clustering Fundamentals**
- ‚úì Unsupervised learning finds patterns without labels
- ‚úì Distance metrics (Euclidean, Manhattan, Cosine)
- ‚úì Evaluation metrics (WCSS, Silhouette Score)
- ‚úì Use cases: customer segmentation, document clustering, image segmentation

#### 2. **K-Means Clustering**
- ‚úì Fast, scalable centroid-based clustering
- ‚úì Elbow method for finding optimal k
- ‚úì K-Means++ for better initialization
- ‚úì Best for spherical, well-separated clusters
- ‚úì **When to use**: General-purpose, known k, spherical clusters

#### 3. **DBSCAN (Density-Based Clustering)**
- ‚úì Finds arbitrary-shaped clusters
- ‚úì No need to specify number of clusters
- ‚úì Identifies outliers as noise
- ‚úì Parameters: epsilon (Œµ) and MinPts
- ‚úì **When to use**: Non-spherical clusters, unknown k, need outlier detection

#### 4. **Hierarchical Clustering**
- ‚úì Creates dendrogram showing cluster hierarchy
- ‚úì Linkage methods: Ward (best), Complete, Average, Single
- ‚úì Visual method for choosing k
- ‚úì Deterministic results
- ‚úì **When to use**: Small datasets, need hierarchy, exploratory analysis

#### 5. **PCA (Principal Component Analysis)**
- ‚úì Linear dimensionality reduction
- ‚úì Preserves maximum variance
- ‚úì Components are orthogonal and ordered
- ‚úì Enables visualization and compression
- ‚úì **When to use**: Preprocessing, visualization, noise reduction, compression

#### 6. **t-SNE (t-Distributed Stochastic Neighbor Embedding)**
- ‚úì Non-linear dimensionality reduction
- ‚úì Excellent for visualization
- ‚úì Preserves local structure (clusters)
- ‚úì Key parameter: perplexity (5-50)
- ‚úì **When to use**: Visualization ONLY, not for ML features

#### 7. **Anomaly Detection**
- ‚úì Isolation Forest: Fast, general-purpose
- ‚úì LOF: Good for varying densities
- ‚úì One-Class SVM: High-dimensional data
- ‚úì Z-Score: Simple statistical method
- ‚úì **Best practice**: Use consensus of multiple methods

#### 8. **Customer Segmentation Project**
- ‚úì End-to-end workflow: EDA ‚Üí Feature Engineering ‚Üí Clustering ‚Üí Insights
- ‚úì Combine multiple techniques (PCA + K-Means + Anomalies)
- ‚úì Business-focused analysis and actionable recommendations

---

### Quick Reference Guide

| Task | Best Algorithm | Key Parameters |
|------|----------------|----------------|
| **General clustering** | K-Means | n_clusters, init='k-means++' |
| **Unknown # clusters** | DBSCAN | eps, min_samples |
| **See hierarchy** | Hierarchical | n_clusters, linkage='ward' |
| **Reduce dimensions** | PCA | n_components (or variance) |
| **Visualize high-D** | t-SNE | perplexity=30, n_iter=1000 |
| **Find outliers** | Isolation Forest | contamination, n_estimators |

---

### Common Pitfalls & Best Practices

**‚ùå Common Mistakes:**
1. Forgetting to standardize features before clustering
2. Using t-SNE output as features for ML (it's for visualization only!)
3. Not validating clustering with silhouette score
4. Ignoring the curse of dimensionality
5. Choosing k without elbow/silhouette analysis

**‚úì Best Practices:**
1. **Always standardize** features before distance-based algorithms
2. **Try multiple clustering methods** and compare
3. **Validate with metrics** (silhouette, WCSS) and domain knowledge
4. **PCA before t-SNE** for high-dimensional data (>50D)
5. **Use consensus** for anomaly detection (multiple methods)
6. **Profile segments** and create actionable business insights

---

### Real-World Applications

**Customer Analytics:**
- Market segmentation
- Churn prediction preprocessing
- Customer lifetime value grouping

**Healthcare:**
- Patient stratification
- Disease subtype discovery
- Medical image segmentation

**Cybersecurity:**
- Intrusion detection
- Malware classification
- Fraud detection

**Image Processing:**
- Image compression
- Object detection preprocessing
- Facial recognition

**Finance:**
- Portfolio optimization
- Risk assessment
- Trading strategy grouping

---

### Resources for Further Learning

**Documentation:**
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
- [PCA Tutorial](https://scikit-learn.org/stable/modules/decomposition.html#pca)
- [t-SNE FAQ](https://distill.pub/2016/misread-tsne/)

**Papers:**
- **K-Means**: MacQueen (1967) - "Some Methods for classification and Analysis"
- **DBSCAN**: Ester et al. (1996) - "A density-based algorithm"
- **t-SNE**: van der Maaten & Hinton (2008) - "Visualizing Data using t-SNE"
- **Isolation Forest**: Liu et al. (2008) - "Isolation Forest"

**Practical Tutorials:**
- [Customer Segmentation with Python](https://towardsdatascience.com)
- [Anomaly Detection in Practice](https://machinelearningmastery.com)
- [Clustering for Beginners](https://realpython.com)

---

### Next Steps

**Immediate:**
- Complete the exercises above
- Try clustering on your own datasets
- Experiment with different distance metrics

**Next Module:**
**Module 16**: `16_neural_networks.ipynb` - Neural Networks from Scratch
- Build neural networks from first principles
- Understand backpropagation
- Introduction to deep learning frameworks

**Advanced Topics:**
- Gaussian Mixture Models (soft clustering)
- Spectral Clustering
- Self-Organizing Maps (SOM)
- UMAP (alternative to t-SNE)
- Autoencoders for dimensionality reduction

---

### Module Complete! üéâ

You've successfully completed Module 15 on Unsupervised Learning!

**You can now:**
- ‚úì Cluster data into meaningful groups
- ‚úì Choose the right clustering algorithm for your problem
- ‚úì Reduce high-dimensional data for visualization
- ‚úì Detect anomalies in datasets
- ‚úì Build end-to-end customer segmentation solutions
- ‚úì Create actionable business insights from clustering

**Next**: `16_neural_networks.ipynb` - Deep Learning Foundations

---

**Remember**: Unsupervised learning is about discovering hidden patterns. Always validate results with domain expertise and business context!

Keep exploring! üöÄ