# **CHAPTER 8: UNSUPERVISED LEARNING**

*Patterns Without Supervision*

## **Chapter Overview**

While supervised learning dominates industry applications, unsupervised learning reveals the hidden architecture of data. From customer segmentation to fraud detection to recommendation systems, these algorithms find structure without ground truth. This chapter covers the full spectrum: clustering algorithms that group similar entities, association rules that uncover co-occurrence patterns, and anomaly detection that identifies the unusual.

**Estimated Time:** 40-50 hours (3 weeks)  
**Prerequisites:** Chapters 1-7 (especially dimensionality reduction and distance metrics)

---

## **8.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement and tune clustering algorithms (K-Means, Hierarchical, DBSCAN, GMM) selecting appropriate metrics and validation techniques
2. Apply association rule mining to discover frequent patterns and meaningful rules (Apriori, FP-Growth)
3. Build anomaly detection systems using statistical, distance-based, and isolation methods
4. Evaluate unsupervised results using internal metrics (Silhouette, Davies-Bouldin) and domain-specific validation
5. Handle high-dimensional clustering challenges and the curse of dimensionality
6. Deploy unsupervised pipelines for segmentation and recommendation preprocessing

---

## **8.1 Clustering: Grouping by Similarity**

#### **8.1.1 K-Means Clustering**

Partition $n$ samples into $k$ clusters minimizing within-cluster sum of squares (WCSS):

$$\arg\min_S \sum_{i=1}^k \sum_{x \in S_i} \|x - \mu_i\|^2$$

**Algorithm (Lloyd's):**
1. Initialize $k$ centroids randomly
2. Assign each point to nearest centroid
3. Update centroids to mean of assigned points
4. Repeat until convergence

```python
from sklearn.cluster import KMeans
import numpy as np

# Initialize with k-means++ (smart initialization, better convergence)
kmeans = KMeans(
    n_clusters=5, 
    init='k-means++',  # Or 'random'
    n_init=10,         # Run 10 times, pick best
    max_iter=300,
    random_state=42
)

labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_  # WCSS (sum of squared distances to centroids)

# Predict new data
new_labels = kmeans.predict(X_new)
```

**Choosing $k$ — The Elbow Method:**
Plot WCSS vs $k$, look for "elbow" where diminishing returns begin.

```python
import matplotlib.pyplot as plt

inertias = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.show()
```

**Limitations:**
- Assumes spherical clusters (isotropic)
- Sensitive to outliers (centroids pulled toward outliers)
- Requires specifying $k$
- Struggles with varying cluster densities

#### **8.1.2 Hierarchical Clustering**

Builds a tree of clusters (dendrogram) either bottom-up (agglomerative) or top-down (divisive).

**Linkage Criteria:**
- **Single:** Minimum distance between clusters (creates chains, good for non-elliptical)
- **Complete:** Maximum distance (compact clusters)
- **Average:** Average distance between all pairs
- **Ward:** Minimizes variance (similar to K-means, usually best)

```python
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Create linkage matrix for visualization
Z = linkage(X, method='ward')  # method: 'single', 'complete', 'average', 'ward'
dendrogram(Z, truncate_mode='lastp', p=12)  # Show last 12 merges
plt.show()

# Actual clustering
hc = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward',
    metric='euclidean'  # For ward, must be euclidean
)
labels = hc.fit_predict(X)
```

**When to use:** Small datasets ($n < 10000$, slow at $O(n^3)$), when hierarchy matters (taxonomy creation), or when cluster count unknown (cut dendrogram at height).

#### **8.1.3 DBSCAN (Density-Based Spatial Clustering)**

Groups together points in high-density regions, marks outliers in low-density regions.

**Parameters:**
- **eps ($\epsilon$):** Maximum distance for neighborhood
- **min_samples:** Minimum points to form dense region (core point)

**Point Types:**
- **Core:** $\geq$ min_samples within eps
- **Border:** Within eps of core, but not core itself
- **Noise:** Neither core nor border (outliers)

```python
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X)

# Labels == -1 are outliers (noise)
n_outliers = np.sum(labels == -1)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print(f"Estimated clusters: {n_clusters}")
print(f"Outliers: {n_outliers}")
```

**Advantages:**
- Discovers arbitrary shapes (not just spheres)
- Robust to outliers (explicitly models noise)
- No need to specify $k$

**Choosing eps:**
Use k-distance graph (distance to k-th nearest neighbor). Elbow indicates good eps.

```python
from sklearn.neighbors import NearestNeighbors

neigh = NearestNeighbors(n_neighbors=4)  # min_samples
neigh.fit(X)
distances, indices = neigh.kneighbors(X)
distances = np.sort(distances[:, 3], axis=0)  # 4th nearest neighbor distance

plt.plot(distances)
plt.ylabel('4-NN Distance')
plt.title('K-Distance Graph')
plt.show()  # Look for "elbow"
```

#### **8.1.4 Gaussian Mixture Models (GMM)**

Soft clustering using probabilistic model. Assumes data generated from $k$ Gaussian distributions.

$$P(x) = \sum_{i=1}^k \pi_i \mathcal{N}(x|\mu_i, \Sigma_i)$$

**Expectation-Maximization (EM) Algorithm:**
1. **E-step:** Compute probability each point belongs to each cluster (responsibilities)
2. **M-step:** Update Gaussian parameters to maximize likelihood

```python
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(
    n_components=3,
    covariance_type='full',  # 'spherical', 'diag', 'tied', 'full'
    random_state=42
)
gmm.fit(X)

# Soft assignments (probabilities)
probs = gmm.predict_proba(X)  # Shape (n_samples, n_components)

# Hard assignments
labels = gmm.predict(X)

# Model parameters
means = gmm.means_
covariances = gmm.covariances_

# Sample new data from learned distribution
X_new = gmm.sample(100)[0]
```

**Covariance Types:**
- **spherical:** Variance same in all directions (like K-means)
- **diag:** Axis-aligned ellipses
- **tied:** All clusters share same covariance matrix
- **full:** Each cluster has own arbitrary covariance (most flexible)

**Selecting components:** Use Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC).

```python
bic_scores = []
for k in range(2, 10):
    gmm = GaussianMixture(n_components=k, random_state=42)
    gmm.fit(X)
    bic_scores.append(gmm.bic(X))

optimal_k = np.argmin(bic_scores) + 2
```

#### **8.1.5 Spectral Clustering**

Uses graph theory. Good for non-convex clusters (e.g., concentric circles).

Constructs similarity graph → Laplacian → Eigenvectors → K-means in eigenvector space.

```python
from sklearn.cluster import SpectralClustering

sc = SpectralClustering(
    n_clusters=2,
    affinity='rbf',  # 'nearest_neighbors' for sparse graph
    gamma=1.0,
    assign_labels='kmeans'  # or 'discretize'
)
labels = sc.fit_predict(X)
```

**When to use:** When clusters are connected but not compact (nested circles, moons), image segmentation.

---

## **8.2 Evaluating Clustering**

Without ground truth labels, we use internal metrics.

#### **8.2.1 Silhouette Score**

For each sample: $s = \frac{b - a}{\max(a, b)}$

- $a$: Mean distance to other points in same cluster (cohesion)
- $b$: Mean distance to nearest cluster (separation)

Range $[-1, 1]$. Close to 1 = well-clustered, 0 = on boundary, negative = wrong cluster.

```python
from sklearn.metrics import silhouette_score, silhouette_samples

score = silhouette_score(X, labels)
print(f"Silhouette Score: {score:.3f}")

# Per-sample scores (for visualization)
sample_silhouette_values = silhouette_samples(X, labels)
```

#### **8.2.2 Davies-Bouldin Index**

Average similarity between each cluster and its most similar cluster. Lower is better (tight clusters, far apart).

```python
from sklearn.metrics import davies_bouldin_score
db_index = davies_bouldin_score(X, labels)
```

#### **8.2.3 Calinski-Harabasz Index (Variance Ratio)**

Ratio of between-cluster dispersion to within-cluster. Higher is better.

```python
from sklearn.metrics import calinski_harabasz_score
ch_score = calinski_harabasz_score(X, labels)
```

#### **8.2.4 Domain-Specific Validation**

**Stability Analysis:** Perturb data slightly, check if clusters stable.

**Business Metrics:** 
- Customer segmentation: Revenue per cluster, churn rate per cluster
- Topic modeling: Coherence scores (NPMI)

---

## **8.3 Association Rule Learning**

Discover interesting relations between variables (market basket analysis).

**Key Metrics:**
- **Support:** Fraction of transactions containing itemset $A$: $P(A)$
- **Confidence:** Conditional probability $P(B|A) = \frac{P(A \cup B)}{P(A)}$
- **Lift:** How much more likely $B$ is given $A$ vs baseline: $\frac{P(B|A)}{P(B)} = \frac{P(A \cup B)}{P(A)P(B)}$

**Apriori Principle:** If itemset is frequent, all subsets are frequent (prunes search space).

```python
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Data format: List of transactions (lists of items)
transactions = [['milk', 'bread', 'butter'], ['milk', 'bread'], ['beer', 'chips'], ...]

# Encode
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Find frequent itemsets (min_support=0.6 means 60% of transactions)
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

# Generate rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules = rules[rules['lift'] > 1.2]  # Filter for interesting rules

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
```

**FP-Growth (Faster):** Uses frequent pattern tree, no candidate generation.

```python
from mlxtend.frequent_patterns import fpgrowth

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
```

**Applications:**
- Market basket: "If diapers, then beer" (classic example)
- Recommendation: "Users who bought X also bought Y"
- Medical: "If symptoms A and B, likely diagnosis C"
- Web mining: Clickstream analysis

---

## **8.4 Anomaly Detection**

Identify rare items, events, or observations which raise suspicions.

#### **8.4.1 Statistical Methods**

**Z-Score:** $z = \frac{x - \mu}{\sigma}$, flag if $|z| > 3$ (assumes Gaussian)

**IQR Method:** $Q1 - 1.5 \times IQR$ to $Q3 + 1.5 \times IQR$

**Grubbs' Test:** For univariate outliers in normally distributed data.

#### **8.4.2 Isolation Forest**

Isolates anomalies by random splits. Anomalies are few and different → easier to isolate (shorter path length).

```python
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    contamination=0.1,  # Expected proportion of outliers
    n_estimators=100,
    random_state=42
)

# Returns: 1 (inlier), -1 (outlier)
labels = iso_forest.fit_predict(X)
outlier_scores = iso_forest.decision_function(X)  # Negative = more anomalous
```

#### **8.4.3 Local Outlier Factor (LOF)**

Density-based. Compares local density of point to neighbors. Lower density than neighbors = outlier.

```python
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1,
    novelty=False  # True if you want to predict on new data (fit on clean data only)
)
labels = lof.fit_predict(X)  # -1 for outliers
```

#### **8.4.4 One-Class SVM**

Learns boundary of "normal" class. Everything outside is anomaly.

```python
from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(
    kernel='rbf',
    gamma='scale',
    nu=0.1  # Upper bound on fraction of outliers, lower bound on support vectors
)
ocsvm.fit(X_clean)  # Train only on normal data
labels = ocsvm.predict(X_test)  # 1 (normal), -1 (outlier)
```

#### **8.4.5 Autoencoders (Preview of Deep Learning)**

Neural network learns to compress and reconstruct data. High reconstruction error = anomaly.

```python
# Conceptual (PyTorch)
class Autoencoder(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, 64), nn.ReLU(), nn.Linear(64, 32))
        self.decoder = nn.Sequential(nn.Linear(32, 64), nn.ReLU(), nn.Linear(64, input_dim))
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# Train on normal data only
# Anomaly score = MSE(input, reconstruction)
```

---

## **8.5 Advanced Topics**

#### **8.5.1 HDBSCAN**

Hierarchical DBSCAN. Better than DBSCAN for varying densities, automatically selects eps.

```python
import hdbscan

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=10,
    min_samples=5,
    metric='euclidean'
)
labels = clusterer.fit_predict(X)
# -1 is noise, soft clustering probabilities available
```

#### **8.5.2 BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)**

For very large datasets ($n > 100k$). Incremental clustering, memory efficient.

```python
from sklearn.cluster import Birch

birch = Birch(
    threshold=0.5,  # Radius of subcluster
    branching_factor=50,  # Max subclusters per node
    n_clusters=5
)
birch.fit(X)
```

---

## **8.6 Workbook Labs**

### **Lab 1: Customer Segmentation**
Use RFM (Recency, Frequency, Monetary) analysis:
1. Calculate RFM scores from transaction data
2. Cluster customers using K-Means (elbow method to choose k)
3. Profile each cluster (avg spend, churn rate, preferred categories)
4. Validate with Silhouette score and business metrics (revenue per cluster)

**Deliverable:** Segmentation report with actionable personas ("High Value Frequent", "At Risk Big Spenders", etc.).

### **Lab 2: Market Basket Analysis**
On grocery store data:
1. Use FP-Growth to find frequent itemsets (support > 1%)
2. Generate association rules with confidence > 50% and lift > 2
3. Visualize network graph of rules (antecedents → consequents)
4. Recommend cross-selling strategy based on highest lift rules

**Deliverable:** `market_basket_analyzer.py` with rule filtering and visualization.

### **Lab 3: Anomaly Detection System**
Network intrusion detection:
1. Compare Isolation Forest vs LOF vs One-Class SVM on time-series network features
2. Tune contamination parameter using validation set with labeled anomalies
3. Implement ensemble: flag if 2/3 methods agree
4. Measure precision@k (top k anomalies flagged)

**Deliverable:** `anomaly_detector.py` with ensemble logic and performance report.

### **Lab 4: Image Segmentation with Spectral Clustering**
Use sklearn's spectral clustering on image pixels (color + spatial coordinates):
1. Load image, convert to feature space [R, G, B, x, y]
2. Build similarity graph (RBF kernel on color + distance)
3. Cluster into 5-10 segments
4. Compare with K-means on color only

**Deliverable:** Side-by-side segmentation comparison showing spectral clustering respects boundaries better.

---

## **8.7 Common Pitfalls**

1. **Curse of Dimensionality:** Distance metrics become meaningless in high dimensions (all points equidistant). Always reduce dimensions (PCA) before clustering if $d > 50$.

2. **Ignoring Scale:** Clustering is distance-based. Features with large scales dominate. Always standardize!

3. **Choosing K Arbitrarily:** Elbow method is subjective. Use Silhouette analysis, BIC (GMM), or business constraints.

4. **DBSCAN Epsilon Too Small:** Everything becomes noise. Too large: everything in one cluster. Use k-distance graph!

5. **Evaluating on Training Data:** Clustering overfits easily. Validate stability by clustering on bootstrap samples.

6. **Treating Clusters as Ground Truth:** Clusters are descriptive, not necessarily meaningful. Always validate with domain knowledge.

---

## **8.8 Interview Questions**

**Q1:** When would you choose DBSCAN over K-Means?
*A: When cluster shapes are non-spherical/arbitrary, when you need to identify outliers as noise rather than forcing them into clusters, when cluster densities vary significantly, or when you don't know the number of clusters in advance. K-Means assumes spherical, equal-variance clusters and requires specifying k.*

**Q2:** How do you validate clustering when you don't have labels?
*A: Internal metrics: Silhouette score (cohesion vs separation), Davies-Bouldin index (compactness vs centroid distance), Calinski-Harabasz (variance ratio). External validation if possible: stability analysis (perturb data, check cluster consistency), business metrics (revenue per cluster, conversion rates), or manual inspection of cluster profiles.*

**Q3:** Explain the difference between hard and soft clustering.
*A: Hard clustering assigns each point to exactly one cluster (K-Means, hierarchical). Soft clustering gives probabilities of membership in each cluster (GMM, fuzzy C-means). Soft clustering useful when boundaries are ambiguous or for downstream probabilistic models.*

**Q4:** What is the Apriori principle and why does it help?
*A: If an itemset is frequent, all its subsets must also be frequent. Conversely, if a subset is infrequent, no superset can be frequent. This allows Apriori algorithm to prune the search space dramatically, avoiding enumeration of all possible itemsets (which is exponential).*

**Q5:** How would you detect anomalies in high-dimensional data?
*A: First reduce dimensions (PCA) to avoid curse of dimensionality. Then use methods robust to high dimensions: Isolation Forest (tree-based, handles high D well), or autoencoders (neural networks learn compressed representation). Avoid distance-based methods (LOF, One-Class SVM) in original high-D space unless heavily preprocessed.*

---

## **8.9 Further Reading**

**Books:**
- *Introduction to Data Mining* (Tan, Steinbach, Kumar) - Comprehensive clustering and association rules
- *Mining of Massive Datasets* (Leskovec, Rajaraman, Ullman) - Free online, scalable algorithms

**Papers:**
- "A Density-Based Algorithm for Discovering Clusters" (Ester et al., 1996) - DBSCAN original paper
- "BIRCH: An Efficient Data Clustering Method" (Zhang et al., 1996)

**Libraries:**
- **HDBSCAN:** https://hdbscan.readthedocs.io/ (better than sklearn DBSCAN)
- **PyCaret:** Automated clustering comparison
- **mlxtend:** Association rules and frequent pattern mining

---

## **8.10 Checkpoint Project: Intelligent Customer Segmentation System**

Build an end-to-end customer segmentation platform for an e-commerce company.

**Dataset:** 12 months of transaction data with customer demographics.

**Requirements:**

1. **Feature Engineering:**
   - RFM analysis (Recency: days since last purchase, Frequency: transaction count, Monetary: total/average spend)
   - Behavioral features: Category diversity (entropy), return rate, discount sensitivity
   - Temporal features: Weekend vs weekday ratio, seasonality preferences

2. **Clustering Pipeline:**
   - Handle mixed data types (numerical RFM + categorical demographics)
   - Use GMM for soft clustering (probability of belonging to each segment)
   - Automatic selection of optimal clusters (BIC + Silhouette consensus)

3. **Segment Profiling:**
   - Automated generation of persona descriptions ("Bargain Hunter: High frequency, low monetary, high discount usage")
   - Statistical significance testing of differences between segments
   - Churn prediction per segment using survival analysis (optional advanced)

4. **Actionable Interface:**
   - API endpoint: Input customer ID → Returns segment membership probabilities + recommended actions
   - Marketing campaign simulator: Predict uplift if targeting specific segment with promotion

5. **Validation:**
   - Temporal stability: Segments should be stable over 3-month windows (high overlap)
   - Business impact: Calculate that top segment has 5x LTV than bottom segment

**Deliverables:**
- `segmentation/` package with data processing, clustering, and profiling modules
- Streamlit dashboard showing segment visualization (t-SNE plot), statistics, and customer lookup
- Deployment-ready Docker container
- Report: "Segment X represents 15% of customers but 45% of revenue; recommended action: Loyalty program"

**Success Criteria:**
- Silhouette score > 0.3 (reasonable separation)
- >80% temporal stability (customers stay in same segment month-to-month)
- Actionable business insights (segments have distinct, meaningful behaviors)

---

**End of Chapter 8**

*You can now discover hidden patterns and detect anomalies without supervision. Chapter 9 will cover Model Evaluation, Validation & Selection — bringing together supervised and unsupervised concepts to ensure your models generalize.*

---