# Chapter 9: Unsupervised Learning Techniques

## 1. Chapter Overview
**Goal:** In previous chapters, we dealt with Supervised Learning (we had labels). Now, we explore **Unsupervised Learning**, where the data is unlabeled. We want the algorithm to discover hidden structures, patterns, or anomalies in the data on its own.

**Key Concepts:**
* **Clustering:** Grouping similar instances together.
    * **K-Means:** The most popular, centroid-based algorithm.
    * **DBSCAN:** Density-based algorithm that handles arbitrary shapes.
* **Clustering Evaluation:** The Elbow Method and Silhouette Score.
* **Image Segmentation:** Using clustering to reduce colors in an image.
* **Gaussian Mixtures (GMM):** A probabilistic model that assumes data is generated from a mixture of Gaussian distributions.
* **Anomaly Detection:** Using GMM to detect outliers (e.g., fraud detection).

**Practical Skills:**
* Implementing K-Means and optimizing `k`.
* Visualizing decision boundaries (Voronoi tessellation).
* Performing color segmentation on images.
* Using Gaussian Mixtures for density estimation and outlier detection.

In [None]:
# Setup
import sys
import sklearn
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 2. Theoretical Explanation (In-Depth)

### 1. K-Means Clustering
K-Means is a simple and efficient algorithm. The idea is to find $k$ centroids (center points) and assign every data point to the nearest centroid. 

**The Algorithm Steps:**
1.  **Initialize:** Place $k$ centroids randomly.
2.  **Assign:** For each instance, calculate the distance to all centroids and assign it to the closest cluster.
3.  **Update:** Move the centroids to the *mean* (center of mass) of the instances assigned to them.
4.  **Repeat:** Repeat steps 2 and 3 until the centroids stop moving (convergence).

**Hard vs. Soft Clustering:**
* *Hard Clustering:* Each instance belongs to exactly one cluster.
* *Soft Clustering:* Each instance has a score/probability for each cluster (e.g., distance to centroid).

**Limitations:**
* You must choose $k$ in advance.
* It assumes clusters are roughly spherical and of similar size. It struggles with elongated blobs or irregular shapes.

### 2. Selecting the Optimal Number of Clusters
Since we don't have labels, how do we know if 3 clusters is better than 5?

* **Inertia:** The sum of squared distances between instances and their closest centroid. Lower is better, BUT inertia always decreases as $k$ increases (if $k=$ number of instances, inertia is 0). We look for an **Elbow** point where the decrease slows down.
* **Silhouette Score:** Measures how close an instance is to its own cluster compared to the nearest neighboring cluster. Range is -1 to +1. +1 means well-clustered, 0 means overlapping, -1 means wrong cluster.

### 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN defines clusters as continuous regions of high density. 
* **Core Instance:** An instance that has at least `min_samples` neighbors within a distance `epsilon`.
* **Border Instance:** An instance close to a core instance but not core itself.
* **Noise/Outlier:** An instance that is neither core nor border.

**Advantages:** It can find clusters of *any shape* (unlike K-Means which likes spheres) and automatically detects outliers.

### 4. Gaussian Mixture Models (GMM)
A GMM is a probabilistic model that assumes the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. Unlike K-Means (which gives hard circles), GMM gives soft ellipses.
* It uses the **Expectation-Maximization (EM)** algorithm to find the parameters (Mean and Covariance matrix) of the Gaussians.
* **Anomaly Detection:** Any instance located in a low-density region (low probability) can be considered an anomaly.

## 3. Code Reproduction

### 3.1 K-Means on Blobs
We generate synthetic "blob" data to visualize how K-Means works.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate blobs
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

# Train K-Means with k=5
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)

# Check the centroids found
print("Cluster Centers:\n", kmeans.cluster_centers_)

# Visualization
def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

plt.figure(figsize=(8, 4))
plot_clusters(X, y_pred)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c='red', marker='x')
plt.title("K-Means Clustering (k=5)")
plt.show()

### 3.2 Finding the Optimal k (Elbow Method & Silhouette)
Let's see what happens if we don't know there are 5 blobs. We test k from 1 to 10.

In [None]:
from sklearn.metrics import silhouette_score

kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X)
                for k in range(1, 10)]

# Inertias
inertias = [model.inertia_ for model in kmeans_per_k]

# Silhouette Scores (starts from k=2)
silhouette_scores = [silhouette_score(X, model.labels_)
                     for model in kmeans_per_k[1:]]

plt.figure(figsize=(12, 4))

plt.subplot(121)
plt.plot(range(1, 10), inertias, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Inertia", fontsize=14)
plt.annotate('Elbow', xy=(4, inertias[3]), xytext=(4.5, 650),
             arrowprops=dict(facecolor='black', shrink=0.1))
plt.title("Inertia (Elbow Method)")

plt.subplot(122)
plt.plot(range(2, 10), silhouette_scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette Score", fontsize=14)
plt.title("Silhouette Score")

plt.show()

### 3.3 Image Segmentation
We will load a sample image (flower), and use K-Means to cluster the colors. By replacing each pixel's color with the cluster's mean color, we segment the image.

In [None]:
from sklearn.datasets import load_sample_images

# Load sample images
images = load_sample_images() 
img = images.images[1]  # The flower image

# Reshape image to a list of RGB colors (pixels)
X_img = img.reshape(-1, 3)

# Segment into 4 colors
kmeans = KMeans(n_clusters=4, random_state=42).fit(X_img)
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(img.shape)

plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.imshow(img)
plt.title("Original Image")
plt.axis('off')

plt.subplot(122)
# Cast to uint8 for valid display
plt.imshow(segmented_img.astype(np.uint8))
plt.title("Segmented (4 Colors)")
plt.axis('off')
plt.show()

### 3.4 DBSCAN
We use the Moons dataset (which K-Means fails on) to show the power of density-based clustering.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.05, random_state=42)

dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(X)

# Check how many clusters found (label -1 is noise)
print("Labels:", np.unique(dbscan.labels_))

plot_clusters(X, dbscan.labels_)
plt.title("DBSCAN Clustering")
plt.show()

### 3.5 Gaussian Mixture & Anomaly Detection
We will fit a GMM to the blob data and then calculate the density. Instances in low-density regions (e.g., density < 4%) are flagged as anomalies.

In [None]:
from sklearn.mixture import GaussianMixture

# Use the blob data again
X, y = make_blobs(n_samples=1000, centers=blob_centers, cluster_std=blob_std, random_state=42)

gm = GaussianMixture(n_components=5, n_init=10, random_state=42)
gm.fit(X)

# Score samples gives the log of the probability density function (PDF)
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4) # Bottom 4%
anomalies = X[densities < density_threshold]

plt.figure(figsize=(8, 4))
plot_clusters(X)
plt.scatter(anomalies[:, 0], anomalies[:, 1], color='r', marker='*', s=100, label='Anomalies')
plt.title("GMM Anomaly Detection")
plt.legend()
plt.show()

## 4. Step-by-Step Explanation

### 1. K-Means Elbow & Silhouette
**Input:** A range of cluster counts ($k=1$ to $9$).
**Analysis:** 
* **Inertia Plot:** We see a sharp drop from $k=1$ to $k=3$, and a noticeable bend at $k=4$ or $k=5$. After $k=5$, inertia goes down very slowly. This suggests $k=4$ or $5$ is optimal.
* **Silhouette Plot:** The score is highest at $k=4$ and $k=5$, confirming our observation.

### 2. Image Segmentation Logic
**Input:** A 3D array (height, width, RGB channels).
**Process:**
1.  Flatten the image into a long list of pixels (ignoring position, looking only at color).
2.  Run K-Means with $k=4$. It finds the 4 dominant colors (centers).
3.  Replace every pixel with its nearest dominant color.
**Output:** An image composed of only 4 colors. This significantly reduces file size and complexity, useful for object detection systems.

### 3. DBSCAN vs. K-Means
* K-Means would have failed on the Moons dataset; it would have drawn a straight line through the middle.
* DBSCAN connects points that are close together. Since the two moons are dense regions separated by a gap (low density), DBSCAN correctly identifies them as two separate shapes.

### 4. Anomaly Detection
**Concept:** If a data point is very far from any cluster center (centroid), its probability density is low.
**Process:** We fit the Gaussian Mixture. We calculate the likelihood of every point. We define a threshold (e.g., "the lowest 4%").
**Output:** The red stars in the plot are the anomalies. In a manufacturing context, these could be defective products.

## 5. Chapter Summary

* **Clustering** allows us to segment data without labels.
* **K-Means:** Fast, scalable, but assumes spherical clusters. Requires picking $k$.
* **DBSCAN:** Great for arbitrary shapes and outlier detection. Parameters `eps` and `min_samples` control density sensitivity.
* **GMM (Gaussian Mixtures):** Generalizes K-Means to ellipsoidal shapes and provides probabilistic assignments (soft clustering).
* **Anomaly Detection:** Unsupervised learning is excellent for finding outliers (fraud, defects) by looking for instances in low-density regions.