# Day 36: Introduction to Unsupervised Learning and Clustering Basics

Welcome to Day 36 of the 100 Days of Machine Learning! Today marks the beginning of **Week 8**, where we shift our focus from supervised learning to the fascinating world of **unsupervised learning**. This is a pivotal moment in our journey—we're moving from problems where we have labeled data to scenarios where we must discover hidden patterns and structures in unlabeled data.

## What Will We Learn Today?

In this lesson, we'll explore:
- The fundamentals of unsupervised learning and how it differs from supervised learning
- The concept of clustering and its real-world applications
- The K-means clustering algorithm, one of the most popular and widely-used clustering techniques
- How to implement K-means in Python using scikit-learn
- Methods for evaluating clustering quality

## Why Unsupervised Learning Matters

In the real world, most data is unlabeled. Consider a company with millions of customers—they might not know how to categorize these customers, but they want to discover natural groupings for targeted marketing. Or imagine analyzing genomic data to discover new disease subtypes without prior knowledge of what those subtypes might be. These are perfect applications for unsupervised learning.

Unsupervised learning allows us to:
- **Discover hidden patterns** in data without predefined labels
- **Reduce dimensionality** to visualize and understand complex datasets
- **Segment customers, products, or data** into meaningful groups
- **Detect anomalies** by identifying data points that don't fit any pattern
- **Preprocess data** for supervised learning tasks

## Learning Objectives

By the end of this lesson, you will be able to:
1. Explain the difference between supervised and unsupervised learning
2. Understand the principles behind clustering algorithms
3. Describe how the K-means algorithm works mathematically
4. Implement K-means clustering using scikit-learn
5. Visualize clustering results and evaluate cluster quality
6. Apply K-means to real-world datasets and interpret the results


In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Set style for better-looking plots
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")


Libraries imported successfully!


## Unsupervised Learning: A New Paradigm

### Supervised Learning (Review)

In the past weeks, we've been working extensively with **supervised learning**, where we had:
- **Input features** $(X)$: The data we use to make predictions
- **Target labels** $(y)$: The known outcomes we're trying to predict

For example, in classification:
- Predicting if an email is spam $(y=1)$ or not spam $(y=0)$ based on its content $(X)$
- Classifying images of digits based on pixel values
- Diagnosing diseases based on patient symptoms and test results

The key characteristic: **We always had labeled training data to learn from.**

### Unsupervised Learning: Learning Without Labels

In **unsupervised learning**, we have:
- **Input features** $(X)$: The data we want to analyze
- **No target labels** $(y)$: We don't know the "correct" answer

Instead of predicting a known output, unsupervised learning aims to:
1. **Find patterns and structure** in the data
2. **Group similar data points** together (clustering)
3. **Reduce complexity** while preserving information (dimensionality reduction)
4. **Identify anomalies** that don't fit the patterns

### Key Differences

| Aspect | Supervised Learning | Unsupervised Learning |
|--------|-------------------|----------------------|
| Data | Labeled (X, y) | Unlabeled (X only) |
| Goal | Predict y for new X | Find patterns in X |
| Examples | Classification, Regression | Clustering, Dimensionality Reduction |
| Evaluation | Compare predictions to true labels | Measure pattern quality |
| Applications | Spam detection, Price prediction | Customer segmentation, Anomaly detection |

### Types of Unsupervised Learning

The two main categories of unsupervised learning are:

1. **Clustering**: Grouping similar data points together
   - K-means (this lesson)
   - Hierarchical clustering (Day 38)
   - DBSCAN (Day 39)
   - Gaussian Mixture Models (Day 40)

2. **Dimensionality Reduction**: Reducing the number of features while preserving information
   - Principal Component Analysis (PCA) - Week 9
   - t-SNE - Week 9
   - Autoencoders (in Deep Learning section)

Today, we focus on **clustering**, specifically the **K-means algorithm**.


## Introduction to Clustering

**Clustering** is the task of grouping a set of objects such that objects in the same group (called a **cluster**) are more similar to each other than to those in other groups.

### Mathematical Definition

Given a dataset $X = \{x_1, x_2, ..., x_n\}$ where each $x_i \in \mathbb{R}^d$ (d-dimensional data points), clustering aims to partition the data into $k$ groups $C = \{C_1, C_2, ..., C_k\}$ such that:

$$C_1 \cup C_2 \cup ... \cup C_k = X$$

$$C_i \cap C_j = \emptyset \text{ for } i \neq j$$

### Real-World Applications

Clustering has numerous practical applications:

1. **Customer Segmentation**: Group customers by purchasing behavior for targeted marketing
2. **Image Segmentation**: Separate different objects or regions in images
3. **Document Organization**: Group similar documents together
4. **Anomaly Detection**: Identify data points that don't belong to any cluster
5. **Genomics**: Group genes with similar expression patterns
6. **Social Network Analysis**: Detect communities in social networks


## K-Means Clustering Algorithm

**K-means** is one of the simplest and most widely used clustering algorithms. It partitions data into $k$ clusters by minimizing the within-cluster variance.

### Mathematical Foundation

#### Objective Function

K-means aims to minimize the **within-cluster sum of squares (WCSS)**, also called **inertia**:

$$J = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2$$

Where:
- $k$ = number of clusters
- $C_i$ = the $i$-th cluster
- $\mu_i$ = the centroid (mean) of cluster $C_i$
- $\|x - \mu_i\|^2$ = squared Euclidean distance between point $x$ and centroid $\mu_i$

#### Euclidean Distance

The distance between two points $x = (x_1, x_2, ..., x_d)$ and $y = (y_1, y_2, ..., y_d)$ is:

$$d(x, y) = \sqrt{\sum_{j=1}^{d} (x_j - y_j)^2}$$

#### Cluster Centroid

The centroid $\mu_i$ of cluster $C_i$ containing $n_i$ points is the mean of all points in the cluster:

$$\mu_i = \frac{1}{n_i} \sum_{x \in C_i} x$$

### The K-Means Algorithm

**Input**: Dataset $X = \{x_1, x_2, ..., x_n\}$, Number of clusters $k$

**Algorithm**:

1. **Initialization**: Randomly select $k$ data points as initial centroids $\mu_1, \mu_2, ..., \mu_k$

2. **Repeat until convergence**:
   
   a. **Assignment Step**: Assign each point $x_i$ to the nearest centroid
   $$C_j = \{x_i : \|x_i - \mu_j\|^2 \leq \|x_i - \mu_l\|^2 \text{ for all } l = 1,...,k\}$$
   
   b. **Update Step**: Recalculate centroids as the mean of all points in each cluster
   $$\mu_j = \frac{1}{|C_j|} \sum_{x \in C_j} x$$
   
3. **Convergence**: Stop when centroids no longer change

### Key Properties

**Advantages**:
- Simple and easy to implement
- Computationally efficient: $O(nkd \cdot i)$ where $i$ is the number of iterations
- Works well when clusters are spherical and similar in size
- Scales to large datasets

**Limitations**:
- Must specify $k$ (number of clusters) in advance
- Sensitive to initial centroid positions
- Assumes clusters are convex and isotropic
- Sensitive to outliers
- Only finds local optima


In [2]:
# Generate synthetic data with clear clusters
X, y_true = make_blobs(n_samples=300, centers=3, n_features=2, 
                       random_state=42, cluster_std=0.8)

# Create and fit K-means model
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

# Print results
print("K-means clustering completed successfully!")
print(f"Number of clusters: {kmeans.n_clusters}")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")


K-means clustering completed successfully!
Number of clusters: 3
Inertia (WCSS): 78.85


In [3]:
# Visualize clustering results
plt.figure(figsize=(12, 5))

# Subplot 1: True clusters (for comparison)
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50, 
           edgecolors='black', linewidth=0.5, alpha=0.7)
plt.title('True Clusters (Hidden in Real Scenarios)', fontsize=13, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Subplot 2: K-means results
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', s=50, 
           edgecolors='black', linewidth=0.5, alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
           c='red', marker='X', s=300, edgecolors='black', linewidth=2,
           label='Centroids')
plt.title('K-Means Clustering Results (k=3)', fontsize=13, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

plt.tight_layout()
plt.show()

print("Clustering results visualized!")


Clustering results visualized!


## Evaluating Clustering Quality

Since we don't have true labels in unsupervised learning, we need special metrics to evaluate clustering quality:

### Silhouette Score

The **silhouette score** measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1:
- **1**: Perfect clustering
- **0**: Overlapping clusters
- **-1**: Wrong clustering

For each sample $i$, the silhouette coefficient is:

$$s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}$$

Where:
- $a(i)$ = average distance between $i$ and all other points in the same cluster
- $b(i)$ = average distance between $i$ and all points in the nearest cluster

### Elbow Method

The **elbow method** helps determine the optimal number of clusters by plotting the inertia (WCSS) for different values of $k$. The "elbow" point where the rate of decrease sharply shifts is often considered the optimal $k$.


In [4]:
# Elbow Method: Test different k values
inertias = []
silhouette_scores = []
K_range = range(2, 11)

print("Testing different numbers of clusters...")
for k in K_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_temp.fit(X)
    inertias.append(kmeans_temp.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans_temp.labels_))

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Elbow curve
ax1.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters (k)', fontsize=12)
ax1.set_ylabel('Inertia (WCSS)', fontsize=12)
ax1.set_title('Elbow Method', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Silhouette scores
ax2.plot(K_range, silhouette_scores, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Number of Clusters (k)', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.set_title('Silhouette Score vs. Number of Clusters', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Elbow method plot created!")


Testing different numbers of clusters...
Elbow method plot created!


## Hands-On Example: Clustering the Iris Dataset

Let's apply K-means to the famous Iris dataset. The Iris dataset contains measurements of 150 iris flowers from 3 different species. We'll use only the measurements (features) and see if K-means can discover the 3 species without being told what they are.


In [5]:
# Load Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris_true = iris.target

print(f"Iris dataset loaded: {X_iris.shape[0]} samples, {X_iris.shape[1]} features")

# Standardize features (important for K-means!)
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)

# Apply K-means
kmeans_iris = KMeans(n_clusters=3, random_state=42, n_init=10)
y_iris_pred = kmeans_iris.fit_predict(X_iris_scaled)

# Calculate silhouette score
sil_score = silhouette_score(X_iris_scaled, y_iris_pred)

print("K-means clustering performed (k=3)")
print(f"Silhouette Score: {sil_score:.2f}")


Iris dataset loaded: 150 samples, 4 features
K-means clustering performed (k=3)
Silhouette Score: 0.55


In [6]:
# Visualize using first two features
plt.figure(figsize=(14, 5))

# True labels
plt.subplot(1, 2, 1)
scatter1 = plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris_true, 
                       cmap='viridis', s=50, edgecolors='black', linewidth=0.5)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('True Species Labels', fontsize=13, fontweight='bold')
plt.colorbar(scatter1, label='Species')

# K-means predictions
plt.subplot(1, 2, 2)
scatter2 = plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris_pred, 
                       cmap='viridis', s=50, edgecolors='black', linewidth=0.5)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('K-Means Clustering Results', fontsize=13, fontweight='bold')
plt.colorbar(scatter2, label='Cluster')

plt.tight_layout()
plt.show()

print("Iris clustering results visualized!")


Iris clustering results visualized!


## Key Takeaways

Congratulations! You've completed your introduction to unsupervised learning and K-means clustering. Here are the key points to remember:

1. **Unsupervised learning** works with unlabeled data to discover hidden patterns and structures
2. **Clustering** groups similar data points together based on their features
3. **K-means** is a simple yet powerful clustering algorithm that minimizes within-cluster variance
4. The K-means algorithm alternates between **assigning points to clusters** and **updating centroids**
5. **Feature scaling** is important for K-means since it uses distance-based calculations
6. The **elbow method** and **silhouette score** help evaluate clustering quality and choose the optimal number of clusters
7. K-means works best when clusters are **spherical** and **similar in size**
8. Applications include customer segmentation, image processing, anomaly detection, and more

## What's Next?

In the coming days, we'll explore:
- **Day 37**: Implementing K-Means for Different Data Types
- **Day 38**: Hierarchical Clustering Techniques
- **Day 39**: Density-Based Clustering with DBSCAN
- **Day 40**: Gaussian Mixture Models and Expectation-Maximization


## Exercise for the Reader

To solidify your understanding of K-means clustering, try these exercises:

### Exercise 1: Experiment with Different K Values
Using the synthetic data we generated, try different values of k (2, 4, 5, 10) and observe:
- How the clusters change
- How the inertia and silhouette scores change
- Which value of k seems most appropriate for this dataset

### Exercise 2: Impact of Initialization
Run K-means multiple times with different `random_state` values. Do you get the same results? Why or why not?

### Exercise 3: Feature Scaling
Apply K-means to the Iris dataset WITHOUT scaling the features. Compare the results to the scaled version. What differences do you observe?

### Exercise 4: Real-World Application
Find a dataset online (e.g., from Kaggle or UCI Machine Learning Repository) and apply K-means clustering:
1. Load and explore the data
2. Preprocess and scale features if necessary
3. Use the elbow method to determine optimal k
4. Apply K-means and visualize results
5. Interpret the clusters - what patterns did you discover?

### Bonus Challenge
Implement the K-means algorithm from scratch using only NumPy (without scikit-learn). This will deepen your understanding of how the algorithm works!


## Further Resources

To deepen your understanding of unsupervised learning and clustering, explore these resources:

### Documentation and Tutorials
1. **Scikit-learn Clustering Documentation**: https://scikit-learn.org/stable/modules/clustering.html
2. **K-Means Clustering Explained**: https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
3. **Real Python K-Means Guide**: https://realpython.com/k-means-clustering-python/

### Academic Papers
4. MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations" - Original K-means paper
5. Arthur, D., & Vassilvitskii, S. (2007). "k-means++: The advantages of careful seeding" - K-means++ initialization

### Interactive Learning
6. **K-Means Visualization**: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
7. **Seeing Theory - Clustering**: https://seeing-theory.brown.edu/

### Videos
8. **StatQuest: K-means clustering**: https://www.youtube.com/watch?v=4b5d3muPQmA
9. **3Blue1Brown - Understanding Clustering**: Various videos on mathematical intuition

### Books
10. "Pattern Recognition and Machine Learning" by Christopher Bishop (Chapter 9)
11. "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman (Chapter 14)
