# Module 01: Supervised vs Unsupervised Learning

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 50 minutes  
**Prerequisites**: 
- [Module 00: Introduction to ML and scikit-learn](00_introduction_to_ml_and_sklearn.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Distinguish between supervised and unsupervised learning
2. Identify when to use classification vs regression
3. Understand common supervised learning algorithms and their use cases
4. Understand common unsupervised learning algorithms and their use cases
5. Apply both paradigms to real datasets

## 1. The Two Main Learning Paradigms

Machine learning algorithms can be grouped into two main categories based on how they learn:

### Supervised Learning
**Definition**: Learning from labeled data where we know the correct answers.

**Analogy**: Like learning with a teacher who provides correct answers:
- You study example problems with solutions
- You learn patterns from these examples
- You apply learned patterns to solve new problems

**Key Characteristic**: Training data includes both features (X) and labels (y)

### Unsupervised Learning
**Definition**: Learning from unlabeled data where we don't know the answers.

**Analogy**: Like exploring a subject on your own:
- No teacher provides correct answers
- You find patterns and structure in the data yourself
- You group similar things together based on characteristics

**Key Characteristic**: Training data includes only features (X), no labels

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

## 2. Supervised Learning: Classification vs Regression

Supervised learning has two main sub-types:

### Classification
**Task**: Predict discrete categories or classes

**Examples**:
- Email: spam or not spam (binary)
- Iris flower: setosa, versicolor, or virginica (multiclass)
- Image: contains cat, dog, bird, or none (multiclass)
- Medical diagnosis: disease present or absent (binary)

**Output**: Category label (discrete values)

### Regression
**Task**: Predict continuous numerical values

**Examples**:
- House price: $250,000
- Temperature tomorrow: 23.5°C
- Stock price: $142.35
- Age of a person: 34 years

**Output**: Numerical value (continuous)

## 3. Classification Example: Iris Species

Let's demonstrate classification with the Iris dataset. We'll predict the species based on flower measurements.

In [None]:
# Load Iris dataset
iris_df = pd.read_csv('data/sample/iris.csv')

# Prepare features and target
feature_cols = ['sepal length (cm)', 'sepal width (cm)', 
                'petal length (cm)', 'petal width (cm)']
X = iris_df[feature_cols]
y = iris_df['species_name']

print("Classification Problem Setup:")
print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")
print(f"\nClasses to predict: {y.unique()}")
print(f"Number of classes: {y.nunique()}")

In [None]:
# Visualize the classification problem
# We'll use two features for easy visualization
plt.figure(figsize=(10, 6))
for species in iris_df['species_name'].unique():
    subset = iris_df[iris_df['species_name'] == species]
    plt.scatter(subset['petal length (cm)'], subset['petal width (cm)'], 
               label=species, s=100, alpha=0.6, edgecolors='black')

plt.xlabel('Petal Length (cm)', fontsize=12)
plt.ylabel('Petal Width (cm)', fontsize=12)
plt.title('Classification: Predicting Iris Species\n(Each color represents a different class)', 
         fontsize=14, fontweight='bold')
plt.legend(title='Species', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservation: The classes are well-separated based on petal measurements.")
print("This suggests classification should work well!")

In [None]:
# Build a simple classification model
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train classifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print("Classification Results:")
print(f"Accuracy: {accuracy:.1%}\n")
print("Detailed Report:")
print(classification_report(y_test, y_pred))

## 4. Regression Example: Diabetes Progression

Now let's demonstrate regression by predicting disease progression (a continuous value) based on medical measurements.

In [None]:
# Load Diabetes dataset
diabetes_df = pd.read_csv('data/sample/diabetes.csv')

# Prepare features and target
feature_cols = [col for col in diabetes_df.columns if col != 'progression']
X_reg = diabetes_df[feature_cols]
y_reg = diabetes_df['progression']

print("Regression Problem Setup:")
print(f"Features (X): {X_reg.shape}")
print(f"Target (y): {y_reg.shape}")
print(f"\nTarget statistics:")
print(f"  Min: {y_reg.min():.2f}")
print(f"  Max: {y_reg.max():.2f}")
print(f"  Mean: {y_reg.mean():.2f}")
print(f"  Std: {y_reg.std():.2f}")

In [None]:
# Visualize the regression problem
# Show target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(y_reg, bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(y_reg.mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0].set_xlabel('Disease Progression', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Target Distribution (Continuous Values)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Scatter plot with one feature
axes[1].scatter(X_reg['bmi'], y_reg, alpha=0.5, s=50, edgecolors='black')
axes[1].set_xlabel('BMI (Body Mass Index)', fontsize=12)
axes[1].set_ylabel('Disease Progression', fontsize=12)
axes[1].set_title('Regression: Predicting Continuous Values', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservation: The target is a continuous numerical value, not discrete classes.")
print("This is a regression problem!")

In [None]:
# Build a simple regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

# Train regressor
regressor = LinearRegression()
regressor.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = regressor.predict(X_test_reg)

# Evaluate
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

print("Regression Results:")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.3f}")
print(f"\nInterpretation:")
print(f"- On average, predictions are off by {rmse:.2f} units")
print(f"- Model explains {r2*100:.1f}% of the variance in disease progression")

## 5. Unsupervised Learning: Clustering

Unlike supervised learning, unsupervised learning works with unlabeled data. A common task is **clustering** - grouping similar data points together.

**Key Difference**: We don't tell the algorithm what groups exist; it finds them on its own!

In [None]:
# Load clustering dataset
blobs_df = pd.read_csv('data/sample/blobs_clustering.csv')

# For unsupervised learning, we use ONLY features (no labels)
X_cluster = blobs_df[['feature_1', 'feature_2']]

print("Unsupervised Learning Setup:")
print(f"Features (X): {X_cluster.shape}")
print("\nNotice: No target variable (y) is used!")
print("The algorithm will find patterns on its own.")

In [None]:
# Visualize the unlabeled data
plt.figure(figsize=(10, 6))
plt.scatter(X_cluster['feature_1'], X_cluster['feature_2'], 
           s=100, alpha=0.6, edgecolors='black', color='gray')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Unsupervised Learning: Data Without Labels\n(Can you see natural groups?)', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nQuestion: Can you visually identify distinct groups in the data?")

In [None]:
# Apply K-Means clustering
from sklearn.cluster import KMeans

# Create clustering model
# We specify 4 clusters (in real scenarios, we'd need to determine this)
kmeans = KMeans(n_clusters=4, random_state=42)

# Fit and predict clusters
# Note: In unsupervised learning, we don't split into train/test
cluster_labels = kmeans.fit_predict(X_cluster)

# Add predictions to dataframe
blobs_df['predicted_cluster'] = cluster_labels

print(f"Clustering complete!")
print(f"Found {len(np.unique(cluster_labels))} clusters")
print(f"\nCluster sizes:")
print(blobs_df['predicted_cluster'].value_counts().sort_index())

In [None]:
# Visualize the clustering results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Original (unlabeled) data
axes[0].scatter(X_cluster['feature_1'], X_cluster['feature_2'], 
               s=100, alpha=0.6, edgecolors='black', color='gray')
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Before: Unlabeled Data', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Clustered data
scatter = axes[1].scatter(X_cluster['feature_1'], X_cluster['feature_2'], 
                         c=cluster_labels, cmap='viridis',
                         s=100, alpha=0.6, edgecolors='black')
# Plot cluster centers
centers = kmeans.cluster_centers_
axes[1].scatter(centers[:, 0], centers[:, 1], 
               marker='X', s=300, c='red', edgecolors='black', linewidths=2,
               label='Cluster Centers')
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title('After: Discovered Clusters', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.colorbar(scatter, ax=axes[1], label='Cluster')
plt.tight_layout()
plt.show()

print("\nKey Insight: The algorithm found natural groupings WITHOUT being told what to look for!")

## 6. Comparing the Paradigms

| Aspect | Supervised Learning | Unsupervised Learning |
|--------|--------------------|-----------------------|
| **Training Data** | Labeled (X and y) | Unlabeled (only X) |
| **Goal** | Predict labels for new data | Find hidden patterns/structure |
| **Evaluation** | Compare predictions to true labels | Measure cluster quality, coherence |
| **Examples** | Classification, Regression | Clustering, Dimensionality Reduction |
| **Use Case** | Spam detection, Price prediction | Customer segmentation, Anomaly detection |
| **Difficulty** | Easier to evaluate | Harder to validate results |

### When to Use Each?

**Use Supervised Learning when:**
- You have labeled data (know the correct answers)
- You want to predict specific outcomes
- You can clearly define what you're looking for
- Examples: Medical diagnosis, fraud detection, price prediction

**Use Unsupervised Learning when:**
- You don't have labels (or labeling is too expensive)
- You want to explore and understand your data
- You're looking for hidden patterns
- Examples: Market segmentation, recommendation systems, data exploration

## Exercises

Practice identifying and working with different learning paradigms.

### Exercise 1: Identify the Problem Type

For each scenario below, identify:
1. Is it supervised or unsupervised?
2. If supervised, is it classification or regression?

Write your answers in the code cell:

**Scenarios:**
1. Predicting whether a loan application will be approved (historical data available)
2. Grouping news articles by topic without predefined categories
3. Estimating the number of sales for next month
4. Identifying different types of flowers from petal measurements (labeled dataset)
5. Finding groups of similar customers based on purchasing behavior
6. Predicting student exam scores based on study hours

In [None]:
# Your answers:
# 1. 
# 2. 
# 3. 
# 4. 
# 5. 
# 6. 


### Exercise 2: Classification on Wine Dataset

Load the wine dataset and build a classification model to predict wine types.

Steps:
1. Load data from `data/sample/wine.csv`
2. Separate features and target (target column name is 'target')
3. Split into train/test sets (70/30 split)
4. Train a KNeighborsClassifier with n_neighbors=7
5. Calculate and print the accuracy

In [None]:
# Your code here



### Exercise 3: Regression on Housing Data

Build a regression model to predict house values using the California housing dataset.

Steps:
1. Load data from `data/sample/california_housing.csv`
2. Separate features and target (target is 'median_house_value')
3. Split into train/test sets (70/30 split)
4. Train a LinearRegression model
5. Calculate and print the R² score

In [None]:
# Your code here



### Exercise 4: Clustering Analysis

Apply K-Means clustering to the Iris dataset (WITHOUT using the species labels).

Steps:
1. Load the Iris dataset
2. Use only the feature columns (ignore species)
3. Apply KMeans with n_clusters=3
4. Create a scatter plot of petal length vs petal width, colored by predicted clusters
5. Compare: How do the discovered clusters relate to the actual species?

In [None]:
# Your code here



## Summary

Congratulations! You've mastered the fundamental learning paradigms in machine learning.

### Key Concepts

1. **Supervised Learning**:
   - Learns from labeled data (features + correct answers)
   - Two types: Classification (categories) and Regression (continuous values)
   - Used when you want to predict specific outcomes
   - Evaluation: Compare predictions to true labels

2. **Classification**:
   - Predicts discrete categories/classes
   - Examples: Iris species, spam detection, disease diagnosis
   - Metrics: Accuracy, precision, recall, F1-score

3. **Regression**:
   - Predicts continuous numerical values
   - Examples: House prices, temperature, disease progression
   - Metrics: RMSE, MAE, R² score

4. **Unsupervised Learning**:
   - Learns from unlabeled data (only features)
   - Discovers hidden patterns and structure
   - Examples: Clustering, dimensionality reduction
   - Used for exploration and pattern discovery

5. **Choosing the Right Approach**:
   - Have labels + want predictions → Supervised
   - No labels + explore patterns → Unsupervised
   - Predict categories → Classification
   - Predict numbers → Regression

### What's Next?

In **Module 02: Data Preparation and Train/Test Split**, you'll learn:
- How to properly prepare data for machine learning
- Why and how to split data into training and testing sets
- Common data preprocessing techniques
- Avoiding data leakage and other pitfalls

### Additional Resources

- [Supervised vs Unsupervised Learning - Google ML Course](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology)
- [Classification vs Regression - StatQuest](https://www.youtube.com/watch?v=i_LwzRVP7bg)
- [K-Means Clustering Explained](https://www.youtube.com/watch?v=4b5d3muPQmA)