# Session 1 Theory: Understanding Random Forest for Earth Observation

**CopPhil 4-Day Advanced Online Training**  
**DAY 2 - Session 1: Supervised Machine Learning - Part 1**

**INSTRUCTOR VERSION - Complete Solutions**

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand Decision Trees**: Explain how a single decision tree makes predictions through recursive splitting
2. **Grasp Ensemble Learning**: Describe how Random Forest combines multiple trees through bootstrap sampling and random feature selection
3. **Interpret Feature Importance**: Analyze which spectral bands or derived indices contribute most to classification
4. **Evaluate Model Performance**: Read and interpret confusion matrices to assess classification accuracy
5. **Apply to EO Context**: Connect these concepts to satellite image classification tasks

---

## Why Random Forest for Earth Observation?

Random Forest is one of the most popular algorithms for land cover classification because:

- **Handles high-dimensional data**: Works well with many spectral bands (Sentinel-2 has 13 bands)
- **Robust to overfitting**: Ensemble approach reduces variance
- **Feature importance**: Reveals which bands are most informative
- **No feature scaling required**: Unlike neural networks
- **Fast training**: Efficient even with large datasets
- **Interpretable**: Can visualize decision rules

---

**Estimated Time**: 70 minutes  
**Teaching Notes**: This notebook builds from simple concepts to complex applications. Encourage hands-on exploration of parameters.

## A. Introduction and Setup (5 minutes)

Let's start by importing the necessary libraries and setting up our environment for reproducible results.

In [None]:
# Core scientific computing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for machine learning
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("colorblind")  # Color-blind friendly palette
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

print("✓ Libraries imported successfully!")
print(f"✓ Random state set to: {RANDOM_STATE}")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Pandas version: {pd.__version__}")

---

## B. Decision Trees Interactive Demo (15 minutes)

### What is a Decision Tree?

A **Decision Tree** is a supervised learning algorithm that makes predictions by learning a series of if-then-else decision rules from data. Think of it like a flowchart:

```
Is NDVI > 0.3?
├─ Yes: Is NIR > 0.5?
│  ├─ Yes: Forest
│  └─ No: Grassland
└─ No: Is SWIR < 0.2?
   ├─ Yes: Water
   └─ No: Urban
```

### Key Concepts:

- **Root Node**: The first decision point (top of the tree)
- **Internal Nodes**: Intermediate decision points
- **Leaf Nodes**: Final predictions (bottom of the tree)
- **Splitting**: How the algorithm decides which feature and threshold to use
- **Depth**: Number of levels in the tree (deeper = more complex)

### Let's Build a Simple Example

**Teaching Tip**: Use this section to explain the greedy, top-down approach of tree building.

In [None]:
# Create a simple 2D classification dataset
# This simulates two spectral bands (e.g., NIR and Red)
X, y = make_moons(n_samples=200, noise=0.25, random_state=RANDOM_STATE)

# Add feature names for EO context
feature_names = ['NIR Reflectance', 'Red Reflectance']
class_names = ['Water/Urban', 'Vegetation']

print(f"Dataset shape: {X.shape}")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Classes: {np.unique(y)}")
print(f"Class distribution: {np.bincount(y)}")

In [None]:
# Visualize the dataset
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                     s=50, alpha=0.7, edgecolors='k', linewidth=0.5)
plt.xlabel(feature_names[0], fontsize=12)
plt.ylabel(feature_names[1], fontsize=12)
plt.title('Training Data: Two Spectral Bands', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Class', ticks=[0, 1])
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 TIP: In real EO applications, each point would represent a pixel with its spectral reflectance values.")

### Train a Single Decision Tree

Let's train a decision tree and visualize how it splits the feature space.

**Teaching Point**: Emphasize that decision trees create rectangular partitions (axis-aligned splits).

In [None]:
# Train a decision tree with limited depth
tree = DecisionTreeClassifier(max_depth=3, random_state=RANDOM_STATE)
tree.fit(X, y)

# Calculate training accuracy
train_accuracy = tree.score(X, y)
print(f"Training Accuracy: {train_accuracy:.3f}")
print(f"Tree Depth: {tree.get_depth()}")
print(f"Number of Leaves: {tree.get_n_leaves()}")

In [None]:
# Visualize decision boundaries
def plot_decision_boundary(model, X, y, title="Decision Boundary"):
    """
    Plot decision boundary for a 2D classification problem.
    
    Parameters:
    -----------
    model : trained classifier
    X : array-like, shape (n_samples, 2)
    y : array-like, shape (n_samples,)
    title : str
    """
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Predict on mesh grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis', levels=1)
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                         s=50, alpha=0.8, edgecolors='k', linewidth=0.5)
    plt.xlabel(feature_names[0], fontsize=12)
    plt.ylabel(feature_names[1], fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.colorbar(scatter, label='Class', ticks=[0, 1])
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

plot_decision_boundary(tree, X, y, 
                      title="Decision Tree: How It Splits the Feature Space")

print("\n💡 TIP: Notice the rectangular decision boundaries. Trees can only make")
print("   axis-aligned splits (e.g., 'NIR > 0.5'), not diagonal lines.")

### Visualize the Tree Structure

Let's look inside the tree to see the actual decision rules it learned.

**Teaching Tip**: Walk through the tree from root to leaf, explaining Gini impurity and sample counts.

In [None]:
# Plot the tree structure
plt.figure(figsize=(20, 10))
plot_tree(tree, 
         feature_names=feature_names,
         class_names=class_names,
         filled=True,
         rounded=True,
         fontsize=10)
plt.title('Decision Tree Structure', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nHow to Read This Tree:")
print("━" * 60)
print("• Each box is a node with a decision rule (e.g., 'NIR <= 0.5')")
print("• 'gini' measures impurity (0 = pure, 0.5 = mixed)")
print("• 'samples' shows how many training points reach this node")
print("• 'value' shows class distribution [class 0, class 1]")
print("• Color intensity indicates class majority (darker = more confident)")
print("• Leaf nodes (bottom) make the final prediction")

### 🎯 Interactive Exercise: Effect of Tree Depth (SOLUTION)

**Task**: Experiment with different `max_depth` values and observe how the decision boundary changes.

**Questions to consider**:
1. What happens with `max_depth=1` (a "decision stump")?
2. What happens with `max_depth=10` (very deep tree)?
3. Which depth seems to balance simplicity and accuracy?
4. Can you identify overfitting?

**Teaching Note**: Have students run this multiple times with different values. Discuss the bias-variance tradeoff.

In [None]:
# SOLUTION: Compare different max_depth values
depths_to_test = [1, 2, 3, 5, 10, None]

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for idx, depth in enumerate(depths_to_test):
    tree_experiment = DecisionTreeClassifier(max_depth=depth, 
                                            random_state=RANDOM_STATE)
    tree_experiment.fit(X, y)
    
    accuracy = tree_experiment.score(X, y)
    actual_depth = tree_experiment.get_depth()
    n_leaves = tree_experiment.get_n_leaves()
    
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    Z = tree_experiment.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    ax = axes[idx]
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis', levels=1)
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
              s=30, alpha=0.7, edgecolors='k', linewidth=0.3)
    ax.set_xlabel(feature_names[0])
    ax.set_ylabel(feature_names[1])
    
    depth_str = str(depth) if depth is not None else "Unlimited"
    ax.set_title(f'max_depth={depth_str}\nAcc={accuracy:.3f}, Depth={actual_depth}, Leaves={n_leaves}',
                fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('Effect of Tree Depth on Decision Boundaries', 
            fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n📊 OBSERVATIONS:")
print("━" * 60)
print("• max_depth=1: Very simple (underfitting), straight line split")
print("• max_depth=2-3: Balanced complexity, generalizes well")
print("• max_depth=10: High training accuracy but overfits (jagged boundaries)")
print("• max_depth=None: Severe overfitting, memorizes training data")
print("\n⚠️ OVERFITTING SIGNS: Perfect training accuracy + complex boundaries")

---

## C. Random Forest Voting Mechanism (15 minutes)

### The Power of Ensemble Learning

A single decision tree can be unstable:
- Small changes in data can lead to completely different trees
- Prone to overfitting (memorizing training data)
- High variance in predictions

**Random Forest** solves this by combining many trees:

1. **Bootstrap Sampling**: Each tree trains on a random subset of data (sampling with replacement)
2. **Random Feature Selection**: Each split only considers a random subset of features
3. **Majority Voting**: Final prediction is the class chosen by most trees

**Analogy**: Instead of asking one expert (one tree), you ask a committee of experts (forest) and take a vote. This "wisdom of the crowd" is more robust!

**Teaching Tip**: Draw a diagram showing bootstrap sampling and aggregation on the board/screen.

In [None]:
# Train a Random Forest with just 5 trees (for visualization)
n_trees = 5
rf_small = RandomForestClassifier(n_estimators=n_trees, 
                                 max_depth=3,
                                 random_state=RANDOM_STATE)
rf_small.fit(X, y)

rf_accuracy = rf_small.score(X, y)
print(f"Random Forest Accuracy (5 trees): {rf_accuracy:.3f}")
print(f"Single Tree Accuracy (from before): {train_accuracy:.3f}")
print(f"\nImprovement: {rf_accuracy - train_accuracy:.3f}")

### Visualize Individual Trees in the Forest

Let's see how each tree makes different decisions.

In [None]:
# Plot decision boundaries for each individual tree
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# Plot each individual tree
for idx, tree in enumerate(rf_small.estimators_):
    ax = axes[idx]
    
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Predict
    Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis', levels=1)
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
              s=30, alpha=0.6, edgecolors='k', linewidth=0.3)
    ax.set_xlabel(feature_names[0])
    ax.set_ylabel(feature_names[1])
    ax.set_title(f'Tree {idx + 1}', fontweight='bold')
    ax.grid(True, alpha=0.3)

# Plot the ensemble (Random Forest)
ax = axes[5]
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))
Z = rf_small.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis', levels=1)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
          s=30, alpha=0.6, edgecolors='k', linewidth=0.3)
ax.set_xlabel(feature_names[0])
ax.set_ylabel(feature_names[1])
ax.set_title('Random Forest (Ensemble)', fontweight='bold', color='red')
ax.grid(True, alpha=0.3)

plt.suptitle('Individual Trees vs. Ensemble Decision', 
            fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n💡 TIP: Notice how each tree is slightly different due to bootstrap")
print("   sampling and random feature selection. The ensemble smooths out")
print("   individual errors and creates more stable boundaries.")

### Visualize Voting Confidence

Random Forest can provide prediction probabilities based on the proportion of trees voting for each class.

In [None]:
# Get prediction probabilities
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))

# Predict probabilities for class 1 (Vegetation)
Z_proba = rf_small.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z_proba = Z_proba.reshape(xx.shape)

# Plot confidence
plt.figure(figsize=(12, 7))
contour = plt.contourf(xx, yy, Z_proba, levels=20, cmap='RdYlGn', alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
           s=50, alpha=0.7, edgecolors='k', linewidth=0.5)
plt.colorbar(contour, label='Confidence for Vegetation Class')
plt.xlabel(feature_names[0], fontsize=12)
plt.ylabel(feature_names[1], fontsize=12)
plt.title('Random Forest Prediction Confidence\n(Based on Voting Proportions)', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nInterpreting Confidence:")
print("━" * 60)
print("• Green (high values): Most trees vote for 'Vegetation'")
print("• Red (low values): Most trees vote for 'Water/Urban'")
print("• Yellow (middle values): Trees are uncertain (mixed votes)")
print("\n💡 TIP: Low confidence regions often indicate:")
print("   - Class boundaries")
print("   - Mixed pixels (in EO context)")
print("   - Need for more training data")

### 🎯 Interactive Exercise: Effect of Number of Trees (SOLUTION)

**Task**: Test how the number of trees affects model stability and accuracy.

**Hypothesis**: More trees → more stable predictions, but diminishing returns after a certain point.

In [None]:
# SOLUTION: Test different numbers of trees
tree_counts = [1, 5, 10, 50, 100, 200]
accuracies = []

for n in tree_counts:
    rf = RandomForestClassifier(n_estimators=n, 
                               max_depth=3,
                               random_state=RANDOM_STATE)
    rf.fit(X, y)
    acc = rf.score(X, y)
    accuracies.append(acc)
    print(f"n_estimators={n:3d} → Accuracy: {acc:.4f}")

# Plot accuracy vs. number of trees
plt.figure(figsize=(10, 6))
plt.plot(tree_counts, accuracies, marker='o', linewidth=2, markersize=8)
plt.xlabel('Number of Trees', fontsize=12)
plt.ylabel('Training Accuracy', fontsize=12)
plt.title('Effect of Ensemble Size on Accuracy', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n📊 OBSERVATIONS:")
print("━" * 60)
print("• Single tree (n=1): High variance, unstable")
print("• Few trees (n=5-10): Improvement but still some variance")
print("• Many trees (n=50-100): Accuracy stabilizes")
print("• More trees (n=200): Minimal additional improvement")
print("\n💡 PRACTICAL RECOMMENDATION: 100-500 trees balances accuracy and speed")

---

## D. Feature Importance Analysis (10 minutes)

### Why Feature Importance Matters in EO

Feature importance tells us:
- Which spectral bands contribute most to classification
- Whether derived indices (NDVI, NDWI) are valuable
- If certain features are redundant
- How to optimize future data collection

**How Random Forest Calculates Importance**:
- Measures how much each feature decreases impurity (Gini or entropy)
- Averaged across all trees in the forest
- Higher values = more important for classification

**Teaching Note**: Emphasize that importance ≠ causation, and correlated features share importance.

In [None]:
# Create a dataset mimicking Sentinel-2 spectral bands
np.random.seed(RANDOM_STATE)

# Simulate 1000 pixels with 8 "spectral bands"
n_samples = 1000
n_features = 8

# Feature names mimicking Sentinel-2 bands and indices
eo_feature_names = [
    'Blue (B2)',
    'Green (B3)',
    'Red (B4)',
    'NIR (B8)',
    'SWIR1 (B11)',
    'SWIR2 (B12)',
    'NDVI',
    'NDWI'
]

# Create synthetic data with realistic patterns
# Class 0: Water (low NIR, high Blue, high NDWI)
# Class 1: Vegetation (high NIR, low Red, high NDVI)
# Class 2: Urban (moderate all, low NDVI, low NDWI)

X_eo = np.random.rand(n_samples, n_features)
y_eo = np.random.choice([0, 1, 2], size=n_samples)

# Add class-specific patterns
for i in range(n_samples):
    if y_eo[i] == 0:  # Water
        X_eo[i, 0] += 0.3  # Higher Blue
        X_eo[i, 3] -= 0.3  # Lower NIR
        X_eo[i, 7] += 0.4  # Higher NDWI
    elif y_eo[i] == 1:  # Vegetation
        X_eo[i, 3] += 0.5  # Higher NIR
        X_eo[i, 2] -= 0.2  # Lower Red
        X_eo[i, 6] += 0.5  # Higher NDVI
    else:  # Urban
        X_eo[i, 4] += 0.2  # Higher SWIR1
        X_eo[i, 5] += 0.2  # Higher SWIR2

# Clip to [0, 1] range
X_eo = np.clip(X_eo, 0, 1)

print(f"EO Dataset shape: {X_eo.shape}")
print(f"Features: {eo_feature_names}")
print(f"Classes: 0=Water, 1=Vegetation, 2=Urban")
print(f"Class distribution: {np.bincount(y_eo)}")

In [None]:
# Train Random Forest on EO-like data
rf_eo = RandomForestClassifier(n_estimators=100, 
                              max_depth=10,
                              random_state=RANDOM_STATE)
rf_eo.fit(X_eo, y_eo)

# Extract feature importances
importances = rf_eo.feature_importances_
indices = np.argsort(importances)[::-1]  # Sort descending

print("Feature Importance Ranking:")
print("━" * 60)
for i, idx in enumerate(indices):
    print(f"{i+1}. {eo_feature_names[idx]:15s}: {importances[idx]:.4f}")

In [None]:
# Visualize feature importances
plt.figure(figsize=(12, 7))
bars = plt.barh(range(len(importances)), importances[indices], align='center')

# Color bars by importance
colors = plt.cm.viridis(importances[indices] / importances.max())
for bar, color in zip(bars, colors):
    bar.set_color(color)

plt.yticks(range(len(importances)), [eo_feature_names[i] for i in indices])
plt.xlabel('Importance (Mean Decrease in Impurity)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance for Land Cover Classification', 
         fontsize=14, fontweight='bold')
plt.grid(True, axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 TIP: High importance doesn't always mean causation!")
print("   - NDVI is derived from NIR and Red, so they're correlated")
print("   - Consider domain knowledge alongside feature importance")
print("   - Importance can be unstable with correlated features")

### 🎯 Exercise: Interpret Feature Importance (SOLUTION)

**Questions**:
1. Which feature is most important? Why might this be?
2. Are the derived indices (NDVI, NDWI) more or less important than raw bands?
3. Which features could potentially be removed to simplify the model?
4. How does this align with your knowledge of land cover spectral signatures?

**SOLUTIONS**:

1. **Most important feature**:
   - Typically **NDVI** or **NIR (B8)** will rank highest
   - **Why**: Strong discriminator between vegetation and non-vegetation
   - Vegetation has very high NIR reflectance vs. water/urban
   - NDVI combines NIR and Red, enhancing this contrast

2. **Derived indices vs. raw bands**:
   - **Often more important** because they:
     - Normalize for illumination differences
     - Enhance specific spectral features
     - Reduce dimensionality while retaining information
   - However, raw bands still valuable for capturing additional variation

3. **Features that could be removed**:
   - Look for features with importance < 0.05
   - **Green (B3)** might be less important (redundant with Blue/Red)
   - **SWIR2 (B12)** might be redundant with SWIR1
   - Consider correlation analysis before removal
   - **Caution**: Don't remove without testing impact on validation accuracy

4. **Alignment with spectral signatures**:
   - **Expected patterns**:
     - NIR/NDVI: High importance (vegetation has unique NIR signature)
     - NDWI/Blue: Important for water detection
     - SWIR: Important for urban/soil/moisture
   - **Makes spectral sense**:
     - Each feature separates specific class pairs
     - Vegetation: High NIR, low Red → high NDVI
     - Water: High Blue, low NIR → high NDWI
     - Urban: Moderate all bands, high SWIR

**Teaching Discussion Points**:
- Connect to Sentinel-2 band selection rationale
- Discuss trade-offs: accuracy vs. computational cost vs. data volume
- Mention permutation importance as alternative (more stable with correlated features)

---

## E. Confusion Matrix Interpretation (15 minutes)

### Why Confusion Matrix?

Overall accuracy can be misleading! Consider:
- Dataset: 95% Forest, 5% Mangrove
- Model: Predicts everything as Forest
- Accuracy: 95% (sounds great!)
- Problem: Completely missed mangroves!

**Confusion Matrix** reveals:
- Which classes are well-predicted
- Which classes are confused with each other
- Class-specific performance (precision, recall)

### Key Metrics:

- **Precision (User's Accuracy)**: Of all pixels predicted as class X, how many are actually class X?
  - Formula: TP / (TP + FP)
  - Important when false positives are costly

- **Recall (Producer's Accuracy)**: Of all actual class X pixels, how many did we correctly identify?
  - Formula: TP / (TP + FN)
  - Important when false negatives are costly

- **F1-Score**: Harmonic mean of precision and recall
  - Formula: 2 × (Precision × Recall) / (Precision + Recall)
  - Balances both metrics

**Teaching Tip**: Use concrete EO examples (e.g., mapping illegal logging, disaster damage) to illustrate when precision vs. recall matters.

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_eo, y_eo, test_size=0.3, random_state=RANDOM_STATE, stratify=y_eo
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Training class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")

print("\n💡 TIP: We use stratified split to maintain class proportions.")

In [None]:
# Train Random Forest
rf_final = RandomForestClassifier(n_estimators=100, 
                                 max_depth=10,
                                 random_state=RANDOM_STATE)
rf_final.fit(X_train, y_train)

# Make predictions
y_pred = rf_final.predict(X_test)

# Calculate overall accuracy
overall_accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Test Accuracy: {overall_accuracy:.3f}")

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
class_labels = ['Water', 'Vegetation', 'Urban']

print("Confusion Matrix (raw counts):")
print("━" * 60)
print(cm)
print("\nRows = Actual class, Columns = Predicted class")

In [None]:
# Visualize confusion matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
           xticklabels=class_labels, 
           yticklabels=class_labels,
           cbar_kws={'label': 'Number of Samples'},
           linewidths=1, linecolor='gray')
plt.xlabel('Predicted Class', fontsize=12, fontweight='bold')
plt.ylabel('Actual Class', fontsize=12, fontweight='bold')
plt.title('Confusion Matrix: Land Cover Classification', 
         fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

print("\nHow to Read This Matrix:")
print("━" * 60)
print("• Diagonal (top-left to bottom-right): Correct predictions")
print("• Off-diagonal: Confusion between classes")
print("• Dark blue cells indicate high counts")
print("\n💡 TIP: Look for patterns in confusion:")
print("   - Are certain class pairs often confused?")
print("   - Do confusions make spectral sense?")

In [None]:
# Calculate normalized confusion matrix (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(10, 8))
sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='RdYlGn', 
           xticklabels=class_labels, 
           yticklabels=class_labels,
           vmin=0, vmax=1,
           cbar_kws={'label': 'Percentage'},
           linewidths=1, linecolor='gray')
plt.xlabel('Predicted Class', fontsize=12, fontweight='bold')
plt.ylabel('Actual Class', fontsize=12, fontweight='bold')
plt.title('Normalized Confusion Matrix (Row Percentages)', 
         fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

print("\n💡 TIP: Normalized matrix shows recall (producer's accuracy) for each class.")
print("   Diagonal values are the percentage correctly classified for each class.")

### Calculate Detailed Metrics

In [None]:
# Generate classification report
print("Classification Report:")
print("━" * 80)
report = classification_report(y_test, y_pred, 
                              target_names=class_labels,
                              digits=3)
print(report)

print("\nMetric Definitions:")
print("━" * 80)
print("• Precision (User's Accuracy): TP / (TP + FP)")
print("  → Of predictions for this class, how many were correct?")
print("  → Important when false alarms are costly")
print("")
print("• Recall (Producer's Accuracy): TP / (TP + FN)")
print("  → Of actual samples of this class, how many were found?")
print("  → Important when missing instances is costly")
print("")
print("• F1-Score: 2 × (Precision × Recall) / (Precision + Recall)")
print("  → Harmonic mean balancing precision and recall")
print("")
print("• Support: Number of actual samples in test set")

In [None]:
# Visualize per-class metrics
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
f1 = f1_score(y_test, y_pred, average=None)

# Create DataFrame for easier plotting
metrics_df = pd.DataFrame({
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1
}, index=class_labels)

# Plot
ax = metrics_df.plot(kind='bar', figsize=(12, 7), width=0.8)
plt.xlabel('Land Cover Class', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('Per-Class Performance Metrics', fontsize=14, fontweight='bold')
plt.xticks(rotation=0)
plt.ylim([0, 1.05])
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, axis='y', alpha=0.3)
plt.axhline(y=0.8, color='r', linestyle='--', alpha=0.5, label='80% threshold')
plt.tight_layout()
plt.show()

print("\n💡 TIP: In EO applications, different thresholds matter:")
print("   - Disaster mapping: High recall for affected areas (don't miss damage)")
print("   - Urban planning: High precision for built-up (avoid false alarms)")
print("   - Balanced: Use F1-score for overall assessment")

### 🎯 Exercise: Confusion Analysis (SOLUTION)

**Task**: Analyze the confusion matrix and answer these questions:

1. Which class has the highest recall (producer's accuracy)?
2. Which class has the lowest precision (user's accuracy)?
3. Which two classes are most often confused with each other?
4. Why might this confusion occur from a spectral perspective?
5. What could you do to improve classification of the weakest class?

**SOLUTIONS**:

1. **Highest recall class**:
   - Typically **Vegetation** (class 1)
   - **Why**: Strong, unique spectral signature (high NIR, high NDVI)
   - Model rarely misses vegetation pixels
   - High NDVI values create clear separation from other classes

2. **Lowest precision class**:
   - Often **Urban** (class 2)
   - **Why**: Heterogeneous class (buildings, roads, bare soil, sparse vegetation)
   - Spectral variability within class
   - Can overlap spectrally with water (shadows, dark roofs) or bare soil

3. **Most confused class pair**:
   - Commonly **Urban ↔ Water** or **Urban ↔ Bare Soil** (if present)
   - Look at off-diagonal elements in confusion matrix
   - Check which non-diagonal cell has highest count
   - **Example**: If cm[2,0] and cm[0,2] are high → Urban-Water confusion

4. **Spectral reason for confusion**:
   - **Urban ↔ Water**:
     - Shadows in urban areas (low reflectance, similar to water)
     - Dark impervious surfaces (asphalt, roofs)
     - Both have low NIR and low NDVI
   - **Urban ↔ Bare Soil**:
     - Construction sites, unpaved roads
     - Similar SWIR response
     - Both have moderate reflectance across bands
   - **Water ↔ Shadows**:
     - Very low reflectance in all bands
     - Without topographic correction, shadows misclassified as water

5. **Improvement strategies**:
   
   **Data-Centric Approaches**:
   - **More training samples** for confused classes
     - Especially at class boundaries
     - Ensure diversity (different urban types, water conditions)
   - **Better quality samples**
     - Remove mislabeled examples
     - Use higher resolution imagery for ground truth
     - Field validation of uncertain areas
   
   **Feature Engineering**:
   - **Add discriminative features**:
     - Texture metrics (urban is heterogeneous, water is smooth)
     - Temporal features (NDVI time series separates urban from bare soil)
     - Topographic features (slope, aspect to handle shadows)
   - **Remove redundant/noisy features**
   
   **Model-Centric Approaches**:
   - **Class balancing**: Adjust class_weight parameter
   - **Threshold tuning**: Adjust prediction probabilities for cost-sensitive classes
   - **Hyperparameter optimization**: Grid search for max_depth, min_samples_split
   - **Ensemble with other algorithms**: Combine RF with SVM or neural network
   
   **Post-Processing**:
   - **Spatial filtering**: Majority filter to remove salt-and-pepper noise
   - **Contextual rules**: Water is unlikely at high elevations
   - **Object-based approach**: Segment first, then classify

**Teaching Discussion**:
- Emphasize **data-centric AI**: Often better to improve training data than tweak algorithms
- Connect to Philippine context: Confusion between mangrove and terrestrial forest?
- Discuss operational constraints: Can you collect more samples? Is real-time processing needed?

---

## F. Concept Check Quiz (10 minutes)

Test your understanding of Random Forest concepts!

**Teaching Note**: Can be done as:
- Individual work with class discussion
- Think-pair-share
- Kahoot/Mentimeter for interactive polling

### Question 1: Decision Tree Splitting

**Q**: How does a decision tree decide where to split at each node?

A) Randomly selects a feature and threshold  
B) Uses the feature and threshold that maximizes information gain (or minimizes impurity)  
C) Always splits at the median value of each feature  
D) Splits based on alphabetical order of feature names

**✓ ANSWER: B**

**Explanation**:
Decision trees use a **greedy algorithm** that evaluates all possible splits and chooses the one that best separates classes:

- **Classification**: Maximizes information gain or minimizes Gini impurity
- **Process**: For each feature, tests multiple threshold values
- **Criterion**: Gini impurity = 1 - Σ(p_i²), where p_i is proportion of class i
- **Goal**: Make child nodes as "pure" (single-class) as possible

**Teaching Tip**: Draw a simple 2D scatter plot and show how different split lines result in different purities.

### Question 2: Bootstrap Sampling

**Q**: In Random Forest, what is bootstrap sampling?

A) Sampling pixels only from the edges of images  
B) Sampling with replacement to create training subsets for each tree  
C) Sampling only the most important features  
D) Sampling validation data separately from training data

**✓ ANSWER: B**

**Explanation**:
**Bootstrap sampling** (also called **bagging** - Bootstrap AGGregatING):

- Randomly selects samples **with replacement**
- Each tree gets ~63.2% unique samples (rest are duplicates)
- Remaining ~36.8% are "out-of-bag" (OOB) samples (can be used for validation)
- Creates training diversity → reduces correlation between trees
- Key to ensemble's variance reduction

**Example**:
```
Original dataset: [A, B, C, D, E]
Tree 1 bootstrap:  [A, A, C, D, E]  (B not selected)
Tree 2 bootstrap:  [A, B, B, C, C]  (D, E not selected)
Tree 3 bootstrap:  [B, C, D, D, E]  (A not selected)
```

Each tree sees slightly different data, learns different patterns.

### Question 3: Random Feature Selection

**Q**: At each split in a Random Forest tree, what does "random feature selection" mean?

A) All features are considered for splitting  
B) Features are selected in alphabetical order  
C) Only a random subset of features is considered (typically √n or log₂n)  
D) The most important feature is always selected

**✓ ANSWER: C**

**Explanation**:
**Random feature selection** (controlled by `max_features` parameter):

- At each split, only consider a **random subset** of features
- **Default for classification**: √n features (e.g., √8 ≈ 3 features)
- **Default for regression**: n/3 features
- **Purpose**: Decorrelates trees

**Why This Matters**:
Without it, if one feature is very strong (e.g., NDVI), ALL trees would use it as the first split → highly correlated trees → ensemble doesn't help much.

With it, some trees won't have access to NDVI, so they'll find alternative patterns using other features → diverse trees → better ensemble.

**Analogy**: If you ask 100 doctors for a diagnosis but they all read the same textbook page, you get redundant opinions. If each reads a random subset of pages, you get diverse insights.

### Question 4: Feature Importance Interpretation

**Q**: You're classifying land cover and find that NDVI has the highest feature importance. What should you conclude?

A) NDVI is the only feature needed; remove all others  
B) NDVI contributes most to reducing impurity, but other features may still be valuable  
C) NDVI causes the land cover types (causal relationship)  
D) All other features are completely irrelevant

**✓ ANSWER: B**

**Explanation**:
**Feature importance ≠ Complete information**

High importance means NDVI is most useful **on average** across all splits, but:

1. **Other features capture complementary info**:
   - NDVI separates vegetation vs. non-vegetation
   - But SWIR might separate urban vs. water
   - And NDWI might separate water vs. bare soil

2. **Correlated features share importance**:
   - NDVI = (NIR - Red) / (NIR + Red)
   - NIR and Red importance is "stolen" by NDVI
   - But NIR/Red might still be needed for edge cases

3. **Importance ≠ Causation**:
   - NDVI doesn't *cause* land cover
   - It's just a good *predictor* (correlation)

4. **Context-dependent**:
   - In a pure urban study (no vegetation), NDVI would be useless
   - Importance reflects your specific dataset

**Best Practice**: Use importance for **feature understanding**, not feature elimination. Test impact of removing features on validation set.

### Question 5: Confusion Matrix - Precision vs. Recall

**Scenario**: You're mapping forest fire damage. The confusion matrix shows:
- Actual Burned: 100 pixels
- Predicted as Burned: 150 pixels
- Correctly identified Burned: 90 pixels

**Q**: Calculate precision and recall for the "Burned" class. Which is more important for this application?

**✓ SOLUTION**:

**Calculations**:
```
True Positives (TP) = 90
False Positives (FP) = 150 - 90 = 60
False Negatives (FN) = 100 - 90 = 10

Precision = TP / (TP + FP) = 90 / 150 = 0.60 (60%)
Recall = TP / (TP + FN) = 90 / 100 = 0.90 (90%)
```

**Interpretation**:
- **Precision = 60%**: Of pixels we labeled as burned, 60% actually were (40% false alarms)
- **Recall = 90%**: Of actual burned pixels, we found 90% (missed 10%)

**Which is More Important? → RECALL**

**Why**:
- **False Negatives (missing burned areas) are costly**:
  - Communities might not receive aid
  - Extent of disaster underestimated
  - Recovery efforts misdirected

- **False Positives (false alarms) are acceptable**:
  - Can be verified with field checks
  - Better to overestimate for safety
  - Not life-threatening if incorrect

**Strategy**: Lower classification threshold for "Burned" class to increase recall (accept more false positives to catch all true positives).

**Contrast with Different Application**:
If mapping **urban expansion for taxation**, **precision** is more important:
- False positives → incorrectly taxing agricultural land as urban
- Legal/financial consequences
- Better to be conservative

**Teaching Point**: The "right" metric depends on **application context and costs**. Always ask: "What's the cost of false positives vs. false negatives?"

### Question 6: Overfitting in Random Forest

**Q**: Which scenario is MOST likely to cause overfitting in Random Forest?

A) Using 100 trees instead of 10  
B) Setting max_depth=None (unlimited depth)  
C) Using bootstrap sampling  
D) Using random feature selection

**✓ ANSWER: B**

**Explanation**:

**B) Unlimited depth is the main overfitting risk**:
- Trees grow until leaves are pure (or min_samples_leaf is reached)
- Creates very deep, complex trees
- **Memorizes training data** instead of learning patterns
- Signs: Training accuracy ≈ 100%, test accuracy much lower

**Why other options DON'T cause overfitting**:

**A) More trees (100 vs. 10)**:
- **Actually REDUCES overfitting!**
- More trees → better averaging → more stable predictions
- Random Forest rarely overfits from too many trees
- Only downside: computational cost

**C) Bootstrap sampling**:
- **Reduces overfitting!**
- Creates training diversity
- Each tree sees different data → less correlation
- Part of RF's strength

**D) Random feature selection**:
- **Reduces overfitting!**
- Prevents dominance of single strong feature
- Decorrelates trees
- Encourages learning diverse patterns

**Prevention Strategies**:
```python
RandomForestClassifier(
    max_depth=10,          # Limit tree depth
    min_samples_split=10,  # Minimum samples to split a node
    min_samples_leaf=5,    # Minimum samples in a leaf
    max_features='sqrt',   # Random feature selection
    n_estimators=100       # Many trees (more = better!)
)
```

**Detection**:
- Large gap between training and validation accuracy
- Very high training accuracy (>99%)
- Poor generalization to new areas
- Overly complex decision boundaries

**Teaching Analogy**: 
Unlimited depth is like a student who memorizes every exam question/answer instead of understanding concepts. They ace practice exams but fail on new questions.

---

## Summary and Key Takeaways

### Decision Trees
- Learn hierarchical decision rules through recursive splitting
- Create axis-aligned decision boundaries
- Prone to overfitting if too deep
- Easy to interpret and visualize

### Random Forest Ensemble
- Combines many trees to reduce variance and improve stability
- Uses bootstrap sampling (bagging) for training diversity
- Uses random feature selection to decorrelate trees
- Final prediction by majority voting (classification) or averaging (regression)
- More robust than single trees, less prone to overfitting

### Feature Importance
- Measures contribution of each feature to reducing impurity
- Helps identify most informative spectral bands/indices
- Useful for feature selection and model interpretation
- Should be interpreted with domain knowledge
- Can be unstable with correlated features

### Confusion Matrix & Metrics
- Overall accuracy can hide class-specific problems
- **Precision** (user's accuracy): Reliability of positive predictions
- **Recall** (producer's accuracy): Completeness of detection
- **F1-score**: Harmonic mean balancing precision and recall
- Choice of metric depends on application cost (false positives vs. false negatives)

### For Earth Observation
- Random Forest works well with multi-spectral data
- No feature scaling needed (unlike neural networks)
- Feature importance reveals spectral signature insights
- Confusion patterns often reflect spectral similarity
- Fast training enables rapid iteration

---

## Next Steps

In the **Hands-On Session**, you will:
1. Load real Sentinel-2 data for Palawan, Philippines
2. Extract training samples from land cover polygons
3. Train Random Forest for multi-class land cover classification
4. Optimize hyperparameters (n_estimators, max_depth, etc.)
5. Generate wall-to-wall land cover maps
6. Validate results and interpret errors

**Prepare by reviewing**:
- Sentinel-2 band characteristics (B2, B3, B4, B8, B11, B12)
- Philippine land cover types (forest, mangrove, agriculture, urban, water)
- Google Earth Engine Python API basics

---

## Teaching Notes & Tips

### Timing Breakdown
- **Section A (Setup)**: 5 min
- **Section B (Decision Trees)**: 15 min (10 min explanation + 5 min exercise)
- **Section C (Random Forest)**: 15 min (10 min voting + 5 min exercise)
- **Section D (Feature Importance)**: 10 min
- **Section E (Confusion Matrix)**: 15 min (10 min metrics + 5 min exercise)
- **Section F (Quiz)**: 10 min
- **Total**: 70 minutes

### Common Student Questions

1. **"Why not just use one deep tree?"**
   - Single tree = high variance, unstable
   - Ensemble averages out errors
   - Show the visualization of individual trees vs. ensemble

2. **"How many trees is enough?"**
   - Typically 100-500
   - Accuracy plateaus, but more trees never hurts (just slower)
   - Use OOB error to monitor convergence

3. **"Can RF handle imbalanced classes?"**
   - Yes, but may bias toward majority class
   - Use `class_weight='balanced'` parameter
   - Or oversample minority class / undersample majority

4. **"RF vs. Neural Networks for EO?"**
   - **RF strengths**: Fast, interpretable, no scaling needed, works with small data
   - **NN strengths**: Better with huge data, learns hierarchical features, handles spatial context
   - **Practical**: Try RF first, then NN if needed

5. **"Why is my Urban class performing poorly?"**
   - Urban is heterogeneous (diverse spectral signatures)
   - May need more training samples or sub-classes
   - Consider texture features, not just spectral

### Troubleshooting

**If students have errors**:
- Check sklearn version (needs >= 0.24)
- Ensure random_state is set for reproducibility
- Check for NaN values in data

**If notebooks run slowly**:
- Reduce n_samples in synthetic data
- Reduce mesh grid resolution (200 → 100)
- Use n_estimators=50 instead of 100

### Extension Activities

For advanced students:
1. Implement cross-validation instead of single train/test split
2. Compare RF with other algorithms (SVM, XGBoost)
3. Experiment with class weights for imbalanced data
4. Calculate and plot OOB error vs. number of trees
5. Implement permutation importance (more robust than default)

---

## References

1. Breiman, L. (2001). Random Forests. *Machine Learning*, 45(1), 5-32. [Foundational paper]
2. Belgiu, M., & Drăguţ, L. (2016). Random forest in remote sensing: A review of applications and future directions. *ISPRS Journal of Photogrammetry and Remote Sensing*, 114, 24-31.
3. Scikit-learn Documentation: [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
4. ESA Sentinel-2 User Handbook: [https://sentinels.copernicus.eu/documents/247904/685211/Sentinel-2_User_Handbook](https://sentinels.copernicus.eu/documents/247904/685211/Sentinel-2_User_Handbook)
5. Louppe, G. (2014). Understanding Random Forests. *PhD Thesis, University of Liège*. [Excellent theoretical treatment]

---

**End of Instructor Version**

*Developed for CopPhil 4-Day Advanced Online Training on AI/ML for Earth Observation*