# Random Forest Modelling Notebook for DEFRA

- This notebook for Random Forest training using 2D flattened data.
- Inputs will be taken from: `data/defra/ml_prep` folder.
- Following the same structure as LAQN RF training for direct comparison.
- Using Géron's *Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow* 3rd edition.

---

## What this notebook does

1. Load prepared data from ml_prep output.
2. Understand the X and y structure (following Géron Chapter 2).
3. Train a baseline Random Forest model.
4. Evaluate using RMSE, MAE, R² (Géron Chapter 2 evaluation approach).
5. Fine-tune hyperparameters with GridSearchCV (Géron Chapter 2).
6. Analyse feature importance.
7. Save the trained model.

---

## Why Random Forest?

From Géron (2023, Chapter 7), Random Forest is an ensemble of Decision Trees trained on different random subsets of the training data. Each tree votes on the prediction, and the final output is the average (for regression) or majority vote (for classification).

Key advantages for air quality prediction:
- Handles nonlinear relationships without feature scaling.
- Provides feature importance for interpretability.
- Robust against overfitting when properly tuned.
- Works well with tabular data like our flattened time series.

---

## DEFRA vs LAQN comparison context

| Metric | LAQN | DEFRA |
| --- | --- | --- |
| Training samples | 9,946 | 11,138 |
| Validation samples | 2,131 | 2,387 |
| Test samples | 2,132 | 2,387 |
| Features (flattened) | 468 | 288 |
| Original features | 39 | 24 |
| Target station | EN5_NO2 | London_Haringey_Priory_Park_South_NO2 |
| Distance between targets | ~3.3 km | - |

This comparison tests whether DEFRA's higher completeness (91.2% vs 87.1%) compensates for fewer stations.

In [1]:
# mandatory libraries for random forest training

import numpy as np
import pandas as pd
import joblib
import os
from pathlib import Path
import time

# scikit-learn for random forest and evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# before grid search, I decided to use kfold n_splits=5
from sklearn.model_selection import KFold

# modules for evaluation metrics - scikit-learn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# gridsearch for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# visualisation
import matplotlib.pyplot as plt

### File paths

Loading from the ml_prep output folder where all prepared arrays are saved.

**CHANGE FROM LAQN:** Path changed from `data/laqn/ml_prep` to `data/defra/ml_prep`

In [2]:
# Paths setup matching ml_prep output 
base_dir = Path.cwd().parent.parent / "data" / "defra"
ml_prep_dir = base_dir / "ml_prep"

# Output folder for this notebook
rf_output_dir = base_dir / "rf_model"
rf_output_dir.mkdir(parents=True, exist_ok=True)

print(f"Loading data from: {ml_prep_dir}")
print(f"Saving results to: {rf_output_dir}")

Loading data from: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/ml_prep
Saving results to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/rf_model


## 1) Load prepared data

The ml_prep notebook created:
- `X_train_rf.npy`: Flattened training features (11,138 samples × 288 features)
- `X_val_rf.npy`: Flattened validation features (2,387 samples × 288 features)
- `X_test_rf.npy`: Flattened test features (2,387 samples × 288 features)
- `y_train.npy`, `y_val.npy`, `y_test.npy`: Target values
- `rf_feature_names.joblib`: Feature names for interpretability
- `scaler.joblib`: MinMaxScaler to reverse normalisation

The flattening was necessary because Random Forest expects 2D input (samples, features), but the original sequences were 3D (samples, timesteps, features).

### DEFRA vs LAQN data shapes:

| Dataset | LAQN | DEFRA |
| --- | --- | --- |
| X_train_rf | (9946, 468) | (11138, 288) |
| X_val_rf | (2131, 468) | (2387, 288) |
| X_test_rf | (2132, 468) | (2387, 288) |
| y_train | (9946, 39) | (11138, 24) |

DEFRA has ~12% more samples but fewer features (24 vs 39 original, 288 vs 468 flattened).

In [3]:
# load all prepared data
print("Loading data")

X_train = np.load(ml_prep_dir / "X_train_rf.npy")
X_val = np.load(ml_prep_dir / "X_val_rf.npy")
X_test = np.load(ml_prep_dir / "X_test_rf.npy")

y_train = np.load(ml_prep_dir / "y_train.npy")
y_val = np.load(ml_prep_dir / "y_val.npy")
y_test = np.load(ml_prep_dir / "y_test.npy")

rf_feature_names = joblib.load(ml_prep_dir / "rf_feature_names.joblib")
feature_names = joblib.load(ml_prep_dir / "feature_names.joblib")
scaler = joblib.load(ml_prep_dir / "scaler.joblib")

Loading data


In [4]:
# check loaded data shapes

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Number of RF features: {len(rf_feature_names)}")
print(f"Number of target features: {len(feature_names)}")

X_train shape: (11138, 288)
X_val shape: (2387, 288)
X_test shape: (2387, 288)
y_train shape: (11138, 24)
y_val shape: (2387, 24)
y_test shape: (2387, 24)
Number of RF features: 288
Number of target features: 24


X_train shape: (11138, 288)
X_val shape: (2387, 288)
X_test shape: (2387, 288)
y_train shape: (11138, 24)
y_val shape: (2387, 24)
y_test shape: (2387, 24)
Number of RF features: 288
Number of target features: 24

## 2) Select target pollutant

The y array has 24 outputs (one for each feature). For better evaluation, I will train a single-output model first.

### Why single-output?

Starting with one target keeps things simple:
- Easier to interpret evaluation metrics (RMSE, R² for one pollutant).
- Easier to understand feature importance (what predicts NO2 specifically).
- Can train separate models for PM10 and O3 later and compare.

### Which target to select?

From DEFRA ml_prep, the target station is **London_Haringey_Priory_Park_South_NO2**:
- Located ~3.3 km from LAQN's EN5 station
- 95.1% data coverage
- Enables direct comparison with LAQN RF results

In [5]:
# list targets
print("Available targets:")
for i, name in enumerate(feature_names):
    print(f"  {i:2d}: {name}")

Available targets:
   0: London_Hillingdon_O3
   1: London_Harlington_PM10
   2: London_Hillingdon_NO2
   3: London_Westminster_NO2
   4: London_Honor_Oak_Park_PM10
   5: London_N._Kensington_NO2
   6: Camden_Kerbside_NO2
   7: London_Hillingdon_PM10
   8: Southwark_A2_Old_Kent_Road_NO2
   9: Borehamwood_Meadow_Park_NO2
  10: London_Bloomsbury_NO2
  11: London_Harlington_NO2
  12: Borehamwood_Meadow_Park_PM10
  13: London_Bloomsbury_O3
  14: London_N._Kensington_O3
  15: London_Haringey_Priory_Park_South_NO2
  16: London_Bexley_NO2
  17: London_Westminster_O3
  18: London_Bloomsbury_PM10
  19: London_Marylebone_Road_NO2
  20: hour
  21: day_of_week
  22: month
  23: is_weekend


Available targets:
   0: London_Hillingdon_O3
   1: London_Harlington_PM10
   2: London_Hillingdon_NO2
   3: London_Westminster_NO2
   4: London_Honor_Oak_Park_PM10
   5: London_N._Kensington_NO2
   6: Camden_Kerbside_NO2
   7: London_Hillingdon_PM10
   8: Southwark_A2_Old_Kent_Road_NO2
   9: Borehamwood_Meadow_Park_NO2
  10: London_Bloomsbury_NO2
  11: London_Harlington_NO2
  12: Borehamwood_Meadow_Park_PM10
  13: London_Bloomsbury_O3
  14: London_N._Kensington_O3
  15: London_Haringey_Priory_Park_South_NO2
  16: London_Bexley_NO2
  17: London_Westminster_O3
  18: London_Bloomsbury_PM10
  19: London_Marylebone_Road_NO2
  20: hour
  21: day_of_week
  22: month
  23: is_weekend

In [6]:
# find the target station index
# From ml_prep: London_Haringey_Priory_Park_South_NO2 is our target
target_name = "London_Haringey_Priory_Park_South_NO2"

# find index of target in feature_names
try:
    target_idx = feature_names.index(target_name)
except ValueError:
    # if exact name not found, search for partial match
    for i, name in enumerate(feature_names):
        if 'Haringey' in name and 'NO2' in name:
            target_idx = i
            target_name = name
            break

y_train_single = y_train[:, target_idx]
y_val_single = y_val[:, target_idx]
y_test_single = y_test[:, target_idx]

print(f"Target: {target_name}")
print(f"Target index: {target_idx}")
print(f"y_train_single: {y_train_single.shape}")
print(f"Range: [{y_train_single.min():.4f}, {y_train_single.max():.4f}]")

Target: London_Haringey_Priory_Park_South_NO2
Target index: 15
y_train_single: (11138,)
Range: [0.0000, 1.0000]


Target: London_Haringey_Priory_Park_South_NO2
Target index: 15
y_train_single: (11138,)
Range: [0.0000, 1.0000]

## 3) Train baseline model (Géron Chapter 7)

From Géron (2023, Chapter 7 - Ensemble Learning and Random Forests):

> "A random forest is an ensemble of decision trees, generally trained via the bagging method (or sometimes pasting), typically with `max_samples` set to the size of the training set. Instead of building a `BaggingClassifier` and passing it a `DecisionTreeClassifier`, you can use the `RandomForestClassifier` class, which is more convenient and optimized for decision trees (similarly, there is a `RandomForestRegressor` class for regression tasks)."

Since I am predicting continuous pollution values (regression), I use `RandomForestRegressor`.

### Key parameters:

| Parameter | Default | What it does |
| --- | --- | --- |
| n_estimators | 100 | Number of trees in the forest |
| max_leaf_nodes | None | Maximum leaf nodes per tree |
| n_jobs | -1 | CPU cores to use (-1 = all available) |
| random_state | 42 | Seed for reproducibility |

Géron's example uses `n_estimators=500` and `max_leaf_nodes=16`, but I start with defaults to establish a baseline before tuning.

In [7]:
# train baseline Random Forest
# Using RandomForestRegressor for regression task predicting continuous values following Géron's structure from Chapter 7

print("Training baseline Random Forest")
print("-" * 40)

start = time.time()

# baseline with default parameters
rf_baseline = RandomForestRegressor(
    n_estimators=100,      # default, Géron's example uses 500
    random_state=42,       # for reproducibility
    n_jobs=-1              # use all CPU cores
)

rf_baseline.fit(X_train, y_train_single)

baseline_time = time.time() - start

print(f"\nTraining complete in {baseline_time:.2f} seconds")
print(f"Number of trees: {rf_baseline.n_estimators}")
print(f"Max leaf nodes: {rf_baseline.max_leaf_nodes}")
print(f"Max depth: {rf_baseline.max_depth}")

Training baseline Random Forest
----------------------------------------

Training complete in 33.32 seconds
Number of trees: 100
Max leaf nodes: None
Max depth: None


    Training baseline Random Forest
    ----------------------------------------

    Training complete in 33.32 seconds
    Number of trees: 100
    Max leaf nodes: None
    Max depth: None

## 4) Evaluate baseline model

To evaluate the baseline model, I use three metrics from scikit-learn's `sklearn.metrics` module.

### RMSE (Root Mean Square Error)

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

Penalises large errors more heavily. Lower is better.

```python
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
```

### MAE (Mean Absolute Error)

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

Average absolute difference between actual and predicted. More interpretable than RMSE. Lower is better.

### R² (Coefficient of Determination)

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Proportion of variance explained by the model. Range 0 to 1, higher is better.

In [8]:
def evaluate(model, X, y_true, name):
    """
    Evaluate model using RMSE, MAE, and R².
    
    Params:
    model : trained sklearn model
    X : feature matrix (n_samples, n_features)
    y_true : actual values (n_samples,)
    name : string for display
    
    Returns: dict with rmse, mae, r2, and y_pred
    """
    y_pred = model.predict(X)
    
    # RMSE using np.sqrt
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    
    # MAE avg absolute difference
    mae = mean_absolute_error(y_true, y_pred)
    
    # R^2 proportion of variance explained
    r2 = r2_score(y_true, y_pred)
    
    print(f"{name}:")
    print(f"  RMSE = {rmse:.6f}")
    print(f"  MAE  = {mae:.6f}")
    print(f"  R^2  = {r2:.6f}")
    
    return {'rmse': rmse, 'mae': mae, 'r2': r2, 'y_pred': y_pred}

In [9]:
# Evaluate baseline on three sets
print("Baseline Model Evaluation")
print("=" * 40)

base_train = evaluate(rf_baseline, X_train, y_train_single, "Training")
print()
base_val = evaluate(rf_baseline, X_val, y_val_single, "Validation")
print()
base_test = evaluate(rf_baseline, X_test, y_test_single, "Test")

Baseline Model Evaluation
Training:
  RMSE = 0.016983
  MAE  = 0.011019
  R^2  = 0.984685

Validation:
  RMSE = 0.042742
  MAE  = 0.027620
  R^2  = 0.865686

Test:
  RMSE = 0.033585
  MAE  = 0.022963
  R^2  = 0.851746


    Baseline Model Evaluation
    ========================================
    Training:
    RMSE = 0.016983
    MAE  = 0.011019
    R^2  = 0.984685

    Validation:
    RMSE = 0.042742
    MAE  = 0.027620
    R^2  = 0.865686

    Test:
    RMSE = 0.033585
    MAE  = 0.022963
    R^2  = 0.851746

The gap between training R² (0.985) and validation R² (0.866) is **0.119** which shows mild overfitting. Similar to LAQN (0.118 gap), the model memorised training data rather than learning general patterns. However, DEFRA's validation R² (0.866) is already higher than LAQN's (0.861), suggesting better generalisation despite similar overfitting levels.

### Checking for overfitting

The training R² is higher than validation R², indicating some overfitting. This happens when the model memorises training data instead of learning general patterns.

Signs of overfitting:
- Training R² close to 1.0 (0.985), validation R² lower (0.866).
- Gap of 0.119 between training and validation R².

**DEFRA vs LAQN overfitting comparison:**

| Metric | LAQN | DEFRA |
|--------|------|-------|
| Training R² | 0.979 | 0.985 |
| Validation R² | 0.861 | 0.866 |
| Gap | 0.118 | 0.119 |

Both datasets show similar overfitting levels (~0.12 gap). However, DEFRA starts from a higher baseline, so even with overfitting, it generalises better. I'll tune hyperparameters to reduce overfitting further.