# Random Forest Modelling Notebook for DEFRA

- This notebook for Random Forest training using 2D flattened data.
- Inputs will be taken from: `data/defra/ml_prep` folder.
- Following the same structure as LAQN RF training for direct comparison.
- Using Géron's *Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow* 3rd edition.

---

## What this notebook does

1. Load prepared data from ml_prep output.
2. Understand the X and y structure (following Géron Chapter 2).
3. Train a baseline Random Forest model.
4. Evaluate using RMSE, MAE, R² (Géron Chapter 2 evaluation approach).
5. Fine-tune hyperparameters with GridSearchCV (Géron Chapter 2).
6. Analyse feature importance.
7. Save the trained model.

---

## Why Random Forest?

From Géron (2023, Chapter 7), Random Forest is an ensemble of Decision Trees trained on different random subsets of the training data. Each tree votes on the prediction, and the final output is the average (for regression) or majority vote (for classification).

Key advantages for air quality prediction:
- Handles nonlinear relationships without feature scaling.
- Provides feature importance for interpretability.
- Robust against overfitting when properly tuned.
- Works well with tabular data like our flattened time series.

---

## DEFRA vs LAQN comparison context

| Metric | LAQN | DEFRA |
| --- | --- | --- |
| Training samples | 9,946 | 11,138 |
| Validation samples | 2,131 | 2,387 |
| Test samples | 2,132 | 2,387 |
| Features (flattened) | 468 | 288 |
| Original features | 39 | 24 |
| Target station | EN5_NO2 | London_Haringey_Priory_Park_South_NO2 |
| Distance between targets | ~3.3 km | - |

This comparison tests whether DEFRA's higher completeness (91.2% vs 87.1%) compensates for fewer stations.

In [1]:
# mandatory libraries for random forest training

import numpy as np
import pandas as pd
import joblib
import os
from pathlib import Path
import time

# scikit-learn for random forest and evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# before grid search, I decided to use kfold n_splits=5
from sklearn.model_selection import KFold

# modules for evaluation metrics - scikit-learn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# gridsearch for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# visualisation
import matplotlib.pyplot as plt

### File paths

Loading from the ml_prep output folder where all prepared arrays are saved.

**CHANGE FROM LAQN:** Path changed from `data/laqn/ml_prep` to `data/defra/ml_prep`

In [2]:
# Paths setup matching ml_prep output 
base_dir = Path.cwd().parent.parent / "data" / "defra"
ml_prep_dir = base_dir / "ml_prep"

# Output folder for this notebook
rf_output_dir = base_dir / "rf_model"
rf_output_dir.mkdir(parents=True, exist_ok=True)

print(f"Loading data from: {ml_prep_dir}")
print(f"Saving results to: {rf_output_dir}")

Loading data from: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/ml_prep
Saving results to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/rf_model


## 1) Load prepared data

The ml_prep notebook created:
- `X_train_rf.npy`: Flattened training features (11,138 samples × 288 features)
- `X_val_rf.npy`: Flattened validation features (2,387 samples × 288 features)
- `X_test_rf.npy`: Flattened test features (2,387 samples × 288 features)
- `y_train.npy`, `y_val.npy`, `y_test.npy`: Target values
- `rf_feature_names.joblib`: Feature names for interpretability
- `scaler.joblib`: MinMaxScaler to reverse normalisation

The flattening was necessary because Random Forest expects 2D input (samples, features), but the original sequences were 3D (samples, timesteps, features).

### DEFRA vs LAQN data shapes:

| Dataset | LAQN | DEFRA |
| --- | --- | --- |
| X_train_rf | (9946, 468) | (11138, 288) |
| X_val_rf | (2131, 468) | (2387, 288) |
| X_test_rf | (2132, 468) | (2387, 288) |
| y_train | (9946, 39) | (11138, 24) |

DEFRA has ~12% more samples but fewer features (24 vs 39 original, 288 vs 468 flattened).

In [3]:
# load all prepared data
print("Loading data")

X_train = np.load(ml_prep_dir / "X_train_rf.npy")
X_val = np.load(ml_prep_dir / "X_val_rf.npy")
X_test = np.load(ml_prep_dir / "X_test_rf.npy")

y_train = np.load(ml_prep_dir / "y_train.npy")
y_val = np.load(ml_prep_dir / "y_val.npy")
y_test = np.load(ml_prep_dir / "y_test.npy")

rf_feature_names = joblib.load(ml_prep_dir / "rf_feature_names.joblib")
feature_names = joblib.load(ml_prep_dir / "feature_names.joblib")
scaler = joblib.load(ml_prep_dir / "scaler.joblib")

Loading data


In [4]:
# check loaded data shapes

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Number of RF features: {len(rf_feature_names)}")
print(f"Number of target features: {len(feature_names)}")

X_train shape: (11138, 288)
X_val shape: (2387, 288)
X_test shape: (2387, 288)
y_train shape: (11138, 24)
y_val shape: (2387, 24)
y_test shape: (2387, 24)
Number of RF features: 288
Number of target features: 24


X_train shape: (11138, 288)
X_val shape: (2387, 288)
X_test shape: (2387, 288)
y_train shape: (11138, 24)
y_val shape: (2387, 24)
y_test shape: (2387, 24)
Number of RF features: 288
Number of target features: 24

## 2) Select target pollutant

The y array has 24 outputs (one for each feature). For better evaluation, I will train a single-output model first.

### Why single-output?

Starting with one target keeps things simple:
- Easier to interpret evaluation metrics (RMSE, R² for one pollutant).
- Easier to understand feature importance (what predicts NO2 specifically).
- Can train separate models for PM10 and O3 later and compare.

### Which target to select?

From DEFRA ml_prep, the target station is **London_Haringey_Priory_Park_South_NO2**:
- Located ~3.3 km from LAQN's EN5 station
- 95.1% data coverage
- Enables direct comparison with LAQN RF results