# Random Forest Modelling Notebook for LAQN

- Starting ML, very excited.
- This notebook is for Random Forest training using 2D flattened data.
- Inputs will be taken from: `data/laqn/ml_prep` folder.
- I will be using Géron's *Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow* 3rd edition as primary source to understand the X_training and y sets better and clear implementation structures mirroring from the book.

---

## What this notebook does

1. Load prepared data from ml_prep output.
2. Understand the X and y structure (following Géron Chapter 2).
3. Train a baseline Random Forest model.
4. Evaluate using RMSE, MAE, R² (Géron Chapter 2 evaluation approach).
5. Fine-tune hyperparameters with GridSearchCV (Géron Chapter 2).
6. Analyse feature importance.
7. Save the trained model.

---

## Why Random Forest?

From Géron (2023, Chapter 7), Random Forest is an ensemble of Decision Trees trained on different random subsets of the training data. Each tree votes on the prediction, and the final output is the average (for regression) or majority vote (for classification).

Key advantages for air quality prediction:
- Handles nonlinear relationships without feature scaling.
- Provides feature importance for interpretability.
- Robust against overfitting when properly tuned.
- Works well with tabular data like our flattened time series.

In [20]:
# mandatory libraries for random forest training

import numpy as np
import pandas as pd
import joblib
import os
from pathlib import Path
import time

# scikit-learn for random forest and evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score

# modules for ebaluation metrics - scikit-learn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# visualisation
import matplotlib.pyplot as plt

### File paths

Loading from the ml_prep output folder where all prepared arrays are saved.

In [21]:
#Paths setup matching ml_prep output 
base_dir = Path.cwd().parent.parent / "data" / "laqn"
ml_prep_dir = base_dir / "ml_prep"

#Output folder for this notebook
rf_output_dir = base_dir / "rf_model"
rf_output_dir.mkdir(parents=True, exist_ok=True)

print(f"Loading data from: {ml_prep_dir}")
print(f"Saving results to: {rf_output_dir}")

Loading data from: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/laqn/ml_prep
Saving results to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/laqn/rf_model


## 1) Load prepared data

The ml_prep notebook created:
- `X_train_rf.npy`: Flattened training features (9,946 samples × 468 features)
- `X_val_rf.npy`: Flattened validation features (2,131 samples × 468 features)
- `X_test_rf.npy`: Flattened test features (2,132 samples × 468 features)
- `y_train.npy`, `y_val.npy`, `y_test.npy`: Target values
- `rf_feature_names.joblib`: Feature names for interpretability
- `scaler.joblib`: MinMaxScaler to reverse normalisation

The flattening was necessary because Random Forest expects 2D input (samples, features), but the original sequences were 3D (samples, timesteps, features).

In [22]:
# load all prepared data
print("Loading data")

X_train = np.load(ml_prep_dir / "X_train_rf.npy")
X_val = np.load(ml_prep_dir / "X_val_rf.npy")
X_test = np.load(ml_prep_dir / "X_test_rf.npy")

y_train = np.load(ml_prep_dir / "y_train.npy")
y_val = np.load(ml_prep_dir / "y_val.npy")
y_test = np.load(ml_prep_dir / "y_test.npy")

rf_feature_names = joblib.load(ml_prep_dir / "rf_feature_names.joblib")
feature_names = joblib.load(ml_prep_dir / "feature_names.joblib")
scaler = joblib.load(ml_prep_dir / "scaler.joblib")


Loading data


In [12]:
#check loaded data shapes

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Number of RF features: {len(rf_feature_names)}")
print(f"Number of target features: {len(feature_names)}")

X_train shape: (9946, 468)
X_val shape: (2131, 468)
X_test shape: (2132, 468)
y_train shape: (9946, 39)
y_val shape: (2131, 39)
y_test shape: (2132, 39)
Number of RF features: 468
Number of target features: 39


    X_train shape: (9946, 468)
    X_val shape: (2131, 468)
    X_test shape: (2132, 468)
    y_train shape: (9946, 39)
    y_val shape: (2131, 39)
    y_test shape: (2132, 39)
    Number of RF features: 468
    Number of target features: 39

## 2) Select target pollutant

The y array has 39 outputs (one for each feature). For better evaluation, I will train a single-output model first.

### Why single-output?

Starting with one target keeps things simple:
- Easier to interpret evaluation metrics (RMSE, R² for one pollutant).
- Easier to understand feature importance (what predicts NO2 specifically).
- Can train separate models for PM10 and O3 later and compare.

### Which target to select?

In ml_prep notebook section 6.A, I selected 35 columns sorted by data coverage. The first column has the highest coverage, making it the most reliable target for initial training.

In [23]:
# list targets
print("Available targets:")
for i, name in enumerate(feature_names):
    print(f"  {i:2d}: {name}")

Available targets:
   0: EN5_NO2
   1: WMD_NO2
   2: BT5_NO2
   3: HP1_PM10
   4: EN1_NO2
   5: ME9_NO2
   6: BT6_NO2
   7: BT8_PM10
   8: HV1_NO2
   9: BT4_PM10
  10: KC1_NO2
  11: BT8_NO2
  12: EI1_NO2
  13: HP1_NO2
  14: BX2_PM10
  15: GN0_NO2
  16: WM6_NO2
  17: IS6_NO2
  18: RI1_NO2
  19: HP1_O3
  20: BT6_PM10
  21: SK5_NO2
  22: BX1_O3
  23: GR9_NO2
  24: EN4_NO2
  25: GN6_NO2
  26: GR7_PM10
  27: KC1_O3
  28: GR7_NO2
  29: GN3_PM10
  30: LB4_NO2
  31: GN4_PM10
  32: GN4_NO2
  33: EA8_NO2
  34: EA6_NO2
  35: hour
  36: day_of_week
  37: month
  38: is_weekend


In [24]:
# select EN5_NO2
target_idx= 0
target_name = feature_names[target_idx]

y_train_single = y_train[:, target_idx]
y_val_single = y_val[:, target_idx]
y_test_single = y_test[:, target_idx]

print(f"Target: {target_idx}")
print(f"y_train_single: {y_train_single.shape}")
print(f"Range: [{y_train_single.min():.4f}, {y_train_single.max():.4f}]")

Target: 0
y_train_single: (9946,)
Range: [0.0079, 1.0000]


    Target: 0
    y_train_single: (9946,)
    Range: [0.0079, 1.0000]


## 3) Train baseline model (Géron Chapter 7)

From Géron (2023, Chapter 7 - Ensemble Learning and Random Forests):

> "A random forest is an ensemble of decision trees, generally trained via the bagging method (or sometimes pasting), typically with `max_samples` set to the size of the training set. Instead of building a `BaggingClassifier` and passing it a `DecisionTreeClassifier`, you can use the `RandomForestClassifier` class, which is more convenient and optimized for decision trees (similarly, there is a `RandomForestRegressor` class for regression tasks)."

Since I am predicting continuous pollution values (regression), I use `RandomForestRegressor`.

### Key parameters:

| Parameter | Default | What it does |
| --- | --- | --- |
| n_estimators | 100 | Number of trees in the forest |
| max_leaf_nodes | None | Maximum leaf nodes per tree |
| n_jobs | -1 | CPU cores to use (-1 = all available) |
| random_state | 42 | Seed for reproducibility |

Géron's example uses `n_estimators=500` and `max_leaf_nodes=16`, but I start with defaults to establish a baseline before tuning.

In [25]:
# train baseline Random Forest
# Using RandomForestRegressor for regression task predicting continuous values following Géron's structure from Chapter 7

print("Training baseline Random Forest")
print("-" * 40)

start = time.time()

# baseline with default parameters
rf_baseline = RandomForestRegressor(
    n_estimators=100,      # default, Géron's example uses 500
    random_state=42,       # for reproducibility
    n_jobs=-1              # use all CPU cores
)

rf_baseline.fit(X_train, y_train_single)

baseline_time = time.time() - start

print(f"\nTraining complete in {baseline_time:.2f} seconds")
print(f"Number of trees: {rf_baseline.n_estimators}")
print(f"Max leaf nodes: {rf_baseline.max_leaf_nodes}")
print(f"Max depth: {rf_baseline.max_depth}")

Training baseline Random Forest
----------------------------------------

Training complete in 97.59 seconds
Number of trees: 100
Max leaf nodes: None
Max depth: None


    Training baseline Random Forest
    ----------------------------------------

    Training complete in 87.43 seconds
    Number of trees: 100
    Max leaf nodes: None
    Max depth: None

## 4) Evaluate baseline model

To evaluate the baseline model, I use three metrics from scikit-learn's `sklearn.metrics` module.

### RMSE (Root Mean Square Error)

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

Penalises large errors more heavily. Lower is better.

```python
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(y_true, y_pred, squared=False)
```

Source: Stack Overflow (2013) *Is there a library function for root mean square error (RMSE) in python?* Available at: https://stackoverflow.com/questions/17197492 (Accessed: 23 December 2025).

### MAE (Mean Absolute Error)

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

Average absolute difference between actual and predicted. More interpretable than RMSE. Lower is better.

```python
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
```

Source: scikit-learn (no date) *sklearn.metrics.mean_absolute_error*. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

### R² (Coefficient of Determination)

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Proportion of variance explained by the model. Range 0 to 1, higher is better. A score of 1.0 means perfect predictions.

```python
from sklearn.metrics import r2_score

r2 = r2_score(y_true, y_pred)
```

Source: scikit-learn (no date) *sklearn.metrics.r2_score*. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

In [None]:
def evaluate(model, X, y_true, name):
    """
    Evaluate model using RMSE, MAE, and R².
    
    Params:
    model : trained sklearn model
    X : feature matrix n_samples, n_features
    y_true : actual values n_samples,
    name : string displayin 
    
    Returns: rmse, mae, r2, and predict
    """
    y_pred = model.predict(X)
    
    # RMSE using squared=False as shown in sklearn docs
    rmse = np.sqrt(y_true, y_pred, squared=False)
    
    # MAE avg absolute difference
    mae = mean_absolute_error(y_true, y_pred)
    
    # R^2 proportion of variance explained
    r2 = r2_score(y_true, y_pred)
    
    print(f"{name}:")
    print(f"  RMSE = {rmse:.6f}")
    print(f"  MAE  = {mae:.6f}")
    print(f"  R²   = {r2:.6f}")
    
    return {'rmse': rmse, 'mae': mae, 'r2': r2, 'y_pred': y_pred}

In [27]:
# Print to evaluate baseline on three sets
print("Baseline Model Evaluation")
print("=" * 40)

base_train = evaluate(rf_baseline, X_train, y_train_single, "Training")
print()
base_val = evaluate(rf_baseline, X_val, y_val_single, "Validation")
print()
base_test = evaluate(rf_baseline, X_test, y_test_single, "Test")

Baseline Model Evaluation


TypeError: got an unexpected keyword argument 'squared'