# Random Forest Modelling Notebook for DEFRA

- This notebook for Random Forest training using 2D flattened data.
- Inputs will be taken from: `data/defra/ml_prep` folder.
- Following the same structure as LAQN RF training for direct comparison.
- Using Géron's *Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow* 3rd edition.

---

## What this notebook does

1. Load prepared data from ml_prep output.
2. Understand the X and y structure (following Géron Chapter 2).
3. Train a baseline Random Forest model.
4. Evaluate using RMSE, MAE, R² (Géron Chapter 2 evaluation approach).
5. Fine-tune hyperparameters with GridSearchCV (Géron Chapter 2).
6. Analyse feature importance.
7. Save the trained model.

---

## Why Random Forest?

From Géron (2023, Chapter 7), Random Forest is an ensemble of Decision Trees trained on different random subsets of the training data. Each tree votes on the prediction, and the final output is the average (for regression) or majority vote (for classification).

Key advantages for air quality prediction:
- Handles nonlinear relationships without feature scaling.
- Provides feature importance for interpretability.
- Robust against overfitting when properly tuned.
- Works well with tabular data like our flattened time series.

---

## DEFRA vs LAQN comparison context

| Metric | LAQN | DEFRA |
| --- | --- | --- |
| Training samples | 9,946 | 11,138 |
| Validation samples | 2,131 | 2,387 |
| Test samples | 2,132 | 2,387 |
| Features (flattened) | 468 | 288 |
| Original features | 39 | 24 |
| Target station | EN5_NO2 | London_Haringey_Priory_Park_South_NO2 |
| Distance between targets | ~3.3 km | - |

This comparison tests whether DEFRA's higher completeness (91.2% vs 87.1%) compensates for fewer stations.

In [1]:
# mandatory libraries for random forest training

import numpy as np
import pandas as pd
import joblib
import os
from pathlib import Path
import time

# scikit-learn for random forest and evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# before grid search, I decided to use kfold n_splits=5
from sklearn.model_selection import KFold

# modules for evaluation metrics - scikit-learn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# gridsearch for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# visualisation
import matplotlib.pyplot as plt

### File paths

Loading from the ml_prep output folder where all prepared arrays are saved.

**CHANGE FROM LAQN:** Path changed from `data/laqn/ml_prep` to `data/defra/ml_prep`

In [2]:
# Paths setup matching ml_prep output 
base_dir = Path.cwd().parent.parent / "data" / "defra"
ml_prep_dir = base_dir / "ml_prep"

# Output folder for this notebook
rf_output_dir = base_dir / "rf_model"
rf_output_dir.mkdir(parents=True, exist_ok=True)

print(f"Loading data from: {ml_prep_dir}")
print(f"Saving results to: {rf_output_dir}")

Loading data from: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/ml_prep
Saving results to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/rf_model
