# ML DEFRA Data Preparation for Air Quality Prediction

This notebook prepares DEFRA data for machine learning models.

## What this notebook does

1. Loads cleaned DEFRA data from the measurements folder.

   ```bash
   ├── 2023measurements/           # Year folders
   │   ├── London Bloomsbury/      # Station folders
   │   │   ├── NO2__2023_01.csv   # Pollutant files
   │   │   ├── PM10__2023_01.csv
   │   │   └── ...
   │   ├── London Eltham/
   │   └── ...
   ├── 2024measurements/
   └── 2025measurements/
   ```

2. Combines all measurements into a single dataset.
3. Creates temporal features (hour, day, month).
4. Creates sequences for ML training.

## Key difference from LAQN

| Aspect | LAQN | DEFRA |
|--------|------|-------|
| File structure | SiteCode_Species_Date.csv | Station/Pollutant__YYYY_MM.csv |
| Date column | @MeasurementDateGMT | date (or Date) |
| Value column | @Value | varies by pollutant name |
| Missing flags | NaN | -99 (maintenance), -1 (invalid) |

## Output path:

Data will be saved to: `data/defra/ml_prep/`

In [1]:
# Standard imports same as LAQN
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Save section
import joblib

# Visualisation
import matplotlib.pyplot as plt

# Preprocessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

## File Paths

- Usual drill, adding paths under this cell for organisation.