# ML DEFRA Data Preparation for Air Quality Prediction

This notebook prepares DEFRA data for machine learning models.

## What this notebook does

1. Loads cleaned DEFRA data from the measurements folder.

   ```bash
   ├── 2023measurements/           # Year folders
   │   ├── London Bloomsbury/      # Station folders
   │   │   ├── NO2__2023_01.csv   # Pollutant files
   │   │   ├── PM10__2023_01.csv
   │   │   └── ...
   │   ├── London Eltham/
   │   └── ...
   ├── 2024measurements/
   └── 2025measurements/
   ```

2. Combines all measurements into a single dataset.
3. Creates temporal features (hour, day, month).
4. Creates sequences for ML training.

## Key difference from LAQN

| Aspect | LAQN | DEFRA |
|--------|------|-------|
| File structure | SiteCode_Species_Date.csv | Station/Pollutant__YYYY_MM.csv |
| Date column | @MeasurementDateGMT | date (or Date) |
| Value column | @Value | varies by pollutant name |
| Missing flags | NaN | -99 (maintenance), -1 (invalid) |

## Output path:

Data will be saved to: `data/defra/ml_prep/`

In [1]:
# Standard imports same as LAQN
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Save section
import joblib

# Visualisation
import matplotlib.pyplot as plt

# Preprocessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

## File Paths

- Usual drill, adding paths under this cell for organisation.

In [5]:
# DEFRA prep file path
base_dir = Path.cwd().parent.parent / "data" / "defra" 
project_root = Path.cwd() / "defra_ml_prep.ipynb"

# Defra's optimased measurements data path
optimased_path = base_dir / "optimased" # Contains 2023measurements, 2024measurements, etc.

# Output paths
output_path = base_dir / "ml_prep"
output_path.mkdir(parents=True, exist_ok=True)

# Visualisation output path
visualisation_path = output_path / "visualisation"
visualisation_path.mkdir(parents=True, exist_ok=True)

print(f"Base directory: {base_dir}")
print(f"Output path: {output_path}")

Base directory: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra
Output path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/ml_prep


## 1) Load DEFRA data

**CHANGE FROM LAQN:** 
- LAQN has flat monthly folders with all files
- DEFRA has nested structure: year > station > pollutant files

DEFRA file naming: `{POLLUTANT}__{YYYY_MM}.csv`

DEFRA columns typically include:
- `date` or `Date` - timestamp
- Pollutant name as column (e.g., `Nitrogen dioxide`, `PM10 particulate matter`)
- Values use -99 for maintenance, -1 for invalid data

In [None]:
def load_defra_data(optimased_path):
    """
    Function to load the optimased data files from the defra dataset.

            optimased_path: path for data/defra/optimased directory.

    """
    optimased_path = Path(optimased_path)
    all_files = []
    file_count = 0
    
    # Get all year measurement folders
    year_folders = sorted([f for f in optimased_path.glob('*measurements') if f.is_dir()])
    
    print(f"Found {len(year_folders)} year folders")
    
    for year_dir in year_folders:
        year = year_dir.name.replace('measurements', '')
        print(f"\nProcessing {year}...")
        
        # Iterate through station folders
        for station_dir in sorted(year_dir.iterdir()):
            if not station_dir.is_dir():
                continue
            
            station_name = station_dir.name
            
            # Process each CSV file in station folder
            for csv_file in station_dir.glob('*.csv'):
                try:
                    # Parse filename: POLLUTANT__YYYY_MM.csv
                    parts = csv_file.stem.split('__')
                    if len(parts) != 2:
                        print(f"  Skipping {csv_file.name}: unexpected format")
                        continue
                    
                    pollutant_name = parts[0]
                    
                    # Read the CSV
                    df = pd.read_csv(csv_file)
                    
                    # DEFRA files have pollutant_std name as column
                    # Find the value column not 'date' or 'Date'
                    date_cols = ['date', 'Date']
                    value_col = None
                    date_col = None
                    
                    for col in df.columns:
                        if col.lower() == 'date':
                            date_col = col
                        elif col not in date_cols:
                            value_col = col  # Assume non-date column is value
                    
                    if date_col is None or value_col is None:
                        print(f"  Skipping {csv_file.name}: missing date or value column")
                        continue
                    
                    # Standardise column names to match LAQN format
                    df_standard = pd.DataFrame({
                        '@MeasurementDateGMT': df[date_col],
                        '@Value': df[value_col],
                        'SpeciesCode': pollutant_name,  # Will standardise later
                        'SiteCode': station_name.replace(' ', '_'),  # Create site code
                        'SiteName': station_name,
                        'Source': 'DEFRA'
                    })
                    
                    all_files.append(df_standard)
                    file_count += 1
                    
                except Exception as e:
                    print(f"  Error reading {csv_file.name}: {e}")
        
        print(f"  Loaded from {year_dir.name}")
    
    # Combine all dataframes
    if not all_files:
        raise ValueError(f"No CSV files found in {optimased_path}")
    
    combined_df = pd.concat(all_files, ignore_index=True)
    
    print(f"\n" + "="*40)
    print(f"Total files loaded: {file_count}")
    print(f"Total rows: {len(combined_df):,}")
    print(f"Columns: {list(combined_df.columns)}")
    
    return combined_df