# ML LAQN Data Preparation for Air Quality Prediction

I will be prepearte LAQN data for machine learning models in this notebook.

## What this notebook does

1. Loads cleaned data from the optimised folder.

   ```bash
   ├── optimased/                              # Optimased validated measurements folder, will be use this folder's files for ML.
   │   ├── 2023_jan/                           # Monthly folders.
   │   ├── 2023_feb/
   │   ├── ...
   │   └── 2025_nov/
   │       └── {SiteCode}_{SpeciesCode}_{StartDate}_{EndDate}.csv #structure of the optimased 
   ```

2. All measurements into a single dataset.
3. Temporal features (hour, day, month).

## Output path:

data will be satve: `data/ml/` 

- Usual drill, I will be adding my paths under this md cell for organise myself better.

In [12]:
# starting with adding mandotary and very helpful python modules below.
import pandas as pd
import numpy as np
import os
from pathlib import Path

# preprocessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


- The dataset/file paths will be below.

In [16]:
# ml prep file path
base_dir = Path.cwd().parent.parent / "data" / "laqn"
project_rooth = Path.cwd() / "ml_prep.ipynb"

# laqn optimased data files path
optimased_path = base_dir / "optimased"



## 1) Load LAQN data:

1. Loads cleaned data from the optimised folder.

   ```bash
   ├── optimased/                              # Optimased validated measurements folder, will be use this folder's files for ML.
   │   ├── 2023_jan/                           # Monthly folders.
   │   ├── 2023_feb/
   │   ├── ...
   │   └── 2025_nov/
   │       └── {SiteCode}_{SpeciesCode}_{StartDate}_{EndDate}.csv #structure of the optimased 
   ```

   heads of the optimased files column structure: 
   `@MeasurementDateGMT,@Value,SpeciesCode,SiteCode,SpeciesName,SiteName,SiteType,Latitude,Longitude`

 - load_data function will be create to combine all the files in to one df.

   Why needs to be on one df? 

   - Each file contains hourly measurements for one station, one pollutant, one month. 
   - Time continuity : Machine learning requires identifying patterns over time. If Jan and Feb are in separate files, the model can't learn that pollution on Jan 31 affects Feb1.
   - Train/test split. %70 for training, %15 validation and %15 test.




In [17]:
def load_data (optimased_path) :
    """
    Function to load the optimased data files from the laqn dataset.
    param:
        optimased_path: path for data/laqn/optimased directory.
    """
    optimased_path = Path(optimased_path)
    all_files = []
    file_count = 0

    # Get all monthly folders sorted chronologically
    monthly_folders = sorted([f for f in optimased_path.iterdir() if f.is_dir()])
    
    print(f"Found {len(monthly_folders)} monthly folders")
    
    # Iterate through each monthly folder
    for folder in monthly_folders:
        # Get all CSV files in this folder
        csv_files = list(folder.glob("*.csv"))
        
        for csv_file in csv_files:
            try:
                df = pd.read_csv(csv_file)
                all_files.append(df)
                file_count += 1
            except Exception as e:
                print(f"Error reading {csv_file.name}: {e}")
        
        # Progress update
        print(f"  Loaded {folder.name}: {len(csv_files)} files")
    
    # Combine all dataframes
    if not all_files:
        raise ValueError(f"No CSV files found in {optimased_path}")
    
    combined_df = pd.concat(all_files, ignore_index=True)
    
    print(f"\n" + "="*40)
    print(f"Total files loaded: {file_count}")
    print(f"Total rows: {len(combined_df):,}")
    print(f"Columns: {list(combined_df.columns)}")
    
    return combined_df

In [18]:
# load all data
df_raw = load_data(optimased_path)

# preview data
df_raw.head(10)

Found 36 monthly folders
  Loaded 2023_apr: 141 files
  Loaded 2023_aug: 141 files
  Loaded 2023_dec: 141 files
  Loaded 2023_feb: 141 files
  Loaded 2023_jan: 141 files
  Loaded 2023_jul: 141 files
  Loaded 2023_jun: 141 files
  Loaded 2023_mar: 141 files
  Loaded 2023_may: 141 files
  Loaded 2023_nov: 138 files
  Loaded 2023_oct: 141 files
  Loaded 2023_sep: 141 files
  Loaded 2024_apr: 141 files
  Loaded 2024_aug: 141 files
  Loaded 2024_dec: 141 files
  Loaded 2024_feb: 141 files
  Loaded 2024_jan: 141 files
  Loaded 2024_jul: 141 files
  Loaded 2024_jun: 141 files
  Loaded 2024_mar: 141 files
  Loaded 2024_may: 141 files
  Loaded 2024_nov: 141 files
  Loaded 2024_oct: 141 files
  Loaded 2024_sep: 141 files
  Loaded 2025_apr: 141 files
  Loaded 2025_aug: 141 files
  Loaded 2025_feb: 141 files
  Loaded 2025_jan: 141 files
  Loaded 2025_jul: 141 files
  Loaded 2025_jun: 141 files
  Loaded 2025_mar: 141 files
  Loaded 2025_may: 141 files
  Loaded 2025_nov: 141 files
  Loaded 2025_oct:

Unnamed: 0,@MeasurementDateGMT,@Value,SpeciesCode,SiteCode,SpeciesName,SiteName,SiteType,Latitude,Longitude
0,2023-04-01 00:00:00,5.1,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
1,2023-04-01 01:00:00,4.4,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
2,2023-04-01 02:00:00,3.5,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
3,2023-04-01 03:00:00,5.3,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
4,2023-04-01 04:00:00,3.9,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
5,2023-04-01 05:00:00,4.3,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
6,2023-04-01 06:00:00,4.2,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
7,2023-04-01 07:00:00,5.5,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
8,2023-04-01 08:00:00,8.0,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
9,2023-04-01 09:00:00,9.4,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>@MeasurementDateGMT</th>
      <th>@Value</th>
      <th>SpeciesCode</th>
      <th>SiteCode</th>
      <th>SpeciesName</th>
      <th>SiteName</th>
      <th>SiteType</th>
      <th>Latitude</th>
      <th>Longitude</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2023-04-01 00:00:00</td>
      <td>5.1</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2023-04-01 01:00:00</td>
      <td>4.4</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2023-04-01 02:00:00</td>
      <td>3.5</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2023-04-01 03:00:00</td>
      <td>5.3</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2023-04-01 04:00:00</td>
      <td>3.9</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>5</th>
      <td>2023-04-01 05:00:00</td>
      <td>4.3</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>6</th>
      <td>2023-04-01 06:00:00</td>
      <td>4.2</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>7</th>
      <td>2023-04-01 07:00:00</td>
      <td>5.5</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>8</th>
      <td>2023-04-01 08:00:00</td>
      <td>8.0</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>9</th>
      <td>2023-04-01 09:00:00</td>
      <td>9.4</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
  </tbody>
</table>
</div>

In [19]:
# Check data info
print("Data shape:", df_raw.shape)
print("\nColumn types:")
print(df_raw.dtypes)
print("\nMissing values:")
print(df_raw.isnull().sum())

Data shape: (3446208, 9)

Column types:
@MeasurementDateGMT     object
@Value                 float64
SpeciesCode             object
SiteCode                object
SpeciesName             object
SiteName                object
SiteType                object
Latitude               float64
Longitude              float64
dtype: object

Missing values:
@MeasurementDateGMT         0
@Value                 464791
SpeciesCode                 0
SiteCode                    0
SpeciesName                 0
SiteName                    0
SiteType                    0
Latitude                    0
Longitude                   0
dtype: int64


    Data shape: (3446208, 9)

    Column types:
    @MeasurementDateGMT     object
    @Value                 float64
    SpeciesCode             object
    SiteCode                object
    SpeciesName             object
    SiteName                object
    SiteType                object
    Latitude               float64
    Longitude              float64
    dtype: object

    Missing values:
    @MeasurementDateGMT         0
    @Value                 464791
    SpeciesCode                 0
    SiteCode                    0
    SpeciesName                 0
    SiteName                    0
    SiteType                    0
    Latitude                    0
    Longitude                   0
    dtype: int64

## 2) Data explarotion:
Already checked data many times but  I think it is beneficial to add it here again:

- How many unique sites? - 64
- Which pollutants (species)? 6 pollutant
- Date range? 1.01.2023 till 19.11.2925

In [20]:
# Define colm names based on optimased data structure
date_col = '@MeasurementDateGMT'
value_col = '@Value'
site_col = 'SiteCode'
species_col = 'SpeciesCode'

# Convert datetime
df_raw[date_col] = pd.to_datetime(df_raw[date_col])

# run them.
print(f"Unique sites: {df_raw[site_col].nunique()}")
print(f"Unique species: {df_raw[species_col].nunique()}")
print(f"\nDate range: {df_raw[date_col].min()} to {df_raw[date_col].max()}")
print(f"\nSpecies in data:")
print(df_raw[species_col].value_counts())

Unique sites: 64
Unique species: 6

Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

Species in data:
SpeciesCode
NO2      1417752
PM10     1026456
PM2.5     586944
O3        268320
SO2        97824
CO         48912
Name: count, dtype: int64


    Unique sites: 64
    Unique species: 6

    Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

    Species in data:
    SpeciesCode
    NO2      1417752
    PM10     1026456
    PM2.5     586944
    O3        268320
    SO2        97824
    CO         48912
    Name: count, dtype: int64

## 3) Selecting target pollutants:
As I mentioned at `/docs/LAQN_DEFRA_benchmark.md` and `/docs/LAQN_data_quality.md` highest coverage:
- NO2 : 60 sites
- PM25 : 53 sites
- PM10: 43 sites
- O3: 11 sites




In [21]:
# target pollutants
target_pollutants = ['NO2', 'PM25', 'PM10', 'O3']

# Filter data
df_filtered = df_raw[df_raw[species_col].isin(target_pollutants)].copy()

print(f"Rows before filtering: {len(df_raw):,}")
print(f"Rows after filtering: {len(df_filtered):,}")
print(f"\nPollutants included:")
print(df_filtered[species_col].value_counts())

Rows before filtering: 3,446,208
Rows after filtering: 2,712,528

Pollutants included:
SpeciesCode
NO2     1417752
PM10    1026456
O3       268320
Name: count, dtype: int64


## 4) Temporal feature adding:

already added this feature in analyse part.

- That will be help to see how pollutant concentrations more according to time of the day
- day of the week (trafic effection)
- Season/month of the year.

These features help the model learn when pollution is typically high or low.

In [22]:
def temporal_features(df, datetime_col):
    """
     Temporal features from datetime column.
    Param:
            df : pandas.DataFrame

            datetime_col : str
       
    """
    df = df.copy()
    
    #  datetime type needs to be ensured
    df[datetime_col] = pd.to_datetime(df[datetime_col])
    
    # Extract features
    df['hour'] = df[datetime_col].dt.hour
    df['day_of_week'] = df[datetime_col].dt.dayofweek  # 0=Monday, 6=Sunday
    df['month'] = df[datetime_col].dt.month
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    print("Added temporal features: hour, day_of_week, month, is_weekend")
    
    return df

In [24]:
# Add temporal features
df_temporal = temporal_features(df_filtered, date_col)

# Preview
df_temporal[['hour', 'day_of_week', 'month', 'is_weekend']].head()

Added temporal features: hour, day_of_week, month, is_weekend


Unnamed: 0,hour,day_of_week,month,is_weekend
0,0,5,4,1
1,1,5,4,1
2,2,5,4,1
3,3,5,4,1
4,4,5,4,1


| Column      | Value         | Meaning                       |
| ----------- | ------------- | ----------------------------- |
| hour        | 0, 1, 2, 3, 4 | Midnight, 1am, 2am, 3am, 4am  |
| day_of_week | 5             | Saturday (0=Monday, 6=Sunday) |
| month       | 4             | April                         |
| is_weekend  | 1             | Yes, it is a weekend          |

Added temporal features: hour, day_of_week, month, is_weekend