# ML LAQN Data Preparation for Air Quality Prediction

I will be prepearte LAQN data for machine learning models in this notebook.

## What this notebook does

1. Loads cleaned data from the optimised folder.

   ```bash
   ├── optimased/                              # Optimased validated measurements folder, will be use this folder's files for ML.
   │   ├── 2023_jan/                           # Monthly folders.
   │   ├── 2023_feb/
   │   ├── ...
   │   └── 2025_nov/
   │       └── {SiteCode}_{SpeciesCode}_{StartDate}_{EndDate}.csv #structure of the optimased 
   ```

2. All measurements into a single dataset.
3. Temporal features (hour, day, month).

## Output path:

data will be satve: `data/ml/` 

- Usual drill, I will be adding my paths under this md cell for organise myself better.

In [12]:
# starting with adding mandotary and very helpful python modules below.
import pandas as pd
import numpy as np
import os
from pathlib import Path

# preprocessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


- The dataset/file paths will be below.

In [16]:
# ml prep file path
base_dir = Path.cwd().parent.parent / "data" / "laqn"
project_rooth = Path.cwd() / "ml_prep.ipynb"

# laqn optimased data files path
optimased_path = base_dir / "optimased"



## 1) Load LAQN data:

1. Loads cleaned data from the optimised folder.

   ```bash
   ├── optimased/                              # Optimased validated measurements folder, will be use this folder's files for ML.
   │   ├── 2023_jan/                           # Monthly folders.
   │   ├── 2023_feb/
   │   ├── ...
   │   └── 2025_nov/
   │       └── {SiteCode}_{SpeciesCode}_{StartDate}_{EndDate}.csv #structure of the optimased 
   ```

   heads of the optimased files column structure: 
   `@MeasurementDateGMT,@Value,SpeciesCode,SiteCode,SpeciesName,SiteName,SiteType,Latitude,Longitude`

 - load_data function will be create to combine all the files in to one df.

   Why needs to be on one df? 

   - Each file contains hourly measurements for one station, one pollutant, one month. 
   - Time continuity : Machine learning requires identifying patterns over time. If Jan and Feb are in separate files, the model can't learn that pollution on Jan 31 affects Feb1.
   - Train/test split. %70 for training, %15 validation and %15 test.




In [17]:
def load_data (optimased_path) :
    """
    Function to load the optimased data files from the laqn dataset.
    param:
        optimased_path: path for data/laqn/optimased directory.
    """
    optimased_path = Path(optimased_path)
    all_files = []
    file_count = 0

    # Get all monthly folders sorted chronologically
    monthly_folders = sorted([f for f in optimased_path.iterdir() if f.is_dir()])
    
    print(f"Found {len(monthly_folders)} monthly folders")
    
    # Iterate through each monthly folder
    for folder in monthly_folders:
        # Get all CSV files in this folder
        csv_files = list(folder.glob("*.csv"))
        
        for csv_file in csv_files:
            try:
                df = pd.read_csv(csv_file)
                all_files.append(df)
                file_count += 1
            except Exception as e:
                print(f"Error reading {csv_file.name}: {e}")
        
        # Progress update
        print(f"  Loaded {folder.name}: {len(csv_files)} files")
    
    # Combine all dataframes
    if not all_files:
        raise ValueError(f"No CSV files found in {optimased_path}")
    
    combined_df = pd.concat(all_files, ignore_index=True)
    
    print(f"\n" + "="*40)
    print(f"Total files loaded: {file_count}")
    print(f"Total rows: {len(combined_df):,}")
    print(f"Columns: {list(combined_df.columns)}")
    
    return combined_df

In [18]:
# load all data
df_raw = load_data(optimased_path)

# preview data
df_raw.head(10)

Found 36 monthly folders
  Loaded 2023_apr: 141 files
  Loaded 2023_aug: 141 files
  Loaded 2023_dec: 141 files
  Loaded 2023_feb: 141 files
  Loaded 2023_jan: 141 files
  Loaded 2023_jul: 141 files
  Loaded 2023_jun: 141 files
  Loaded 2023_mar: 141 files
  Loaded 2023_may: 141 files
  Loaded 2023_nov: 138 files
  Loaded 2023_oct: 141 files
  Loaded 2023_sep: 141 files
  Loaded 2024_apr: 141 files
  Loaded 2024_aug: 141 files
  Loaded 2024_dec: 141 files
  Loaded 2024_feb: 141 files
  Loaded 2024_jan: 141 files
  Loaded 2024_jul: 141 files
  Loaded 2024_jun: 141 files
  Loaded 2024_mar: 141 files
  Loaded 2024_may: 141 files
  Loaded 2024_nov: 141 files
  Loaded 2024_oct: 141 files
  Loaded 2024_sep: 141 files
  Loaded 2025_apr: 141 files
  Loaded 2025_aug: 141 files
  Loaded 2025_feb: 141 files
  Loaded 2025_jan: 141 files
  Loaded 2025_jul: 141 files
  Loaded 2025_jun: 141 files
  Loaded 2025_mar: 141 files
  Loaded 2025_may: 141 files
  Loaded 2025_nov: 141 files
  Loaded 2025_oct:

Unnamed: 0,@MeasurementDateGMT,@Value,SpeciesCode,SiteCode,SpeciesName,SiteName,SiteType,Latitude,Longitude
0,2023-04-01 00:00:00,5.1,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
1,2023-04-01 01:00:00,4.4,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
2,2023-04-01 02:00:00,3.5,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
3,2023-04-01 03:00:00,5.3,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
4,2023-04-01 04:00:00,3.9,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
5,2023-04-01 05:00:00,4.3,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
6,2023-04-01 06:00:00,4.2,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
7,2023-04-01 07:00:00,5.5,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
8,2023-04-01 08:00:00,8.0,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
9,2023-04-01 09:00:00,9.4,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>@MeasurementDateGMT</th>
      <th>@Value</th>
      <th>SpeciesCode</th>
      <th>SiteCode</th>
      <th>SpeciesName</th>
      <th>SiteName</th>
      <th>SiteType</th>
      <th>Latitude</th>
      <th>Longitude</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2023-04-01 00:00:00</td>
      <td>5.1</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2023-04-01 01:00:00</td>
      <td>4.4</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2023-04-01 02:00:00</td>
      <td>3.5</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2023-04-01 03:00:00</td>
      <td>5.3</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2023-04-01 04:00:00</td>
      <td>3.9</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>5</th>
      <td>2023-04-01 05:00:00</td>
      <td>4.3</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>6</th>
      <td>2023-04-01 06:00:00</td>
      <td>4.2</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>7</th>
      <td>2023-04-01 07:00:00</td>
      <td>5.5</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>8</th>
      <td>2023-04-01 08:00:00</td>
      <td>8.0</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>9</th>
      <td>2023-04-01 09:00:00</td>
      <td>9.4</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
  </tbody>
</table>
</div>

In [19]:
# Check data info
print("Data shape:", df_raw.shape)
print("\nColumn types:")
print(df_raw.dtypes)
print("\nMissing values:")
print(df_raw.isnull().sum())

Data shape: (3446208, 9)

Column types:
@MeasurementDateGMT     object
@Value                 float64
SpeciesCode             object
SiteCode                object
SpeciesName             object
SiteName                object
SiteType                object
Latitude               float64
Longitude              float64
dtype: object

Missing values:
@MeasurementDateGMT         0
@Value                 464791
SpeciesCode                 0
SiteCode                    0
SpeciesName                 0
SiteName                    0
SiteType                    0
Latitude                    0
Longitude                   0
dtype: int64


    Data shape: (3446208, 9)

    Column types:
    @MeasurementDateGMT     object
    @Value                 float64
    SpeciesCode             object
    SiteCode                object
    SpeciesName             object
    SiteName                object
    SiteType                object
    Latitude               float64
    Longitude              float64
    dtype: object

    Missing values:
    @MeasurementDateGMT         0
    @Value                 464791
    SpeciesCode                 0
    SiteCode                    0
    SpeciesName                 0
    SiteName                    0
    SiteType                    0
    Latitude                    0
    Longitude                   0
    dtype: int64

## 2) Data explarotion:
Already checked data many times but  I think it is beneficial to add it here again:

- How many unique sites? - 64
- Which pollutants (species)? 6 pollutant
- Date range? 1.01.2023 till 19.11.2925

In [20]:
# Define colm names based on optimased data structure
date_col = '@MeasurementDateGMT'
value_col = '@Value'
site_col = 'SiteCode'
species_col = 'SpeciesCode'

# Convert datetime
df_raw[date_col] = pd.to_datetime(df_raw[date_col])

# run them.
print(f"Unique sites: {df_raw[site_col].nunique()}")
print(f"Unique species: {df_raw[species_col].nunique()}")
print(f"\nDate range: {df_raw[date_col].min()} to {df_raw[date_col].max()}")
print(f"\nSpecies in data:")
print(df_raw[species_col].value_counts())

Unique sites: 64
Unique species: 6

Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

Species in data:
SpeciesCode
NO2      1417752
PM10     1026456
PM2.5     586944
O3        268320
SO2        97824
CO         48912
Name: count, dtype: int64


    Unique sites: 64
    Unique species: 6

    Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

    Species in data:
    SpeciesCode
    NO2      1417752
    PM10     1026456
    PM2.5     586944
    O3        268320
    SO2        97824
    CO         48912
    Name: count, dtype: int64

## 3) Selecting target pollutants:
As I mentioned at `/docs/LAQN_DEFRA_benchmark.md` and `/docs/LAQN_data_quality.md` highest coverage:
- NO2 : 60 sites
- PM25 : 53 sites
- PM10: 43 sites
- O3: 11 sites




In [21]:
# target pollutants
target_pollutants = ['NO2', 'PM25', 'PM10', 'O3']

# Filter data
df_filtered = df_raw[df_raw[species_col].isin(target_pollutants)].copy()

print(f"Rows before filtering: {len(df_raw):,}")
print(f"Rows after filtering: {len(df_filtered):,}")
print(f"\nPollutants included:")
print(df_filtered[species_col].value_counts())

Rows before filtering: 3,446,208
Rows after filtering: 2,712,528

Pollutants included:
SpeciesCode
NO2     1417752
PM10    1026456
O3       268320
Name: count, dtype: int64


## 4) Temporal feature adding:

already added this feature in analyse part.

- That will be help to see how pollutant concentrations more according to time of the day
- day of the week (trafic effection)
- Season/month of the year.

These features help the model learn when pollution is typically high or low.

In [22]:
def temporal_features(df, datetime_col):
    """
     Temporal features from datetime column.
    Param:
            df : pandas.DataFrame

            datetime_col : str
       
    """
    df = df.copy()
    
    #  datetime type needs to be ensured
    df[datetime_col] = pd.to_datetime(df[datetime_col])
    
    # Extract features
    df['hour'] = df[datetime_col].dt.hour
    df['day_of_week'] = df[datetime_col].dt.dayofweek  # 0=Monday, 6=Sunday
    df['month'] = df[datetime_col].dt.month
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    print("Added temporal features: hour, day_of_week, month, is_weekend")
    
    return df

In [24]:
# Add temporal features
df_temporal = temporal_features(df_filtered, date_col)

# Preview
df_temporal[['hour', 'day_of_week', 'month', 'is_weekend']].head()

Added temporal features: hour, day_of_week, month, is_weekend


Unnamed: 0,hour,day_of_week,month,is_weekend
0,0,5,4,1
1,1,5,4,1
2,2,5,4,1
3,3,5,4,1
4,4,5,4,1


| Column      | Value         | Meaning                       |
| ----------- | ------------- | ----------------------------- |
| hour        | 0, 1, 2, 3, 4 | Midnight, 1am, 2am, 3am, 4am  |
| day_of_week | 5             | Saturday (0=Monday, 6=Sunday) |
| month       | 4             | April                         |
| is_weekend  | 1             | Yes, it is a weekend          |

Added temporal features: hour, day_of_week, month, is_weekend

## 5) Wide formatting

For ml (Air quality prediction using CNN+LSTM-based hybrid deep learning architecture', *Environmental Science and Pollution Research*, 29(8), pp. 11920-11938)

And according to their search here are the findings:
 | Method      | Input                                    | Output           |
| ----------- | ---------------------------------------- | ---------------- |
| UNI/UNI     | Historical info of target pollutant only | Single pollutant |
| MULTI/UNI   | Historical info of all pollutants        | Single pollutant |
| MULTI/MULTI | Historical info of all pollutants        | All pollutants   |

page 11922 (Results section):

> "The multivariate model without using meteorological data revealed the best results."

So order to create multivariate input I will be formatting pivot to wider,adding each station/species combination to table.




In [25]:
def wide_format(df, datetime_col, site_col, species_col, value_col):
    """
    Pivot data from long to wide format. 
    Each site-species combination becomes a column.
    Each row represents one timestamp.
    
    Params:
        datetime_col, site_col, species_col, value_col 
    """
    df = df.copy()
    
    # Create site_species identifier
    df['site_species'] = df[site_col] + '_' + df[species_col]
    
    # Pivot table
    # If duplicate datetime-site_species combinations exist, take mean
    pivoted = df.pivot_table(
        index=datetime_col,
        columns='site_species',
        values=value_col,
        aggfunc='mean'
    )
    
    # Sort by datetime
    pivoted = pivoted.sort_index()
    
    print(f"Created wide format:")
    print(f"Timestamps: {len(pivoted):,}")
    print(f"Features (site-species): {len(pivoted.columns)}")
    print(f"Date range: {pivoted.index.min()} to {pivoted.index.max()}")
    
    return pivoted

In [26]:
# Create wide format
df_wide = wide_format(df_temporal, date_col, site_col, species_col, value_col)

# Preview
print("\nFirst 10 columns:")
print(list(df_wide.columns)[:10])
df_wide.head()

Created wide format:
Timestamps: 24,456
Features (site-species): 111
Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

First 10 columns:
['BG1_NO2', 'BG2_NO2', 'BG2_PM10', 'BQ7_NO2', 'BQ7_O3', 'BQ7_PM10', 'BQ9_PM10', 'BT4_NO2', 'BT4_PM10', 'BT5_NO2']


site_species,BG1_NO2,BG2_NO2,BG2_PM10,BQ7_NO2,BQ7_O3,BQ7_PM10,BQ9_PM10,BT4_NO2,BT4_PM10,BT5_NO2,...,WA9_PM10,WAA_NO2,WAA_PM10,WAB_NO2,WAB_PM10,WAC_PM10,WM5_NO2,WM6_NO2,WM6_PM10,WMD_NO2
@MeasurementDateGMT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-01 00:00:00,7.6,4.6,16.5,,74.2,10.2,8.1,11.4,39.0,8.1,...,9.0,9.0,12.7,,24.0,19.9,5.7,17.9,14.0,8.8
2023-01-01 01:00:00,4.4,4.4,,,74.4,6.8,7.1,14.5,15.0,7.1,...,3.0,7.0,6.9,,7.0,8.5,6.2,15.1,17.0,9.1
2023-01-01 02:00:00,4.2,4.1,5.7,,76.7,9.3,9.7,14.0,18.0,5.3,...,11.0,6.0,11.0,,13.0,12.1,9.1,16.0,16.0,8.4
2023-01-01 03:00:00,4.5,3.7,10.2,,76.4,11.8,13.1,12.6,15.0,3.7,...,11.0,5.0,12.5,,16.0,14.7,6.2,19.7,24.0,3.2
2023-01-01 04:00:00,2.7,2.9,13.8,,77.1,12.7,14.5,7.9,17.0,6.1,...,13.0,5.0,15.6,,19.0,18.4,5.3,16.5,20.0,4.0


    Created wide format:
    Timestamps: 24,456
    Features (site-species): 111
    Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

    First 10 columns:
    ['BG1_NO2', 'BG2_NO2', 'BG2_PM10', 'BQ7_NO2', 'BQ7_O3', 'BQ7_PM10', 'BQ9_PM10', 'BT4_NO2', 'BT4_PM10', 'BT5_NO2']

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
    <th>site_species</th>
    <th>BG1_NO2</th>
    <th>BG2_NO2</th>
    <th>BG2_PM10</th>
    <th>BQ7_NO2</th>
    <th>BQ7_O3</th>
    <th>BQ7_PM10</th>
    <th>BQ9_PM10</th>
    <th>BT4_NO2</th>
    <th>BT4_PM10</th>
    <th>BT5_NO2</th>
    <th>...</th>
    <th>WA9_PM10</th>
    <th>WAA_NO2</th>
    <th>WAA_PM10</th>
    <th>WAB_NO2</th>
    <th>WAB_PM10</th>
    <th>WAC_PM10</th>
    <th>WM5_NO2</th>
    <th>WM6_NO2</th>
    <th>WM6_PM10</th>
    <th>WMD_NO2</th>
</tr>
<tr>
    <th>@MeasurementDateGMT</th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
</tr>
</thead>
<tbody>
<tr>
    <th>2023-01-01 00:00:00</th>
    <td>7.6</td>
    <td>4.6</td>
    <td>16.5</td>
    <td>NaN</td>
    <td>74.2</td>
    <td>10.2</td>
    <td>8.1</td>
    <td>11.4</td>
    <td>39.0</td>
    <td>8.1</td>
    <td>...</td>
    <td>9.0</td>
    <td>9.0</td>
    <td>12.7</td>
    <td>NaN</td>
    <td>24.0</td>
    <td>19.9</td>
    <td>5.7</td>
    <td>17.9</td>
    <td>14.0</td>
    <td>8.8</td>
</tr>
<tr>
    <th>2023-01-01 01:00:00</th>
    <td>4.4</td>
    <td>4.4</td>
    <td>NaN</td>
    <td>NaN</td>
    <td>74.4</td>
    <td>6.8</td>
    <td>7.1</td>
    <td>14.5</td>
    <td>15.0</td>
    <td>7.1</td>
    <td>...</td>
    <td>3.0</td>
    <td>7.0</td>
    <td>6.9</td>
    <td>NaN</td>
    <td>7.0</td>
    <td>8.5</td>
    <td>6.2</td>
    <td>15.1</td>
    <td>17.0</td>
    <td>9.1</td>
</tr>
<tr>
    <th>2023-01-01 02:00:00</th>
    <td>4.2</td>
    <td>4.1</td>
    <td>5.7</td>
    <td>NaN</td>
    <td>76.7</td>
    <td>9.3</td>
    <td>9.7</td>
    <td>14.0</td>
    <td>18.0</td>
    <td>5.3</td>
    <td>...</td>
    <td>11.0</td>
    <td>6.0</td>
    <td>11.0</td>
    <td>NaN</td>
    <td>13.0</td>
    <td>12.1</td>
    <td>9.1</td>
    <td>16.0</td>
    <td>16.0</td>
    <td>8.4</td>
</tr>
<tr>
    <th>2023-01-01 03:00:00</th>
    <td>4.5</td>
    <td>3.7</td>
    <td>10.2</td>
    <td>NaN</td>
    <td>76.4</td>
    <td>11.8</td>
    <td>13.1</td>
    <td>12.6</td>
    <td>15.0</td>
    <td>3.7</td>
    <td>...</td>
    <td>11.0</td>
    <td>5.0</td>
    <td>12.5</td>
    <td>NaN</td>
    <td>16.0</td>
    <td>14.7</td>
    <td>6.2</td>
    <td>19.7</td>
    <td>24.0</td>
    <td>3.2</td>
</tr>
<tr>
    <th>2023-01-01 04:00:00</th>
    <td>2.7</td>
    <td>2.9</td>
    <td>13.8</td>
    <td>NaN</td>
    <td>77.1</td>
    <td>12.7</td>
    <td>14.5</td>
    <td>7.9</td>
    <td>17.0</td>
    <td>6.1</td>
    <td>...</td>
    <td>13.0</td>
    <td>5.0</td>
    <td>15.6</td>
    <td>NaN</td>
    <td>19.0</td>
    <td>18.4</td>
    <td>5.3</td>
    <td>16.5</td>
    <td>20.0</td>
    <td>4.0</td>
</tr>
</tbody>
</table>
<p>5 rows × 111 columns</p>

## 6) NaN value handling

- Find the optimased stratagyies.
- maybe interpolate small gaps 2/5 hours etc..
- or usuall drill drop the remaining rows with NaN.

In [27]:
# Check missing values before handling
missing_pct = (df_wide.isnull().sum() / len(df_wide) * 100).sort_values(ascending=False)

print("Missing value percentage by column (top 20):")
print(missing_pct.head(20))

print(f"\nTotal cells: {df_wide.size:,}")
print(f"Missing cells: {df_wide.isnull().sum().sum():,}")
print(f"Missing percentage: {df_wide.isnull().sum().sum() / df_wide.size * 100:.2f}%")

Missing value percentage by column (top 20):
site_species
WM6_PM10    62.798495
CE3_NO2     46.589794
TL4_NO2     40.407262
RI2_O3      39.818449
WA7_NO2     37.765783
CD1_PM10    31.534184
WAA_NO2     31.505561
CE3_PM10    31.288845
TH4_PM10    30.528296
HG4_O3      28.999019
CD1_NO2     28.557409
MY1_O3      27.171246
SK5_PM10    26.300294
BG1_NO2     26.226693
BG2_NO2     24.873242
GB6_O3      24.345764
CE2_O3      24.124959
TH4_O3      23.994112
RI2_NO2     23.769218
TH4_NO2     22.894177
dtype: float64

Total cells: 2,714,616
Missing cells: 330,540
Missing percentage: 12.18%


        Missing value percentage by column (top 20):
        site_species
        WM6_PM10    62.798495
        CE3_NO2     46.589794
        TL4_NO2     40.407262
        RI2_O3      39.818449
        WA7_NO2     37.765783
        CD1_PM10    31.534184
        WAA_NO2     31.505561
        CE3_PM10    31.288845
        TH4_PM10    30.528296
        HG4_O3      28.999019
        CD1_NO2     28.557409
        MY1_O3      27.171246
        SK5_PM10    26.300294
        BG1_NO2     26.226693
        BG2_NO2     24.873242
        GB6_O3      24.345764
        CE2_O3      24.124959
        TH4_O3      23.994112
        RI2_NO2     23.769218
        TH4_NO2     22.894177
        dtype: float64

        Total cells: 2,714,616
        Missing cells: 330,540
        Missing percentage: 12.18%

- max_gap=5 value. (linear interpolation is applied to fill in missing values)

Air quality prediction using CNN+LSTM-based hybrid deep learning architecture', *Environmental Science and Pollution Research*, 29(8), pp. 11920-11938

In [30]:
def handle_missing_values(df, max_gap=5, min_coverage=0.8):
    """
    Handle NaN 
    
    Param
    max_gap : int max consecutive NaN values to interpolate
    min_coverage : Min. proportion of non-null values to keep a column.
    
    """
    df = df.copy()
    print(f"Before: {df.shape}")
    
    # 1. rm columns with too many missing values
    coverage = df.notna().sum() / len(df)
    cols_to_keep = coverage[coverage >= min_coverage].index
    cols_removed = len(df.columns) - len(cols_to_keep)
    df = df[cols_to_keep]
    print(f"Removed {cols_removed} columns with <{min_coverage*100:.0f}% coverage")
    
    # 2 Interpolate small gaps
    df = df.interpolate(method='linear', limit=max_gap, limit_direction='both')
    print(f"Interpolated gaps up to {max_gap} consecutive values")
    
    # 3. Drop remaining rows with NaN
    rows_before = len(df)
    df = df.dropna()
    rows_dropped = rows_before - len(df)
    print(f"Dropped {rows_dropped:,} rows with remaining NaN")
    
    print(f"After: {df.shape}")
    
    return df

In [32]:
# Handle missing values
df_clean = handle_missing_values(df_wide, max_gap=5, min_coverage=0.8)

print(f"\nMissing values remaining: {df_clean.isnull().sum().sum()}")

Before: (24456, 111)
Removed 25 columns with <80% coverage
Interpolated gaps up to 5 consecutive values
Dropped 24,398 rows with remaining NaN
After: (58, 86)

Missing values remaining: 0


    Before: (24456, 111)
    Removed 25 columns with <80% coverage
    Interpolated gaps up to 5 consecutive values
    Dropped 24,398 rows with remaining NaN
    After: (58, 86)

    Missing values remaining: 0

## 7) temporal features addition:
 - extracting hour, day and month as column to make data wider.

In [33]:
def temporal_wide(df):
    """
    Add temporal features to wide format data.
    """
    df = df.copy()
    
    df['hour'] = df.index.hour
    df['day_of_week'] = df.index.dayofweek
    df['month'] = df.index.month
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    print(f"Added temporal features")
    print(f"Total features: {len(df.columns)}")
    
    return df

In [34]:
# Add temporal features
df_features = temporal_wide(df_clean)

# Preview
df_features[['hour', 'day_of_week', 'month', 'is_weekend']].head()

Added temporal features
Total features: 90


site_species,hour,day_of_week,month,is_weekend
@MeasurementDateGMT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-08-01 09:00:00,9,1,8,0
2023-08-01 10:00:00,10,1,8,0
2023-08-01 11:00:00,11,1,8,0
2023-08-01 12:00:00,12,1,8,0
2023-08-01 13:00:00,13,1,8,0


    Added temporal features
    Total features: 90

    <div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>site_species</th>
      <th>hour</th>
      <th>day_of_week</th>
      <th>month</th>
      <th>is_weekend</th>
    </tr>
    <tr>
      <th>@MeasurementDateGMT</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2023-08-01 09:00:00</th>
      <td>9</td>
      <td>1</td>
      <td>8</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2023-08-01 10:00:00</th>
      <td>10</td>
      <td>1</td>
      <td>8</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2023-08-01 11:00:00</th>
      <td>11</td>
      <td>1</td>
      <td>8</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2023-08-01 12:00:00</th>
      <td>12</td>
      <td>1</td>
      <td>8</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2023-08-01 13:00:00</th>
      <td>13</td>
      <td>1</td>
      <td>8</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
</div>