# ML LAQN Data Preparation for Air Quality Prediction

I will be prepearte LAQN data for machine learning models in this notebook.

## What this notebook does

1. Loads cleaned data from the optimised folder.

   ```bash
   ├── optimased/                              # Optimased validated measurements folder, will be use this folder's files for ML.
   │   ├── 2023_jan/                           # Monthly folders.
   │   ├── 2023_feb/
   │   ├── ...
   │   └── 2025_nov/
   │       └── {SiteCode}_{SpeciesCode}_{StartDate}_{EndDate}.csv #structure of the optimased 
   ```

2. All measurements into a single dataset.
3. Temporal features (hour, day, month).

## Output path:

data will be satve: `data/ml/` 

- Usual drill, I will be adding my paths under this md cell for organise myself better.

In [3]:
# starting with adding mandotary and very helpful python modules below.
import pandas as pd
import numpy as np
import os
from pathlib import Path

# preprocessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


- The dataset/file paths will be below.

In [4]:
# ml prep file path
base_dir = Path.cwd().parent.parent / "data" / "laqn"
project_rooth = Path.cwd() / "ml_prep.ipynb"

# laqn optimased data files path
optimased_path = base_dir / "optimased"



## 1) Load LAQN data:

1. Loads cleaned data from the optimised folder.

   ```bash
   ├── optimased/                              # Optimased validated measurements folder, will be use this folder's files for ML.
   │   ├── 2023_jan/                           # Monthly folders.
   │   ├── 2023_feb/
   │   ├── ...
   │   └── 2025_nov/
   │       └── {SiteCode}_{SpeciesCode}_{StartDate}_{EndDate}.csv #structure of the optimased 
   ```

   heads of the optimased files column structure: 
   `@MeasurementDateGMT,@Value,SpeciesCode,SiteCode,SpeciesName,SiteName,SiteType,Latitude,Longitude`

 - load_data function will be create to combine all the files in to one df.

   Why needs to be on one df? 

   - Each file contains hourly measurements for one station, one pollutant, one month. 
   - Time continuity : Machine learning requires identifying patterns over time. If Jan and Feb are in separate files, the model can't learn that pollution on Jan 31 affects Feb1.
   - Train/test split. %70 for training, %15 validation and %15 test.




In [5]:
def load_data (optimased_path) :
    """
    Function to load the optimased data files from the laqn dataset.
    param:
        optimased_path: path for data/laqn/optimased directory.
    """
    optimased_path = Path(optimased_path)
    all_files = []
    file_count = 0

    # Get all monthly folders sorted chronologically
    monthly_folders = sorted([f for f in optimased_path.iterdir() if f.is_dir()])
    
    print(f"Found {len(monthly_folders)} monthly folders")
    
    # Iterate through each monthly folder
    for folder in monthly_folders:
        # Get all CSV files in this folder
        csv_files = list(folder.glob("*.csv"))
        
        for csv_file in csv_files:
            try:
                df = pd.read_csv(csv_file)
                all_files.append(df)
                file_count += 1
            except Exception as e:
                print(f"Error reading {csv_file.name}: {e}")
        
        # Progress update
        print(f"  Loaded {folder.name}: {len(csv_files)} files")
    
    # Combine all dataframes
    if not all_files:
        raise ValueError(f"No CSV files found in {optimased_path}")
    
    combined_df = pd.concat(all_files, ignore_index=True)
    
    print(f"\n" + "="*40)
    print(f"Total files loaded: {file_count}")
    print(f"Total rows: {len(combined_df):,}")
    print(f"Columns: {list(combined_df.columns)}")
    
    return combined_df

In [6]:
# load all data
df_raw = load_data(optimased_path)

# preview data
df_raw.head(10)

Found 36 monthly folders
  Loaded 2023_apr: 141 files
  Loaded 2023_aug: 141 files
  Loaded 2023_dec: 141 files
  Loaded 2023_feb: 141 files
  Loaded 2023_jan: 141 files
  Loaded 2023_jul: 141 files
  Loaded 2023_jun: 141 files
  Loaded 2023_mar: 141 files
  Loaded 2023_may: 141 files
  Loaded 2023_nov: 138 files
  Loaded 2023_oct: 141 files
  Loaded 2023_sep: 141 files
  Loaded 2024_apr: 141 files
  Loaded 2024_aug: 141 files
  Loaded 2024_dec: 141 files
  Loaded 2024_feb: 141 files
  Loaded 2024_jan: 141 files
  Loaded 2024_jul: 141 files
  Loaded 2024_jun: 141 files
  Loaded 2024_mar: 141 files
  Loaded 2024_may: 141 files
  Loaded 2024_nov: 141 files
  Loaded 2024_oct: 141 files
  Loaded 2024_sep: 141 files
  Loaded 2025_apr: 141 files
  Loaded 2025_aug: 141 files
  Loaded 2025_feb: 141 files
  Loaded 2025_jan: 141 files
  Loaded 2025_jul: 141 files
  Loaded 2025_jun: 141 files
  Loaded 2025_mar: 141 files
  Loaded 2025_may: 141 files
  Loaded 2025_nov: 141 files
  Loaded 2025_oct:

Unnamed: 0,@MeasurementDateGMT,@Value,SpeciesCode,SiteCode,SpeciesName,SiteName,SiteType,Latitude,Longitude
0,2023-04-01 00:00:00,5.1,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
1,2023-04-01 01:00:00,4.4,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
2,2023-04-01 02:00:00,3.5,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
3,2023-04-01 03:00:00,5.3,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
4,2023-04-01 04:00:00,3.9,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
5,2023-04-01 05:00:00,4.3,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
6,2023-04-01 06:00:00,4.2,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
7,2023-04-01 07:00:00,5.5,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
8,2023-04-01 08:00:00,8.0,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606
9,2023-04-01 09:00:00,9.4,PM10,GB6,PM10 Particulate,Greenwich - Falconwood,Roadside,51.4563,0.085606


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>@MeasurementDateGMT</th>
      <th>@Value</th>
      <th>SpeciesCode</th>
      <th>SiteCode</th>
      <th>SpeciesName</th>
      <th>SiteName</th>
      <th>SiteType</th>
      <th>Latitude</th>
      <th>Longitude</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2023-04-01 00:00:00</td>
      <td>5.1</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2023-04-01 01:00:00</td>
      <td>4.4</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2023-04-01 02:00:00</td>
      <td>3.5</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2023-04-01 03:00:00</td>
      <td>5.3</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2023-04-01 04:00:00</td>
      <td>3.9</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>5</th>
      <td>2023-04-01 05:00:00</td>
      <td>4.3</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>6</th>
      <td>2023-04-01 06:00:00</td>
      <td>4.2</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>7</th>
      <td>2023-04-01 07:00:00</td>
      <td>5.5</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>8</th>
      <td>2023-04-01 08:00:00</td>
      <td>8.0</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
    <tr>
      <th>9</th>
      <td>2023-04-01 09:00:00</td>
      <td>9.4</td>
      <td>PM10</td>
      <td>GB6</td>
      <td>PM10 Particulate</td>
      <td>Greenwich - Falconwood</td>
      <td>Roadside</td>
      <td>51.4563</td>
      <td>0.085606</td>
    </tr>
  </tbody>
</table>
</div>

In [7]:
# Check data info
print("Data shape:", df_raw.shape)
print("\nColumn types:")
print(df_raw.dtypes)
print("\nMissing values:")
print(df_raw.isnull().sum())

Data shape: (3446208, 9)

Column types:
@MeasurementDateGMT     object
@Value                 float64
SpeciesCode             object
SiteCode                object
SpeciesName             object
SiteName                object
SiteType                object
Latitude               float64
Longitude              float64
dtype: object

Missing values:
@MeasurementDateGMT         0
@Value                 464791
SpeciesCode                 0
SiteCode                    0
SpeciesName                 0
SiteName                    0
SiteType                    0
Latitude                    0
Longitude                   0
dtype: int64


    Data shape: (3446208, 9)

    Column types:
    @MeasurementDateGMT     object
    @Value                 float64
    SpeciesCode             object
    SiteCode                object
    SpeciesName             object
    SiteName                object
    SiteType                object
    Latitude               float64
    Longitude              float64
    dtype: object

    Missing values:
    @MeasurementDateGMT         0
    @Value                 464791
    SpeciesCode                 0
    SiteCode                    0
    SpeciesName                 0
    SiteName                    0
    SiteType                    0
    Latitude                    0
    Longitude                   0
    dtype: int64

## 2) Data explarotion:
Already checked data many times but  I think it is beneficial to add it here again:

- How many unique sites? - 64
- Which pollutants (species)? 6 pollutant
- Date range? 1.01.2023 till 19.11.2925

In [8]:
# Define colm names based on optimased data structure
date_col = '@MeasurementDateGMT'
value_col = '@Value'
site_col = 'SiteCode'
species_col = 'SpeciesCode'

# Convert datetime
df_raw[date_col] = pd.to_datetime(df_raw[date_col])

# run them.
print(f"Unique sites: {df_raw[site_col].nunique()}")
print(f"Unique species: {df_raw[species_col].nunique()}")
print(f"\nDate range: {df_raw[date_col].min()} to {df_raw[date_col].max()}")
print(f"\nSpecies in data:")
print(df_raw[species_col].value_counts())

Unique sites: 64
Unique species: 6

Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

Species in data:
SpeciesCode
NO2      1417752
PM10     1026456
PM2.5     586944
O3        268320
SO2        97824
CO         48912
Name: count, dtype: int64


    Unique sites: 64
    Unique species: 6

    Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

    Species in data:
    SpeciesCode
    NO2      1417752
    PM10     1026456
    PM2.5     586944
    O3        268320
    SO2        97824
    CO         48912
    Name: count, dtype: int64

## 3) Selecting target pollutants:
As I mentioned at `/docs/LAQN_DEFRA_benchmark.md` and `/docs/LAQN_data_quality.md` highest coverage:
- NO2 : 60 sites
- PM25 : 53 sites
- PM10: 43 sites
- O3: 11 sites




In [9]:
# target pollutants
target_pollutants = ['NO2', 'PM25', 'PM10', 'O3']

# Filter data
df_filtered = df_raw[df_raw[species_col].isin(target_pollutants)].copy()

print(f"Rows before filtering: {len(df_raw):,}")
print(f"Rows after filtering: {len(df_filtered):,}")
print(f"\nPollutants included:")
print(df_filtered[species_col].value_counts())

Rows before filtering: 3,446,208
Rows after filtering: 2,712,528

Pollutants included:
SpeciesCode
NO2     1417752
PM10    1026456
O3       268320
Name: count, dtype: int64


## 4) Temporal feature adding:

already added this feature in analyse part.

- That will be help to see how pollutant concentrations more according to time of the day
- day of the week (trafic effection)
- Season/month of the year.

These features help the model learn when pollution is typically high or low.

In [10]:
def temporal_features(df, datetime_col):
    """
     Temporal features from datetime column.
    Param:
            df : pandas.DataFrame

            datetime_col : str
       
    """
    df = df.copy()
    
    #  datetime type needs to be ensured
    df[datetime_col] = pd.to_datetime(df[datetime_col])
    
    # Extract features
    df['hour'] = df[datetime_col].dt.hour
    df['day_of_week'] = df[datetime_col].dt.dayofweek  # 0=Monday, 6=Sunday
    df['month'] = df[datetime_col].dt.month
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    print("Added temporal features: hour, day_of_week, month, is_weekend")
    
    return df

In [11]:
# Add temporal features
df_temporal = temporal_features(df_filtered, date_col)

# Preview
df_temporal[['hour', 'day_of_week', 'month', 'is_weekend']].head()

Added temporal features: hour, day_of_week, month, is_weekend


Unnamed: 0,hour,day_of_week,month,is_weekend
0,0,5,4,1
1,1,5,4,1
2,2,5,4,1
3,3,5,4,1
4,4,5,4,1


| Column      | Value         | Meaning                       |
| ----------- | ------------- | ----------------------------- |
| hour        | 0, 1, 2, 3, 4 | Midnight, 1am, 2am, 3am, 4am  |
| day_of_week | 5             | Saturday (0=Monday, 6=Sunday) |
| month       | 4             | April                         |
| is_weekend  | 1             | Yes, it is a weekend          |

Added temporal features: hour, day_of_week, month, is_weekend

## 5) Wide formatting

For ml (Air quality prediction using CNN+LSTM-based hybrid deep learning architecture', *Environmental Science and Pollution Research*, 29(8), pp. 11920-11938)

And according to their search here are the findings:
 | Method      | Input                                    | Output           |
| ----------- | ---------------------------------------- | ---------------- |
| UNI/UNI     | Historical info of target pollutant only | Single pollutant |
| MULTI/UNI   | Historical info of all pollutants        | Single pollutant |
| MULTI/MULTI | Historical info of all pollutants        | All pollutants   |

page 11922 (Results section):

> "The multivariate model without using meteorological data revealed the best results."

So order to create multivariate input I will be formatting pivot to wider,adding each station/species combination to table.




In [12]:
def wide_format(df, datetime_col, site_col, species_col, value_col):
    """
    Pivot data from long to wide format. 
    Each site-species combination becomes a column.
    Each row represents one timestamp.
    
    Params:
        datetime_col, site_col, species_col, value_col 
    """
    df = df.copy()
    
    # Create site_species identifier
    df['site_species'] = df[site_col] + '_' + df[species_col]
    
    # Pivot table
    # If duplicate datetime-site_species combinations exist, take mean
    pivoted = df.pivot_table(
        index=datetime_col,
        columns='site_species',
        values=value_col,
        aggfunc='mean'
    )
    
    # Sort by datetime
    pivoted = pivoted.sort_index()
    
    print(f"Created wide format:")
    print(f"Timestamps: {len(pivoted):,}")
    print(f"Features (site-species): {len(pivoted.columns)}")
    print(f"Date range: {pivoted.index.min()} to {pivoted.index.max()}")
    
    return pivoted

In [13]:
# Create wide format
df_wide = wide_format(df_temporal, date_col, site_col, species_col, value_col)

# Preview
print("\nFirst 10 columns:")
print(list(df_wide.columns)[:10])
df_wide.head()

Created wide format:
Timestamps: 24,456
Features (site-species): 111
Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

First 10 columns:
['BG1_NO2', 'BG2_NO2', 'BG2_PM10', 'BQ7_NO2', 'BQ7_O3', 'BQ7_PM10', 'BQ9_PM10', 'BT4_NO2', 'BT4_PM10', 'BT5_NO2']


site_species,BG1_NO2,BG2_NO2,BG2_PM10,BQ7_NO2,BQ7_O3,BQ7_PM10,BQ9_PM10,BT4_NO2,BT4_PM10,BT5_NO2,...,WA9_PM10,WAA_NO2,WAA_PM10,WAB_NO2,WAB_PM10,WAC_PM10,WM5_NO2,WM6_NO2,WM6_PM10,WMD_NO2
@MeasurementDateGMT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-01 00:00:00,7.6,4.6,16.5,,74.2,10.2,8.1,11.4,39.0,8.1,...,9.0,9.0,12.7,,24.0,19.9,5.7,17.9,14.0,8.8
2023-01-01 01:00:00,4.4,4.4,,,74.4,6.8,7.1,14.5,15.0,7.1,...,3.0,7.0,6.9,,7.0,8.5,6.2,15.1,17.0,9.1
2023-01-01 02:00:00,4.2,4.1,5.7,,76.7,9.3,9.7,14.0,18.0,5.3,...,11.0,6.0,11.0,,13.0,12.1,9.1,16.0,16.0,8.4
2023-01-01 03:00:00,4.5,3.7,10.2,,76.4,11.8,13.1,12.6,15.0,3.7,...,11.0,5.0,12.5,,16.0,14.7,6.2,19.7,24.0,3.2
2023-01-01 04:00:00,2.7,2.9,13.8,,77.1,12.7,14.5,7.9,17.0,6.1,...,13.0,5.0,15.6,,19.0,18.4,5.3,16.5,20.0,4.0


    Created wide format:
    Timestamps: 24,456
    Features (site-species): 111
    Date range: 2023-01-01 00:00:00 to 2025-11-18 23:00:00

    First 10 columns:
    ['BG1_NO2', 'BG2_NO2', 'BG2_PM10', 'BQ7_NO2', 'BQ7_O3', 'BQ7_PM10', 'BQ9_PM10', 'BT4_NO2', 'BT4_PM10', 'BT5_NO2']

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
    <th>site_species</th>
    <th>BG1_NO2</th>
    <th>BG2_NO2</th>
    <th>BG2_PM10</th>
    <th>BQ7_NO2</th>
    <th>BQ7_O3</th>
    <th>BQ7_PM10</th>
    <th>BQ9_PM10</th>
    <th>BT4_NO2</th>
    <th>BT4_PM10</th>
    <th>BT5_NO2</th>
    <th>...</th>
    <th>WA9_PM10</th>
    <th>WAA_NO2</th>
    <th>WAA_PM10</th>
    <th>WAB_NO2</th>
    <th>WAB_PM10</th>
    <th>WAC_PM10</th>
    <th>WM5_NO2</th>
    <th>WM6_NO2</th>
    <th>WM6_PM10</th>
    <th>WMD_NO2</th>
</tr>
<tr>
    <th>@MeasurementDateGMT</th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
</tr>
</thead>
<tbody>
<tr>
    <th>2023-01-01 00:00:00</th>
    <td>7.6</td>
    <td>4.6</td>
    <td>16.5</td>
    <td>NaN</td>
    <td>74.2</td>
    <td>10.2</td>
    <td>8.1</td>
    <td>11.4</td>
    <td>39.0</td>
    <td>8.1</td>
    <td>...</td>
    <td>9.0</td>
    <td>9.0</td>
    <td>12.7</td>
    <td>NaN</td>
    <td>24.0</td>
    <td>19.9</td>
    <td>5.7</td>
    <td>17.9</td>
    <td>14.0</td>
    <td>8.8</td>
</tr>
<tr>
    <th>2023-01-01 01:00:00</th>
    <td>4.4</td>
    <td>4.4</td>
    <td>NaN</td>
    <td>NaN</td>
    <td>74.4</td>
    <td>6.8</td>
    <td>7.1</td>
    <td>14.5</td>
    <td>15.0</td>
    <td>7.1</td>
    <td>...</td>
    <td>3.0</td>
    <td>7.0</td>
    <td>6.9</td>
    <td>NaN</td>
    <td>7.0</td>
    <td>8.5</td>
    <td>6.2</td>
    <td>15.1</td>
    <td>17.0</td>
    <td>9.1</td>
</tr>
<tr>
    <th>2023-01-01 02:00:00</th>
    <td>4.2</td>
    <td>4.1</td>
    <td>5.7</td>
    <td>NaN</td>
    <td>76.7</td>
    <td>9.3</td>
    <td>9.7</td>
    <td>14.0</td>
    <td>18.0</td>
    <td>5.3</td>
    <td>...</td>
    <td>11.0</td>
    <td>6.0</td>
    <td>11.0</td>
    <td>NaN</td>
    <td>13.0</td>
    <td>12.1</td>
    <td>9.1</td>
    <td>16.0</td>
    <td>16.0</td>
    <td>8.4</td>
</tr>
<tr>
    <th>2023-01-01 03:00:00</th>
    <td>4.5</td>
    <td>3.7</td>
    <td>10.2</td>
    <td>NaN</td>
    <td>76.4</td>
    <td>11.8</td>
    <td>13.1</td>
    <td>12.6</td>
    <td>15.0</td>
    <td>3.7</td>
    <td>...</td>
    <td>11.0</td>
    <td>5.0</td>
    <td>12.5</td>
    <td>NaN</td>
    <td>16.0</td>
    <td>14.7</td>
    <td>6.2</td>
    <td>19.7</td>
    <td>24.0</td>
    <td>3.2</td>
</tr>
<tr>
    <th>2023-01-01 04:00:00</th>
    <td>2.7</td>
    <td>2.9</td>
    <td>13.8</td>
    <td>NaN</td>
    <td>77.1</td>
    <td>12.7</td>
    <td>14.5</td>
    <td>7.9</td>
    <td>17.0</td>
    <td>6.1</td>
    <td>...</td>
    <td>13.0</td>
    <td>5.0</td>
    <td>15.6</td>
    <td>NaN</td>
    <td>19.0</td>
    <td>18.4</td>
    <td>5.3</td>
    <td>16.5</td>
    <td>20.0</td>
    <td>4.0</td>
</tr>
</tbody>
</table>
<p>5 rows × 111 columns</p>

## 6) NaN value handling

- Find the optimased stratagyies.
- maybe interpolate small gaps 2/5 hours etc..
- or usuall drill drop the remaining rows with NaN.

In [14]:
# Check missing values before handling
missing_pct = (df_wide.isnull().sum() / len(df_wide) * 100).sort_values(ascending=False)

print("Missing value percentage by column (top 20):")
print(missing_pct.head(20))

print(f"\nTotal cells: {df_wide.size:,}")
print(f"Missing cells: {df_wide.isnull().sum().sum():,}")
print(f"Missing percentage: {df_wide.isnull().sum().sum() / df_wide.size * 100:.2f}%")

Missing value percentage by column (top 20):
site_species
WM6_PM10    62.798495
CE3_NO2     46.589794
TL4_NO2     40.407262
RI2_O3      39.818449
WA7_NO2     37.765783
CD1_PM10    31.534184
WAA_NO2     31.505561
CE3_PM10    31.288845
TH4_PM10    30.528296
HG4_O3      28.999019
CD1_NO2     28.557409
MY1_O3      27.171246
SK5_PM10    26.300294
BG1_NO2     26.226693
BG2_NO2     24.873242
GB6_O3      24.345764
CE2_O3      24.124959
TH4_O3      23.994112
RI2_NO2     23.769218
TH4_NO2     22.894177
dtype: float64

Total cells: 2,714,616
Missing cells: 330,540
Missing percentage: 12.18%


        Missing value percentage by column (top 20):
        site_species
        WM6_PM10    62.798495
        CE3_NO2     46.589794
        TL4_NO2     40.407262
        RI2_O3      39.818449
        WA7_NO2     37.765783
        CD1_PM10    31.534184
        WAA_NO2     31.505561
        CE3_PM10    31.288845
        TH4_PM10    30.528296
        HG4_O3      28.999019
        CD1_NO2     28.557409
        MY1_O3      27.171246
        SK5_PM10    26.300294
        BG1_NO2     26.226693
        BG2_NO2     24.873242
        GB6_O3      24.345764
        CE2_O3      24.124959
        TH4_O3      23.994112
        RI2_NO2     23.769218
        TH4_NO2     22.894177
        dtype: float64

        Total cells: 2,714,616
        Missing cells: 330,540
        Missing percentage: 12.18%

- max_gap=5 value. (linear interpolation is applied to fill in missing values)

Air quality prediction using CNN+LSTM-based hybrid deep learning architecture', *Environmental Science and Pollution Research*, 29(8), pp. 11920-11938

after min_coverage = 0.8 filter my dataset to 58 rows only. So I will be test other tresolds below.

In [None]:
# # Test different thresholds to find what works
# for threshold in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95]:
#     coverage = df_wide.notna().sum() / len(df_wide)
#     cols_kept = coverage[coverage >= threshold].index
#     temp_df = df_wide[cols_kept].interpolate(method='linear', limit=5, limit_direction='both').dropna()
#     print(f"Threshold {threshold}: {len(temp_df):,} rows, {len(cols_kept)} columns")


Threshold 0.5: 0 rows, 110 columns
Threshold 0.6: 0 rows, 108 columns
Threshold 0.7: 0 rows, 102 columns
Threshold 0.8: 58 rows, 86 columns
Threshold 0.9: 4,069 rows, 62 columns
Threshold 0.95: 13,190 rows, 39 columns


## 6.A) optimal columns and handle missing values

### The problem

I have 111 site-species colm but their missing values different times. 
When any column has a gap at a timestamp, that entire row is dropped.Because of treshold = 0.8 I wanna use most of my data, order to do that I need to ran some test and find out that what is the best.

- With all 111 columns: only 58 complete rows unfurtunatally it is unusable
- Need to find balance between columns features and rows samples

### Where is the sweet spot?

I'll test different numbers of columns to max total data points rows × columns:

In [None]:
# # Testing different numbers of columns to find optimal balance
# coverage = df_wide.notna().sum() / len(df_wide)
# coverage_sorted = coverage.sort_values(ascending=False)

# print("Columns vs Rows")
# print("-" * 40)

# results = []
# for n_cols in [10, 15, 20, 25, 30, 35, 40, 45, 50]:
#     best_cols = coverage_sorted.head(n_cols).index
#     df_test = df_wide[best_cols].copy()
#     df_test = df_test.interpolate(method='linear', limit=5, limit_direction='both')
#     df_test = df_test.dropna()
    
#     total_data = len(df_test) * n_cols
#     results.append({'columns': n_cols, 'rows': len(df_test), 'total': total_data})
    
#     print(f"{n_cols} columns: {len(df_test):,} rows → {total_data:,} total data points")

# # Find sweetest spot 
# best = max(results, key=lambda x: x['total'])
# print(f"\nOptimal: {best['columns']} columns with {best['rows']:,} rows")

Columns vs Rows
----------------------------------------
10 columns: 23,660 rows → 236,600 total data points
15 columns: 22,122 rows → 331,830 total data points
20 columns: 20,324 rows → 406,480 total data points
25 columns: 17,944 rows → 448,600 total data points
30 columns: 16,580 rows → 497,400 total data points
35 columns: 14,221 rows → 497,735 total data points
40 columns: 12,416 rows → 496,640 total data points
45 columns: 10,070 rows → 453,150 total data points
50 columns: 8,208 rows → 410,400 total data points

Optimal: 35 columns with 14,221 rows


I will also check pollutants in selected colm, to see the pollutants I choose actually good mix.

In [None]:
# # pollutant check  in selected columns
# n_cols = best['columns'] # use 35 colmn with 14,221 row up.

# best_cols = coverage_sorted.head(n_cols).index.tolist()

# print(f"Selected {n_cols} columns:\n")
# for col in best_cols:
#     print(f"  {col}: {coverage_sorted[col]*100:.1f}% coverage")

# # Count by pollutant type
# no2 = [c for c in best_cols if 'NO2' in c]
# pm10 = [c for c in best_cols if 'PM10' in c]
# o3 = [c for c in best_cols if 'O3' in c]

# print(f"\nPollutant mix:")
# print(f"NO2 stations: {len(no2)}")
# print(f"PM10 stations: {len(pm10)}")
# print(f"O3 stations: {len(o3)}")

Selected 35 columns:

  EN5_NO2: 99.6% coverage
  WMD_NO2: 99.6% coverage
  BT5_NO2: 99.5% coverage
  HP1_PM10: 99.5% coverage
  EN1_NO2: 99.4% coverage
  ME9_NO2: 99.0% coverage
  BT6_NO2: 99.0% coverage
  BT8_PM10: 98.6% coverage
  HV1_NO2: 98.3% coverage
  BT4_PM10: 98.3% coverage
  KC1_NO2: 98.1% coverage
  BT8_NO2: 97.9% coverage
  EI1_NO2: 97.8% coverage
  HP1_NO2: 97.7% coverage
  BX2_PM10: 97.7% coverage
  GN0_NO2: 97.6% coverage
  WM6_NO2: 97.5% coverage
  IS6_NO2: 97.5% coverage
  RI1_NO2: 97.4% coverage
  HP1_O3: 97.3% coverage
  BT6_PM10: 97.3% coverage
  SK5_NO2: 97.0% coverage
  BX1_O3: 96.8% coverage
  GR9_NO2: 96.6% coverage
  EN4_NO2: 96.6% coverage
  GN6_NO2: 96.2% coverage
  GR7_PM10: 96.1% coverage
  KC1_O3: 96.0% coverage
  GR7_NO2: 96.0% coverage
  GN3_PM10: 95.9% coverage
  LB4_NO2: 95.9% coverage
  GN4_PM10: 95.7% coverage
  GN4_NO2: 95.7% coverage
  EA8_NO2: 95.7% coverage
  EA6_NO2: 95.6% coverage

Pollutant mix:
NO2 stations: 24
PM10 stations: 8
O3 stations: 

In [None]:
""" commented out function below because that's min_cov not gives me the max rows x col data point."""

# def handle_missing_values(df, max_gap=5, min_coverage=0.8):
#     """
#     Handle NaN 
    
#     Param
#     max_gap : int max consecutive NaN values to interpolate
#     min_coverage : Min. proportion of non-null values to keep a column.
    
#     """
#     df = df.copy()
#     print(f"Before: {df.shape}")
    
#     # 1. rm columns with too many missing values
#     coverage = df.notna().sum() / len(df)
#     cols_to_keep = coverage[coverage >= min_coverage].index
#     cols_removed = len(df.columns) - len(cols_to_keep)
#     df = df[cols_to_keep]
#     print(f"Removed {cols_removed} columns with <{min_coverage*100:.0f}% coverage")
    
#     # 2 Interpolate small gaps
#     df = df.interpolate(method='linear', limit=max_gap, limit_direction='both')
#     print(f"Interpolated gaps up to {max_gap} consecutive values")
    
#     # 3. Drop remaining rows with NaN
#     rows_before = len(df)
#     df = df.dropna()
#     rows_dropped = rows_before - len(df)
#     print(f"Dropped {rows_dropped:,} rows with remaining NaN")
    
#     print(f"After: {df.shape}")
    
#     return df

In [None]:
"""" commented out function below because that's min_cov not gives me the max rows x col data point."""

# # Handle missing values
# df_clean = handle_missing_values(df_wide, max_gap=5, min_coverage=0.8)

# print(f"\nMissing values remaining: {df_clean.isnull().sum().sum()}")

Before: (24456, 111)
Removed 25 columns with <80% coverage
Interpolated gaps up to 5 consecutive values
Dropped 24,398 rows with remaining NaN
After: (58, 86)

Missing values remaining: 0


## 6.B ) Reselect optimal col and handl NaN

### The problem

I have 111 site-species columns but their missing values occur at different times. 
When any column has a gap at a timestamp, that entire row is dropped.

With all 111 columns: only 58 complete rows (unusable)

### Finding the sweet spot

Tested different column counts to maximise total data points (rows × columns):

| Columns | Rows | Total data points |
|---------|------|-------------------|
| 10 | 23,660 | 236,600 |
| 20 | 20,324 | 406,480 |
| 30 | 16,580 | 497,400 |
| **35** | **14,221** | **497,735**  optimal |
| 40 | 12,416 | 496,640 |
| 50 | 8,208 | 410,400 |

### Selected columns

35 columns with coverage between 95.6% and 99.6%:
- NO2: 24 stations
- PM10: 8 stations  
- O3: 3 stations


```


In [35]:
# Select top 35 col ran tests for it 6.A  least nan 
coverage = df_wide.notna().sum() / len(df_wide)
coverage_sorted = coverage.sort_values(ascending=False)

n_cols = 35
best_cols = coverage_sorted.head(n_cols).index.tolist()

df_selected = df_wide[best_cols].copy()
df_clean = df_selected.interpolate(method='linear', limit=5, limit_direction='both')
df_clean = df_clean.dropna()

print(f"Final dataset: {len(df_clean):,} rows × {len(df_clean.columns)} columns")

# Verify pollutant mix
no2 = sum(1 for c in best_cols if 'NO2' in c)
pm10 = sum(1 for c in best_cols if 'PM10' in c)
o3 = sum(1 for c in best_cols if 'O3' in c)
print(f"Pollutant mix: NO2={no2}, PM10={pm10}, O3={o3}")

Final dataset: 14,221 rows × 35 columns
Pollutant mix: NO2=24, PM10=8, O3=3


    Final dataset: 14,221 rows × 35 columns
    Pollutant mix: NO2=24, PM10=8, O3=3

## 7) temporal features addition:
 - extracting hour, day and month as column to make data wider.

In [36]:
def temporal_wide(df):
    """
    Add temporal features to wide format data.
    """
    df = df.copy()
    
    df['hour'] = df.index.hour
    df['day_of_week'] = df.index.dayofweek
    df['month'] = df.index.month
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    print(f"Added temporal features")
    print(f"Total features: {len(df.columns)}")
    
    return df

In [45]:
# Add temporal features
df_features = temporal_wide(df_clean)

# Preview
df_features[['hour', 'day_of_week', 'month', 'is_weekend']].head()

#  print(f"df_features: {df_features.shape}")



Added temporal features
Total features: 39


site_species,hour,day_of_week,month,is_weekend
@MeasurementDateGMT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-01-01 00:00:00,0,6,1,1
2023-01-01 01:00:00,1,6,1,1
2023-01-01 02:00:00,2,6,1,1
2023-01-01 03:00:00,3,6,1,1
2023-01-01 04:00:00,4,6,1,1


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>site_species</th>
      <th>hour</th>
      <th>day_of_week</th>
      <th>month</th>
      <th>is_weekend</th>
    </tr>
    <tr>
      <th>@MeasurementDateGMT</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2023-01-01 00:00:00</th>
      <td>0</td>
      <td>6</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2023-01-01 01:00:00</th>
      <td>1</td>
      <td>6</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2023-01-01 02:00:00</th>
      <td>2</td>
      <td>6</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2023-01-01 03:00:00</th>
      <td>3</td>
      <td>6</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2023-01-01 04:00:00</th>
      <td>4</td>
      <td>6</td>
      <td>1</td>
      <td>1</td>
    </tr>
  </tbody>
</table>
</div>


Added temporal features
Total features: 39

## 8) Normalise the data

Different features have different scales:
- NO2: 0-200 µg/m³
- PM25: 0-100 µg/m³
- Hour: 0-23

Neural networks work best when all inputs are on the same scale (0 to 1).

MinMaxScaler from scikit-learn which applies:
```
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
```

This transforms every value to a number between 0 and 1 (Géron, 2022).

**Important:** Save the scaler object to reverse the transformation later when interpreting predictions.

- MinMaxScaler is  scikit-learn lib. 
Scikit-learn (no date) sklearn.preprocessing.MinMaxScaler. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Géron, A. (2022) Hands-on machine learning with scikit-learn, Keras, and TensorFlow. 3rd edn. Sebastopol: O'Reilly Media.
Chapter 2 


In [37]:
def normalise_data(df):
    """
    Normalise all columns to 0-1 range using scikit-learn MinMaxScaler.

    """
    feature_names = df.columns.tolist()
    index = df.index
    
    # Fit and transform
    scaler = MinMaxScaler(feature_range=(0, 1))
    normalised_values = scaler.fit_transform(df.values)
    
    # Create dataframe
    normalised_df = pd.DataFrame(
        normalised_values,
        columns=feature_names,
        index=index
    )
    
    print(f"Data normalised to range 0, 1")
    print(f"Original range [{df.values.min():.2f}, {df.values.max():.2f}]")
    print(f"Normalised range [{normalised_values.min():.2f}, {normalised_values.max():.2f}]")
    
    return normalised_df, scaler, feature_names

In [44]:
# Normalise the data
df_normalised, scaler, feature_names = normalise_data(df_features)

# Preview
df_normalised.head()

# print(f"df_normalised: {df_normalised.shape}")


Data normalised to range 0, 1
Original range [0.00, 587.00]
Normalised range [0.00, 1.00]


Unnamed: 0_level_0,EN5_NO2,WMD_NO2,BT5_NO2,HP1_PM10,EN1_NO2,ME9_NO2,BT6_NO2,BT8_PM10,HV1_NO2,BT4_PM10,...,GN3_PM10,LB4_NO2,GN4_PM10,GN4_NO2,EA8_NO2,EA6_NO2,hour,day_of_week,month,is_weekend
@MeasurementDateGMT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-01 00:00:00,0.061135,0.050996,0.069828,0.075868,0.067465,0.07374,0.06223,0.095833,0.066303,0.06644,...,0.208388,0.094123,0.055256,0.017544,0.058506,0.071708,0.0,1.0,0.0,1.0
2023-01-01 01:00:00,0.055022,0.052755,0.061207,0.064568,0.056583,0.095561,0.048401,0.033333,0.042688,0.025554,...,0.04194,0.244261,0.013477,0.023684,0.026103,0.085398,0.043478,1.0,0.0,1.0
2023-01-01 02:00:00,0.056769,0.048652,0.04569,0.078289,0.04679,0.069977,0.035436,0.045833,0.030881,0.030664,...,0.103539,0.175849,0.045148,0.019298,0.025203,0.073664,0.086957,1.0,0.0,1.0
2023-01-01 03:00:00,0.041921,0.018171,0.031897,0.103309,0.038085,0.06471,0.041487,0.054167,0.042688,0.025554,...,0.142857,0.137741,0.07345,0.012281,0.022502,0.058018,0.130435,1.0,0.0,1.0
2023-01-01 04:00:00,0.039301,0.02286,0.052586,0.120258,0.043526,0.048157,0.052723,0.0625,0.035422,0.028961,...,0.176933,0.145546,0.097709,0.007895,0.036004,0.080183,0.173913,1.0,0.0,1.0


  Data normalised to range 0, 1
  Original range [0.00, 587.00]
  Normalised range [0.00, 1.00]
  df_normalised: (14221, 39)

  <div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>EN5_NO2</th>
      <th>WMD_NO2</th>
      <th>BT5_NO2</th>
      <th>HP1_PM10</th>
      <th>EN1_NO2</th>
      <th>ME9_NO2</th>
      <th>BT6_NO2</th>
      <th>BT8_PM10</th>
      <th>HV1_NO2</th>
      <th>BT4_PM10</th>
      <th>...</th>
      <th>GN3_PM10</th>
      <th>LB4_NO2</th>
      <th>GN4_PM10</th>
      <th>GN4_NO2</th>
      <th>EA8_NO2</th>
      <th>EA6_NO2</th>
      <th>hour</th>
      <th>day_of_week</th>
      <th>month</th>
      <th>is_weekend</th>
    </tr>
    <tr>
      <th>@MeasurementDateGMT</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2023-01-01 00:00:00</th>
      <td>0.061135</td>
      <td>0.050996</td>
      <td>0.069828</td>
      <td>0.075868</td>
      <td>0.067465</td>
      <td>0.073740</td>
      <td>0.062230</td>
      <td>0.095833</td>
      <td>0.066303</td>
      <td>0.066440</td>
      <td>...</td>
      <td>0.208388</td>
      <td>0.094123</td>
      <td>0.055256</td>
      <td>0.017544</td>
      <td>0.058506</td>
      <td>0.071708</td>
      <td>0.000000</td>
      <td>1.0</td>
      <td>0.0</td>
      <td>1.0</td>
    </tr>
    <tr>
      <th>2023-01-01 01:00:00</th>
      <td>0.055022</td>
      <td>0.052755</td>
      <td>0.061207</td>
      <td>0.064568</td>
      <td>0.056583</td>
      <td>0.095561</td>
      <td>0.048401</td>
      <td>0.033333</td>
      <td>0.042688</td>
      <td>0.025554</td>
      <td>...</td>
      <td>0.041940</td>
      <td>0.244261</td>
      <td>0.013477</td>
      <td>0.023684</td>
      <td>0.026103</td>
      <td>0.085398</td>
      <td>0.043478</td>
      <td>1.0</td>
      <td>0.0</td>
      <td>1.0</td>
    </tr>
    <tr>
      <th>2023-01-01 02:00:00</th>
      <td>0.056769</td>
      <td>0.048652</td>
      <td>0.045690</td>
      <td>0.078289</td>
      <td>0.046790</td>
      <td>0.069977</td>
      <td>0.035436</td>
      <td>0.045833</td>
      <td>0.030881</td>
      <td>0.030664</td>
      <td>...</td>
      <td>0.103539</td>
      <td>0.175849</td>
      <td>0.045148</td>
      <td>0.019298</td>
      <td>0.025203</td>
      <td>0.073664</td>
      <td>0.086957</td>
      <td>1.0</td>
      <td>0.0</td>
      <td>1.0</td>
    </tr>
    <tr>
      <th>2023-01-01 03:00:00</th>
      <td>0.041921</td>
      <td>0.018171</td>
      <td>0.031897</td>
      <td>0.103309</td>
      <td>0.038085</td>
      <td>0.064710</td>
      <td>0.041487</td>
      <td>0.054167</td>
      <td>0.042688</td>
      <td>0.025554</td>
      <td>...</td>
      <td>0.142857</td>
      <td>0.137741</td>
      <td>0.073450</td>
      <td>0.012281</td>
      <td>0.022502</td>
      <td>0.058018</td>
      <td>0.130435</td>
      <td>1.0</td>
      <td>0.0</td>
      <td>1.0</td>
    </tr>
    <tr>
      <th>2023-01-01 04:00:00</th>
      <td>0.039301</td>
      <td>0.022860</td>
      <td>0.052586</td>
      <td>0.120258</td>
      <td>0.043526</td>
      <td>0.048157</td>
      <td>0.052723</td>
      <td>0.062500</td>
      <td>0.035422</td>
      <td>0.028961</td>
      <td>...</td>
      <td>0.176933</td>
      <td>0.145546</td>
      <td>0.097709</td>
      <td>0.007895</td>
      <td>0.036004</td>
      <td>0.080183</td>
      <td>0.173913</td>
      <td>1.0</td>
      <td>0.0</td>
      <td>1.0</td>
    </tr>
  </tbody>
</table>
<p>5 rows × 39 columns</p>
</div>

## 9) Create sequences

### Why sequences are needed

A single row of data is just one moment with no context. Machine learning needs to see what happened before to make predictions.

By creating sequences, I give the model historical context. Instead of seeing one timestamp, it sees the last 12 hours of measurements and can learn patterns like pollution rising or falling.

### What is the sliding window method?

The sliding window method restructures time series data for supervised learning (Brownlee, J. (2017) ). With `n_past=12`, I use the last 12 hours to predict the next hour:
```
Input (X):  [hour1, hour2, hour3, ..., hour12]  → shape: (12, num_features)
Output (y): [hour13]                            → shape: (num_features,)
```

The window slides forward to create many training samples:
```
Sample 1: hours 1-12  → predict hour 13
Sample 2: hours 2-13  → predict hour 14
Sample 3: hours 3-14  → predict hour 15
...
```

### Why 12 hours?

Gilik, A., Ogrenci, A.S. and Ozmen, A. (2021). tested frame sizes between 8 and 15 hours for air quality prediction. I chose 12 because it captures half a day of patterns including rush hour variations.

https://www.inf.szte.hu/~korosig/teach/books/Jason%20Brownlee%20-%20Introduction%20to%20Time%20Series%20Forecasting%20with%20Python%20-%20How%20to%20Prepare%20Data%20and%20Develop%20Models%20to%20Predict%20the%20Future-v1.9%20(2020).pdf

In [None]:
def create_sequences(data, n_past=12, n_future=1):
    """
    Create sequences for time series prediction using rolling window. 
    Sliding window method restructures time series as supervised learning
    problem (Brownlee, J. (2017) ). Window size based on Gilik, Ogrenci and 
    Ozmen (2021) who tested values between 8-15 hours. 
    
    Params

    data : numpy.ndarray Normalised data of shape (timestamps, features).
    n_past : int  Number of past timesteps to use as input.
    n_future : int Number of future timesteps to predict.
    
    Returns
    
    tuple: (X, y)
        X: Input sequences, shape (samples, n_past, features) (coordinates X needs to be capital)
        y: Target values, shape (samples, features)
    """
    X, y = [], []
    
    for i in range(n_past, len(data) - n_future + 1):
        #  past n_past timesteps
        X.append(data[i - n_past:i])
        # Output value at n_future steps ahead
        y.append(data[i + n_future - 1])
    
    X = np.array(X)
    y = np.array(y)
    
    print(f"Created sequences:")
    print(f" n_past (history): {n_past} hours")
    print(f"n_future (predict): {n_future} hour")
    print(f"Samples: {len(X):,}")
    print(f"X shape: {X.shape} (samples, timesteps, features)")
    print(f"y shape: {y.shape} (samples, features)")
    
    return X, y

In [47]:
# Configuration
N_PAST = 12      # Use last 12 hours as input
N_FUTURE = 1     # Predict 1 hour ahead

# Convert to numpy array
data_array = df_normalised.values

# Create sequences
X, y = create_sequences(data_array, n_past=N_PAST, n_future=N_FUTURE)

Created sequences:
 n_past (history): 12 hours
n_future (predict): 1 hour
Samples: 14,209
X shape: (14209, 12, 39) (samples, timesteps, features)
y shape: (14209, 39) (samples, features)


    Created sequences:
    n_past (history): 12 hours
    n_future (predict): 1 hour
    Samples: 46
    X shape: (46, 12, 90) (samples, timesteps, features)
    y shape: (46, 90) (samples, features)

In [48]:
# Check how many rows at each step
print(f"After pivot (wide format): {len(df_wide)}")
print(f"After handle_missing_values: {len(df_clean)}")
print(f"After normalisation: {len(df_normalised)}")

After pivot (wide format): 24456
After handle_missing_values: 14221
After normalisation: 14221


## 10) Split into train/validation/test

### Why split the data?

Split data into three parts:
- Training (70%): model learns patterns from this
- Validation (15%): tune model and check for overfitting during training
- Test (15%): final evaluation, model never sees this until the end
```
|-------- Training (70%) --------|--- Val (15%) ---|--- Test (15%) ---|
Aug 2023                                                        Nov 2025
```

### Why sequential split, not random?

For time series, split sequentially (oldest to newest), not randomly. 
Random splitting causes data leakage where model sees future data when 
training to predict the past (Brownlee, J. (2017) ).

### Why 70/15/15 ratio?

Gilik, A., Ogrenci, A.S. and Ozmen, A. (2021). used this split for air quality prediction.

In [None]:
def split_time_series(X, y, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15):
    """
    Split time series data sequentially into train/validation and test
    Sequential split avoids data leakage (Brownlee, J. (2017) ).
    Ratio based on Gilik, Ogrenci and Ozmen (2021)
    """
    n_samples = len(X)
    
    train_end = int(n_samples * train_ratio)
    val_end = int(n_samples * (train_ratio + val_ratio))
    
    splits = {
        'X_train': X[:train_end],
        'y_train': y[:train_end],
        'X_val': X[train_end:val_end],
        'y_val': y[train_end:val_end],
        'X_test': X[val_end:],
        'y_test': y[val_end:]
    }
    
    print(f"Data split (sequential):")
    print(f"Training: {len(splits['X_train']):,} samples ({train_ratio*100:.0f}%)")
    print(f"Validation: {len(splits['X_val']):,} samples ({val_ratio*100:.0f}%)")
    print(f"Test: {len(splits['X_test']):,} samples ({test_ratio*100:.0f}%)")
    
    return splits

In [50]:
# Split the data
splits = split_time_series(X, y, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15)

Data split (sequential):
Training: 9,946 samples (70%)
Validation: 2,131 samples (15%)
Test: 2,132 samples (15%)


    Data split (sequential):
    Training: 9,946 samples (70%)
    Validation: 2,131 samples (15%)
    Test: 2,132 samples (15%)