# Beijing Air Quality
## ðŸ“˜ Notebook 04 â€“ Feature Engineering


| Field             | Description                                      |
| ----------------- | ------------------------------------------------ |
| **Author:**       | Robert Steven Elliott                            |
| **Course:**       | Code Institute â€“ Data Analytics with AI Bootcamp |
| **Project Type:** | Capstone                                         |
| **Date:**         | December 2025                                    |


## Objectives

This notebook creates all engineered features required for hypothesis validation (Milestone 5) and machine learning modelling (Milestone 6).

Specifically, this notebook will:

- Generate lag features for PM2.5 and selected meteorological variables
- Create rolling window statistics
- Add seasonal features
- Add cyclical datetime encodings
- Integrate spatial metadata (station latitude, longitude, area type)
- Create derived weather features such as dew point spread
- Validate each transformation
- Export the final feature-engineered dataset for modelling

## Inputs
- `data/cleaned/beijing_cleaned.csv`
- Station metadata file:
    - `data/metadata/station_metadata.csv` including:
        - station
        - latitude
        - longitude
        - area_type (urban / suburban / industrial / traffic-heavy)

## Outputs
- data/engineered/beijing_feature_engineered.csv
- Engineered features including:
    - lag variables
    - rolling averages
    - season
    - cyclical time encodings
    - spatial metadata
    - derived meteorological features

---

## Notebook Setup

### Import Required Libraries

We import the necessary Python tools for data cleaning:

- `pathlib` â€“ to handle file paths
- `pandas` â€“ to manipulate tabular data
- `matplotlib` â€“ for plotting
- `numpy` â€“ numeric operations 
- `seaborn` - enhanced data visualisation
- `plotly.express` - interactive plots

In [1]:
import pandas as pd # data analysis and manipulation
import numpy as np # numerical computing
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
from pathlib import Path # filesystem paths

setup matplotlib and seaborn themes

In [2]:

plt.style.use("seaborn-v0_8")
sns.set_theme()

### Set Up Project Paths

We define paths for input and output datasets to ensure the notebook is portable and reproducible.

In [3]:
project_root = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
data_path = project_root / "data" # Path to the data directory

input_path = data_path / "cleaned" / "beijing_cleaned.csv" # Input file path
metadata_path = data_path / "metadata" / "station_metadata.csv" # Metadata file path
output_path = data_path / "engineered" / "beijing_feature_engineered.csv" # Output file path


print("Input path :", input_path) # Print input path
print("Metadata path :", metadata_path) # Print metadata path
print("Output path :", output_path) # Print output path

Input path : /home/robert/Projects/beijing-air-quality/data/cleaned/beijing_cleaned.csv
Metadata path : /home/robert/Projects/beijing-air-quality/data/metadata/station_metadata.csv
Output path : /home/robert/Projects/beijing-air-quality/data/engineered/beijing_feature_engineered.csv


### Load Dataset

Load the cleaned dataset created in Notebook 02, parsing datetime and converting object columns to category type.

In [4]:
df = pd.read_csv(input_path) # Load cleaned data
datetime_cols = ['datetime'] # List of datetime columns
object_cols = df.select_dtypes(include=['object']).columns.difference(datetime_cols) # Identify object type columns
df[datetime_cols] = df[datetime_cols].apply(pd.to_datetime) # Convert datetime columns to datetime type
df[object_cols] = df[object_cols].astype('category') # Convert object columns to category type
df.info() # Display first few rows of the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420768 entries, 0 to 420767
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   datetime  420768 non-null  datetime64[ns]
 1   pm2.5     420768 non-null  float64       
 2   temp      420768 non-null  float64       
 3   pres      420768 non-null  float64       
 4   dewp      420768 non-null  float64       
 5   rain      420768 non-null  float64       
 6   wd        420768 non-null  category      
 7   wspm      420768 non-null  float64       
 8   station   420768 non-null  category      
dtypes: category(2), datetime64[ns](1), float64(6)
memory usage: 23.3 MB


In [5]:
df.head()

Unnamed: 0,datetime,pm2.5,temp,pres,dewp,rain,wd,wspm,station
0,2013-03-01,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,guanyuan
1,2013-03-01,3.0,-2.3,1020.8,-19.7,0.0,E,0.5,changping
2,2013-03-01,6.0,0.1,1021.1,-18.6,0.0,NW,4.4,gucheng
3,2013-03-01,8.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,wanliu
4,2013-03-01,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,aotizhongxin


### Load Station Metadata

Spatial EDA showed strong differences between stations. Latitude, longitude, and area type help us encode spatial variability into the model.

In [6]:
station_metadata = pd.read_csv(metadata_path) # Load station metadata
df = df.merge(station_metadata, on='station', how='left') # Merge metadata with main dataframe
df.head()

Unnamed: 0,datetime,pm2.5,temp,pres,dewp,rain,wd,wspm,station,latitude,longitude,area_type
0,2013-03-01,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,guanyuan,39.941746,116.361478,urban
1,2013-03-01,3.0,-2.3,1020.8,-19.7,0.0,E,0.5,changping,40.220772,116.231204,suburban
2,2013-03-01,6.0,0.1,1021.1,-18.6,0.0,NW,4.4,gucheng,39.908156,116.239596,residential
3,2013-03-01,8.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,wanliu,39.990376,116.287252,residential
4,2013-03-01,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,aotizhongxin,40.003388,116.407613,urban


### Sort Dataset

Sorting by station and time ensures lag features and rolling windows operate correctly without leakage across stations.

In [7]:
df = df.sort_values(["station", "datetime"]).reset_index(drop=True)
df.head()

Unnamed: 0,datetime,pm2.5,temp,pres,dewp,rain,wd,wspm,station,latitude,longitude,area_type
0,2013-03-01 00:00:00,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,aotizhongxin,40.003388,116.407613,urban
1,2013-03-01 01:00:00,8.0,-1.1,1023.2,-18.2,0.0,N,4.7,aotizhongxin,40.003388,116.407613,urban
2,2013-03-01 02:00:00,7.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,aotizhongxin,40.003388,116.407613,urban
3,2013-03-01 03:00:00,6.0,-1.4,1024.5,-19.4,0.0,NW,3.1,aotizhongxin,40.003388,116.407613,urban
4,2013-03-01 04:00:00,3.0,-2.0,1025.2,-19.5,0.0,N,2.0,aotizhongxin,40.003388,116.407613,urban


## Temporal Feature Engineering

### Extract Datetime Components

EDA showed clear seasonal and hourly patterns. Extracting components helps create interpretable calendar features.

In [8]:
df["year"] = df["datetime"].dt.year
df["month"] = df["datetime"].dt.month
df["day"] = df["datetime"].dt.day
df["hour"] = df["datetime"].dt.hour
df["dayofweek"] = df["datetime"].dt.dayofweek

### Create Seasonal Categories

Monthly averages showed strong seasonality. A categorical season feature improves interpretability and often boosts model accuracy.

In [9]:
def assign_season(month):
    if month in [12, 1, 2]:
        return "winter"
    elif month in [3, 4, 5]:
        return "spring"
    elif month in [6, 7, 8]:
        return "summer"
    else:
        return "autumn"

df["season"] = df["month"].apply(assign_season).astype("category")

### Cyclical Encoding for Time Features

Hours and months are cyclical (23 â†’ 0, December â†’ January).
Using sin/cos encoding preserves continuity, improving performance for ML models.

In [10]:
df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24)

df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)

## Lag Feature Engineering

### Create Lag Features for PM2.5

EDA showed PM2.5 values depend strongly on previous hours.

Lag features allow the model to incorporate short- and medium-term pollutant persistence.

In [11]:
lags = [1, 3, 6, 12, 24]

for lag in lags:
    df[f"pm2.5_lag_{lag}h"] = df.groupby("station")["pm2.5"].shift(lag)

## Rolling Feature Engineering

### Rolling Means

Rolling averages smooth sudden spikes and capture sustained pollution episodes, which are highly predictive.

In [12]:
windows = [3, 6, 12, 24]

for w in windows:
    df[f"pm2.5_roll_{w}h_mean"] = (
        df.groupby("station")["pm2.5"].rolling(window=w).mean().reset_index(level=0, drop=True)
    )