# Beijing Air Quality
## ðŸ“˜ Notebook 04 â€“ Feature Engineering


| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.


## Objectives

This notebook creates all engineered features required for hypothesis validation (Milestone 5) and machine learning modelling (Milestone 6).

Specifically, this notebook will:

- Generate lag features for PM2.5 and selected meteorological variables
- Create rolling window statistics
- Add seasonal features
- Add cyclical datetime encodings
- Integrate spatial metadata (station latitude, longitude, area type)
- Create derived weather features such as dew point spread
- Validate each transformation
- Export the final feature-engineered dataset for modelling

## Inputs
- `data/cleaned/beijing_cleaned.csv`
- Station metadata file:
    - `data/metadata/station_metadata.csv` including:
        - station
        - latitude
        - longitude
        - area_type (urban / suburban / industrial / traffic-heavy)

## Outputs
- data/engineered/beijing_feature_engineered.csv
- Engineered features including:
    - lag variables
    - rolling averages
    - season
    - cyclical time encodings
    - spatial metadata
    - derived meteorological features

## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

We import the necessary Python tools for data cleaning:

- `pathlib` â€“ to handle file paths
- `pandas` â€“ to manipulate tabular data
- `matplotlib` â€“ for plotting
- `numpy` â€“ numeric operations 
- `seaborn` - enhanced data visualisation
- `plotly.express` - interactive plots

In [None]:
import sys # system-specific parameters and functions
import pandas as pd # data analysis and manipulation
import numpy as np # numerical computing
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
from pathlib import Path # filesystem paths

setup matplotlib and seaborn themes

In [None]:

plt.style.use("seaborn-v0_8")
sns.set_theme()

### Set Up Project Paths

We define paths for input and output datasets to ensure the notebook is portable and reproducible.

In [None]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" # Path to the data directory
sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

INPUT_PATH = DATA_PATH / "cleaned" / "beijing_cleaned.parquet" # Input file path
OUTPUT_PATH = DATA_PATH / "engineered" / "beijing_feature_engineered" # Output file path

print("Input path :", INPUT_PATH) # Print input path
print("Output path :", OUTPUT_PATH) # Print output path

## Initiate metadata function



In [None]:
from src.metadata_builder import MetadataBuilder

builder = MetadataBuilder(
    "data/engineered/beijing_feature_engineered.parquet",
    "Beijing Air Quality â€“ Feature Engineered Dataset",
    "Dataset with lag features, rolling windows, seasonal categories, cyclical encodings, and spatial metadata."
)

builder.add_creation_script("notebooks/04_feature_engineering.ipynb")

### Load Dataset

Load the cleaned dataset created in Notebook 02, parsing datetime and converting object columns to category type.

In [None]:
df = pd.read_parquet(INPUT_PATH) # Load cleaned data
df.info() # Display information about the dataframe
df.head() # Display first few rows of the dataframe

### Sort Dataset

Sorting by station and time ensures lag features and rolling windows operate correctly without leakage across stations.

In [None]:
df = df.sort_values(["station", "datetime"]).reset_index(drop=True)
df.head()

## Temporal Feature Engineering

### Extract Datetime Components

EDA showed clear seasonal and hourly patterns. Extracting components helps create interpretable calendar features.

In [None]:
df["dayofweek"] = df["datetime"].dt.dayofweek
builder.add_step("Extracted datetime feature: dayofweek") # Add step to metadata
df.head()

### Cyclical Encoding for Time Features

Hours and months are cyclical (23 â†’ 0, December â†’ January).
Using sin/cos encoding preserves continuity, improving performance for ML models.

In [None]:
df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24)

df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)
builder.add_step("Encoded hour and month as cyclical sin/cos") # Add step to metadata

## Lag Feature Engineering

### Create Lag Features for PM2.5

EDA showed PM2.5 values depend strongly on previous hours.

Lag features allow the model to incorporate short- and medium-term pollutant persistence.

In [None]:
lags = [1, 3, 6, 12, 24]

for lag in lags:
    df[f"pm25_lag_{lag}h"] = df.groupby("station", observed=False)["pm25"].shift(lag)

builder.add_step("Created lag features (1h, 3h, 6h, 12h, 24h)") # Add step to metadata

## Rolling Feature Engineering

### Rolling Means

Rolling averages smooth sudden spikes and capture sustained pollution episodes, which are highly predictive.

In [None]:
windows = [3, 6, 12, 24]

for w in windows:
    df[f"pm25_roll_{w}h_mean"] = (
        df.groupby("station", observed=False)["pm25"].rolling(window=w).mean().reset_index(level=0, drop=True)
    )
builder.add_step("Created rolling mean features (3h, 6h, 12h, 24h)") # Add step to metadata

## Derived Meteorological Features

### Dew Point Spread

EDA showed strong TEMPâ€“DEWP correlations.

Their difference indicates moisture levels and stability, which affect PM2.5 dispersion.

In [None]:
df["dewpoint_spread"] = df["temperature"] - df["dew_point"]
builder.add_step("Created dewpoint spread feature") # Add step to metadata

### Temperatureâ€“Pressure Interaction

High pressure + low temperature often leads to stagnant air and high PM2.5.

An interaction term helps the model learn this relationship.

In [None]:
df["temp_pres_interaction"] = df["temperature"] * df["pressure"]
builder.add_step("Created temperature-pressure interaction feature") # Add step to metadata

### Rainfall Binary Indicator

Rainfall events cleanse the air but occur infrequently.

A binary indicator captures this effect better than raw values.

In [None]:
df["rain_binary"] = (df["rain"] > 0).astype(int)
builder.add_step("Created binary rain feature") # Add step to metadata

## Export Final Feature Engineered Dataset

We export the final dataset for hypothesis testing and modelling.

In [None]:
df.to_parquet(OUTPUT_PATH.with_suffix('.parquet'), index=False) # Save the feature-engineered dataframe to Parquet
builder.add_step("Saved dataset as Parquet for GitHub compliance") # Add step to metadata
print("Feature-engineered data saved to :", OUTPUT_PATH.with_suffix('.parquet')) # Print confirmation message

df.to_csv(OUTPUT_PATH.with_suffix('.csv'), index=False) # Save the feature-engineered dataframe to CSV
builder.add_step("Saved dataset as CSV for GitHub compliance") # Add step to metadata
print("Feature-engineered data saved to :", OUTPUT_PATH.with_suffix('.csv')) # Print confirmation message


## Save Metadata file

In [None]:
builder.add_columns(df.columns) # Add columns the dataframe
builder.add_record_count_from_df(df) # Set record count from the engineered dataframe    
builder.add_record_stats(OUTPUT_PATH) # Add record statistics

builder.write(PROJECT_ROOT / "data" /"engineered" / "_metadata.yml") # Write metadata to YAML