# Beijing Air Quality
## ðŸ“˜ Notebook 08 â€“ Feature Engineering


| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.

## Purpose of This Notebook

This notebook prepares the dataset for machine learning modelling used in Hypothesis 5, which evaluates whether lag features improve PM2.5 forecasting performance.


## Objectives

- Load the cleaned dataset and construct modelling-ready features.

- Engineer the following feature classes:

### 1. Interaction Features
Enhance meteorological effects:

- temp_press_interaction
- dew_point_spread
- rain_binary
- humidity

### 2. Cyclical Time Encodings

- Preserve periodic structure:
- hour_sin, hour_cos
- month_sin, month_cos

### 3. Final Modelling Dataset Output

- Save engineered dataset to data/engineered/
- Validate shape, missing data, and feature distributions

The final engineered dataset will be used in Notebook 10.


## Inputs
- Cleaned dataset: `data/cleaned/beijing_cleaned.parquet`


## Outputs
- `data/engineered/beijing_feature_engineered.parquet`
- `data/engineered/beijing_feature_engineered.csv`
- `data/engineered/_metadata.yml`


## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

(The following libraries support analysis, plotting, and data manipulation.)

In [None]:
import sys # system-specific parameters and functions
import numpy as np # numerical operations
import pandas as pd # data analysis and manipulation
import os # operating system dependent functionality
from pathlib import Path # filesystem paths

### Set Up Project Paths

In [None]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" # Path to the data directory
sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

INPUT_PATH = DATA_PATH / "cleaned" / "beijing_cleaned.csv" # Input file path
OUTPUT_PATH = DATA_PATH / "engineered" / "beijing_engineered.csv" # Output file path

if not OUTPUT_PATH.parent.exists(): # Check if output directory exists
    os.makedirs(OUTPUT_PATH.parent) # Create output directory if it doesn't exist

print("Input path :", INPUT_PATH) # Print input path
print("Output path :", OUTPUT_PATH) # Print output path

## Initiate metadata function



In [None]:
from utils.metadata_builder import MetadataBuilder
from utils.data_loader import _load_csv

builder = MetadataBuilder(
    "data/engineered/beijing_feature_engineered.parquet",
    "Beijing Air Quality â€“ Feature Engineered Dataset",
    "Dataset with lag features, rolling windows, seasonal categories, cyclical encodings, and spatial metadata."
)

builder.add_creation_script("notebooks/08_feature_engineering.ipynb")

### Load Dataset

Load the cleaned dataset created in Notebook 02, parsing datetime and converting object columns to category type.

In [None]:
df = _load_csv(INPUT_PATH) # Load cleaned data
df.info() # Display information about the dataframe
df.head() # Display first few rows of the dataframe
print("Initial dataset shape:", df.shape) # Print initial dataset shape

### Sort Dataset

Sorting by station and time ensures lag features and rolling windows operate correctly without leakage across stations.

In [None]:
df = df.sort_values(["station", "datetime"]).reset_index(drop=True) # Sort by station and datetime
df.head() # Display first few rows after sorting

## Temporal Feature Engineering

### Cyclical Encoding for Time Features

Hours and months are cyclical (23 â†’ 0, December â†’ January).
Using sin/cos encoding preserves continuity, improving performance for ML models.

In [None]:
df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24) # Encode hour as cyclical feature using sine function
df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24) # Encode hour as cyclical feature using cosine function
builder.add_step("Encoded hour as cyclical sin/cos") # Add step to metadata
df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12) # Encode month as cyclical feature using sine function
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12) # Encode month as cyclical feature using cosine function
builder.add_step("Encoded month as cyclical sin/cos") # Add step to metadata

## Derived Meteorological Features

### Dew Point Spread

EDA showed strong TEMPâ€“DEWP correlations.

Their difference indicates moisture levels and stability, which affect PM2.5 dispersion.

In [None]:
df["dew_point_spread"] = df["temperature"] - df["dew_point"] # Dew point spread feature
builder.add_step("Created dew_point_spread feature") # Add step to metadata

### Temperatureâ€“Pressure Interaction

High pressure + low temperature often leads to stagnant air and high PM2.5.

An interaction term helps the model learn this relationship.

In [None]:
df["temp_pres_interaction"] = df["temperature"] * df["pressure"] # Temperature-pressure interaction feature
builder.add_step("Created temperature-pressure interaction feature") # Add step to metadata

### Rainfall Binary Indicator

A binary indicator captures the occurrence of any rainfall, which often has
a more meaningful effect on PM2.5 cleansing than raw rainfall amounts.

In [None]:
df["rain_binary"] = (df["rain"] > 0).astype(int) # Binary rain feature
builder.add_step("Created binary rain feature") # Add step to metadata
print("Dataset shape after feature engineering:", df.shape) # Print dataset shape after feature engineering

### Humidty

The relative humidty (RH) will be calculated using `temperature` and `dew_point` with the following formula provided by chatgpt.

$$
RH = 100 \times \frac{e \frac{17.625 \times DEWP}{243.04 + DEWP}}{e \frac{17.625 \times TEMP}{243.04 + TEMP}}
$$

where:
- TEMP = temperature in $^{\circ}C$
- DEWP = dew point in $^{\circ}C$

This calculates humidity as a percentage (0â€“100%).

In [None]:
def compute_relative_humidity(temp: float, dewp: float) -> float:
    """
    Compute relative humidity from temperature and dew point.
    arguments:
        temp (float): Temperature in degrees Celsius.
        dewp (float): Dew point in degrees Celsius.
    Returns:
        float: Relative humidity in percentage.
    """
    temp = temp.astype(float)
    dewp = dewp.astype(float)
    
    a = 17.625
    b = 243.04
    
    alpha = (a * dewp) / (b + dewp)
    beta  = (a * temp) / (b + temp)

    rh = 100 * np.exp(alpha - beta)
    # Clip humidity to valid range
    rh = np.clip(rh, 0, 100)
    
    return rh

In [None]:
### Compute relative humidity and add as a new feature
df["relative_humidity"] = compute_relative_humidity(df["temperature"], df["dew_point"])
builder.add_step("Computed relative humidity from temperature and dew point") # Add step to metadata

## Export Final Feature Engineered Dataset

We export the final dataset for hypothesis testing and modelling.

In [None]:
df.to_csv(OUTPUT_PATH, index=False) # Save the feature-engineered dataframe to CSV
builder.add_step("Saved dataset as CSV") # Add step to metadata
print("Feature-engineered data saved to :", OUTPUT_PATH) # Print confirmation message

## Save Metadata file

In [None]:
builder.add_columns(df.columns) # Add columns the dataframe
builder.add_record_count_from_df(df) # Set record count from the engineered dataframe    
builder.add_record_stats(OUTPUT_PATH) # Add record statistics

builder.write(PROJECT_ROOT / "data" /"engineered" / "_metadata.yml") # Write metadata to YAML

## Next Steps

The feature-engineered dataset is now ready for modelling.
In the next notebook (Notebook 10), Hypothesis 5 will evaluate whether the lag features
created here improve the predictive performance of PM2.5 forecasting models.

---
### AI Assistance Note
Some narrative text and minor formatting or wording improvements in this notebook were supported by AI-assisted tools (ChatGPT for documentation clarity, Copilot for small routine code suggestions, and Grammarly for proofreading). All analysis, code logic, feature engineering, modelling, and interpretations were independently created by the author.