# Beijing Air Quality
## ðŸ“˜ Notebook 04 â€“ Feature Engineering


| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.


## Objectives

This notebook creates all engineered features required for hypothesis validation (Milestone 5) and machine learning modelling (Milestone 6).

Specifically, this notebook will:

- Generate lag features for PM2.5 and selected meteorological variables
- Create rolling window statistics
- Add seasonal features
- Add cyclical datetime encodings
- Integrate spatial metadata (station latitude, longitude, area type)
- Create derived weather features such as dew point spread
- Validate each transformation
- Export the final feature-engineered dataset for modelling

## Inputs
- `data/cleaned/beijing_cleaned.csv`
- Station metadata file:
    - `data/metadata/station_metadata.csv` including:
        - station
        - latitude
        - longitude
        - area_type (urban / suburban / industrial / traffic-heavy)

## Outputs
- data/engineered/beijing_feature_engineered.csv
- Engineered features including:
    - lag variables
    - rolling averages
    - season
    - cyclical time encodings
    - spatial metadata
    - derived meteorological features

## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

We import the necessary Python tools for data cleaning:

- `pathlib` â€“ to handle file paths
- `pandas` â€“ to manipulate tabular data
- `matplotlib` â€“ for plotting
- `numpy` â€“ numeric operations 
- `seaborn` - enhanced data visualisation
- `plotly.express` - interactive plots

In [1]:
import sys # system-specific parameters and functions
import pandas as pd # data analysis and manipulation
import numpy as np # numerical computing
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
from pathlib import Path # filesystem paths

setup matplotlib and seaborn themes

In [2]:

plt.style.use("seaborn-v0_8")
sns.set_theme()

### Set Up Project Paths

We define paths for input and output datasets to ensure the notebook is portable and reproducible.

In [3]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" # Path to the data directory
sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

INPUT_PATH = DATA_PATH / "cleaned" / "beijing_cleaned.csv" # Input file path
OUTPUT_PATH = DATA_PATH / "engineered" / "beijing_feature_engineered" # Output file path

print("Input path :", INPUT_PATH) # Print input path
print("Output path :", OUTPUT_PATH) # Print output path

Input path : /home/robert/Projects/beijing-air-quality/data/cleaned/beijing_cleaned.csv
Output path : /home/robert/Projects/beijing-air-quality/data/engineered/beijing_feature_engineered


## Initiate metadata function



In [4]:
from src.metadata_builder import MetadataBuilder

builder = MetadataBuilder(
    "data/engineered/beijing_feature_engineered.parquet",
    "Beijing Air Quality â€“ Feature Engineered Dataset",
    "Dataset with lag features, rolling windows, seasonal categories, cyclical encodings, and spatial metadata."
)

builder.add_creation_script("notebooks/04_feature_engineering.ipynb")

### Load Dataset

Load the cleaned dataset created in Notebook 02, parsing datetime and converting object columns to category type.

In [5]:
df = pd.read_csv(INPUT_PATH) # Load cleaned data
datetime_cols = ['datetime'] # List of datetime columns
object_cols = df.select_dtypes(include=['object']).columns.difference(datetime_cols) # Identify object type columns
df[datetime_cols] = df[datetime_cols].apply(pd.to_datetime) # Convert datetime columns to datetime type
df[object_cols] = df[object_cols].astype('category') # Convert object columns to category type
df.info() # Display first few rows of the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403776 entries, 0 to 403775
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   datetime        403776 non-null  datetime64[ns]
 1   pm25            403776 non-null  float64       
 2   temperature     403776 non-null  float64       
 3   pressure        403776 non-null  float64       
 4   dew_point       403776 non-null  float64       
 5   rain            403776 non-null  float64       
 6   wind_direction  403776 non-null  category      
 7   wind_speed      403776 non-null  float64       
 8   station         403776 non-null  category      
 9   latitude        403776 non-null  float64       
 10  longitude       403776 non-null  float64       
 11  area_type       403776 non-null  category      
dtypes: category(3), datetime64[ns](1), float64(8)
memory usage: 28.9 MB


In [6]:
df.head()

Unnamed: 0,datetime,pm25,temperature,pressure,dew_point,rain,wind_direction,wind_speed,station,latitude,longitude,area_type
0,2013-03-01,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,guanyuan,39.941746,116.361478,urban
1,2013-03-01,3.0,-2.3,1020.8,-19.7,0.0,E,0.5,changping,40.220772,116.231204,suburban
2,2013-03-01,6.0,0.1,1021.1,-18.6,0.0,NW,4.4,gucheng,39.908156,116.239596,residential
3,2013-03-01,8.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,wanliu,39.990376,116.287252,residential
4,2013-03-01,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,aotizhongxin,40.003388,116.407613,urban


### Sort Dataset

Sorting by station and time ensures lag features and rolling windows operate correctly without leakage across stations.

In [7]:
df = df.sort_values(["station", "datetime"]).reset_index(drop=True)
df.head()

Unnamed: 0,datetime,pm25,temperature,pressure,dew_point,rain,wind_direction,wind_speed,station,latitude,longitude,area_type
0,2013-03-01 00:00:00,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,aotizhongxin,40.003388,116.407613,urban
1,2013-03-01 01:00:00,8.0,-1.1,1023.2,-18.2,0.0,N,4.7,aotizhongxin,40.003388,116.407613,urban
2,2013-03-01 02:00:00,7.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,aotizhongxin,40.003388,116.407613,urban
3,2013-03-01 03:00:00,6.0,-1.4,1024.5,-19.4,0.0,NW,3.1,aotizhongxin,40.003388,116.407613,urban
4,2013-03-01 04:00:00,3.0,-2.0,1025.2,-19.5,0.0,N,2.0,aotizhongxin,40.003388,116.407613,urban


## Temporal Feature Engineering

### Extract Datetime Components

EDA showed clear seasonal and hourly patterns. Extracting components helps create interpretable calendar features.

In [8]:
df["year"] = df["datetime"].dt.year
df["month"] = df["datetime"].dt.month
df["day"] = df["datetime"].dt.day
df["hour"] = df["datetime"].dt.hour
df["dayofweek"] = df["datetime"].dt.dayofweek
builder.add_step("Extracted datetime features: year, month, day, hour, dayofweek") # Add step to metadata
df.head()

Unnamed: 0,datetime,pm25,temperature,pressure,dew_point,rain,wind_direction,wind_speed,station,latitude,longitude,area_type,year,month,day,hour,dayofweek
0,2013-03-01 00:00:00,4.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,aotizhongxin,40.003388,116.407613,urban,2013,3,1,0,4
1,2013-03-01 01:00:00,8.0,-1.1,1023.2,-18.2,0.0,N,4.7,aotizhongxin,40.003388,116.407613,urban,2013,3,1,1,4
2,2013-03-01 02:00:00,7.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,aotizhongxin,40.003388,116.407613,urban,2013,3,1,2,4
3,2013-03-01 03:00:00,6.0,-1.4,1024.5,-19.4,0.0,NW,3.1,aotizhongxin,40.003388,116.407613,urban,2013,3,1,3,4
4,2013-03-01 04:00:00,3.0,-2.0,1025.2,-19.5,0.0,N,2.0,aotizhongxin,40.003388,116.407613,urban,2013,3,1,4,4


### Create Seasonal Categories

Monthly averages showed strong seasonality. A categorical season feature improves interpretability and often boosts model accuracy.

In [9]:
def assign_season(month):
    if month in [12, 1, 2]:
        return "winter"
    elif month in [3, 4, 5]:
        return "spring"
    elif month in [6, 7, 8]:
        return "summer"
    else:
        return "autumn"

df["season"] = df["month"].apply(assign_season).astype("category")
builder.add_step("Added season variable") # Add step to metadata

### Cyclical Encoding for Time Features

Hours and months are cyclical (23 â†’ 0, December â†’ January).
Using sin/cos encoding preserves continuity, improving performance for ML models.

In [10]:
df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24)

df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)
builder.add_step("Encoded hour and month as cyclical sin/cos") # Add step to metadata

## Lag Feature Engineering

### Create Lag Features for PM2.5

EDA showed PM2.5 values depend strongly on previous hours.

Lag features allow the model to incorporate short- and medium-term pollutant persistence.

In [11]:
lags = [1, 3, 6, 12, 24]

for lag in lags:
    df[f"pm25_lag_{lag}h"] = df.groupby("station", observed=False)["pm25"].shift(lag)

builder.add_step("Created lag features (1h, 3h, 6h, 12h, 24h)") # Add step to metadata

## Rolling Feature Engineering

### Rolling Means

Rolling averages smooth sudden spikes and capture sustained pollution episodes, which are highly predictive.

In [12]:
windows = [3, 6, 12, 24]

for w in windows:
    df[f"pm25_roll_{w}h_mean"] = (
        df.groupby("station", observed=False)["pm25"].rolling(window=w).mean().reset_index(level=0, drop=True)
    )
builder.add_step("Created rolling mean features (3h, 6h, 12h, 24h)") # Add step to metadata

## Derived Meteorological Features

### Dew Point Spread

EDA showed strong TEMPâ€“DEWP correlations.

Their difference indicates moisture levels and stability, which affect PM2.5 dispersion.

In [13]:
df["dewpoint_spread"] = df["temperature"] - df["dew_point"]
builder.add_step("Created dewpoint spread feature") # Add step to metadata

### Temperatureâ€“Pressure Interaction

High pressure + low temperature often leads to stagnant air and high PM2.5.

An interaction term helps the model learn this relationship.

In [14]:
df["temp_pres_interaction"] = df["temperature"] * df["pressure"]
builder.add_step("Created temperature-pressure interaction feature") # Add step to metadata

### Rainfall Binary Indicator

Rainfall events cleanse the air but occur infrequently.

A binary indicator captures this effect better than raw values.

In [15]:
df["rain_binary"] = (df["rain"] > 0).astype(int)
builder.add_step("Created binary rain feature") # Add step to metadata

## Export Final Feature Engineered Dataset

We export the final dataset for hypothesis testing and modelling.

In [16]:
df.to_parquet(OUTPUT_PATH.with_suffix('.parquet'), index=False) # Save the feature-engineered dataframe to Parquet
builder.add_step("Saved dataset as Parquet for GitHub compliance") # Add step to metadata
print("Feature-engineered data saved to :", OUTPUT_PATH.with_suffix('.parquet')) # Print confirmation message
df.to_csv(OUTPUT_PATH.with_suffix('.csv'), index=False) # Save the feature-engineered dataframe to CSV
builder.add_step("Saved dataset as CSV for GitHub compliance") # Add step to metadata
print("Feature-engineered data saved to :", OUTPUT_PATH.with_suffix('.csv')) # Print confirmation message


Feature-engineered data saved to : /home/robert/Projects/beijing-air-quality/data/engineered/beijing_feature_engineered.parquet
Feature-engineered data saved to : /home/robert/Projects/beijing-air-quality/data/engineered/beijing_feature_engineered.csv


### ðŸ“˜ Why the Feature-Engineered Dataset Is Saved in Parquet Format

The feature-engineered dataset is saved in **Parquet** format rather than CSV for several important technical and practical reasons:

#### **1. File Size and GitHub Limitations**
- The engineered dataset contains many additional columns (lags, rolling windows, cyclical encodings, spatial metadata), causing the CSV version to exceed **117 MB**.
- GitHub has a strict **100 MB file size limit**, meaning the CSV cannot be uploaded or version-controlled.
- Parquet uses columnar compression and encoding, reducing the file size to a fraction of the CSV and keeping it well within GitHubâ€™s limits.

#### **2. Improved Performance**
- Parquet loads significantly faster than CSV because:
  - It is a binary columnar format
  - Data types are preserved
  - No need for expensive type inference during loading  
- This results in faster processing in Jupyter notebooks and in the Streamlit dashboard.

#### **3. Data Type Preservation**
- CSV files do not store data types; Parquet does.  
- This ensures that:
  - datetimes remain datetimes  
  - categorical fields remain categorical  
  - numerical precision is maintained  
- This improves consistency and prevents type-related errors in later machine learning steps.

#### **4. Professional Data Engineering Practice**
- Parquet is the industry standard for analytical datasets, especially those involving:
  - time-series features  
  - large dimensionality  
  - engineered ML-ready columns  
- Using Parquet demonstrates an understanding of scalable data storage in real-world ML pipelines.

#### **5. Fully Compatible with Pandas and Streamlit**
- Pandas supports `read_parquet()` natively.
- Streamlit can load Parquet files directly during dashboard generation.
- No loss of functionality compared to CSV.

Overall, Parquet provides a **compact, reliable, fast-loading, and GitHub-compliant** storage format for the final machine learning dataset, making it the most suitable choice for this stage of the Capstone project.

## Save Metadata file

In [17]:
builder.add_columns(df.columns) # Add columns the dataframe
builder.add_record_count_from_df(df) # Set record count from the engineered dataframe    
builder.add_record_stats(OUTPUT_PATH) # Add record statistics

builder.write(PROJECT_ROOT / "data" /"engineered" / "_metadata.yml") # Write metadata to YAML

ðŸ“„ Metadata written to: /home/robert/Projects/beijing-air-quality/data/engineered/_metadata.yml
