# ðŸ”¬ Beijing Air Quality
## ðŸ“˜ Notebook 12 â€“ Forecast Simulation & Scenario Modelling

| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.


## Objectives

This notebook introduces scenario-based forecasting using your best-performing prediction model from Notebook 11.

Specifically, it:
- Generates a synthetic (fake) 24-hour day with realistic meteorology + PM2.5
- Applies the full feature-engineering pipeline
- Uses a trained ML model to predict the next 24 hours recursively
- Saves reproducible forecast outputs for the Streamlit dashboard
- Enables user-driven forecasting scenarios (cold day / rainy day / high wind day / etc.)


## Inputs

- Best trained forecasting model from Notebook 11 (e.g., xgb_best_model.joblib or similar)
- Feature list (model.feature_names_in_)
- Dataset ranges (for realistic randomisation)
- No raw or cleaned datasets are required â€” this notebook generates its own inputs.


## Outputs

- forecast_fake_day.csv â€” synthetic input day
- forecast_next_24h.csv â€” forecast results
- Figure: fake_day_forecast_plot.png
- Ready-to-load files for the Streamlit dashboard


## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

(The following libraries support analysis, plotting, and data manipulation.)

In [1]:
import sys # system-level operations
import pandas as pd # data manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # plotting
import seaborn as sns # statistical data visualization
import plotly.express as px # interactive plotting
import joblib # model serialization
from pathlib import Path # filesystem paths
import warnings # warning control
warnings.filterwarnings("ignore") # ignore warnings for cleaner output


### Configure Visual Settings

In [2]:

plt.style.use("seaborn-v0_8") # set matplotlib style
sns.set_theme() # set seaborn theme

### Set Up Project Paths

In [3]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" / "engineered" / "beijing_engineered.csv"
MODELS_PATH = PROJECT_ROOT / "models"
OUTPUT_PATH = PROJECT_ROOT / "data" / "model_outputs" / "forecasts"
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

FIG_PATH = PROJECT_ROOT / "figures" / "forecasting"
FIG_PATH.mkdir(parents=True, exist_ok=True)
sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

from src.feature_engineering import apply_forecasting_features # feature engineering functions

## Load saved dtypes

In [4]:
model = joblib.load(MODELS_PATH / "best_regression_model.joblib")
season_dtype = joblib.load(MODELS_PATH / "season_dtype.joblib")
area_dtype   = joblib.load(MODELS_PATH / "area_dtype.joblib")
station_dtype = joblib.load(MODELS_PATH / "station_dtype.joblib")
features = joblib.load(MODELS_PATH / "forecasting_feature_names.joblib")

In [5]:
df = pd.read_csv(DATA_PATH)
df["datetime"] = pd.to_datetime(df["datetime"])

# Apply saved metadata dtypes
df["season"] = df["season"].astype(season_dtype)
df["season"] = df["season"].cat.codes

df["area_type"] = df["area_type"].astype(area_dtype)
df["area_type"] = df["area_type"].cat.codes

df["station"] = df["station"].astype(station_dtype)
df["station"] = df["station"].cat.codes

# Recreate lag + rolling features exactly as in Notebook 11
df = apply_forecasting_features(df, add_lags=True, add_rollings=True)
df = df.dropna()


## Apply Feature Engineering

This reproduces engineered dataset features:

In [6]:
def apply_forecasting_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Cyclical
    df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24)
    df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24)
    df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
    df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)

    # Interaction features
    df["dew_point_spread"] = df["temperature"] - df["dew_point"]
    df["temp_pres_interaction"] = df["temperature"] * df["pressure"]
    df["rain_binary"] = (df["rain"] > 0).astype(int)

    # Rolling windows
    for w in [3, 6, 12, 18]:
        df[f"pm25_roll_{w}h_mean"] = (
            df["pm25"].shift(1).rolling(w).mean()
        )

    # Lag features
    for lag in [1, 3, 6, 12, 18]:
        df[f"pm25_lag_{lag}h"] = df["pm25"].shift(lag)

    return df

Apply it:

In [7]:
df = apply_forecasting_features(df) # Apply feature engineering to the synthetic data
df.tail()

Unnamed: 0,datetime,year,month,day,hour,pm25,temperature,pressure,dew_point,rain,...,rain_binary,pm25_lag_1h,pm25_lag_3h,pm25_lag_6h,pm25_lag_12h,pm25_lag_18h,pm25_roll_3h_mean,pm25_roll_6h_mean,pm25_roll_12h_mean,pm25_roll_18h_mean
403771,2016-12-31 19:00:00,2016,12,31,19,449.0,-1.9,1022.0,-6.1,0.0,...,0,392.0,440.0,468.0,311.0,350.0,403.333333,421.0,394.416667,377.0
403772,2016-12-31 20:00:00,2016,12,31,20,460.0,-2.5,1022.4,-5.5,0.0,...,0,449.0,378.0,399.0,332.0,361.0,406.333333,417.833333,405.916667,382.5
403773,2016-12-31 21:00:00,2016,12,31,21,463.0,-3.0,1022.1,-5.3,0.0,...,0,460.0,392.0,449.0,358.0,364.0,433.666667,428.0,416.583333,388.0
403774,2016-12-31 22:00:00,2016,12,31,22,493.0,-3.0,1022.7,-5.0,0.0,...,0,463.0,449.0,440.0,407.0,316.0,457.333333,430.333333,425.333333,393.5
403775,2016-12-31 23:00:00,2016,12,31,23,464.0,-4.0,1022.6,-5.7,0.0,...,0,493.0,460.0,378.0,398.0,325.0,472.0,439.166667,432.5,403.333333


## Recursive 24h Forecast

This uses the last row of the fake day and repeatedly predicts forward:

In [8]:
def forecast_next_24h(df_station, model):
    """
    Recursive 24h forecast for a single station.
    df_station must already contain engineered features.
    """
    df_station = df_station.sort_values("datetime")
    last = df_station.iloc[-1].copy()

    forecasts = []

    for step in range(24):
        # Advance time
        new_time = last["datetime"] + pd.Timedelta(hours=1)
        last["datetime"] = new_time
        # Update temporal encodings
        last["hour"] = new_time.hour
        last["month"] = new_time.month
        last["day_of_week"] = new_time.dayofweek
        last["year"] = new_time.year

        last["hour_sin"] = np.sin(2*np.pi*last["hour"]/24)
        last["hour_cos"] = np.cos(2*np.pi*last["hour"]/24)
        last["month_sin"] = np.sin(2*np.pi*last["month"]/12)
        last["month_cos"] = np.cos(2*np.pi*last["month"]/12)

        # Prepare input
        X = last[features].astype("float32").to_numpy().reshape(1,-1)

        # Predict
        pred = model.predict(X)[0]

        # Update lag features
        last["pm25_lag_18h"] = last["pm25_lag_12h"]
        last["pm25_lag_12h"] = last["pm25_lag_6h"]
        last["pm25_lag_6h"] = last["pm25_lag_3h"]
        last["pm25_lag_3h"] = last["pm25_lag_1h"]
        last["pm25_lag_1h"] = pred

        # Rolling means
        last["pm25_roll_3h_mean"] = (last["pm25_lag_1h"] + last["pm25_lag_3h"]) / 2
        last["pm25_roll_6h_mean"] = np.mean([
            last["pm25_lag_1h"], last["pm25_lag_3h"], last["pm25_lag_6h"]
        ])
        last["pm25_roll_12h_mean"] = np.mean([
            last["pm25_lag_1h"], last["pm25_lag_3h"], last["pm25_lag_6h"],
            last["pm25_lag_12h"]
        ])
        last["pm25_roll_18h_mean"] = np.mean([
            last["pm25_lag_1h"], last["pm25_lag_3h"], last["pm25_lag_6h"],
            last["pm25_lag_12h"], last["pm25_lag_18h"]
        ])

        forecasts.append({"datetime": new_time, "pm25_predicted": pred})

    return pd.DataFrame(forecasts)


Run forecast:

In [9]:
stations = df["station"].unique()
forecast_results = {}


for st_code in stations: 
    df_stn = df[df["station"] == st_code]
    fc = forecast_next_24h(df_stn, model)
    fc["station_code"] = st_code
    forecast_results[st_code] = fc


In [10]:
combined = pd.concat(forecast_results.values(), ignore_index=True)
station_map = dict(enumerate(station_dtype.categories))
combined["station_name"] = combined["station_code"].map(station_map)
combined["datetime"] = pd.to_datetime(combined["datetime"]).dt.to_pydatetime()

fig = px.line(
    combined,
    x="datetime",
    y="pm25_predicted",
    color="station_name",
    title="Next 24h PM2.5 Forecast per Station",
)
fig.show()


### Save Results for Dashboard

In [11]:
for stn, df_fc in forecast_results.items():
    station_name = station_dtype.categories[stn]
    outfile = OUTPUT_PATH / f"forecast_24h_{station_name}.csv"
    print(f"Saving forecast for station {station_name} to {outfile}")
    df_fc.to_csv(outfile, index=False)

Saving forecast for station aotizhongxin to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_aotizhongxin.csv
Saving forecast for station changping to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_changping.csv
Saving forecast for station dingling to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_dingling.csv
Saving forecast for station dongsi to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_dongsi.csv
Saving forecast for station guanyuan to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_guanyuan.csv
Saving forecast for station gucheng to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_gucheng.csv
Saving forecast for station huairou to /home/robert/Projects/beijing-air-quality/data/model_outputs/forecasts/forecast_24h_huairou.csv
Saving forecast for station nongzhangua

## Summary

This notebook demonstrates:

- How the model reacts to realistic but synthetic atmospheric conditions
- Predictive capability under user-defined scenarios
- A method to forecast 24 hours ahead using recursive prediction
- Dashboard-ready output files

This is a powerful demonstration of model interpretability and forecasting ability.

---

### AI Assistance Note

Some narrative text and minor formatting or wording improvements in this notebook were supported by AI-assisted tools (ChatGPT for documentation clarity, Copilot for small routine code suggestions, and Grammarly for proofreading). All analysis, code logic, feature engineering, modelling, and interpretations were independently created by the author.