# ðŸ”¬ Beijing Air Quality
## ðŸ“˜ Notebook 12 â€“ Forecast Simulation & Scenario Modelling

| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.


## Objectives

This notebook introduces scenario-based forecasting using your best-performing prediction model from Notebook 11.

Specifically, it:
- Generates a synthetic (fake) 24-hour day with realistic meteorology + PM2.5
- Applies the full feature-engineering pipeline
- Uses a trained ML model to predict the next 24 hours recursively
- Saves reproducible forecast outputs for the Streamlit dashboard
- Enables user-driven forecasting scenarios (cold day / rainy day / high wind day / etc.)


## Inputs

- Best trained forecasting model from Notebook 11 (e.g., xgb_best_model.joblib or similar)
- Feature list (model.feature_names_in_)
- Dataset ranges (for realistic randomisation)
- No raw or cleaned datasets are required â€” this notebook generates its own inputs.


## Outputs

- forecast_fake_day.csv â€” synthetic input day
- forecast_next_24h.csv â€” forecast results
- Figure: fake_day_forecast_plot.png
- Ready-to-load files for the Streamlit dashboard


## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

(The following libraries support analysis, plotting, and data manipulation.)

In [1]:
import sys # system-level operations
import pandas as pd # data manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # plotting
import seaborn as sns # statistical data visualization
import plotly.express as px # interactive plotting
import joblib # model serialization
from pathlib import Path # filesystem paths
import warnings # warning control
warnings.filterwarnings("ignore") # ignore warnings for cleaner output


### Configure Visual Settings

In [2]:

plt.style.use("seaborn-v0_8") # set matplotlib style
sns.set_theme() # set seaborn theme

### Set Up Project Paths

In [3]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" / "engineered" / "beijing_engineered.csv"
MODELS_PATH = PROJECT_ROOT / "models" / "regression"
TYPES_PATH = PROJECT_ROOT / "model_outputs" / "regression"
FIG_PATH = PROJECT_ROOT / "figures" / "forecasting"
FIG_PATH.mkdir(parents=True, exist_ok=True)
sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

from utils.feature_engineering import apply_forecasting_features # feature engineering functions
from utils.load_csv import load_csv # data loading functions
from utils.forcast import forecast_horizon # forecasting functions

## Load saved dtypes

In [4]:
model = joblib.load(MODELS_PATH / "best_regression_model.joblib")
season_dtype = joblib.load(TYPES_PATH / "season_dtype.joblib")
area_dtype   = joblib.load(TYPES_PATH / "area_dtype.joblib")
station_dtype = joblib.load(TYPES_PATH / "station_dtype.joblib")

In [5]:
df = load_csv(DATA_PATH)
df["datetime"] = pd.to_datetime(df["datetime"])

# Apply saved metadata dtypes
df["season"] = df["season"].astype(season_dtype)
df["season"] = df["season"].cat.codes

df["area_type"] = df["area_type"].astype(area_dtype)
df["area_type"] = df["area_type"].cat.codes

df["station"] = df["station"].astype(station_dtype)
df["station"] = df["station"].cat.codes

# Recreate lag + rolling features exactly as in Notebook 11
df = apply_forecasting_features(df, add_lags=True, add_rollings=True)
df = df.dropna()


## Apply Feature Engineering

This reproduces engineered dataset features:

In [6]:
df.tail()

Unnamed: 0,datetime,year,month,day,hour,pm25,temperature,pressure,dew_point,rain,...,relative_humidity,pm25_lag_1h,pm25_lag_3h,pm25_lag_6h,pm25_lag_12h,pm25_lag_18h,pm25_roll_3h_mean,pm25_roll_6h_mean,pm25_roll_12h_mean,pm25_roll_18h_mean
403771,2016-12-31 19:00:00,2016,12,31,19,449.0,-1.9,1022.0,-6.1,0.0,...,72.987463,392.0,440.0,468.0,311.0,350.0,403.333333,421.0,394.416667,377.0
403772,2016-12-31 20:00:00,2016,12,31,20,460.0,-2.5,1022.4,-5.5,0.0,...,79.859003,449.0,378.0,399.0,332.0,361.0,406.333333,417.833333,405.916667,382.5
403773,2016-12-31 21:00:00,2016,12,31,21,463.0,-3.0,1022.1,-5.3,0.0,...,84.1438,460.0,392.0,449.0,358.0,364.0,433.666667,428.0,416.583333,388.0
403774,2016-12-31 22:00:00,2016,12,31,22,493.0,-3.0,1022.7,-5.0,0.0,...,86.076384,463.0,449.0,440.0,407.0,316.0,457.333333,430.333333,425.333333,393.5
403775,2016-12-31 23:00:00,2016,12,31,23,464.0,-4.0,1022.6,-5.7,0.0,...,87.95407,493.0,460.0,378.0,398.0,325.0,472.0,439.166667,432.5,403.333333


In [7]:
features = model.feature_names_in_

features

array(['temperature', 'dew_point', 'pressure', 'rain', 'wind_speed',
       'temp_pres_interaction', 'dew_point_spread', 'rain_binary',
       'area_type', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos',
       'season', 'day_of_week', 'month', 'year', 'station',
       'relative_humidity', 'pm25_lag_1h', 'pm25_lag_3h', 'pm25_lag_6h',
       'pm25_lag_12h', 'pm25_lag_18h', 'pm25_roll_3h_mean',
       'pm25_roll_6h_mean', 'pm25_roll_12h_mean', 'pm25_roll_18h_mean'],
      dtype='<U21')

In [8]:
len(features)

28

## Recursive 24h Forecast

In [9]:
df.tail()

Unnamed: 0,datetime,year,month,day,hour,pm25,temperature,pressure,dew_point,rain,...,relative_humidity,pm25_lag_1h,pm25_lag_3h,pm25_lag_6h,pm25_lag_12h,pm25_lag_18h,pm25_roll_3h_mean,pm25_roll_6h_mean,pm25_roll_12h_mean,pm25_roll_18h_mean
403771,2016-12-31 19:00:00,2016,12,31,19,449.0,-1.9,1022.0,-6.1,0.0,...,72.987463,392.0,440.0,468.0,311.0,350.0,403.333333,421.0,394.416667,377.0
403772,2016-12-31 20:00:00,2016,12,31,20,460.0,-2.5,1022.4,-5.5,0.0,...,79.859003,449.0,378.0,399.0,332.0,361.0,406.333333,417.833333,405.916667,382.5
403773,2016-12-31 21:00:00,2016,12,31,21,463.0,-3.0,1022.1,-5.3,0.0,...,84.1438,460.0,392.0,449.0,358.0,364.0,433.666667,428.0,416.583333,388.0
403774,2016-12-31 22:00:00,2016,12,31,22,493.0,-3.0,1022.7,-5.0,0.0,...,86.076384,463.0,449.0,440.0,407.0,316.0,457.333333,430.333333,425.333333,393.5
403775,2016-12-31 23:00:00,2016,12,31,23,464.0,-4.0,1022.6,-5.7,0.0,...,87.95407,493.0,460.0,378.0,398.0,325.0,472.0,439.166667,432.5,403.333333


Run forecast:

In [10]:
stations = df["station"].unique()
forecast_results = {}


for st_code in stations: 
    df_stn = df[df["station"] == st_code]
    fc = forecast_horizon(df_stn, model, features)
    fc["station_code"] = st_code
    forecast_results[st_code] = fc


In [11]:
combined = pd.concat(forecast_results.values(), ignore_index=True)
station_map = dict(enumerate(station_dtype.categories))
combined["station_name"] = combined["station_code"].map(station_map)
combined["datetime"] = pd.to_datetime(combined["datetime"]).dt.to_pydatetime()

fig = px.line(
    combined,
    x="datetime",
    y="pm25_predicted",
    color="station_name",
    title="Next 24h PM2.5 Forecast per Station",
)
fig.show()


## Summary

This notebook demonstrates:

- How the model reacts to realistic but synthetic atmospheric conditions
- Predictive capability under user-defined scenarios
- A method to forecast 24 hours ahead using recursive prediction
- Dashboard-ready output files

This is a powerful demonstration of model interpretability and forecasting ability.

---

### AI Assistance Note

Some narrative text and minor formatting or wording improvements in this notebook were supported by AI-assisted tools (ChatGPT for documentation clarity, Copilot for small routine code suggestions, and Grammarly for proofreading). All analysis, code logic, feature engineering, modelling, and interpretations were independently created by the author.