# ðŸ”¬ Beijing Air Quality
## ðŸ“˜ Notebook 12 â€“ Forecast Simulation & Scenario Modelling

| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.


## Objectives

This notebook introduces scenario-based forecasting using your best-performing prediction model from Notebook 11.

Specifically, it:
- Generates a synthetic (fake) 24-hour day with realistic meteorology + PM2.5
- Applies the full feature-engineering pipeline
- Uses a trained ML model to predict the next 24 hours recursively
- Saves reproducible forecast outputs for the Streamlit dashboard
- Enables user-driven forecasting scenarios (cold day / rainy day / high wind day / etc.)


## Inputs

- Best trained forecasting model from Notebook 11 (e.g., xgb_best_model.joblib or similar)
- Feature list (model.feature_names_in_)
- Dataset ranges (for realistic randomisation)
- No raw or cleaned datasets are required â€” this notebook generates its own inputs.


## Outputs

- forecast_fake_day.csv â€” synthetic input day
- forecast_next_24h.csv â€” forecast results
- Figure: fake_day_forecast_plot.png
- Ready-to-load files for the Streamlit dashboard


## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

(The following libraries support analysis, plotting, and data manipulation.)

In [11]:
import sys # system-level operations
import pandas as pd # data manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # plotting
import seaborn as sns # statistical data visualization
import plotly.express as px # interactive plotting
import joblib # model serialization
from pathlib import Path # filesystem paths

### Configure Visual Settings

In [12]:

plt.style.use("seaborn-v0_8") # set matplotlib style
sns.set_theme() # set seaborn theme

### Set Up Project Paths

In [13]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" # Path to the data directory
DERIVED_PATH = DATA_PATH / "derived" # Path to derived data
MODELS_PATH = PROJECT_ROOT / "models" # Path to models directory

sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

FIGURES_PATH = PROJECT_ROOT / "figures" / "h2" # Path to save figures
FIGURES_PATH.mkdir(parents=True, exist_ok=True) # Create directory if it doesn't exist

## Load saved dtypes

In [None]:
season_dtype = joblib.load(MODELS_PATH / "season_dtype.joblib")
area_dtype = joblib.load(MODELS_PATH / "area_dtype.joblib")

print("Loaded season and area_type dtypes")

### Load Model

In [14]:
model = joblib.load(MODELS_PATH / "best_regression_model.joblib") # Load the best forecasting model
model

## Generate a Fake 24-Hour Day

This produces realistic synthetic data based on value ranges in your Beijing dataset.

In [None]:
def generate_fake_day(start_date: str = "2025-01-01", area_type: str = "urban") -> pd.DataFrame:
    """
    Generates synthetic 24h of PM2.5 + weather using realistic ranges.
    Ensures area_type and season values match training categorical dtypes.
    """

    # --- VALIDATE AREA TYPE AGAINST TRAINING CATEGORIES ---
    if area_type not in area_dtype.categories:
        raise ValueError(
            f"Invalid area_type '{area_type}'. Must be one of: {list(area_dtype.categories)}"
        ) # Validate area_type input

    hours = pd.date_range(start=start_date, periods=24, freq="H") # Generate hourly datetime range
    rng = np.random.default_rng() # Random number generator

    df = pd.DataFrame({
        "datetime": hours,
        "pm25": rng.normal(80, 20, 24).clip(5, 300),
        "temperature": rng.normal(10, 7, 24).clip(-20, 35),
        "dew_point": rng.normal(0, 10, 24).clip(-25, 25),
        "pressure": rng.normal(1010, 7, 24).clip(980, 1040),
        "rain": rng.choice([0, 0, 0, 1, 2, 5], 24),
        "wind_speed": rng.normal(2.5, 1.5, 24).clip(0, 10),
    }) # Create DataFrame with synthetic data

    df["area_type"] = area_type # Set area type
    df["area_type"] = pd.Categorical(df["area_type"], dtype=area_dtype) # Convert to categorical
    df["area_type_code"] = df["area_type"].cat.codes # Get area type codes

    df["year"] = df["datetime"].dt.year # Extract year from datetime
    df["day"] = df["datetime"].dt.day # Extract day from datetime
    df["hour"] = df["datetime"].dt.hour # Extract hour from datetime
    df["month"] = df["datetime"].dt.month # Extract month from datetime
    df["day_of_week"] = df["datetime"].dt.dayofweek # Extract day of week from datetime

    def season(m : int) -> str:
        """
        Determines the season based on the month.
        Args:
            m (int): _description_
        Returns:
            str: _description_
        """
        if m in [12, 1, 2]: # Winter months
            return "winter"
        elif m in [3, 4, 5]: # Spring months
            return "spring"
        elif m in [6, 7, 8]: # Summer months
            return "summer"
        else: # Autumn months
            return "autumn"

    df["season"] = df["month"].apply(season) # Determine season from month
    df["season"] = pd.Categorical(df["season"], dtype=season_dtype) # Convert to categorical
    df["season_code"] = df["season"].cat.codes # Get season codes

    if (df["season_code"] == -1).any(): # Check for unknown season codes
        raise ValueError("Season mapping produced unknown categories!")

    if (df["area_type_code"] == -1).any(): # Check for unknown area type codes
        raise ValueError("Area type mapping produced an unknown category!")

    return df


Unnamed: 0,datetime,pm25,temperature,dew_point,pressure,rain,wind_speed,year,day,hour,month,day_of_week,season
0,2025-01-01 00:00:00,54.728807,11.919859,-7.465163,1017.885668,0,1.281176,2025,1,0,1,2,0
1,2025-01-01 01:00:00,67.377892,15.173769,6.647801,1001.615625,0,2.258348,2025,1,1,1,2,0
2,2025-01-01 02:00:00,63.461647,17.277592,10.314784,996.49742,0,4.006704,2025,1,2,1,2,0
3,2025-01-01 03:00:00,94.10825,7.298002,8.019444,1001.395422,5,1.042618,2025,1,3,1,2,0
4,2025-01-01 04:00:00,63.181457,9.426475,2.089835,1003.842333,0,4.398618,2025,1,4,1,2,0


## Apply Feature Engineering

This reproduces engineered dataset features:

In [None]:
def apply_feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy() # Create a copy of the dataframe to avoid modifying the original

    # Cyclical encodings
    df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24) # Sine transformation for hour
    df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24) # Cosine transformation for hour
    df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12) # Sine transformation for month
    df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12) # Cosine transformation for month

    # Interaction features
    df["dew_point_spread"] = df["temperature"] - df["dew_point"] # Dew point spread
    df["temp_pres_interaction"] = df["temperature"] * df["pressure"] # Temperature and pressure interaction
    df["rain_binary"] = (df["rain"] > 0).astype(int) # Binary rain indicator

    # Rolling statistics
   for rolling in [3, 6, 12, 24]:
        df[f"pm25_roll_{rolling}h_mean"] = (
            df["pm25"].shift(1).rolling(window=rolling).mean() # Rolling mean
        )# Lagged values

    for lag in [1,3,6,12,24]:
        df[f"pm25_lag_{lag}h"] = df["pm25"].shift(lag) # Lagged PM2.5 values

    return df # Return the dataframe with engineered features


Apply it:

In [17]:
df_eng = apply_feature_engineering(df) # Apply feature engineering to the synthetic data
df_eng.tail()

Unnamed: 0,datetime,pm25,temperature,dew_point,pressure,rain,wind_speed,year,day,hour,...,month_sin,month_cos,dew_point_spread,temp_pres_interaction,rain_binary,pm25_lag_1h,pm25_lag_3h,pm25_lag_6h,pm25_lag_12h,pm25_lag_24h
19,2025-01-01 19:00:00,65.208677,0.431484,11.799239,999.992186,0,3.400333,2025,1,19,...,0.5,0.866025,-11.367754,431.481077,0,29.345309,61.495684,122.127459,78.90429,
20,2025-01-01 20:00:00,87.254652,12.494346,-0.90169,1019.112921,5,0.173788,2025,1,20,...,0.5,0.866025,13.396036,12733.148971,1,65.208677,102.314042,67.095938,48.451574,
21,2025-01-01 21:00:00,54.649334,11.410768,-5.109551,1020.604415,1,0.29174,2025,1,21,...,0.5,0.866025,16.520319,11645.880528,1,87.254652,29.345309,73.48454,95.080591,
22,2025-01-01 22:00:00,95.446739,13.1685,-10.680697,997.108539,5,1.730414,2025,1,22,...,0.5,0.866025,23.849197,13130.423958,1,54.649334,65.208677,61.495684,110.236034,
23,2025-01-01 23:00:00,83.10228,10.682678,-12.778404,1005.895559,0,1.030885,2025,1,23,...,0.5,0.866025,23.461082,10745.65862,0,95.446739,87.254652,102.314042,93.876671,


## Recursive 24h Forecast

This uses the last row of the fake day and repeatedly predicts forward:

In [None]:
def forecast_next_24h(df_fake: pd.DataFrame, model: any) -> pd.DataFrame:

    df_eng = apply_feature_engineering(df_fake) # Apply feature engineering
    last = df_eng.iloc[-1].copy() # Get the last row for recursive forecasting
    preds = [] # List to store predictions

    # Recursive forecasting for the next 24 hours
    for i in range(24):
        next_time = last["datetime"] + pd.Timedelta(hours=1) # Calculate next hour datetime
        
        fr = last.copy() # Start with last known features
        fr["datetime"] = next_time # Update datetime
        fr["hour"] = next_time.hour # Update hour
        fr["month"] = next_time.month # Update month
        fr["day"] = next_time.day # Update day
        fr["day_of_week"] = next_time.dayofweek # Update day of week
        fr["year"] = next_time.year # Update year
        
        # cyclical recalc
        fr["hour_sin"]  = np.sin(2 * np.pi * fr["hour"] / 24) # Sine transformation for hour
        fr["hour_cos"]  = np.cos(2 * np.pi * fr["hour"] / 24) # Cosine transformation for hour
        fr["month_sin"] = np.sin(2 * np.pi * fr["month"] / 12) # Sine transformation for month
        fr["month_cos"] = np.cos(2 * np.pi * fr["month"] / 12) # Cosine transformation for month

        fr["season"] = pd.Categorical(fr["season"], dtype=season_dtype)
        fr["season_code"] = fr["season"].cat.codes

        fr["area_type"] = pd.Categorical(fr["area_type"], dtype=area_dtype)
        fr["area_type_code"] = fr["area_type"].cat.codes

        fr["season"] = last["season"] # Update season
        
        fr["rolling_mean_1h"] = last["pm25"] # 1-hour rolling mean

        for w in [3, 6, 12, 24]:
            fr[f"pm25_roll_{w}h_mean"] = last.get(f"pm25_roll_{w-1}h_mean", last["pm25"]) # Shift rolling means
        
        # recursive lags
        fr["pm25_lag_1h"] = last["pm25"] # Lag 1 hour is last pm25
        for lag in [3, 6, 12, 24]:
            fr[f"pm25_lag_{lag}h"] = last.get(f"pm25_lag_{lag-1}h", last["pm25"]) # Shift lags

        X = fr[model.feature_names_in_].values.reshape(1,-1) # Prepare features for prediction
        pred = model.predict(X)[0] # Make prediction
 
        preds.append({"datetime": next_time, "pm25_pred": pred}) # Store prediction

        fr["pm25"] = pred # Update pm25 with prediction
        last = fr.copy()    # Update last for next iteration

    return pd.DataFrame(preds) # Return DataFrame of predictions


Run forecast:

In [19]:
forecast_df = forecast_next_24h(df_eng, model)
forecast_df.head()


Unnamed: 0,datetime,pm25_pred
0,2025-01-02 00:00:00,83.629646
1,2025-01-02 01:00:00,82.307869
2,2025-01-02 02:00:00,72.420067
3,2025-01-02 03:00:00,64.173599
4,2025-01-02 04:00:00,57.301754


### Save Results for Dashboard

In [20]:
df_eng.to_csv(DERIVED_PATH / "forecast_fake_day.csv", index=False)
forecast_df.to_csv(DERIVED_PATH / "forecast_next_24h.csv", index=False)

print("Saved forecast input + output for dashboard.")


Saved forecast input + output for dashboard.


## Summary

This notebook demonstrates:

- How the model reacts to realistic but synthetic atmospheric conditions
- Predictive capability under user-defined scenarios
- A method to forecast 24 hours ahead using recursive prediction
- Dashboard-ready output files

This is a powerful demonstration of model interpretability and forecasting ability.

---

### AI Assistance Note

Some narrative text and minor formatting or wording improvements in this notebook were supported by AI-assisted tools (ChatGPT for documentation clarity, Copilot for small routine code suggestions, and Grammarly for proofreading). All analysis, code logic, feature engineering, modelling, and interpretations were independently created by the author.