# ðŸ”¬ Beijing Air Quality
## ðŸ“˜ Notebook 12 â€“ Forecast Simulation & Scenario Modelling

| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.


## Objectives

This notebook introduces scenario-based forecasting using your best-performing prediction model from Notebook 11.

Specifically, it:
- Generates a synthetic (fake) 24-hour day with realistic meteorology + PM2.5
- Applies the full feature-engineering pipeline
- Uses a trained ML model to predict the next 24 hours recursively
- Saves reproducible forecast outputs for the Streamlit dashboard
- Enables user-driven forecasting scenarios (cold day / rainy day / high wind day / etc.)


## Inputs

- Best trained forecasting model from Notebook 11 (e.g., xgb_best_model.joblib or similar)
- Feature list (model.feature_names_in_)
- Dataset ranges (for realistic randomisation)
- No raw or cleaned datasets are required â€” this notebook generates its own inputs.


## Outputs

- forecast_fake_day.csv â€” synthetic input day
- forecast_next_24h.csv â€” forecast results
- Figure: fake_day_forecast_plot.png
- Ready-to-load files for the Streamlit dashboard


## Citation  
This project uses data from:

Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository â€” Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## Notebook Setup

### Import Required Libraries

(The following libraries support analysis, plotting, and data manipulation.)

In [32]:
import sys # system-level operations
import pandas as pd # data manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # plotting
import seaborn as sns # statistical data visualization
import plotly.express as px # interactive plotting
import joblib # model serialization
from pathlib import Path # filesystem paths

### Configure Visual Settings

In [33]:

plt.style.use("seaborn-v0_8") # set matplotlib style
sns.set_theme() # set seaborn theme

### Set Up Project Paths

In [34]:
PROJECT_ROOT = Path.cwd().parent # Assuming this script is in a subdirectory of the project root
DATA_PATH = PROJECT_ROOT / "data" # Path to the data directory
DERIVED_PATH = DATA_PATH / "derived" # Path to derived data
MODELS_PATH = PROJECT_ROOT / "models" # Path to models directory

sys.path.append(str(PROJECT_ROOT)) # Add project root to sys.path

FIGURES_PATH = PROJECT_ROOT / "figures" / "h2" # Path to save figures
FIGURES_PATH.mkdir(parents=True, exist_ok=True) # Create directory if it doesn't exist

## Load saved dtypes

In [35]:
season_dtype = joblib.load(MODELS_PATH / "season_dtype.joblib")
area_dtype = joblib.load(MODELS_PATH / "area_dtype.joblib")

print("Loaded season and area_type dtypes")

Loaded season and area_type dtypes


### Load Model

In [36]:
model = joblib.load(MODELS_PATH / "best_regression_model.joblib") # Load the best forecasting model
model

## Generate a Fake 24-Hour Day

This produces realistic synthetic data based on value ranges in your Beijing dataset.

In [37]:
def generate_fake_day(start_date: str = "2025-01-01", area_type: str = "urban") -> pd.DataFrame:
    """
    Generates synthetic 24h of PM2.5 + weather using realistic ranges.
    Ensures area_type and season values match training categorical dtypes.
    """

    # --- VALIDATE AREA TYPE AGAINST TRAINING CATEGORIES ---
    if area_type not in area_dtype.categories:
        raise ValueError(
            f"Invalid area_type '{area_type}'. Must be one of: {list(area_dtype.categories)}"
        ) # Validate area_type input

    hours = pd.date_range(start=start_date, periods=24, freq="h") # Generate hourly datetime range
    rng = np.random.default_rng() # Random number generator

    df = pd.DataFrame({
        "datetime": hours,
        "pm25": rng.normal(80, 20, 24).clip(5, 300),
        "temperature": rng.normal(10, 7, 24).clip(-20, 35),
        "dew_point": rng.normal(0, 10, 24).clip(-25, 25),
        "pressure": rng.normal(1010, 7, 24).clip(980, 1040),
        "rain": rng.choice([0, 0, 0, 1, 2, 5], 24),
        "wind_speed": rng.normal(2.5, 1.5, 24).clip(0, 10),
    }) # Create DataFrame with synthetic data

    df["area_type"] = area_type # Set area type
    df["area_type"] = pd.Categorical(df["area_type"], dtype=area_dtype) # Convert to categorical
    df["area_type_code"] = df["area_type"].cat.codes # Get area type codes

    df["year"] = df["datetime"].dt.year # Extract year from datetime
    df["day"] = df["datetime"].dt.day # Extract day from datetime
    df["hour"] = df["datetime"].dt.hour # Extract hour from datetime
    df["month"] = df["datetime"].dt.month # Extract month from datetime
    df["day_of_week"] = df["datetime"].dt.dayofweek # Extract day of week from datetime

    def season(m : int) -> str:
        """
        Determines the season based on the month.
        Args:
            m (int): _description_
        Returns:
            str: _description_
        """
        if m in [12, 1, 2]: # Winter months
            return "winter"
        elif m in [3, 4, 5]: # Spring months
            return "spring"
        elif m in [6, 7, 8]: # Summer months
            return "summer"
        else: # Autumn months
            return "autumn"

    df["season"] = df["month"].apply(season) # Determine season from month
    df["season"] = pd.Categorical(df["season"], dtype=season_dtype) # Convert to categorical
    df["season_code"] = df["season"].cat.codes # Get season codes

    if (df["season_code"] == -1).any(): # Check for unknown season codes
        raise ValueError("Season mapping produced unknown categories!")

    if (df["area_type_code"] == -1).any(): # Check for unknown area type codes
        raise ValueError("Area type mapping produced an unknown category!")

    return df


In [38]:
df = generate_fake_day(start_date="2025-01-01", area_type="urban") # Generate synthetic data

## Apply Feature Engineering

This reproduces engineered dataset features:

In [39]:
def apply_forecasting_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Cyclical
    df["hour_sin"]  = np.sin(2 * np.pi * df["hour"] / 24)
    df["hour_cos"]  = np.cos(2 * np.pi * df["hour"] / 24)
    df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
    df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)

    # Interaction features
    df["dew_point_spread"] = df["temperature"] - df["dew_point"]
    df["temp_pres_interaction"] = df["temperature"] * df["pressure"]
    df["rain_binary"] = (df["rain"] > 0).astype(int)

    # Rolling windows
    for w in [3, 6, 12, 18]:
        df[f"pm25_roll_{w}h_mean"] = (
            df["pm25"].shift(1).rolling(w).mean()
        )

    # Lag features
    for lag in [1, 3, 6, 12, 18]:
        df[f"pm25_lag_{lag}h"] = df["pm25"].shift(lag)

    return df


Apply it:

In [40]:
df_eng = apply_forecasting_features(df) # Apply feature engineering to the synthetic data
df_eng.tail()

Unnamed: 0,datetime,pm25,temperature,dew_point,pressure,rain,wind_speed,area_type,area_type_code,year,...,rain_binary,pm25_roll_3h_mean,pm25_roll_6h_mean,pm25_roll_12h_mean,pm25_roll_18h_mean,pm25_lag_1h,pm25_lag_3h,pm25_lag_6h,pm25_lag_12h,pm25_lag_18h
19,2025-01-01 19:00:00,81.070365,11.583486,11.883598,1016.723832,0,2.39188,urban,2,2025,...,0,72.506685,79.016435,85.944449,78.83088,97.523937,39.698509,91.894161,93.879168,38.417463
20,2025-01-01 20:00:00,56.565427,5.409058,0.224427,1016.116103,0,5.582016,urban,2,2025,...,0,86.297304,77.212469,84.877049,81.200485,81.070365,80.297609,93.569114,96.492338,70.598677
21,2025-01-01 21:00:00,59.380359,23.052506,3.017506,1009.076,0,2.293546,urban,2,2025,...,0,78.386576,71.045188,81.549806,80.42086,56.565427,97.523937,71.115283,107.51294,75.863987
22,2025-01-01 22:00:00,113.062832,1.049522,1.585371,1004.791421,1,0.444254,urban,2,2025,...,1,65.67205,69.089368,77.538758,79.505103,59.380359,81.070365,39.698509,108.48567,80.041567
23,2025-01-01 23:00:00,90.449596,7.738894,-8.734115,1006.142523,2,4.49243,urban,2,2025,...,1,76.336206,81.316755,77.920188,81.339618,113.062832,56.565427,80.297609,90.703037,53.497647


## Recursive 24h Forecast

This uses the last row of the fake day and repeatedly predicts forward:

In [41]:
def forecast_next_24h(df_fake: pd.DataFrame, model) -> pd.DataFrame:

    # Apply feature engineering to the initial fake day
    history = apply_forecasting_features(df_fake).reset_index(drop=True)

    preds = []

    for _ in range(24):

        last = history.iloc[-1].copy()
        next_time = last["datetime"] + pd.Timedelta(hours=1)

        # Start next row
        fr = pd.Series(dtype='float64')
        fr["datetime"] = next_time
        fr["year"] = next_time.year
        fr["month"] = next_time.month
        fr["day"] = next_time.day
        fr["hour"] = next_time.hour
        fr["day_of_week"] = next_time.dayofweek

        # Carry forward weather (no change in scenario mode)
        for col in ["temperature", "dew_point", "pressure", "rain", "wind_speed"]:
            fr[col] = last[col]

        # Categorical encodings
        def month_to_season(m):
            if m in [12,1,2]: return "winter"
            if m in [3,4,5]: return "spring"
            if m in [6,7,8]: return "summer"
            return "autumn"

        season_name = month_to_season(fr["month"])
        season_code = season_dtype.categories.get_loc(season_name)

        # MUST match model expectation:
        fr["season"] = season_code   

# numeric column expected by the model

        fr["area_type"] = last["area_type"]
        fr["area_type_code"] = last["area_type_code"]

        # Cyclical encodings
        fr["hour_sin"]  = np.sin(2 * np.pi * fr["hour"] / 24)
        fr["hour_cos"]  = np.cos(2 * np.pi * fr["hour"] / 24)
        fr["month_sin"] = np.sin(2 * np.pi * fr["month"] / 12)
        fr["month_cos"] = np.cos(2 * np.pi * fr["month"] / 12)

        # Interaction terms
        fr["dew_point_spread"] = fr["temperature"] - fr["dew_point"]
        fr["temp_pres_interaction"] = fr["temperature"] * fr["pressure"]
        fr["rain_binary"] = (fr["rain"] > 0)

        # PM2.5 history for rolling + lag features
        pm25_history = history["pm25"].tolist()

        # Rolling windows
        for w in [3, 6, 12, 18]:
            fr[f"pm25_roll_{w}h_mean"] = (
                np.mean(pm25_history[-w:]) if len(pm25_history) >= w else np.mean(pm25_history)
            )

        # Lag features
        for lag in [1, 3, 6, 12, 18]:
            fr[f"pm25_lag_{lag}h"] = (
                pm25_history[-lag] if len(pm25_history) >= lag else pm25_history[0]
            )

        # Predict next PM2.5
        X = fr[model.feature_names_in_].values.reshape(1, -1)
        pm25_pred = model.predict(X)[0]

        preds.append({"datetime": next_time, "pm25_pred": pm25_pred})

        # Add predicted pm25 to fr and append to history
        fr["pm25"] = pm25_pred
        history = pd.concat([history, fr.to_frame().T], ignore_index=True)

    return pd.DataFrame(preds)


Run forecast:

In [42]:
print(model.feature_names_in_)

forecast_df = forecast_next_24h(df_eng, model)
forecast_df.head()


['temperature' 'dew_point' 'pressure' 'rain' 'wind_speed'
 'temp_pres_interaction' 'dew_point_spread' 'rain_binary' 'hour_sin'
 'hour_cos' 'month_sin' 'month_cos' 'season' 'day_of_week' 'month' 'year'
 'pm25_lag_1h' 'pm25_lag_3h' 'pm25_lag_6h' 'pm25_lag_12h' 'pm25_lag_18h'
 'pm25_roll_3h_mean' 'pm25_roll_6h_mean' 'pm25_roll_12h_mean'
 'pm25_roll_18h_mean']


Unnamed: 0,datetime,pm25_pred
0,2025-01-02 00:00:00,76.161758
1,2025-01-02 01:00:00,61.156464
2,2025-01-02 02:00:00,52.787884
3,2025-01-02 03:00:00,47.635162
4,2025-01-02 04:00:00,44.379475


### Save Results for Dashboard

In [43]:
df_eng.to_csv(DERIVED_PATH / "forecast_fake_day.csv", index=False)
forecast_df.to_csv(DERIVED_PATH / "forecast_next_24h.csv", index=False)

print("Saved forecast input + output for dashboard.")


Saved forecast input + output for dashboard.


## Summary

This notebook demonstrates:

- How the model reacts to realistic but synthetic atmospheric conditions
- Predictive capability under user-defined scenarios
- A method to forecast 24 hours ahead using recursive prediction
- Dashboard-ready output files

This is a powerful demonstration of model interpretability and forecasting ability.

---

### AI Assistance Note

Some narrative text and minor formatting or wording improvements in this notebook were supported by AI-assisted tools (ChatGPT for documentation clarity, Copilot for small routine code suggestions, and Grammarly for proofreading). All analysis, code logic, feature engineering, modelling, and interpretations were independently created by the author.