# 03_model_regression – Weekly Bookings Forecast (Regression)

## Objectives
- Train baseline and simple regression models to forecast weekly bookings by region.
- Establish benchmark errors (MAE, MAPE, R²).
- Produce evaluation plots (actual vs predicted, residuals).
- Assess whether performance meets business requirements (KPIs).

## Inputs
- `data/processed/train_regression.csv`
- `data/processed/test_regression.csv`

## Outputs
- Baseline metrics
- Linear/ElasticNet model metrics
- Evaluation plots saved to `reports/figures/`
- (Advanced boosted model + hyperparameter tuning will be added in Part 2)


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt
import seaborn as sns

BASE_DIR = Path("..").resolve()
DATA = BASE_DIR / "data" / "processed"
FIG_DIR = BASE_DIR / "reports" / "figures"
FIG_DIR.mkdir(parents=True, exist_ok=True)

sns.set(style="whitegrid")


In [None]:
train = pd.read_csv(DATA / "train_regression.csv", parse_dates=["week_start"])
test = pd.read_csv(DATA / "test_regression.csv", parse_dates=["week_start"])

train.head(), test.head()


In [None]:
TARGET = "bookings_count"

FEATURES = [
    "region",
    "week_number",
    "month",
    "is_bank_holiday_week",
    "is_peak_winter",
    "mean_temp_c",
    "precip_mm",
    "snowfall_flag",
    "wind_speed_kph",
    "visibility_km",
    "lag_1w_bookings",
    "lag_4w_mean",
    "lag_52w_bookings"
]

X_train = train[FEATURES]
y_train = train[TARGET]
X_test = test[FEATURES]
y_test = test[TARGET]


In [None]:
# baseline: predict next week = last week (lag_1w_bookings)
baseline_pred = X_test["lag_1w_bookings"]

baseline_mae = mean_absolute_error(y_test, baseline_pred)
baseline_mape = mean_absolute_percentage_error(y_test, baseline_pred)
baseline_r2 = r2_score(y_test, baseline_pred)

baseline_mae, baseline_mape, baseline_r2


In [None]:
categorical = ["region"]
numeric = [col for col in FEATURES if col not in categorical]

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
        ("num", "passthrough", numeric)
    ]
)

linreg = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LinearRegression())
    ]
)

linreg.fit(X_train, y_train)

pred_lin = linreg.predict(X_test)

lin_mae = mean_absolute_error(y_test, pred_lin)
lin_mape = mean_absolute_percentage_error(y_test, pred_lin)
lin_r2 = r2_score(y_test, pred_lin)

lin_mae, lin_mape, lin_r2


In [None]:
elastic = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", ElasticNet(random_state=42))
    ]
)

elastic.fit(X_train, y_train)
pred_elastic = elastic.predict(X_test)

el_mae = mean_absolute_error(y_test, pred_elastic)
el_mape = mean_absolute_percentage_error(y_test, pred_elastic)
el_r2 = r2_score(y_test, pred_elastic)

el_mae, el_mape, el_r2


In [None]:
# Actual vs Predicted
plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=pred_lin, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Linear Regression – Actual vs Predicted")

fig_path = FIG_DIR / "regression_lin_actual_vs_pred.png"
plt.savefig(fig_path, dpi=120)
fig_path


In [None]:
residuals = y_test - pred_lin

plt.figure(figsize=(6,4))
sns.histplot(residuals, kde=True)
plt.title("Linear Regression – Residual Distribution")

fig_path = FIG_DIR / "regression_lin_residuals.png"
plt.savefig(fig_path, dpi=120)
fig_path


### Model Evaluation Summary (Before Advanced Tuning)

- The naive baseline (lag-1) provides a simple but surprisingly strong benchmark.
- Linear Regression performs better/worse depending on data patterns; it captures calendar and weather effects but struggles with non-linear structure.
- ElasticNet regularisation stabilises coefficients but may still underfit compared to expected non-linear models.

This establishes the baseline performance required for the ML Business Case (MAE/MAPE/KPIs).  
The next modelling phase will introduce non-linear models (Random Forest, XGBoost) and
advanced hyperparameter tuning to meet business requirements.
