
# Influenza Forecasting with Machine Learning
## Weekly Influenza Cases – Germany (RKI Data)


---
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/ShamsaraE/time-series-medicine-biology-2026/blob/main/notebooks/05_Influenza_Forecasting_ML.ipynb)
---
We demonstrate:

- Log transformation
- Lag-based supervised learning
- Ridge regression with scaling
- Proper recursive multi-step forecasting
- Seasonal naive baseline
- MASE evaluation



# 1. Import Required Libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Machine learning tools
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error



# 2. Load and Prepare Data


We load weekly influenza case counts from the RKI repository and:

1. Filter for Germany
2. Convert ISO week format to real dates
3. Aggregate weekly counts
4. Enforce weekly frequency


In [None]:

# Load influenza data
url = "https://raw.githubusercontent.com/robert-koch-institut/Influenzafaelle_in_Deutschland/main/IfSG_Influenzafaelle.tsv"
df = pd.read_csv(url, sep="\t")

# Filter for Germany
df = df[df["Region"] == "Deutschland"].copy()

# Convert ISO week to actual Monday date
df["date"] = pd.to_datetime(df["Meldewoche"] + "-1", format="%G-W%V-%u")

# Sort and set index
df = df.sort_values("date").set_index("date")

# Aggregate weekly cases
ts = df.groupby(df.index)["Fallzahl"].sum().astype(float)

# Ensure weekly frequency
ts = ts.asfreq("W-MON")



# 3. Log Transformation

## Definition
Log transformation:

log(1 + y)

is used to stabilize variance in skewed count data.

## Why?
Epidemic peaks can vary from 0 to >100,000 cases.
Linear regression performs better on stabilized scales.


In [None]:

ts_log = np.log1p(ts)

plt.figure(figsize=(10,4))
plt.plot(ts_log)
plt.title("Log-Transformed Influenza Cases")
plt.show()



## Interpretation

Variance is now more stable across seasons.
Large peaks no longer dominate the scale.



# 4. Lag Feature Engineering

## Definition
Lag features convert a time series into a supervised learning problem:

$y_t = f(y_{t-1}, y_{t-2}, ..., y_{t-p})$

## What we include
- Short memory: lags 1, 2, 3
- Seasonal memory: lag 52 (one year)


In [None]:

lags = [1, 2, 3, 52]

data = pd.DataFrame({"y": ts_log})

# Create lagged features
for lag in lags:
    data[f"lag{lag}"] = data["y"].shift(lag)

# Remove rows with missing values
data = data.dropna()

X = data.drop(columns="y")
y = data["y"]
data.head()


## Interpretation

We transformed the time series into a regression dataset.
Each row now contains past values predicting the present.



# 5. Train-Test Split

## Definition
Time series split must preserve temporal order.

## What we do
We reserve the last 52 weeks for testing.


In [None]:

test_size = 52

X_train = X.iloc[:-test_size]
X_test = X.iloc[-test_size:]

y_train = y.iloc[:-test_size]
y_test = y.iloc[-test_size:]



# 6. Ridge Regression with Standardization

## Definition
Ridge regression minimizes:

$||y - Xβ||² + λ||β||²$

Standardization ensures features are on comparable scales.


In [None]:

# -----------------------------------------------------------
# Build a Machine Learning Pipeline
# -----------------------------------------------------------

# A Pipeline chains multiple processing steps together.
# This ensures that scaling is applied correctly
# during both training and prediction.

model = Pipeline([
    
    # Step 1: Standardize features
    # StandardScaler transforms each feature to:
    #    (x - mean) / standard deviation
    #
    # Why?
    # Ridge regression is sensitive to feature scale.
    # Without scaling, large-magnitude lags could dominate
    # the regularization penalty.
    ("scaler", StandardScaler()),
    
    
    # Step 2: Ridge Regression
    # Ridge minimizes:
    #     ||y - Xβ||² + λ||β||²
    #
    # alpha = λ controls the strength of regularization.
    # alpha = 1.0 is moderate shrinkage.
    ("ridge", Ridge(alpha=1.0))
])


# -----------------------------------------------------------
# Fit the Model
# -----------------------------------------------------------

# The model learns coefficients β from the training data.
# Only past information (X_train) is used.
model.fit(X_train, y_train)


# -----------------------------------------------------------
# One-Step-Ahead Prediction
# -----------------------------------------------------------

# We now predict the test set.
# These are one-step predictions because
# each row in X_test uses true past values.
y_pred_one = model.predict(X_test)


# -----------------------------------------------------------
# Evaluate Performance
# -----------------------------------------------------------

# Mean Absolute Error (MAE):
#
# MAE = mean(|y_true - y_pred|)
#
# On log scale, this measures average deviation
# in transformed space.
mae_one = mean_absolute_error(y_test, y_pred_one)

print("One-step MAE (log scale):", mae_one)


# 7. Recursive Multi-Step Forecasting

## Definition
Recursive forecasting feeds predictions back into the model.



In [None]:
# -----------------------------------------------------------
# Recursive Multi-Step Forecasting
# -----------------------------------------------------------

# Definition:
# Recursive forecasting means that predicted values
# are fed back into the model to generate future forecasts.
#
# Unlike one-step forecasting, we no longer use
# the true observed future values.



# -----------------------------------------------------------
# Step 1: Initialize Forecasting History
# -----------------------------------------------------------

# We start with the FULL training history.
# This ensures that lag52 (seasonal lag) remains valid.
#
# IMPORTANT:
# We must keep at least 52 past values,
# otherwise lag52 would not exist.
history = list(ts_log.iloc[:-test_size].values)


# This list will store our multi-step predictions
recursive_preds = []


# -----------------------------------------------------------
# Step 2: Iteratively Forecast Each Future Week
# -----------------------------------------------------------

# We forecast one week at a time.
# Each new prediction is appended to history
# and used to predict the next step.

for i in range(test_size):
    
    # -------------------------------------------------------
    # Construct Feature Vector Using Latest Available History
    # -------------------------------------------------------
    
    # At time t, we build:
    #   lag1  = y_{t-1}
    #   lag2  = y_{t-2}
    #   lag3  = y_{t-3}
    #   lag52 = y_{t-52}
    #
    # These values may be real observations (early steps)
    # or previous predictions (later steps).
    features = {
        "lag1": history[-1],
        "lag2": history[-2],
        "lag3": history[-3],
        "lag52": history[-52]
    }
    
    
    # Convert dictionary into DataFrame
    # This preserves feature names expected by the pipeline
    X_input = pd.DataFrame([features])
    
    
    # -------------------------------------------------------
    # Step 3: Predict Next Value
    # -------------------------------------------------------
    
    # model.predict returns an array → take first element
    pred = model.predict(X_input)[0]
    
    
    # Store prediction
    recursive_preds.append(pred)
    
    
    # -------------------------------------------------------
    # Step 4: Update History
    # -------------------------------------------------------
    
    # The predicted value becomes part of the time series.
    # Future lags will depend on this value.
    history.append(pred)


# Convert predictions to NumPy array
recursive_preds = np.array(recursive_preds)


# -----------------------------------------------------------
# Step 5: Evaluate Recursive Forecast
# -----------------------------------------------------------

# MAE measures average absolute deviation
# between true log-values and predicted log-values.
mae_recursive = mean_absolute_error(y_test, recursive_preds)

print("Recursive MAE (log scale):", mae_recursive)


## Interpretation

Recursive forecasting is significantly harder.
Errors accumulate over time.



# 8. Seasonal Naive Baseline

## Definition
Seasonal naive forecast:

$ŷ_t = y_{t-52}$

This is often a very strong benchmark in epidemiology.


In [None]:

# -----------------------------------------------------------
# Seasonal Naive Baseline
# -----------------------------------------------------------

# Definition:
# The seasonal naive forecast assumes that
# the best prediction for this week
# is the value observed exactly one year ago.
#
# Mathematically:
#     ŷ_t = y_{t-52}
#
# Why 52?
# Because influenza exhibits strong annual seasonality
# (weekly data → 52 weeks per year).


# -----------------------------------------------------------
# Step 1: Generate Seasonal Naive Forecast
# -----------------------------------------------------------

# shift(52) moves the time series forward by 52 weeks,
# so each observation becomes aligned with the value
# from the same week last year.
seasonal_naive = ts_log.shift(52).iloc[-test_size:]


# -----------------------------------------------------------
# Step 2: Evaluate Baseline Performance
# -----------------------------------------------------------

# We compare the baseline forecast
# with the true observed test values.
#
# MAE measures average absolute deviation.
mae_naive = mean_absolute_error(y_test, seasonal_naive)

print("Seasonal Naive MAE (log scale):", mae_naive)



# 9. MASE Evaluation

## Definition
MASE = Model MAE / Seasonal Naive MAE

If MASE < 1 → model beats baseline.
If MASE > 1 → baseline is stronger.


In [None]:

mase = mae_recursive / mae_naive
print("MASE (recursive):", mase)
