
# Session 1 Project: Epidemic Time Series Analysis + Variant-Aware

---

## Learning Goals

In this session you will:

1. Decide whether epidemic dynamics are additive or multiplicative.
2. Understand why `period = 7` is biologically and statistically meaningful.
3. Apply STL decomposition.
4. Interpret trend, seasonal and residual components.
5. Detect anomalies using:
   - Standard Z-score
   - Robust MAD
6. Investigate association of anomalies with:
   - Variants
   - Stringency index
   - Vaccination
7. Detect epidemic peaks using `scipy.signal.find_peaks`.

---


## 1 Import Required Packages

In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.tsa.seasonal import STL
from scipy.signal import find_peaks
from scipy.stats import ttest_ind
import statsmodels.formula.api as smf

plt.rcParams["figure.figsize"] = (10,4)
plt.rcParams["axes.grid"] = True


## 2 Load OWID Case Data (Provided)

In [5]:

country = "Germany"
owid_url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"

df = pd.read_csv(owid_url, parse_dates=["date"])

df = df[df["location"] == country][[
    "date",
    "new_cases_smoothed",
    "stringency_index",
    "people_vaccinated_per_hundred"
]].dropna()

df = df[df["date"] >= "2020-03-01"].reset_index(drop=True)



## 3 Why is `period = 7` Appropriate?

### Statistical Reason

COVID reporting follows a weekly cycle:
- Lower reporting during weekends
- Higher reporting mid-week

This creates artificial oscillations every 7 days.

---

### Biological Relevance

Although weekly seasonality mainly reflects reporting,
it also aligns with:

- Human weekly behavior patterns
- Testing availability
- Administrative reporting cycles

Thus, `period = 7` captures systematic short-term structure.

---

### Questions

1. What would happen if we used `period = 30`?
2. Would STL confuse reporting effects with long-term trend?



## 4 Additive vs Multiplicative Structure

Tasks:

1. Plot raw cases vs date.
2. Plot log(cases) vs date.
3. Decide whether multiplicative structure is appropriate.



## 5 STL Decomposition

Tasks:

1. Apply STL with `period = 7` to log cases.
2. Plot trend, seasonal, and residual components.
3. Interpret:

- Trend → what does it represent epidemiologically?
- Seasonal → reporting or biology?
- Residual → what do spikes represent?



## 6 Anomaly Detection

We compare two methods.

---

### A) Standard Z-score

Formula:

    z = (residual - mean) / standard_deviation

An anomaly if:

    |z| > 2

Limitation:
Standard deviation is sensitive to extreme values.

---

### B) Robust MAD (Median Absolute Deviation)

Step 1: Compute median of residuals.

Step 2: Compute MAD:

    MAD = median(|residual - median|)

Step 3: Convert to robust z-score:

    robust_z = (residual - median) / (1.4826 × MAD)

Why multiply by 1.4826?
For normally distributed data, this makes MAD comparable to standard deviation.

---

### Why is MAD More Robust?

• Uses median instead of mean  
• Not strongly influenced by extreme spikes  
• Better for heavy-tailed epidemic residuals  

---

### Tasks

1. Implement both methods.
2. Compare number of detected anomalies.
3. Which seems more reasonable?



## 7 Integrate ECDC Variant Data

We use weekly variant proportions from ECDC.

Steps:
1. Load weekly variant data.
2. Filter for Germany.
3. Keep only SARS-CoV-2 and proportion indicator.
4. Convert ISO week to date.
5. Identify dominant variant per week.
6. Merge into daily dataset.


In [6]:

variant_url = "https://raw.githubusercontent.com/EU-ECDC/Respiratory_viruses_weekly_data/refs/heads/main/data/variants.csv"

var = pd.read_csv(variant_url)

# Filter Germany + SARS-CoV-2 + proportion + total age
var = var[
    (var["countryname"] == "Germany") &
    (var["pathogen"] == "SARS-CoV-2") &
    (var["indicator"] == "proportion") &
    (var["age"] == "total")
].copy()

# Convert ISO week format (YYYY-Www) to date (Monday of that week)
var["date"] = pd.to_datetime(var["yearweek"] + "-1", format="%G-W%V-%u")

# Identify dominant variant per week
dominant = var.loc[var.groupby("date")["value"].idxmax()]
dominant = dominant[["date", "variant"]]

# Merge into daily dataset
df = df.merge(dominant, on="date", how="left")
df["variant"] = df["variant"].ffill()



### Tasks

1. Plot anomalies on log-case plot and color by variant.
2. Compute anomaly frequency per variant.
3. Are certain variants associated with more residual spikes?



## 8 Statistical Testing

Perform:

A) Compare stringency during anomaly vs non-anomaly (t-test).

B) Compare vaccination during anomaly vs non-anomaly.

C) Logistic regression:

    is_anomaly ~ stringency + vaccination + variant

Interpret carefully:
Does biological shift (variant) matter more than policy?



## 9 Peak Detection Using `find_peaks`

Function: `scipy.signal.find_peaks`

It detects local maxima.

Parameter:

    distance = 60

What does it control?

• Minimum number of days between detected peaks  
• Prevents detecting small fluctuations as new waves  
• Roughly ensures waves are at least ~2 months apart  

---

### Tasks

1. Detect peaks with distance=60.
2. Plot peaks on raw case data.
3. Compare with STL trend waves.


