# Baseline (observed) dataset: COVID-19 cases vs time (England, UKHSA)

## What this notebook is doing (in plain English)

We want a dataset that is basically **“time vs infected”**.

In real life, we usually *cannot* measure “how many people are infected right now” perfectly. Instead, we use a **proxy** (a measurable quantity that is related to infections). In this notebook, we use:

- **confirmed COVID-19 cases by day** (England)

This is a good first *observed* time series to work with because it’s a simple “counts over time” dataset.

## Why this is called “baseline (observed)”

- **Observed**: it comes from a real surveillance source (UKHSA dashboard), not a simulation.
- **Baseline**: it’s our starting point before we build “what-if” *intervention scenarios* using a model.

Important: this observed baseline is **not** “no interventions happened in the real world”. It is simply the cleanest pre-existing dataset that matches “time vs infected (proxy)”.

## What you will produce

- A CSV file saved to:
  - `data/processed/observed/ukhsa_covid19_cases_by_day_england.csv`
- A plot of **date vs cases** (and optionally a smoothed version)

## How to read the main columns

- **date**: the day of observation
- **cases**: number of confirmed cases reported for that day (proxy for “infected”)


## Alternative ways to call the UKHSA API (without our custom helper)

In this notebook we use a small helper from `epidemiology_project.ukhsa` so that the *data science part* stays simple.

But there **are** other approaches, especially because UKHSA publishes an **OpenAPI/Swagger** schema:

- Swagger UI: `https://api.ukhsa-dashboard.data.gov.uk/api/swagger`
- OpenAPI schema JSON: `https://api.ukhsa-dashboard.data.gov.uk/api/schema`

### Option A — Use the UKHSA API directly with `requests` (simplest “no helper”)

This is the simplest *library-based* approach because `requests` is the standard HTTP library.

You:
- build the metric URL
- call `requests.get(...)`
- follow the `next` links to get all pages

(We do exactly this logic inside the helper, so you don’t have to re-write it in every notebook.)

### Option B — Use a runtime OpenAPI client (Swagger client)

There are libraries that can load the OpenAPI schema and let you call endpoints more “object-like”.

Example candidates:
- `bravado` (Swagger/OpenAPI runtime client)

This can feel convenient, but you still need to understand:
- which endpoint you want
- pagination (`next` links)

### Option C — Generate a Python client from OpenAPI (most “SDK-like”)

Tools like these can generate a typed client from the schema:
- `openapi-python-client`
- OpenAPI Generator

Typical workflow:
1. Download the schema (`/api/schema`)
2. Generate client code
3. Import the generated client in notebooks

This can be great in bigger projects, but it adds setup/maintenance overhead for a first-year project.

### What we recommend for this project

- Start with the **simple helper** (`fetch_timeseries` / `fetch_metric`) so you can focus on the maths and plots.
- If you later want to explore the OpenAPI approach, see `docs/GETTING_DATA_FROM_THE_WEB.md`.


In [None]:
import requests

url = "https://api.ukhsa-dashboard.data.gov.uk/v2/themes/infectious_disease/sub_themes/respiratory/topics/COVID-19/geography_types/Nation/geographies/England/metrics/COVID-19_cases_casesByDay"

rows = []
while url:
    data = requests.get(url).json()
    rows += data["results"]
    url = data["next"]

rows[:3]

In [None]:
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

from epidemiology_project.paths import processed_data_dir, figures_dir
from epidemiology_project.ukhsa import UksHaMetricQuery, fetch_metric


## Download from UKHSA API

### What is the UKHSA API?

UKHSA (UK Health Security Agency) publishes a public dashboard. Behind the dashboard is an **API** (Application Programming Interface): a set of URLs that return data in a structured format (JSON).

We use a small helper (`fetch_metric`) that downloads the data and handles pagination (when the API splits results into multiple pages).

### What is this “query” object?

`UksHaMetricQuery(...)` just stores the choices that identify *which* dataset we want:

- **theme/sub_theme/topic**: what area of public health (here: infectious disease → respiratory → COVID-19)
- **geography_type/geography**: what area (here: Nation → England)
- **metric**: what measure (here: confirmed cases by day)

If you change these values, you get a different dataset.


In [None]:
q = UksHaMetricQuery(
    theme='infectious_disease',
    sub_theme='respiratory',
    topic='COVID-19',
    geography_type='Nation',
    geography='England',
    metric='COVID-19_cases_casesByDay',
)

df_raw = fetch_metric(q)
df_raw[['date','metric_value']].head()


## Keep only the time series we need

The API returns a table with many columns (theme, geography codes, age bands, etc.). For a first “infected vs time” plot we only need:

- the **date** column (time)
- the **metric_value** column (the number for that day)

We also:

- drop missing values (`dropna`) so the plot won’t break
- sort by date so the line goes left-to-right correctly
- rename `metric_value` to `cases` so the meaning is obvious


In [None]:
df = df_raw[['date', 'metric_value']].dropna().sort_values('date')
df = df.rename(columns={'metric_value': 'cases'})
df.head()


## Save to `data/processed/` (observed baseline)

### Why do we save a CSV?

Notebooks are great for exploring, but it’s good practice to also save the *dataset you actually used* as a file. That way:

- everyone in the group can load the same data later
- you can rerun plots without re-downloading every time
- you have a clear “checkpoint” for reproducibility

### Why `data/processed/observed/`?

- `data/raw/` is for original downloads you never change.
- `data/processed/` is for cleaned, ready-to-plot datasets.
- `observed/` means this dataset comes from **real surveillance**, not a simulation.


In [None]:
out_dir = processed_data_dir() / 'observed'
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / 'ukhsa_covid19_cases_by_day_england.csv'
df.to_csv(out_path, index=False)
out_path


## Plot: time vs infected (proxy)

This is the key picture: **a time series**.

- **x-axis**: time (date)
- **y-axis**: daily confirmed cases (our proxy for “infected”)

### What should you look for?

- **Waves/peaks**: times when cases rise and fall
- **Skewness**: a fast rise and slow fall is common in epidemics
- **Noise**: daily data has reporting effects (weekends, backlogs)

This is why we also show a smoothed curve next.


In [None]:
fig, ax = plt.subplots(figsize=(11,5))
ax.plot(df['date'], df['cases'], linewidth=1)
ax.set_title('England: confirmed COVID-19 cases by day (UKHSA)')
ax.set_xlabel('Date')
ax.set_ylabel('Cases')
ax.grid(True, alpha=0.3)
fig.tight_layout()

fig_path = figures_dir(final=True) / 'ukhsa_observed_baseline_covid_cases_by_day_england.png'
fig.savefig(fig_path, dpi=200)
fig_path


## Optional: smoother curve (7-day mean)

Daily case counts are often “bumpy” because of reporting patterns.

A simple way to reduce noise is a **rolling mean**. A 7-day rolling mean replaces each day’s value with the average of the last 7 days.

Mathematically, if daily cases are \(c_t\), the 7-day rolling mean is:

\[
\bar{c}_t = \frac{1}{7}\sum_{k=0}^{6} c_{t-k}
\]

This keeps the overall shape (waves) but makes it easier to compare peaks and trends.


In [None]:
df2 = df.set_index('date').copy()
df2['cases_7d_mean'] = df2['cases'].rolling(7).mean()

fig, ax = plt.subplots(figsize=(11,5))
ax.plot(df2.index, df2['cases_7d_mean'], linewidth=1.5)
ax.set_title('England: confirmed COVID-19 cases (7-day mean)')
ax.set_xlabel('Date')
ax.set_ylabel('Cases (7-day mean)')
ax.grid(True, alpha=0.3)
fig.tight_layout()

fig_path = figures_dir(final=True) / 'ukhsa_observed_baseline_covid_cases_by_day_england_7dmean.png'
fig.savefig(fig_path, dpi=200)
fig_path
