## Problem 1: Mathematical model of an infectious disease

### Background
The SIR model (developed by W. O. Kermack and A. G. McKendrick in 1927) is one of the simplest variants in a group of [compartmental models](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology) used in theoretical biology to model how an infectious disease spreads in a population in order to determine the likely outcome of an epidemic. The variables S, I and R represent the number of people in the compartments of the population and the differential equations describe the transition rates and thus flow between the compartments over time. When based on ODEs, the models are deterministic. (ODE = [ordinary differential equation](https://en.wikipedia.org/wiki/Ordinary_differential_equation) = equality involving a function and its derivatives (may be of higher order but not partial derivatives).) For small populations, stochastic models that use Monte-Carlo methods are more realistic.

### The SIR model
Consider the [SIR model](https://de.wikipedia.org/wiki/SIR-Modell) characterized through the following coupled system of first-order ordinary differential equations:
* ${\frac {dS}{dt}}=-c{\frac {SI}{N}}$ 
* ${\frac {dI}{dt}}=c{\frac {SI}{N}}-wI$
* ${\frac {dR}{dt}}=wI$

with the boundary condition that the total number of individuals N, which is the sum of S, I, and R, is constant
* $N=S+I+R$

(Obviously, the boundary condition is trivially fulfilled as the sum of the derivatives is zero, i.e. no births or deaths are included in this version of the model.)

The variables and parameters of the model are:
* $S$: number of susceptible individuals
* $I$: number of infected individuals
* $R$: number of recovered (resistant / removed) individuals
* $c$: rate of disease transmission
* $w$: rate of recovery 
* $t$: time (independent variable)

Assumptions:
* every individual is infected only once (and either dies or becomes immune -- these are treated alike)
* infected individuals are infecteous
* rates $c$ and $w$ are constant
* birth and general death rates can be neglected on the time scales considered

#### Basic reproduction number
The basic reproduction number $R_0$ is given by $R_0 = c/w$. It describes the expected number of new infections caused by one infected individual (in a population where all subjects are susceptible, i.e. at the beginning of the epidemic).
$1/w$ can be interpreted as the average duration an individual is infecteous.

#### Variations
A more complex model with an interactive interface can be found [here](https://gabgoh.github.io/COVID/index.html). 
It is based on SEIR with an additional compartment E = Exposed, and includes deaths and hospitalization. In addition to different sets of compartments, one can also consider spatial dependencies in addition to time.

Your tasks:
* Solve this system of equations using methods presented in the course and plot S, I and R as a function of time.
* Change the parameters c and w and study their impact.
* Optional: Make the model more dynamic by e.g. introducing a birth and a death rate, or vaccination.

Homework 👆 Try to solve at least the non-optional part. Solutions to this will be discussed tomorrow morning.

## Problem 2: Understanding and working with a real-life dataset

Here, we are going to work with two datasets that contain the daily new cases of reported COVID-19 infections. The purpose of this exercise is to practice the use of the Python tools that we encountered in the course and how to apply them to work with real-life data and gain experience what problems might occur.


In [None]:
### imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from requests_cache import CachedSession # if you get an error, install this with: pip install requests_cache

To avoid that the data is downloaded every time we run the notebook, we will use a module for caching that will create a local copy of the URLs we retrieve and only refetch them when a certain time has passed. 

In [None]:
### set up cached session
session = CachedSession('../data/cache_plot_corona_cases', backend='sqlite', expire_after = 86400)

### helper functions
def SaveURL(url, path):
  with open(path, "w") as outf:
    outf.write(GetURL(url))
    
def GetURL(url):
  response = session.get(url)
  print(f"Loaded {url} from cache: {response.from_cache}")
  if response.status_code != 200:
    print(f"Request failed with code {response.status_code}")
    return None
  else:
    return response.text

We will use 2 datasets, one provided by the Robert-Koch-Institut (RKI) and one from the Johns Hopkins University (JHU). Both are in CSV format.

In [None]:
def LoadDataset(url, local_path, date_column, cases_column):
  SaveURL(url, local_path)
  df = pd.read_csv(local_path, parse_dates = [date_column])\
         .set_index(date_column)
  print("Last data point:")
  print(df.tail(1)[cases_column])
  return df

In [None]:
# RKI, documentation: https://github.com/robert-koch-institut/SARS-CoV-2-Nowcasting_und_-R-Schaetzung/#readme
dfr = LoadDataset("https://raw.githubusercontent.com/robert-koch-institut/SARS-CoV-2-Nowcasting_und_-R-Schaetzung/main/Nowcast_R_aktuell.csv", 
                  "/tmp/Nowcast_R_aktuell.csv",
                  "Datum",
                  "PS_COVID_Faelle")
# JHU, documentation: https://github.com/owid/covid-19-data/tree/master/public/data#readme
dfj = LoadDataset("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/jhu/new_cases.csv", 
                  "/tmp/new_cases.csv",
                  "date",
                  "Germany")

Your tasks:
1. As a first step, plot the data to see whether it makes sense intuitively. What differences between the datasets do you observe?
2. Have a look at the README files to understand the features of the datasets:
  * [README for RKI data](https://github.com/robert-koch-institut/SARS-CoV-2-Nowcasting_und_-R-Schaetzung/#readme)
    * "SARS-CoV-2-Nowcasting und -R-Schätzung"
    * PS_COVID_Faelle: "Punktschätzer der Anzahl an Neuerkrankungen (ohne Glättung)")
  * [README for JHU data](https://github.com/owid/covid-19-data/tree/master/public/data#readme)
    * "Data on COVID-19 (coronavirus) by Our World in Data"
    * new_cases: "New confirmed cases of COVID-19. Counts can include probable cases, where reported. In rare cases where our source reports a negative daily change due to a data correction, we set this metric to NA."
3. Both datasets have strong fluctuations. Think about possible reasons. Do you think these are real? How can you fix these?")

### More numbers: per state
To compare numbers in the constituent states of Germany we can use this dataset ([README](https://github.com/jgehrcke/covid-19-germany-gae/#readme)):

In [None]:
dfbl = LoadDataset("https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rl-crowdsource-by-state.csv",
            "/tmp/cases-rl-crowdsource-by-state.csv",
            "time_iso8601",
            "sum_cases")

In [None]:
dfbl.columns

In [None]:
dfbl.loc[:, "DE-BB":"DE-TH"].plot(figsize=(15, 8));

Why does DE-NW have the highest incidence?