# Chapter 14

*Modeling and Simulation in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In [None]:
# install Pint if necessary

try:
    import pint
except ImportError:
    !pip install pint

In [1]:
# download modsim.py if necessary

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://raw.githubusercontent.com/AllenDowney/' +
         'ModSimPy/master/modsim.py')

In [None]:
# import functions from modsim

from modsim import *

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/ModSimPy/blob/master//chapters/chap14.ipynb)

In [4]:
# import code from previous notebooks

from chap11 import make_system
from chap11 import update_func
from chap11 import run_simulation

from chap12 import calc_total_infected

from chap13 import sweep_beta
from chap13 import sweep_parameters

In the previous chapters we used simulation to predict the effect of an infectious disease in a susceptible population and to design
interventions that would minimize the effect.

In this chapter we use analysis to investigate the relationship between the parameters, `beta` and `gamma`, and the outcome of the simulation.

## Nondimensionalization

The figures in
Section [\[sweepframe\]](#sweepframe){reference-type="ref"
reference="sweepframe"} suggest that there is a relationship between the parameters of the SIR model, `beta` and `gamma`, that determines the outcome of the simulation, the fraction of students infected. Let's think what that relationship might be.

-   When `beta` exceeds `gamma`, that means there are more contacts
    (that is, potential infections) than recoveries during each day (or other unit of time). The difference between `beta` and `gamma` might be called the "excess contact rate\", in units of contacts per day.

-   As an alternative, we might consider the ratio `beta/gamma`, which
    is the number of contacts per recovery. Because the numerator and
    denominator are in the same units, this ratio is **dimensionless**, which means it has no units.

Describing physical systems using dimensionless parameters is often a
useful move in the modeling and simulation game. It is so useful, in
fact, that it has a name: **nondimensionalization** (see
<http://modsimpy.com/nondim>).

So we'll try the second option first.

## Exploring the results

Suppose we have a `SweepFrame` with one row for each value of `beta` and one column for each value of `gamma`. Each element in the `SweepFrame` is the fraction of students infected in a simulation with a given pair of parameters.

We can print the values in the `SweepFrame` like this:

In [5]:
beta_array = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 , 1.1]
gamma_array = [0.2, 0.4, 0.6, 0.8]
frame = sweep_parameters(beta_array, gamma_array)
frame.head()

In [6]:
for gamma in frame.columns:
    column = frame[gamma]
    for beta in column.index:
        frac_infected = column[beta]
        print(beta, gamma, frac_infected)

This is the first example we've seen with one `for` loop inside another:

-   Each time the outer loop runs, it selects a value of `gamma` from
    the columns of the `DataFrame` and extracts the corresponding
    column.

-   Each time the inner loop runs, it selects a value of `beta` from the
    column and selects the corresponding element, which is the fraction
    of students infected.

In the example from the previous chapter, `frame` has 4 columns, one for
each value of `gamma`, and 11 rows, one for each value of `beta`. So
these loops print 44 lines, one for each pair of parameters.

The following function encapulates the previous loop and plots the
fraction infected as a function of the ratio `beta/gamma`:

In [7]:
from matplotlib.pyplot import plot

def plot_sweep_frame(frame):
    for gamma in frame.columns:
        series = frame[gamma]
        for beta in series.index:
            frac_infected = series[beta]
            plot(beta/gamma, frac_infected, 'o', 
                 color='C1', alpha=0.4)

In [8]:
plot_sweep_frame(frame)

decorate(xlabel='Contact number (beta/gamma)',
         ylabel='Fraction infected')

The results fall on a single curve, at least approximately. That means that we can predict the fraction of students who will be infected based on a single parameter, the ratio `beta/gamma`. We don't need to know the values of `beta` and `gamma` separately.

## Contact number

From Section xxx, recall that the number of new infections in a
given day is $\beta s i N$, and the number of recoveries is
$\gamma i N$. If we divide these quantities, the result is
$\beta s / \gamma$, which is the number of new infections per recovery
(as a fraction of the population).

When a new disease is introduced to a susceptible population, $s$ is
approximately 1, so the number of people infected by each sick person is $\beta / \gamma$. This ratio is called the "contact number\" or "basic reproduction number\" (see <http://modsimpy.com/contact>). By convention it is usually denoted $R_0$, but in the context of an SIR model, this notation is confusing, so we'll use $c$ instead.

The results in the previous section suggest that there is a relationship between $c$ and the total number of infections. We can derive this relationship by analyzing the differential equations from
Section xxx:

$$\begin{aligned}
\frac{ds}{dt} &= -\beta s i \\
\frac{di}{dt} &= \beta s i - \gamma i\\
\frac{dr}{dt} &= \gamma i\end{aligned}$$ 

In the same way we divided the
contact rate by the infection rate to get the dimensionless quantity
$c$, now we'll divide $di/dt$ by $ds/dt$ to get a ratio of rates:

$$\frac{di}{ds} = -1 + \frac{1}{cs}$$ 

Dividing one differential equation by another is not an obvious move, but in this case it is useful because it gives us a relationship between $i$, $s$ and $c$ that does not depend on time. From that relationship, we can derive an equation that relates $c$ to the final value of $s$. In theory, this equation makes it possible to infer $c$ by observing the course of an epidemic.

Here's how the derivation goes. We multiply both sides of the previous
equation by $ds$: 

$$di = \left( -1 + \frac{1}{cs} \right) ds$$ 

And then integrate both sides: 

$$i = -s + \frac{1}{c} \log s + q$$ 

where $q$ is a constant of integration. Rearranging terms yields:

$$q = i + s - \frac{1}{c} \log s$$ 

Now let's see if we can figure out what $q$ is. At the beginning of an epidemic, if the fraction infected is small and nearly everyone is susceptible, we can use the approximations $i(0) = 0$ and $s(0) = 1$ to compute $q$:

$$q = 0 + 1 + \frac{1}{c} \log 1$$ 

Since $\log 1 = 0$, we get $q = 1$.

Now, at the end of the epidemic, let's assume that $i(\infty) = 0$, and $s(\infty)$ is an unknown quantity, $s_{\infty}$. Now we have:

$$q = 1 = 0 + s_{\infty}- \frac{1}{c} \log s_{\infty}$$ 

Solving for $c$, we get $$c = \frac{\log s_{\infty}}{s_{\infty}- 1}$$ By relating $c$ and $s_{\infty}$, this equation makes it possible to estimate $c$ based on data, and possibly predict the behavior of future epidemics.

## Analysis and simulation

Let's compare this analytic result to the results from simulation. I'll create an array of values for $s_{\infty}$

In [9]:
from numpy import linspace

s_inf_array = linspace(0.0001, 0.999, 31)

And compute the corresponding values of $c$:

In [10]:
from numpy import log

c_array = log(s_inf_array) / (s_inf_array - 1)

To get the total infected, we compute the difference between $s(0)$ and
$s(\infty)$, then store the results in a `Series`:

In [20]:
frac_infected = 1 - s_inf_array

We can use `make_series` to put `c_array`
and `frac_infected` in a Pandas `Series`.

In [21]:
frac_infected_series = make_series(c_array, frac_infected)

Now we can plot the results:

In [22]:
plot_sweep_frame(frame)
frac_infected_series.plot(label='analysis')

decorate(xlabel='Contact number (c)',
         ylabel='Fraction infected')

When the contact number exceeds 1, analysis and simulation agree. When
the contact number is less than 1, they do not: analysis indicates there should be no infections; in the simulations there are a small number of infections.

The reason for the discrepancy is that the simulation divides time into a discrete series of days, whereas the analysis treats time as a
continuous quantity. In other words, the two methods are actually based on different models. So which model is better?

Probably neither. When the contact number is small, the early progress
of the epidemic depends on details of the scenario. If we are lucky, the original infected person, "patient zero", infects no one and there is no epidemic. If we are unlucky, patient zero might have a large number of close friends, or might work in the dining hall (and fail to observe safe food handling procedures).

For contact numbers near or less than 1, we might need a more detailed
model. But for higher contact numbers the SIR model might be good
enough.

## Estimating contact number

Figure xxx shows that if we know the contact number, we can compute the fraction infected. But we can also read the figure the other way; that is, at the end of an epidemic, if we can estimate the fraction of the population that was ever infected, we can use it to estimate the contact number.

Well, in theory we can. In practice, it might not work very well,
because of the shape of the curve. When the contact number is near 2,
the curve is quite steep, which means that small changes in $c$ yield
big changes in the number of infections. If we observe that the total
fraction infected is anywhere from 20% to 80%, we would conclude that
$c$ is near 2.

On the other hand, for larger contact numbers, nearly the entire
population is infected, so the curve is nearly flat. In that case we
would not be able to estimate $c$ precisely, because any value greater
than 3 would yield effectively the same results. Fortunately, this is
unlikely to happen in the real world; very few epidemics affect anything close to 90% of the population.

So the SIR model has limitations; nevertheless, it provides insight into the behavior of infectious disease, especially the phenomenon of herd immunity. As we saw in Chapter xxx, if we know the parameters of the model, we can use it to evaluate possible interventions. And as we saw in this chapter, we might be able to use data from earlier outbreaks to estimate the parameters.


## Exercises

**Exercise:**  If we didn't know about contact numbers, we might have explored other possibilities, like the difference between `beta` and `gamma`, rather than their ratio.

Write a version of `plot_sweep_frame`, called `plot_sweep_frame_difference`, that plots the fraction infected versus the difference `beta-gamma`.

What do the results look like, and what does that imply? 

In [23]:
# Solution goes here

In [24]:
# Solution goes here

In [25]:
# Solution goes here

**Exercise:** Suppose you run a survey at the end of the semester and find that 26% of students had the Freshman Plague at some point.

What is your best estimate of `c`?

Hint: if you print `frac_infected_series`, you can read off the answer. 

In [26]:
# Solution goes here

In [17]:
# Solution goes here