## The Most Dangerous Equation Demonstration

The purpose of this notebook is to illustrate the phenomenon described in the Estimation Cautions slides of the most extreme rates (both large and small) appearing in counties with the smallest populations.

In this notebook, we'll perform a simulation to see this in action.

In [None]:
import pandas as pd
from scipy.stats import binom
import matplotlib.pyplot as plt

We'll start by reading in a dataset which contains county-level population estimates for the year 2019.

In [None]:
pops = pd.read_csv('county_populations.csv')

In [None]:
pops.sort_values('POPESTIMATE2019')

We'll set an overall incidence rate for some type of cancer.

In [None]:
# Cancer Rate
p = 0.005

Now, we need to perform a simulation to determine the number of cases per county. For this, we can use the `rvs` method of the `binom` class. We need to give the number of trials, `n` and the probability of success, `p`.

For example, we could simulate for Kalawao County, the smallest county in the dataset.

In [None]:
# Simulating for Kalawao County
binom.rvs(n = 86, p = p)

We can actually use an array to specify the number of trials. For example, if we wanted to generate our numbers for the three smallest counties, we could do so as follows:

In [None]:
binom.rvs(n = [86, 169, 272], p = p)

Let's simulate for each county and save the results as a new column.

In [None]:
pops['cancer_instances'] = binom.rvs(n = pops['POPESTIMATE2019'], p = p)

In [None]:
pops.sort_values('POPESTIMATE2019')

We can calculate a "cancer rate" by normalizing by population. We'll actually do cases per 100,000 population by also multiplying by 100,000.

In [None]:
pops['cancer_rate'] = pops['cancer_instances'] / pops['POPESTIMATE2019'] * 100000

In [None]:
pops.sort_values('POPESTIMATE2019')

Now, let's look at the highest cancer rates.

In [None]:
pops.nlargest(25, 'cancer_rate')

And the smallest cancer rates.

In [None]:
pops.nsmallest(25, 'cancer_rate')

Finally, let's plot population size vs cancer rate. Since there are some very large counties, we'll look at only those that have population less than 100000.

You'll see a funnel shape, as the larger counties tend to be closer to the overall cancer rate.

In [None]:
pops.plot(kind = 'scatter',
         x = 'POPESTIMATE2019',
         y = 'cancer_rate',
         figsize = (12,6),
         alpha = 0.5)

plt.xlim(-100, 100000);