# Calculating state-by-state implied infection numbers

This notebook tries to compute what the full infection numbers in the past and present likely were/are.

It does so in the past by blending variables for "median days from infection to death" and "infection fatility rate" (IFR) with smoothed death rates. In other words, days_to_death days before date D, there must have been roughly (deaths_on_date_D / IFR) infections to end up with a given number of deaths on date D.

When looking at the most recent days_to_death days, it looks up what percentage of infections were confirmed on the last day calculated in the past, and applies that percentage to the new infections found since then. It normalizes a bit by the amount of testing done on each day to try to handle significant ramping up/down of testing during that time, but the recent projections are admittedly sketchy.

The principal source of death data is files from the NY Times, supplemented by a more accurate DateOfDeath.xlsx from Massachusetts. The source of testing data is The COVID Tracking Project, maintained by The Atlantic.

NOTE: Prior to running this notebook, you should retrieve the latest DateOfDeath.xlsx file by:

1. going to https://www.mass.gov/info-details/covid-19-response-reporting,
2. downloading the raw data zip from the line saying "Raw data used to create the dashboard is available here:"
3. copying the DateofDeath.xlsx in that file to the same directory as the notebook

Yeah, that could potentially be automated, but MA made that really hard the way they implemented it.

In [None]:
%matplotlib inline
import numpy
import pandas
import matplotlib
import matplotlib.pyplot as plt

from common import load_data, smooth_series, calc_mid_weekly_average
from common import calc_state_stats, get_infections_df, find_smooth_dates

In [None]:
# Earliest date that there is sufficient data for all states, including MA
EARLIEST_DATE = pandas.Period('2020-03-10', freq='D')

# Set a latest date when the most recent days have garbage (like on or after holidays)
LATEST_DATE = pandas.Period('2020-12-23', freq='D')
LATEST_DATE = pandas.Period('2021-01-03', freq='D')
LATEST_DATE = None

# Set a number of recent days to not display in the graphs for lack of future days to smooth them
NON_DISPLAY_DAYS = 0

In [None]:
latest_date, meta, nyt_stats, ct_stats = load_data(EARLIEST_DATE, LATEST_DATE)
latest_displayed = latest_date - NON_DISPLAY_DAYS
print(f"Latest date = {str(latest_date)}; latest displayed = {str(latest_displayed)}")

### Put the two datasets together

In [None]:
ct1 = ct_stats.set_index(['ST', 'Date']).sort_index()[['Pos', 'Neg']]
nyt1 = nyt_stats.set_index(['ST', 'Date']).sort_index()[['Deaths']]
both = ct1.join(nyt1)
meta_tmp = meta.set_index('ST')

In [None]:
states = [calc_state_stats(state, df, meta_tmp, latest_date)
          for state, df in both.reset_index().groupby('ST')]

In [None]:
stats = pandas.concat(states).reset_index()

### Calculate new stats, state by state

In [None]:
# Median number of days between being exposed and developing illness
INCUBATION = 4

# Number of days one is infectious (this isn't actually used yet)
INFECTIOUS = 10

# Median days in between exposure and death
DEATH_LAG = 19

In [None]:
# Here is where you set variables for IFR assumptions

# Note that this IFR represents a country-wide average on any given day, but the IFRs
# are actually adjusted up/down based on median age and nursing home residents per capita

# This set represents my worst case scenario (in my 95% CI interval)
# Start by setting the inital and final IFRs
IFR_S, IFR_E = 0.013, 0.006
# Then set dates in between by which it linearly scales to various targets
IFR_BREAKS = [['2020-04-30', 0.0095], ['2020-07-31', 0.007], ['2020-09-15', 0.006]]

# This set is my optimistic scenario
IFR_S, IFR_E = 0.01, 0.0025
IFR_BREAKS = [['2020-04-30', 0.0075], ['2020-07-31', 0.0045], ['2020-09-15', 0.0025]]

# This set is a highly optimistic scenario that matches the recent CDC data
IFR_S, IFR_E = 0.009, 0.002
IFR_BREAKS = [['2020-04-30', 0.007], ['2020-07-31', 0.003], ['2020-09-15', 0.002]]

# This is my expected scenario
IFR_S, IFR_E = 0.01, 0.004
IFR_BREAKS = [['2020-04-30', 0.0085], ['2020-07-31', 0.005], ['2020-09-15', 0.004]]

IFR_PARAMS = {
    "High IFR (1.3%->0.6%)": (0.013, 0.006,
                                 [['2020-04-30', 0.0095], ['2020-07-31', 0.007], ['2020-09-15', 0.006]]),
    "Expected IFR (1.0%->0.4%)": (0.01, 0.004,
                              [['2020-07-31', 0.005], ['2020-09-15', 0.004]]),
    "Low IFR (1.0%->0.25%)": (0.01, 0.0025,
                                 [['2020-04-30', 0.0075], ['2020-07-31', 0.0045], ['2020-09-15', 0.0025]]),
    "CDC Estimates (0.9%->0.2%)": (0.009, 0.002,
                         [['2020-04-30', 0.007], ['2020-07-31', 0.003], ['2020-09-15', 0.002]]),
}

In [None]:
RESULTS = {}
for title, (IFR_S, IFR_E, IFR_BREAKS) in IFR_PARAMS.items():
    IFR_S_S, IFR_E_S = f'{100*IFR_S:.1f}%', f'{100*IFR_E:.2f}%'
    infected_states = get_infections_df(states, meta, DEATH_LAG, IFR_S, IFR_E, IFR_BREAKS, INCUBATION, INFECTIOUS)
    foo = infected_states.reset_index()[['Date', 'NewInf']]
    foo = foo[foo.Date <= latest_displayed].groupby('Date').sum().NewInf
    foo = foo.cumsum()
    foo = pandas.Series([int(x) for x in foo], index=foo.index)
    print(f"Total infected by {latest_date}: {int(foo.iloc[-1]):,}")
    RESULTS[title] = foo

In [None]:
fam = infected_states.reset_index()[['Date', 'Confirms7']].groupby('Date').sum().cumsum()
fam = fam.loc[:latest_displayed]
fam = pandas.Series([int(x) for x in fam.Confirms7], index=fam.index)
RESULTS["Confirmed"] = fam

In [None]:
EST_LINE = str(latest_date - (DEATH_LAG - 1))
print(f"Vertical line marking recent estimations set at {EST_LINE}")

In [None]:
columns = list(RESULTS.keys())
values = list(RESULTS.values())
df = pandas.concat(values, axis=1)
df.columns = columns
df.tail() / 1000000

In [None]:
fam = (df/1000000).plot(title=f"Estimates of Total Infections in the US",
                        figsize=(13,5), ylim=0)