# Calculating implied infection numbers

This notebook tries to compute what the full infection numbers in the past and present likely were/are.

It does so in the past by blending variables for "median days from infection to death" and "infection fatility rate" (IFR) with smoothed death rates. In other words, days_to_death days before date D, there must have been roughly (deaths_on_date_D / IFR) infections to end up with a given number of deaths on date D.

It does in the present to looking at what percentage of infections were confirmed on the last day calculated in the past, and applying that percentage to the new infections found since then. That doesn't quite take into account if there is a significant ramping of testing during that time, but it should be close enough.

The principal source of death data is files from the NY Times, supplemented by a more accurate DateOfDeath.csv from Massachusetts. The source of testing data is The COVID Tracking Project, maintained by The Atlantic.

NOTE: Prior to running this notebook, you should retrieve the latest DateOfDeath.csv file by:

1. going to https://www.mass.gov/info-details/covid-19-response-reporting,
2. downloading the raw data zip from the line saying "Raw data used to create the dashboard is available here:"
3. copying the DateofDeath.csv in that file to the same directory as the notebook

Yeah, that could be automated. Just haven't done it yet...

In [None]:
%matplotlib inline
import numpy
import pandas
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# Earliest date that there is sufficient data for all states, including MA
EARLIEST_DATE = pandas.Period('2020-03-10', freq='D')

In [None]:
# Get the state metadata
meta = pandas.read_csv('nyt_states_meta.csv')
meta.sort_values('Pop', ascending=False).head()
meta['Country'] = 'USA'

### Pull in state data from NY Times and reduce it to interesting columns, joined with the data above

In [None]:
def read_nyt_csv(uri):
    stats = pandas.read_csv(uri)
    stats = stats[stats.state.isin(meta.State)][['date', 'state', 'deaths']]
    stats.columns = ['Date', 'State', 'Deaths']
    stats.Date = [pandas.Period(str(v)) for v in stats.Date]
    stats = stats[stats.Date >= EARLIEST_DATE]

    stats = stats.set_index(['State', 'Date']).sort_index()
    # Pull in the statistics for states
    stats = stats.join(meta.set_index('State'))

    # Remove territories
    stats = stats[~stats.ST.isin(['AS', 'GU', 'MP', 'PR', 'VI'])]

    return stats.reset_index()

In [None]:
nyt_stats = read_nyt_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')
nyt_stats_live = read_nyt_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/live/us-states.csv')
nyt_stats[nyt_stats.State == 'New York'].tail()

In [None]:
# Attach the live stats if the daily file has not yet rolled
if nyt_stats.Date.max() < nyt_stats_live.Date.max():
    print("Pulling in live stats")
    nyt_stats = pandas.concat([nyt_stats, nyt_stats_live], sort=True)
    nyt_stats.index = list(range(len(nyt_stats)))

### Improve the Massachusetts data by using the DateOfDeath.csv from MA site

In [None]:
# Since the latest MA data is very incomplete, replace the most recent three days with
# the average from the prior five days
days = 3
cur_date = pandas.Period(nyt_stats.Date.max(), freq='D')
cutoff_date = cur_date - days
ma = pandas.read_csv('DateOfDeath.csv').iloc[:, [0, 2, 4]]
ma.columns = ['Date', 'Confirmed', 'Probable']
ma['Deaths'] = ma.Confirmed + ma.Probable
ma.Date = [pandas.Period(str(v)) for v in ma.Date]
ma = ma[(ma.Date >= EARLIEST_DATE) & (ma.Date <= cutoff_date)]
ma = ma.set_index('Date').sort_index()[['Deaths']]
extra_dates = pandas.period_range(end=cur_date, periods=days, freq='D')
avg_deaths = (ma.loc[cutoff_date].Deaths - ma.loc[cutoff_date-5].Deaths) / 5
new_deaths = [ma.Deaths[-1] + (avg_deaths * (i+1)) for i in range(days)]
ma = pandas.concat([ma, pandas.DataFrame(new_deaths, index=extra_dates, columns=['Deaths'])])
ma.tail()

In [None]:
indices = nyt_stats[nyt_stats.State == 'Massachusetts'].index.copy()
spork = ma.copy()
spork.index = indices
nyt_stats.loc[indices, 'Deaths'] = spork.Deaths
nyt_stats[nyt_stats.State == 'Massachusetts'].tail()

### Pull in the testing information from the COVID Tracking Project

In [None]:
ct_stats = pandas.read_csv('https://covidtracking.com/api/v1/states/daily.csv')

# Remove territories
ct_stats = ct_stats[~ct_stats.state.isin(['AS', 'GU', 'MP', 'PR', 'VI'])]

# Choose and rename a subset of columns
ct_stats = ct_stats[['date', 'state', 'positive', 'negative']]
ct_stats.columns = ['Date', 'ST', 'Pos', 'Neg']

# Set the index to state and date
ct_stats.Date = [pandas.Period(str(v)) for v in ct_stats.Date]
ct_stats = ct_stats[ct_stats.Date >= EARLIEST_DATE]
ct_stats = ct_stats.set_index(['ST', 'Date'])

# Pull in the statistics for states
ct_stats = ct_stats.join(meta.set_index('ST')).reset_index().sort_values(['ST', 'Date'])

### Correct for various jumps in the data

NOTE: Various states have had days on which they dramatically scaled up their number of deaths for various reasons (usually starting to include probable deaths). The function above and code correct somewhat for this.

For example, on June 25, NJ started reporting "probable" deaths (much later than other states) and lumped 1854 past ones on that day, throwing off a lot of trend analysis. This code distributes those over the past in proportion to the confirmed deaths. Technically, the probables should be weighted earlier, but this should mostly suffice.

In [None]:
def spread_deaths(stats, state, num_deaths, deaths_date, realloc_end_date=None):
    realloc_end_date = realloc_end_date or deaths_date
    st = stats[(stats.State == state) & (stats.Date <= deaths_date)]
    indices = st.index.copy()
    st = st.set_index('Date')[['Deaths']].copy()
    orig_total = st.loc[deaths_date, 'Deaths']
    st.loc[deaths_date, 'Deaths'] -= num_deaths
    new_total = st.loc[deaths_date, 'Deaths']
    st['Daily'] = st.Deaths - st.shift(1).Deaths
    st['DailyAdj'] = (st.Daily * (orig_total / new_total)) - st.Daily
    st['CumAdj'] = st.DailyAdj.sort_index().cumsum().sort_index()
    st.loc[deaths_date, 'CumAdj'] = 0.
    st = st.reset_index()
    st.index = indices
    stats.loc[indices, 'Deaths'] += st.CumAdj

In [None]:
STATE_ADJUSTMENTS = (
    ('New Jersey', 1854, '2020-06-25'),
    ('New York', 608, '2020-06-30'),  # technically, most of these apparently happened at least three weeks earlier
    ('Illinois', 123, '2020-06-08'),
    ('Michigan', 220, '2020-06-05'),
)

for state, deaths, deaths_date in STATE_ADJUSTMENTS:
    spread_deaths(nyt_stats, state, deaths, deaths_date)

### Group on date and calculate new stats

In [None]:
nyt = nyt_stats.groupby('Date').sum().sort_index()[['Deaths']].copy()
ct = ct_stats.groupby('Date').sum().sort_index()[['Pos', 'Neg']].copy()

# Calculate per-capita values
ct['PctPos'] = ct.Pos / (ct.Pos + ct.Neg)

# Calculate daily deaths and smoothed (avg of trailing 7 days) deaths
nyt['Daily'] = (nyt.Deaths - nyt.shift().Deaths)
nyt['Deaths7'] = (nyt.Deaths - nyt.shift(7).Deaths) / 7

# Calculate confirmed tests based on smoothed weekly data
ct7 = ct.shift(7)[['Pos', 'Neg']]
ct['NRatio'] = (ct.Neg - ct7.Neg) / (ct.Pos - ct7.Pos)
ct['DailyConfirms'] = (ct.Pos - ct7.Pos) / 7

ct.tail()

## Now for the charts...

In [None]:
def get_infections_df(scenarios):
    data = {}
    for name, death_lag, ifr_high, ifr_low in scenarios:
        # Calculate the IFR to apply for each day
        ifr = pandas.Series(numpy.linspace(ifr_high, ifr_low, len(nyt)), index=nyt.index)
        # Calculate the infections in the past
        infections = nyt.shift(-death_lag).Deaths7 / ifr
        
        # Find out the ratio of infections that were detected on the last date in the past
        last_date = infections.index[-(death_lag+1)]
        last_ratio = infections.loc[last_date] / ct.loc[last_date, 'DailyConfirms']
        
        # Apply that ratio to the dates since that date
        infections.iloc[-death_lag:] = ct.DailyConfirms.iloc[-death_lag:] * last_ratio

        print(1 / last_ratio)
        data[name] = infections

    return pandas.DataFrame(data)

In [None]:
SCENARIOS = (('20', 20, 0.01, 0.01), ('18', 18, 0.01, 0.01), ('16', 16, 0.01, 0.01), )

df = get_infections_df(SCENARIOS)
foo = df.plot(title="New Infections Estimates, varying average days to death, IFR = 1.0%", figsize=(10,5))

In [None]:
SCENARIOS = (('1.3%', 18, 0.013, 0.013), ('1.0%', 18, 0.01, 0.01), ('0.7%', 18, 0.007, 0.007), )

df = get_infections_df(SCENARIOS)
foo = df.plot(title="New Infections Estimates, varying IFR, days to death = 18", figsize=(10,5))

In [None]:
SCENARIOS = (('1.2% - 0.8%', 18, 0.012, 0.008), ('1.0% - 0.7%', 18, 0.01, 0.007), ('0.9% - 0.6%', 18, 0.009, 0.006), )

df = get_infections_df(SCENARIOS)
foo = df.plot(title="Infection Estimations, improving IFR, days to death = 18", figsize=(10,5))

In [None]:
SCENARIOS = (('1.0% - 0.6%', 18, 0.01, 0.006), )

df = get_infections_df(SCENARIOS)
foo = df.plot(title="Infection Estimations, my hunch", figsize=(10,5))

In [None]:
df.sum()

In [None]:
SCENARIOS = (('1.2% - 0.5%', 20, 0.012, 0.005), )

df = get_infections_df(SCENARIOS)
foo = df.plot(title="Worst case? 20 days to death, improving IFR", figsize=(10,5))