# Calculating state-by-state implied infection numbers

This notebook tries to compute what the full infection numbers in the past and present likely were/are.

It does so in the past by blending variables for "median days from infection to death" and "infection fatility rate" (IFR) with smoothed death rates. In other words, days_to_death days before date D, there must have been roughly (deaths_on_date_D / IFR) infections to end up with a given number of deaths on date D.

When looking at the most recent days_to_death days, it looks up what percentage of infections were confirmed on the last day calculated in the past, and applies that percentage to the new infections found since then. It normalizes a bit by the amount of testing done on each day to try to handle significant ramping up/down of testing during that time, but the recent projections are admittedly sketchy.

The principal source of death data is files from the NY Times, supplemented by a more accurate DateOfDeath.xlsx from Massachusetts. The source of testing data is The COVID Tracking Project, maintained by The Atlantic.

NOTE: Prior to running this notebook, you should retrieve the latest DateOfDeath.xlsx file by:

1. going to https://www.mass.gov/info-details/covid-19-response-reporting,
2. downloading the raw data zip from the line saying "Raw data used to create the dashboard is available here:"
3. copying the DateofDeath.xlsx in that file to the same directory as the notebook

Yeah, that could potentially be automated, but MA made that really hard the way they implemented it.

In [None]:
%matplotlib inline
import numpy
import pandas
import matplotlib
import matplotlib.pyplot as plt

from common import load_data, smooth_series, calc_mid_weekly_average
from common import calc_state_stats, get_infections_df, find_smooth_dates

In [None]:
# Earliest date that there is sufficient data for all states, including MA
EARLIEST_DATE = pandas.Period('2020-03-10', freq='D')

# Set a latest date when the most recent days have garbage (like on or after holidays)
LATEST_DATE = pandas.Period('2020-12-23', freq='D')
LATEST_DATE = pandas.Period('2021-01-03', freq='D')
LATEST_DATE = None

# Set a number of recent days to not display in the graphs for lack of future days to smooth them
NON_DISPLAY_DAYS = 2

In [None]:
latest_date, meta, nyt_stats, ct_stats = load_data(EARLIEST_DATE, LATEST_DATE)
latest_displayed = latest_date - NON_DISPLAY_DAYS
print(f"Latest date = {str(latest_date)}; latest displayed = {str(latest_displayed)}")

In [None]:
# nyt_stats.tail(2)

In [None]:
# ct_stats.tail(2)

### Put the two datasets together

In [None]:
ct1 = ct_stats.set_index(['ST', 'Date']).sort_index()[['Pos', 'Neg']]
nyt1 = nyt_stats.set_index(['ST', 'Date']).sort_index()[['Deaths']]
both = ct1.join(nyt1)
meta_tmp = meta.set_index('ST')

In [None]:
states = [calc_state_stats(state, df, meta_tmp, latest_date)
          for state, df in both.reset_index().groupby('ST')]
# states[-17].tail(2)

In [None]:
stats = pandas.concat(states).reset_index()
# stats[stats.ST == 'WV'].tail(5)[['Date', 'RawDeaths', 'Deaths', 'Deaths7']]

### Calculate new stats, state by state

In [None]:
# Median number of days between being exposed and developing illness
INCUBATION = 4

# Number of days one is infectious (this isn't actually used yet)
INFECTIOUS = 10

# Median days in between exposure and death
DEATH_LAG = 19

In [None]:
# Here is where you set variables for IFR assumptions

# Note that this IFR represents a country-wide average on any given day, but the IFRs
# are actually adjusted up/down based on median age and nursing home residents per capita

# This set represents my worst case scenario (in my 95% CI interval)
# Start by setting the inital and final IFRs
IFR_S, IFR_E = 0.013, 0.006
# Then set dates in between by which it linearly scales to various targets
IFR_BREAKS = [['2020-04-30', 0.0095], ['2020-07-31', 0.007], ['2020-09-15', 0.006]]

# This set is my optimistic scenario
IFR_S, IFR_E = 0.01, 0.0025
IFR_BREAKS = [['2020-04-30', 0.0075], ['2020-07-31', 0.0045], ['2020-09-15', 0.0025]]

# This set is a highly optimistic scenario that matches the recent CDC data
IFR_S, IFR_E = 0.009, 0.002
IFR_BREAKS = [['2020-04-30', 0.007], ['2020-07-31', 0.003], ['2020-09-15', 0.002]]

# This is my expected scenario
IFR_S, IFR_E = 0.01, 0.004
IFR_BREAKS = [['2020-04-30', 0.0085], ['2020-07-31', 0.005], ['2020-09-15', 0.004]]

In [None]:
IFR_S_S, IFR_E_S = f'{100*IFR_S:.1f}%', f'{100*IFR_E:.2f}%'
infected_states = get_infections_df(states, meta, DEATH_LAG, IFR_S, IFR_E, IFR_BREAKS, INCUBATION, INFECTIOUS)
EST_LINE = str(latest_date - (DEATH_LAG - 1))
print(f"Total infected by {latest_date}: {int(infected_states.NewInf.sum()):,}")
print(f"Vertical line marking recent estimations set at {EST_LINE}")
# infected_states.tail(3)

In [None]:
# Checking infection totals by an arbitrary date
INF_DATE = '2021-01-02'
fizz = infected_states.reset_index()
fizz = fizz[fizz.Date <= INF_DATE]
print(f"Total infected by {INF_DATE}: {int(fizz.NewInf.sum()):,}")

In [None]:
# raise ValueError()

## Now for the charts

In [None]:
# Just nicking off the values we don't want to display here
fazzy = infected_states.reset_index()
fazzy = fazzy[fazzy.Date <= latest_displayed]
fazzy = fazzy.set_index(['ST', 'Date'])
infected_states = fazzy

In [None]:
foozle = infected_states.reset_index()[['Date', 'NewInf', 'Deaths7']].groupby('Date').sum()
foozle.columns = ['Infections', 'Deaths']
foozle = foozle.loc['2020-09-05':, :]
fam = foozle.plot(
    title=f"Daily Infections vs. Deaths, 19 median days to death, "
          f"IFR improving {IFR_S_S} - {IFR_E_S}",
    secondary_y='Deaths', figsize=(13,5), ylim=0)
__ = fam.axvline(EST_LINE, color="red", linestyle="--")
__ = fam.get_figure().get_axes()[1].set_ylim(0)

In [None]:
foo = infected_states.reset_index()[['Date', 'Region', 'NewInf', 'Deaths7', 'Pop']]
foo = foo.groupby(['Region', 'Date']).sum()
foo['NIPerM'] = foo.NewInf / foo.Pop
foo['DPerM'] = foo.Deaths7 / foo.Pop

In [None]:
zzz = foo.reset_index()
# zzz = zzz[zzz.Date > '2020-09-01']
fam = pandas.pivot_table(zzz, values = 'NIPerM', index=['Date'],
                         columns = 'Region').plot(title="New Daily Infections per Million", figsize=(15,5))
__ = fam.axvline(EST_LINE, color="red", linestyle="--")

In [None]:
# was foo.reset_index()
fam = pandas.pivot_table(zzz, values = 'DPerM', index=['Date'],
                         columns = 'Region').plot(title="Daily Deaths per Million", figsize=(15,5))

In [None]:
foo = infected_states.reset_index().set_index(['Date', 'ST']).sort_index()
foo = foo[['Pop', 'Confirms7', 'Deaths7', 'DPerM', 'NIPerM', 'NewInf', 'AIPer1000', 'AUPer1000', 'PctFound']]
faz = foo.loc[latest_displayed, :].sort_values('AUPer1000', ascending=False).copy()
faz = faz.reset_index()[['ST', 'Pop', 'Confirms7', 'Deaths7', 'DPerM', 'NIPerM', 'AIPer1000', 'AUPer1000', 'PctFound']]
faz.columns = ['ST', 'Pop', 'Cases', 'Deaths', 'DPerM', 'NIPerM', 'AIPer1000', 'ActUnk1000', 'PctFound']
faz.sort_values('NIPerM', ascending=False)

In [None]:
fam = infected_states[['Pop', 'Confirms7', 'Deaths7', 'NewInf']].copy()
fam['C7Per'] = fam.Confirms7 / fam.Pop
fam['D7Per'] = fam.Deaths7 / fam.Pop
fam['NIPer'] = fam.NewInf / fam.Pop
fam = fam.reset_index()[['ST', 'NIPer', 'C7Per', 'D7Per']]
fam.columns = ['ST', 'Infections', 'Confirms', 'Deaths']
fam = fam.groupby('ST').max().copy()
print("Maximum deaths/M/day states ever had")
fam.sort_values('Deaths', ascending=False).head(15)

In [None]:
# list(infected_states.index.get_level_values(0).unique())

In [None]:
# This is where I noodle around to investigate particular states of interest

# This next line lists all 51 (DC included)
st_names = list(infected_states.index.get_level_values(0).unique())

st_names = ['SD', 'ND', 'IA', 'TN']
st_names = ['NM', 'WY']
st_names = list(infected_states.index.get_level_values(0).unique())
st_names = ['AZ', 'NM', 'PA', 'TX', 'VT',]
st_names = ['SD', 'ND', 'IA',]
st_names = ['CA', 'TX', 'PA', 'NY', 'AZ', 'IL', 'FL', 'MI', 'NJ', 'TN', 'NC', 'IA', 'OH', 'MA', 'GA', 'IN', ]
st_names = ['AZ', 'PA', 'WV', 'NM', 'MS', 'KS', 'TN', 'SD', 'NV', 'AL',
            'AR', 'RI', 'IL', 'IN', 'SC', 'MI', 'MA', 'CA', 'NJ', 'TX', ]
st_names = ['DC', 'NM', 'MA', 'VA',]
num_plots = max(len(st_names), 2)
fig, axes = plt.subplots(num_plots, figsize=(15, 5*num_plots))
for i, st in enumerate(st_names):
    data = infected_states.loc[st, :].reset_index()[['Date', 'NIPerM', 'DPerM']].copy()
    data = data[data.Date >= '2020-11-01']
    data.columns = ['Date', 'Infections/M', 'Deaths/M']
    fam = data.groupby('Date').sum().plot(
        ax=axes[i], title=st, ylim=0, secondary_y='Deaths/M',
    )
    fam.axvline(EST_LINE, color="red", linestyle="--")

axes = fam.get_figure().get_axes()
for i in range(len(axes)):
    axes[i].set_ylim(0)

In [None]:
# This lists the states with the highest percentage ever infected by a given date
# I usually will set this back about 10 days because I don't trust the estimated infections too much
DT = '2021-01-01'
term = 'NIPerM'
divisor = 10000 # 10000 to convert NIPerM to total percentage ever infected
ni = infected_states.reset_index()[['ST', 'Date', term]].copy()
ni = ni[ni.Date < DT].copy()
ni = (ni.groupby('ST').sum()[term].sort_values(ascending=False) / divisor)
ni.head(20)

In [None]:
# Stopping the processing of this notebook
raise ValueError()

## Detritus

In [None]:
infected_states.reset_index().columns

In [None]:
df = infected_states.loc['AL', :][['RawInc', 'Daily', 'Deaths7', 'DPerM', 'Confirms7', 'NIPerM']]
df = df.loc['2020-12-10':, :].copy()
df

In [None]:
df = pandas.concat(states)[['DTests7']].reset_index()
st_names = list(df.ST.unique())
fig, axes = plt.subplots(len(st_names), figsize=(10, 4*len(st_names)))
for i, state in enumerate(st_names):
    try:
        df[df.ST == state].set_index('Date').DTests7.plot(ax=axes[i], title=state)
    except:
        pass


In [None]:
foo = {}
for st in ['WY', 'MA']:
    data = infected_states.loc[st, :]
    data = infected_states.loc[st, :].loc['2020-07-01':, :]
    # foo[st] = data.NIPerM
    foo[st] = data.DPerM
foo = pandas.DataFrame(foo)
fam = foo.plot(figsize=(15,5), legend=True, ylim=0)

In [None]:
spaz = nyt_stats[['ST', 'Nursing', 'Pop', 'Median']].drop_duplicates().copy()
spaz['NPerM'] = spaz.Nursing / spaz.Pop
spaz.sort_values('Median', ascending=False)

In [None]:
fizz = infected_states.reset_index()
fizz = fizz[fizz.Date <= '2020-12-01']
fizz.NewInf.sum()

In [None]:
foo = infected_states.loc['NM', :]
foo.Daily.tail(60)

In [None]:
#infected_states.columns

In [None]:
foo = infected_states[['Deaths7', 'DPerM', 'Pop']].reset_index().copy()
ma = foo[foo.ST.isin(['MA'])].copy()
us = foo.groupby('Date').sum().reset_index()
us['ST'] = 'US'
us['DPerM'] = us.Deaths7 / us.Pop
both = pandas.concat([ma, us]).sort_values(['Date', 'ST'])
both.tail()
fam = pandas.pivot_table(both, values = 'DPerM', index=['Date'],
                         columns = 'ST').plot(title="US vs. MA Deaths/Million", figsize=(15,5))

In [None]:
# state = states[34]
# st, start = state.index[0]
# spans = []
# start_amt = IFR_S
# for end, end_amt in IFR_BREAKS:
#     end = pandas.Period(end, 'D')
#     idx = pandas.period_range(start=start, end=end, freq='D')
#     spans.append(pandas.Series(numpy.linspace(start_amt, end_amt, len(idx)), index=idx).iloc[0:-1])
#     start, start_amt = end, end_amt

# st, end = state.index[-1]
# idx = pandas.period_range(start=start, end=end, freq='D')
# spans.append(pandas.Series(numpy.linspace(start_amt, IFR_E, len(idx)), index=idx))
# span = pandas.concat(spans)
# span = pandas.Series(span.values, index=state.index)
# span
# # ifr = pandas.Series(numpy.linspace(IFR_S, IFR_E, len(state)), index=state.index)
# # ifr[0], ifr[-1]

In [None]:
# fam = infected_states.reset_index()[['Date', 'NewInf']].groupby('Date').sum().plot(
#     title=f"Infection Estimations, 19 median days to death, "
#           f"IFR improving {IFR_S_S} - {IFR_E_S}",
#     figsize=(13,5), legend=None, ylim=0
# )
# __ = fam.axvline(EST_LINE, color="red", linestyle="--")

In [None]:
# fam = infected_states.reset_index()[['Date', 'Deaths7']].groupby('Date').sum().plot(
#     title="Deaths", figsize=(13,5),
#     legend=None, ylim=0, secondary_y='Deaths7'
# )

In [None]:
fizz = infected_states.reset_index().groupby('Date').agg({'DPerM': [numpy.mean, numpy.std]}).dropna()
fizz.columns = ['Mean', 'StdDev']
fizz['Ratio'] = fizz.StdDev / fizz.Mean
fizz.sort_values('Ratio').head(20)

In [None]:
fizz = infected_states.reset_index().groupby('Date').agg({'DPerM': lambda x: numpy.std(x) / numpy.mean(x)}).dropna()
fizz