# COVID-19 Tracker - USA

This notebook provides basic information about trends in COVID-19 hospitalizations and deaths in the US at the country, state and county levels. The data is provided by the New York Times, who gathers it based on reports from state and local health agencies. More information on the NYT dataset is available [here](https://github.com/nytimes/covid-19-data).

## How to Use this Notebook
The key is just to make sure that the NYT dataset is up-to-date.  The data is provided as a GitHub repository, which is included as a submodule of this repo.  Therefore, to update the data to the latest version, you simply need to issue the following command from within the top-level directory:

> git pull --recurse --submodules

If this is the first time you have run the notebook, you may first need to run the following command to properly set up the dataset:

> git submodule update --init --recursive

## Module Imports

In [0]:
import pandas as pd
import numpy as np

import plotly.offline as po
import plotly.graph_objects as go
import plotly.express as px

import requests

import io

from statsmodels.tsa.seasonal import STL
from scipy import stats

from huntlib.util import benfords

from datetime import datetime

import json

import ipywidgets as widgets
from IPython.display import display

from tqdm.auto import tqdm

## Parameters & Setup
Some basic notebook config parameters here.  

___ROLLING_AVERAGE_DAYS___ is the number of days upon which to compute rolling averages for the timeline view. The deafault is 7 days, so 1-week averages.

___RECENT_DAYS___ is the number of days of history to consider when creating the activity heatmaps.  The default is 14 days, so the maps will reflect the most recent 2 weeks worth of data.

In [0]:
ROLLING_AVERAGE_DAYS = 7
RECENT_DAYS = 14

## Load and Prepare the Data

### Load Geographic Data
First we load the data we'll use to create the maps.  These basically are just a set of coordinates for creating the outlines of the states and counties, tied to [FIPS location codes](https://en.wikipedia.org/wiki/FIPS_county_code).  The file we're using originally came from https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json. 

In [0]:
with open('geojson-counties-fips.json', 'r') as f:
    counties = f.read()
    
counties_geojson = json.loads(counties)

### Load the New York Times COVID-19 Data
Now we'll load the county-level COVID-19 data into a DataFrame.  We'll also fix it up a little by converting types in some of the important columns.

In [0]:
us_counties = pd.read_csv('covid-19-data/us-counties.csv', parse_dates=['date'])

us_counties['fips'] = us_counties['fips'].astype('object')
us_counties['county'] = us_counties['county'].astype('category')
us_counties['state'] = us_counties['state'].astype('category')

us_counties.info()

### Collate Data
The NYT dataset provides cumulative counts for each day, since the first case in the US. For our purposes, we're more interested in the per-day counts (i.e., the actual number of new cases or deaths for a single day).  

In order to compute these from the cumulative counts, we first find all the unique _(state, county)_ pairs, then extract them each as temporary DataFrames.  Within each of these temporary frames, we can then subtract adjacent rows from each other to determine the delta, which is the count of new cases/deaths for each day.  

Once we've done that, we concatenate all the temporary frames back into one big dataframe again.

In [0]:
local_dfs = list()

for state, county in tqdm(us_counties.groupby(['state', 'county']).groups.keys(), desc="Collating geographic info"):
    ldf = us_counties[(us_counties.state == state) & (us_counties.county == county)].copy()
    ldf['daily_cases'] = ldf['cases'].diff().fillna(0).astype('int')
    ldf['daily_deaths'] = ldf['deaths'].diff().fillna(0).astype('int')
    local_dfs.append(ldf)
    
full_df = pd.concat(local_dfs, ignore_index=True).sort_values(by=['date', 'state', 'county']).reindex()

full_df
    

## Visualizations

### Visualization Functions
Define some functions to create the visualizations we want.

In [0]:
def plot_daily_cumulative_summary(df, state=None, county=None):
    if state:
        df = df[df.state == state]
        title = f"{state} Cumulative Cases & Deaths"
        if county:
            df = df[df.county == county]
            title = f"{county} County, {state} Cumulative Cases & Deaths"
    else:
        title = "US Cumulative Cases & Deaths"
        
    cum_sum_df = df[['date', 'cases', 'deaths']].groupby('date').sum()
    
    fig = go.Figure(
        data=go.Scatter(
            x=cum_sum_df.index,
            y=cum_sum_df['cases'],
            mode='lines',
            name='Total Cases'
        )
    )

    fig.add_trace(
        go.Scatter(
            x=cum_sum_df.index,
            y=cum_sum_df['deaths'],
            mode='lines',
            name='Deaths'
        )
    )

    fig.update_layout(
        title=title
    )

    fig.show()

In [0]:
def plot_timeseries(df, column, state=None, county=None, anomaly_threshold=3, baseline_type='ra'):

    
    if state:
        df = df[df.state == state]
        title = f"COVID-19 {column} vs {ROLLING_AVERAGE_DAYS} Day Average: {state}"

        if county and not county == '--':
            df = df[df.county == county]
            title = f"COVID-19 {column} vs {ROLLING_AVERAGE_DAYS} Day Average: {county} County, {state}"
    else:
        title = f"COVID-19 {column} vs {ROLLING_AVERAGE_DAYS} Day Average: US"

            
            
    local_series = df[['date', column]].groupby('date').sum().sort_index()[column]
    
    fig = go.Figure(
        data=go.Scatter(
            x=local_series.index,
            y=local_series,
            mode='lines',
            name=column
        )
    )

    seasonal = STL(local_series, robust=True)
    
    seasonal = seasonal.fit()
    
    if baseline_type == 'trend' or baseline_type == 'trend_seasonal':
        fig.add_trace(
            go.Scatter(
                x=local_series.index,
                y=seasonal.trend + seasonal.seasonal,
                mode='lines',
                marker_color='green',
                name='Basline (Trend + Seasonal)'
            )
        )
    else:
        fig.add_trace(
            go.Scatter(
                x=local_series.index,
                y=local_series.rolling(ROLLING_AVERAGE_DAYS).mean(),
                mode='lines',
                marker_color='green',
                name="Baseline (Rolling Average)"
            )
        )
    
    anomalies = seasonal.resid[abs(seasonal.resid - seasonal.resid.mean()) >= (anomaly_threshold * seasonal.resid.std())]

    fig.add_trace(
        go.Scatter(
            x=anomalies.index,
            y=local_series.loc[anomalies.index],
            mode='markers',
            marker_symbol='x',
            marker_color='red',
            marker = dict(
              size=10  
            ),
            name='Anomalies'
        )
    )
    
    fig.update_layout(
        title=title
    )

    fig.show()


In [0]:
def plot_heatmap(df, geojson, column=None, locations='fips', height=1000, width=None, color_scale='viridis'):

    fig = px.choropleth(
        df,
        geojson=geojson,
        locations=locations,
        color=column,
        scope='usa',
        hover_data=['county', 'fips', column],
        height=height,
        width=width,
        color_continuous_scale=color_scale,
        title=f"{column} in Last {RECENT_DAYS} Days by County"
    )

    fig.update_geos(fitbounds='locations')
    
    fig.show()

## US
First, show information about the country as a whole.

### US Cumulative Cases & Deaths

In [0]:
plot_daily_cumulative_summary(full_df)

### US Timelines

In [0]:
plot_timeseries(full_df, column='daily_cases')
plot_timeseries(full_df, column='daily_deaths')

## Geographic Drill-Down
In this section, you can use the widgets below to select a state (and optionally, a county) and examine cases and deaths in more detail.  

The timeline views will be show either state- or county-level data, depending on your selections.  The heatmaps, however, only show the state-level views.

In [0]:
states = sorted(full_df.state.unique())
counties = list(full_df.county.unique())
counties.append('')
counties = sorted(counties)


widget_state_chooser = widgets.Dropdown(
    options=states,
    value='Virginia',
    description="State: "
)

widget_county_chooser = widgets.Dropdown(
    options=counties,
    value='',
    description='County: '
)

@widgets.interact(state=widget_state_chooser, county=widget_county_chooser)
def make_plots(state, county):
    plot_timeseries(full_df, column='daily_cases', state=state, county=county)
    plot_timeseries(full_df, column='daily_deaths', state=state, county=county)
    
    state_df = full_df[full_df.state == state]
    start_time = state_df[-1:]['date'] - pd.DateOffset(RECENT_DAYS, 'D')

    recent_df = state_df[state_df['date'] >= start_time.iloc[0]]
    recent_df = recent_df.drop(['cases', 'deaths'], axis='columns')
    recent_df = recent_df.groupby(['fips', 'county'], as_index=False).sum().dropna()
    
    plot_heatmap(recent_df, counties_geojson, column='daily_cases')
    plot_heatmap(recent_df, counties_geojson, column='daily_deaths')