# US Data Charting
This workbooks analyzes and plots the latest US States data from the [COVID Tracking project](https://covidtracking.com/).

In [0]:
from datetime import datetime, timedelta, timezone
import dateutil.parser as du_parser
import pandas as pd
import eloader as el
import eplotter as ep

# load from the data loader helper
(df_world_daily) = el.load_opencovid19_data()
(df_it_daily, df_it_regional_daily) = el.load_pcmdpc_it_data()
(df_us_daily, df_us_states_daily, df_us_states_latest) = el.load_covidtracking_us_data()

df_fused_daily = el.fuse_daily_sources(df_world_daily, df_us_daily, df_it_daily)

## US Aggregate

Confirmed cases in the US. Other countries shown as references.

In [0]:
# plot, ranked by Confirmed cases
ranked_countries_by_confirmed_cases = ep.rank_data_by_metric(df_fused_daily, metric='Confirmed', unique_key='CountryCode')
#ranked_countries_by_population = ep.rank_data_by_metric(df_fused_daily, metric='Population', unique_key='CountryCode', max_results=10)
highlight_countries = ['United States of America', 'China', 'Italy', 'United Kingdom', 'South Korea', 'Japan', 'Brazil', 'India', 'Mexico', 'Nigeria', 'Russia']
ep.scatter_plot_by_series(
    df_fused_daily,
    x_key='X', y_key='Confirmed',
    series_key='CountryName', series_names=ranked_countries_by_confirmed_cases['CountryName'],
    y_log=True,
    # series_is_secondary=lambda df: df['CountryName'].any() not in list(ranked_countries_by_population['CountryName']),
    series_is_secondary=lambda df: df['CountryName'].any() not in highlight_countries,
    series_secondary_width=1,
    # bounds=[64, el.date_to_day_of_year(datetime.now()) - 64 + 7, 100, None],
    bounds=[64, None, 100, None],
    data_labels="legend", data_labels_align='right',
    line_style_non_first_series='dotted',
    title='US: Confirmed Cases',
    label_x="Day of 2020",
    stamp_1='Highlighted: highly populated or reference countries'
)

Deaths count in the US versus the rest of the world. Significant comparison points highlighted.

In [0]:
ep.scatter_plot_by_series(
    df_fused_daily,
    x_key='X', y_key='Deaths',
    series_key='CountryName', series_names=ranked_countries_by_confirmed_cases['CountryName'],
    y_log=True,
    series_is_secondary=lambda df: df['CountryName'].any() not in highlight_countries,
    series_secondary_width=1,
    bounds=[64, None, 10, None],
    data_labels="legend", data_labels_align='right',
    line_style_non_first_series='dotted',
    title='US: confirmed Deaths',
    label_x="Day of 2020",
    stamp_1='Highlighted: highly populated or reference countries'
)

### Table for US Aggregate data
Last 10 days of aggregated data.

In [0]:
df_fused_daily[df_fused_daily['CountryCode'] == 'US'][-10:]

## US States Charts

The following charts are for Confirmed cases. The first represents the total number of people declared 'Positive' (which can happen even after death). Note that the statistics do not include non-observable numbers such as deaths that are now tested for the virus or people that had low symptoms and did not get tested.


In [0]:
# states with the highest Confirmed
regions_by_cases = ep.rank_data_by_metric(df_us_states_daily, metric='Confirmed', unique_key='RegionName')
confirmed_top_count = regions_by_cases['Confirmed'].iloc[0].astype(int)
confirmed_sec_threshold = round(confirmed_top_count / 20).astype(int)
secondary_function = lambda df: df['Confirmed'].iloc[-1] < confirmed_sec_threshold

# [plot] Days sice Case 500, log
case_intersection = 500
ep.scatter_plot_by_series(
    df_us_states_daily,
    x_key='X', y_key='Confirmed',
    series_key='RegionName', series_names=regions_by_cases['RegionName'],
    series_is_secondary=secondary_function,
    shift_x_to_intersect_y=case_intersection,
    y_log=True,
    bounds=[None, (el.date_to_day_of_year(datetime.now()) - 66) * 1.5, None, None],
    data_labels="series", data_labels_align="center",
    title='Trend since case ' + str(case_intersection) + ', for US States',
    label_x='Days since case ' + str(case_intersection),
    stamp_1="Grayed-out: low case count"
)

This chart tries to bring the curves together at case #500, to see the difference in regional behavior after hitting that infection size.

In [0]:
# [plot] Day of the year, all series, higher than 100
first_day = el.date_to_day_of_year(datetime(2020, 3, 6))
last_day = el.date_to_day_of_year(datetime.now())
ep.scatter_plot_by_series(
    df_us_states_daily,
    x_key='X', y_key='Confirmed',
    series_key='RegionName', series_names=regions_by_cases['RegionName'],
    series_is_secondary=secondary_function,
    series_secondary_width=1,
    y_log=True,
    bounds=[first_day, last_day, 100, None],
    data_labels="series", data_labels_align='right',
    title='Confirmed cases by US State, over time',
    label_x="Day of 2020",
    stamp_1="Grayed-out: low case count"
)

This chart shows the mortality rate, defined as: Deaths / Total Positives. There are multiple factors to take into account on the numerator (in particular non-attributed deaths) and on the denominator (for example low-symptomatic cases, and non-tested cases) so the real values for the death rates are probably different.

For now this is a baseline estimation given the numbers we have.

In [0]:
# Mortality
regions_by_death_rate = ep.rank_data_by_metric(df_us_states_daily, metric='Death_rate', unique_key='RegionName')
ep.scatter_plot_by_series(
    df_us_states_daily,
    x_key='X', y_key='Death_rate',
    series_key='RegionName', series_names=regions_by_death_rate['RegionName'],
    series_is_secondary=secondary_function, series_secondary_width=1,
    y_filter='expo',
    bounds=[80, None, 0, 10],
    legend_decimals=1, legend_suffix='%',
    data_labels="legend", data_labels_align='right',
    title="Death rate by US State, in the last weeks",
    label_x="Day of 2020", label_y="Reported deaths / Confirmed cases (percent)",
    stamp_1="Grayed-out: states with low case count yet"
)

### Tables for US Regions
States ranked by higher Confirmed cases.

In [0]:
regions_by_cases

All regions ranked by higher Mortality rates.

NOTE: when confirmed cases are low ('Confirmed') in the table below, the 'Death_rate' is not significant, so it's up to you to filter and interpret the data below.

In [0]:
regions_by_death_rate