# tables_us_data.ipynb

Build tables of the latest COVID-19 statistics for U.S. counties.

Inputs:
* `outputs/us_counties_clean.csv`: The contents of `data/us_counties.csv` after data cleaning by [clean_us_data.ipynb](./clean_us_data.ipynb)
* `outputs/us_counties_clean_meta.json`: Column type metadata for reading `data/us_counties_clean.csv` with `pd.read_csv()`

**Note:** You can redirect these input files by setting the environment variable `COVID_OUTPUTS_DIR` to a replacement for the prefix `outputs` in the above paths.

In [1]:
# Initialization boilerplate
import os
import json
import pandas as pd
import numpy as np
from typing import *

import text_extensions_for_pandas as tp

# Local file of utility functions
import util

# Allow environment variables to override data file locations.
_OUTPUTS_DIR = os.getenv("COVID_OUTPUTS_DIR", "outputs")
util.ensure_dir_exists(_OUTPUTS_DIR)  # create if necessary

# Read and Reformat Input Data

In [2]:
# Read in the CSV file and apply the saved type information
csv_file = os.path.join(_OUTPUTS_DIR, "us_counties_clean.csv")
meta_file = os.path.join(_OUTPUTS_DIR, "us_counties_clean_meta.json")

# Read column type metadata
with open(meta_file) as f:
    cases_meta = json.load(f)

# Pandas does not currently support parsing datetime64 from CSV files.
# As a workaround, read the "Date" column as objects and manually 
# convert after.
cases_meta["Date"] = "object"

cases_vertical = (
    pd
    .read_csv(csv_file, dtype=cases_meta, parse_dates=["Date"])   
    .set_index(["FIPS", "Date"], verify_integrity=True)
)
cases_vertical

Unnamed: 0_level_0,Unnamed: 1_level_0,State,County,Population,Confirmed,Deaths,Recovered,Confirmed_Outlier,Deaths_Outlier,Recovered_Outlier
FIPS,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,2020-01-22,Alabama,Autauga,55869,0,0,0,False,False,False
1001,2020-01-23,Alabama,Autauga,55869,0,0,0,False,False,False
1001,2020-01-24,Alabama,Autauga,55869,0,0,0,False,False,False
1001,2020-01-25,Alabama,Autauga,55869,0,0,0,False,False,False
1001,2020-01-26,Alabama,Autauga,55869,0,0,0,False,False,False
...,...,...,...,...,...,...,...,...,...,...
56045,2020-09-12,Wyoming,Weston,6927,23,0,0,False,False,False
56045,2020-09-13,Wyoming,Weston,6927,23,0,0,False,False,False
56045,2020-09-14,Wyoming,Weston,6927,23,0,0,False,False,False
56045,2020-09-15,Wyoming,Weston,6927,23,0,0,False,False,False


## Normalize the Confirmed and Deaths counts by population

The populations of U.S. counties vary by several orders of magnitude, so it's useful to 
normalize the case count for each count to the county's population. Compute confirmed cases
and deaths per 100 residents.

In [3]:
cases = cases_vertical.copy()
cases["Confirmed_per_100"] = cases["Confirmed"] / cases["Population"] * 100
cases["Deaths_per_100"] = cases["Deaths"] / cases["Population"] * 100
cases

Unnamed: 0_level_0,Unnamed: 1_level_0,State,County,Population,Confirmed,Deaths,Recovered,Confirmed_Outlier,Deaths_Outlier,Recovered_Outlier,Confirmed_per_100,Deaths_per_100
FIPS,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1001,2020-01-22,Alabama,Autauga,55869,0,0,0,False,False,False,0.000000,0.0
1001,2020-01-23,Alabama,Autauga,55869,0,0,0,False,False,False,0.000000,0.0
1001,2020-01-24,Alabama,Autauga,55869,0,0,0,False,False,False,0.000000,0.0
1001,2020-01-25,Alabama,Autauga,55869,0,0,0,False,False,False,0.000000,0.0
1001,2020-01-26,Alabama,Autauga,55869,0,0,0,False,False,False,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
56045,2020-09-12,Wyoming,Weston,6927,23,0,0,False,False,False,0.332034,0.0
56045,2020-09-13,Wyoming,Weston,6927,23,0,0,False,False,False,0.332034,0.0
56045,2020-09-14,Wyoming,Weston,6927,23,0,0,False,False,False,0.332034,0.0
56045,2020-09-15,Wyoming,Weston,6927,23,0,0,False,False,False,0.332034,0.0


## Extract the most recent element of each time series

Most of the tables below focus on the most recent day's data, so we generate a DataFrame with
just the last element of each time series.

In [4]:
cases_without_index = cases.reset_index()
last_date = cases_without_index["Date"].max()
cases_by_county = (
    cases_without_index[cases_without_index["Date"] == last_date]
    .set_index("FIPS")
    .drop(columns=["Confirmed_Outlier", "Deaths_Outlier", "Recovered_Outlier"]))
cases_by_county

Unnamed: 0_level_0,Date,State,County,Population,Confirmed,Deaths,Recovered,Confirmed_per_100,Deaths_per_100
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1001,2020-09-16,Alabama,Autauga,55869,1619,24,0,2.897850,0.042958
1003,2020-09-16,Alabama,Baldwin,223234,5003,47,0,2.241146,0.021054
1005,2020-09-16,Alabama,Barbour,24686,809,7,0,3.277161,0.028356
1007,2020-09-16,Alabama,Bibb,22394,612,9,0,2.732875,0.040189
1009,2020-09-16,Alabama,Blount,57826,1487,13,0,2.571508,0.022481
...,...,...,...,...,...,...,...,...,...
56037,2020-09-16,Wyoming,Sweetwater,42343,317,2,0,0.748648,0.004723
56039,2020-09-16,Wyoming,Teton,23464,481,1,0,2.049949,0.004262
56041,2020-09-16,Wyoming,Uinta,20226,323,2,0,1.596954,0.009888
56043,2020-09-16,Wyoming,Washakie,7805,111,6,0,1.422165,0.076874


# Generate tables

Now that we have read and formatted the input data, we can use Pandas to generate summary tables of the 
latest COVID-19 data.

## Table: COVID-19 Cases and Deaths by State

Aggregate the most recent county-level numbers by state to build up a table of statewide totals.

In [5]:
cases_by_state = (cases_by_county
 .groupby("State")
 .aggregate({
     "Population": "sum",
     "Confirmed": "sum",
     "Deaths": "sum"
 }))
cases_by_state["Confirmed_per_100"] = cases_by_state["Confirmed"] / cases_by_state["Population"] * 100
cases_by_state["Deaths_per_100"] = cases_by_state["Deaths"] / cases_by_state["Population"] * 100

cases_by_state = cases_by_state[["Population", "Confirmed", "Deaths", \
                                 "Confirmed_per_100", "Deaths_per_100"]]
cases_by_state

Unnamed: 0_level_0,Population,Confirmed,Deaths,Confirmed_per_100,Deaths_per_100
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,4903185,141087,2392,2.877456,0.048785
Alaska,731545,6431,44,0.879098,0.006015
Arizona,7278717,209904,5370,2.883805,0.073777
Arkansas,3017804,70761,1157,2.344784,0.038339
California,39512223,771321,14691,1.952107,0.037181
Colorado,5758736,62666,2002,1.08819,0.034765
Connecticut,3565287,55034,4487,1.543606,0.125852
Delaware,973764,18756,619,1.926134,0.063568
District of Columbia,705749,14743,617,2.088986,0.087425
Florida,21477737,669635,12939,3.11781,0.060244


In [6]:
# Now our data prep is done and we can start analyzing.

# The latest nationwide totals
cases_by_state[["Confirmed", "Deaths"]].sum()

Confirmed    6534746
Deaths        195046
dtype: int64

In [7]:
# Today's nationwide totals, computed from the county-level data
cases_by_county[["Confirmed", "Deaths"]].sum()

Confirmed    6534746
Deaths        195046
dtype: int64

## Table: Top 10 states by total confirmed cases

In [8]:
cases_by_state.sort_values("Confirmed", ascending=False).head(10)

Unnamed: 0_level_0,Population,Confirmed,Deaths,Confirmed_per_100,Deaths_per_100
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
California,39512223,771321,14691,1.952107,0.037181
Texas,28995881,696807,14738,2.403124,0.050828
Florida,21477737,669635,12939,3.11781,0.060244
New York,19453561,446366,33025,2.294521,0.169763
Georgia,10617423,279937,6272,2.636581,0.059073
Illinois,12671821,266093,8367,2.09988,0.066028
Arizona,7278717,209904,5370,2.883805,0.073777
New Jersey,8882190,197472,16054,2.223235,0.180744
North Carolina,10488084,188024,3149,1.792739,0.030025
Tennessee,6829174,171625,2121,2.513115,0.031058


## Table: Top 10 states by confirmed cases per 100 residents

In [9]:
cases_by_state.sort_values("Confirmed_per_100", ascending=False).head(10)

Unnamed: 0_level_0,Population,Confirmed,Deaths,Confirmed_per_100,Deaths_per_100
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Louisiana,4648794,158634,5126,3.412369,0.110265
Florida,21477737,669635,12939,3.11781,0.060244
Mississippi,2976149,91234,2756,3.065505,0.092603
Arizona,7278717,209904,5370,2.883805,0.073777
Alabama,4903185,141087,2392,2.877456,0.048785
Georgia,10617423,279937,6272,2.636581,0.059073
South Carolina,5148714,134122,3132,2.604961,0.060831
Tennessee,6829174,171625,2121,2.513115,0.031058
Iowa,3155070,76630,1248,2.428789,0.039555
Nevada,3080156,74248,1494,2.410527,0.048504


## Table: Top 10 states by deaths per 100 residents

In [10]:
# Top 10 states by deaths per 100 residents
cases_by_state.sort_values("Deaths_per_100", ascending=False).head(10)

Unnamed: 0_level_0,Population,Confirmed,Deaths,Confirmed_per_100,Deaths_per_100
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
New Jersey,8882190,197472,16054,2.223235,0.180744
New York,19453561,446366,33025,2.294521,0.169763
Massachusetts,6892503,123248,9239,1.788146,0.134044
Connecticut,3565287,55034,4487,1.543606,0.125852
Louisiana,4648794,158634,5126,3.412369,0.110265
Rhode Island,1059361,21374,1066,2.017631,0.100627
Mississippi,2976149,91234,2756,3.065505,0.092603
District of Columbia,705749,14743,617,2.088986,0.087425
Arizona,7278717,209904,5370,2.883805,0.073777
Michigan,9986857,119795,6865,1.199527,0.06874


## Table: Top 10 states by confirmed cases

In [11]:
cases_by_county.sort_values("Confirmed", ascending=False).head(20)

Unnamed: 0_level_0,Date,State,County,Population,Confirmed,Deaths,Recovered,Confirmed_per_100,Deaths_per_100
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6037,2020-09-16,California,Los Angeles,10039107,256148,6303,0,2.551502,0.062784
12086,2020-09-16,Florida,Miami-Dade,2716940,165147,2955,0,6.078419,0.108762
4013,2020-09-16,Arizona,Maricopa,4485414,138151,3197,0,3.080006,0.071275
17031,2020-09-16,Illinois,Cook,5150233,136246,5146,0,2.645434,0.099918
48201,2020-09-16,Texas,Harris,4713325,120771,2458,0,2.562331,0.05215
48113,2020-09-16,Texas,Dallas,2635516,76149,1056,0,2.889339,0.040068
12011,2020-09-16,Florida,Broward,1952778,74832,1297,0,3.832079,0.066418
36081,2020-09-16,New York,Queens,2253858,71309,7235,0,3.163864,0.321005
36047,2020-09-16,New York,Kings,2559903,66468,7311,0,2.596505,0.285597
32003,2020-09-16,Nevada,Clark,2266715,63077,1298,0,2.782749,0.057263


## Table: Top 20 counties by confirmed cases per 100 residents

In [12]:
cases_by_county.sort_values("Confirmed_per_100", ascending=False).head(20)

Unnamed: 0_level_0,Date,State,County,Population,Confirmed,Deaths,Recovered,Confirmed_per_100,Deaths_per_100
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
13053,2020-09-16,Georgia,Chattahoochee,10907,1596,1,0,14.632805,0.009168
47169,2020-09-16,Tennessee,Trousdale,11284,1644,7,0,14.569302,0.062035
12067,2020-09-16,Florida,Lafayette,8422,1227,11,0,14.568986,0.13061
5079,2020-09-16,Arkansas,Lincoln,13024,1802,15,0,13.835995,0.115172
47095,2020-09-16,Tennessee,Lake,7016,864,2,0,12.314709,0.028506
5077,2020-09-16,Arkansas,Lee,8857,1041,16,0,11.753415,0.180648
31043,2020-09-16,Nebraska,Dakota,20026,2110,43,0,10.536303,0.214721
19021,2020-09-16,Iowa,Buena Vista,19620,1913,12,0,9.750255,0.061162
5017,2020-09-16,Arkansas,Chicot,10118,986,17,0,9.745009,0.168017
27105,2020-09-16,Minnesota,Nobles,21629,1912,16,0,8.839983,0.073975


## Table: Top 20 counties by deaths per 100 residents

In [13]:
cases_by_county.sort_values("Deaths_per_100", ascending=False).head(20)

Unnamed: 0_level_0,Date,State,County,Population,Confirmed,Deaths,Recovered,Confirmed_per_100,Deaths_per_100
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
13141,2020-09-16,Georgia,Hancock,8457,383,41,0,4.528793,0.484805
51640,2020-09-16,Virginia,Galax,6347,406,30,0,6.396723,0.472664
51595,2020-09-16,Virginia,Emporia,5346,244,23,0,4.56416,0.430228
13243,2020-09-16,Georgia,Randolph,6778,329,28,0,4.853939,0.413101
13273,2020-09-16,Georgia,Terrell,8531,325,31,0,3.809635,0.363381
28099,2020-09-16,Mississippi,Neshoba,29118,1513,103,0,5.196099,0.353733
35031,2020-09-16,New Mexico,McKinley,71367,4280,252,0,5.99717,0.353104
36005,2020-09-16,New York,Bronx,1418207,52371,4938,0,3.692761,0.348186
28051,2020-09-16,Mississippi,Holmes,17010,1076,58,0,6.325691,0.340976
22037,2020-09-16,Louisiana,East Feliciana,19135,1690,65,0,8.831983,0.339692
