# COVID-19 MIDS Collaboration 

## Data Sourcing: COVID Tracking Project 

This Jupyter Notebook reads in raw data as csv files from a website and exports them as [pickle files for faster loading](https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e). 

This code was adapted from a script provided to us by Professor Kevin Crook of the Berkeley MIDS program during our W205 (Data Engineering) class. 

### Data sources

US COVID-19 data (historical, at state level) from the COVID tracking project: https://covidtracking.com/api

Note: they are a new website, and their file formats have changed several times, but will eventually settle down.  So the parsing may change.

### Set up environment 

In [1]:
# Import packages
import pandas as pd
import io
import requests
import pickle 

### Retrieve data

In [2]:
# get data at URL - this URL is for the state historical data, updated daily at 4pm ET
r = requests.get("https://covidtracking.com/api/v1/states/daily.csv")

In [3]:
# check HTTP request status
r.status_code

200

In [4]:
# just show the first 2000 characters, the text is really long otherwise
r.text[0:2000]

'date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,onVentilatorCumulative,recovered,hash,dateChecked,death,hospitalized,total,totalTestResults,posNeg,fips,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease\n20200425,AK,339,15393,,32,,,,,,217,ab748b658b14f06c13e30194100f210b2136340e,2020-04-25T20:00:00Z,9,,15732,15732,15732,02,0,0,3451,0,3451\n20200425,AL,6137,65207,,,839,,288,,170,,7bf5132c6fc6ee52a0ff04abc4cd5430c20b506c,2020-04-25T20:00:00Z,212,839,71344,71344,71344,01,15,71,18344,305,18649\n20200425,AR,2829,35224,,104,291,,,25,57,964,58b9bd7f87ceaa7eb18f988e56f5ff87f50faf84,2020-04-25T20:00:00Z,47,291,38053,38053,38053,05,2,0,2387,88,2475\n20200425,AS,0,3,17,,,,,,,,9bbbe48f731360bdff6ba5de6550e13b34e278d0,2020-04-25T20:00:00Z,0,,20,3,3,60,0,0,0,0,0\n20200425,AZ,6280,56228,,697,1022,313,,191,,1345,c61443f5a5a14bd10e15d579fdf0ff8872ddb036,2020-04-25T20:00:00Z

In [5]:
# load into a Pandas dataframe
covid_df = pd.read_csv(io.StringIO(r.text)).add_prefix('o_') 

covid_df

Unnamed: 0,o_date,o_state,o_positive,o_negative,o_pending,o_hospitalizedCurrently,o_hospitalizedCumulative,o_inIcuCurrently,o_inIcuCumulative,o_onVentilatorCurrently,...,o_hospitalized,o_total,o_totalTestResults,o_posNeg,o_fips,o_deathIncrease,o_hospitalizedIncrease,o_negativeIncrease,o_positiveIncrease,o_totalTestResultsIncrease
0,20200425,AK,339.0,15393.0,,32.0,,,,,...,,15732.0,15732.0,15732.0,2,0.0,0.0,3451.0,0.0,3451.0
1,20200425,AL,6137.0,65207.0,,,839.0,,288.0,,...,839.0,71344.0,71344.0,71344.0,1,15.0,71.0,18344.0,305.0,18649.0
2,20200425,AR,2829.0,35224.0,,104.0,291.0,,,25.0,...,291.0,38053.0,38053.0,38053.0,5,2.0,0.0,2387.0,88.0,2475.0
3,20200425,AS,0.0,3.0,17.0,,,,,,...,,20.0,3.0,3.0,60,0.0,0.0,0.0,0.0,0.0
4,20200425,AZ,6280.0,56228.0,,697.0,1022.0,313.0,,191.0,...,1022.0,62508.0,62508.0,62508.0,4,0.0,38.0,1559.0,235.0,1794.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2876,20200126,WA,1.0,,,,,,,,...,,1.0,1.0,1.0,53,0.0,0.0,0.0,0.0,0.0
2877,20200125,WA,1.0,,,,,,,,...,,1.0,1.0,1.0,53,0.0,0.0,0.0,0.0,0.0
2878,20200124,WA,1.0,,,,,,,,...,,1.0,1.0,1.0,53,0.0,0.0,0.0,0.0,0.0
2879,20200123,WA,1.0,,,,,,,,...,,1.0,1.0,1.0,53,0.0,0.0,0.0,0.0,0.0


#### You can also pull data in json format, but since they are "flat" csv loaded into Pandas is much easier to work with, but here is an example of downing a json data and loading into a Python dictionary

Since the dictionary will be really large, for this example I picked the smallest json which is the US current values

In [6]:
r = requests.get("https://covidtracking.com/api/v1/us/current.json")

In [7]:
r.status_code

200

In [8]:
r.json()

[{'positive': 931698,
  'negative': 4252937,
  'pending': 5315,
  'hospitalizedCurrently': 56312,
  'hospitalizedCumulative': 94743,
  'inIcuCurrently': 15020,
  'inIcuCumulative': 2516,
  'onVentilatorCurrently': 5266,
  'onVentilatorCumulative': 227,
  'recovered': 90445,
  'hash': '88b95c3bf06a7da67491cd0b5e62e81489e348b6',
  'lastModified': '2020-04-26T20:13:00.659Z',
  'death': 47980,
  'hospitalized': 94743,
  'total': 5189950,
  'totalTestResults': 5184635,
  'posNeg': 5184635,
  'notes': 'NOTE: "total", "posNeg", "hospitalized" will be removed in the future.'}]

### Convert datatypes

#### Count records

In [9]:
# count rows and columns
covid_df.shape

(2881, 25)

#### Check and convert datatypes

We have:

* o_date: date the cases are from; should be converted from int64 to a date object
* o_dateChecked: date when data were validated; should be converted from string to a date object
* o_hash: unique ID for every data update; string 
* o_state: state the cases are from; string; could be converted to category 
* o_fips: state fips code; should be converted to category; can be used to join to other datasets 
* o_positive/ negative/ pending/ hospitalizedCurrently/ inIcuCurrently/ inIcuCumulative/ onVentilatorCurrently/ onVentilatorCumulative/ recovered/ death/ hospitalized/ total: COVID-19 counts; should be converted from float to integer 
* o_totalTestResults/ posNeg: test results; posNeg is deprecated and was renamed to totalTestResults; should be converted from float to integer 
* o_deathIncrease/ hospitalized/ negative/ positive/ totalTestResults: increase in COVID-19 counts from last date; should be converted from float to integer


In [10]:
# check data types 
covid_df.dtypes

o_date                          int64
o_state                        object
o_positive                    float64
o_negative                    float64
o_pending                     float64
o_hospitalizedCurrently       float64
o_hospitalizedCumulative      float64
o_inIcuCurrently              float64
o_inIcuCumulative             float64
o_onVentilatorCurrently       float64
o_onVentilatorCumulative      float64
o_recovered                   float64
o_hash                         object
o_dateChecked                  object
o_death                       float64
o_hospitalized                float64
o_total                       float64
o_totalTestResults            float64
o_posNeg                      float64
o_fips                          int64
o_deathIncrease               float64
o_hospitalizedIncrease        float64
o_negativeIncrease            float64
o_positiveIncrease            float64
o_totalTestResultsIncrease    float64
dtype: object

In [11]:
covid_df.head()

Unnamed: 0,o_date,o_state,o_positive,o_negative,o_pending,o_hospitalizedCurrently,o_hospitalizedCumulative,o_inIcuCurrently,o_inIcuCumulative,o_onVentilatorCurrently,...,o_hospitalized,o_total,o_totalTestResults,o_posNeg,o_fips,o_deathIncrease,o_hospitalizedIncrease,o_negativeIncrease,o_positiveIncrease,o_totalTestResultsIncrease
0,20200425,AK,339.0,15393.0,,32.0,,,,,...,,15732.0,15732.0,15732.0,2,0.0,0.0,3451.0,0.0,3451.0
1,20200425,AL,6137.0,65207.0,,,839.0,,288.0,,...,839.0,71344.0,71344.0,71344.0,1,15.0,71.0,18344.0,305.0,18649.0
2,20200425,AR,2829.0,35224.0,,104.0,291.0,,,25.0,...,291.0,38053.0,38053.0,38053.0,5,2.0,0.0,2387.0,88.0,2475.0
3,20200425,AS,0.0,3.0,17.0,,,,,,...,,20.0,3.0,3.0,60,0.0,0.0,0.0,0.0,0.0
4,20200425,AZ,6280.0,56228.0,,697.0,1022.0,313.0,,191.0,...,1022.0,62508.0,62508.0,62508.0,4,0.0,38.0,1559.0,235.0,1794.0


##### String --> Date columns

In [12]:
# create new version of column as a datetime object - with ymd
covid_df["date"] = pd.to_datetime(covid_df["o_date"], format='%Y%m%d')

# create new version of column as a datetime object - with ymdhms
covid_df["dateChecked"] = pd.to_datetime(covid_df["o_dateChecked"])
# check conversion 
covid_df[["o_date", "date", "o_dateChecked", "dateChecked"]].head()

Unnamed: 0,o_date,date,o_dateChecked,dateChecked
0,20200425,2020-04-25,2020-04-25T20:00:00Z,2020-04-25 20:00:00+00:00
1,20200425,2020-04-25,2020-04-25T20:00:00Z,2020-04-25 20:00:00+00:00
2,20200425,2020-04-25,2020-04-25T20:00:00Z,2020-04-25 20:00:00+00:00
3,20200425,2020-04-25,2020-04-25T20:00:00Z,2020-04-25 20:00:00+00:00
4,20200425,2020-04-25,2020-04-25T20:00:00Z,2020-04-25 20:00:00+00:00


##### String --> Categorical columns

In [13]:
# identify which columns to convert 
str_to_category = ["o_state", "o_fips"]
# create new column names
str_to_category_new = list(map(lambda x: x.replace('o_', ''), str_to_category))
# add new converted columns
covid_df[str_to_category_new] = covid_df[str_to_category].apply(lambda x: x.astype('category'))
# check conversion
covid_df[str_to_category + str_to_category_new]


Unnamed: 0,o_state,o_fips,state,fips
0,AK,2,AK,2
1,AL,1,AL,1
2,AR,5,AR,5
3,AS,60,AS,60
4,AZ,4,AZ,4
...,...,...,...,...
2876,WA,53,WA,53
2877,WA,53,WA,53
2878,WA,53,WA,53
2879,WA,53,WA,53


In [14]:
# check conversion
covid_df[str_to_category + str_to_category_new].dtypes

o_state      object
o_fips        int64
state      category
fips       category
dtype: object

##### Float --> Integer columns

In [15]:
# identify which columns to convert 
to_int = list(covid_df.select_dtypes(include = ["float64"]).columns)
# create new column names
converted_to_int = list(map(lambda x: x.replace('o_', ''), to_int))
# add new converted columns
covid_df[converted_to_int] = covid_df[to_int].apply(lambda x: x.astype('Int64'))
# check conversion
covid_df[to_int].head()


Unnamed: 0,o_positive,o_negative,o_pending,o_hospitalizedCurrently,o_hospitalizedCumulative,o_inIcuCurrently,o_inIcuCumulative,o_onVentilatorCurrently,o_onVentilatorCumulative,o_recovered,o_death,o_hospitalized,o_total,o_totalTestResults,o_posNeg,o_deathIncrease,o_hospitalizedIncrease,o_negativeIncrease,o_positiveIncrease,o_totalTestResultsIncrease
0,339.0,15393.0,,32.0,,,,,,217.0,9.0,,15732.0,15732.0,15732.0,0.0,0.0,3451.0,0.0,3451.0
1,6137.0,65207.0,,,839.0,,288.0,,170.0,,212.0,839.0,71344.0,71344.0,71344.0,15.0,71.0,18344.0,305.0,18649.0
2,2829.0,35224.0,,104.0,291.0,,,25.0,57.0,964.0,47.0,291.0,38053.0,38053.0,38053.0,2.0,0.0,2387.0,88.0,2475.0
3,0.0,3.0,17.0,,,,,,,,0.0,,20.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0
4,6280.0,56228.0,,697.0,1022.0,313.0,,191.0,,1345.0,266.0,1022.0,62508.0,62508.0,62508.0,0.0,38.0,1559.0,235.0,1794.0


In [16]:
# check conversion
covid_df[converted_to_int].head()

Unnamed: 0,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,onVentilatorCumulative,recovered,death,hospitalized,total,totalTestResults,posNeg,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease
0,339,15393,,32.0,,,,,,217.0,9,,15732,15732,15732,0,0,3451,0,3451
1,6137,65207,,,839.0,,288.0,,170.0,,212,839.0,71344,71344,71344,15,71,18344,305,18649
2,2829,35224,,104.0,291.0,,,25.0,57.0,964.0,47,291.0,38053,38053,38053,2,0,2387,88,2475
3,0,3,17.0,,,,,,,,0,,20,3,3,0,0,0,0,0
4,6280,56228,,697.0,1022.0,313.0,,191.0,,1345.0,266,1022.0,62508,62508,62508,0,38,1559,235,1794


In [17]:
# check conversion
covid_df[to_int + converted_to_int].dtypes

o_positive                    float64
o_negative                    float64
o_pending                     float64
o_hospitalizedCurrently       float64
o_hospitalizedCumulative      float64
o_inIcuCurrently              float64
o_inIcuCumulative             float64
o_onVentilatorCurrently       float64
o_onVentilatorCumulative      float64
o_recovered                   float64
o_death                       float64
o_hospitalized                float64
o_total                       float64
o_totalTestResults            float64
o_posNeg                      float64
o_deathIncrease               float64
o_hospitalizedIncrease        float64
o_negativeIncrease            float64
o_positiveIncrease            float64
o_totalTestResultsIncrease    float64
positive                        Int64
negative                        Int64
pending                         Int64
hospitalizedCurrently           Int64
hospitalizedCumulative          Int64
inIcuCurrently                  Int64
inIcuCumulat

### Check missingness

We have a lot of missing data in the coumns that track COVID related things. However, we don't have any missing data in the ID columns (date, state, fips). Given the difficulty with retrieving these data [https://covidtracking.com/data](- as documented on the COVID Tracking Project website -) one would expect at least some missing data in these columns. 

These missingness counts are a further validation that the datatype conversions did not introduce additional NAs. 

In [18]:
covid_df.isnull().sum(axis = 0)

o_date                           0
o_state                          0
o_positive                      15
o_negative                     193
o_pending                     2256
o_hospitalizedCurrently       2019
o_hospitalizedCumulative      1896
o_inIcuCurrently              2451
o_inIcuCumulative             2709
o_onVentilatorCurrently       2541
o_onVentilatorCumulative      2828
o_recovered                   2137
o_hash                           0
o_dateChecked                    0
o_death                        739
o_hospitalized                1896
o_total                          2
o_totalTestResults               2
o_posNeg                         2
o_fips                           0
o_deathIncrease                 56
o_hospitalizedIncrease          56
o_negativeIncrease              56
o_positiveIncrease              56
o_totalTestResultsIncrease      56
date                             0
dateChecked                      0
state                            0
fips                

### Generate summary statistics

The summary statistics of original and converted columns are another method for validating the column type conversion. We can see that the conversion did not appear to change summary statistics of the data so from now on we will use the converted columns. 

In [19]:
# display summary statistics of original columns
covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('o_')]].describe(include = 'all')

Unnamed: 0,o_date,o_state,o_positive,o_negative,o_pending,o_hospitalizedCurrently,o_hospitalizedCumulative,o_inIcuCurrently,o_inIcuCumulative,o_onVentilatorCurrently,...,o_hospitalized,o_total,o_totalTestResults,o_posNeg,o_fips,o_deathIncrease,o_hospitalizedIncrease,o_negativeIncrease,o_positiveIncrease,o_totalTestResultsIncrease
count,2881.0,2881,2866.0,2688.0,625.0,862.0,985.0,430.0,172.0,340.0,...,985.0,2879.0,2879.0,2879.0,2881.0,2825.0,2825.0,2825.0,2825.0,2825.0
unique,,56,,,,,,,,,...,,,,,,,,,,
top,,WA,,,,,,,,,...,,,,,,,,,,
freq,,95,,,,,,,,,...,,,,,,,,,,
mean,20200360.0,,5390.375436,24698.565104,1512.3488,1466.431555,1711.577665,720.813953,225.127907,336.652941,...,1711.577665,28754.351164,28426.036471,28426.036471,32.201666,16.984071,35.247788,1505.098761,329.766372,1834.865133
std,51.78699,,20434.225767,48083.57659,7667.013322,3100.638022,7001.974126,1157.295186,222.239205,445.080255,...,7001.974126,64138.249628,63862.788101,63862.788101,18.417664,66.684907,235.416836,4360.269069,1004.373163,4888.489829
min,20200120.0,,0.0,0.0,0.0,2.0,0.0,2.0,6.0,2.0,...,0.0,0.0,0.0,0.0,1.0,-201.0,-655.0,-5086.0,-383.0,-4714.0
25%,20200320.0,,32.25,420.75,6.0,74.25,57.0,74.0,43.0,26.0,...,57.0,292.0,264.5,264.5,17.0,0.0,0.0,9.0,5.0,32.0
50%,20200330.0,,476.0,7504.5,32.0,295.0,241.0,162.5,176.0,81.5,...,241.0,6674.0,6633.0,6633.0,32.0,1.0,0.0,394.0,48.0,467.0
75%,20200410.0,,2883.25,28201.0,236.0,1410.5,760.0,1047.5,290.25,507.0,...,760.0,29316.5,29109.0,29109.0,46.0,7.0,3.0,1563.0,225.0,1851.0


In [20]:
# extract converted columns & unconverted hash column
converted_covid_df = covid_df[list(
    map(lambda x: x.replace('o_', '') if x != 'o_hash' else x, 
        list(covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('o_')]].columns)))]
         
# show summary statistics
converted_covid_df.describe(include = 'all')

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,hospitalized,total,totalTestResults,posNeg,fips,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease
count,2881,2881,2866.0,2688.0,625.0,862.0,985.0,430.0,172.0,340.0,...,985.0,2879.0,2879.0,2879.0,2881.0,2825.0,2825.0,2825.0,2825.0,2825.0
unique,95,56,,,,,,,,,...,,,,,56.0,,,,,
top,2020-03-16 00:00:00,WA,,,,,,,,,...,,,,,53.0,,,,,
freq,56,95,,,,,,,,,...,,,,,95.0,,,,,
first,2020-01-22 00:00:00,,,,,,,,,,...,,,,,,,,,,
last,2020-04-25 00:00:00,,,,,,,,,,...,,,,,,,,,,
mean,,,5390.375436,24698.565104,1512.3488,1466.431555,1711.577665,720.813953,225.127907,336.652941,...,1711.577665,28754.351164,28426.036471,28426.036471,,16.984071,35.247788,1505.098761,329.766372,1834.865133
std,,,20434.225767,48083.57659,7667.013322,3100.638022,7001.974126,1157.295186,222.239205,445.080255,...,7001.974126,64138.249628,63862.788101,63862.788101,,66.684907,235.416836,4360.269069,1004.373163,4888.489829
min,,,0.0,0.0,0.0,2.0,0.0,2.0,6.0,2.0,...,0.0,0.0,0.0,0.0,,-201.0,-655.0,-5086.0,-383.0,-4714.0
25%,,,32.25,420.75,6.0,74.25,57.0,74.0,43.0,26.0,...,57.0,292.0,264.5,264.5,,0.0,0.0,9.0,5.0,32.0


#### Check duplicates

There are no perfectly duplicated rows. 

In [21]:
sum(converted_covid_df.duplicated())

0

There are no duplications for the primary keys - as expected, there appears to be one column for every day, for every state (whether identified via state name or fips code).

In [22]:
# check duplication w/ date and state name
converted_covid_df[converted_covid_df.duplicated(subset=['date','state'], keep=False)]

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,hospitalized,total,totalTestResults,posNeg,fips,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease


In [23]:
# check duplication w/ date and state fips code
converted_covid_df[converted_covid_df.duplicated(subset=['date','fips'], keep=False)]

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,hospitalized,total,totalTestResults,posNeg,fips,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease


#### Recount records

In [24]:
# count rows 
covid_df.shape[0] == converted_covid_df.shape[0]

True

### Pickle data 

In [25]:
pickle.dump(converted_covid_df, open( "../Data_pkl/covid19/covidtrackingproject_df.pkl", "wb" ) )
