# COVID-19 MIDS Collaboration 

## Data Sourcing: COVID Tracking Project 

This Jupyter Notebook reads in raw data as csv files from a website and exports them as [pickle files for faster loading](https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e). 

This code was adapted from a script provided to us by Professor Kevin Crook of the Berkeley MIDS program during our W205 (Data Engineering) class. 

### Data sources

US COVID-19 data (cumulative, at county level) from the New York Times covid-19 Github repo: https://github.com/nytimes/covid-19-data

### Set up environment 

In [1]:
# Import packages
import pandas as pd
import numpy as np
import io
import requests
import pickle 

### Retrieve data

In [2]:
# get data at URL - this URL is for the state historical data, updated daily at 4pm ET
r = requests.get("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")

In [3]:
# check HTTP request status
r.status_code

200

In [4]:
# just show the first 2000 characters, the text is really long otherwise
r.text[0:2000]

'date,county,state,fips,cases,deaths\n2020-01-21,Snohomish,Washington,53061,1,0\n2020-01-22,Snohomish,Washington,53061,1,0\n2020-01-23,Snohomish,Washington,53061,1,0\n2020-01-24,Cook,Illinois,17031,1,0\n2020-01-24,Snohomish,Washington,53061,1,0\n2020-01-25,Orange,California,06059,1,0\n2020-01-25,Cook,Illinois,17031,1,0\n2020-01-25,Snohomish,Washington,53061,1,0\n2020-01-26,Maricopa,Arizona,04013,1,0\n2020-01-26,Los Angeles,California,06037,1,0\n2020-01-26,Orange,California,06059,1,0\n2020-01-26,Cook,Illinois,17031,1,0\n2020-01-26,Snohomish,Washington,53061,1,0\n2020-01-27,Maricopa,Arizona,04013,1,0\n2020-01-27,Los Angeles,California,06037,1,0\n2020-01-27,Orange,California,06059,1,0\n2020-01-27,Cook,Illinois,17031,1,0\n2020-01-27,Snohomish,Washington,53061,1,0\n2020-01-28,Maricopa,Arizona,04013,1,0\n2020-01-28,Los Angeles,California,06037,1,0\n2020-01-28,Orange,California,06059,1,0\n2020-01-28,Cook,Illinois,17031,1,0\n2020-01-28,Snohomish,Washington,53061,1,0\n2020-01-29,Maricopa,Arizon

In [5]:
# load into a Pandas dataframe
covid_df = pd.read_csv(io.StringIO(r.text)).add_prefix('o_') 

covid_df

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0
...,...,...,...,...,...,...
89767,2020-04-25,Sublette,Wyoming,56035.0,1,0
89768,2020-04-25,Sweetwater,Wyoming,56037.0,10,0
89769,2020-04-25,Teton,Wyoming,56039.0,64,1
89770,2020-04-25,Uinta,Wyoming,56041.0,6,0


### Convert data types

#### Count records

In [6]:
# count rows and columns
covid_df.shape

(89772, 6)

#### Check and convert datatypes

We have:

* 1 string column tracking the date (which needs to be converted to a date object)
* 2 string columns tracking location (county, state) 
* 1 float column tracking location (fips) (which should be converted into a category; should also add two columns to have the county and state fips separate)
* 2 integer columns tracking COVID-19 cases and deaths

In [7]:
# check data types 
covid_df.dtypes

o_date       object
o_county     object
o_state      object
o_fips      float64
o_cases       int64
o_deaths      int64
dtype: object

In [8]:
covid_df.head()

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


In [9]:
# create new version of column as a datetime object
covid_df["date"] = pd.to_datetime(covid_df["o_date"])
# check conversion 
covid_df.dtypes

o_date              object
o_county            object
o_state             object
o_fips             float64
o_cases              int64
o_deaths             int64
date        datetime64[ns]
dtype: object

In [10]:
# create new version of fips column as a category (must convert to integer as in-between step)
covid_df["fips"] = covid_df["o_fips"].astype('Int64').astype('category')
# check conversion 
covid_df.dtypes

o_date              object
o_county            object
o_state             object
o_fips             float64
o_cases              int64
o_deaths             int64
date        datetime64[ns]
fips              category
dtype: object

In [11]:
# visual check of column conversions - these worked 
covid_df.head()

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths,date,fips
0,2020-01-21,Snohomish,Washington,53061.0,1,0,2020-01-21,53061
1,2020-01-22,Snohomish,Washington,53061.0,1,0,2020-01-22,53061
2,2020-01-23,Snohomish,Washington,53061.0,1,0,2020-01-23,53061
3,2020-01-24,Cook,Illinois,17031.0,1,0,2020-01-24,17031
4,2020-01-24,Snohomish,Washington,53061.0,1,0,2020-01-24,53061


### Check missingness

We have some missing FIPS codes, but no other missing data (at least that is coded as NaNs).

In [12]:
covid_df.isnull().sum(axis = 0)

o_date         0
o_county       0
o_state        0
o_fips      1097
o_cases        0
o_deaths       0
date           0
fips        1097
dtype: int64

3 counties have missing FIPS codes.

In [13]:
covid_df["o_county"][covid_df["o_fips"].isnull()].unique()

array(['New York City', 'Unknown', 'Kansas City'], dtype=object)

New York City is comprised of 5 counties (1 for each of the 5 boroughs); these each have their own FIPS codes: New York County (Manhattan), Kings County (Brooklyn), Bronx County (The Bronx), Richmond County (Staten Island), and Queens County (Queens). 

This is documented in the NYT github repo: All cases for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) are assigned to a single area called New York City. There is a large jump in the number of deaths on April 6th due to switching from data from New York City to data from New York state for deaths. We are not currently including the probable deaths reported by New York City.

In [14]:
# show unique state for counties coded as "Unknown"
covid_df["o_state"][covid_df["o_county"] == "New York City"].unique()

array(['New York'], dtype=object)

In [15]:
# show unique fips codes for counties coded as "New York City"
covid_df["o_fips"][covid_df["o_county"] == "New York City"].unique()

array([nan])

Kansas City, Missouri also straddles multiple counties which would explain why it does not have a FIPS code. 

This is documented in the NYT github repo: Four counties (Cass, Clay, Jackson and Platte) overlap the municipality of Kansas City, Mo. The cases and deaths that we show for these four counties are only for the portions exclusive of Kansas City. Cases and deaths for Kansas City are reported as their own line.

In [16]:
# show unique states for counties coded as "Kansas City"
covid_df["o_state"][covid_df["o_county"] == "Kansas City"].unique()

array(['Missouri'], dtype=object)

In [17]:
# show unique fips codes for counties coded as "Kansas City"
covid_df["o_fips"][covid_df["o_county"] == "Kansas City"].unique()

array([nan])

Other rows are missing FIPS codes because they lack a county (it is marked as "Unknown"). Some would make sense - for example, Puerto Rico only has a state FIPS code. 

This is the documentation from the NYT: Many state health departments choose to report cases separately when the patient’s county of residence is unknown or pending determination. In these instances, we record the county name as “Unknown.” As more information about these cases becomes available, the cumulative number of cases in “Unknown” counties may fluctuate.

Sometimes, cases are first reported in one county and then moved to another county. As a result, the cumulative number of cases may change for a given county.

In [18]:
# show unique states for counties coded as "Unknown"
covid_df["o_state"][covid_df["o_county"] == "Unknown"].unique()

array(['Rhode Island', 'New Jersey', 'Puerto Rico', 'Virgin Islands',
       'Guam', 'Maine', 'Massachusetts', 'Louisiana', 'Kentucky',
       'Nevada', 'Tennessee', 'Arkansas', 'Georgia', 'Missouri',
       'Minnesota', 'California', 'Colorado', 'Florida', 'Hawaii',
       'Illinois', 'Vermont', 'Idaho', 'Michigan', 'Ohio', 'Utah',
       'Mississippi', 'Northern Mariana Islands', 'Nebraska',
       'Connecticut', 'Indiana', 'Alaska', 'Arizona', 'New Mexico',
       'New York', 'Pennsylvania', 'Virginia', 'New Hampshire',
       'South Dakota', 'Washington', 'Delaware', 'Kansas', 'North Dakota',
       'Maryland', 'Iowa'], dtype=object)

In [19]:
# show unique fips codes for counties coded as "Unknown"
covid_df["o_fips"][covid_df["o_county"] == "Unknown"].unique()

array([nan])

The missingness in FIPS codes is a major issue since this is the geographic identifier we would use to join these data to other datsets. We will likely need to accurately rectify this missingness in order to accurately combine our data. These don't seem to be erroneous data for the most part. Rather, the missing points reflect when a data point is at the city (rather than county) level and/or if there is no county for that region. 

In [20]:
# replace rows with missing county with NaN 
covid_df["county"] = covid_df["o_county"].replace("Unknown", None)

### Generate summary statistics

The mins and maxes look reasonable for COVID-19 mins and maxes. The distributions are remarkable but also reasonable - we clearly have some outliers on the high end (New York is likely one of those). 

We have no missing data in the text columns. We have data for 86 unique days, in 55 "states" (so the data also include territories), and 1,627 counties. In the original county column, we see there are 1,628 unique values since the NaNs used to be coded as "Unknown".

We can see that we have some missing data in our FIPS codes column and that we have 2,708 unique codes. 

Our earliest date was January 21st, 2020. Our last was April 15th, 2020. 

In [21]:
# display summary statistics
covid_df.describe(include = 'all')

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths,date,fips,county
count,89772,89772,89772,88675.0,89772.0,89772.0,89772,88675.0,89772
unique,96,1673,55,,,,96,2807.0,1672
top,2020-04-25,Washington,Texas,,,,2020-04-25 00:00:00,53061.0,Washington
freq,2820,1072,5665,,,,2820,96.0,1077
first,,,,,,,2020-01-21 00:00:00,,
last,,,,,,,2020-04-25 00:00:00,,
mean,,,,29773.602729,172.927104,7.172593,,,
std,,,,15466.678766,2079.570699,131.810249,,,
min,,,,1001.0,0.0,0.0,,,
25%,,,,18013.0,2.0,0.0,,,


Now that we have verified our column conversion worked, we drop the columns for which we had added additional columns with the correct data type and remove the "o_" prefix since we no longer need to compare original versus converted columns. Columns are reordered to match the original ordering in the dataset.

In [22]:
# drop, rename, and reorder columns
converted_covid_df = covid_df.drop(columns = ["o_date", "o_fips", "o_county"]).rename(columns = lambda x: x.replace('o_', ''))[["date", "fips", "state", "county", "cases", "deaths"]]
# visual inspection
converted_covid_df.head()

Unnamed: 0,date,fips,state,county,cases,deaths
0,2020-01-21,53061,Washington,Snohomish,1,0
1,2020-01-22,53061,Washington,Snohomish,1,0
2,2020-01-23,53061,Washington,Snohomish,1,0
3,2020-01-24,17031,Illinois,Cook,1,0
4,2020-01-24,53061,Washington,Snohomish,1,0


### Check duplicates

There are no perfectly duplicated rows. 

In [23]:
sum(converted_covid_df.duplicated())

0

In [24]:
# create column to store county-state pairs
converted_covid_df["county_state"] = (converted_covid_df["county"] + ", " + converted_covid_df["state"])
# count unique county-state pairs
len(converted_covid_df["county_state"].unique())

2820

Suffolk, Massachusetts appears the most frequently, which is unexpected. It has 20 more entries than there are days in the dataset. The next county is Snohomish, Washington at a rate that would represent one record per day. Some areas appear only once. 

In [25]:
converted_covid_df["county_state"].value_counts()

Suffolk, Massachusetts    126
Snohomish, Washington      96
Cook, Illinois             93
Union, New Jersey          93
Orange, California         92
                         ... 
San Juan, Colorado          1
Pembina, North Dakota       1
Gilmer, West Virginia       1
Hidalgo, New Mexico         1
Butte, South Dakota         1
Name: county_state, Length: 2820, dtype: int64

Suffolk, Massachusettes has duplicates of daily counts.

In [26]:
len(converted_covid_df["date"][converted_covid_df["county_state"] == "Suffolk, Massachusetts"].unique())

85

We can see that the maximum number of unique dates is for Snohomish, Washington. We will need to flatten our file so that it is at the date level - so that there is one date per county/state pair. 

In [27]:
# show first 5 rows of dataframe
converted_covid_df.groupby('county_state').date.nunique().reset_index().sort_values('date', ascending = 0).head(5)

Unnamed: 0,county_state,date
2342,"Snohomish, Washington",96
567,"Cook, Illinois",93
1917,"Orange, California",92
1518,"Los Angeles, California",91
1586,"Maricopa, Arizona",91


In [28]:
# show first 5 rows of dataframe
converted_covid_df.groupby('county_state').date.nunique().reset_index().sort_values('date', ascending = 0).tail(5)

Unnamed: 0,county_state,date
1118,"Hidalgo, New Mexico",1
662,"Day, South Dakota",1
658,"Dawes, Nebraska",1
1569,"Madison, North Carolina",1
928,"Gilmer, West Virginia",1


In [29]:
converted_covid_df[converted_covid_df.duplicated(subset=['county_state', 'date'], keep=False)]

Unnamed: 0,date,fips,state,county,cases,deaths,county_state
1510,2020-03-12,34039,New Jersey,Union,1,0,"Union, New Jersey"
1511,2020-03-12,,New Jersey,Union,1,0,"Union, New Jersey"
1803,2020-03-13,34039,New Jersey,Union,1,0,"Union, New Jersey"
1804,2020-03-13,,New Jersey,Union,1,0,"Union, New Jersey"
2150,2020-03-14,34039,New Jersey,Union,1,0,"Union, New Jersey"
...,...,...,...,...,...,...,...
89042,2020-04-25,,Rhode Island,Providence,1198,96,"Providence, Rhode Island"
89448,2020-04-25,49047,Utah,Uintah,6,0,"Uintah, Utah"
89449,2020-04-25,,Utah,Uintah,0,1,"Uintah, Utah"
89464,2020-04-25,50021,Vermont,Rutland,44,1,"Rutland, Vermont"


If these duplicated rows have about 1 extra duplication each, we would expect to drop the following number of rows when we flatten:

In [30]:
converted_covid_df[converted_covid_df.duplicated(subset=['county_state', 'date'], keep=False)].shape[0]/2

822.0

We can see that there are discrepancies in case counts on some days  in some counties. It is unclear whether we should add these cases or if one row is accurate. 

In [31]:
# pull out rows for Thurston County, Washington
converted_covid_df[(converted_covid_df['county'] == 'Thurston') & (converted_covid_df['state'] == 'Washington')]

Unnamed: 0,date,fips,state,county,cases,deaths,county_state
1356,2020-03-11,53067.0,Washington,Thurston,1,0,"Thurston, Washington"
1610,2020-03-12,53067.0,Washington,Thurston,1,0,"Thurston, Washington"
1921,2020-03-13,53067.0,Washington,Thurston,3,0,"Thurston, Washington"
2295,2020-03-14,53067.0,Washington,Thurston,3,0,"Thurston, Washington"
2724,2020-03-15,53067.0,Washington,Thurston,4,0,"Thurston, Washington"
3198,2020-03-16,53067.0,Washington,Thurston,4,0,"Thurston, Washington"
3740,2020-03-17,53067.0,Washington,Thurston,5,0,"Thurston, Washington"
4375,2020-03-18,53067.0,Washington,Thurston,6,0,"Thurston, Washington"
5132,2020-03-19,53067.0,Washington,Thurston,6,0,"Thurston, Washington"
6036,2020-03-20,53067.0,Washington,Thurston,8,0,"Thurston, Washington"


### Convert to appropriate level

Here, we flatten the file so that there is one row per county-state pair for each day. We sum over the cases and deaths for that day to generate a new cases and deaths column. 

In [32]:
# create new dataframe flattened to county-state pair and date level 
flattened_covid_df = converted_covid_df.groupby(['county_state', 'date']).agg(
    {
     'state': 'first',
     'county': 'first',
     'fips': np.unique,
     'cases':np.unique, 
     'deaths':np.unique
    }).reset_index()

In [33]:
flattened_covid_df.head()

Unnamed: 0,county_state,date,state,county,fips,cases,deaths
0,"Abbeville, South Carolina",2020-03-19,South Carolina,Abbeville,[45001],1,0
1,"Abbeville, South Carolina",2020-03-20,South Carolina,Abbeville,[45001],1,0
2,"Abbeville, South Carolina",2020-03-21,South Carolina,Abbeville,[45001],1,0
3,"Abbeville, South Carolina",2020-03-22,South Carolina,Abbeville,[45001],1,0
4,"Abbeville, South Carolina",2020-03-23,South Carolina,Abbeville,[45001],1,0


We can see that for counties with discrepancies in case counts, the different case counts are put into a list. These may need to be cleaned manually.

In [34]:
flattened_covid_df[(flattened_covid_df['county'] == 'Thurston') & (flattened_covid_df['state'] == 'Washington')]

Unnamed: 0,county_state,date,state,county,fips,cases,deaths
78468,"Thurston, Washington",2020-03-11,Washington,Thurston,[53067],1,0
78469,"Thurston, Washington",2020-03-12,Washington,Thurston,[53067],1,0
78470,"Thurston, Washington",2020-03-13,Washington,Thurston,[53067],3,0
78471,"Thurston, Washington",2020-03-14,Washington,Thurston,[53067],3,0
78472,"Thurston, Washington",2020-03-15,Washington,Thurston,[53067],4,0
78473,"Thurston, Washington",2020-03-16,Washington,Thurston,[53067],4,0
78474,"Thurston, Washington",2020-03-17,Washington,Thurston,[53067],5,0
78475,"Thurston, Washington",2020-03-18,Washington,Thurston,[53067],6,0
78476,"Thurston, Washington",2020-03-19,Washington,Thurston,[53067],6,0
78477,"Thurston, Washington",2020-03-20,Washington,Thurston,[53067],8,0


The fips column can be split into a column of fips and a column of NAs because some rows had the county-state identifier and date but did not have a fips code.

In [35]:
# split columns
flattened_covid_df[['fips','fips0']] = pd.DataFrame(flattened_covid_df.fips.values.tolist(), index= flattened_covid_df.index)

# change back data type of fips column
flattened_covid_df['fips'] = flattened_covid_df['fips'].astype('Int64').astype('category')

# check that new column is only NAs 
sum(flattened_covid_df['fips0'].notna())

0

In [36]:
# create new version of fips column as string 
flattened_covid_df["fips_str"] = flattened_covid_df["fips"].astype('str')

In [37]:
# extract state FIPS and convert to category 
flattened_covid_df["state_fips"] = flattened_covid_df["fips_str"].apply(lambda x: x[:-3]).astype('category')
flattened_covid_df["state_fips"].head()

0    4500
1    4500
2    4500
3    4500
4    4500
Name: state_fips, dtype: category
Categories (682, object): [, 100, 1000, 101, ..., 811, 812, 900, 901]

In [38]:
# extract county FIPS and convert to category 
flattened_covid_df["county_fips"] = flattened_covid_df["fips_str"].apply(lambda x: x[-3:]).astype('category')
flattened_covid_df["county_fips"].head()

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: county_fips, dtype: category
Categories (11, object): [0.0, 1.0, 2.0, 3.0, ..., 7.0, 8.0, 9.0, nan]

In [39]:
# drop superfluous, newly added columns
flattened_covid_df = flattened_covid_df.drop(columns = ['fips0', 'fips_str'])

In [40]:
# recheck data types
flattened_covid_df.dtypes

county_state            object
date            datetime64[ns]
state                   object
county                  object
fips                  category
cases                   object
deaths                  object
state_fips            category
county_fips           category
dtype: object

We can see that we have dropped around the number of rows we expected to drop in flattening the files. 

In [41]:
# check new shape
covid_df.shape[0] - flattened_covid_df.shape[0]


822

In [42]:
# inspect new dataframe 
flattened_covid_df.head()

Unnamed: 0,county_state,date,state,county,fips,cases,deaths,state_fips,county_fips
0,"Abbeville, South Carolina",2020-03-19,South Carolina,Abbeville,45001,1,0,4500,1.0
1,"Abbeville, South Carolina",2020-03-20,South Carolina,Abbeville,45001,1,0,4500,1.0
2,"Abbeville, South Carolina",2020-03-21,South Carolina,Abbeville,45001,1,0,4500,1.0
3,"Abbeville, South Carolina",2020-03-22,South Carolina,Abbeville,45001,1,0,4500,1.0
4,"Abbeville, South Carolina",2020-03-23,South Carolina,Abbeville,45001,1,0,4500,1.0


In [43]:
# reorder columns
flattened_covid_df = flattened_covid_df[['date', 'county_state', 'state', 'county', 'fips', 'state_fips', 'county_fips', 
                                       'cases', 'deaths']]
flattened_covid_df.head()

Unnamed: 0,date,county_state,state,county,fips,state_fips,county_fips,cases,deaths
0,2020-03-19,"Abbeville, South Carolina",South Carolina,Abbeville,45001,4500,1.0,1,0
1,2020-03-20,"Abbeville, South Carolina",South Carolina,Abbeville,45001,4500,1.0,1,0
2,2020-03-21,"Abbeville, South Carolina",South Carolina,Abbeville,45001,4500,1.0,1,0
3,2020-03-22,"Abbeville, South Carolina",South Carolina,Abbeville,45001,4500,1.0,1,0
4,2020-03-23,"Abbeville, South Carolina",South Carolina,Abbeville,45001,4500,1.0,1,0


In [44]:
# check Thurston, Washington again
flattened_covid_df[(flattened_covid_df['county'] == 'Thurston') & (flattened_covid_df['state'] == 'Washington')]

Unnamed: 0,date,county_state,state,county,fips,state_fips,county_fips,cases,deaths
78468,2020-03-11,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,1,0
78469,2020-03-12,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,1,0
78470,2020-03-13,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,3,0
78471,2020-03-14,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,3,0
78472,2020-03-15,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,4,0
78473,2020-03-16,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,4,0
78474,2020-03-17,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,5,0
78475,2020-03-18,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,6,0
78476,2020-03-19,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,6,0
78477,2020-03-20,"Thurston, Washington",Washington,Thurston,53067,5306,7.0,8,0


In [45]:
pd.DataFrame(flattened_covid_df.cases.values.tolist(), index= flattened_covid_df.index)


Unnamed: 0,0
0,1
1,1
2,1
3,1
4,1
...,...
88945,1
88946,1
88947,1
88948,1


### Pickle data

In [46]:
# pickle both flattened & unflattened version
pickle.dump(converted_covid_df, open( "../Data_pkl/covid19/nyt_converted_covid_df.pkl", "wb" ) )
pickle.dump(flattened_covid_df, open( "../Data_pkl/covid19/nyt_flattened_covid_df.pkl", "wb" ) )