# COVID-19 MIDS Collaboration 

## Data Import: COVID-19 Data

This Jupyter Notebook reads in raw data as csv files and exports them as [pickle files for faster loading](https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e). 

### Data sources

COVID-19 data: 

### Set up environment 

In [1]:
# Import packages
import pandas as pd # to read in csv
import numpy as np # to process data
import pickle # to pickle files


Here, we import our dataset into a pandas dataframe and add a "o_" prefix to all columsn to demarcate which columns were the originals. 

In [2]:
#Read in data & add "o_" prefix to demarcate which columns were the originals 
# covid cases data
covid_df = pd.read_csv("../Data_raw/us-counties.csv").add_prefix('o_') 


### Inspect data

#### Count records

Our dataset tracking COVID-19 cases and deaths by county has 61,971 rows and 6 columns. 

In [3]:
# count rows and columns
covid_df.shape

(61971, 6)

#### Check and convert datatypes

We have:

* one string column tracking the date (which needs to be converted to a date object)
* 3 float columns tracking location 
    * county and state are correctly stored as strings
    * fips is a float and should be converted to a category - it should also be split into county and state codes 
* 2 integer columns tracking COVID-19 cases and deaths

In [4]:
# check data types 
covid_df.dtypes

o_date       object
o_county     object
o_state      object
o_fips      float64
o_cases       int64
o_deaths      int64
dtype: object

In [5]:
covid_df.head()

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


In [6]:
# create new version of column as a datetime object
covid_df["date"] = pd.to_datetime(covid_df["o_date"])
# check conversion 
covid_df.dtypes

o_date              object
o_county            object
o_state             object
o_fips             float64
o_cases              int64
o_deaths             int64
date        datetime64[ns]
dtype: object

In [7]:
# create new version of fips column as nullable integer
covid_df["fips"] = covid_df["o_fips"].astype('category')
# check conversion 
covid_df.dtypes

o_date              object
o_county            object
o_state             object
o_fips             float64
o_cases              int64
o_deaths             int64
date        datetime64[ns]
fips              category
dtype: object

In [8]:
# visual check of column conversions - these worked 
covid_df.head()

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths,date,fips
0,2020-01-21,Snohomish,Washington,53061.0,1,0,2020-01-21,53061.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0,2020-01-22,53061.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0,2020-01-23,53061.0
3,2020-01-24,Cook,Illinois,17031.0,1,0,2020-01-24,17031.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0,2020-01-24,53061.0


### Check missingness

We have 807 missing FIPS codes, but no other missing data (at least that is coded as NaNs).

In [9]:
covid_df.isnull().sum(axis = 0)

o_date        0
o_county      0
o_state       0
o_fips      807
o_cases       0
o_deaths      0
date          0
fips        807
dtype: int64

3 counties have missing FIPS codes.

In [10]:
covid_df["o_county"][covid_df["o_fips"].isnull()].unique()

array(['New York City', 'Unknown', 'Kansas City'], dtype=object)

New York City is comprised of 5 counties (1 for each of the 5 boroughs); these each have their own FIPS codes: New York County (Manhattan), Kings County (Brooklyn), Bronx County (The Bronx), Richmond County (Staten Island), and Queens County (Queens). 

In [11]:
# show unique state for counties coded as "Unknown"
covid_df["o_state"][covid_df["o_county"] == "New York City"].unique()

array(['New York'], dtype=object)

In [12]:
# show unique fips codes for counties coded as "New York City"
covid_df["o_fips"][covid_df["o_county"] == "New York City"].unique()

array([nan])

Kansas City, Missouri also straddles multiple counties which would explain why it does not have a FIPS code. 

In [13]:
# show unique states for counties coded as "Kansas City"
covid_df["o_state"][covid_df["o_county"] == "Kansas City"].unique()

array(['Missouri'], dtype=object)

In [14]:
# show unique fips codes for counties coded as "Kansas City"
covid_df["o_fips"][covid_df["o_county"] == "Kansas City"].unique()

array([nan])

Other rows are missing FIPS codes because they lack a county (it is marked as "Unknown"). Some would make sense - for example, Puerto Rico only has a state FIPS code. 

In [15]:
# show unique states for counties coded as "Unknown"
covid_df["o_state"][covid_df["o_county"] == "Unknown"].unique()

array(['Rhode Island', 'New Jersey', 'Puerto Rico', 'Virgin Islands',
       'Guam', 'Maine', 'Massachusetts', 'Louisiana', 'Kentucky',
       'Nevada', 'Tennessee', 'Arkansas', 'Georgia', 'Missouri',
       'Minnesota', 'California', 'Colorado', 'Florida', 'Hawaii',
       'Illinois', 'Vermont', 'Idaho', 'Michigan', 'Ohio', 'Utah',
       'Mississippi', 'Northern Mariana Islands', 'Nebraska',
       'Connecticut', 'South Dakota', 'Indiana', 'Alaska', 'Arizona',
       'New Mexico', 'New York', 'Pennsylvania', 'Virginia',
       'New Hampshire', 'Washington', 'Delaware', 'Kansas',
       'North Dakota', 'Maryland'], dtype=object)

In [16]:
# show unique fips codes for counties coded as "Unknown"
covid_df["o_fips"][covid_df["o_county"] == "Unknown"].unique()

array([nan])

The missingness in FIPS codes is a major issue since this is the geographic identifier we would use to join these data to other datsets. We will likely need to accurately rectify this missingness in order to accurately combine our data. These don't seem to be erroneous data for the most part. Rather, the missing points reflect when a data point is at the city (rather than county) level and/or if there is no county for that region. 

In [17]:
# replace rows with missing county with NaN 
covid_df["county"] = covid_df["o_county"].replace("Unknown", None)

### Generate summary statistics

The mins and maxes look reasonable for COVID-19 mins and maxes. The distributions are remarkable but also reasonable - we clearly have some outliers on the high end (New York is likely one of those). 

We have no missing data in the text columns. We have data for 86 unique days, in 55 "states" (so the data also include territories), and 1,627 counties. In the original county column, we see there are 1,628 unique values since the NaNs used to be coded as "Unknown".

We can see that we have some missing data in our FIPS codes column and that we have 2,708 unique codes. 

Our earliest date was January 21st, 2020. Our last was April 15th, 2020. 

In [18]:
# display summary statistics
covid_df.describe(include = 'all')

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths,date,fips,county
count,61971,61971,61971,61164.0,61971.0,61971.0,61971,61164.0,61971
unique,86,1628,55,,,,86,2708.0,1627
top,2020-04-15,Washington,Texas,,,,2020-04-15 00:00:00,53061.0,Washington
freq,2722,772,3685,,,,2722,86.0,777
first,,,,,,,2020-01-21 00:00:00,,
last,,,,,,,2020-04-15 00:00:00,,
mean,,,,29601.48898,121.685353,4.025415,,,
std,,,,15528.488936,1520.051958,82.937968,,,
min,,,,1001.0,0.0,0.0,,,
25%,,,,17179.0,2.0,0.0,,,


Now that we have verified our column conversion worked, we drop the columns for which we had added additional columns with the correct data type and remove the "o_" prefix since we no longer need to compare original versus converted columns. Columns are reordered to match the original ordering in the dataset.

In [21]:
# drop, rename, and reorder columns
converted_covid_df = covid_df.drop(columns = ["o_date", "o_fips", "o_county"]).rename(columns = lambda x: x.replace('o_', ''))[["date", "fips", "county", "state", "cases", "deaths"]]
# visual inspection
converted_covid_df.head()

Unnamed: 0,date,fips,county,state,cases,deaths
0,2020-01-21,53061.0,Snohomish,Washington,1,0
1,2020-01-22,53061.0,Snohomish,Washington,1,0
2,2020-01-23,53061.0,Snohomish,Washington,1,0
3,2020-01-24,17031.0,Cook,Illinois,1,0
4,2020-01-24,53061.0,Snohomish,Washington,1,0


### Check duplicates

There are no perfectly duplicated rows. 

In [22]:
sum(converted_covid_df.duplicated())

0

There are 2,721 unique county-state pairs. 

In [23]:
# create column to store county-state pairs
converted_covid_df["county_state"] = (converted_covid_df["county"] + ", " + converted_covid_df["state"])
# count unique county-state pairs
len(converted_covid_df["county_state"].unique())

2721

Suffolk, Massachusetts appears the most frequently, which is unexpected. It has 20 more entries than there are days in the dataset. The next county is Snohomish, Washington occurs 86 times, which could represent one record per day. Some areas appear only once. 

In [32]:
converted_covid_df["county_state"].value_counts()

Suffolk, Massachusetts     106
Snohomish, Washington       86
Cook, Illinois              83
Orange, California          82
Los Angeles, California     81
                          ... 
Norman, Minnesota            1
Harrison, Ohio               1
Benzie, Michigan             1
Alcona, Michigan             1
Bosque, Texas                1
Name: county_state, Length: 2721, dtype: int64

Suffolk, Massachusettes has only 75 reported days. 

In [33]:
len(converted_covid_df["date"][converted_covid_df["county_state"] == "Suffolk, Massachusetts"].unique())

75

We can see that the maximum number of unique dates is for Snohomish, Washington (86). We will need to flatten our file so that it is at the date level - so that there is one date per county/state pair. 

In [34]:
# show first 5 rows of dataframe
converted_covid_df.groupby('county_state').date.nunique().reset_index().sort_values('date', ascending = 0).head(5)

Unnamed: 0,county_state,date
2255,"Snohomish, Washington",86
548,"Cook, Illinois",83
1845,"Orange, California",82
1462,"Los Angeles, California",81
1528,"Maricopa, Arizona",81


In [35]:
# show first 5 rows of dataframe
converted_covid_df.groupby('county_state').date.nunique().reset_index().sort_values('date', ascending = 0).tail(5)

Unnamed: 0,county_state,date
1799,"Nome Census Area, Alaska",1
27,"Alcona, Michigan",1
1731,"Morrill, Nebraska",1
1802,"Norman, Minnesota",1
210,"Boundary, Idaho",1


In [36]:
converted_covid_df[converted_covid_df.duplicated(subset=['county_state', 'date'], keep=False)]

Unnamed: 0,date,fips,county,state,cases,deaths,county_state
1510,2020-03-12,34039.0,Union,New Jersey,1,0,"Union, New Jersey"
1511,2020-03-12,,Union,New Jersey,1,0,"Union, New Jersey"
1803,2020-03-13,34039.0,Union,New Jersey,1,0,"Union, New Jersey"
1804,2020-03-13,,Union,New Jersey,1,0,"Union, New Jersey"
2150,2020-03-14,34039.0,Union,New Jersey,1,0,"Union, New Jersey"
...,...,...,...,...,...,...,...
61668,2020-04-15,,Rutland,Vermont,12,0,"Rutland, Vermont"
61786,2020-04-15,51185.0,Tazewell,Virginia,4,0,"Tazewell, Virginia"
61787,2020-04-15,,Tazewell,Virginia,0,59,"Tazewell, Virginia"
61830,2020-04-15,53067.0,Thurston,Washington,82,1,"Thurston, Washington"


If these duplicated rows have about 1 extra duplication each, we would expect to drop 592 rows once we flatten. 

In [49]:
converted_covid_df[converted_covid_df.duplicated(subset=['county_state', 'date'], keep=False)].shape[0]/2

592.0

We can see that there are discrepancies in case counts on some days  in some counties. It is unclear whether we should add these cases or if one row is accurate. 

In [42]:
# pull out rows for Thurston County, Washington
converted_covid_df[converted_covid_df['county'] == 'Thurston']

Unnamed: 0,date,fips,county,state,cases,deaths,county_state
1356,2020-03-11,53067.0,Thurston,Washington,1,0,"Thurston, Washington"
1610,2020-03-12,53067.0,Thurston,Washington,1,0,"Thurston, Washington"
1921,2020-03-13,53067.0,Thurston,Washington,3,0,"Thurston, Washington"
2295,2020-03-14,53067.0,Thurston,Washington,3,0,"Thurston, Washington"
2724,2020-03-15,53067.0,Thurston,Washington,4,0,"Thurston, Washington"
3198,2020-03-16,53067.0,Thurston,Washington,4,0,"Thurston, Washington"
3740,2020-03-17,53067.0,Thurston,Washington,5,0,"Thurston, Washington"
4375,2020-03-18,53067.0,Thurston,Washington,6,0,"Thurston, Washington"
5132,2020-03-19,53067.0,Thurston,Washington,6,0,"Thurston, Washington"
6036,2020-03-20,53067.0,Thurston,Washington,8,0,"Thurston, Washington"


### Convert to appropriate level

Here, we flatten the file so that there is one row per county-state pair for each day. We sum over the cases and deaths for that day to generate a new cases and deaths column. 

In [24]:
# create new dataframe flattened to county-state pair and date level 
flattened_covid_df = converted_covid_df.groupby(['county_state', 'date']).agg(
    {
     'state': 'first',
     'county': 'first',
     'fips': np.unique,
     'cases':np.unique, 
     'deaths':np.unique
    }).reset_index()

In [25]:
flattened_covid_df.head()

Unnamed: 0,county_state,date,state,county,fips,cases,deaths
0,"Abbeville, South Carolina",2020-03-19,South Carolina,Abbeville,[45001.0],1,0
1,"Abbeville, South Carolina",2020-03-20,South Carolina,Abbeville,[45001.0],1,0
2,"Abbeville, South Carolina",2020-03-21,South Carolina,Abbeville,[45001.0],1,0
3,"Abbeville, South Carolina",2020-03-22,South Carolina,Abbeville,[45001.0],1,0
4,"Abbeville, South Carolina",2020-03-23,South Carolina,Abbeville,[45001.0],1,0


We can see that for counties with discrepancies in case counts, the different case counts are put into a list. These may need to be cleaned manually.

In [43]:
flattened_covid_df[flattened_covid_df['county'] == 'Thurston']

Unnamed: 0,county_state,date,state,county,fips,cases,deaths,fips0,fips_str,state_fips,county_fips
54111,"Thurston, Washington",2020-03-11,Washington,Thurston,53067.0,1,0,,53067,53,67
54112,"Thurston, Washington",2020-03-12,Washington,Thurston,53067.0,1,0,,53067,53,67
54113,"Thurston, Washington",2020-03-13,Washington,Thurston,53067.0,3,0,,53067,53,67
54114,"Thurston, Washington",2020-03-14,Washington,Thurston,53067.0,3,0,,53067,53,67
54115,"Thurston, Washington",2020-03-15,Washington,Thurston,53067.0,4,0,,53067,53,67
54116,"Thurston, Washington",2020-03-16,Washington,Thurston,53067.0,4,0,,53067,53,67
54117,"Thurston, Washington",2020-03-17,Washington,Thurston,53067.0,5,0,,53067,53,67
54118,"Thurston, Washington",2020-03-18,Washington,Thurston,53067.0,6,0,,53067,53,67
54119,"Thurston, Washington",2020-03-19,Washington,Thurston,53067.0,6,0,,53067,53,67
54120,"Thurston, Washington",2020-03-20,Washington,Thurston,53067.0,8,0,,53067,53,67


The fips column can be split into a column of fips and a column of NAs because some rows had the county-state identifier and date but did not have a fips code.

In [27]:
# split columns
flattened_covid_df[['fips','fips0']] = pd.DataFrame(flattened_covid_df.fips.values.tolist(), index= flattened_covid_df.index)

# check that new column is only NAs 
sum(flattened_covid_df['fips0'].notna())



0

In [28]:
# create new version of fips column as string 
flattened_covid_df["fips_str"] = flattened_covid_df["fips"].astype('Int64').astype('str')

In [29]:
# extract state FIPS and convert to category 
flattened_covid_df["state_fips"] = flattened_covid_df["fips_str"].apply(lambda x: x[:-3]).astype('category')
flattened_covid_df["state_fips"].head()

0    45
1    45
2    45
3    45
4    45
Name: state_fips, dtype: category
Categories (52, object): [, 1, 10, 11, ..., 56, 6, 8, 9]

In [30]:
# extract county FIPS and convert to category 
flattened_covid_df["county_fips"] = flattened_covid_df["fips_str"].apply(lambda x: x[-3:]).astype('category')
flattened_covid_df["county_fips"].head()

0    001
1    001
2    001
3    001
4    001
Name: county_fips, dtype: category
Categories (284, object): [001, 003, 005, 006, ..., 820, 830, 840, nan]

We can see that we have dropped 592 rows, which is about what we expected in flattening the files. 

In [47]:
# check new shape
covid_df.shape[0] - flattened_covid_df.shape[0]


592

### Pickle data

In [50]:
pickle.dump(flattened_covid_df, open( "../Data_pkl/flattened_covid_df.pkl", "wb" ) )