# COVID-19 MIDS Collaboration 

## Data Import 

This Jupyter Notebook reads in raw data as csv files and exports them as [pickle files for faster loading](https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e). 

### Data sources

State and county FIPS codes: https://github.com/kjhealy/fips-codes

### Set up environment 

In [1]:
# Import packages
import pandas as pd # to read in csv
import numpy as np # to process data
import pickle # to pickle files


Here, we import all of our datasets into a pandas dataframe and add a "o_" prefix to all columsn to demarcate which columns were the originals. 

In [2]:
#Read in data & add "o_" prefix to demarcate which columns were the originals 
# state fips codes and names 
state_fips_df = pd.read_csv("../Data_raw/fips-codes/state_fips_master.csv").add_prefix('o_')
# state and county fips codes and names 
fips_df = pd.read_csv("../Data_raw/fips-codes/county_fips_master.csv", encoding='cp1252').add_prefix('o_') 


### Inspect data

#### State & County FIPS codes and names

Our dataset tracking state and county FIPS codes has 50 rows and 10 columns. There is one row for each of the 50 states - we probably would want to merge the above file on this one to obtain the correct state FIPS codes and limit our data to only these 50 states for easier comparison. 

In [3]:
# display rows & columns
state_fips_df.shape

(50, 10)

We have:

* 5 string columns tracking the county name, abbreviation, region, and division
* 5 integer columns tracking fips codes and tracking identifiers we can use to match to the county file 
    * These should be converted to categories 


In [4]:
# show data types 
state_fips_df.dtypes

o_state_name       object
o_state_abbr       object
o_long_name        object
o_fips              int64
o_sumlev            int64
o_region            int64
o_division          int64
o_state             int64
o_region_name      object
o_division_name    object
dtype: object

In [5]:
# visual inspection
state_fips_df.head()

Unnamed: 0,o_state_name,o_state_abbr,o_long_name,o_fips,o_sumlev,o_region,o_division,o_state,o_region_name,o_division_name
0,Alabama,AL,Alabama AL,1,40,3,6,1,South,East South Central
1,Alaska,AK,Alaska AK,2,40,4,9,2,West,Pacific
2,Arizona,AZ,Arizona AZ,4,40,4,8,4,West,Mountain
3,Arkansas,AR,Arkansas AR,5,40,3,7,5,South,West South Central
4,California,CA,California CA,6,40,4,9,6,West,Pacific


#### Check and convert datatypes

In [6]:
# convert floats to categories
state_fips_df["state_fips"] = state_fips_df["o_state"].astype('category')
state_fips_df["fips"] = state_fips_df["o_fips"].astype('category')
state_fips_df["region"] = state_fips_df["o_region"].astype('category')
state_fips_df["division"] = state_fips_df["o_division"].astype('category')
state_fips_df["sumlev"] = state_fips_df["o_sumlev"].astype('category')

In [7]:
# visual check of column conversions - these worked 
state_fips_df.head()

Unnamed: 0,o_state_name,o_state_abbr,o_long_name,o_fips,o_sumlev,o_region,o_division,o_state,o_region_name,o_division_name,state_fips,fips,region,division,sumlev
0,Alabama,AL,Alabama AL,1,40,3,6,1,South,East South Central,1,1,3,6,40
1,Alaska,AK,Alaska AK,2,40,4,9,2,West,Pacific,2,2,4,9,40
2,Arizona,AZ,Arizona AZ,4,40,4,8,4,West,Mountain,4,4,4,8,40
3,Arkansas,AR,Arkansas AR,5,40,3,7,5,South,West South Central,5,5,3,7,40
4,California,CA,California CA,6,40,4,9,6,West,Pacific,6,6,4,9,40


#### Generate summary statistics

State FIPS codes range from 1-56. Since there are 50 states in the data, this means that some numbers in this range do not represent FIPS codes. Region IDs range from 1-4. Division names range from 1-9. The o_sumlev column is a constant and is likely not needed. 

We can see that we have 50 states as expected. These are divided into 4 regions and 9 divisions.

In [8]:
# display summary statistics
state_fips_df.describe(include = 'all')

Unnamed: 0,o_state_name,o_state_abbr,o_long_name,o_fips,o_sumlev,o_region,o_division,o_state,o_region_name,o_division_name,state_fips,fips,region,division,sumlev
count,50,50,50,50.0,50.0,50.0,50.0,50.0,50,50,50.0,50.0,50.0,50.0,50.0
unique,50,50,50,,,,,,4,9,50.0,50.0,4.0,9.0,1.0
top,Georgia,RI,Pennsylvania PA,,,,,,South,Mountain,56.0,56.0,3.0,8.0,40.0
freq,1,1,1,,,,,,16,8,1.0,1.0,16.0,8.0,50.0
mean,,,,29.32,40.0,2.66,5.12,29.32,,,,,,,
std,,,,15.782243,0.0,1.061574,2.560612,15.782243,,,,,,,
min,,,,1.0,40.0,1.0,1.0,1.0,,,,,,,
25%,,,,17.25,40.0,2.0,3.0,17.25,,,,,,,
50%,,,,29.5,40.0,3.0,5.0,29.5,,,,,,,
75%,,,,41.75,40.0,3.75,7.75,41.75,,,,,,,


#### Check missingness

There is no missing data.

In [30]:
state_fips_df.isnull().sum(axis = 0)

o_state_name       0
o_state_abbr       0
o_long_name        0
o_fips             0
o_sumlev           0
o_region           0
o_division         0
o_state            0
o_region_name      0
o_division_name    0
state_fips         0
fips               0
region             0
division           0
sumlev             0
dtype: int64

#### Check duplicates

There are no perfectly duplicated rows. 

In [31]:
sum(state_fips_df.duplicated())

0

This dataset is at the state level (50 rows for 50 unique states).

#### State & County FIPS codes and names

Our dataset tracking state and county FIPS codes has 3,146 rows and 13 columns. 

In [9]:
# display rows & columns
fips_df.shape

(3146, 13)

We have:

* 6 string columns tracking the county and state name, abbreviation, region, and division
* 1 integer column tracking fips codes 
* 4 float columns tracking identifiers we can use to match to the county file (these need to be converted to integers)
* 1 string column that stores an additional format of the integer columns that can be used for matching


In [10]:
# show data types
fips_df.dtypes

o_fips               int64
o_county_name       object
o_state_abbr        object
o_state_name        object
o_long_name         object
o_sumlev           float64
o_region           float64
o_division         float64
o_state            float64
o_county           float64
o_crosswalk         object
o_region_name       object
o_division_name     object
dtype: object

Since FIPS codes are stored as integers, we can see that we lose the leading zeros. This is something to be careful with when it comes to matching. These are already split into state and county FIPS codes in the o_state and o_county columns. We could add the county name and FIPS codes to the merged file with COVID-19 and state data. The match would not be perfect given the issues we saw in the COVID-19 data. 

In [11]:
# visual inspection
fips_df.head()

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name
0,1001,Autauga County,AL,Alabama,Autauga County AL,50.0,3.0,6.0,1.0,1.0,3-6-1-1,South,East South Central
1,1003,Baldwin County,AL,Alabama,Baldwin County AL,50.0,3.0,6.0,1.0,3.0,3-6-1-3,South,East South Central
2,1005,Barbour County,AL,Alabama,Barbour County AL,50.0,3.0,6.0,1.0,5.0,3-6-1-5,South,East South Central
3,1007,Bibb County,AL,Alabama,Bibb County AL,50.0,3.0,6.0,1.0,7.0,3-6-1-7,South,East South Central
4,1009,Blount County,AL,Alabama,Blount County AL,50.0,3.0,6.0,1.0,9.0,3-6-1-9,South,East South Central


#### Check and convert datatypes

In [12]:
# convert floats to categories
fips_df["state_fips"] = fips_df["o_state"].astype('category')
fips_df["county_fips"] = fips_df["o_county"].astype('category')
fips_df["region"] = fips_df["o_region"].astype('category')
fips_df["division"] = fips_df["o_division"].astype('category')
fips_df["sumlev"] = fips_df["o_sumlev"].astype('category')

In [13]:
# visual check of column conversions - these worked 
fips_df.head()

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name,state_fips,county_fips,region,division,sumlev
0,1001,Autauga County,AL,Alabama,Autauga County AL,50.0,3.0,6.0,1.0,1.0,3-6-1-1,South,East South Central,1.0,1.0,3.0,6.0,50.0
1,1003,Baldwin County,AL,Alabama,Baldwin County AL,50.0,3.0,6.0,1.0,3.0,3-6-1-3,South,East South Central,1.0,3.0,3.0,6.0,50.0
2,1005,Barbour County,AL,Alabama,Barbour County AL,50.0,3.0,6.0,1.0,5.0,3-6-1-5,South,East South Central,1.0,5.0,3.0,6.0,50.0
3,1007,Bibb County,AL,Alabama,Bibb County AL,50.0,3.0,6.0,1.0,7.0,3-6-1-7,South,East South Central,1.0,7.0,3.0,6.0,50.0
4,1009,Blount County,AL,Alabama,Blount County AL,50.0,3.0,6.0,1.0,9.0,3-6-1-9,South,East South Central,1.0,9.0,3.0,6.0,50.0


#### Generate summary statistics

State FIPS codes range from 1-56, as expected. The county FIPS codes range from 0-840. Region IDs range from 1-4 and division names from 1-9, as expected. The o_sumlev column is not a constant - it seems to take the value of either 40 or 50 (but that 40 is rare). We are missing some sumlev, regions, divisions and state & county FIPS codes.

We can see that we have 51 states - 1 more than expected. These are divided into 4 regions and 9 divisions.

In [14]:
# display numerical data summary statistics
fips_df.describe(include = 'all')

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name,state_fips,county_fips,region,division,sumlev
count,3146.0,3146,3146,3146,3146,3143.0,3143.0,3143.0,3143.0,3143.0,3146,3143,3143,3143.0,3143.0,3143.0,3143.0,3143.0
unique,,1879,51,51,3145,,,,,,3144,4,9,51.0,325.0,4.0,9.0,2.0
top,,Washington County,TX,Texas,District of Columbia DC,,,,,,NA-NA-NA-NA,South,West North Central,48.0,1.0,3.0,4.0,50.0
freq,,30,254,254,2,,,,,,3,1423,618,254.0,49.0,1423.0,618.0,3142.0
mean,30380.268595,,,,,49.996818,2.668788,5.192491,30.273942,103.53993,,,,,,,,
std,15172.365155,,,,,0.178372,0.803043,1.96375,15.145834,107.702765,,,,,,,,
min,1001.0,,,,,40.0,1.0,1.0,1.0,0.0,,,,,,,,
25%,18175.5,,,,,50.0,2.0,4.0,18.0,35.0,,,,,,,,
50%,29176.0,,,,,50.0,3.0,5.0,29.0,79.0,,,,,,,,
75%,45082.5,,,,,50.0,3.0,7.0,45.0,133.0,,,,,,,,


The 1 extra state in the county dataset is Washington, D.C. 

In [15]:
fips_df["o_state_name"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

The additional state FIPS is for Washington, D.C. 

In [16]:
state_fips_df["o_state_name"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
       'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
       'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
       'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
       'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype=object)

Washington, D.C. also accounts for the discrepancy in the o_sumlev column. 

In [17]:
fips_df["o_state_name"][fips_df["o_sumlev"] == 40]

320    District of Columbia
Name: o_state_name, dtype: object

#### Check missing data

We are missing some sumlev, regions, divisions and state & county FIPS codes.

In [29]:
fips_df.isnull().sum(axis = 0)

o_fips             0
o_county_name      0
o_state_abbr       0
o_state_name       0
o_long_name        0
o_sumlev           3
o_region           3
o_division         3
o_state            3
o_county           3
o_crosswalk        0
o_region_name      3
o_division_name    3
state_fips         3
county_fips        3
region             3
division           3
sumlev             3
dtype: int64

It is unclear why these 3 rows are missing these values. This may pose an issue, unless these counties are not usually used in datasets for some reason (perhaps they are not very populated). One appears to be a city (Bedford City, VA). We can retain these for now and drop them later. 

In [28]:
fips_df[fips_df['county_fips'].isnull()]

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name,state_fips,county_fips,region,division,sumlev
93,2270,Wade Hampton Census Area,AK,Alaska,Wade Hampton Census Area AK,,,,,,NA-NA-NA-NA,,,,,,,
2420,46113,Shannon County,SD,South Dakota,Shannon County SD,,,,,,NA-NA-NA-NA,,,,,,,
2919,51515,Bedford city,VA,Virginia,Bedford city VA,,,,,,NA-NA-NA-NA,,,,,,,


#### Check duplicates

There are no perfectly duplicated rows. 

In [18]:
sum(fips_df.duplicated())

0

There should be one row per county-state pair. 

In [20]:
fips_df["o_long_name"].value_counts()

District of Columbia DC    2
Gates County NC            1
Stevens County MN          1
King George County VA      1
Osborne County KS          1
                          ..
Comanche County OK         1
Coles County IL            1
DoÐa Ana County NM         1
Wells County IN            1
Suwannee County FL         1
Name: o_long_name, Length: 3145, dtype: int64

There appears to be an additional row for Washington, D.C. - one for each value of sumlev (40 and 50). 

In [21]:
fips_df[fips_df["o_long_name"] == "District of Columbia DC"]

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name,state_fips,county_fips,region,division,sumlev
320,11001,District of Columbia,DC,District of Columbia,District of Columbia DC,40.0,3.0,5.0,11.0,0.0,3-5-11-0,South,South Atlantic,11.0,0.0,3.0,5.0,40.0
321,11001,District of Columbia,DC,District of Columbia,District of Columbia DC,50.0,3.0,5.0,11.0,1.0,3-5-11-1,South,South Atlantic,11.0,1.0,3.0,5.0,50.0


We can remove this row to deduplicate.

In [22]:
clean_fips_df = fips_df[fips_df["o_sumlev"] != 40]

In [24]:
clean_fips_df.describe(include = 'all')

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name,state_fips,county_fips,region,division,sumlev
count,3145.0,3145,3145,3145,3145,3142.0,3142.0,3142.0,3142.0,3142.0,3145,3142,3142,3142.0,3142.0,3142.0,3142.0,3142.0
unique,,1879,51,51,3145,,,,,,3143,4,9,51.0,324.0,4.0,9.0,1.0
top,,Washington County,TX,Texas,Gates County NC,,,,,,NA-NA-NA-NA,South,West North Central,48.0,1.0,3.0,4.0,50.0
freq,,30,254,254,1,,,,,,3,1422,618,254.0,49.0,1422.0,618.0,3142.0
mean,30386.430525,,,,,50.0,2.668682,5.192553,30.280076,103.572884,,,,,,,,
std,15170.840246,,,,,0.0,0.803149,1.96406,15.144339,107.70406,,,,,,,,
min,1001.0,,,,,50.0,1.0,1.0,1.0,1.0,,,,,,,,
25%,18177.0,,,,,50.0,2.0,4.0,18.0,35.0,,,,,,,,
50%,29177.0,,,,,50.0,3.0,5.0,29.0,79.0,,,,,,,,
75%,45083.0,,,,,50.0,3.0,7.0,45.0,133.0,,,,,,,,


### Pickle data

In [32]:
pickle.dump(flattened_covid_df, open( "../Data_pkl/flattened_covid_df.pkl", "wb" ) )