# COVID-19 MIDS Collaboration 

## Data Import: Income data

This Jupyter Notebook reads in raw data as csv files and exports them as [pickle files for faster loading](https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e). 

### Data sources

Income data: 

### Set up environment 

In [1]:
# Import packages
import pandas as pd # to read in csv
import numpy as np # to process data
import pickle # to pickle files


Here, we import the dataset into a pandas dataframe and add a "o_" prefix to all columsn to demarcate which columns were the originals. 

In [2]:
#Read in data & add "o_" prefix to demarcate which columns were the originals 
income_df = pd.read_csv("../Data_raw/Kaggle_USHousehold_Income/kaggle_income.csv", encoding='latin-1').add_prefix('o_') 


### Inspect data

Our dataset tracking income has 32,526 rows and 19 columns. 

In [3]:
# count rows and columns
income_df.shape

(32526, 19)

We have:

* 2 integer columns tracking the ID, state codes, and zip codes; information about land and water; and mean/median/standard deviation of income.
* 3 float columns tracking latitude & longitude and sum_w (unknown column)
* 8 string columns tracking state name and abbreviation; county and city name; place, type and primary (unknown column); and area code.


In [4]:
# check data types 
income_df.dtypes

o_id              int64
o_State_Code      int64
o_State_Name     object
o_State_ab       object
o_County         object
o_City           object
o_Place          object
o_Type           object
o_Primary        object
o_Zip_Code        int64
o_Area_Code      object
o_ALand           int64
o_AWater          int64
o_Lat           float64
o_Lon           float64
o_Mean            int64
o_Median          int64
o_Stdev           int64
o_sum_w         float64
dtype: object

In [5]:
income_df.head()

Unnamed: 0,o_id,o_State_Code,o_State_Name,o_State_ab,o_County,o_City,o_Place,o_Type,o_Primary,o_Zip_Code,o_Area_Code,o_ALand,o_AWater,o_Lat,o_Lon,o_Mean,o_Median,o_Stdev,o_sum_w
0,1011000,1,Alabama,AL,Mobile County,Chickasaw,Chickasaw city,City,place,36611,251,10894952,909156,30.77145,-88.079697,38773,30506,33101,1638.260513
1,1011010,1,Alabama,AL,Barbour County,Louisville,Clio city,City,place,36048,334,26070325,23254,31.708516,-85.611039,37725,19528,43789,258.017685
2,1011020,1,Alabama,AL,Shelby County,Columbiana,Columbiana city,City,place,35051,205,44835274,261034,33.191452,-86.615618,54606,31930,57348,926.031
3,1011030,1,Alabama,AL,Mobile County,Satsuma,Creola city,City,place,36572,251,36878729,2374530,30.874343,-88.009442,63919,52814,47707,378.114619
4,1011040,1,Alabama,AL,Mobile County,Dauphin Island,Dauphin Island,Town,place,36528,251,16204185,413605152,30.250913,-88.171268,77948,67225,54270,282.320328


# CODE BELOW TO PULL FROM TO FINISH THE REST

# CONVERT DATA TYPES

# CHECK CONVERSION

# CHECK SUMMARY STATS 

# CHECK DUPLICATES

# CONVERT TO APPROPRIATE LEVEL

# PICKLE AND EXPORT NEW DATASET

In [6]:
# create new version of column as a datetime object
covid_df["date"] = pd.to_datetime(covid_df["o_date"])
# check conversion 
covid_df.dtypes

o_date              object
o_county            object
o_state             object
o_fips             float64
o_cases              int64
o_deaths             int64
date        datetime64[ns]
dtype: object

In [68]:
# create new version of fips column as nullable integer
covid_df["fips"] = covid_df["o_fips"].astype('category')
# create new version of fips column as string 
covid_df["fips_str"] = covid_df["o_fips"].astype('Int64').astype('str')
# check conversion 
covid_df.dtypes

o_date                 object
o_county               object
o_state                object
o_fips                float64
o_cases                 int64
o_deaths                int64
date           datetime64[ns]
fips                 category
county                 object
fips_str               object
state_fips             object
county_fips          category
dtype: object

In [65]:
# extract state FIPS and convert to category 
covid_df["state_fips"] = covid_df["fips_str"].apply(lambda x: x[:-3]).astype('category')
covid_df["state_fips"].head()

0    53
1    53
2    53
3    17
4    53
Name: state_fips, dtype: object

In [66]:
# extract county FIPS and convert to category 
covid_df["county_fips"] = covid_df["fips_str"].apply(lambda x: x[-3:]).astype('category')
covid_df["county_fips"].head()

0    061
1    061
2    061
3    031
4    061
Name: county_fips, dtype: category
Categories (284, object): [001, 003, 005, 006, ..., 820, 830, 840, nan]

In [8]:
# visual check of column conversions - these worked 
covid_df.head()

Unnamed: 0,o_date,o_county,o_state,o_fips,o_cases,o_deaths,date,fips
0,2020-01-21,Snohomish,Washington,53061.0,1,0,2020-01-21,53061.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0,2020-01-22,53061.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0,2020-01-23,53061.0
3,2020-01-24,Cook,Illinois,17031.0,1,0,2020-01-24,17031.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0,2020-01-24,53061.0


We have 807 missing FIPS codes, but no other missing data (at least that is coded as NaNs).

In [9]:
covid_df.isnull().sum(axis = 0)

o_date        0
o_county      0
o_state       0
o_fips      807
o_cases       0
o_deaths      0
date          0
fips        807
dtype: int64

3 counties have missing FIPS codes.

In [10]:
covid_df["o_county"][covid_df["o_fips"].isnull()].unique()

array(['New York City', 'Unknown', 'Kansas City'], dtype=object)

New York City is comprised of 5 counties (1 for each of the 5 boroughs); these each have their own FIPS codes: New York County (Manhattan), Kings County (Brooklyn), Bronx County (The Bronx), Richmond County (Staten Island), and Queens County (Queens). 

In [11]:
# show unique state for counties coded as "Unknown"
covid_df["o_state"][covid_df["o_county"] == "New York City"].unique()

array(['New York'], dtype=object)

In [12]:
# show unique fips codes for counties coded as "New York City"
covid_df["o_fips"][covid_df["o_county"] == "New York City"].unique()

array([nan])

Kansas City, Missouri also straddles multiple counties which would explain why it does not have a FIPS code. 

In [13]:
# show unique states for counties coded as "Kansas City"
covid_df["o_state"][covid_df["o_county"] == "Kansas City"].unique()

array(['Missouri'], dtype=object)

In [14]:
# show unique fips codes for counties coded as "Kansas City"
covid_df["o_fips"][covid_df["o_county"] == "Kansas City"].unique()

array([nan])

Other rows are missing FIPS codes because they lack a county (it is marked as "Unknown"). Some would make sense - for example, Puerto Rico only has a state FIPS code. 

In [15]:
# show unique states for counties coded as "Unknown"
covid_df["o_state"][covid_df["o_county"] == "Unknown"].unique()

array(['Rhode Island', 'New Jersey', 'Puerto Rico', 'Virgin Islands',
       'Guam', 'Maine', 'Massachusetts', 'Louisiana', 'Kentucky',
       'Nevada', 'Tennessee', 'Arkansas', 'Georgia', 'Missouri',
       'Minnesota', 'California', 'Colorado', 'Florida', 'Hawaii',
       'Illinois', 'Vermont', 'Idaho', 'Michigan', 'Ohio', 'Utah',
       'Mississippi', 'Northern Mariana Islands', 'Nebraska',
       'Connecticut', 'South Dakota', 'Indiana', 'Alaska', 'Arizona',
       'New Mexico', 'New York', 'Pennsylvania', 'Virginia',
       'New Hampshire', 'Washington', 'Delaware', 'Kansas',
       'North Dakota', 'Maryland'], dtype=object)

In [16]:
# show unique fips codes for counties coded as "Unknown"
covid_df["o_fips"][covid_df["o_county"] == "Unknown"].unique()

array([nan])

The missingness in FIPS codes is a major issue since this is the geographic identifier we would use to join these data to other datsets. We will likely need to accurately rectify this missingness in order to accurately combine our data. These don't seem to be erroneous data for the most part. Rather, the missing points reflect when a data point is at the city (rather than county) level and/or if there is no county for that region. 

In [17]:
# replace rows with missing county with NaN 
covid_df["county"] = covid_df["o_county"].replace("Unknown", None)

The mins and maxes look reasonable for COVID-19 mins and maxes. The distributions are remarkable but also reasonable - we clearly have some outliers on the high end (New York is likely one of those). 

In [18]:
# display numerical data summary statistics
covid_df.describe()

Unnamed: 0,o_fips,o_cases,o_deaths
count,61164.0,61971.0,61971.0
mean,29601.48898,121.685353,4.025415
std,15528.488936,1520.051958,82.937968
min,1001.0,0.0,0.0
25%,17179.0,2.0,0.0
50%,28143.0,6.0,0.0
75%,42131.0,25.0,1.0
max,56043.0,118302.0,8215.0


We have no missing data in the text columns. We have data for 86 unique days, in 55 "states" (so the data also include territories), and 1,627 counties. In the original county column, we see there are 1,628 unique values since the NaNs used to be coded as "Unknown".

In [19]:
# display text data summary statistics
covid_df.describe(include=['O'])

Unnamed: 0,o_date,o_county,o_state,county
count,61971,61971,61971,61971
unique,86,1628,55,1627
top,2020-04-15,Washington,Texas,Washington
freq,2722,772,3685,777


We can see that we have some missing data in our FIPS codes column and that we have 2,708 unique codes. 

In [20]:
# display text data summary statistics
covid_df.describe(include=['category'])

Unnamed: 0,fips
count,61164.0
unique,2708.0
top,53061.0
freq,86.0


Now that we have verified our column conversion worked, we drop the columns for which we had added additional columns with the correct data type and remove the "o_" prefix since we no longer need to compare original versus converted columns. Columns are reordered to match the original ordering in the dataset.

In [70]:
# drop, rename, and reorder columns
converted_covid_df = covid_df.drop(columns = ["o_date", "o_fips", "o_county", "fips_str"]).rename(columns = lambda x: x.replace('o_', ''))[["date", "fips", "state_fips", "county_fips","county", "state", "cases", "deaths"]]
# visual inspection
converted_covid_df.head()

Unnamed: 0,date,fips,state_fips,county_fips,county,state,cases,deaths
0,2020-01-21,53061.0,53,61,Snohomish,Washington,1,0
1,2020-01-22,53061.0,53,61,Snohomish,Washington,1,0
2,2020-01-23,53061.0,53,61,Snohomish,Washington,1,0
3,2020-01-24,17031.0,17,31,Cook,Illinois,1,0
4,2020-01-24,53061.0,53,61,Snohomish,Washington,1,0


#### State FIPS codes and names

Our dataset tracking state and county FIPS codes has 50 rows and 10 columns. There is one row for each of the 50 states - we probably would want to merge the above file on this one to obtain the correct state FIPS codes and limit our data to only these 50 states for easier comparison. 

In [22]:
# display rows & columns
state_fips_df.shape

(50, 10)

We have:

* 5 string columns tracking the county name, abbreviation, region, and division
* 5 integer columns tracking fips codes and tracking identifiers we can use to match to the county file 
    * These should be converted to categories 


In [23]:
# show data types 
state_fips_df.dtypes

o_state_name       object
o_state_abbr       object
o_long_name        object
o_fips              int64
o_sumlev            int64
o_region            int64
o_division          int64
o_state             int64
o_region_name      object
o_division_name    object
dtype: object

In [24]:
# visual inspection
state_fips_df.head()

Unnamed: 0,o_state_name,o_state_abbr,o_long_name,o_fips,o_sumlev,o_region,o_division,o_state,o_region_name,o_division_name
0,Alabama,AL,Alabama AL,1,40,3,6,1,South,East South Central
1,Alaska,AK,Alaska AK,2,40,4,9,2,West,Pacific
2,Arizona,AZ,Arizona AZ,4,40,4,8,4,West,Mountain
3,Arkansas,AR,Arkansas AR,5,40,3,7,5,South,West South Central
4,California,CA,California CA,6,40,4,9,6,West,Pacific


State FIPS codes range from 1-56. Since there are 50 states in the data, this means that some numbers in this range do not represent FIPS codes. Region IDs range from 1-4. Division names range from 1-9. The o_sumlev column is a constant and is likely not needed. 

In [25]:
# display numerical data summary statistics
state_fips_df.describe()

Unnamed: 0,o_fips,o_sumlev,o_region,o_division,o_state
count,50.0,50.0,50.0,50.0,50.0
mean,29.32,40.0,2.66,5.12,29.32
std,15.782243,0.0,1.061574,2.560612,15.782243
min,1.0,40.0,1.0,1.0,1.0
25%,17.25,40.0,2.0,3.0,17.25
50%,29.5,40.0,3.0,5.0,29.5
75%,41.75,40.0,3.75,7.75,41.75
max,56.0,40.0,4.0,9.0,56.0


We can see that we have 50 states as expected. These are divided into 4 regions and 9 divisions.

In [26]:
# display text data summary statistics
state_fips_df.describe(include=['O'])

Unnamed: 0,o_state_name,o_state_abbr,o_long_name,o_region_name,o_division_name
count,50,50,50,50,50
unique,50,50,50,4,9
top,Pennsylvania,SD,Kansas KS,South,South Atlantic
freq,1,1,1,16,8


#### State FIPS codes and names

Our dataset tracking state and county FIPS codes has 3,146 rows and 13 columns. 

In [27]:
# display rows & columns
fips_df.shape

(3146, 13)

We have:

* 6 string columns tracking the county and state name, abbreviation, region, and division
* 1 integer column tracking fips codes 
* 4 float columns tracking identifiers we can use to match to the county file (these need to be converted to integers)
* 1 string column that stores an additional format of the integer columns that can be used for matching


In [28]:
# show data types
fips_df.dtypes

o_fips               int64
o_county_name       object
o_state_abbr        object
o_state_name        object
o_long_name         object
o_sumlev           float64
o_region           float64
o_division         float64
o_state            float64
o_county           float64
o_crosswalk         object
o_region_name       object
o_division_name     object
dtype: object

Since FIPS codes are stored as integers, we can see that we lose the leading zeros. This is something to be careful with when it comes to matching. These are already split into state and county FIPS codes in the o_state and o_county columns. We could add the county name and FIPS codes to the merged file with COVID-19 and state data. The match would not be perfect given the issues we saw in the COVID-19 data. 

In [29]:
# visual inspection
fips_df.head()

Unnamed: 0,o_fips,o_county_name,o_state_abbr,o_state_name,o_long_name,o_sumlev,o_region,o_division,o_state,o_county,o_crosswalk,o_region_name,o_division_name
0,1001,Autauga County,AL,Alabama,Autauga County AL,50.0,3.0,6.0,1.0,1.0,3-6-1-1,South,East South Central
1,1003,Baldwin County,AL,Alabama,Baldwin County AL,50.0,3.0,6.0,1.0,3.0,3-6-1-3,South,East South Central
2,1005,Barbour County,AL,Alabama,Barbour County AL,50.0,3.0,6.0,1.0,5.0,3-6-1-5,South,East South Central
3,1007,Bibb County,AL,Alabama,Bibb County AL,50.0,3.0,6.0,1.0,7.0,3-6-1-7,South,East South Central
4,1009,Blount County,AL,Alabama,Blount County AL,50.0,3.0,6.0,1.0,9.0,3-6-1-9,South,East South Central


In [30]:
# convert floats to integers
fips_df["state_fips"] = fips_df["o_state"].astype('category')
fips_df["county_fips"] = fips_df["o_county"].astype('category')
fips_df["region"] = fips_df["o_region"].astype('category')
fips_df["division"] = fips_df["o_division"].astype('category')

State FIPS codes range from 1-56, as expected. The county FIPS codes range from 0-840. Region IDs range from 1-4 and division names from 1-9, as expected. The o_sumlev column is not a constant - it seems to take the value of either 40 or 50 (but that 40 is rare). 

In [31]:
# display numerical data summary statistics
fips_df.describe()

Unnamed: 0,o_fips,o_sumlev,o_region,o_division,o_state,o_county
count,3146.0,3143.0,3143.0,3143.0,3143.0,3143.0
mean,30380.268595,49.996818,2.668788,5.192491,30.273942,103.53993
std,15172.365155,0.178372,0.803043,1.96375,15.145834,107.702765
min,1001.0,40.0,1.0,1.0,1.0,0.0
25%,18175.5,50.0,2.0,4.0,18.0,35.0
50%,29176.0,50.0,3.0,5.0,29.0,79.0
75%,45082.5,50.0,3.0,7.0,45.0,133.0
max,56045.0,50.0,4.0,9.0,56.0,840.0


We can see that we have 51 states - 1 more than expected. These are divided into 4 regions and 9 divisions.

In [32]:
# display text data summary statistics
fips_df.describe(include=['O'])

Unnamed: 0,o_county_name,o_state_abbr,o_state_name,o_long_name,o_crosswalk,o_region_name,o_division_name
count,3146,3146,3146,3146,3146,3143,3143
unique,1879,51,51,3145,3144,4,9
top,Washington County,TX,Texas,District of Columbia DC,NA-NA-NA-NA,South,West North Central
freq,30,254,254,2,3,1423,618


In [33]:
# display categorical data summary statistics
fips_df.describe(include=['category'])

Unnamed: 0,state_fips,county_fips,region,division
count,3143.0,3143.0,3143.0,3143.0
unique,51.0,325.0,4.0,9.0
top,48.0,1.0,3.0,4.0
freq,254.0,49.0,1423.0,618.0


The 1 extra state in the county dataset is Washington, D.C. 

In [34]:
fips_df["o_state_name"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

The additional state FIPS is for Washington, D.C. 

In [35]:
state_fips_df["o_state_name"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
       'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
       'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
       'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
       'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype=object)

Washington, D.C. also accounts for the discrepancy in the o_sumlev column. 

In [69]:
fips_df["o_state_name"][fips_df["o_sumlev"] == 40]

320    District of Columbia
Name: o_state_name, dtype: object