## Categorizing US counties

In [24]:
import pandas as pd

### Import county category data
Source: NCHS Urban-Rural Classification Scheme for Counties

##### The NCHS has developed a six-level urban-rural classification scheme for U.S. counties and county-equivalent entities.
1. Metropolitan counties: Large central metro counties in MSA of 1 million population that: 1) contain the entire population of the largest principal city of the MSA, or 2) are completely contained within the largest principal city of the MSA, or 3) contain at least 250,000 residents of any principal city in the MSA.
2. Large fringe metro counties in MSA of 1 million or more population that do not qualify as large central 
3. Medium metro counties in MSA of 250,000-999,999 population.
4. Small metro counties are counties in MSAs of less than 250,000 population.
5. Micropolitan counties in micropolitan statistical area
6. Noncore counties not in micropolitan statistical areas
    
##### We can farther collapse these six categories into urban, suburban, and rural:
- Urban: 1 (Large central metro)
- Suburban: 2, 3, 4 (Large fringe metro counties, medium metro, and small metro counties)
- Rural: 5 and 6 (Micropolitan counties and Noncore counties)
    

In [25]:
county_categories = pd.read_excel("NCHSURCodes2013.xlsx")
print(county_categories.shape)
county_categories.head()

(3149, 9)


Unnamed: 0,FIPS code,State Abr.,County name,CBSA title,CBSA 2012 pop,County 2012 pop,2013 code,2006 code,1990-based code
0,1001,AL,Autauga County,"Montgomery, AL",377149,55514,3,3,3
1,1003,AL,Baldwin County,"Daphne-Fairhope-Foley, AL",190790,190790,4,5,3
2,1005,AL,Barbour County,,.,27201,6,5,5
3,1007,AL,Bibb County,"Birmingham-Hoover, AL",1136650,22597,2,2,6
4,1009,AL,Blount County,"Birmingham-Hoover, AL",1136650,57826,2,2,3


In [26]:
selected = county_categories[["FIPS code", "2013 code"]]

### Import County-level Covid Data
Source: NY Times (https://github.com/nytimes/covid-19-data)

In [27]:
url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"

covid_by_counties = pd.read_csv(url, error_bad_lines=False)

print(covid_by_counties.shape)
covid_by_counties.head()

(1160632, 6)


Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0
3,2020-01-24,Cook,Illinois,17031.0,1,0.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0


### Merging and Fixing

The first merge attempt shows that three of the counties in this dataset do not come with fips county code that identifies them. They are New York City, Kansas City, and Joplin. 

We will fill in the fips code for New York City with that of New York County because they generally refer to the same region. 

Joplin (Missouri) is a city in both Jasper and Newton County. Since both are designated by the NCHS system as small metro counties and Joplin is mostly in Jasper County, we will fill in the Joplin FIPS code with the Jasper County FIPS code.

Kansas City (Missouri) straddles the Kansas-Missouri state line. The bulk of it is in Jackson County, Missouri, but parts of it lie in Clay, Cass, and Platte Counties. We will fill in the Kansas City FIPS with that of Jackson County.

In [28]:
with_categories_left_1 = pd.merge(covid_by_counties, selected, left_on = "fips", right_on = "FIPS code", how = "left")
print(with_categories_left_1.shape)
with_categories_left_1[with_categories_left_1["fips"].isnull()].groupby("county").sum()

(1160632, 8)


Unnamed: 0_level_0,fips,cases,deaths,FIPS code,2013 code
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Joplin,0.0,732534,13820.0,0.0,0.0
Kansas City,0.0,5686326,71182.0,0.0,0.0
New York City,0.0,121106975,8357276.0,0.0,0.0
Unknown,0.0,13118064,550794.0,0.0,0.0


In [29]:
covid_by_counties.loc[covid_by_counties["county"] == "New York City", "fips"] = 36061
covid_by_counties.loc[covid_by_counties["county"] == "Joplin", "fips"] = 29097
covid_by_counties.loc[covid_by_counties["county"] == "Kansas City", "fips"] = 29095

In [30]:
with_categories_left_2 = pd.merge(covid_by_counties, selected, left_on = "fips", right_on = "FIPS code", how = "left")
print(with_categories_left_2.shape)
with_categories_left_2[with_categories_left_2["fips"].isnull()].groupby("county").sum()

(1160632, 8)


Unnamed: 0_level_0,fips,cases,deaths,FIPS code,2013 code
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unknown,0.0,13118064,550794.0,0.0,0.0


In [31]:
print(with_categories_left_2[with_categories_left_2["fips"].isnull()].shape[0])
print(with_categories_left_2[with_categories_left_2["fips"].isnull()].shape[0]/with_categories_left_2.shape[0])
print(with_categories_left_2[with_categories_left_2["fips"].isnull()].groupby("state").sum().shape)
print(with_categories_left_2[with_categories_left_2["fips"].isnull() == False].groupby("state").sum().shape)

9593
0.008265324409459674
(53, 5)
(54, 5)


After fixing the fips codes above, we are only missing county information on 9257 entries in the 1118423-entry dataset. That is only 0.83% of the data. We also check to make sure all 50 states are represented in our dataset.

### Dividing into Urban, Suburban, and Rural

In [32]:
with_categories_left_2.loc[with_categories_left_2["2013 code"] == 1, "category"] = "Urban"
with_categories_left_2.loc[with_categories_left_2["2013 code"] == 2, "category"] = "Suburban"
with_categories_left_2.loc[with_categories_left_2["2013 code"] == 3, "category"] = "Suburban"
with_categories_left_2.loc[with_categories_left_2["2013 code"] == 4, "category"] = "Suburban"
with_categories_left_2.loc[with_categories_left_2["2013 code"] == 5, "category"] = "Rural"
with_categories_left_2.loc[with_categories_left_2["2013 code"] == 6, "category"] = "Rural"

In [33]:
with_categories_left_2.head()

Unnamed: 0,date,county,state,fips,cases,deaths,FIPS code,2013 code,category
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0,53061.0,2.0,Suburban
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0,53061.0,2.0,Suburban
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0,53061.0,2.0,Suburban
3,2020-01-24,Cook,Illinois,17031.0,1,0.0,17031.0,1.0,Urban
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0,53061.0,2.0,Suburban


In [38]:
# fips will be string version of fips code with zero padding in front.
no_na = with_categories_left_2[with_categories_left_2["fips"].isnull() == False]
no_na = no_na.astype({"fips": "float64"})
no_na = no_na.astype({"fips": "int64"})
no_na = no_na.astype({"fips": "string"})
no_na["fips"] = no_na["fips"].str.zfill(5)

no_na = no_na.astype({"date": "string"})

### Data visualization