## Categorizing US counties

In [1]:
import pandas as pd

### Import county category data
Source: NCHS Urban-Rural Classification Scheme for Counties

##### The NCHS has developed a six-level urban-rural classification scheme for U.S. counties and county-equivalent entities.
1. Metropolitan counties: Large central metro counties in MSA of 1 million population that: 1) contain the entire population of the largest principal city of the MSA, or 2) are completely contained within the largest principal city of the MSA, or 3) contain at least 250,000 residents of any principal city in the MSA.
2. Large fringe metro counties in MSA of 1 million or more population that do not qualify as large central 
3. Medium metro counties in MSA of 250,000-999,999 population.
4. Small metro counties are counties in MSAs of less than 250,000 population.
5. Micropolitan counties in micropolitan statistical area
6. Noncore counties not in micropolitan statistical areas
    
##### We can farther collapse these six categories into urban, suburban, and rural:
- Urban: 1 (Large central metro)
- Suburban: 2, 3, 4 (Large fringe metro counties, medium metro, and small metro counties)
- Rural: 5 and 6 (Micropolitan counties and Noncore counties)
    

In [2]:
county_categories = pd.read_excel("NCHSURCodes2013.xlsx")
print(county_categories.shape)
county_categories.head()

(3149, 9)


Unnamed: 0,FIPS code,State Abr.,County name,CBSA title,CBSA 2012 pop,County 2012 pop,2013 code,2006 code,1990-based code
0,1001,AL,Autauga County,"Montgomery, AL",377149,55514,3,3,3
1,1003,AL,Baldwin County,"Daphne-Fairhope-Foley, AL",190790,190790,4,5,3
2,1005,AL,Barbour County,,.,27201,6,5,5
3,1007,AL,Bibb County,"Birmingham-Hoover, AL",1136650,22597,2,2,6
4,1009,AL,Blount County,"Birmingham-Hoover, AL",1136650,57826,2,2,3


In [3]:
selected = county_categories[["FIPS code", "2013 code"]]

### Import County-level Covid Data
Source: NY Times (https://github.com/nytimes/covid-19-data)

In [4]:
url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"

covid_by_counties = pd.read_csv(url, error_bad_lines=False)

print(covid_by_counties.shape)
covid_by_counties.head()

(1212589, 6)


Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0
3,2020-01-24,Cook,Illinois,17031.0,1,0.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0


In [5]:
covid_by_counties = covid_by_counties[covid_by_counties["state"] != "Guam"]
covid_by_counties = covid_by_counties[covid_by_counties["state"] != "Virgin Islands"]
covid_by_counties = covid_by_counties[covid_by_counties["state"] != "Puerto Rico"]
covid_by_counties = covid_by_counties[covid_by_counties["state"] != "Northern Mariana Islands"]

### Merging and Fixing

The first merge attempt shows that three of the counties in this dataset do not come with fips county code that identifies them. They are New York City, Kansas City, and Joplin. 

We will fill in the fips code for New York City with that of New York County because they generally refer to the same region. 

Joplin (Missouri) is a city in both Jasper and Newton County. Since both are designated by the NCHS system as small metro counties and Joplin is mostly in Jasper County, we will fill in the Joplin FIPS code with the Jasper County FIPS code.

Kansas City (Missouri) straddles the Kansas-Missouri state line. The bulk of it is in Jackson County, Missouri, but parts of it lie in Clay, Cass, and Platte Counties. We will fill in the Kansas City FIPS with that of Jackson County.

In [6]:
with_categories_left_1 = pd.merge(covid_by_counties, selected, left_on = "fips", right_on = "FIPS code", how = "left")
print(with_categories_left_1.shape)
with_categories_left_1[with_categories_left_1["fips"].isnull()].groupby("county").sum()

(1183224, 8)


Unnamed: 0_level_0,fips,cases,deaths,FIPS code,2013 code
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Joplin,0.0,828153,15496.0,0.0,0.0
Kansas City,0.0,6369699,80066.0,0.0,0.0
New York City,0.0,134845544,8859621.0,0.0,0.0
Unknown,0.0,12053242,263734.0,0.0,0.0


In [7]:
covid_by_counties.loc[covid_by_counties["county"] == "New York City", "fips"] = 36061
covid_by_counties.loc[covid_by_counties["county"] == "Joplin", "fips"] = 29097
covid_by_counties.loc[covid_by_counties["county"] == "Kansas City", "fips"] = 29095
covid_by_counties.loc[covid_by_counties["county"] == "Bristol Bay plus Lake and Peninsula", "fips"] = 2164
covid_by_counties.loc[covid_by_counties["county"] == "Yakutat plus Hoonah-Angoon", "fips"] = 2105

In [8]:
with_categories_left_2 = pd.merge(covid_by_counties, selected, left_on = "fips", right_on = "FIPS code", how = "left")
print(with_categories_left_2.shape)
with_categories_left_2[with_categories_left_2["fips"].isnull()].groupby("county").sum()

(1183224, 8)


Unnamed: 0_level_0,fips,cases,deaths,FIPS code,2013 code
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unknown,0.0,12053242,263734.0,0.0,0.0


In [9]:
print(with_categories_left_2[with_categories_left_2["fips"].isnull()].shape[0])
print(with_categories_left_2[with_categories_left_2["fips"].isnull()].shape[0]/with_categories_left_2.shape[0])
print(with_categories_left_2[with_categories_left_2["fips"].isnull()].groupby("state").sum().shape)
print(with_categories_left_2[with_categories_left_2["fips"].isnull() == False].groupby("state").sum().shape)

8945
0.007559853417442513
(49, 5)
(51, 5)


After fixing the fips codes above, we are only missing county information on 8945 entries in the 1183224-entry dataset. That is only 0.756% of the data. We also check to make sure all 50 states are represented in our dataset.

### Import US Population Data at the County Level

Source: https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/



In [11]:
population = pd.read_excel("PopulationEstimates.xls")
print(population.shape)
population.head()

(3273, 165)


Unnamed: 0,FIPStxt,State,Area_Name,Rural-urban_Continuum Code_2003,Rural-urban_Continuum Code_2013,Urban_Influence_Code_2003,Urban_Influence_Code_2013,Economic_typology_2015,CENSUS_2010_POP,ESTIMATES_BASE_2010,...,R_DOMESTIC_MIG_2019,R_NET_MIG_2011,R_NET_MIG_2012,R_NET_MIG_2013,R_NET_MIG_2014,R_NET_MIG_2015,R_NET_MIG_2016,R_NET_MIG_2017,R_NET_MIG_2018,R_NET_MIG_2019
0,0,US,United States,,,,,,308745538,308758105,...,,,,,,,,,,
1,1000,AL,Alabama,,,,,,4779736,4780125,...,1.917501,0.578434,1.186314,1.522549,0.563489,0.626357,0.745172,1.090366,1.773786,2.483744
2,1001,AL,Autauga County,2.0,2.0,2.0,2.0,0.0,54571,54597,...,4.84731,6.018182,-6.226119,-3.902226,1.970443,-1.712875,4.777171,0.849656,0.540916,4.560062
3,1003,AL,Baldwin County,4.0,3.0,5.0,2.0,5.0,182265,182265,...,24.017829,16.64187,17.488579,22.751474,20.184334,17.725964,21.279291,22.398256,24.727215,24.380567
4,1005,AL,Barbour County,6.0,6.0,6.0,6.0,3.0,27457,27455,...,-5.690302,0.292676,-6.897817,-8.132185,-5.140431,-15.724575,-18.238016,-24.998528,-8.754922,-5.165664


In [12]:
population = population[population["State"] != "PR"]
population = population[population["State"] != "US"]
population = population[population["Rural-urban_Continuum Code_2013"].isnull() == False]

In [13]:
pop_select = population[["FIPStxt", "POP_ESTIMATE_2019"]]

In [14]:
pop_merge = pd.merge(with_categories_left_2, pop_select, left_on = "fips", right_on = "FIPStxt", how = "left")

### Dividing into Urban, Suburban, and Rural

In [15]:
pop_merge.loc[pop_merge["2013 code"] == 1, "category"] = "Urban"
pop_merge.loc[pop_merge["2013 code"] == 2, "category"] = "Suburban"
pop_merge.loc[pop_merge["2013 code"] == 3, "category"] = "Suburban"
pop_merge.loc[pop_merge["2013 code"] == 4, "category"] = "Suburban"
pop_merge.loc[pop_merge["2013 code"] == 5, "category"] = "Rural"
pop_merge.loc[pop_merge["2013 code"] == 6, "category"] = "Rural"

In [16]:
pop_merge = pop_merge.drop(["FIPStxt"], axis = 1)

In [17]:
# fips will be string version of fips code with zero padding in front.
no_na = pop_merge[pop_merge["fips"].isnull() == False]
no_na = no_na.astype({"fips": "float64"})
no_na = no_na.astype({"fips": "int64"})
no_na = no_na.astype({"fips": "string"})
no_na["fips"] = no_na["fips"].str.zfill(5)

no_na = no_na.astype({"date": "string"})
no_na.head()

Unnamed: 0,date,county,state,fips,cases,deaths,FIPS code,2013 code,POP_ESTIMATE_2019
0,2020-01-21,Snohomish,Washington,53061,1,0.0,53061.0,2.0,822083.0
1,2020-01-22,Snohomish,Washington,53061,1,0.0,53061.0,2.0,822083.0
2,2020-01-23,Snohomish,Washington,53061,1,0.0,53061.0,2.0,822083.0
3,2020-01-24,Cook,Illinois,17031,1,0.0,17031.0,1.0,5150233.0
4,2020-01-24,Snohomish,Washington,53061,1,0.0,53061.0,2.0,822083.0


In [21]:
#no_na["percent_covid"] = no_na["cases"]/no_na["POP_ESTIMATE_2019"]
#no_na.head()

### Analysis

In [32]:
before_classes = no_na[no_na["date"] < "2020-08-01"][["date", "fips", "cases", "deaths"]]
before_classes.groupby("fips").sum()
#before_classes.tail()

Unnamed: 0_level_0,cases,deaths
fips,Unnamed: 1_level_1,Unnamed: 2_level_1
01001,39785,931.0
01003,77258,992.0
01005,24648,155.0
01007,13685,103.0
01009,19359,82.0
...,...,...
56037,7598,34.0
56039,14958,100.0
56041,9787,4.0
56043,3102,291.0
