# Census Population Data

The goal of this notebook is to pull in census data and make it available to incorporate into the dashboard. Specifically to include cases per capita and deaths per capita. 

Sections:
 - [Notebook Setup](#1---Notebook-Setup) - Change directory and import modules.
 - [Import Data](#2---Import-Data) - Import data directly from the data source.
 - [Cleaning and Data Checks](#3---Cleaning-and-Data-Checks) - Clean columns and check consistency.
 - [Explore Data](#4---Explore-Data) - Explore the data and build a function to export the census dataframe.

---

## 1 - Notebook Setup

First lets switch directories to the root directory to access modules.

In [1]:
pwd

'/Users/DanOvadia/Projects/covid-hotspots/notebooks'

First, we need to make sure we are in the root directory of the project to import custom modules. These notebooks are stored in `notebooks/` for cleanliness.

In [2]:
cd ..

/Users/DanOvadia/Projects/covid-hotspots


### Import Libraries

In [3]:
import pandas as pd

### Import Custom Modules

In [4]:
from modules import data_processing
from modules import plotting

---

## 2 - Import Data

Import the data directly from census.gov

Data come from census.gov from their 2019 population estimates for all counties, available for direct download. The csv is encoded using `encoding = "ISO-8859-1"`.

 - [Census.gov datasets](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/)
 - [Estimated 2019 Population direct download CSV (3.5MB)](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv)

In [5]:
URL = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv'
CENSUS_DF = pd.read_csv(URL, encoding = "ISO-8859-1")

### Assess raw data

Lets take a look at the shape and at column datatypes.

In [6]:
CENSUS_DF.shape

(3193, 164)

In [7]:
CENSUS_DF.head(10)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2019,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,RNETMIG2016,RNETMIG2017,RNETMIG2018,RNETMIG2019
0,40,3,6,1,0,Alabama,Alabama,4779736,4780125,4785437,...,1.917501,0.578434,1.186314,1.522549,0.563489,0.626357,0.745172,1.090366,1.773786,2.483744
1,50,3,6,1,1,Alabama,Autauga County,54571,54597,54773,...,4.84731,6.018182,-6.226119,-3.902226,1.970443,-1.712875,4.777171,0.849656,0.540916,4.560062
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112,...,24.017829,16.64187,17.488579,22.751474,20.184334,17.725964,21.279291,22.398256,24.727215,24.380567
3,50,3,6,1,5,Alabama,Barbour County,27457,27455,27327,...,-5.690302,0.292676,-6.897817,-8.132185,-5.140431,-15.724575,-18.238016,-24.998528,-8.754922,-5.165664
4,50,3,6,1,7,Alabama,Bibb County,22915,22915,22870,...,1.385134,-4.998356,-3.787545,-5.797999,1.331144,1.329817,-0.708717,-3.234669,-6.857092,1.831952
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57376,...,1.020788,0.208812,-1.650165,-0.347225,-2.04959,-1.338525,-1.391062,6.193562,-0.069229,1.124597
6,50,3,6,1,11,Alabama,Bullock County,10914,10911,10876,...,-7.102343,-21.994339,-6.95456,-6.3342,9.617198,-24.592888,-2.212709,-19.936786,-1.474201,-7.200986
7,50,3,6,1,13,Alabama,Butler County,20947,20940,20932,...,-7.216152,-3.636538,-7.896764,-14.478623,0.442445,-6.223913,-6.272714,-4.962406,-10.071105,-6.294941
8,50,3,6,1,15,Alabama,Calhoun County,118572,118526,118408,...,-4.167837,-6.148582,-4.588523,-5.255477,-4.027747,-3.215406,-3.471589,-1.654454,-0.488995,-4.044995
9,50,3,6,1,17,Alabama,Chambers County,34215,34169,34122,...,-7.927723,-1.379209,4.490952,2.520405,-3.787656,1.324055,-5.786747,2.134851,-1.515444,-7.748227


---

## 3 - Cleaning and Data Checks

### FIPS Codes

In order to incorporate these data into the covid dashboard, we'll need to tie them to counties via FIPS codes. We have state and county codes available above as integers, but we'll need it stored as a 5 digit string.

Below I create a function to apply to the dataframe to produce the new column. This is open to issues if state or county codes exceed expected lengths.

In [8]:
# Function to generate fips from a state int and county int.
def generate_fips(state_fips, county_fips):
    state_str, county_str = str(state_fips), str(county_fips)
    
    # Check length of state code and append 0's if necessary.
    if len(state_str) == 1:
        state_str = "0"+state_str
        
    # Check length of county code and append 0's if necessary.
    if len(county_str) == 1:
        county_str = "00"+county_str
    elif len(county_str) == 2:
        county_str = "0"+county_str
        
    return state_str+county_str

# Quick test
state = 1
county = 22
generate_fips(state, county)

'01022'

In [9]:
# Use apply across rows to generate fips codes.
CENSUS_DF['FIPS'] = CENSUS_DF[['STATE','COUNTY']].apply(
    lambda x:generate_fips(
        x['STATE'],x['COUNTY']), axis = 1)

### Length Check

Lets check to make sure that our transformation was clean from a length persepective.

In [10]:
# Checking the the length of each row, producing a series
#  and checking the max.
CENSUS_DF['FIPS'].map(lambda x:len(str(x))).max()

5

In [11]:
# Doing the same only for min.
CENSUS_DF['FIPS'].map(lambda x:len(str(x))).min()

5

So we've got 3193 rows (counties), lets make sure we have the same number of unique fips codes.

### Uniqueness Check

In [12]:
CENSUS_DF.shape

(3193, 165)

In [13]:
len(set(CENSUS_DF['FIPS']))

3193

Yup, we look good to go.

------------

## 4 - Explore Data

Selecing a handful of columns to persist.

In [14]:
FEATURES = [
    'SUMLEV',
    'REGION',
    'DIVISION',
    'STATE',
    'COUNTY',
    'STNAME',
    'CTYNAME',
    'POPESTIMATE2019',
    'CENSUS2010POP', 
    'FIPS' # calculated column
]

In [15]:
CENSUS_DF[FEATURES].head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,POPESTIMATE2019,CENSUS2010POP,FIPS
0,40,3,6,1,0,Alabama,Alabama,4903185,4779736,1000
1,50,3,6,1,1,Alabama,Autauga County,55869,54571,1001
2,50,3,6,1,3,Alabama,Baldwin County,223234,182265,1003
3,50,3,6,1,5,Alabama,Barbour County,24686,27457,1005
4,50,3,6,1,7,Alabama,Bibb County,22394,22915,1007


Below is a list of all the columns that come from the data.

In [16]:
# list out all columns for reference and availability
CENSUS_DF.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3193 entries, 0 to 3192
Data columns (total 165 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SUMLEV                 int64  
 1   REGION                 int64  
 2   DIVISION               int64  
 3   STATE                  int64  
 4   COUNTY                 int64  
 5   STNAME                 object 
 6   CTYNAME                object 
 7   CENSUS2010POP          int64  
 8   ESTIMATESBASE2010      int64  
 9   POPESTIMATE2010        int64  
 10  POPESTIMATE2011        int64  
 11  POPESTIMATE2012        int64  
 12  POPESTIMATE2013        int64  
 13  POPESTIMATE2014        int64  
 14  POPESTIMATE2015        int64  
 15  POPESTIMATE2016        int64  
 16  POPESTIMATE2017        int64  
 17  POPESTIMATE2018        int64  
 18  POPESTIMATE2019        int64  
 19  NPOPCHG_2010           int64  
 20  NPOPCHG_2011           int64  
 21  NPOPCHG_2012           int64  
 22  NPOPCHG_2013           