### Data Cleaning: Workers Residence Area Charactheristics

By ADA Group 1

In this Jupyter Notebook, we will clean the data obtained from OnTheMap (U.S.Census Bureau, Center for Economic Studies). https://onthemap.ces.census.gov/

The specific dataset we retrieved from OnTheMap is the workers Residence Area Charactheristics from Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics. https://lehd.ces.census.gov/data/

We decided to use this dataset since it is aggregated to Census Blocks. This is the smallest geaograohic aggregation available.

The raw file contains workers age, income, race, and educational attainment divided in categories and aggregated at the census block level for 2002 and 2010. Indicating number of workers per each category in a given Block. Moreover, this notebook will get the relevant variables: income and educational attainment categories. Then, we will calculate the percentage change between 2002-2010. 

The raw file contains information for all New York State Blocks, that we will filter to obtain New York City only data, by performing an inner join with pur previously cleaned NYC Census Blocks dataset.


### Import Packages

In [1]:
# visualization
%pylab inline
# import the packages
# numpy for array and matrix computation
import numpy as np

# pandas for data analysis
import pandas as pd

# matplotlib and seaborn are the data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# sqlalchemy an psycopg2 are sql connection packages
from sqlalchemy import create_engine

# configure pandas display: set the maximum number of columns displayed to 25
pd.options.display.max_columns = 25

# use the __future__ version of division and print
from __future__ import division, print_function

# gzip and csv for unzip compressed files
import gzip
import csv

import warnings
warnings.filterwarnings('ignore')

Populating the interactive namespace from numpy and matplotlib




### Workers Data Extract and Import

In [2]:
#Unzip data and save in shared folder
##Data for Workers Residence 2002 at the Block level (the earliest available)
with gzip.open("Data/ny_rac_S000_JT00_2002.csv.gz", 'rt') as f:
    data = f.read()
    with open("Data/ny_rac_S000_JT00_2002.csv", 'wt') as f:
        f.write(data)

In [3]:
#Data for Workers Residence in 2010 at the Block level
with gzip.open("Data/ny_rac_S000_JT01_2010.csv.gz", 'rt') as f:
    data = f.read()
    with open("Data/ny_rac_S000_JT01_2010.csv", 'wt') as f:
        f.write(data)

In [4]:
#Import the data for 2002
ny_worker_2002 = pd.read_csv("Data/ny_rac_S000_JT00_2002.csv", dtype= {'h_geocode': str})
ny_worker_2002.shape
#238,021 observations, 43 variables
#Data contains all NY State Blocks

(238021, 43)

In [5]:
ny_worker_2010 = pd.read_csv("Data/ny_rac_S000_JT01_2010.csv", dtype= {'h_geocode': str})
ny_worker_2010.shape
#238,684 observations, 43 variables

(238684, 43)

### Subset variables
Select the relevant variables first, and then subset the NYC only obeservation.

#### Variables of Interest in Workers Data


1. CE01 : Number of jobs with earnings 1250USD/month or less
2. CE02 : Number of jobs with earnings 1251USD/month to 3333USD/month
3. CE03 : Number of jobs with earnings greater than 3333USD/month
4. CD01 : Number of jobs for workers with Educational Attainment: Less than high school
5. CD02 : Number of jobs for workers with Educational Attainment: High School or Equivalent
6. CD03 : Number of jobs for workers with Educational Attainment: Some college or Associate Degree
7. CD04 : Number of jobs for workers with Educational Attainment: Bachelor's degree or Advanced Degree

In [6]:
# Create a subset vector to be referenced afterwards.
subset = ['h_geocode', 'CE01', 'CE02', 'CE03', 'CD01', 'CD02', 'CD03', 'CD04']

In [7]:
# Use the vector [subset] created above
ny_worker_2002_working = ny_worker_2002[subset]
ny_worker_2002_working.tail()

Unnamed: 0,h_geocode,CE01,CE02,CE03,CD01,CD02,CD03,CD04
238016,361231505004017,2,9,2,0,0,0,0
238017,361231505004019,3,6,4,0,0,0,0
238018,361231505004020,0,0,1,0,0,0,0
238019,361231505004021,1,1,0,0,0,0,0
238020,361231505004022,0,1,2,0,0,0,0


In [8]:
ny_worker_2010_working = ny_worker_2010[subset]
ny_worker_2010_working.tail()

Unnamed: 0,h_geocode,CE01,CE02,CE03,CD01,CD02,CD03,CD04
238679,361231505004016,0,2,1,0,2,0,1
238680,361231505004017,1,4,1,0,3,2,0
238681,361231505004019,2,2,2,1,3,0,0
238682,361231505004020,1,1,3,0,2,2,0
238683,361231505004022,2,1,1,0,2,1,0


### Subset Observations
Perform an inner Join Worker's Data to our previously cleaned New York Census Block Group data. 

In [9]:
#Import only Block unique identifier from clean NYC Blocks data
blocks_nyc = pd.read_csv("Data/blocks_clean.csv", usecols=range(1,2), dtype= {'BLOCKID': str})
blocks_nyc.tail()

Unnamed: 0,BLOCKID
29348,360850134001015
29349,360850146042016
29350,360850121002001
29351,360470015003000
29352,360470015003000


In [10]:
#Inner join so that only observations that are in both datasets will be in the output
#Run for 2002
nyc_blocks_workers02 = ny_worker_2002_working.merge(blocks_nyc, left_on = 'h_geocode', 
                                                    right_on = 'BLOCKID', how = 'inner' )
nyc_blocks_workers02.shape
# 29,249 observations, 10 variables.

(29249, 9)

In [11]:
#Repeat inner join for 2010
nyc_blocks_workers10 = ny_worker_2010_working.merge(blocks_nyc, left_on = 'h_geocode', 
                                                    right_on = 'BLOCKID', how = 'inner' )
nyc_blocks_workers10.shape
# 29,266 observations, 10 variables. 

(29266, 9)

In [13]:
# Create a subset vector to only keep one Block ID.
# Run for 2002.
subset2 = ['BLOCKID', 'CE01', 'CE02', 'CE03', 'CD01', 'CD02', 'CD03', 'CD04']
nyc_workers02_clean = nyc_blocks_workers02[subset2]
nyc_workers02_clean.tail()

Unnamed: 0,BLOCKID,CE01,CE02,CE03,CD01,CD02,CD03,CD04
29244,360850323001012,30,52,27,0,0,0,0
29245,360850323001013,1,2,0,0,0,0,0
29246,360850323001014,0,2,5,0,0,0,0
29247,360850323001015,2,4,2,0,0,0,0
29248,360850323001019,4,4,0,0,0,0,0


In [14]:
# Run for 2010.
nyc_workers10_clean = nyc_blocks_workers10[subset2]
nyc_workers10_clean.tail()

Unnamed: 0,BLOCKID,CE01,CE02,CE03,CD01,CD02,CD03,CD04
29261,360850323001012,38,53,69,15,25,41,45
29262,360850323001013,2,1,2,1,1,0,2
29263,360850323001014,1,1,6,1,1,3,2
29264,360850323001015,1,3,7,1,0,2,4
29265,360850323001019,0,5,7,0,6,3,2


### Join the Two Datasets
So far we have two datasets with workers information: 2002 and 2010.
Here we are going to join them, but first we'll rename the variables to better indicate what they have and add the year after all of them in the format: variableName_YYYY. 


#### Rename Variables

In [15]:
# Rename all columns except BLOCKID _YYYY (last four digits of the observation year)
# Run for 2002
nyc_workers02_clean.columns = ['BLOCKID', 'Inc01_2002', 'Inc02_2002', 'Inc03_2002', 'Ed01_2002', 'Ed02_2002', 'Ed03_2002', 'Ed04_2002']
nyc_workers02_clean.tail()

Unnamed: 0,BLOCKID,Inc01_2002,Inc02_2002,Inc03_2002,Ed01_2002,Ed02_2002,Ed03_2002,Ed04_2002
29244,360850323001012,30,52,27,0,0,0,0
29245,360850323001013,1,2,0,0,0,0,0
29246,360850323001014,0,2,5,0,0,0,0
29247,360850323001015,2,4,2,0,0,0,0
29248,360850323001019,4,4,0,0,0,0,0


In [16]:
# Run for 2010
nyc_workers10_clean.columns = ['BLOCKID', 'Inc01_2010', 'Inc02_2010', 'Inc03_2010', 'Ed01_2010', 'Ed02_2010', 'Ed03_2010', 'Ed04_2010']
nyc_workers02_clean.tail()

Unnamed: 0,BLOCKID,Inc01_2002,Inc02_2002,Inc03_2002,Ed01_2002,Ed02_2002,Ed03_2002,Ed04_2002
29244,360850323001012,30,52,27,0,0,0,0
29245,360850323001013,1,2,0,0,0,0,0
29246,360850323001014,0,2,5,0,0,0,0
29247,360850323001015,2,4,2,0,0,0,0
29248,360850323001019,4,4,0,0,0,0,0


#### Join Datasets

In [17]:
#Inner join the two datasets
workers_02_10 = nyc_workers02_clean.merge(nyc_workers10_clean, left_on = 'BLOCKID', 
                                                    right_on = 'BLOCKID', how = 'inner' )
workers_02_10.shape
# 28371 observations, 15 variables

(29192, 15)

In [18]:
workers_02_10.tail()

Unnamed: 0,BLOCKID,Inc01_2002,Inc02_2002,Inc03_2002,Ed01_2002,Ed02_2002,Ed03_2002,Ed04_2002,Inc01_2010,Inc02_2010,Inc03_2010,Ed01_2010,Ed02_2010,Ed03_2010,Ed04_2010
29187,360850323001012,30,52,27,0,0,0,0,38,53,69,15,25,41,45
29188,360850323001013,1,2,0,0,0,0,0,2,1,2,1,1,0,2
29189,360850323001014,0,2,5,0,0,0,0,1,1,6,1,1,3,2
29190,360850323001015,2,4,2,0,0,0,0,1,3,7,1,0,2,4
29191,360850323001019,4,4,0,0,0,0,0,0,5,7,0,6,3,2


### Create new variables
For measuring gentrification, it is relevant to add change in educational attainment and income of the population.
We will focus on calculating the change for the higher income and the higher educated workers. Therefore we will be measuring changes in the "gentrifying group". This variables will be used as outcome variables for our modelling.

In [19]:
# Create new variables with change in highly educated workers population and high income bracket workers population.
workers_02_10['pct_ch_hInc']  = (workers_02_10['Inc03_2010'] - workers_02_10['Inc03_2002'])/ (workers_02_10['Inc03_2002']+0.1)

workers_02_10['pct_ch_hEduc']  = (workers_02_10['Ed04_2010']-workers_02_10['Ed04_2002'])/(workers_02_10['Ed04_2002']+0.1)

workers_02_10.tail()

Unnamed: 0,BLOCKID,Inc01_2002,Inc02_2002,Inc03_2002,Ed01_2002,Ed02_2002,Ed03_2002,Ed04_2002,Inc01_2010,Inc02_2010,Inc03_2010,Ed01_2010,Ed02_2010,Ed03_2010,Ed04_2010,pct_ch_hInc,pct_ch_hEduc
29187,360850323001012,30,52,27,0,0,0,0,38,53,69,15,25,41,45,1.549815,450.0
29188,360850323001013,1,2,0,0,0,0,0,2,1,2,1,1,0,2,20.0,20.0
29189,360850323001014,0,2,5,0,0,0,0,1,1,6,1,1,3,2,0.196078,20.0
29190,360850323001015,2,4,2,0,0,0,0,1,3,7,1,0,2,4,2.380952,40.0
29191,360850323001019,4,4,0,0,0,0,0,0,5,7,0,6,3,2,70.0,20.0


In [20]:
workers_02_10.to_csv("Data/workers_02_10_Clean.csv", encoding='utf8')