### Data Cleaning: Racial Composition by Blocks

By ADA Group 1

In this Jupyter Notebook, we will clean the data obtained from the Center for Urban Research, The Graduate Center, City University of New York (CUNY) 
http://www.urbanresearchmaps.org/plurality/blockmaps.htm

The raw file contains population by race data at the census block level for 2000 and 2010. There are 39,011 observations, each one with a unique identifyier. The file contains number of people per each category. 

Moreover, this file contains information of Business Improvement Districts (BIDs) that was joined in GIS. It indicates weather each Census Block is within a BID or not, the name of the BID and what area of each Block is within a determined BID.

This notebook will get shares of population in 2000 per each racial category, and calculate the percentage change between 2000-2010. Then, we will create dummy variables for Borough and Neighborhood (NTA) (we will use these for controlling for neighborhood charactheristics in our Logistic Regression model). Finally, we will discard Census Blocks with 0 population (ie. parks and airports).




### Data Dictionary

#### Block and Total Population
* **BLOCKID :**       Unique Census Block Indentifyier
* **Pop10 :**         Total Census Block population in 2010
* **Pop00 :**         Total Census Block population in 2000
* **blckArea_ft :**   Total area of the Census Block in sqft.

#### Population by Race in 2010
* **WHITE10 :**       Total White non-hispanic population per Block in 2010
* **LATINO10 :**      Total Hispanic population per Block in 2010
* **BLACK10 :**       Total Black non-hispanic population per Block in 2010
* **ASIAN10 :**       Total Asian non-hispanic population per Block in 2010
* **OTHERS10 :**      Total Other Race non-hispanic population per Block in 2010.

#### Population by Race in 2000
* **WHITE00 :**       Total White non-hispanic population per Block in 2000
* **LATINO00 :**      Total Hispanic population per Block in 2000
* **BLACK00 :**       Total Black non-hispanic population per Block in 2000
* **ASIAN00 :**       Total Asian non-hispanic population per Block in 2000
* **OTHERS00 :**      Total Other Race non-hispanic population per Block in 2000.

#### Change in Population by Race in 2000-2010
* **CHGTot0010 :**    Numeric change in Total Population per Block between 2000-2010
* **CHGWhite0010 :**  Numeric change in White Population per Block between 2000-2010
* **CHGBlack0010 :**  Numeric change in Black non-hispanic Population per Block between 2000-2010
* **CHGAsian0010 :**  Numeric change in Asian non-hispanic Population per Block between 2000-2010
* **CHGHisp0010 :**   Numeric change in Hispanic Population per Block between 2000-2010
* **CHGOther0010 :**  Numeric change in Other Race non-hispanic Population per Block between 2000-2010.

#### Geographic references
* **BoroName :**      Borough 
* **NTACode :**       Neighborhood Tabulation Area Code
* **NTAName :**       Neighborhood Tabulation Area Name.

#### Business Improvement Districts Information
* **A_poly :**        Area of the Census Block that is within a BID in sqft (if the Block is not within a BID A_poly is equal to total area of the Census Block)
* **bid_id :**        Busiiness Improvement District (BID) unique identifyier (if the Block is within a BID)
* **bid_name :**      Name of the BID
* **areaBID_ft :**    Area of the BID in sqft
* **a_weight :**      Share of the Block Area that is within a BID (if the Block is not within a BID a_weight id equal to 1)
* **BID_dummy :**     Binary variable that indicates weather the Census Block is within a BID or not.

### Import Packages

In [1]:
# visualization
%pylab inline
# import the packages
# numpy for array and matrix computation
import numpy as np

# pandas for data analysis
import pandas as pd

# matplotlib and seaborn are the data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# sqlalchemy an psycopg2 are sql connection packages
from sqlalchemy import create_engine

# configure pandas display: set the maximum number of columns displayed to 25
pd.options.display.max_columns = 25

# use the __future__ version of division and print
from __future__ import division, print_function
import warnings
warnings.filterwarnings('ignore')

Populating the interactive namespace from numpy and matplotlib




### Import Data

In [2]:
#Import Block ID as character to preserve full number
blocks = pd.read_csv("/nfshome/mf3435/projects/ada_pub_1/shared/Data/Blocks_Full.csv",  index_col=[0], dtype= {'BLOCKID': str})
blocks.shape
#39,011 observations, 29 variables

(39011, 29)

In [56]:
blocks.tail(10)

Unnamed: 0,BLOCKID,Pop10,Pop00,WHITE10,LATINO10,BLACK10,ASIAN10,OTHERS10,WHITE00,LATINO00,BLACK00,ASIAN00,...,CHGHisp0010,CHGOther0010,BoroName,NTACode,NTAName,blckArea_ft,A_poly,bid_id,bid_name,areaBID_ft,a_weight,BID_dummy
39002,360850134001015,38,38.0,26,12,0,0,0,34,3,1,0,...,9,0,Staten Island,SI45,New Dorp-Midland Beach,74371,48723,3941.0,New Dorp BID,1465187.0,0.655134,1
39003,360850291021056,0,0.0,0,0,0,0,0,0,0,0,0,...,0,0,Staten Island,SI05,New Springville-Bloomfield-Travis,16487,12622,1094.0,West Shore BID,10925605.0,0.765573,1
39004,360850146042016,4,7.0,0,0,0,4,0,3,0,0,4,...,0,0,Staten Island,SI54,Great Kills,87555,87555,1109.0,South Shore BID,4342132.0,1.0,1
39005,360850121002001,65,87.0,38,27,0,0,0,49,35,0,1,...,-8,-2,Staten Island,SI22,West New Brighton-New Brighton-St. George,209130,110127,98.0,Forest Avenue?BID,1367217.0,0.526596,1
39006,360470015003000,469,44.0,222,56,113,58,20,19,3,14,0,...,53,12,Brooklyn,BK38,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill,104562,63639,80.0,MetroTech?BID,5137361.0,0.608625,1
39007,360470015003000,469,44.0,222,56,113,58,20,19,3,14,0,...,53,12,Brooklyn,BK38,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill,104562,63639,100.0,Myrtle Avenue Brooklyn Partnership?,5011985.0,0.608625,1
39008,360470015003011,35,0.0,14,6,1,13,1,0,0,0,0,...,6,1,Brooklyn,BK38,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill,62576,40850,80.0,MetroTech?BID,5137361.0,0.652806,1
39009,360470015003011,35,0.0,14,6,1,13,1,0,0,0,0,...,6,1,Brooklyn,BK38,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill,62576,40850,100.0,Myrtle Avenue Brooklyn Partnership?,5011985.0,0.652806,1
39010,360470015003008,0,96.0,0,0,0,0,0,0,15,79,0,...,-15,-2,Brooklyn,BK68,Fort Greene,302591,189425,80.0,MetroTech?BID,5137361.0,0.62601,1
39011,360470015003008,0,96.0,0,0,0,0,0,0,15,79,0,...,-15,-2,Brooklyn,BK68,Fort Greene,302591,189425,79.0,Fulton Mall Improvement Association,1445650.0,0.62601,1


### Create New Variables
* Share of each racial group in 2000
* Percentage change 2000-2010 per each racial group

In [3]:
# Share of each race population per every census block in 2000
blocks['shWhite00']  = 100 * blocks['WHITE00']/blocks['Pop00']
blocks['shLatino00'] = 100 * blocks['LATINO00']/blocks['Pop00']
blocks['shBlack00']  = 100 * blocks['BLACK00']/blocks['Pop00']
blocks['shAsian00']  = 100 * blocks['ASIAN00']/blocks['Pop00']
blocks['shOther00']  = 100 * blocks['OTHERS00']/blocks['Pop00']

# Share of each race population per every census block in 2010
blocks['shWhite10']  = 100 * blocks['WHITE10']/blocks['Pop10']
blocks['shLatino10'] = 100 * blocks['LATINO10']/blocks['Pop10']
blocks['shBlack10']  = 100 * blocks['BLACK10']/blocks['Pop10']
blocks['shAsian10']  = 100 * blocks['ASIAN10']/blocks['Pop10']
blocks['shOther10']  = 100 * blocks['OTHERS10']/blocks['Pop10']

# Change in racial composition of each tract
blocks['pct_ch_white'] = blocks['shWhite10'] - blocks['shWhite00']  
blocks['pct_ch_hisp']  = blocks['shLatino10'] - blocks['shLatino00'] 
blocks['pct_ch_black'] = blocks['shBlack10'] - blocks['shBlack00'] 
blocks['pct_ch_asian'] = blocks['shAsian10'] - blocks['shAsian00'] 
blocks['pct_ch_other'] = blocks['shOther10'] - blocks['shOther00'] 

#Population change
blocks['pop_pct_ch'] = 100 * blocks['Pop10']/blocks['Pop00']

blocks.tail(10)

Unnamed: 0,BLOCKID,Pop10,Pop00,WHITE10,LATINO10,BLACK10,ASIAN10,OTHERS10,WHITE00,LATINO00,BLACK00,ASIAN00,...,shOther00,shWhite10,shLatino10,shBlack10,shAsian10,shOther10,pct_ch_white,pct_ch_hisp,pct_ch_black,pct_ch_asian,pct_ch_other,pop_pct_ch
39002,360850134001015,38,38.0,26,12,0,0,0,34,3,1,0,...,0.0,68.421053,31.578947,0.0,0.0,0.0,-21.052632,23.684211,-2.631579,0.0,0.0,100.0
39003,360850291021056,0,0.0,0,0,0,0,0,0,0,0,0,...,,,,,,,,,,,,
39004,360850146042016,4,7.0,0,0,0,4,0,3,0,0,4,...,0.0,0.0,0.0,0.0,100.0,0.0,-42.857143,0.0,0.0,42.857143,0.0,57.142857
39005,360850121002001,65,87.0,38,27,0,0,0,49,35,0,1,...,2.298851,58.461538,41.538462,0.0,0.0,0.0,2.139699,1.308576,0.0,-1.149425,-2.298851,74.712644
39006,360470015003000,469,44.0,222,56,113,58,20,19,3,14,0,...,18.181818,47.334755,11.940299,24.093817,12.366738,4.264392,4.152937,5.122117,-7.724365,12.366738,-13.917426,1065.909091
39007,360470015003000,469,44.0,222,56,113,58,20,19,3,14,0,...,18.181818,47.334755,11.940299,24.093817,12.366738,4.264392,4.152937,5.122117,-7.724365,12.366738,-13.917426,1065.909091
39008,360470015003011,35,0.0,14,6,1,13,1,0,0,0,0,...,,40.0,17.142857,2.857143,37.142857,2.857143,,,,,,inf
39009,360470015003011,35,0.0,14,6,1,13,1,0,0,0,0,...,,40.0,17.142857,2.857143,37.142857,2.857143,,,,,,inf
39010,360470015003008,0,96.0,0,0,0,0,0,0,15,79,0,...,2.083333,,,,,,,,,,,0.0
39011,360470015003008,0,96.0,0,0,0,0,0,0,15,79,0,...,2.083333,,,,,,,,,,,0.0


### Subset the Data

In [4]:
#Subset
blocks_clean = blocks[['BLOCKID', 'Pop10', 'Pop00', 'shWhite00', 'shLatino00', 
                       'shBlack00', 'shAsian00', 'shOther00', 'pct_ch_white', 
                       'pct_ch_hisp', 'pct_ch_black', 'pct_ch_asian', 'pct_ch_other', 'pop_pct_ch', 
                       'BoroName', 'NTACode', 'NTAName', 'A_poly', 'bid_id', 'bid_name', 'a_weight', 'BID_dummy']]

#Exclude Blocks with population less than 1 people in 2000 and 2010
blocks_clean = blocks_clean[(blocks_clean['Pop10'] > 1)&
    (blocks_clean['Pop00'] > 1)]    

blocks_clean = blocks_clean.dropna(subset = ['BLOCKID'])
# shape of a dataframe (row number, column number)
blocks_clean.shape
#29,353 Observations and 22 columns

(29353, 22)

### Save the results as csv

In [5]:
blocks_clean.to_csv("/nfshome/mf3435/projects/ada_pub_1/shared/Data/blocks_clean.csv", encoding='utf8')

### Create Borough and Neighborhood Dummies

In [6]:
#Create dummy variables for neighborhoods
blocks_dummies = pd.get_dummies(blocks_clean, columns=['BoroName', 'NTACode'])
blocks_dummies.count()


BLOCKID                   29353
Pop10                     29353
Pop00                     29353
shWhite00                 29353
shLatino00                29353
shBlack00                 29353
shAsian00                 29353
shOther00                 29353
pct_ch_white              29353
pct_ch_hisp               29353
pct_ch_black              29353
pct_ch_asian              29353
pct_ch_other              29353
pop_pct_ch                29353
NTAName                   29353
A_poly                    29353
bid_id                      719
bid_name                    719
a_weight                  29353
BID_dummy                 29353
BoroName_Bronx            29353
BoroName_Brooklyn         29353
BoroName_Manhattan        29353
BoroName_Queens           29353
BoroName_Staten Island    29353
NTACode_BK09              29353
NTACode_BK17              29353
NTACode_BK19              29353
NTACode_BK21              29353
NTACode_BK23              29353
                          ...  
NTACode_

In [7]:
blocks_dummies.tail()
#Somehow it still shows the initial indexes. Not sure how to fix this yet.

Unnamed: 0,BLOCKID,Pop10,Pop00,shWhite00,shLatino00,shBlack00,shAsian00,shOther00,pct_ch_white,pct_ch_hisp,pct_ch_black,pct_ch_asian,...,NTACode_SI22,NTACode_SI24,NTACode_SI25,NTACode_SI28,NTACode_SI32,NTACode_SI35,NTACode_SI36,NTACode_SI37,NTACode_SI45,NTACode_SI48,NTACode_SI54,NTACode_SI99
39002,360850134001015,38,38.0,89.473684,7.894737,2.631579,0.0,0.0,-21.052632,23.684211,-2.631579,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
39004,360850146042016,4,7.0,42.857143,0.0,0.0,57.142857,0.0,-42.857143,0.0,0.0,42.857143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
39005,360850121002001,65,87.0,56.321839,40.229885,0.0,1.149425,2.298851,2.139699,1.308576,0.0,-1.149425,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39006,360470015003000,469,44.0,43.181818,6.818182,31.818182,0.0,18.181818,4.152937,5.122117,-7.724365,12.366738,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39007,360470015003000,469,44.0,43.181818,6.818182,31.818182,0.0,18.181818,4.152937,5.122117,-7.724365,12.366738,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Save the results with neighborhood and borough dummies as csv

In [8]:
blocks_dummies.to_csv("/nfshome/mf3435/projects/ada_pub_1/shared/Data/block_dummies.csv", encoding='utf8')