### Data Cleaning: DOB Certificate of Occupancy 

By ADA Group 1

In this Jupyter Notebook, we will do an API call NYC Open Data to obtain DOB Certificate of Occupancy (2012 onwards). This dataset comes at the Borough-Block-Lot (BBL) level, in other words, it comes at the tax lot level. In this notebook we will aggregate the data to the block level, to obtain number of new approved-for-occupancy units per Census Block.


#### Data Sources

* **DOB Certificate of Occupancy** NYC Open Data. https://data.cityofnewyork.us/Housing-Development/DOB-Certificate-Of-Occupancy/bs8b-p36w
* **BBL to Census Block Crosswalk** Created in GIS by the team based on PLUTO and Blocks shapefiles from NYC Open Data. https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page https://data.cityofnewyork.us/City-Government/2010-Census-Blocks/v2h8-6mxf/data

For Data Dictionary, please refer to the notebook 00_ReadMe.

### Import Packages

In [1]:
# visualization
%pylab inline
# import the packages
# numpy for array and matrix computation
import numpy as np

# pandas for data analysis
import pandas as pd

# matplotlib and seaborn are the data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# sqlalchemy an psycopg2 are sql connection packages
from sqlalchemy import create_engine

# configure pandas display: set the maximum number of columns displayed to 25
pd.options.display.max_columns = 25

# use the __future__ version of division and print
from __future__ import division, print_function

# gzip and csv for unzip compressed files
import gzip
import csv

import warnings
warnings.filterwarnings('ignore')

Populating the interactive namespace from numpy and matplotlib




### API call to NYC open Data
NYC Open Data API call for obtaining DOB Certificate Of Occupancy.


In [2]:
#API call to NYC Open Data 
dob_occupancy = pd.read_csv("https://data.cityofnewyork.us/api/views/2vyb-t2nz/rows.csv?accessType=DOWNLOAD", dtype= {'BBL': str})
dob_occupancy.shape

(83458, 30)

In [30]:
#Old code with call to a zipped version of the data.
#Uncomment and run it case of any issues with API call.
#Unzip data and save in shared folder
#with gzip.open("Data/DOB_Certificate_Of_Occupancy.csv.gz", 'rt') as f:
#    data = f.read()
#    with open("Data/dob_certificate_of_occupancy.csv", 'wt') as f:
#        f.write(data)
#dob_occupancy = pd.read_csv("Data/dob_certificate_of_occupancy.csv", dtype= {'BBL': str})#

In [3]:
# Visualize all the columns in the dataset
dob_occupancy.columns

Index([u'JOB_NUMBER', u'JOB_TYPE', u'C_O_ISSUE_DATE', u'BIN_NUMBER',
       u'BOROUGH', u'NUMBER', u'STREET', u'BLOCK', u'LOT', u'POSTCODE',
       u'PR_DWELLING_UNIT', u'EX_DWELLING_UNIT', u'APPLICATION_STATUS_RAW',
       u'FILING_STATUS_RAW', u'ITEM_NUMBER', u'ISSUE_TYPE', u'LATITUDE',
       u'LONGITUDE', u'COMMUNITY_BOARD', u'COUNCIL_DISTRICT', u'CENSUS_TRACT',
       u'BIN', u'BBL', u'NTA', u'LOCATION', u'Zip Codes',
       u'Community Districts', u'Borough Boundaries',
       u'City Council Districts', u'Police Precincts'],
      dtype='object')

In [12]:
dob_occupancy.head()

Unnamed: 0,JOB_NUMBER,JOB_TYPE,C_O_ISSUE_DATE,BIN_NUMBER,BOROUGH,NUMBER,STREET,BLOCK,LOT,POSTCODE,PR_DWELLING_UNIT,EX_DWELLING_UNIT,...,COMMUNITY_BOARD,COUNCIL_DISTRICT,CENSUS_TRACT,BIN,BBL,NTA,LOCATION,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,100032215,A1,10/17/2013,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,6.0,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0
1,100032215,A1,07/18/2013,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,6.0,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0
2,100032215,A1,01/29/2014,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,6.0,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0
3,100032215,A1,06/11/2014,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,6.0,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0
4,100032215,A1,09/10/2014,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,6.0,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0


### BBL to Census Blocks Crosswalk Import
Using GIS, we created a BBL to Census Block geographic crosswalk, so that we can aggregate the BBL level data to Blocks and add the data of new building occupancy permits to our predictive model.

In [4]:
with gzip.open("../Data/BBL_BLOCKS.csv.gz", 'rt') as f:
    data = f.read()
    with open("../Data/bbl_blocks.csv", 'wt') as f:
        f.write(data)

In [5]:
bbl_blocks = pd.read_csv("../Data/bbl_blocks.csv", dtype= {'BLOCKID': str, 'BBL': str})
bbl_blocks.shape

(857513, 2)

In [6]:
bbl_blocks.tail(5)

Unnamed: 0,BBL,BLOCKID
857508,2042590015,360050230002004
857509,2053930011,360050264002003
857510,2099998900,360050024001008
857511,2042260001,360050296001018
857512,2033557501,360050429011004


### Joining DOB to Block-BBL Crosswalk

In [7]:
dob_block_level = dob_occupancy.merge(bbl_blocks, left_on = 'BBL', right_on = 'BBL', how = "inner")
dob_block_level.shape
#78,269 observations, 31 variables as of April 22, 2018

(78385, 31)

In [8]:
dob_block_level.head(5)

Unnamed: 0,JOB_NUMBER,JOB_TYPE,C_O_ISSUE_DATE,BIN_NUMBER,BOROUGH,NUMBER,STREET,BLOCK,LOT,POSTCODE,PR_DWELLING_UNIT,EX_DWELLING_UNIT,...,COUNCIL_DISTRICT,CENSUS_TRACT,BIN,BBL,NTA,LOCATION,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts,BLOCKID
0,100032215,A1,10/17/2013,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0,360610048006000
1,100032215,A1,07/18/2013,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0,360610048006000
2,100032215,A1,01/29/2014,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0,360610048006000
3,100032215,A1,06/11/2014,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0,360610048006000
4,100032215,A1,09/10/2014,1020493,Manhattan,352,EAST 19 STREET,924,39,10003.0,8,7.0,...,2.0,48.0,1020493.0,1009240039,Gramercy,"(40.734903, -73.98146)",11724.0,71.0,4.0,50.0,7.0,360610048006000


### Subset the data

We will aggregate the data at the census block level, to obtain number of permit of occupancy for new units per each block.

In [9]:
# Replace Nans with zeros
dob_block_level['PR_DWELLING_UNIT']= dob_block_level['PR_DWELLING_UNIT'].fillna("0")

In [10]:
# Convert to numeric by assigning NaN for conflictive observations (ie. a bad coded character). 
dob_block_level['PR_DWELLING_UNIT'] = pd.to_numeric(dob_block_level['PR_DWELLING_UNIT'], errors='coerce')

Note: The *PR_DWELLING_UNIT* variable must have had a characther somewhere that would not let it read as numeric or transform to numeric without specifying coerce errors.

In [11]:
# Aggregate the data and pass it into a data frame
new_units=dob_block_level.groupby('BLOCKID')['PR_DWELLING_UNIT'].sum() 
new_units_df = pd.DataFrame(new_units)
# Add index
new_units_df = new_units_df.reset_index()

new_units_df.shape
#12,888 ovservations, 2 variables.

(12898, 2)

In [12]:
new_units_df.head(5)

Unnamed: 0,BLOCKID,PR_DWELLING_UNIT
0,360050001001012,0.0
1,360050002001000,0.0
2,360050002001004,48.0
3,360050002002002,2.0
4,360050002002003,0.0


### Match with our existing Block Dataset and fill Nan

As there are fewer Block observations in the DOB data than in our clean dataset, we will merge it and then replace NANs values for zeros. In this case when there is no observation, there is no permit. Therefore 0 units were most likely built.

In [13]:
#Import only Block unique identifier from clean NYC Blocks data
blocks_nyc = pd.read_csv("../Data/blocks_clean.csv", usecols=range(1,2), dtype= {'BLOCKID': str})
blocks_nyc.tail()

Unnamed: 0,BLOCKID
29348,360850134001015
29349,360850146042016
29350,360850121002001
29351,360470015003000
29352,360470015003000


In [14]:
#Right join to full blocks dataset
new_units_full = new_units_df.merge(blocks_nyc, left_on = 'BLOCKID', 
                                                    right_on = 'BLOCKID', how = 'right' )
new_units_full.shape
#29,353 observations. Dataset is complete.

(29353, 2)

In [15]:
new_units_full.head()

Unnamed: 0,BLOCKID,PR_DWELLING_UNIT
0,360050002001000,0.0
1,360050002001004,48.0
2,360050002002002,2.0
3,360050002002003,0.0
4,360050004002003,4.0


In [16]:
new_units_full.tail()

Unnamed: 0,BLOCKID,PR_DWELLING_UNIT
29348,360850122002014,
29349,360850146042016,
29350,360850121002001,
29351,360470015003000,
29352,360470015003000,


* We see above that the blocks without observations were coded as NaN. We want these to be zeros.

In [17]:
#Fill NANs with zeros
new_units_full['PR_DWELLING_UNIT']= new_units_full['PR_DWELLING_UNIT'].fillna(0)
new_units_full.tail()

Unnamed: 0,BLOCKID,PR_DWELLING_UNIT
29348,360850122002014,0.0
29349,360850146042016,0.0
29350,360850121002001,0.0
29351,360470015003000,0.0
29352,360470015003000,0.0


### Export Dataset
Now our data is ready to go for the predictive model.

In [18]:
new_units_full.to_csv("../Data/DOB_new_units_Clean.csv", encoding='utf8')