# Active New Building Construction Sites

## Introduction

In this notebook we attempt to get a handle on data for active new building construction sites in New York City.

In [1]:
import requests
import pandas as pd
pd.set_option("max_columns", 500)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import geopandas as gpd
from datetime import datetime
import co_reader

## Download

In [2]:
def download_file(url, filename):
    """
    Helper method handling downloading large files from `url` to `filename`. Returns a pointer to `filename`.
    """
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return filename

In [3]:
permits = download_file("https://data.cityofnewyork.us/api/views/ipu4-2q9a/rows.csv?accessType=DOWNLOAD",
                        "data/DOB Permit Issuance.csv")
permits = pd.read_csv(permits)

  interactivity=interactivity, compiler=compiler, result=result)


## Preprocessing

We need to find construction permits corresponding with new building jobs which have yet to expire.

We start by filtering those down and converting the issuance and expiration dates from strings to intelligent datetimes.

In [4]:
nb_permits = permits[(permits['Job Type'] == 'NB') &
                     (permits['Permit Type'] == 'NB') &
                     (permits['Permit Status'] == 'ISSUED')]

In [5]:
nb_permits['Issuance Date'] = nb_permits['Issuance Date'].map(lambda date: pd.to_datetime(date))
nb_permits['Expiration Date'] = nb_permits['Expiration Date'].map(lambda date: pd.to_datetime(date))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Because of technical differences in PDF link name formatting, `co_reader` requires that you pass it the DOB borough code for the building. To simplify this operation let's remap a borough code column for the entire dataset of interest, using the existing borough column.

In [6]:
borough_mapper = {
    "MANHATTAN": "M",
    "BROOKLYN": "B",
    "QUEENS": "Q",
    "STATEN ISLAND": "R",
    "BRONX": "X"
}

nb_permits['Borough Code'] = nb_permits['BOROUGH'].map(lambda b: borough_mapper[b])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


How many new building construction permits are active right now?

In [8]:
now = pd.to_datetime(datetime.now())

In [9]:
(nb_permits['Expiration Date'] > now).astype(int).sum()

4961

In [10]:
bins_with_nonexpired_permits = nb_permits[nb_permits['Expiration Date'] > now]['Bin #'].astype(int).unique()

How many unique lots have currently-active new building construction permits?

This filters out lots which have recieved multiple permits (reissuance etc.) which are still valid.

In [11]:
len(bins_with_nonexpired_permits)

4083

We will need to start time for each of these permits, as this is what we will be comparing against in order to determine whether or not a building has finished construction. We will also need the borough code that we just generated.

The loop that follows the selects and takes this information off of the most recent new building document on record&mdash;the one with the highest permit number.

Even we do not filter the data this way we will recirculate each of the thousand outstanding "additional" permits, increasing runtime by 20%. Even though the end result would be the same, given how long certificate data reads take, this is wasteful, so it is worth the additional work of removing these explicitly beforehand.

In [12]:
now = pd.to_datetime(datetime.now())

In [13]:
active_nb_permits = nb_permits[nb_permits['Expiration Date'] > now]

In [14]:
active_nb_permits.head(0)

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,Community Board,Zip Code,Bldg Type,Residential,Special District 1,Special District 2,Work Type,Permit Status,Filing Status,Permit Type,Permit Sequence #,Permit Subtype,Oil Gas,Site Fill,Filing Date,Issuance Date,Expiration Date,Job Start Date,Permittee's First Name,Permittee's Last Name,Permittee's Business Name,Permittee's Phone #,Permittee's License Type,Permittee's License #,Act as Superintendent,Permittee's Other Title,HIC License,Site Safety Mgr's First Name,Site Safety Mgr's Last Name,Site Safety Mgr Business Name,Superintendent First & Last Name,Superintendent Business Name,Owner's Business Type,Non-Profit,Owner's Business Name,Owner's First Name,Owner's Last Name,Owner's House #,Owner's House Street Name,Owner’s House City,Owner’s House State,Owner’s House Zip Code,Owner's Phone #,DOBRunDate,Borough Code


In [15]:
most_recent_docs = []
nb_permits['Bin #'] = nb_permits['Bin #'].astype(int)
bins_with_nonexpired_permits = nb_permits[nb_permits['Expiration Date'] > now]['Bin #'].unique()
active_nb_permits = nb_permits[nb_permits['Expiration Date'] > now]

for BIN in bins_with_nonexpired_permits:
    docs = active_nb_permits[active_nb_permits['Bin #'] == BIN]
    doc = docs.iloc[np.argmax(docs['Permit Sequence #'].values)]
    most_recent_docs.append(doc)

active_latest_nb_permits = pd.concat(most_recent_docs, axis=1).T

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [16]:
len(active_latest_nb_permits)

4083

That this number is exactly the same as the number of unique BINs before confirms that the routine fired successfully.

In [17]:
pd.to_pickle(active_latest_nb_permits, "data/Latest Active New Building Permits.p")

## Processing

4083 Certificate of Occupancy reads will take a long time to process. This step must be handled in segments.

In [3]:
active_latest_nb_permits = pd.read_pickle("data/Latest Active New Building Permits.p")

In [5]:
active_latest_nb_permits.head(1)

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,Community Board,Zip Code,Bldg Type,Residential,Special District 1,Special District 2,Work Type,Permit Status,Filing Status,Permit Type,Permit Sequence #,Permit Subtype,Oil Gas,Site Fill,Filing Date,Issuance Date,Expiration Date,Job Start Date,Permittee's First Name,Permittee's Last Name,Permittee's Business Name,Permittee's Phone #,Permittee's License Type,Permittee's License #,Act as Superintendent,Permittee's Other Title,HIC License,Site Safety Mgr's First Name,Site Safety Mgr's Last Name,Site Safety Mgr Business Name,Superintendent First & Last Name,Superintendent Business Name,Owner's Business Type,Non-Profit,Owner's Business Name,Owner's First Name,Owner's Last Name,Owner's House #,Owner's House Street Name,Owner’s House City,Owner’s House State,Owner’s House Zip Code,Owner's Phone #,DOBRunDate,Borough Code
566401,BROOKLYN,3034837,329,STERLING ST.,320708000.0,1,NB,,1316,72,309,11225,2,YES,,,,ISSUED,RENEWAL,NB,4,,,OFF-SITE,07/08/2016,2016-07-08 00:00:00,2016-11-25 00:00:00,07/07/2014,ZEV,CHASKELSON,HML DEVELOPMENTS LLC,7187021530,GENERAL CONTRACTOR,613324,,,,,,,ZEV CHASKELSON,ZENCO GROUP INC,PARTNERSHIP,,JACQUELYN 327 LLC,AL,LIEBER,146,SPENCER STREET,BROOKLYN,NY,11205,3472274450,07/09/2016 12:00:00 AM,B


In [2]:
def latest_co_date(srs):
    """
    DataFrame apply function which retrieves and stores the most recent found C/O date in the DataFrame.
    """
    try:
        return co_reader.get_co_date(srs['Bin #'], srs['Borough Code'])
    except Exception as e:
        print("WARNING: Error raised:\n", e)
        return None

    
def is_active(srs):
    """
    DataFrame apply function which retrieves and stores whether or not a construction site is active.
    
    Uses the "Latest C/O Date" field specified by the `latest_co_date` function above.
    """
    if srs['Latest C/O Date']:
        if srs['Latest C/O Date'].replace(tzinfo=None) > srs['Issuance Date'].replace(tzinfo=None):
            return False
        else:
            return True
    else:
        return True

Now we run the primary algorithm---in 100-permit segments, for runtime splitting sake.

<!-- Because of the verbocity of the logging output, while this script was running I temporarily commented out the print statements in the `co_reader` script. -->

Important note for those reading this file: the log below (which I may edit out at a later date due to its verbocity) claims that optical character recognition is used. This is false. The logfile includes these lines to do an error on my part, due to my uncommenting a print line in the source code that I should not have uncommented. This has now been fixed, but the output remains in that form here.

In [4]:
active_sample_1 = active_latest_nb_permits.iloc[0:100]
active_sample_1['Latest C/O Date'] = active_sample_1.apply(latest_co_date, axis='columns')
active_sample_1['Active Construction Site'] = active_sample_1.apply(is_active, axis='columns')

Requested BIN 3034837 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3418394 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3000305 data from BIS, awaiting response...
Discovered 3 Certificates of Occupancy.
Scanning B000119597.PDF...
PDF Certificate of Occupancy 'B000119597.PDF' retrieved.
Copying text using optical character recognition...
Harvesting dates...
[]
No date found. Continuing...
Scanning B000130178.PDF...
PDF Certificate of Occupancy 'B000130178.PDF' retrieved.
Copying text using optical character recognition...
Harvesting dates...
[]
No date found. Continuing...
Scanning B000000529.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy 'B000000529.PDF' again in five seconds...
After some delay, PDF Certificate of Occupancy 'B000000529.PDF' retrieved.
Copying text using optical character recognition...
Harvesting dates...
[]
No date found. Continuing...
Requested B

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [7]:
active_sample_1.to_csv("data/active_sample_1.csv")

In [5]:
active_sample_2 = active_latest_nb_permits.iloc[100:200]
active_sample_2['Latest C/O Date'] = active_sample_2.apply(latest_co_date, axis='columns')
active_sample_2['Active Construction Site'] = active_sample_2.apply(is_active, axis='columns')

Requested BIN 3413948 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3046578 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3414080 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3413814 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3341057 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning B000074362.PDF...
PDF Certificate of Occupancy 'B000074362.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B300843336.PDF...
PDF Certificate of Occupancy 'B300843336.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 3412984 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3256406 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3044622 data from BIS, await

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [21]:
active_sample_2.to_csv("data/active_sample_2.csv")

In [6]:
active_sample_3 = active_latest_nb_permits.iloc[200:300]
active_sample_3['Latest C/O Date'] = active_sample_3.apply(latest_co_date, axis='columns')
active_sample_3['Active Construction Site'] = active_sample_3.apply(is_active, axis='columns')

Requested BIN 4605206 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4605205 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4602517 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4602518 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 1076762 data from BIS, awaiting response...
Discovered 4 Certificates of Occupancy.
Scanning M000043184.PDF...
PDF Certificate of Occupancy 'M000043184.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning M000043355.PDF...
PDF Certificate of Occupancy 'M000043355.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning M000050070.PDF...
PDF Certificate of Occupancy 'M000050070.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning M000057947.PDF...
PDF Certificate of Occupancy 'M000057947.PDF' retrieved.
Harvesting dates..

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [20]:
active_sample_3.to_csv("data/active_sample_3.csv")

In [13]:
active_sample_4 = active_latest_nb_permits.iloc[300:400]
active_sample_4['Latest C/O Date'] = active_sample_4.apply(latest_co_date, axis='columns')
active_sample_4['Active Construction Site'] = active_sample_4.apply(is_active, axis='columns')

Requested BIN 3010209 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning B000170491.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy 'B000170491.PDF' again in five seconds...
After some delay, PDF Certificate of Occupancy 'B000170491.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 4016095 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3336304 data from BIS, awaiting response...
Discovered 3 Certificates of Occupancy.
Scanning B000074044.PDF...
PDF Certificate of Occupancy 'B000074044.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning 320590829-01.PDF...
PDF Certificate of Occupancy '320590829-01.PDF' retrieved.
Harvesting dates...
['03/15/2016', '06/13/2016']
Date(s) found!
Scanning 320590829-02.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy '320590829-02.PDF' again in five seconds...
After s

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


ValueError: Wrong number of items passed 56, placement implies 1

In [19]:
active_sample_4.to_csv("data/active_sample_4.csv")

In [17]:
active_sample_5 = active_latest_nb_permits.iloc[400:500]
active_sample_5['Latest C/O Date'] = active_sample_5.apply(latest_co_date, axis='columns')
active_sample_5['Active Construction Site'] = active_sample_5

Requested BIN 3418622 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3184172 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 2124618 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 2124619 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3417817 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3211211 data from BIS, awaiting response...
Discovered 12 Certificates of Occupancy.
Scanning B000116477.PDF...
PDF Certificate of Occupancy 'B000116477.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000105351.PDF...
PDF Certificate of Occupancy 'B000105351.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000105352.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy 'B000105352.PDF' again in five seconds...

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


ValueError: Wrong number of items passed 56, placement implies 1

In [23]:
active_sample_5.to_csv("data/active_sample_5.csv")

**FIXME**

In [4]:
active_sample_6 = active_latest_nb_permits.iloc[500:600]
active_sample_6['Latest C/O Date'] = active_sample_6.apply(latest_co_date, axis='columns')
active_sample_6['Active Construction Site'] = active_sample_6

Requested BIN 2081548 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4436850 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4251704 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning 421185666F.PDF...
PDF Certificate of Occupancy '421185666F.PDF' retrieved.
Harvesting dates...
['05/19/2016']
Date(s) found!
Scanning 421146138F.PDF...
PDF Certificate of Occupancy '421146138F.PDF' retrieved.
Harvesting dates...
['05/19/2016']
Date(s) found!
Requested BIN 4129673 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning Q000119656.PDF...
PDF Certificate of Occupancy 'Q000119656.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 1006102 data from BIS, awaiting response...
Discovered 5 Certificates of Occupancy.
Scanning M000036430.PDF...
PDF Certificate of Occupancy 'M000036430.PDF' retrieved.
Harvesting dates...
[]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


ValueError: Wrong number of items passed 56, placement implies 1

In [11]:
active_sample_6.to_csv("data/active_sample_6.csv")

In [6]:
active_sample_7 = active_latest_nb_permits.iloc[600:700]
active_sample_7['Latest C/O Date'] = active_sample_7.apply(latest_co_date, axis='columns')
active_sample_7['Active Construction Site'] = active_sample_7.apply(is_active, axis='columns')

Requested BIN 5042101 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5802686 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5158715 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 1087661 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning 100853765-T-2.PDF...
PDF Certificate of Occupancy '100853765-T-2.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning 100853765-T-1.PDF...
PDF Certificate of Occupancy '100853765-T-1.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 3005962 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3035806 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3421390 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3421391 data fro

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


ValueError: Wrong number of items passed 56, placement implies 1

In [9]:
active_sample_7.to_csv("data/active_sample_7.csv")

In [12]:
active_sample_8 = active_latest_nb_permits.iloc[700:800]
active_sample_8['Latest C/O Date'] = active_sample_6.apply(latest_co_date, axis='columns')
active_sample_8['Active Construction Site'] = active_sample_8.apply(is_active, axis='columns')

Requested BIN 2081548 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4436850 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4251704 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning 421185666F.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy '421185666F.PDF' again in five seconds...
After some delay, PDF Certificate of Occupancy '421185666F.PDF' retrieved.
Harvesting dates...
['05/19/2016']
Date(s) found!
Scanning 421146138F.PDF...
PDF Certificate of Occupancy '421146138F.PDF' retrieved.
Harvesting dates...
['05/19/2016']
Date(s) found!
Requested BIN 4129673 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning Q000119656.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy 'Q000119656.PDF' again in five seconds...
After some delay, PDF Certificate of Occupancy 'Q000119656.PDF' retri

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [13]:
active_sample_8.to_csv("data/active_sample_8.csv")

In [14]:
active_sample_9 = active_latest_nb_permits.iloc[800:900]
active_sample_9['Latest C/O Date'] = active_sample_9.apply(latest_co_date, axis='columns')
active_sample_9['Active Construction Site'] = active_sample_9.apply(is_active, axis='columns')

Requested BIN 4010134 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning Q000010134.PDF...
PDF Certificate of Occupancy 'Q000010134.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 4007730 data from BIS, awaiting response...
Discovered 5 Certificates of Occupancy.
Scanning Q000008148.PDF...
PDF Certificate of Occupancy 'Q000008148.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning Q000077203.PDF...
PDF Certificate of Occupancy 'Q000077203.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning Q000095343.PDF...
PDF Certificate of Occupancy 'Q000095343.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning Q000178080.PDF...
PDF Certificate of Occupancy 'Q000178080.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning Q000187548.PDF...
PDF Certificate of Occupancy 'Q000187548.PDF' retrieved.
Harvesting dates...
[]
No date found. Co

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [15]:
active_sample_9.to_csv("data/active_sample_9.csv")

In [25]:
active_sample_10 = active_latest_nb_permits.iloc[900:1000]
active_sample_10['Latest C/O Date'] = active_sample_10.apply(latest_co_date, axis='columns')
active_sample_10['Active Construction Site'] = active_sample_10.apply(is_active, axis='columns')

Requested BIN 4056350 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 1084666 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning 121235289-02.PDF...
PDF Certificate of Occupancy '121235289-02.PDF' retrieved.
Harvesting dates...
['07/07/2016', '10/05/2016', '04/08/2013']
Date(s) found!
Scanning 121235289T001.PDF...
PDF Certificate of Occupancy '121235289T001.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 4101236 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 1014279 data from BIS, awaiting response...
Discovered 10 Certificates of Occupancy.
Scanning M000088684.PDF...
PDF Certificate of Occupancy 'M000088684.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning M000096082.PDF...
PDF Certificate of Occupancy 'M000096082.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning 103305022-T-1.PDF.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [26]:
active_sample_10.to_csv("active_sample_10.csv")

In [27]:
active_sample_11 = active_latest_nb_permits.iloc[1000:1100]
active_sample_11['Latest C/O Date'] = active_sample_11.apply(latest_co_date, axis='columns')
active_sample_11['Active Construction Site'] = active_sample_11.apply(is_active, axis='columns')

Requested BIN 2124464 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4466143 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4466681 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 1046284 data from BIS, awaiting response...
Discovered 3 Certificates of Occupancy.
Scanning M000063559.PDF...
PDF Certificate of Occupancy 'M000063559.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning M000016169.PDF...
PDF Certificate of Occupancy 'M000016169.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning M000063559.PDF...
PDF Certificate of Occupancy 'M000063559.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 3064482 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5113073 data from BIS, awaiting response...
Discovered 6 Certificates of Occupancy.
Scan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [28]:
active_sample_11.to_csv("data/active_sample_11.csv")

In [20]:
active_sample_12 = active_latest_nb_permits.iloc[1100:1200]
active_sample_12['Latest C/O Date'] = active_sample_12.apply(latest_co_date, axis='columns')
active_sample_12['Active Construction Site'] = active_sample_12.apply(is_active, axis='columns')

Requested BIN 2127131 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3336560 data from BIS, awaiting response...
Discovered 9 Certificates of Occupancy.
Scanning B000219271.PDF...
PDF Certificate of Occupancy 'B000219271.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000000004.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy 'B000000004.PDF' again in five seconds...
After some delay, PDF Certificate of Occupancy 'B000000004.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000221667.PDF...
PDF Certificate of Occupancy 'B000221667.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000062682.PDF...
PDF Certificate of Occupancy 'B000062682.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning 301365314.PDF...
PDF Certificate of Occupancy '301365314.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuin

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [21]:
active_sample_12.to_csv("data/active_sample_12.csv")

In [22]:
active_sample_13 = active_latest_nb_permits.iloc[1200:1300]
active_sample_13['Latest C/O Date'] = active_sample_13.apply(latest_co_date, axis='columns')
active_sample_13['Active Construction Site'] = active_sample_13.apply(is_active, axis='columns')

Requested BIN 3398269 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398264 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398250 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398247 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398314 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398315 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398259 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398266 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398268 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3398248 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 339831

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [23]:
active_sample_13.to_csv("data/active_sample_13.csv")

In [18]:
active_sample_14 = active_latest_nb_permits.iloc[1300:1400]
active_sample_14['Latest C/O Date'] = active_sample_14.apply(latest_co_date, axis='columns')
active_sample_14['Active Construction Site'] = active_sample_14.apply(is_active, axis='columns')

Requested BIN 2120541 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3413514 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3049349 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning 320914196.PDF...
PDF Certificate of Occupancy '320914196.PDF' retrieved.
Harvesting dates...
['05/09/2016']
Date(s) found!
Requested BIN 3418222 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4017171 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning Q000191087.PDF...
Got the wait page. Trying to retrieve the PDF Certificate of Occupancy 'Q000191087.PDF' again in five seconds...
After some delay, PDF Certificate of Occupancy 'Q000191087.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 3116023 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [19]:
active_sample_14.to_csv("data/active_sample_14.csv")

In [24]:
active_sample_14to21 = active_latest_nb_permits.iloc[1400:2100]
active_sample_14to21['Latest C/O Date'] = active_sample_14to21.apply(latest_co_date, axis='columns')
active_sample_14to21['Active Construction Site'] = active_sample_14to21.apply(is_active, axis='columns')

Requested BIN 5164132 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning 520218096F.PDF...
PDF Certificate of Occupancy '520218096F.PDF' retrieved.
Harvesting dates...
['06/17/2016']
Date(s) found!
Requested BIN 5164131 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning 520218103F.PDF...
PDF Certificate of Occupancy '520218103F.PDF' retrieved.
Harvesting dates...
['06/20/2016']
Date(s) found!
Requested BIN 5164130 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning 520218112F.PDF...
PDF Certificate of Occupancy '520218112F.PDF' retrieved.
Harvesting dates...
['07/13/2016']
Date(s) found!
Requested BIN 5164129 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning 520218121F.PDF...
PDF Certificate of Occupancy '520218121F.PDF' retrieved.
Harvesting dates...
['07/15/2016']
Date(s) found!
Requested BIN 5164128 data from BIS, awaiting response...
Discovered 1 C

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [25]:
active_sample_14to21.to_csv("data/active_sample_14to21.csv")

In [6]:
active_sample_21to25 = active_latest_nb_permits.iloc[2100:2500]
active_sample_21to25['Latest C/O Date'] = active_sample_21to25.apply(latest_co_date, axis='columns')
active_sample_21to25['Active Construction Site'] = active_sample_21to25.apply(is_active, axis='columns')

Requested BIN 2075802 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3069650 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning 320577666-01.PDF...
PDF Certificate of Occupancy '320577666-01.PDF' retrieved.
Harvesting dates...
['04/04/2016', '05/05/2016']
Date(s) found!
Scanning 320577666-02.PDF...
PDF Certificate of Occupancy '320577666-02.PDF' retrieved.
Harvesting dates...
['04/25/2016', '07/24/2016']
Date(s) found!
Requested BIN 5165104 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3071683 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning B000040066.PDF...
PDF Certificate of Occupancy 'B000040066.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 1043841 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning M000032172.PDF...
PDF Certificate of Occupancy 'M000032172.PD

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [7]:
active_sample_21to25.to_csv("data/active_sample_21to25.csv")

^ Latest.

In [4]:
active_sample_25to30 = active_latest_nb_permits.iloc[2500:3000]
active_sample_25to30['Latest C/O Date'] = active_sample_25to30.apply(latest_co_date, axis='columns')
active_sample_25to30['Active Construction Site'] = active_sample_25to30.apply(is_active, axis='columns')

Requested BIN 3054243 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning B000109432.PDF...
PDF Certificate of Occupancy 'B000109432.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000194269.PDF...
PDF Certificate of Occupancy 'B000194269.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 5164385 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5159042 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3819496 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning B000139650.PDF...
PDF Certificate of Occupancy 'B000139650.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 5158973 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning 520113234T002.PDF...
PDF Certificate of Occupancy '520113234T002.PDF' retrieved.
Harvesting d

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [5]:
active_sample_25to30.to_csv("data/active_sample_25to30.csv")

In [6]:
active_sample_30to38 = active_latest_nb_permits.iloc[3000:3800]
active_sample_30to38['Latest C/O Date'] = active_sample_30to38.apply(latest_co_date, axis='columns')
active_sample_30to38['Active Construction Site'] = active_sample_30to38.apply(is_active, axis='columns')

Requested BIN 5166388 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3413703 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5166019 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5166387 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5164518 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5164517 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5164519 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 5166389 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 2117981 data from BIS, awaiting response...
Discovered 1 Certificates of Occupancy.
Scanning X000001983.PDF...
PDF Certificate of Occupancy 'X000001983.PDF' retrieved.
Harvesting dates...
[]
No date fou

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [7]:
active_sample_30to38.to_csv("data/active_sample_30to38.csv")

In [8]:
active_sample_30to38['Active Construction Site'].value_counts()

True     776
False     24
Name: Active Construction Site, dtype: int64

In [9]:
len(active_latest_nb_permits.iloc[3800:])

283

In [10]:
active_sample_38_plus = active_latest_nb_permits.iloc[3800:]
active_sample_38_plus['Latest C/O Date'] = active_sample_38_plus.apply(latest_co_date, axis='columns')
active_sample_38_plus['Active Construction Site'] = active_sample_38_plus.apply(is_active, axis='columns')

Requested BIN 4536149 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4605309 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4607324 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3050879 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3402017 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 4541571 data from BIS, awaiting response...
Discovered 0 Certificates of Occupancy.
Requested BIN 3132055 data from BIS, awaiting response...
Discovered 2 Certificates of Occupancy.
Scanning B000007685.PDF...
PDF Certificate of Occupancy 'B000007685.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Scanning B000168397.PDF...
PDF Certificate of Occupancy 'B000168397.PDF' retrieved.
Harvesting dates...
[]
No date found. Continuing...
Requested BIN 4116381 data from BIS, await

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [11]:
active_sample_38_plus.to_csv("data/active_sample_38_plus.csv")

Finally we merge!

In [29]:
pd.concat([pd.read_csv("data/active_sample_1.csv"),
           pd.read_csv("data/active_sample_2.csv"),
           pd.read_csv("data/active_sample_3.csv"),
           pd.read_csv("data/active_sample_4.csv"),
           pd.read_csv("data/active_sample_5.csv"),
           pd.read_csv("data/active_sample_6.csv"),
           pd.read_csv("data/active_sample_7.csv"),
           pd.read_csv("data/active_sample_8.csv"),
           pd.read_csv("data/active_sample_9.csv"),
           pd.read_csv("data/active_sample_10.csv"),
           pd.read_csv("data/active_sample_11.csv"),
           pd.read_csv("data/active_sample_12.csv"),
           pd.read_csv("data/active_sample_13.csv"),
           pd.read_csv("data/active_sample_14.csv"),
           pd.read_csv("data/active_sample_14to21.csv"),
           pd.read_csv("data/active_sample_21to25.csv"),
           pd.read_csv("data/active_sample_25to30.csv"),
           pd.read_csv("data/active_sample_30to38.csv"),
           pd.read_csv("data/active_sample_38_plus.csv")]).to_csv("data/active_construction_sites.csv")

In [30]:
%ls

Active New Building Construction Site Data Join.ipynb
Active New Building Construction Site Spatial Map.ipynb
DOB Permit Issuance.csv
Latest Active New Building Permits.p
NYC Community Districts.geojson
active_construction_sites.csv
active_sample_1.csv
active_sample_10.csv
active_sample_11.csv
active_sample_12.csv
active_sample_13.csv
active_sample_14.csv
active_sample_14to21.csv
active_sample_2.csv
active_sample_21to25.csv
active_sample_25to30.csv
active_sample_3.csv
active_sample_30to38.csv
active_sample_38_plus.csv
active_sample_4.csv
active_sample_5.csv
active_sample_6.csv
active_sample_7.csv
active_sample_8.csv
active_sample_9.csv
co_reader.py
co_reader.pyc
[34mdata[m[m/
environment.yml
ghostdriver.log
temp.pdf


In [40]:
# active_sample_2['Latest C/O Date'] = active_sample_2.apply(latest_co_date, axis='columns')