## CLEAN APPROVED BUILDING PERMITS
This notebook contains the cleaning process of the approved permit data. Exploratory cleaning analysis (can be found in repository history) is removed and replaced with comment or markdown explanations to aid readability. Column descriptions are documented in the data_insights document.

In [1]:
import pandas as pd
import numpy as np
import re
import Levenshtein
# (pip install python-Levenshtein)

pd.set_option('display.max_columns', 100)

# from google.colab import drive
# drive.mount('/content/drive')
# directory = "/content/drive/MyDrive/City of Boston: Permitting D/Project Files/data/abp.csv"
directory = '../data/raw_abp.csv' # interchangeable with above code

# Estimated runtime ~1 minute

Data Import and basic check of the columns

In [7]:
df = pd.read_csv(directory)
df.head()

  df = pd.read_csv(directory)


Unnamed: 0,object_id,permitnumber,worktype,permittypedescr,description,comments,applicant,declared_valuation,total_fees,issued_date,expiration_date,status,owner,occupancytype,sq_feet,address,city,state,zip,property_id,parcel_id,gpsy,gpsx,geom_2249,lat,long,geom_4326
0,1,A1000569,INTEXT,Amendment to a Long Form,Interior/Exterior Work,This work is to Amend Permit ALT347244. Elimin...,Patrick Sharkey,"$36,500.00",$390.00,2021-01-28 16:29:26+00,2021-07-28 04:00:00+00,Open,ONE 83 STATE ST CONDO TR,Mixed,0.0,181-183 State ST,Boston,MA,2109.0,130392.0,303807000.0,2956235.0,777000.467775,0101000020C9080000014080EF50B6274128B89653E58D...,42.35919,-71.052924,0101000020E6100000A703291D63C351C074AD05ECF92D...
1,2,A100071,COB,Amendment to a Long Form,City of Boston,Change connector link layout from attached enc...,Renee Santeusanio,"$40,000.00",$429.00,2011-11-04 15:04:58+00,2012-05-04 04:00:00+00,Open,CITY OF BOSTON,Comm,170.0,175 W Boundary RD,West Roxbury,MA,2132.0,17268.0,2012032000.0,2920239.0,751016.119559,0101000020C908000081DB363D50EB264164AA649F9747...,42.26075,-71.149611,0101000020E61000005F23793993C951C071ECAA3E6021...
2,3,A1001012,OTHER,Amendment to a Long Form,Other,Amend Alt943748 to erect a roof deck as per pl...,Jusimar Oliveria,"$5,000.00",$70.00,2020-06-01 18:08:47+00,,Open,15 PROSPECT STREET CONDOMINIUM TRUST,1-3FAM,0.0,15 Prospect ST,Charlestown,MA,2129.0,113443.0,202837000.0,2962078.0,775710.380542,0101000020C90800007E6BD6C23CAC2741422F500F4F99...,42.375243,-71.057585,0101000020E6100000F053B47AAFC351C0A6BB62F20730...
3,4,A1001201,INTEXT,Amendment to a Long Form,Interior/Exterior Work,Build steel balcony over garden level with sta...,Andreas Hwang,"$74,295.75",$803.00,2019-11-13 18:38:56+00,2020-05-13 04:00:00+00,Closed,LEDERMAN US REAL ESTATE CORP,Multi,0.0,211 W Springfield ST,Roxbury,MA,2118.0,129994.0,402558000.0,2949423.0,769648.312793,0101000020C9080000025726A0E07C274183505E499780...,42.3406,-71.080251,0101000020E6100000D72A24D322C551C044521DC4982B...
4,5,A100137,EXTREN,Amendment to a Long Form,Renovations - Exterior,Landscaping/stonework - amending permit #2801/...,,"$15,000.00",$206.00,2013-01-03 19:13:09+00,2013-07-03 04:00:00+00,Open,MIARA SIMON,1-2FAM,0.0,14 William Jackson AVE,Brighton,MA,2135.0,149852.0,2204944000.0,2950791.0,749690.29879,0101000020C9080000FCFDFA98F4E02641F6694F594383...,42.3446,-71.154051,0101000020E61000009DED6FF7DBC951C0929A5BD71B2C...


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 622276 entries, 0 to 622275
Data columns (total 27 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   object_id           622276 non-null  int64  
 1   permitnumber        622276 non-null  object 
 2   worktype            617722 non-null  object 
 3   permittypedescr     622276 non-null  object 
 4   description         617722 non-null  object 
 5   comments            622066 non-null  object 
 6   applicant           599396 non-null  object 
 7   declared_valuation  622276 non-null  object 
 8   total_fees          622276 non-null  object 
 9   issued_date         622276 non-null  object 
 10  expiration_date     597182 non-null  object 
 11  status              622276 non-null  object 
 12  owner               605742 non-null  object 
 13  occupancytype       602785 non-null  object 
 14  sq_feet             622276 non-null  float64
 15  address             622275 non-nul

The following contains the dropping of columns and cleaning of remaining columns, we're going to be consistent with lowercase underscored variable names

In [18]:
# Initial dropping columns
df.drop(columns=['applicant'], inplace=True)    # not useful for analysis
df.drop(columns=['owner'], inplace=True)        # not useful for analysis
df.drop(columns=['address'], inplace=True)      # location data already available
df.drop(columns=['state'], inplace=True)        # only one state
df.drop(columns=['property_id'], inplace=True)  # too many for data analysis      
df.drop(columns=['parcel_id'], inplace=True)    # too many for data analysis
df.drop(columns=['gpsy'], inplace=True)         # location data already available
df.drop(columns=['gpsx'], inplace=True)         # location data already available
df.drop(columns=['geom_2249'], inplace=True)    # location data already available
df.drop(columns=['geom_4326'], inplace=True)    # location data already available

In [19]:
# Initial renaming columns
df = df.rename(columns={'object_id': 'id'})
df = df.rename(columns={'permitnumber': 'permit'})
df = df.rename(columns={'worktype': 'class'})
df = df.rename(columns={'permittypedescr': 'type'})
df = df.rename(columns={'comments': 'text'})
df = df.rename(columns={'sq_feet': 'sqft'})
df = df.rename(columns={'long': 'lon'})
df = df.rename(columns={'declared_valuation': 'value'})
df = df.rename(columns={'total_fees': 'fee'})
df = df.rename(columns={'zip': 'zipcode'})

Declared Valuation (value) needed a bit of processing to get float representations of the values <br>
The same was true for the total_fees (fee) column

In [20]:
df['value'] = df['value'].replace('[\$,]', '', regex=True).astype(float)
df['fee'] = df['fee'].replace('[\$,]', '', regex=True).astype(float)

Permits needed to be stripped of their prefixes so that we can merge with the other datasets down the line

In [21]:
df['permit'] = df['permit'].apply(lambda x: ''.join(filter(str.isdigit, str(x))))

Dates needed to be extracted into year, month and day

In [22]:
df['issued_date'] = pd.to_datetime(df['issued_date'])
df = df.assign(year=df['issued_date'].dt.year,
               month=df['issued_date'].dt.month,
               day=df['issued_date'].dt.day
               ).drop(columns=['issued_date'])

df['expiration_date'] = pd.to_datetime(df['expiration_date'])
df = df.assign(end_year=df['expiration_date'].dt.year.astype('Int64'),
               end_month=df['expiration_date'].dt.month.astype('Int64'),
               end_day=df['expiration_date'].dt.day.astype('Int64')
               ).drop(columns=['expiration_date'])

Zips needed processing: removing and processing hyphens and double zipcodes, .0s, incomplete zipcodes and uncharacterized zipcodes

In [23]:
def clean_zip(value):
    if pd.isna(value):
        return value

    value = str(value)
    value = re.sub(r'-.*|\.0$', '', value)
    value = '0' + value if len(value) == 4 else value
    value = pd.NA if value.isdigit() and len(value) <= 3 else value

    return value

df['zipcode'] = df['zipcode'].apply(clean_zip)

Cities needed more extensive processing, we clean the data and levenshtein-match it

In [24]:
expected_cities = ["Boston", "West Roxbury", "Charlestown", "Roxbury", "Brighton", "Allston", "Jamaica Plain", "East Boston",
                   "Dorchester", "Hyde Park", "South Boston", "Roslindale", "Brighton/Allston", "Mission Hill", "Mattapan",
                   "Longwood", "Bay Village", "Chestnut Hill", "North End", "Leather District", "Chinatown",
                   "South Boston Waterfront", "West End", "Fenway", "South End", "Back Bay", "Downtown", "Beacon Hill",
                   "Theater District"]

def clean_and_match(city):
    if pd.isna(city):
        return city
    cleaned_city = ''.join(filter(str.isalpha, city))
    closest_match = min(expected_cities, key=lambda x: Levenshtein.distance(cleaned_city.lower(), x.lower()))
    return closest_match

df['city'] = df['city'].apply(clean_and_match)

Text will get a basic cleaning to save time in the future

In [25]:
def process_string(input_string):
    only_alphabetical = re.sub(r'[^a-zA-Z\s]', '', str(input_string))
    lowercased = only_alphabetical.lower()
    return lowercased

df.text = df.text.apply(process_string)

In [26]:
df.sample(5)

Unnamed: 0,id,permit,class,type,description,text,value,fee,status,occupancytype,sqft,city,zipcode,lat,lon,year,month,day,end_year,end_month,end_day
534017,533999,342653,OTHER,Short Form Bldg Permit,Other,replace air handlers,31076.0,340.0,Open,1-4FAM,0.0,Boston,2116,42.35303,-71.077181,2014,3,11,2014,9,11
458956,458940,1051807,SOL,Short Form Bldg Permit,Solar Panels,rooftop install of solar panels,10637.0,130.0,Open,1-2FAM,0.0,Jamaica Plain,2130,42.28966,-71.112621,2020,5,18,2020,11,18
80442,80410,1134223,ELECTRICAL,Electrical Permit,Electrical,refeed existing panelboards from new distribut...,900000.0,142.0,Closed,Comm,0.0,Boston,2116,42.348044,-71.074245,2020,11,9,2021,5,9
244745,244717,1461297,LVOLT,Electrical Low Voltage,Low Voltage,adding readers on nd floor and on th floor ...,8000.0,100.0,Open,Comm,0.0,Brighton,2135,42.357074,-71.14446,2023,4,13,2023,10,13
109911,109887,1482517,ELECTRICAL,Electrical Permit,Electrical,electrical fit out of th floor lab space light...,180000.0,610.0,Open,Comm,0.0,Boston,2114,42.363093,-71.067268,2023,6,6,2023,12,6


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 622276 entries, 0 to 622275
Data columns (total 21 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   id             622276 non-null  int64  
 1   permit         622276 non-null  object 
 2   class          617722 non-null  object 
 3   type           622276 non-null  object 
 4   description    617722 non-null  object 
 5   text           622276 non-null  object 
 6   value          622276 non-null  float64
 7   fee            622276 non-null  float64
 8   status         622276 non-null  object 
 9   occupancytype  602785 non-null  object 
 10  sqft           622276 non-null  float64
 11  city           622062 non-null  object 
 12  zipcode        621829 non-null  object 
 13  lat            607822 non-null  float64
 14  lon            607822 non-null  float64
 15  year           622276 non-null  int32  
 16  month          622276 non-null  int32  
 17  day            622276 non-nul

In [28]:
# df.to_csv('/content/drive/MyDrive/City of Boston: Permitting D/Project Files/data/abp_cleaned.csv', index=False, encoding='utf-8')
df.to_csv('../data/cleaned_abp.csv', index=False, encoding='utf-8')