## CLEANING ARTICLE80 DEVELOPMENT PROJECTS

This notebook contains the cleaning process of the article 80 data. Exploratory cleaning analysis (can be found in repository history) is removed and replaced with comment or markdown explanations to aid readability. Column descriptions are documented in the data_insights document.

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)

# from google.colab import drive
# drive.mount('/content/drive')
# directory = '/content/drive/MyDrive/City of Boston: Permitting D/Project Files/data/a80.csv'
directory = '../data/raw_a80.csv' # interchangeable with above code

# Estimated runtime ~15 seconds

In [2]:
df = pd.read_csv(directory)
df.head(4)

Unnamed: 0,X,Y,OBJECTID,Project_ID,Project_Name,Project_Street_Number,Project_Street_Name,Project_Street_Suffix,Project_Zip_Code,Neighborhood,Project_Record_Type,Project_Status,Filed_Date,BPDA_Board_Approval,First_Building_Permit,COO_Permit_Date,Last_Project_Update_Date,Gross_Square_Footage,Description,Website_URL,Lat,Lon,Gross_Floor_Area,RnD_sqft,Shape
0,764853.067664,2942946.0,21411,2501,Jackson Square Recreation Center,1522,Columbus,Avenue,2119.0,Roxbury,NPC,Board Approved,2016/12/15 00:00:00+00,2011/06/16 00:00:00+00,,,,38500.0,The proposed project as described in the NPC c...,http://www.bostonplans.org/projects/developmen...,42.3229,-71.0981,75000.0,0.0,
1,766147.926951,2932274.0,21412,2502,Brooke Charter High School,198-260,American Legion,Highway,2124.0,Mattapan,Large Project,Construction Complete,2016/12/09 00:00:00+00,2017/03/16 00:00:00+00,2017/06/22 00:00:00+00,2018/08/21 00:00:00+00,2017/04/20 00:00:00+00,95000.0,The Brooke Charter High School proposed the co...,http://www.bostonplans.org/projects/developmen...,42.2936,-71.0935,95000.0,0.0,
2,767973.616213,2951926.0,21413,2508,1000 Boylston Street,1000,Boylston,Street,2115.0,Back Bay,Large Project,Board Approved,2017/01/05 00:00:00+00,2018/03/15 00:00:00+00,,,2019/05/28 00:00:00+00,513000.0,The Proposed Project consists of a single cond...,http://www.bostonplans.org/projects/developmen...,42.3475,-71.0864,439500.0,0.0,
3,753172.933872,2955175.0,21414,2509,Allston Yards Building B,400,Guest,Street,2134.0,Allston,Large Project,Board Approved,2018/01/22 00:00:00+00,2019/12/12 00:00:00+00,,,2023/01/27 00:00:00+00,636500.0,400 Guest Street (Building B) in the Allston Y...,http://www.bostonplans.org/projects/developmen...,42.3566,-71.1411,634500.0,350000.0,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1769 entries, 0 to 1768
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   X                         1760 non-null   float64
 1   Y                         1760 non-null   float64
 2   OBJECTID                  1769 non-null   int64  
 3   Project_ID                1769 non-null   int64  
 4   Project_Name              1769 non-null   object 
 5   Project_Street_Number     1727 non-null   object 
 6   Project_Street_Name       1759 non-null   object 
 7   Project_Street_Suffix     1708 non-null   object 
 8   Project_Zip_Code          1352 non-null   float64
 9   Neighborhood              1767 non-null   object 
 10  Project_Record_Type       1769 non-null   object 
 11  Project_Status            1769 non-null   object 
 12  Filed_Date                897 non-null    object 
 13  BPDA_Board_Approval       1198 non-null   object 
 14  First_Bu

The following contains the dropping of columns and cleaning of remaining columns, we're going to try to be consistent with lowercase underscored variable names

In [4]:
# Initial Dropping columns
df.drop(columns=['X'], inplace=True)
df.drop(columns=['Y'], inplace=True)
df.drop(columns=['Project_Street_Number'], inplace=True)
df.drop(columns=['Project_Street_Name'], inplace=True)
df.drop(columns=['Project_Street_Suffix'], inplace=True)
df.drop(columns=['Website_URL'], inplace=True)
df.drop(columns=['Gross_Floor_Area'], inplace=True)
df.drop(columns=['RnD_sqft'], inplace=True)
df.drop(columns=['Shape'], inplace=True)

In [5]:
# Initial Renaming columns
df = df.rename(columns={'OBJECTID': 'id'})
df = df.rename(columns={'Project_ID': 'project_id'})
df = df.rename(columns={'Project_Name': 'name'})
df = df.rename(columns={'Project_Zip_Code': 'zipcode'})
df = df.rename(columns={'Neighborhood': 'city'})
df = df.rename(columns={'Project_Record_Type': 'type'})
df = df.rename(columns={'Project_Status': 'status'})
df = df.rename(columns={'Gross_Square_Footage': 'sqft'})
df = df.rename(columns={'Description': 'text'})
df = df.rename(columns={'Lat': 'lat'})
df = df.rename(columns={'Lon': 'lon'})

Zipcodes needed more intensive formatting

In [6]:
def clean_zip(value):
    if pd.isna(value):
        return value

    value = str(value)
    value = re.sub(r'-.*|\.0$', '', value)
    value = '0' + value if len(value) == 4 else value
    value = pd.NA if value.isdigit() and len(value) <= 3 else value

    return value

df['zipcode'] = df['zipcode'].apply(clean_zip)

We combined the dates into one date to simplify date info later on on the process

In [7]:
def get_date_components(row):
    date_columns = ['Filed_Date', 'BPDA_Board_Approval', 'First_Building_Permit', 'COO_Permit_Date', 'Last_Project_Update_Date']
    dates = [row[col] for col in date_columns]
    non_null_dates = [pd.to_datetime(date) for date in dates if pd.notnull(date)]

    if non_null_dates:
        chosen_date = max(non_null_dates)
        return chosen_date.year, chosen_date.month, chosen_date.day
    else:
        return pd.NaT, pd.NaT, pd.NaT

df[['year', 'month', 'day']] = df.apply(get_date_components, axis=1, result_type='expand')

df.drop(columns=['Filed_Date'], inplace=True)
df.drop(columns=['BPDA_Board_Approval'], inplace=True)
df.drop(columns=['First_Building_Permit'], inplace=True)
df.drop(columns=['COO_Permit_Date'], inplace=True)
df.drop(columns=['Last_Project_Update_Date'], inplace=True)

We left name as was and preprocessed the 'text' field

In [8]:
def process_string(input_string):
    # Remove non-alphabetical characters (excluding whitespace) using regex
    only_alphabetical = re.sub(r'[^a-zA-Z\s]', '', str(input_string))

    # Convert all words to lowercase
    lowercase_result = only_alphabetical.lower()
    
    return lowercase_result

df.text = df.text.apply(process_string)

## Save csv

In [9]:
df.head()

Unnamed: 0,id,project_id,name,zipcode,city,type,status,sqft,text,lat,lon,year,month,day
0,21411,2501,Jackson Square Recreation Center,2119,Roxbury,NPC,Board Approved,38500.0,the proposed project as described in the npc c...,42.3229,-71.0981,2016,12,15
1,21412,2502,Brooke Charter High School,2124,Mattapan,Large Project,Construction Complete,95000.0,the brooke charter high school proposed the co...,42.2936,-71.0935,2018,8,21
2,21413,2508,1000 Boylston Street,2115,Back Bay,Large Project,Board Approved,513000.0,the proposed project consists of a single cond...,42.3475,-71.0864,2019,5,28
3,21414,2509,Allston Yards Building B,2134,Allston,Large Project,Board Approved,636500.0,guest street building b in the allston yards ...,42.3566,-71.1411,2023,1,27
4,21415,2510,Wentworth Multipurpose Academic Building,2215,Mission Hill,Large Project,Construction Complete,69000.0,the mpa building will contain laboratories stu...,42.3359,-71.0948,2018,11,8


In [10]:
# sanity check (additional checks and plots removed for consiceness)
print(df.zipcode.unique)
print(df.city.unique)
print(df.type.unique)
print(df.status.unique)
print(df.year.unique)

df.info()

<bound method Series.unique of 0       02119
1       02124
2       02115
3       02134
4       02215
        ...  
1764      NaN
1765      NaN
1766      NaN
1767      NaN
1768      NaN
Name: zipcode, Length: 1769, dtype: object>
<bound method Series.unique of 0                     Roxbury
1                    Mattapan
2                    Back Bay
3                     Allston
4                Mission Hill
                ...          
1764               Dorchester
1765    Longwood Medical Area
1766                 Brighton
1767             South Boston
1768                  Roxbury
Name: city, Length: 1769, dtype: object>
<bound method Series.unique of 0                 NPC
1       Large Project
2       Large Project
3       Large Project
4       Large Project
            ...      
1764    Small Project
1765    Large Project
1766              NPC
1767    Small Project
1768    Small Project
Name: type, Length: 1769, dtype: object>
<bound method Series.unique of 0              Board App

In [11]:
# df.to_csv('/content/drive/MyDrive/City of Boston: Permitting D/Project Files/data/a80_cleaned.csv', index=False, encoding='utf-8')
df.to_csv('../data/cleaned_a80.csv', index=False, encoding='utf-8')