# Finding Property in City Council District 7

### Overview
The goal of this notebook is to extend the Property Assessment Dataset by adding another column called `IS_D7` that indicates whether the property is in District 7.

Datasets used in this notebook:
- Property Assessment Dataset: [link](https://data.boston.gov/dataset/property-assessment)
- Boston Live Street Address Management: [link](https://data.boston.gov/dataset/live-street-address-management-sam-addresses)
- Boston City Council 2023-2032 Shapefile: [link](https://data.boston.gov/dataset/city-council-districts-2023-2032)
- Boston ZIP Code Shapefile: [link](https://data.boston.gov/dataset/zip-codes/resource/a9b44fec-3a21-42ac-a919-06ec4ac20ab8)

### Summary
There are many missing values for street number and incorrect street address formatting, along with some streets that isn't in Live Street Address Management's dataset. We did our best to maintain data integrity and correctness as it is assigned to either in District 7 or not, but please be mindful that there will be some mistakes with our approach. We assign District 7 to the property that we are mostly sure of its belonging and mark any address whose coordinate cannot be determined as not in District 7.

In the process, we assign XY coordinates (longitude and latitude) according to this order:
1. Exact full street address matches, which includes treet number, street body, and street suffix (no unit number, assuming that property with the same street address will be in the same building)
2. Street body and suffix match
3. Street body match
4. Remove suffix from street body where included and match

At the end of the coordinate assigning process, 994,085 properties out of the total 1,006,669 have coordinate. 12584 property still doesn't have coordinate with main reasons being that the address is not found in Live Street Address Management Dataset. We deem that the number of property being left out is small and can be safely excluded from the dataset.

After coordinates are assigned, it is used to compare against City Council Shapefile to determine whether the location falls within District 7's boundary. `IS_D7` is set to `True` if it is and `False` if not. To reduce errors that may stem from comparing street body without street numbers and suffix, out of the rows where `IS_D7 = True`, if the ZIP code falls outside of District 7 ZIP Code, modify the value to `False`.

The resulting dataframe is export as `d7-property-new.csv` in the `data` folder.

In [398]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import csv
import geopandas as gpd
import fiona
from shapely.geometry import Point
from fuzzywuzzy import process, fuzz

In [399]:
df_2019 = pd.read_csv("../data/property/property_2019.csv", low_memory=False)
df_2020 = pd.read_csv("../data/property/property_2020.csv", low_memory=False)
df_2021 = pd.read_csv("../data/property/property_2021.csv", low_memory=False)
df_2022 = pd.read_csv("../data/property/property_2022.csv", low_memory=False)
df_2023 = pd.read_csv("../data/property/property_2023.csv", low_memory=False)
df_2024 = pd.read_csv("../data/property/property_2024.csv", low_memory=False)

df_2019["YEAR"] = 2019
df_2020["YEAR"] = 2020
df_2021["YEAR"] = 2021
df_2022["YEAR"] = 2022
df_2023["YEAR"] = 2023
df_2024["YEAR"] = 2024

df_property = pd.concat([df_2019, df_2020, df_2021, df_2022, df_2023, df_2024], ignore_index=True)

## Data Cleaning

The number of columns explodes after combing the datasets of 6 years. This is because new columns are introduced as the year goes by, and some columns are renamed, causing many missing values in some columns where values are distributed across many columns that serve the same purpose.

Especially the dataset from 2019, the column names in this dataset are different from later years. We assume that this is due to data re-structuring. All data on residential unit is prefixed with `R_`, while condo main and condo unit is prefixed with `S_` and `U_` respectively. However, starting from 2020, the characteristics of property are not separatedly labelled.

We'll be using the dataset to analyze property tax and property value trend, so we're going to combine the values of some columns into one.

### Property Value

Columns for property values have multiple different names.
- Total property value: `AV_TOTAL`, `TOTAL_VALUE`
- Building value: `AV_BLDG`, `BLDG_VALUE`
- Land value: `AV_LAND`, `LAND_VALUE`

In [400]:
df_property["AV_TOTAL"] = df_property["AV_TOTAL"].fillna(df_property["TOTAL_VALUE"])
df_property["AV_BLDG"] = df_property["AV_BLDG"].fillna(df_property["BLDG_VALUE"])
df_property["AV_LAND"] = df_property["AV_LAND"].fillna(df_property["LAND_VALUE"])
df_property.drop(columns=["TOTAL_VALUE", "BLDG_VALUE", "LAND_VALUE"], inplace=True)

In [401]:
df_property["AV_TOTAL"] = df_property["AV_TOTAL"].replace({"\\$": "", ",": ""}, regex=True).astype(float)
df_property["AV_BLDG"] = df_property["AV_BLDG"].replace({"\\$": "", ",": ""}, regex=True).astype(float)
df_property["AV_LAND"] = df_property["AV_LAND"].replace({"\\$": "", ",": ""}, regex=True).astype(float)

### Gross Tax

Gross property tax is broken down into 2 columns: `GROSS_TAX` and `...GROSS_TAX...`. where `...` is a blank space.

In [402]:
df_property["GROSS_TAX"] = df_property["GROSS_TAX"].fillna(df_property[" GROSS_TAX "])
df_property.drop(columns=[" GROSS_TAX "], inplace=True)

#### Reformat values from `GROSS_TAX` column
- In 2019 dataset, `GROSS_TAX` contains incorrect value because many rows show property tax higher than property value. After manually calculating tax according to the year's tax rate, we found that the current value is the value of tax but they're missing decimals. Instead of XXXXXX, it should be XXXX.XX.
- In 2020 - 2024 dataset, values for `GROSS_TAX` are strings with dollar sign ($) and have leading and trailing space.

In [403]:
df_property.loc[df_property["YEAR"] == 2019, "GROSS_TAX"] = df_property.loc[df_property["YEAR"] == 2019, "GROSS_TAX"].astype(float) / 100
df_property.loc[df_property["YEAR"] == 2021, "GROSS_TAX"] = df_property.loc[df_property["YEAR"] == 2021, "GROSS_TAX"].str.strip().replace({"\\$": "", ",": ""}, regex=True)
df_property.loc[df_property["YEAR"] == 2022, "GROSS_TAX"] = df_property.loc[df_property["YEAR"] == 2022, "GROSS_TAX"].str.strip().replace({"\\$": "", ",": ""}, regex=True)
df_property.loc[df_property["YEAR"] == 2024, "GROSS_TAX"] = df_property.loc[df_property["YEAR"] == 2024, "GROSS_TAX"].str.strip().replace({"\\$": "", ",": ""}, regex=True)
df_property["GROSS_TAX"] = df_property["GROSS_TAX"].replace({"-": "0.00"}, regex=True)

In [404]:
df_property["GROSS_TAX"] = df_property["GROSS_TAX"].astype(float)

In [405]:
df_property.loc[df_property["AV_TOTAL"].isna(), "GROSS_TAX"] = np.nan

### Drop Condominium Main property
Condominium main is explained in the data key as "physical structure housing all related condo units with no assessed value." Inspecting the values in columns where land use type is `CM` reveals that the columns related to property values is is mostly populated with 0 or `NaN`. We'll drop all Condominium Main property 

In [406]:
cm_df = df_property[df_property["LU"] == "CM"]
print(f"Number of condominium main property with assessed total property value: {cm_df[(cm_df["AV_TOTAL"] != 0) & (cm_df["AV_TOTAL"] != "0") & (~cm_df["AV_TOTAL"].isna())]["AV_TOTAL"].shape[0]}")
print(f"Number of condominium main property with assessed building value: {cm_df[(cm_df["AV_BLDG"] != 0) & (cm_df["AV_BLDG"] != "0") & (~cm_df["AV_BLDG"].isna())]["AV_BLDG"].shape[0]}")
print(f"Number of condominium main property with assessed land value: {cm_df[(cm_df["AV_LAND"] != 0) & (cm_df["AV_LAND"] != "0") & (~cm_df["AV_LAND"].isna())]["AV_LAND"].shape[0]}")

Number of condominium main property with assessed total property value: 174
Number of condominium main property with assessed building value: 170
Number of condominium main property with assessed land value: 4


We deem that Condominium Main property isn't relevant to the analysis of housing and can be dropped altogether due to the small number of property value available.

In [407]:
df_property = df_property[df_property["LU"] != "CM"]

Also drop the columns related to Condominium Main property's characteristics.

In [408]:
df_property.drop(columns=[column for column in df_property.columns if column.startswith("S_")], inplace=True)

### Combine property characteristics columns

The dataset for 2019 has a column name `R_BLDG_STYL` which refers to building style.  The dataset for 2020 to 2024 doesn't have `R_BLDG_STYL` columns, but it is renamed to `BLDG_TYPE`, which refers to builing style and type. However, the value in `R_BLDG_STYL` from 2019 is formatted differently from other years, so we'll have to clean that first.

#### Building Style and Building Type
Both `R_BLDG_STYL` and `BLDG_TYPE` represent the same data.

In [409]:
r_bldg_styl_dict = {
    "CV": "CV - Conventional",
    "RE": "RE - Row End",
    "RM": "RM - Row Middle",
    "DK": "DK - Decker",
    "TF": "TF - Two Fam Stack",
    "SD": "SD - Semi-Det",
    "CL": "CL - Colonial",
    "CP": "CP - Cape",
    "DX": "DX - Duplex",
    "VT": "VT - Victorian",
    "RR": "RR - Raised Ranch",
    "OT": "OT - Other",
    "RN": "RN - Ranch",
    "BW": "BW - Bungalow",
    "CN": "CN - Contemporary",
    "SL": "SL - Split Level",
    "TL": "TL - Tri-Level",
    "TD": "TD - Tudor",
    "BL": "BL - Bi-Level",
    "104 - TWO-FAM DWELLI": "104 - TWO-FAM DWELLING",
    "105 - THREE-FAM DWEL": "105 - THREE-FAM DWELLING"
}

In [410]:
df_property["R_BLDG_STYL"] = df_property["R_BLDG_STYL"].replace(r_bldg_styl_dict)

In [411]:
df_property["BLDG_TYPE"] = df_property["BLDG_TYPE"].fillna(df_property["R_BLDG_STYL"])
df_property.drop(columns=["R_BLDG_STYL"], inplace=True)

#### Drop `R_KITCH` for 2020
According to the data ket documentation, `R_KITCH` is the number of kitchens in the property, but the data from 2020 put kitchen type in the column instead.

In [412]:
df_property.loc[df_property["YEAR"] == 2020, "R_KITCH"] = np.nan

### Drop columns that are related to the property's cosmetic and ownership

In [413]:
cosmetic_columns = ["R_EXT_FIN", "EXT_FNISHED", "R_BTH_STYLE", "R_BTH_STYLE2", "R_BTH_STYLE3", "R_KITCH_STYLE", 
                    "R_KITCH_STYLE2", "R_KITCH_STYLE3", "R_ROOF_TYP", "R_INT_FIN", "U_BASE_FLOOR", "U_BTH_STYLE",
                    "U_BTH_STYLE2", "U_BTH_STYLE3", "U_KITCH_TYPE", "U_KITCH_STYLE", "U_INT_FIN", "EXT_FINISHED",
                    "BTHRM_STYLE1", "BTHRM_STYLE2", "BTHRM_STYLE3", "KITCHEN_TYPE", "KITCHEN_STYLE1", "KITCHEN_STYLE2",
                    "KITCHEN_STYLE3", "EXT_FNISHED", "SFYI_VALUE", "STRUCTURE_CLASS", "U_CORNER", "U_ORIENT", "ROOF_STRUCTURE",
                    "ROOF_COVER", "BDRM_COND", "INT_WALL", "CORNER_UNIT", "ORIENTATION"]

ownership_columns = ["OWNER", "MAIL_ADDRESSEE", "MAIL_ADDRESS", "MAIL CS", "MAIL_ZIPCODE", "MAIL_CITY", "MAIL_STATE", "OWNER MAIL ADDRESS", "MAIL_STREET_ADDRESS",
                    "MAIL_ZIP_CODE"]

In [414]:
df_property.drop(columns=cosmetic_columns + ownership_columns, inplace=True)

### Merge columns with the same function
These columns serve the same functions:
- `ZIP_CODE` and `ZIPCODE`

- `PTYPE` and `LUC`
- `YR_BUILT`
- `YR_REMOD` and `YR_REMODEL`
- `R_TOTAL_RMS`, `U_TOT_RMS`, and `TT_RMS`
- `R_BDRMS`, `U_BDRMS`, and `BED_RMS`
- `R_FULL_BTH`, `U_FULL_BTH`, and `FULL_BTH`
- `R_HALF_BTH`, `U_HALF_BTH`, and `HLF_BTH`
- `R_KITCH`, `KITCHENS`, and `KITCHEN`
- `R_HEAT_TYP`, `U_HEAT_TYP`, and `HEAT_TYPE`
- `R_AC`, `U_AC`, and `AC_TYPE`
- `R_FPLACE`, `U_FPLACE`, `FIRE_PLACE`, and `FIREPLACES`
- `R_EXT_CND` and  `EXT_COND`
- `R_OVRALL_CND` and `OVERALL_COND`
- `R_INT_CND`, `U_INT_CND`, and `INT_COND`
- `R_VIEW`, `U_VIEW`, and `PROP_VIEW`
- `U_NUM_PARK` and `NUM_PARKING`
- `NUM_FLOORS`, `RES_FLOOR`, and `CD_FLOOR`
- `RC_UNITS` and `RES_UNITS`

In [415]:
df_property["ZIPCODE"] = df_property["ZIPCODE"].fillna(df_property["ZIP_CODE"])
df_property["LUC"] = df_property["LUC"].fillna(df_property["PTYPE"])
df_property["YR_REMODEL"] = df_property["YR_REMODEL"].fillna(df_property["YR_REMOD"])
df_property["TT_RMS"] = df_property["TT_RMS"].fillna(df_property["R_TOTAL_RMS"])
df_property["TT_RMS"] = df_property["TT_RMS"].fillna(df_property["U_TOT_RMS"])
df_property["BED_RMS"] = df_property["BED_RMS"].fillna(df_property["R_BDRMS"])
df_property["BED_RMS"] = df_property["BED_RMS"].fillna(df_property["U_BDRMS"])
df_property["FULL_BTH"] = df_property["FULL_BTH"].fillna(df_property["R_FULL_BTH"])
df_property["FULL_BTH"] = df_property["FULL_BTH"].fillna(df_property["U_FULL_BTH"])
df_property["HLF_BTH"] = df_property["HLF_BTH"].fillna(df_property["R_HALF_BTH"])
df_property["HLF_BTH"] = df_property["HLF_BTH"].fillna(df_property["U_HALF_BTH"])
df_property["KITCHEN"] = df_property["KITCHEN"].fillna(df_property["R_KITCH"])
df_property["KITCHEN"] = df_property["KITCHEN"].fillna(df_property["KITCHENS"])
df_property["HEAT_TYPE"] = df_property["HEAT_TYPE"].fillna(df_property["R_HEAT_TYP"])
df_property["HEAT_TYPE"] = df_property["HEAT_TYPE"].fillna(df_property["U_HEAT_TYP"])
df_property["AC_TYPE"] = df_property["AC_TYPE"].fillna(df_property["R_AC"])
df_property["AC_TYPE"] = df_property["AC_TYPE"].fillna(df_property["U_AC"])
df_property["FIREPLACES"] = df_property["FIREPLACES"].fillna(df_property["FIRE_PLACE"])
df_property["FIREPLACES"] = df_property["FIREPLACES"].fillna(df_property["R_FPLACE"])
df_property["FIREPLACES"] = df_property["FIREPLACES"].fillna(df_property["U_FPLACE"])
df_property["EXT_COND"] = df_property["EXT_COND"].fillna(df_property["R_EXT_CND"])
df_property["OVERALL_COND"] = df_property["OVERALL_COND"].fillna(df_property["R_OVRALL_CND"])
df_property["INT_COND"] = df_property["INT_COND"].fillna(df_property["R_INT_CND"])
df_property["INT_COND"] = df_property["INT_COND"].fillna(df_property["U_INT_CND"])
df_property["PROP_VIEW"] = df_property["PROP_VIEW"].fillna(df_property["R_VIEW"])
df_property["PROP_VIEW"] = df_property["PROP_VIEW"].fillna(df_property["U_VIEW"])
df_property["NUM_PARKING"] = df_property["NUM_PARKING"].fillna(df_property["U_NUM_PARK"])
df_property["NUM_FLOORS"] = df_property["NUM_FLOORS"].fillna(df_property["RES_FLOOR"])
df_property["NUM_FLOORS"] = df_property["NUM_FLOORS"].fillna(df_property["CD_FLOOR"])
df_property["RES_UNITS"] = df_property["RES_UNITS"].fillna(df_property["RC_UNITS"])
df_property["HEAT_TYPE"] = df_property["HEAT_TYPE"].fillna(df_property["U_HEAT_TYP"])

  df_property["KITCHEN"] = df_property["KITCHEN"].fillna(df_property["R_KITCH"])


In [416]:
df_property.drop(columns=["ZIP_CODE", "PTYPE", "YR_REMOD", "R_TOTAL_RMS", "U_TOT_RMS", "R_BDRMS", "U_BDRMS", "R_FULL_BTH", "U_FULL_BTH", "U_HEAT_TYP",
                            "R_HALF_BTH", "U_HALF_BTH", "R_KITCH", "R_HEAT_TYP", "R_AC", "U_AC", "FIRE_PLACE", "R_FPLACE", "U_FPLACE", "KITCHENS",
                            "R_EXT_CND", "R_OVRALL_CND", "R_INT_CND", "U_INT_CND", "R_VIEW", "U_VIEW", "U_NUM_PARK", "RES_FLOOR", "CD_FLOOR", "RC_UNITS",], inplace=True)

In [417]:
df_property.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1006677 entries, 0 to 1068277
Data columns (total 45 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   PID           1006677 non-null  int64  
 1   CM_ID         555749 non-null   object 
 2   GIS_ID        1006664 non-null  float64
 3   ST_NUM        978330 non-null   object 
 4   ST_NAME       1006677 non-null  object 
 5   ST_NAME_SUF   326961 non-null   object 
 6   UNIT_NUM      459239 non-null   object 
 7   ZIPCODE       1006655 non-null  float64
 8   LU            1006677 non-null  object 
 9   OWN_OCC       1006501 non-null  object 
 10  AV_LAND       927492 non-null   float64
 11  AV_BLDG       998441 non-null   float64
 12  AV_TOTAL      1006294 non-null  float64
 13  GROSS_TAX     1006294 non-null  float64
 14  LAND_SF       830674 non-null   object 
 15  YR_BUILT      886435 non-null   float64
 16  GROSS_AREA    914529 non-null   float64
 17  LIVING_AREA   914497 non-null   

### Format `ZIPCODE` column

In [418]:
df_property["ZIPCODE"] = df_property["ZIPCODE"].astype(pd.Int64Dtype()).astype(str).str.zfill(5)

## Convert all float type columns that should be int type
Columns that is intuitively integer type but is displayed as float should be concerted to integer.

In [419]:
df_property = df_property[(df_property["YR_REMODEL"] > 1000) | (df_property["YR_REMODEL"] == 0) | (df_property["YR_REMODEL"].isna())]

# Instead of using 0 as a value for property that has not been remodeled, use NaN.
df_property["YR_REMODEL"] = df_property["YR_REMODEL"].replace(0.0, np.nan)

In [420]:
df_property["LAND_SF"] = df_property["LAND_SF"].replace({",": ""}, regex=True).astype(float)
df_property["YR_BUILT"] = df_property["YR_BUILT"].astype("Int64")
df_property["BLDG_SEQ"] = df_property["BLDG_SEQ"].astype("Int64")
df_property["NUM_BLDGS"] = df_property["NUM_BLDGS"].astype("Int64")
df_property["LUC"] = pd.to_numeric(df_property["LUC"], errors="coerce").astype("Int64")
df_property["RES_UNITS"] = df_property["RES_UNITS"].astype("Int64")
df_property["COM_UNITS"] = df_property["COM_UNITS"].astype("Int64")
df_property["YR_REMODEL"] = df_property["YR_REMODEL"].astype("Int64")
df_property["BED_RMS"] = df_property["BED_RMS"].astype("Int64")
df_property["FULL_BTH"] = df_property["FULL_BTH"].astype("Int64")
df_property["HLF_BTH"] = df_property["HLF_BTH"].astype("Int64")
df_property["KITCHEN"] = df_property["KITCHEN"].astype("Int64")
df_property["TT_RMS"] = pd.to_numeric(df_property["TT_RMS"], errors="coerce").astype("Int64")
df_property["NUM_PARKING"] = df_property["NUM_PARKING"].astype("Int64")
df_property["FIREPLACES"] = df_property["FIREPLACES"].astype("Int64")
df_property["CM_ID"] = pd.to_numeric(df_property["CM_ID"], errors="coerce").astype("Int64").astype(str)
df_property["GIS_ID"] = pd.to_numeric(df_property["GIS_ID"], errors="coerce").astype("Int64").astype(str)

In [421]:
condition_dict = {
    "G": "G - Good",
    "A": "A - Average",
    "E": "E - Excellent",
    "P": "P - Poor",
    "F": "F - Fair",
    "AVG - Default - Average": "A - Average",
    "EX - Excellent": "E - Excellent",
    "S": "S - Special"
}

In [422]:
df_property["INT_COND"] = df_property["INT_COND"].replace(condition_dict)
df_property["EXT_COND"] = df_property["EXT_COND"].replace(condition_dict)
df_property["OVERALL_COND"] = df_property["OVERALL_COND"].replace(condition_dict)
df_property["PROP_VIEW"] = df_property["PROP_VIEW"].replace(condition_dict)

In [423]:
heat_type_dict = {
    "W": "W - Ht Water/Steam",
    "F": "F - Forced Hot Air",
    "S": "S - Space Heat",
    "E": "E - Electric",
    "N": "N - None",
    "P": "P - Heat Pump",
    "O": "O - Other"
}

In [424]:
df_property["HEAT_TYPE"] = df_property["HEAT_TYPE"].replace(heat_type_dict)

In [425]:
AC_type_dict = {
    "N": "N - None",
    "C": "C - Central AC",
    "D": "D - Ductless AC"
}

In [426]:
df_property["AC_TYPE"] = df_property["AC_TYPE"].replace(AC_type_dict)

In [427]:
df_property.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1006669 entries, 0 to 1068277
Data columns (total 45 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   PID           1006669 non-null  int64  
 1   CM_ID         1006669 non-null  object 
 2   GIS_ID        1006669 non-null  object 
 3   ST_NUM        978322 non-null   object 
 4   ST_NAME       1006669 non-null  object 
 5   ST_NAME_SUF   326959 non-null   object 
 6   UNIT_NUM      459234 non-null   object 
 7   ZIPCODE       1006669 non-null  object 
 8   LU            1006669 non-null  object 
 9   OWN_OCC       1006493 non-null  object 
 10  AV_LAND       927484 non-null   float64
 11  AV_BLDG       998433 non-null   float64
 12  AV_TOTAL      1006286 non-null  float64
 13  GROSS_TAX     1006286 non-null  float64
 14  LAND_SF       830666 non-null   float64
 15  YR_BUILT      886428 non-null   Int64  
 16  GROSS_AREA    914521 non-null   float64
 17  LIVING_AREA   914489 non-null   

# Live Street Address Dataset

In [428]:
df_street_address = pd.read_csv("../data/Boston_SAM.csv", low_memory=False)

### Total street address: 400197

There are some property that will not have the street address corresponding to it? What to do?

In [429]:
df_street_address[["POINT_X", "POINT_Y", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,POINT_X,POINT_Y,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME
29697,-71.120547,42.278381,0 Cliffmont St,0,Cliffmont St
399204,-71.075982,42.338042,0 Deacon St,0,Deacon St
41758,-71.063681,42.373852,0 Devens St,0,Devens St
41757,-71.063681,42.373852,0 Devens St 1,0,Devens St
43255,-71.166560,42.280650,0 Dow Rd,0,Dow Rd
...,...,...,...,...,...
134864,-71.075107,42.347132,,20-48,
136137,-71.056711,42.361799,,116,Blackstone St
140399,-71.074028,42.347213,,10-12,
399800,-71.054734,42.359777,,34,


In [430]:
df_street_address["FULL_ADDRESS"] = df_street_address["FULL_ADDRESS"].str.strip()
df_street_address["STREET_NUMBER"] = df_street_address["STREET_NUMBER"].str.strip()
df_street_address["FULL_STREET_NAME"] = df_street_address["FULL_STREET_NAME"].str.strip()

### Number of rows from live street address that don't have both street name and full address: 8
These rows are unusable and will be dropped.

In [431]:
df_street_address[df_street_address["FULL_ADDRESS"].isna() & df_street_address["FULL_STREET_NAME"].isna()][["POINT_X", "POINT_Y", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME"]]

Unnamed: 0,POINT_X,POINT_Y,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME
134864,-71.075107,42.347132,,20-48,
140399,-71.074028,42.347213,,10-12,
399800,-71.054734,42.359777,,34,
400039,-71.054865,42.360526,,8,


In [432]:
df_street_address = df_street_address.dropna(subset=['FULL_ADDRESS', 'FULL_STREET_NAME'], how='all')
df_street_address[["POINT_X", "POINT_Y", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,POINT_X,POINT_Y,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME
29697,-71.120547,42.278381,0 Cliffmont St,0,Cliffmont St
399204,-71.075982,42.338042,0 Deacon St,0,Deacon St
41758,-71.063681,42.373852,0 Devens St,0,Devens St
41757,-71.063681,42.373852,0 Devens St 1,0,Devens St
43255,-71.166560,42.280650,0 Dow Rd,0,Dow Rd
...,...,...,...,...,...
111307,-71.050772,42.376113,C-8 Shipway Pl C-8,C-8,Shipway Pl
111308,-71.050772,42.376113,C-9 Shipway Pl C-9,C-9,Shipway Pl
310515,-71.051407,42.371981,Pier 4 Eighth St,Pier 4,Eighth St
377920,-71.055699,42.357256,TEN Post Office Sq,TEN,Post Office Sq


### Number of rows without full address in live street address: 1

In [433]:
df_street_address[df_street_address["FULL_ADDRESS"].isna()].shape

(1, 32)

### Number of rows without street number: 0

In [434]:
df_street_address[df_street_address["STREET_NUMBER"].isna()].shape

(0, 32)

### Number of rows without street name: 0

In [435]:
df_street_address[df_street_address["FULL_STREET_NAME"].isna()].shape

(0, 32)

## Create full street address without unit number

In [436]:
df_street_address["FULL_STREET_ADDRESS"] = df_street_address["STREET_NUMBER"].str.lower().str.strip() + " " + df_street_address["FULL_STREET_NAME"].str.lower().str.strip()
df_street_address["FULL_STREET_ADDRESS"] = df_street_address["FULL_STREET_ADDRESS"].str.strip()
df_street_address[["FULL_STREET_ADDRESS", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME", "POINT_X", "POINT_Y"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,FULL_STREET_ADDRESS,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME,POINT_X,POINT_Y
29697,0 cliffmont st,0 Cliffmont St,0,Cliffmont St,-71.120547,42.278381
399204,0 deacon st,0 Deacon St,0,Deacon St,-71.075982,42.338042
41758,0 devens st,0 Devens St,0,Devens St,-71.063681,42.373852
41757,0 devens st,0 Devens St 1,0,Devens St,-71.063681,42.373852
43255,0 dow rd,0 Dow Rd,0,Dow Rd,-71.166560,42.280650
...,...,...,...,...,...,...
111307,c-8 shipway pl,C-8 Shipway Pl C-8,C-8,Shipway Pl,-71.050772,42.376113
111308,c-9 shipway pl,C-9 Shipway Pl C-9,C-9,Shipway Pl,-71.050772,42.376113
310515,pier 4 eighth st,Pier 4 Eighth St,Pier 4,Eighth St,-71.051407,42.371981
377920,ten post office sq,TEN Post Office Sq,TEN,Post Office Sq,-71.055699,42.357256


### How many of the full street address is NA: 0

In [437]:
df_street_address[df_street_address["FULL_STREET_ADDRESS"].isna()].shape

(0, 33)

### Number of duplicates in full street address column: 263971

In [438]:
df_street_address["FULL_STREET_ADDRESS"].count() - df_street_address["FULL_STREET_ADDRESS"].nunique()

np.int64(263930)

# Property Assessment Dataset

In [439]:
df_property_columns = df_property.columns

### Total property assessed: 1,006,669

In [440]:
df_property[["ST_NUM", "ST_NAME", "ST_NAME_SUF"]].sort_values(by="ST_NAME")

Unnamed: 0,ST_NUM,ST_NAME,ST_NAME_SUF
230020,85,A,ST
227741,36,A,ST
60290,319,A,ST
60289,319,A,ST
60288,319,A,ST
...,...,...,...
791123,12.0,Zamora ST,
791292,33.0,Zamora ST,
791118,18.0,Zamora ST,
972777,12.0,Zamora ST,


In [441]:
df_property["ST_NUM"] = df_property["ST_NUM"].str.strip()
df_property["ST_NAME"] = df_property["ST_NAME"].str.strip()
df_property["ST_NAME_SUF"] = df_property["ST_NAME_SUF"].str.strip()

### Number of rows with no street number: 372436

In [442]:
df_property[df_property["ST_NUM"].isna()].shape

(351086, 45)

### Number of rows with no street name: 0

In [443]:
df_property[df_property["ST_NAME"].isna()].shape

(0, 45)

### Number of rows with no street suffix: 679710

In [444]:
df_property[df_property["ST_NAME_SUF"].isna()].shape

(679710, 45)

### Number of rows with no unit number: 547435
Will treat the address as if there is no unit number for geographical plotting purposes, assuming that property in the on the same street address but different unit number are still in the same building.

In [445]:
df_property[df_property["UNIT_NUM"].isna()].shape

(547435, 45)

### Combine columns into full address

In [446]:
import re

def extract_numeric_part(st_num):
    if pd.isna(st_num):
        return ""  # Return an empty string if ST_NUM is NaN
    elif isinstance(st_num, (int, float)):
        return str(int(st_num))  # Convert numeric ST_NUM to integer and then to string
    elif isinstance(st_num, str):
        match = re.match(r"(\d+)\.?\d*", st_num)  # Matches the numeric part in strings
        if match:
            return str(int(float(match.group(1))))  # Convert to integer
    return st_num.strip()  # Return as is if it doesn't match any numeric part

In [447]:
df_property["FULL_STREET_ADDRESS"] = df_property.apply(
    lambda row: (
        extract_numeric_part(row["ST_NUM"]) + " " + row["ST_NAME"] + " " + row["ST_NAME_SUF"]
    ).lower() if pd.notna(row["ST_NUM"]) and pd.notna(row["ST_NAME_SUF"])
    else (
        extract_numeric_part(row["ST_NUM"]) + " " + row["ST_NAME"]
    ).lower() if pd.notna(row["ST_NUM"])
    else (
        row["ST_NAME"] + " " + row["ST_NAME_SUF"]
    ).lower() if pd.notna(row["ST_NAME_SUF"])
    else row["ST_NAME"].lower(),
    axis=1
)
df_property["FULL_STREET_ADDRESS"] = df_property["FULL_STREET_ADDRESS"].str.strip()

In [448]:
df_property[["FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
0,87 beacon st,87,BEACON,ST
1,87 beacon st,87,BEACON,ST
2,87 beacon st,87,BEACON,ST
3,87 beacon st,87,BEACON,ST
4,87 beacon st,87,BEACON,ST
...,...,...,...,...
1068273,knowles st,,KNOWLES ST,
1068274,lake st,,Lake ST,
1068275,lake st,,Lake ST,
1068276,commonwealth av,,COMMONWEALTH AV,


In [449]:
df_property[df_property["FULL_STREET_ADDRESS"].isna()]

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,TT_RMS,HEAT_TYPE,HEAT_FUEL,AC_TYPE,PlUMBING,NUM_PARKING,PROP_VIEW,FIREPLACES,HEAT_SYSTEM,FULL_STREET_ADDRESS


### Number of duplicate full street addresses: 972,126
This is OK! We treat property with the same street address with different unit number the same because they are in the same building and makes no geographical difference.

In [450]:
df_property["FULL_STREET_ADDRESS"].count() - df_property["FULL_STREET_ADDRESS"].nunique()

np.int64(910863)

# Joining Property Assessment Dataset with Live Street Address Management Dataset

### Drop duplicates in full street address

In [451]:
df_street_address_unique = df_street_address.drop_duplicates(subset=["FULL_STREET_ADDRESS"])

In [452]:
df_street_address_unique[["FULL_STREET_ADDRESS", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME", "POINT_X", "POINT_Y"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,FULL_STREET_ADDRESS,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME,POINT_X,POINT_Y
29697,0 cliffmont st,0 Cliffmont St,0,Cliffmont St,-71.120547,42.278381
399204,0 deacon st,0 Deacon St,0,Deacon St,-71.075982,42.338042
41757,0 devens st,0 Devens St 1,0,Devens St,-71.063681,42.373852
43255,0 dow rd,0 Dow Rd,0,Dow Rd,-71.166560,42.280650
393984,0 emerson pl,0 Emerson Pl,0,Emerson Pl,-71.068738,42.364346
...,...,...,...,...,...,...
111307,c-8 shipway pl,C-8 Shipway Pl C-8,C-8,Shipway Pl,-71.050772,42.376113
111308,c-9 shipway pl,C-9 Shipway Pl C-9,C-9,Shipway Pl,-71.050772,42.376113
310515,pier 4 eighth st,Pier 4 Eighth St,Pier 4,Eighth St,-71.051407,42.371981
377920,ten post office sq,TEN Post Office Sq,TEN,Post Office Sq,-71.055699,42.357256


In [453]:
df_property[["FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]].sort_values(by="FULL_STREET_ADDRESS")

Unnamed: 0,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
269323,-8b-8c greenwood st,-8B-8C,GREENWOOD,ST
448877,-8b-8c greenwood st,-8B-8C,GREENWOOD ST,
627166,-8b-8c greenwood st,-8B-8C,GREENWOOD ST,
589327,0 harbor st,0,HARBOR ST,
233545,0 harbor st,0,HARBOR,ST
...,...,...,...,...
859128,zeller st,,ZELLER ST,
859127,zeller st,,ZELLER ST,
859126,zeller st,,ZELLER ST,
1041007,zeller st,,ZELLER ST,


### Number of property that is now assigned with XY coordinate: 469413

In [454]:
df_property_with_coord = pd.merge(df_property, df_street_address_unique, left_on="FULL_STREET_ADDRESS", right_on="FULL_STREET_ADDRESS", how="inner")

In [455]:
df_property_with_coord.shape

(469413, 78)

In [456]:
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,Y_COORD,SAM_STREET_ID,WARD,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
469408,2205663001,,2205663001,20,LAKE ST,,,02135,R1,Y,...,2.949498e+06,2334.0,22,2208,2205663001,9/25/2009 10:14:59,9/29/2009 17:33:08,POINT (-71.166339999999934 42.341090000000065),-71.166340,42.341090
469409,2205664000,,2205664000,18 16,LAKE ST,,,02135,R2,Y,...,2.949442e+06,2334.0,22,2208,2205664000,9/28/2009 1:28:37,9/29/2009 17:33:08,POINT (-71.166349937999939 42.340936078000027),-71.166350,42.340936
469410,2205665002,2205665000,2205665000,14,LAKE ST,,2,02135,CD,N,...,2.949394e+06,2334.0,22,2208,2205665000,9/25/2009 10:14:59,9/29/2009 17:33:08,POINT (-71.166337999999939 42.340804000000048),-71.166338,42.340804
469411,2205665004,2205665000,2205665000,12,LAKE ST,,1,02135,CD,N,...,2.949385e+06,2334.0,22,2208,2205665000,9/25/2009 10:14:59,9/29/2009 17:33:08,POINT (-71.166339999999934 42.340780000000052),-71.166340,42.340780


### Examines the property that found no matching address on Live Street Address Management Dataset

In [457]:
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} addresses found no match.")

537256 addresses found no match.


In [458]:
df_missing_coord_addresses = df_property[~df_property['FULL_STREET_ADDRESS'].isin(df_property_with_coord['FULL_STREET_ADDRESS'])].copy()

In [459]:
df_missing_coord_addresses[["FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
60,104 putnam st,104 A 104,PUTNAM,ST
111,198 princeton st,198 200,PRINCETON,ST
121,399 saratoga st,399 401,SARATOGA,ST
175,4 lawson pl,4,LAWSON,PL
176,3 lawson pl,3,LAWSON,PL
...,...,...,...,...
1068273,knowles st,,KNOWLES ST,
1068274,lake st,,Lake ST,
1068275,lake st,,Lake ST,
1068276,commonwealth av,,COMMONWEALTH AV,


In [460]:
df_missing_coord_addresses["FULL_STREET_NAME"] = df_missing_coord_addresses.apply(
    lambda row: (
        row["ST_NAME"] + " " + row["ST_NAME_SUF"]
    ).lower() if pd.notna(row["ST_NAME_SUF"])
    else row["ST_NAME"].lower(),
    axis=1
)

df_missing_coord_addresses["FULL_STREET_NAME"] = df_missing_coord_addresses["FULL_STREET_NAME"].str.strip()

df_missing_coord_addresses[["FULL_STREET_ADDRESS", "FULL_STREET_NAME", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_ADDRESS,FULL_STREET_NAME,ST_NUM,ST_NAME,ST_NAME_SUF
60,104 putnam st,putnam st,104 A 104,PUTNAM,ST
111,198 princeton st,princeton st,198 200,PRINCETON,ST
121,399 saratoga st,saratoga st,399 401,SARATOGA,ST
175,4 lawson pl,lawson pl,4,LAWSON,PL
176,3 lawson pl,lawson pl,3,LAWSON,PL
...,...,...,...,...,...
1068273,knowles st,knowles st,,KNOWLES ST,
1068274,lake st,lake st,,Lake ST,
1068275,lake st,lake st,,Lake ST,
1068276,commonwealth av,commonwealth av,,COMMONWEALTH AV,


### Drop duplicate in the full street name column in Street Address Dataset

In [461]:
df_street_name_unique = df_street_address[["FULL_STREET_ADDRESS", "FULL_ADDRESS", "STREET_NUMBER", "STREET_BODY", "FULL_STREET_NAME", "POINT_X", "POINT_Y"]].copy()
df_street_name_unique["FULL_STREET_NAME"] = df_street_name_unique["FULL_STREET_NAME"].str.strip().str.lower()
df_street_name_unique = df_street_name_unique.drop_duplicates(subset=["FULL_STREET_NAME"])

### Fix misspelling and abbreviation
- Replace with "msgr" with "monsignor"
- Remove all special characters from street name
- Replace "abbott" with "abbot"
- Replace "wy" with "way"
- Replace "wm" with "William"
- Replace "hw" with "hwy"
- Replace "oneil" with "o'neil"
- Replace "mt" with "mount"
- Replace "dr mary m beatty" with "dr mary moore beatty cir"
- Replace "commercial wharf east" with "commercial whf r"
- Replace "commonweatlh" with "commonwealth"
- Replace "battery wharf" with "battery whf"
- Replace leading "st" with "saint"
- Replace "crescent circuit" with "crescent cirt"
- Replace "fr francis gilday" with "father francis j gilday"
- Replace "pw" with "pkwy"
- Replace "wm card oconnell" with "william cardinal oconnell way"
- Replace "gen wm h devine" with "general william h devine way"
- Replace "w roxbury pkwy" with "west roxbury pkwy"
- Replace "park lane" with "park ln"
- Replace "gen jozef pilsudski way" with "general jozef pilsudski way"
- Replace "fan pier bl" with "fan pier  blvd"
- Replace "oconnor way" with "major michael j oconnor way"
- Replace "soldiers field rd xt" with "soldiers field rd"

In [462]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("msgr", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("msgr", "monsignor", case=False, regex=False)

In [463]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("abbott", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("abbott", "abbot", case=False, regex=False)

In [464]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("wy", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("wy", "way", case=False, regex=False)

In [465]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("wm", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("wm", "william", case=False, regex=False)

In [466]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("hw", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("hw", "hwy", case=False, regex=False)

In [467]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains(r"[^a-zA-Z0-9\s]", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace(r"[^a-zA-Z0-9\s]", "", case=False, regex=False)

In [468]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("oneil", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("oneil", "o'neil", case=False, regex=False)

In [469]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("mt", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("mt", "mount", case=False, regex=False)

In [470]:
df_missing_coord_addresses['FULL_STREET_NAME'] = df_missing_coord_addresses['FULL_STREET_NAME'].str.replace(r'^\bst\b', 'saint', case=False, regex=True)

In [471]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("crescent circuit", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("crescent circuit", "crescent cirt", case=False, regex=False)

In [472]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("fr francis gilday", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("fr francis gilday", "father francis j gilday", case=False, regex=False)

In [473]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("pw", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("pw", "pkwy", case=False, regex=False)

In [474]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("card oconnell", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("card oconnell", "cardinal oconnell", case=False, regex=False)

In [475]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("gen", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("gen", "general", case=False, regex=False)

In [476]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("w roxbury pkwy", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("w roxbury pkwy", "west roxbury pkwy", case=False, regex=False)

In [477]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("park lane dr", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("park lane", "park ln", case=False, regex=False)

In [478]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("fan pier bl", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("fan pier bl", "fan pier  blvd", case=False, regex=False)

In [479]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("oconnor way", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("oconnor way", "major michael j oconnor way", case=False, regex=False)

In [480]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("soldiers field rd xt", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("soldiers field rd xt", "soldiers field rd", case=False, regex=False)

### Match just the street name and it's suffix
We will estimate the coordinate of the property based on street name.

In [481]:
print(f"{df_missing_coord_addresses.shape[0]} addresses found no match because the street numbers in Live Street Address are recorded as a range.")

537256 addresses found no match because the street numbers in Live Street Address are recorded as a range.


In [482]:
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]].sort_values(by="FULL_STREET_NAME")

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
945802,a st,a st,,A ST,
234037,a st,a st,,A,ST
233780,a st,141 a st,141,A,ST
233779,a st,a st,,A,ST
233778,a st,a st,,A,ST
...,...,...,...,...,...
859125,zeller st,zeller st,,ZELLER ST,
859124,zeller st,zeller st,,ZELLER ST,
859123,zeller st,zeller st,,ZELLER ST,
859129,zeller st,zeller st,,ZELLER ST,


In [483]:
df_missing_coord_addresses["FULL_STREET_NAME"].nunique()

4247

In [484]:
df_missing_property_street_name_with_coord = pd.merge(df_missing_coord_addresses, df_street_name_unique, left_on="FULL_STREET_NAME", right_on="FULL_STREET_NAME", how="inner", suffixes=('', '_right'))

In [485]:
df_missing_property_street_name_with_coord.shape

(344583, 53)

In [486]:
df_missing_property_street_name_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,FIREPLACES,HEAT_SYSTEM,FULL_STREET_ADDRESS,FULL_STREET_NAME,FULL_STREET_ADDRESS_right,FULL_ADDRESS,STREET_NUMBER,STREET_BODY,POINT_X,POINT_Y
0,100001000,,100001000,104 A 104,PUTNAM,ST,,02128,R3,Y,...,0,,104 putnam st,putnam st,10 putnam st,10 Putnam St,10,Putnam,-71.059864,42.373542
1,100051000,,100051000,198 200,PRINCETON,ST,,02128,R4,N,...,,,198 princeton st,princeton st,10 princeton st,10 Princeton St,10,Princeton,-71.038800,42.376590
2,100061000,,100061000,399 401,SARATOGA,ST,,02128,RC,N,...,,,399 saratoga st,saratoga st,100-104 saratoga st,100-104 Saratoga St,100-104,Saratoga,-71.036213,42.376800
3,100116000,,100116000,4,LAWSON,PL,,02128,RL,N,...,,,4 lawson pl,lawson pl,1 lawson pl,1 Lawson Pl,1,Lawson,-71.028470,42.380200
4,100117000,,100117000,3,LAWSON,PL,,02128,RL,N,...,,,3 lawson pl,lawson pl,1 lawson pl,1 Lawson Pl,1,Lawson,-71.028470,42.380200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
344578,2205664000,,2205664000,,Lake ST,,,02135,R2,Y,...,0,,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050
344579,2205665002,2205665000,2205665000,,Lake ST,,2,02135,CD,N,...,1,I - Indiv. Cntrl,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050
344580,2205665004,2205665000,2205665000,,Lake ST,,1,02135,CD,N,...,1,I - Indiv. Cntrl,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050
344581,2205667000,,2205667000,,Lake ST,,,02135,RL - RL,N,...,,,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050


### Number of property that is now assigned with XY coordinate: 866,271

In [487]:
df_property_with_coord = pd.concat([df_property_with_coord, df_missing_property_street_name_with_coord], ignore_index=True)

In [488]:
df_property_with_coord.shape

(813996, 79)

In [489]:
print(f"A total of {df_property_with_coord.shape[0]} properties have coordinates out of the total {df_property.shape[0]}.")
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} property still doesn't have coordinate, which is a lot.")

A total of 813996 properties have coordinates out of the total 1006669.
192673 property still doesn't have coordinate, which is a lot.


### Examine the property where match are not found for both full street address and full street name

In [490]:
df_missing_coord_addresses = df_missing_coord_addresses[~df_missing_coord_addresses['FULL_STREET_NAME'].isin(df_missing_property_street_name_with_coord['FULL_STREET_NAME'])].copy()

In [491]:
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
521,swift te,1 swift te,1,SWIFT,TE
522,swift te,5 swift te,5,SWIFT,TE
523,swift te,9 swift te,9,SWIFT,TE
524,swift te,15 swift te,15,SWIFT,TE
525,swift te,19 swift te,19,SWIFT,TE
...,...,...,...,...,...
1068182,undine st,undine st,,UNDINE ST,
1068183,undine st,undine st,,UNDINE ST,
1068273,knowles st,knowles st,,KNOWLES ST,
1068276,commonwealth av,commonwealth av,,COMMONWEALTH AV,


### Match just the street name (body)

After manually searching for the `FULL_STREET_NAME` against Live Street Address, the common problems are misspellings, wrong street name suffix, and colloquial name.

There too many of these cases to manually replace, so we will match the substring to of street name to Live Street Address and replace it with the correct name for matching.

In [492]:
df_missing_coord_addresses["ST_NAME_NEW"] = df_missing_coord_addresses["ST_NAME"].str.lower()
df_street_name_unique["STREET_BODY"] = df_street_name_unique["STREET_BODY"].str.lower()

In [493]:
df_street_body_unique = df_street_name_unique.drop_duplicates(subset=["STREET_BODY"])

In [494]:
df_missing_coord_addresses.shape

(192673, 48)

In [495]:
# Ignore the street name suffix and only use street name
df_missing_property_street_name_with_coord = pd.merge(df_missing_coord_addresses, df_street_body_unique, left_on="ST_NAME_NEW", right_on="STREET_BODY", how="inner", suffixes=('', '_right'))

In [496]:
df_missing_property_street_name_with_coord.shape

(57832, 55)

In [497]:
df_missing_property_street_name_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,FULL_STREET_ADDRESS,FULL_STREET_NAME,ST_NAME_NEW,FULL_STREET_ADDRESS_right,FULL_ADDRESS,STREET_NUMBER,STREET_BODY,FULL_STREET_NAME_right,POINT_X,POINT_Y
0,100382000,,100382000,1,SWIFT,TE,,02128,R1,Y,...,1 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
1,100383000,,100383000,5,SWIFT,TE,,02128,R1,Y,...,5 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
2,100384000,,100384000,9,SWIFT,TE,,02128,R1,Y,...,9 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
3,100385000,,100385000,15,SWIFT,TE,,02128,R3,N,...,15 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
4,100386000,,100386000,19,SWIFT,TE,,02128,R1,N,...,19 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57827,1809911000,,1809911000,,VAN BRUNT,,,02136,R1,Y,...,van brunt,van brunt,van brunt,10 van brunt st,10 Van Brunt St,10,van brunt,van brunt st,-71.124490,42.242760
57828,1811885010,,1811885010,,ADAMS,,,02136,R1,Y,...,adams,adams,adams,1 adams st,1 Adams St,1,adams,adams st,-71.060040,42.374840
57829,1900635011,1900635000,1900635000,,LAMARTINE,,5,02130,CD,Y,...,lamartine,lamartine,lamartine,1 lamartine pl,1 Lamartine Pl,1,lamartine,lamartine pl,-71.106670,42.313510
57830,2200470000,,2200470000,,CHARLES RIVER,,,02135,E,N,...,charles river,charles river,charles river,44 charles river ave,44 Charles River Ave,44,charles river,charles river ave,-71.060502,42.370607


In [498]:
df_property_with_coord = pd.concat([df_property_with_coord, df_missing_property_street_name_with_coord], ignore_index=True)
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871823,1809911000,,1809911000,,VAN BRUNT,,,02136,R1,Y,...,,,,,,-71.124490,42.242760,10 van brunt st,van brunt,van brunt st
871824,1811885010,,1811885010,,ADAMS,,,02136,R1,Y,...,,,,,,-71.060040,42.374840,1 adams st,adams,adams st
871825,1900635011,1900635000,1900635000,,LAMARTINE,,5,02130,CD,Y,...,,,,,,-71.106670,42.313510,1 lamartine pl,lamartine,lamartine pl
871826,2200470000,,2200470000,,CHARLES RIVER,,,02135,E,N,...,,,,,,-71.060502,42.370607,44 charles river ave,charles river,charles river ave


### Number of property that is now assigned with XY coordinate: 871828

In [499]:
print(f"A total of {df_property_with_coord.shape[0]} properties have coordinates out of the total {df_property.shape[0]}.")
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} property still doesn't have coordinate.")

A total of 871828 properties have coordinates out of the total 1006669.
134841 property still doesn't have coordinate.


In [500]:
df_missing_coord_addresses = df_missing_coord_addresses[~df_missing_coord_addresses['ST_NAME_NEW'].isin(df_missing_property_street_name_with_coord['STREET_BODY'])].copy()
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME_NEW", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME_NEW,ST_NAME_SUF
533,vienna st,vienna st,,vienna,ST
534,vienna st,vienna st,,vienna,ST
535,vienna st,3 vienna st,3,vienna,ST
536,vienna st,5 vienna st,5,vienna,ST
537,vienna st,7 vienna st,7,vienna,ST
...,...,...,...,...,...
1068182,undine st,undine st,,undine st,
1068183,undine st,undine st,,undine st,
1068273,knowles st,knowles st,,knowles st,
1068276,commonwealth av,commonwealth av,,commonwealth av,


### Some street name contain wrong street name suffix
Assume that the shortest word is the suffix, remove the suffix.

In [501]:
def remove_shortest_word(text):
    words = text.split()  # Split the string into words
    if len(words) <= 1:
        return text  # Return the original string if it's the only word
    shortest_word = min(words, key=len)  # Find the shortest word
    words.remove(shortest_word)  # Remove the shortest word
    return ' '.join(words)

In [502]:
df_missing_coord_addresses['ST_NAME_NEW'] = df_missing_coord_addresses['ST_NAME_NEW'].apply(remove_shortest_word)
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF", 'ST_NAME_NEW']]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF,ST_NAME_NEW
533,vienna st,vienna st,,VIENNA,ST,vienna
534,vienna st,vienna st,,VIENNA,ST,vienna
535,vienna st,3 vienna st,3,VIENNA,ST,vienna
536,vienna st,5 vienna st,5,VIENNA,ST,vienna
537,vienna st,7 vienna st,7,VIENNA,ST,vienna
...,...,...,...,...,...,...
1068182,undine st,undine st,,UNDINE ST,,undine
1068183,undine st,undine st,,UNDINE ST,,undine
1068273,knowles st,knowles st,,KNOWLES ST,,knowles
1068276,commonwealth av,commonwealth av,,COMMONWEALTH AV,,commonwealth


In [503]:
df_missing_property_street_name_with_coord = pd.merge(df_missing_coord_addresses, df_street_body_unique, left_on="ST_NAME_NEW", right_on="STREET_BODY", how="inner", suffixes=('', '_right'))

In [504]:
df_missing_property_street_name_with_coord.shape

(122257, 55)

In [505]:
df_missing_property_street_name_with_coord[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF", "STREET_BODY", "FULL_ADDRESS", "STREET_NUMBER", "POINT_X", "POINT_Y"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF,STREET_BODY,FULL_ADDRESS,STREET_NUMBER,POINT_X,POINT_Y
0,chelsea creek,chelsea creek,,CHELSEA CREEK,,chelsea,55 Chelsea St B,55,-71.059256,42.372731
1,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
2,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
3,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
4,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
...,...,...,...,...,...,...,...,...,...,...
122252,lake shore te,lake shore te,,LAKE SHORE TE,,lake shore,14 Lake Shore Ct 1,14,-71.170468,42.345856
122253,undine st,undine st,,UNDINE ST,,undine,100-98 Undine Rd,100-98,-71.166046,42.342942
122254,undine st,undine st,,UNDINE ST,,undine,100-98 Undine Rd,100-98,-71.166046,42.342942
122255,commonwealth av,commonwealth av,,COMMONWEALTH AV,,commonwealth,1 Commonwealth Ave A,1,-71.072054,42.353996


In [506]:
df_property_with_coord = pd.concat([df_property_with_coord, df_missing_property_street_name_with_coord], ignore_index=True)
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994080,2205550632,2205550001,2205550001,,LAKE SHORE TE,,6-4,02135,CD,Y,...,,,,,,-71.170468,42.345856,14 lake shore ct,lake shore,lake shore ct
994081,2205589002,2205589000,2205589000,,UNDINE ST,,1,02135,CD,N,...,,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd
994082,2205589004,2205589000,2205589000,,UNDINE ST,,2,02135,CD,N,...,,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd
994083,2205669000,,2205669000,,COMMONWEALTH AV,,,02135,C,N,...,,,,,,-71.072054,42.353996,1 commonwealth ave,commonwealth,commonwealth ave


### Number of property that is now assigned with XY coordinate: 994085

In [507]:
print(f"A total of {df_property_with_coord.shape[0]} properties have coordinates out of the total {df_property.shape[0]}.")
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} property still doesn't have coordinate.")

A total of 994085 properties have coordinates out of the total 1006669.
12584 property still doesn't have coordinate.


In [508]:
df_missing_coord_addresses = df_missing_coord_addresses[~df_missing_coord_addresses['ST_NAME'].isin(df_missing_property_street_name_with_coord['ST_NAME'])].copy()
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
533,vienna st,vienna st,,VIENNA,ST
534,vienna st,vienna st,,VIENNA,ST
535,vienna st,3 vienna st,3,VIENNA,ST
536,vienna st,5 vienna st,5,VIENNA,ST
537,vienna st,7 vienna st,7,VIENNA,ST
...,...,...,...,...,...
1060810,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,
1060811,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,
1060812,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,
1060813,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,


Stop here because the rest will need manual inspection and will take too long. Will not include the rest of the rows in the new CSV file.

In [509]:
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,505.0,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994080,2205550632,2205550001,2205550001,,LAKE SHORE TE,,6-4,02135,CD,Y,...,,,,,,-71.170468,42.345856,14 lake shore ct,lake shore,lake shore ct
994081,2205589002,2205589000,2205589000,,UNDINE ST,,1,02135,CD,N,...,,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd
994082,2205589004,2205589000,2205589000,,UNDINE ST,,2,02135,CD,N,...,,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd
994083,2205669000,,2205669000,,COMMONWEALTH AV,,,02135,C,N,...,,,,,,-71.072054,42.353996,1 commonwealth ave,commonwealth,commonwealth ave


# Plot onto shapefile

In [510]:
geometry = [Point(xy) for xy in zip(df_property_with_coord['POINT_X'], df_property_with_coord['POINT_Y'])]
gdf = gpd.GeoDataFrame(df_property_with_coord, geometry=geometry, crs="EPSG:3857")
gdf = gdf.to_crs("EPSG:3857")
gdf

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right,geometry
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,POINT (-71.072 42.356)
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,POINT (-71.072 42.356)
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,POINT (-71.072 42.356)
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,POINT (-71.072 42.356)
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,POINT (-71.072 42.356)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994080,2205550632,2205550001,2205550001,,LAKE SHORE TE,,6-4,02135,CD,Y,...,,,,,-71.170468,42.345856,14 lake shore ct,lake shore,lake shore ct,POINT (-71.17 42.346)
994081,2205589002,2205589000,2205589000,,UNDINE ST,,1,02135,CD,N,...,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd,POINT (-71.166 42.343)
994082,2205589004,2205589000,2205589000,,UNDINE ST,,2,02135,CD,N,...,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd,POINT (-71.166 42.343)
994083,2205669000,,2205669000,,COMMONWEALTH AV,,,02135,C,N,...,,,,,-71.072054,42.353996,1 commonwealth ave,commonwealth,commonwealth ave,POINT (-71.072 42.354)


In [511]:
district_shapefile = gpd.read_file("../data/City-Council-District")

count = 0
is_D7_addresses = []

for row in df_property_with_coord.itertuples(index=True, name="Row"):
    address_point = Point(row.POINT_X, row.POINT_Y)
    address_gdf = gpd.GeoDataFrame(geometry=[address_point], crs="EPSG:4326")
    address_gdf = address_gdf.to_crs(district_shapefile.crs)
    result = gpd.sjoin(address_gdf, district_shapefile, how="left", predicate="intersects")

    if result['DISTRICT'].values[0] == 7:
        count += 1
        is_D7_addresses.append(True)
    else:
        is_D7_addresses.append(False)

In [512]:
print(f"{count} properties may be in District 7.")

76046 properties may be in District 7.


In [513]:
df_property_with_coord['IS_D7'] = is_D7_addresses
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right,IS_D7
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,False
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,False
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,False
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,False
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994080,2205550632,2205550001,2205550001,,LAKE SHORE TE,,6-4,02135,CD,Y,...,,,,,-71.170468,42.345856,14 lake shore ct,lake shore,lake shore ct,False
994081,2205589002,2205589000,2205589000,,UNDINE ST,,1,02135,CD,N,...,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd,False
994082,2205589004,2205589000,2205589000,,UNDINE ST,,2,02135,CD,N,...,,,,,-71.166046,42.342942,100-98 undine rd,undine,undine rd,False
994083,2205669000,,2205669000,,COMMONWEALTH AV,,,02135,C,N,...,,,,,-71.072054,42.353996,1 commonwealth ave,commonwealth,commonwealth ave,False


In [514]:
df_property_with_coord_and_d7 = df_property_with_coord[list(df_property_columns) + ["POINT_X", "POINT_Y", "IS_D7"]].copy()

In [515]:
df_property_with_coord_and_d7

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,LU,OWN_OCC,...,HEAT_FUEL,AC_TYPE,PlUMBING,NUM_PARKING,PROP_VIEW,FIREPLACES,HEAT_SYSTEM,POINT_X,POINT_Y,IS_D7
0,502550008,502550000,502550000,87,BEACON,ST,2-F,02108,CD,Y,...,,N - None,,1,A - Average,1,,-71.071690,42.355910,False
1,502550010,502550000,502550000,87,BEACON,ST,2-R,02108,CD,N,...,,N - None,,1,A - Average,1,,-71.071690,42.355910,False
2,502550012,502550000,502550000,87,BEACON,ST,3-F,02108,CD,Y,...,,N - None,,0,G - Good,1,,-71.071690,42.355910,False
3,502550014,502550000,502550000,87,BEACON,ST,3-R,02108,CD,N,...,,N - None,,1,G - Good,1,,-71.071690,42.355910,False
4,502550016,502550000,502550000,87,BEACON,ST,4,02108,CD,Y,...,,C - Central AC,,2,G - Good,2,,-71.071690,42.355910,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994080,2205550632,2205550001,2205550001,,LAKE SHORE TE,,6-4,02135,CD,Y,...,,C - Central AC,,1,A - Average,0,Y - Self Contained,-71.170468,42.345856,False
994081,2205589002,2205589000,2205589000,,UNDINE ST,,1,02135,CD,N,...,,N - None,,1,A - Average,0,I - Indiv. Cntrl,-71.166046,42.342942,False
994082,2205589004,2205589000,2205589000,,UNDINE ST,,2,02135,CD,N,...,,N - None,,1,A - Average,0,I - Indiv. Cntrl,-71.166046,42.342942,False
994083,2205669000,,2205669000,,COMMONWEALTH AV,,,02135,C,N,...,,,,,,,,-71.072054,42.353996,False


In [516]:
df_property_with_coord_and_d7[df_property_with_coord_and_d7["IS_D7"] == True].shape

(76046, 48)

To reduce error that may stem from comparing street body without street numbers and suffix, out of the rows where `IS_D7 = True`, if the ZIP code falls outside of District 7 ZIP Code, modify the value to `False`.

In [517]:
zip_shapefile = gpd.read_file("../data/ZIP_Codes")

In [518]:
print(zip_shapefile.crs)

EPSG:2249


In [519]:
if zip_shapefile.crs != district_shapefile.crs:
    zip_shapefile = zip_shapefile.to_crs(district_shapefile.crs)
district_7 = district_shapefile[district_shapefile["DISTRICT"] == 7]
zip_in_district_7 = gpd.sjoin(zip_shapefile, district_7, how="inner", predicate="intersects")
unique_zip_codes = zip_in_district_7["ZIP5"].unique()

In [520]:
unique_zip_codes

array(['02125', '02118', '02130', '02121', '02119', '02115', '02116',
       '02120', '02215'], dtype=object)

In [521]:
df_property_with_coord_and_d7.loc[
    (df_property_with_coord_and_d7["IS_D7"] == True) & 
    (~df_property_with_coord_and_d7["ZIPCODE"].apply(lambda x: x in unique_zip_codes)),
    "IS_D7"
] = False

In [522]:
df_property_with_coord_and_d7[df_property_with_coord_and_d7["IS_D7"] == True].shape

(72932, 48)

In [523]:
df_property_with_coord_and_d7.to_csv("../data/property-cleaned.csv", index=False, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)