# Finding property in City Council District 7

### Overview
The goal of this notebook is to extend the Property Assessment Dataset by adding another column called `is_d7` that indicates whether the property is in District 7.

Datasets used in this notebook:
- Property Assessment Dataset: [link](https://data.boston.gov/dataset/property-assessment)
- Boston Live Street Address Management: [link](https://data.boston.gov/dataset/live-street-address-management-sam-addresses)
- Boston City Council 2023-2032 Shapefile: [link](https://data.boston.gov/dataset/city-council-districts-2023-2032)
- Boston ZIP Code Shapefile: [link](https://data.boston.gov/dataset/zip-codes/resource/a9b44fec-3a21-42ac-a919-06ec4ac20ab8)

### Summary
There are many missing values for street number and incorrect street address formatting, along with some streets that isn't in Live Street Address Management's dataset. We did our best to maintain data integrity and correctness as it is assigned to either in District 7 or not, but please be mindful that there will be some mistakes with our approach. We assign District 7 to the property that we are mostly sure of its belonging and mark any address whose coordinate cannot be determined as not in District 7.

In the process, we assign XY coordinates (longitude and latitude) according to this order:
1. Exact full street address matches, which includes treet number, street body, and street suffix (no unit number, assuming that property with the same street address will be in the same building)
2. Street body and suffix match
3. Street body match
4. Remove suffix from street body where included and match

At the end of the coordinate assigning process, a total of 1,055,424 properties out of the total 1,068,278 have coordinate. 12,854 property still doesn't have coordinate with main reasons being that the address is not found in Live Street Address Management Dataset. We deem that the number of property being left out is small and can be safely excluded from the dataset.

After coordinates are assigned, it is used to compare against City Council Shapefile to determine whether the location falls within District 7's boundary. `is_d7` is set to `True` if it is and `False` if not. To reduce errors that may stem from comparing street body without street numbers and suffix, out of the rows where `is_d7 = True`, if the ZIP code falls outside of District 7 ZIP Code, modify the value to `False`.

The resulting dataframe is export as `d7-property-new.csv` in the `data` folder.

In [1]:
from geopy.geocoders import Nominatim
import geopandas as gpd
from shapely.geometry import Point
import matplotlib as plt
import pandas as pd
import numpy as np

In [2]:
df_2019 = pd.read_csv("../data/property/property_2019.csv", low_memory=False)
df_2020 = pd.read_csv("../data/property/property_2020.csv", low_memory=False)
df_2021 = pd.read_csv("../data/property/property_2021.csv", low_memory=False)
df_2022 = pd.read_csv("../data/property/property_2022.csv", low_memory=False)
df_2023 = pd.read_csv("../data/property/property_2023.csv", low_memory=False)
df_2024 = pd.read_csv("../data/property/property_2024.csv", low_memory=False)

df_2019["year"] = 2019
df_2020["year"] = 2020
df_2021["year"] = 2021
df_2022["year"] = 2022
df_2023["year"] = 2023
df_2024["year"] = 2024

df_property = pd.concat([df_2019, df_2020, df_2021, df_2022, df_2023, df_2024], ignore_index=True)

In [3]:
df_street_address = pd.read_csv("../data/Boston_SAM.csv", low_memory=False)

# Live Street Address Dataset

### Total street address: 400197

There are some property that will not have the street address corresponding to it? What to do?

In [4]:
df_street_address[["POINT_X", "POINT_Y", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,POINT_X,POINT_Y,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME
29697,-71.120547,42.278381,0 Cliffmont St,0,Cliffmont St
399204,-71.075982,42.338042,0 Deacon St,0,Deacon St
41758,-71.063681,42.373852,0 Devens St,0,Devens St
41757,-71.063681,42.373852,0 Devens St 1,0,Devens St
43255,-71.166560,42.280650,0 Dow Rd,0,Dow Rd
...,...,...,...,...,...
134864,-71.075107,42.347132,,20-48,
136137,-71.056711,42.361799,,116,Blackstone St
140399,-71.074028,42.347213,,10-12,
399800,-71.054734,42.359777,,34,


In [5]:
df_street_address["FULL_ADDRESS"] = df_street_address["FULL_ADDRESS"].str.strip()
df_street_address["STREET_NUMBER"] = df_street_address["STREET_NUMBER"].str.strip()
df_street_address["FULL_STREET_NAME"] = df_street_address["FULL_STREET_NAME"].str.strip()

### Number of rows from live street address that don't have both street name and full address: 8
These rows are unusable and will be dropped.

In [6]:
df_street_address[df_street_address["FULL_ADDRESS"].isna() & df_street_address["FULL_STREET_NAME"].isna()][["POINT_X", "POINT_Y", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME"]]

Unnamed: 0,POINT_X,POINT_Y,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME
134864,-71.075107,42.347132,,20-48,
140399,-71.074028,42.347213,,10-12,
399800,-71.054734,42.359777,,34,
400039,-71.054865,42.360526,,8,


In [7]:
df_street_address = df_street_address.dropna(subset=['FULL_ADDRESS', 'FULL_STREET_NAME'], how='all')
df_street_address[["POINT_X", "POINT_Y", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,POINT_X,POINT_Y,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME
29697,-71.120547,42.278381,0 Cliffmont St,0,Cliffmont St
399204,-71.075982,42.338042,0 Deacon St,0,Deacon St
41758,-71.063681,42.373852,0 Devens St,0,Devens St
41757,-71.063681,42.373852,0 Devens St 1,0,Devens St
43255,-71.166560,42.280650,0 Dow Rd,0,Dow Rd
...,...,...,...,...,...
111307,-71.050772,42.376113,C-8 Shipway Pl C-8,C-8,Shipway Pl
111308,-71.050772,42.376113,C-9 Shipway Pl C-9,C-9,Shipway Pl
310515,-71.051407,42.371981,Pier 4 Eighth St,Pier 4,Eighth St
377920,-71.055699,42.357256,TEN Post Office Sq,TEN,Post Office Sq


### Number of rows without full address in live street address: 0

In [8]:
df_street_address[df_street_address["FULL_ADDRESS"].isna()]

Unnamed: 0,_id,OID_,SAM_ADDRESS_ID,BUILDING_ID,RELATIONSHIP_TYPE,FULL_ADDRESS,STREET_NUMBER,IS_RANGE,RANGE_FROM,RANGE_TO,...,Y_COORD,SAM_STREET_ID,WARD,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y
136137,136138,136138,157009,153268,2,,116,0,,,...,2957180.0,437.0,3,306,303337000,9/25/2009 17:56:02,8/5/2022 14:46:20,POINT (-71.056711005999944 42.361798776000057),-71.056711,42.361799


### Number of rows without street number: 0

In [9]:
df_street_address[df_street_address["STREET_NUMBER"].isna()]

Unnamed: 0,_id,OID_,SAM_ADDRESS_ID,BUILDING_ID,RELATIONSHIP_TYPE,FULL_ADDRESS,STREET_NUMBER,IS_RANGE,RANGE_FROM,RANGE_TO,...,Y_COORD,SAM_STREET_ID,WARD,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y


### Number of rows without street name: 0

In [10]:
df_street_address[df_street_address["FULL_STREET_NAME"].isna()]

Unnamed: 0,_id,OID_,SAM_ADDRESS_ID,BUILDING_ID,RELATIONSHIP_TYPE,FULL_ADDRESS,STREET_NUMBER,IS_RANGE,RANGE_FROM,RANGE_TO,...,Y_COORD,SAM_STREET_ID,WARD,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y


## Create full street address without unit number

In [11]:
df_street_address["FULL_STREET_ADDRESS"] = df_street_address["STREET_NUMBER"].str.lower().str.strip() + " " + df_street_address["FULL_STREET_NAME"].str.lower().str.strip()
df_street_address["FULL_STREET_ADDRESS"] = df_street_address["FULL_STREET_ADDRESS"].str.strip()
df_street_address[["FULL_STREET_ADDRESS", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME", "POINT_X", "POINT_Y"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,FULL_STREET_ADDRESS,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME,POINT_X,POINT_Y
29697,0 cliffmont st,0 Cliffmont St,0,Cliffmont St,-71.120547,42.278381
399204,0 deacon st,0 Deacon St,0,Deacon St,-71.075982,42.338042
41758,0 devens st,0 Devens St,0,Devens St,-71.063681,42.373852
41757,0 devens st,0 Devens St 1,0,Devens St,-71.063681,42.373852
43255,0 dow rd,0 Dow Rd,0,Dow Rd,-71.166560,42.280650
...,...,...,...,...,...,...
111307,c-8 shipway pl,C-8 Shipway Pl C-8,C-8,Shipway Pl,-71.050772,42.376113
111308,c-9 shipway pl,C-9 Shipway Pl C-9,C-9,Shipway Pl,-71.050772,42.376113
310515,pier 4 eighth st,Pier 4 Eighth St,Pier 4,Eighth St,-71.051407,42.371981
377920,ten post office sq,TEN Post Office Sq,TEN,Post Office Sq,-71.055699,42.357256


### How many of the full street address is NA: 0

In [12]:
df_street_address[df_street_address["FULL_STREET_ADDRESS"].isna()]

Unnamed: 0,_id,OID_,SAM_ADDRESS_ID,BUILDING_ID,RELATIONSHIP_TYPE,FULL_ADDRESS,STREET_NUMBER,IS_RANGE,RANGE_FROM,RANGE_TO,...,SAM_STREET_ID,WARD,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,FULL_STREET_ADDRESS


### Number of duplicates in full street address column: 263971

In [13]:
df_street_address["FULL_STREET_ADDRESS"].count() - df_street_address["FULL_STREET_ADDRESS"].nunique()

263930

In [14]:
df_street_address["FULL_STREET_ADDRESS"].nunique()

136262

# Property Assessment Dataset

In [15]:
df_property_columns = df_property.columns

### Total property assessed: 1,068,278

In [16]:
df_property[["ST_NUM", "ST_NAME", "ST_NAME_SUF"]].sort_values(by="ST_NAME")

Unnamed: 0,ST_NUM,ST_NAME,ST_NAME_SUF
60333,319,A,ST
233763,326,A,ST
233762,326,A,ST
233761,326,A,ST
233760,326,A,ST
...,...,...,...
791118,18.0,Zamora ST,
972941,15.0,Zamora ST,
972942,15.0,Zamora ST,
972949,35.0,Zamora ST,


In [17]:
df_property["ST_NUM"] = df_property["ST_NUM"].str.strip()
df_property["ST_NAME"] = df_property["ST_NAME"].str.strip()
df_property["ST_NAME_SUF"] = df_property["ST_NAME_SUF"].str.strip()

### Number of rows with no street number: 372436

In [18]:
df_property[df_property["ST_NUM"].isna()].shape

(372436, 132)

### Number of rows with no street name: 0

In [19]:
df_property[df_property["ST_NAME"].isna()].shape

(0, 132)

### Number of rows with no street suffix: 721438

In [20]:
df_property[df_property["ST_NAME_SUF"].isna()].shape

(721438, 132)

### Number of rows with no unit number: 609024
Will treat the address as if there is no unit number for geographical plotting purposes, assuming that property in the on the same street address but different unit number are still in the same building.

In [21]:
df_property[df_property["UNIT_NUM"].isna()].shape

(609024, 132)

### Combine columns into full address

In [22]:
import re

def extract_numeric_part(st_num):
    if pd.isna(st_num):
        return ""  # Return an empty string if ST_NUM is NaN
    elif isinstance(st_num, (int, float)):
        return str(int(st_num))  # Convert numeric ST_NUM to integer and then to string
    elif isinstance(st_num, str):
        match = re.match(r"(\d+)\.?\d*", st_num)  # Matches the numeric part in strings
        if match:
            return str(int(float(match.group(1))))  # Convert to integer
    return st_num.strip()  # Return as is if it doesn't match any numeric part

In [23]:
df_property["FULL_STREET_ADDRESS"] = df_property.apply(
    lambda row: (
        extract_numeric_part(row["ST_NUM"]) + " " + row["ST_NAME"] + " " + row["ST_NAME_SUF"]
    ).lower() if pd.notna(row["ST_NUM"]) and pd.notna(row["ST_NAME_SUF"])
    else (
        extract_numeric_part(row["ST_NUM"]) + " " + row["ST_NAME"]
    ).lower() if pd.notna(row["ST_NUM"])
    else (
        row["ST_NAME"] + " " + row["ST_NAME_SUF"]
    ).lower() if pd.notna(row["ST_NAME_SUF"])
    else row["ST_NAME"].lower(),
    axis=1
)
df_property["FULL_STREET_ADDRESS"] = df_property["FULL_STREET_ADDRESS"].str.strip()

In [24]:
df_property[["FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
0,87 beacon st,87,BEACON,ST
1,87 beacon st,87,BEACON,ST
2,87 beacon st,87,BEACON,ST
3,87 beacon st,87,BEACON,ST
4,87 beacon st,87,BEACON,ST
...,...,...,...,...
1068273,knowles st,,KNOWLES ST,
1068274,lake st,,Lake ST,
1068275,lake st,,Lake ST,
1068276,commonwealth av,,COMMONWEALTH AV,


In [25]:
df_property[df_property["FULL_STREET_ADDRESS"].isna()]

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,OWNER MAIL ADDRESS,EXT_FNISHED,KITCHENS,FIREPLACES,MAIL_STREET_ADDRESS,MAIL_ZIP_CODE,SFYI_VALUE,GROSS_TAX,HEAT_SYSTEM,FULL_STREET_ADDRESS


### Number of duplicate full street addresses: 972,126
This is OK! We treat property with the same street address with different unit number the same because they are in the same building and makes no geographical difference.

In [26]:
df_property["FULL_STREET_ADDRESS"].count() - df_property["FULL_STREET_ADDRESS"].nunique()

972126

# Joining Property Assessment Dataset with Live Street Address Management Dataset

### Drop duplicates in full street address

In [27]:
df_street_address_unique = df_street_address.drop_duplicates(subset=["FULL_STREET_ADDRESS"])

In [28]:
df_street_address_unique[["FULL_STREET_ADDRESS", "FULL_ADDRESS", "STREET_NUMBER", "FULL_STREET_NAME", "POINT_X", "POINT_Y"]].sort_values(by="FULL_ADDRESS")

Unnamed: 0,FULL_STREET_ADDRESS,FULL_ADDRESS,STREET_NUMBER,FULL_STREET_NAME,POINT_X,POINT_Y
29697,0 cliffmont st,0 Cliffmont St,0,Cliffmont St,-71.120547,42.278381
399204,0 deacon st,0 Deacon St,0,Deacon St,-71.075982,42.338042
41757,0 devens st,0 Devens St 1,0,Devens St,-71.063681,42.373852
43255,0 dow rd,0 Dow Rd,0,Dow Rd,-71.166560,42.280650
393984,0 emerson pl,0 Emerson Pl,0,Emerson Pl,-71.068738,42.364346
...,...,...,...,...,...,...
111307,c-8 shipway pl,C-8 Shipway Pl C-8,C-8,Shipway Pl,-71.050772,42.376113
111308,c-9 shipway pl,C-9 Shipway Pl C-9,C-9,Shipway Pl,-71.050772,42.376113
310515,pier 4 eighth st,Pier 4 Eighth St,Pier 4,Eighth St,-71.051407,42.371981
377920,ten post office sq,TEN Post Office Sq,TEN,Post Office Sq,-71.055699,42.357256


In [29]:
df_property[["FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]].sort_values(by="FULL_STREET_ADDRESS")

Unnamed: 0,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
627166,-8b-8c greenwood st,-8B-8C,GREENWOOD ST,
448877,-8b-8c greenwood st,-8B-8C,GREENWOOD ST,
269323,-8b-8c greenwood st,-8B-8C,GREENWOOD,ST
589328,0 harbor st,0,HARBOR ST,
233545,0 harbor st,0,HARBOR,ST
...,...,...,...,...
859127,zeller st,,ZELLER ST,
859128,zeller st,,ZELLER ST,
859129,zeller st,,ZELLER ST,
1041002,zeller st,,ZELLER ST,


### Number of property that is now assigned with XY coordinate: 501,449

In [30]:
df_property_with_coord = pd.merge(df_property, df_street_address_unique, left_on="FULL_STREET_ADDRESS", right_on="FULL_STREET_ADDRESS", how="inner")

In [31]:
df_property_with_coord.shape

(501449, 165)

In [32]:
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,Y_COORD,SAM_STREET_ID,WARD,PRECINCT_WARD,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y
0,502550008,502550000.0,5.025500e+08,87,BEACON,ST,2-F,2108.0,102.0,CD,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
1,502550010,502550000.0,5.025500e+08,87,BEACON,ST,2-R,2108.0,102.0,CD,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
2,502550012,502550000.0,5.025500e+08,87,BEACON,ST,3-F,2108.0,102.0,CD,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
3,502550014,502550000.0,5.025500e+08,87,BEACON,ST,3-R,2108.0,102.0,CD,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
4,502550016,502550000.0,5.025500e+08,87,BEACON,ST,4,2108.0,102.0,CD,...,2.955013e+06,332.0,5,505,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501444,2205664000,,2.205664e+09,18 16,LAKE ST,,,2135.0,,R2,...,2.949442e+06,2334.0,22,2208,2205664000,9/28/2009 1:28:37,9/29/2009 17:33:08,POINT (-71.166349937999939 42.340936078000027),-71.166350,42.340936
501445,2205665000,2205665000,2.205665e+09,14 12,LAKE ST,,,2135.0,,CM,...,2.949394e+06,2334.0,22,2208,2205665000,9/25/2009 10:14:59,9/29/2009 17:33:08,POINT (-71.166337999999939 42.340804000000048),-71.166338,42.340804
501446,2205665002,2205665000,2.205665e+09,14,LAKE ST,,2,2135.0,,CD,...,2.949394e+06,2334.0,22,2208,2205665000,9/25/2009 10:14:59,9/29/2009 17:33:08,POINT (-71.166337999999939 42.340804000000048),-71.166338,42.340804
501447,2205665004,2205665000,2.205665e+09,12,LAKE ST,,1,2135.0,,CD,...,2.949385e+06,2334.0,22,2208,2205665000,9/25/2009 10:14:59,9/29/2009 17:33:08,POINT (-71.166339999999934 42.340780000000052),-71.166340,42.340780


### Examines the property that found no matching address on Live Street Address Management Dataset

In [33]:
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} addresses found no match.")

566829 addresses found no match.


In [34]:
df_missing_coord_addresses = df_property[~df_property['FULL_STREET_ADDRESS'].isin(df_property_with_coord['FULL_STREET_ADDRESS'])].copy()

In [35]:
df_missing_coord_addresses[["FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
60,104 putnam st,104 A 104,PUTNAM,ST
111,198 princeton st,198 200,PRINCETON,ST
121,399 saratoga st,399 401,SARATOGA,ST
175,4 lawson pl,4,LAWSON,PL
176,3 lawson pl,3,LAWSON,PL
...,...,...,...,...
1068273,knowles st,,KNOWLES ST,
1068274,lake st,,Lake ST,
1068275,lake st,,Lake ST,
1068276,commonwealth av,,COMMONWEALTH AV,


In [36]:
df_missing_coord_addresses["FULL_STREET_NAME"] = df_missing_coord_addresses.apply(
    lambda row: (
        row["ST_NAME"] + " " + row["ST_NAME_SUF"]
    ).lower() if pd.notna(row["ST_NAME_SUF"])
    else row["ST_NAME"].lower(),
    axis=1
)

df_missing_coord_addresses["FULL_STREET_NAME"] = df_missing_coord_addresses["FULL_STREET_NAME"].str.strip()

df_missing_coord_addresses[["FULL_STREET_ADDRESS", "FULL_STREET_NAME", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_ADDRESS,FULL_STREET_NAME,ST_NUM,ST_NAME,ST_NAME_SUF
60,104 putnam st,putnam st,104 A 104,PUTNAM,ST
111,198 princeton st,princeton st,198 200,PRINCETON,ST
121,399 saratoga st,saratoga st,399 401,SARATOGA,ST
175,4 lawson pl,lawson pl,4,LAWSON,PL
176,3 lawson pl,lawson pl,3,LAWSON,PL
...,...,...,...,...,...
1068273,knowles st,knowles st,,KNOWLES ST,
1068274,lake st,lake st,,Lake ST,
1068275,lake st,lake st,,Lake ST,
1068276,commonwealth av,commonwealth av,,COMMONWEALTH AV,


### Drop duplicate in the full street name column in Street Address Dataset

In [37]:
df_street_name_unique = df_street_address[["FULL_STREET_ADDRESS", "FULL_ADDRESS", "STREET_NUMBER", "STREET_BODY", "FULL_STREET_NAME", "POINT_X", "POINT_Y"]].copy()
df_street_name_unique["FULL_STREET_NAME"] = df_street_name_unique["FULL_STREET_NAME"].str.strip().str.lower()
df_street_name_unique = df_street_name_unique.drop_duplicates(subset=["FULL_STREET_NAME"])

### Fix misspelling and abbreviation
- Replace with "msgr" with "monsignor"
- Remove all special characters from street name
- Replace "abbott" with "abbot"
- Replace "wy" with "way"
- Replace "wm" with "William"
- Replace "hw" with "hwy"
- Replace "oneil" with "o'neil"
- Replace "mt" with "mount"
- Replace "dr mary m beatty" with "dr mary moore beatty cir"
- Replace "commercial wharf east" with "commercial whf r"
- Replace "commonweatlh" with "commonwealth"
- Replace "battery wharf" with "battery whf"
- Replace leading "st" with "saint"
- Replace "crescent circuit" with "crescent cirt"
- Replace "fr francis gilday" with "father francis j gilday"
- Replace "pw" with "pkwy"
- Replace "wm card oconnell" with "william cardinal oconnell way"
- Replace "gen wm h devine" with "general william h devine way"
- Replace "w roxbury pkwy" with "west roxbury pkwy"
- Replace "park lane" with "park ln"
- Replace "gen jozef pilsudski way" with "general jozef pilsudski way"
- Replace "fan pier bl" with "fan pier  blvd"
- Replace "oconnor way" with "major michael j oconnor way"
- Replace "soldiers field rd xt" with "soldiers field rd"

orton marotta = Orton-Marotta


In [38]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("msgr", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("msgr", "monsignor", case=False, regex=False)

In [39]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("abbott", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("abbott", "abbot", case=False, regex=False)

In [40]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("wy", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("wy", "way", case=False, regex=False)

In [41]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("wm", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("wm", "william", case=False, regex=False)

In [42]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("hw", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("hw", "hwy", case=False, regex=False)

In [43]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains(r"[^a-zA-Z0-9\s]", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace(r"[^a-zA-Z0-9\s]", "", case=False, regex=False)

In [44]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("oneil", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("oneil", "o'neil", case=False, regex=False)

In [45]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("mt", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("mt", "mount", case=False, regex=False)

In [46]:
df_missing_coord_addresses['FULL_STREET_NAME'] = df_missing_coord_addresses['FULL_STREET_NAME'].str.replace(r'^\bst\b', 'saint', case=False, regex=True)

In [47]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("crescent circuit", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("crescent circuit", "crescent cirt", case=False, regex=False)

In [48]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("fr francis gilday", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("fr francis gilday", "father francis j gilday", case=False, regex=False)

In [49]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("pw", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("pw", "pkwy", case=False, regex=False)

In [50]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("card oconnell", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("card oconnell", "cardinal oconnell", case=False, regex=False)

In [51]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("gen", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("gen", "general", case=False, regex=False)

In [52]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("w roxbury pkwy", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("w roxbury pkwy", "west roxbury pkwy", case=False, regex=False)

In [53]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("park lane dr", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("park lane", "park ln", case=False, regex=False)

In [54]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("fan pier bl", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("fan pier bl", "fan pier  blvd", case=False, regex=False)

In [55]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("oconnor way", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("oconnor way", "major michael j oconnor way", case=False, regex=False)

In [56]:
df_missing_coord_addresses.loc[df_missing_coord_addresses["FULL_STREET_NAME"].str.contains("soldiers field rd xt", case=False, na=False), "FULL_STREET_NAME"] = \
    df_missing_coord_addresses["FULL_STREET_NAME"].str.replace("soldiers field rd xt", "soldiers field rd", case=False, regex=False)

### Match just the street name and it's suffix
We will estimate the coordinate of the property based on street name.

In [57]:
print(f"{df_missing_coord_addresses.shape[0]} addresses found no match because the street numbers in Live Street Address are recorded as a range.")

566829 addresses found no match because the street numbers in Live Street Address are recorded as a range.


In [58]:
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]].sort_values(by="FULL_STREET_NAME")

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
769532,a st,a st,,A ST,
950845,a st,a st,,A ST,
405061,a st,19 a st,19,A ST,
951137,a st,a st,,A ST,
951148,a st,a st,,A ST,
...,...,...,...,...,...
1040995,zeller st,zeller st,,ZELLER ST,
1040994,zeller st,zeller st,,ZELLER ST,
859113,zeller st,zeller st,,ZELLER ST,
859106,zeller st,zeller st,,ZELLER ST,


In [59]:
df_missing_coord_addresses["FULL_STREET_NAME"].nunique()

4253

In [60]:
df_missing_property_street_name_with_coord = pd.merge(df_missing_coord_addresses, df_street_name_unique, left_on="FULL_STREET_NAME", right_on="FULL_STREET_NAME", how="inner", suffixes=('', '_right'))

In [61]:
df_missing_property_street_name_with_coord.shape

(364822, 140)

In [62]:
df_missing_property_street_name_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,GROSS_TAX,HEAT_SYSTEM,FULL_STREET_ADDRESS,FULL_STREET_NAME,FULL_STREET_ADDRESS_right,FULL_ADDRESS,STREET_NUMBER,STREET_BODY,POINT_X,POINT_Y
0,100001000,,1.000010e+08,104 A 104,PUTNAM,ST,,2128.0,105.0,R3,...,,,104 putnam st,putnam st,10 putnam st,10 Putnam St,10,Putnam,-71.059864,42.373542
1,100051000,,1.000510e+08,198 200,PRINCETON,ST,,2128.0,111.0,R4,...,,,198 princeton st,princeton st,10 princeton st,10 Princeton St,10,Princeton,-71.038800,42.376590
2,100061000,,1.000610e+08,399 401,SARATOGA,ST,,2128.0,13.0,RC,...,,,399 saratoga st,saratoga st,100-104 saratoga st,100-104 Saratoga St,100-104,Saratoga,-71.036213,42.376800
3,100116000,,1.001160e+08,4,LAWSON,PL,,2128.0,130.0,RL,...,,,4 lawson pl,lawson pl,1 lawson pl,1 Lawson Pl,1,Lawson,-71.028470,42.380200
4,100117000,,1.001170e+08,3,LAWSON,PL,,2128.0,130.0,RL,...,,,3 lawson pl,lawson pl,1 lawson pl,1 Lawson Pl,1,Lawson,-71.028470,42.380200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
364817,2205665000,2205665000.0,2.205665e+09,,Lake ST,,,,,CM,...,$-,,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050
364818,2205665002,2205665000.0,2.205665e+09,,Lake ST,,2,,,CD,...,"$5,941.59",I - Indiv. Cntrl,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050
364819,2205665004,2205665000.0,2.205665e+09,,Lake ST,,1,,,CD,...,"$5,393.32",I - Indiv. Cntrl,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050
364820,2205667000,,2.205667e+09,,Lake ST,,,,,RL - RL,...,$793.52,,lake st,lake st,102 lake st,102 Lake St,102,Lake,-71.164330,42.344050


### Number of property that is now assigned with XY coordinate: 866,271

In [63]:
df_property_with_coord = pd.concat([df_property_with_coord, df_missing_property_street_name_with_coord], ignore_index=True)

In [64]:
df_property_with_coord.shape

(866271, 167)

In [65]:
print(f"A total of {df_property_with_coord.shape[0]} properties have coordinates out of the total {df_property.shape[0]}.")
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} property still doesn't have coordinate, which is a lot.")

A total of 866271 properties have coordinates out of the total 1068278.
202007 property still doesn't have coordinate, which is a lot.


### Examine the property where match are not found for both full street address and full street name

In [66]:
df_missing_coord_addresses = df_missing_coord_addresses[~df_missing_coord_addresses['FULL_STREET_NAME'].isin(df_missing_property_street_name_with_coord['FULL_STREET_NAME'])].copy()

In [67]:
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
521,swift te,1 swift te,1,SWIFT,TE
522,swift te,5 swift te,5,SWIFT,TE
523,swift te,9 swift te,9,SWIFT,TE
524,swift te,15 swift te,15,SWIFT,TE
525,swift te,19 swift te,19,SWIFT,TE
...,...,...,...,...,...
1068182,undine st,undine st,,UNDINE ST,
1068183,undine st,undine st,,UNDINE ST,
1068273,knowles st,knowles st,,KNOWLES ST,
1068276,commonwealth av,commonwealth av,,COMMONWEALTH AV,


### Match just the street name (body)

After manually searching for the `FULL_STREET_NAME` against Live Street Address, the common problems are misspellings, wrong street name suffix, and colloquial name.

There too many of these cases to manually replace, so we will match the substring to of street name to Live Street Address and replace it with the correct name for matching.

In [68]:
df_missing_coord_addresses["ST_NAME_NEW"] = df_missing_coord_addresses["ST_NAME"].str.lower()
df_street_name_unique["STREET_BODY"] = df_street_name_unique["STREET_BODY"].str.lower()

In [69]:
df_street_body_unique = df_street_name_unique.drop_duplicates(subset=["STREET_BODY"])

In [70]:
df_missing_coord_addresses.shape

(202007, 135)

In [71]:
# Ignore the street name suffix and only use street name
df_missing_property_street_name_with_coord = pd.merge(df_missing_coord_addresses, df_street_body_unique, left_on="ST_NAME_NEW", right_on="STREET_BODY", how="inner", suffixes=('', '_right'))

In [72]:
df_missing_property_street_name_with_coord.shape

(60791, 142)

In [73]:
df_missing_property_street_name_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,FULL_STREET_ADDRESS,FULL_STREET_NAME,ST_NAME_NEW,FULL_STREET_ADDRESS_right,FULL_ADDRESS,STREET_NUMBER,STREET_BODY,FULL_STREET_NAME_right,POINT_X,POINT_Y
0,100382000,,1.003820e+08,1,SWIFT,TE,,2128.0,101.0,R1,...,1 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
1,100383000,,1.003830e+08,5,SWIFT,TE,,2128.0,101.0,R1,...,5 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
2,100384000,,1.003840e+08,9,SWIFT,TE,,2128.0,101.0,R1,...,9 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
3,100385000,,1.003850e+08,15,SWIFT,TE,,2128.0,105.0,R3,...,15 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
4,100386000,,1.003860e+08,19,SWIFT,TE,,2128.0,101.0,R1,...,19 swift te,swift te,swift,1 swift ter,1 Swift Ter,1,swift,swift ter,-71.021937,42.380540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60786,1809911000,,1.809911e+09,,VAN BRUNT,,,,,R1,...,van brunt,van brunt,van brunt,10 van brunt st,10 Van Brunt St,10,van brunt,van brunt st,-71.124490,42.242760
60787,1811885010,,1.811885e+09,,ADAMS,,,,,R1,...,adams,adams,adams,1 adams st,1 Adams St,1,adams,adams st,-71.060040,42.374840
60788,1900635011,1900635000.0,1.900635e+09,,LAMARTINE,,5,,,CD,...,lamartine,lamartine,lamartine,1 lamartine pl,1 Lamartine Pl,1,lamartine,lamartine pl,-71.106670,42.313510
60789,2200470000,,2.200470e+09,,CHARLES RIVER,,,,,E,...,charles river,charles river,charles river,44 charles river ave,44 Charles River Ave,44,charles river,charles river ave,-71.060502,42.370607


In [74]:
df_property_with_coord = pd.concat([df_property_with_coord, df_missing_property_street_name_with_coord], ignore_index=True)
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,ZIP_CODE,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right
0,502550008,502550000.0,5.025500e+08,87,BEACON,ST,2-F,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
1,502550010,502550000.0,5.025500e+08,87,BEACON,ST,2-R,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
2,502550012,502550000.0,5.025500e+08,87,BEACON,ST,3-F,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
3,502550014,502550000.0,5.025500e+08,87,BEACON,ST,3-R,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
4,502550016,502550000.0,5.025500e+08,87,BEACON,ST,4,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
927057,1809911000,,1.809911e+09,,VAN BRUNT,,,,,R1,...,,,,,-71.124490,42.242760,2136.0,10 van brunt st,van brunt,van brunt st
927058,1811885010,,1.811885e+09,,ADAMS,,,,,R1,...,,,,,-71.060040,42.374840,2136.0,1 adams st,adams,adams st
927059,1900635011,1900635000.0,1.900635e+09,,LAMARTINE,,5,,,CD,...,,,,,-71.106670,42.313510,2130.0,1 lamartine pl,lamartine,lamartine pl
927060,2200470000,,2.200470e+09,,CHARLES RIVER,,,,,E,...,,,,,-71.060502,42.370607,2135.0,44 charles river ave,charles river,charles river ave


### Number of property that is now assigned with XY coordinate: 927,062

In [75]:
print(f"A total of {df_property_with_coord.shape[0]} properties have coordinates out of the total {df_property.shape[0]}.")
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} property still doesn't have coordinate.")

A total of 927062 properties have coordinates out of the total 1068278.
141216 property still doesn't have coordinate.


In [76]:
df_missing_coord_addresses = df_missing_coord_addresses[~df_missing_coord_addresses['ST_NAME_NEW'].isin(df_missing_property_street_name_with_coord['STREET_BODY'])].copy()
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME_NEW", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME_NEW,ST_NAME_SUF
533,vienna st,vienna st,,vienna,ST
534,vienna st,vienna st,,vienna,ST
535,vienna st,3 vienna st,3,vienna,ST
536,vienna st,5 vienna st,5,vienna,ST
537,vienna st,7 vienna st,7,vienna,ST
...,...,...,...,...,...
1068182,undine st,undine st,,undine st,
1068183,undine st,undine st,,undine st,
1068273,knowles st,knowles st,,knowles st,
1068276,commonwealth av,commonwealth av,,commonwealth av,


### Some street name contain wrong street name suffix
Assume that the shortest word is the suffix, remove the suffix.

In [77]:
def remove_shortest_word(text):
    words = text.split()  # Split the string into words
    if len(words) <= 1:
        return text  # Return the original string if it's the only word
    shortest_word = min(words, key=len)  # Find the shortest word
    words.remove(shortest_word)  # Remove the shortest word
    return ' '.join(words)

In [78]:
df_missing_coord_addresses['ST_NAME_NEW'] = df_missing_coord_addresses['ST_NAME_NEW'].apply(remove_shortest_word)
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF", 'ST_NAME_NEW']]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF,ST_NAME_NEW
533,vienna st,vienna st,,VIENNA,ST,vienna
534,vienna st,vienna st,,VIENNA,ST,vienna
535,vienna st,3 vienna st,3,VIENNA,ST,vienna
536,vienna st,5 vienna st,5,VIENNA,ST,vienna
537,vienna st,7 vienna st,7,VIENNA,ST,vienna
...,...,...,...,...,...,...
1068182,undine st,undine st,,UNDINE ST,,undine
1068183,undine st,undine st,,UNDINE ST,,undine
1068273,knowles st,knowles st,,KNOWLES ST,,knowles
1068276,commonwealth av,commonwealth av,,COMMONWEALTH AV,,commonwealth


In [79]:
df_missing_property_street_name_with_coord = pd.merge(df_missing_coord_addresses, df_street_body_unique, left_on="ST_NAME_NEW", right_on="STREET_BODY", how="inner", suffixes=('', '_right'))

In [80]:
df_missing_property_street_name_with_coord.shape

(128362, 142)

In [81]:
df_missing_property_street_name_with_coord[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF", "STREET_BODY", "FULL_ADDRESS", "STREET_NUMBER", "POINT_X", "POINT_Y"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF,STREET_BODY,FULL_ADDRESS,STREET_NUMBER,POINT_X,POINT_Y
0,chelsea creek,chelsea creek,,CHELSEA CREEK,,chelsea,55 Chelsea St B,55,-71.059256,42.372731
1,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
2,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
3,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
4,fort strong,fort strong,,FORT STRONG,,strong,1 Strong Pl,1,-71.068335,42.360817
...,...,...,...,...,...,...,...,...,...,...
128357,lake shore te,lake shore te,,LAKE SHORE TE,,lake shore,14 Lake Shore Ct 1,14,-71.170468,42.345856
128358,undine st,undine st,,UNDINE ST,,undine,100-98 Undine Rd,100-98,-71.166046,42.342942
128359,undine st,undine st,,UNDINE ST,,undine,100-98 Undine Rd,100-98,-71.166046,42.342942
128360,commonwealth av,commonwealth av,,COMMONWEALTH AV,,commonwealth,1 Commonwealth Ave A,1,-71.072054,42.353996


In [82]:
df_property_with_coord = pd.concat([df_property_with_coord, df_missing_property_street_name_with_coord], ignore_index=True)
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,ZIP_CODE,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right
0,502550008,502550000.0,5.025500e+08,87,BEACON,ST,2-F,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
1,502550010,502550000.0,5.025500e+08,87,BEACON,ST,2-R,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
2,502550012,502550000.0,5.025500e+08,87,BEACON,ST,3-F,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
3,502550014,502550000.0,5.025500e+08,87,BEACON,ST,3-R,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
4,502550016,502550000.0,5.025500e+08,87,BEACON,ST,4,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1055419,2205550632,2205550001.0,2.205550e+09,,LAKE SHORE TE,,6-4,,,CD,...,,,,,-71.170468,42.345856,2135.0,14 lake shore ct,lake shore,lake shore ct
1055420,2205589002,2205589000.0,2.205589e+09,,UNDINE ST,,1,,,CD,...,,,,,-71.166046,42.342942,2135.0,100-98 undine rd,undine,undine rd
1055421,2205589004,2205589000.0,2.205589e+09,,UNDINE ST,,2,,,CD,...,,,,,-71.166046,42.342942,2135.0,100-98 undine rd,undine,undine rd
1055422,2205669000,,2.205669e+09,,COMMONWEALTH AV,,,,,C,...,,,,,-71.072054,42.353996,2135.0,1 commonwealth ave,commonwealth,commonwealth ave


### Number of property that is now assigned with XY coordinate: 111,5186

In [83]:
print(f"A total of {df_property_with_coord.shape[0]} properties have coordinates out of the total {df_property.shape[0]}.")
print(f"{df_property.shape[0] - df_property_with_coord.shape[0]} property still doesn't have coordinate.")

A total of 1055424 properties have coordinates out of the total 1068278.
12854 property still doesn't have coordinate.


In [84]:
df_missing_coord_addresses = df_missing_coord_addresses[~df_missing_coord_addresses['ST_NAME'].isin(df_missing_property_street_name_with_coord['ST_NAME'])].copy()
df_missing_coord_addresses[["FULL_STREET_NAME", "FULL_STREET_ADDRESS", "ST_NUM", "ST_NAME", "ST_NAME_SUF"]]

Unnamed: 0,FULL_STREET_NAME,FULL_STREET_ADDRESS,ST_NUM,ST_NAME,ST_NAME_SUF
533,vienna st,vienna st,,VIENNA,ST
534,vienna st,vienna st,,VIENNA,ST
535,vienna st,3 vienna st,3,VIENNA,ST
536,vienna st,5 vienna st,5,VIENNA,ST
537,vienna st,7 vienna st,7,VIENNA,ST
...,...,...,...,...,...
1060810,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,
1060811,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,
1060812,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,
1060813,leo m birmingham pkway,leo m birmingham pkwy,,Leo M Birmingham PKWY,


Stop here because the rest will need manual inspection and will take too long. Will not include the rest of the rows in the new CSV file.

In [85]:
df_property_with_coord

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,PARCEL,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,ZIP_CODE,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right
0,502550008,502550000.0,5.025500e+08,87,BEACON,ST,2-F,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
1,502550010,502550000.0,5.025500e+08,87,BEACON,ST,2-R,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
2,502550012,502550000.0,5.025500e+08,87,BEACON,ST,3-F,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
3,502550014,502550000.0,5.025500e+08,87,BEACON,ST,3-R,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
4,502550016,502550000.0,5.025500e+08,87,BEACON,ST,4,2108.0,102.0,CD,...,0502550000,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1055419,2205550632,2205550001.0,2.205550e+09,,LAKE SHORE TE,,6-4,,,CD,...,,,,,-71.170468,42.345856,2135.0,14 lake shore ct,lake shore,lake shore ct
1055420,2205589002,2205589000.0,2.205589e+09,,UNDINE ST,,1,,,CD,...,,,,,-71.166046,42.342942,2135.0,100-98 undine rd,undine,undine rd
1055421,2205589004,2205589000.0,2.205589e+09,,UNDINE ST,,2,,,CD,...,,,,,-71.166046,42.342942,2135.0,100-98 undine rd,undine,undine rd
1055422,2205669000,,2.205669e+09,,COMMONWEALTH AV,,,,,C,...,,,,,-71.072054,42.353996,2135.0,1 commonwealth ave,commonwealth,commonwealth ave


# Plot onto shapefile

In [86]:
geometry = [Point(xy) for xy in zip(df_property_with_coord['POINT_X'], df_property_with_coord['POINT_Y'])]
gdf = gpd.GeoDataFrame(df_property_with_coord, geometry=geometry, crs="EPSG:3857")
gdf = gdf.to_crs("EPSG:3857")
gdf

Unnamed: 0,PID,CM_ID,GIS_ID,ST_NUM,ST_NAME,ST_NAME_SUF,UNIT_NUM,ZIPCODE,PTYPE,LU,...,created_date,last_edited_date,shape_wkt,POINT_X,POINT_Y,ZIP_CODE,FULL_STREET_ADDRESS_right,ST_NAME_NEW,FULL_STREET_NAME_right,geometry
0,502550008,502550000.0,5.025500e+08,87,BEACON,ST,2-F,2108.0,102.0,CD,...,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,,POINT (-71.072 42.356)
1,502550010,502550000.0,5.025500e+08,87,BEACON,ST,2-R,2108.0,102.0,CD,...,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,,POINT (-71.072 42.356)
2,502550012,502550000.0,5.025500e+08,87,BEACON,ST,3-F,2108.0,102.0,CD,...,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,,POINT (-71.072 42.356)
3,502550014,502550000.0,5.025500e+08,87,BEACON,ST,3-R,2108.0,102.0,CD,...,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,,POINT (-71.072 42.356)
4,502550016,502550000.0,5.025500e+08,87,BEACON,ST,4,2108.0,102.0,CD,...,9/25/2009 10:14:59,1/27/2022 10:44:10,POINT (-71.071689999999933 42.355910000000051),-71.071690,42.355910,,,,,POINT (-71.072 42.356)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1055419,2205550632,2205550001.0,2.205550e+09,,LAKE SHORE TE,,6-4,,,CD,...,,,,-71.170468,42.345856,2135.0,14 lake shore ct,lake shore,lake shore ct,POINT (-71.17 42.346)
1055420,2205589002,2205589000.0,2.205589e+09,,UNDINE ST,,1,,,CD,...,,,,-71.166046,42.342942,2135.0,100-98 undine rd,undine,undine rd,POINT (-71.166 42.343)
1055421,2205589004,2205589000.0,2.205589e+09,,UNDINE ST,,2,,,CD,...,,,,-71.166046,42.342942,2135.0,100-98 undine rd,undine,undine rd,POINT (-71.166 42.343)
1055422,2205669000,,2.205669e+09,,COMMONWEALTH AV,,,,,C,...,,,,-71.072054,42.353996,2135.0,1 commonwealth ave,commonwealth,commonwealth ave,POINT (-71.072 42.354)


In [2791]:
district_shapefile = gpd.read_file("../data/City-Council-District")

count = 0
is_D7_addresses = []

for row in df_property_with_coord.itertuples(index=True, name="Row"):
    address_point = Point(row.POINT_X, row.POINT_Y)
    address_gdf = gpd.GeoDataFrame(geometry=[address_point], crs="EPSG:4326")
    address_gdf = address_gdf.to_crs(district_shapefile.crs)
    result = gpd.sjoin(address_gdf, district_shapefile, how="left", predicate="intersects")

    if result['DISTRICT'].values[0] == 7:
        count += 1
        is_D7_addresses.append(True)
    else:
        is_D7_addresses.append(False)

In [None]:
print(count)

In [None]:
df_property_with_coord['is_d7'] = is_D7_addresses
df_property_with_coord

In [2794]:
df_property_with_coord_and_d7 = df_property_with_coord[list(df_property_columns) + ["POINT_X", "POINT_Y", "is_d7"]].copy()
df_property_with_coord_and_d7["ZIPCODE"] = df_property_with_coord_and_d7["ZIPCODE"].apply(
    lambda x: f"{int(x):05}" if not pd.isna(x) else np.nan
)

In [None]:
df_property_with_coord_and_d7

In [None]:
df_property_with_coord_and_d7[df_property_with_coord_and_d7["is_d7"] == True].shape

To reduce error that may stem from comparing street body without street numbers and suffix, out of the rows where `is_d7 = True`, if the ZIP code falls outside of District 7 ZIP Code, modify the value to `False`.

In [2797]:
zip_shapefile = gpd.read_file("../data/ZIP_Codes")

In [None]:
print(zip_shapefile.crs)

In [2799]:
if zip_shapefile.crs != district_shapefile.crs:
    zip_shapefile = zip_shapefile.to_crs(district_shapefile.crs)
district_7 = district_shapefile[district_shapefile["DISTRICT"] == 7]
zip_in_district_7 = gpd.sjoin(zip_shapefile, district_7, how="inner", predicate="intersects")
unique_zip_codes = zip_in_district_7["ZIP5"].unique()

In [None]:
unique_zip_codes

In [2801]:
df_property_with_coord_and_d7.loc[
    (df_property_with_coord_and_d7["is_d7"] == True) & 
    (~df_property_with_coord_and_d7["ZIPCODE"].apply(lambda x: x in unique_zip_codes)),
    "is_d7"
] = False

In [None]:
df_property_with_coord_and_d7[df_property_with_coord_and_d7["is_d7"] == True].shape

In [2803]:
df_property_with_coord_and_d7.to_csv("../data/d7-property-new.csv", index=False)