# Predicting Redevelopment Potential for Boston Parcels

**Authors:** Milo Margolis, Dhruv Rokkam

**Problem Statement:** This project develops a binary classification model to predict the redevelopment potential for Boston properties using parcel data. Properties are labels as high potential based on indicators such as low building to land ratios and underutilized FAR and then classified using logistic regression, KNN, and decision trees with the proper training, validation, and hyperparameter tuning to demonstrate overfitting prevention. 


### Section 1: Importing Libraries and set random set

In [8]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                            f1_score, confusion_matrix, roc_curve, auc)
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# set the random seed 
np.random.seed(42)

# set the plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")


### Section 2: Data Cleaning

In [9]:
# Load CSV file
data_path = Path("../data/raw/boston_properties.csv")
df = pd.read_csv(data_path)

# Display basic information about the dataset
print("Dataset Shape:")
print(df.shape)
print("\n" + "="*50)
print("Dataset Info:")
print("="*50)
df.info()


Dataset Shape:
(182393, 24)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182393 entries, 0 to 182392
Data columns (total 24 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   property_id                   182393 non-null  object 
 1   address                       182393 non-null  object 
 2   parcel_id                     182393 non-null  int64  
 3   attributes                    182393 non-null  object 
 4   neighborhood                  182393 non-null  object 
 5   created_at                    182393 non-null  object 
 6   updated_at                    182393 non-null  object 
 7   actual_far                    148349 non-null  float64
 8   building_value                182393 non-null  float64
 9   classification_code           182393 non-null  int64  
 10  far_utilization               127498 non-null  float64
 11  fiscal_year                   182393 non-null  int64  
 12  l

In [None]:
# Check missing values
print("Missing Values:")
print("="*50)
missing_values = df.isnull().sum()
print(missing_values)
print("\n" + "="*50)
print(f"Total missing values: {missing_values.sum()}")
print(f"Columns with missing values: {(missing_values > 0).sum()}")


Missing Values:
property_id                         0
address                             0
parcel_id                           0
attributes                          0
neighborhood                        0
created_at                          0
updated_at                          0
actual_far                      34044
building_value                      0
classification_code                 0
far_utilization                 54895
fiscal_year                         0
land_to_building_value_ratio    20442
land_value                          0
living_area                     34070
gross_area                      33754
lot_size                         7371
owner_address                       5
owner_name                          5
permits                             0
total_assessed_value                0
vacant_lot                          0
year_built                      22734
zoning                              0
dtype: int64

Total missing values: 207320
Columns with missing values: 

In [None]:
# Handle missing values
print("Handling missing values...")
print(f"Initial shape: {df.shape}")

# Decision 1: Drop rows with missing owner_address or owner_name (only 5 rows)
# Reason: Very few missing values (0.003% of data), and these are categorical text fields
# that cannot be reliably imputed. Dropping is cleaner than imputing placeholder text.
df = df.dropna(subset=['owner_address', 'owner_name'])
print(f"After dropping rows with missing owner info: {df.shape}")

# Decision 2: Drop rows missing far_utilization or actual_far
# Reason: These columns are critical for creating our target variable (high_potential).
# We cannot reliably impute these values as they are key features for redevelopment potential.
# This removes ~30% of data but ensures data quality for modeling.
df = df.dropna(subset=['far_utilization', 'actual_far'])
print(f"After dropping rows with missing FAR data: {df.shape}")

# Decision 3: Calculate land_to_building_value_ratio where possible
# Reason: This ratio can be calculated from land_value and building_value when missing.
# Only calculate if both source values are present and ratio is missing.
mask_missing_ratio = df['land_to_building_value_ratio'].isna()
mask_has_values = (df['land_value'] > 0) & (df['building_value'] > 0)
calculated_count = (mask_missing_ratio & mask_has_values).sum()
df.loc[mask_missing_ratio & mask_has_values, 'land_to_building_value_ratio'] = \
    df.loc[mask_missing_ratio & mask_has_values, 'land_value'] / \
    df.loc[mask_missing_ratio & mask_has_values, 'building_value']
print(f"Calculated {calculated_count} missing ratios where possible")

# Decision 4: Impute remaining numeric columns with median
# Reason: Median is robust to outliers and appropriate for continuous variables.
# We impute: living_area, gross_area, year_built, lot_size, land_to_building_value_ratio
numeric_cols_to_impute = ['living_area', 'gross_area', 'year_built', 'lot_size', 
                          'land_to_building_value_ratio']
for col in numeric_cols_to_impute:
    if df[col].isna().sum() > 0:
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)
        print(f"Imputed {col} with median: {median_value:.2f}")

print(f"\nFinal shape after handling missing values: {df.shape}")
print(f"Remaining missing values: {df.isnull().sum().sum()}")


**Missing Values Summary:**
- Dropped 5 rows with missing owner information (negligible impact, <0.003% of data).
- Dropped rows missing `far_utilization` or `actual_far` (~30% of data) as these are critical for target creation and cannot be reliably imputed.
- Calculated `land_to_building_value_ratio` from source values where possible.
- Imputed remaining numeric columns (`living_area`, `gross_area`, `year_built`, `lot_size`, `land_to_building_value_ratio`) with median values, which is robust to outliers for continuous variables.


In [None]:
# Check and remove duplicate rows
print("Checking for duplicate rows...")
print(f"Shape before removing duplicates: {df.shape}")

# Check for exact duplicate rows (all columns match)
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Decision: Remove duplicate rows if any exist
# Reason: Duplicate rows provide no additional information and can bias the model
# by giving more weight to certain property records. We keep the first occurrence.
if duplicate_count > 0:
    df = df.drop_duplicates()
    print(f"Removed {duplicate_count} duplicate rows")
else:
    print("No duplicate rows found")

print(f"Shape after removing duplicates: {df.shape}")


**Duplicate Removal Summary:**
- Checked for exact duplicate rows across all columns.
- Removed duplicate rows (if any) to prevent data bias in modeling, keeping only the first occurrence of each unique row.
