# Investigating Factors of Rising House Values in New York City: Step 3: Machine Learning

Team Members: Francisco Brady (fbrady), Zhonghan Xie (jonasxie), Michael Garner (mngarner)  
Date: 2024-12-02

## Introduction

In this section, we seek to apply clustering machine learning techniques to the combined New York City housing price, education, eviction, and demographic dataset in order to identify patterns and relationships between these phenomena. 

In [17]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.manifold import MDS
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import scale
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

In [2]:
# Import merged neighborhood-level dataset
allData = pd.read_csv('./data/analytic_dataset.csv')

## Preprocessing
- Data cleaning and handling of issues: 20 points
- Feature scaling and normalization; one-hot encoding/categorical variable handling: 20 points
- Dimensionality reduction techniques: 20 points
- Preprocessing justification and impact analysis: 20 points

In [15]:
# Streamline dataset to keep columns of interest for this analysis
keepCols = ['year', 'nta_name', 'borough', 'pct_attendance',
       'pct_chronically_absent', 'EVICTION_COUNT', 'GEOID',
       'median_sale_price', 'hvi', 'total_population',
       'median_income', 'white_pct', 'black_pct',
       'american_indian_alaska_native_pct', 'asian_pct',
       'hawaiian_pacific_islander_pct', 'multiple_race_pct', 'other_race_pct']
data = allData[keepCols]

# Rename column for format
data.rename(columns={'EVICTION_COUNT': 'num_evictions'}, inplace=True)

# Drop rows with missing values
data = data.dropna()
data.sample(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.rename(columns={'EVICTION_COUNT': 'num_evictions'}, inplace=True)


Unnamed: 0,year,nta_name,borough,pct_attendance,pct_chronically_absent,num_evictions,GEOID,median_sale_price,hvi,total_population,median_income,white_pct,black_pct,american_indian_alaska_native_pct,asian_pct,hawaiian_pacific_islander_pct,multiple_race_pct,other_race_pct
44545,2020.0,South Ozone Park,Queens,91.928398,28.483606,1.0,36081010600,565000.0,598014.6374,3764.0,20700.0,9.511,9.777,0.0,47.21,0.0,19.129,23.937
51650,2022.0,Jamaica,Queens,96.721779,6.923077,3.0,36081044602,882000.0,604426.160646,5016.0,19383.0,8.393,19.896,0.478,52.652,0.857,8.533,13.457
2388,2022.0,Bushwick (East),Brooklyn,86.639107,54.749569,1.0,36047043900,1280000.0,877179.605395,3424.0,27616.0,22.225,26.811,1.636,8.002,0.0,15.421,33.616
16942,2020.0,Clinton Hill,Brooklyn,92.519287,28.26087,1.0,36047019700,2950000.0,909536.48701,3669.0,84316.0,57.999,29.736,0.0,4.688,0.0,8.776,3.189
34615,2019.0,Bellerose,Queens,95.844391,8.823529,4.0,36081161700,680000.0,603056.00316,4610.0,49548.0,42.017,16.009,1.041,25.38,0.0,6.941,12.082
16719,2020.0,East New York (North),Brooklyn,90.519112,38.103161,9.0,36047115200,732500.0,536893.640202,3049.0,23621.0,12.266,53.099,0.0,1.935,0.0,26.238,19.58
54912,2019.0,Mott Haven-Port Morris,Bronx,89.454346,42.131149,19.0,36005002500,715000.0,551343.572632,5694.0,15423.0,18.458,29.259,0.228,0.105,0.773,11.872,45.241
60105,2020.0,Melrose,Bronx,91.340584,29.25926,2.0,36005007500,557500.0,357176.773464,6345.0,22083.0,18.156,35.697,1.513,0.709,0.0,20.331,33.759
25217,2022.0,Bedford-Stuyvesant (East),Brooklyn,87.396118,51.931332,5.0,36047038500,1650000.0,986130.726784,4324.0,47668.0,12.882,63.298,0.671,4.648,0.0,30.065,3.469
64411,2020.0,Williamsbridge-Olinville,Bronx,82.136734,53.5,3.0,36005040400,625276.0,524673.078599,3514.0,43088.0,7.968,80.734,1.394,0.0,0.0,0.057,9.875


In [21]:
# Add column to categorize white/minority-dominated neighborhoods
data['minority_dominated'] = np.where((100 - data['white_pct']) > 50, "Yes", "No")
data

Unnamed: 0,year,nta_name,borough,pct_attendance,pct_chronically_absent,num_evictions,GEOID,median_sale_price,hvi,total_population,median_income,white_pct,black_pct,american_indian_alaska_native_pct,asian_pct,hawaiian_pacific_islander_pct,multiple_race_pct,other_race_pct,minority_dominated
3,2019.0,Sunset Park (West),Brooklyn,94.443169,17.731630,9.0,36047007400,1217500.0,643949.214602,5328.0,23235.0,33.296,0.638,1.126,30.593,0.0,9.685,29.505,Yes
4,2019.0,Sunset Park (West),Brooklyn,94.443169,17.731630,14.0,36047007600,1217500.0,643949.214602,5431.0,24473.0,41.300,3.701,0.110,19.923,0.0,29.571,20.180,Yes
5,2019.0,Sunset Park (West),Brooklyn,94.443169,17.731630,1.0,36047007800,1217500.0,643949.214602,4873.0,26467.0,51.365,2.298,0.882,8.496,0.0,6.321,33.798,No
6,2019.0,Sunset Park (West),Brooklyn,94.443169,17.731630,3.0,36047008000,1217500.0,643949.214602,3627.0,30063.0,59.112,0.744,0.000,5.735,0.0,14.613,27.102,No
7,2019.0,Sunset Park (West),Brooklyn,94.443169,17.731630,2.0,36047008200,1217500.0,643949.214602,3627.0,35182.0,32.506,11.938,0.000,13.372,0.0,25.145,29.611,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65448,2022.0,Mott Haven-Port Morris,Bronx,56.050163,88.120949,5.0,36005005100,820000.0,685357.770663,5858.0,15515.0,15.961,32.127,0.000,0.666,0.0,7.614,47.439,Yes
65449,2022.0,Mott Haven-Port Morris,Bronx,56.050163,88.120949,2.0,36005003500,820000.0,685357.770663,3738.0,20773.0,15.008,43.526,2.461,0.000,0.0,15.195,31.407,Yes
65450,2022.0,Mott Haven-Port Morris,Bronx,56.050163,88.120949,1.0,36005004100,820000.0,685357.770663,5865.0,16763.0,18.227,40.341,2.728,0.597,0.0,25.814,25.200,Yes
65451,2022.0,Mott Haven-Port Morris,Bronx,56.050163,88.120949,9.0,36005004300,820000.0,685357.770663,5421.0,17007.0,21.878,31.396,0.221,1.623,0.0,29.404,30.179,Yes


In [None]:
# Separate data into features and label
y = data['GEOID']
X = data.drop(['GEOID'], axis=1)

Index(['year', 'nta_name', 'borough', 'pct_attendance',
       'pct_chronically_absent', 'num_evictions', 'GEOID', 'median_sale_price',
       'hvi', 'total_population', 'median_income', 'white_pct', 'black_pct',
       'american_indian_alaska_native_pct', 'asian_pct',
       'hawaiian_pacific_islander_pct', 'multiple_race_pct', 'other_race_pct',
       'minority_dominated'],
      dtype='object')

In [None]:
# Build preprocessing pipeline column transformer
catCols = ['year', 'nta_name', 'borough', 'minority_dominated']
numCols = data.columns.drop(catCols)

# Numerical variable pipeline
numPipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical variable pipeline: ordinally encode the year, one-hot encode all other categorical variables


# Define column transformer


## Clustering Analysis
- Clear definition and justification of goals: 10 points
- Implementation of chosen techniques: 20 points
- Parameter optimization and method tuning: 20 points
- Performance evaluation and metric selection: 20 points
- Interpretation and discussion of results: 30 points

In [None]:
# Agglomerative Clustering

In [None]:
# K-means Clustering