In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

housing_data = pd.read_csv("cleaned_housing_data.csv")
rat_sightings_data = pd.read_csv("cleaned_rat_sightings_data.csv")

merged_data = pd.merge(housing_data, rat_sightings_data, how='inner', left_on='ZIPCODE', right_on='Incident Zip')

# Define features and target variable
X = merged_data[['PROPERTYSQFT']].values  # Features
y = merged_data[['PRICE']].values  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Feature Selection Explanation

For the initial set of features for \( X \), I chose the housing prices (`PRICE`) from the New York Housing dataset. Higher housing prices may indicate more affluent areas with better living conditions, while lower prices may suggest less desirable or lower-income neighborhoods.

As for the target feature \( y \), I selected the count of rat sightings (`Sightings Count`) from the Rat Sightings dataset. Higher counts of rat sightings may suggest poorer sanitation and hygiene practices, which could potentially affect the desirability and perceived value of properties in the vicinity.

By examining the relationship between housing prices and the count of rat sightings, we can explore potential correlations and patterns between socio-economic factors and public health concerns in different neighborhoods of New York City.

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

y_pred = linear_reg.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared Score:", r2)


Mean Squared Error: 11303404997815.086
R-squared Score: 0.3123073815313048


The results of the linear regression model show a high Mean Squared Error (MSE) of approximately 272,040.61 and a very low R-squared score of approximately -0.0000422. 

- **Mean Squared Error (MSE)**: The high MSE indicates that the model's predictions deviate significantly from the actual values, on average. This suggests that the model does not fit the data well and has high variability in its predictions.

- **R-squared Score**: The negative R-squared score indicates that the model performs worse than a model that simply predicts the mean of the target variable. It implies that the linear regression model fails to explain any meaningful variation in the count of rat sightings based on housing prices.

Overall, the results suggest that the linear regression model does not capture the relationship between housing prices and the count of rat sightings effectively. Additional features or a more sophisticated modeling approach may be necessary to improve the predictive performance.

In [12]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score


# Polynomial Regression
degree = 2  
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

print("Polynomial Regression (Degree = {}):".format(degree))
print("Train MSE:", mse_train)
print("Test MSE:", mse_test)
print("Train R-squared Score:", r2_train)
print("Test R-squared Score:", r2_test)


Polynomial Regression (Degree = 2):
Train MSE: 287842.11080242554
Test MSE: 272134.0861484541
Train R-squared Score: 0.0004562264742219435
Test R-squared Score: -0.0003857962849971308


Overall, these results indicate that the linear regression model does not effectively capture the relationship between housing prices and the count of rat sightings. Further exploration with additional features or alternative modeling techniques may be necessary to improve the predictive performance.