In [13]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

file_path = "NY-House-Dataset.csv"
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print(f"File '{file_path}' not found.")

# Haversine formula to calculate distance between two points given their coordinates
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  
    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)
    a = np.sin(dlat / 2) ** 2 + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.sin(dlon / 2) ** 2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c 
    return distance

iconic_landmark_coords = (40.7580, -73.9855)  # Times Square coordinates

df['DISTANCE_TO_ICONIC_LANDMARK'] = df.apply(lambda row: haversine(row['LATITUDE'], row['LONGITUDE'], iconic_landmark_coords[0], iconic_landmark_coords[1]), axis=1)

X = df[['BEDS', 'BATH', 'PROPERTYSQFT', 'LATITUDE', 'LONGITUDE', 'DISTANCE_TO_ICONIC_LANDMARK']]
y = df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Dataset loaded successfully.
Shape of X_train: (3840, 6)
Shape of X_test: (961, 6)
Shape of y_train: (3840,)
Shape of y_test: (961,)


For the initial set of features, I have chosen 'BEDS', 'BATH', 'PROPERTYSQFT', 'LATITUDE', and 'LONGITUDE' for X, representing the number of bedrooms, number of bathrooms, property square footage, latitude, and longitude coordinates of each house respectively. 

- **Number of Bedrooms (BEDS)**: This feature can have a significant impact on the price of a house. Generally, houses with more bedrooms tend to have higher prices.

- **Number of Bathrooms (BATH)**: Similar to the number of bedrooms, the number of bathrooms is an important factor influencing house prices. Houses with more bathrooms are often considered more desirable and thus command higher prices.

- **Property Square Footage (PROPERTYSQFT)**: The size of the property is a crucial determinant of its price. Larger properties typically have higher prices compared to smaller ones.

- **Latitude and Longitude (LATITUDE, LONGITUDE)**: Geographic location plays a crucial role in determining house prices. These coordinates can capture the spatial aspect of the dataset, allowing the model to learn spatial patterns and variations in house prices across different areas.

For the target feature y, I have chosen 'PRICE', which represents the price of each house. The goal of the predictive model is to estimate house prices based on the selected features.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

linear_reg = LinearRegression()

linear_reg.fit(X_train, y_train)

y_train_pred = linear_reg.predict(X_train)

train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

print("Training RMSE:", train_rmse)
print("Training MAE:", train_mae)
print("Training R^2 Score:", train_r2)

Training RMSE: 34695582.59705792
Training MAE: 2796614.264018483
Training R^2 Score: 0.015375262370970977


The training evaluation metrics for the linear regression model are as follows:
- **Training RMSE**: 34,695,582.60
- **Training MAE**: 2,796,614.26
- **Training R^2 Score**: 0.015

These metrics provide insights into how well the linear regression model fits the training data. However, the results indicate that the model's performance on the training set is not satisfactory.

1. **RMSE (Root Mean Squared Error)**: This metric measures the average difference between the predicted and actual values. A higher RMSE indicates larger errors in predictions. In this case, the RMSE is very high, suggesting that the model's predictions deviate significantly from the actual house prices in the training set.

2. **MAE (Mean Absolute Error)**: MAE represents the average absolute difference between the predicted and actual values. Similar to RMSE, a higher MAE indicates larger errors in predictions. The MAE value is also quite high, indicating substantial errors in the model's predictions.

3. **R^2 Score (Coefficient of Determination)**: R^2 score measures the proportion of the variance in the target variable that is predictable from the features. A higher R^2 score closer to 1 indicates a better fit of the model to the data. However, the R^2 score here is very low (0.015), suggesting that the model explains only a small fraction of the variance in the target variable and performs poorly in capturing the relationships between features and house prices.

Overall, based on these results, it appears that the linear regression model is not able to effectively capture the underlying patterns in the training data, and its predictive performance is inadequate. Further analysis and potentially more sophisticated modeling techniques may be necessary to improve the model's performance.

In [16]:
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(random_state=42)

random_forest.fit(X_train, y_train)

y_train_pred_rf = random_forest.predict(X_train)

train_rmse_rf = mean_squared_error(y_train, y_train_pred_rf, squared=False)
train_mae_rf = mean_absolute_error(y_train, y_train_pred_rf)
train_r2_rf = r2_score(y_train, y_train_pred_rf)

print("Random Forest Model - Training RMSE:", train_rmse_rf)
print("Random Forest Model - Training MAE:", train_mae_rf)
print("Random Forest Model - Training R^2 Score:", train_r2_rf)

Random Forest Model - Training RMSE: 13248201.900374904
Random Forest Model - Training MAE: 596846.2013254217
Random Forest Model - Training R^2 Score: 0.8564390743052879


In [19]:
y_test_pred_rf = random_forest.predict(X_test)

test_rmse_rf = mean_squared_error(y_test, y_test_pred_rf, squared=False)
test_mae_rf = mean_absolute_error(y_test, y_test_pred_rf)
test_r2_rf = r2_score(y_test, y_test_pred_rf)

print("Random Forest Model - Test RMSE:", test_rmse_rf)
print("Random Forest Model - Test MAE:", test_mae_rf)
print("Random Forest Model - Test R^2 Score:", test_r2_rf)


Random Forest Model - Test RMSE: 3334708.8642089046
Random Forest Model - Test MAE: 814564.6523938109
Random Forest Model - Test R^2 Score: 0.5586284452473593


**Markdown: Look at the parameters you found and discuss what you have learned.**

The Random Forest model was trained and evaluated with the following results:

- **Training RMSE**: 13,248,201.90
- **Training MAE**: 596,846.20
- **Training R^2 Score**: 0.856

- **Test RMSE**: 3,334,708.86
- **Test MAE**: 814,564.65
- **Test R^2 Score**: 0.559

**Training Evaluation:**
- The model achieved a relatively low RMSE on the training set, indicating that, on average, the predicted house prices differ from the actual prices by approximately 13.25 million.
- The MAE is also relatively low, indicating that, on average, the absolute difference between predicted and actual prices is around 596,846.
- The R^2 score of 0.856 suggests that the model explains approximately 85.6% of the variance in the training data.

**Test Evaluation:**
- The model's performance on the test set is slightly worse compared to the training set, with a higher RMSE and MAE.
- The RMSE on the test set is approximately 3.33 million, indicating that, on average, the predicted house prices differ from the actual prices by this amount.
- The MAE on the test set is around 814,564, suggesting a slightly higher average absolute difference between predicted and actual prices compared to the training set.
- The R^2 score on the test set is 0.559, indicating that the model explains approximately 55.9% of the variance in the test data.

**Discussion:**
- The model appears to have good performance on the training set, as evidenced by the low RMSE, MAE, and high R^2 score.
- However, there seems to be some level of overfitting, as the model's performance on the test set is not as good as on the training set. The higher RMSE and MAE on the test set suggest that the model may not generalize well to unseen data.
- Further tuning of hyperparameters or exploring different algorithms may help improve the model's performance on the test set and mitigate overfitting. Additionally, collecting more data or refining features could also enhance the model's predictive capability.