In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load and split data
df = pd.read_csv("cleaned_data_with_transaction_year.csv")
X = df.drop("price_per_unit_area", axis=1)
y = df["price_per_unit_area"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rf_r2 = r2_score(y_test, y_pred)

print("Random Forest RMSE:", rf_rmse)
print("Random Forest R²:", rf_r2)


Random Forest RMSE: 5.687590539891411
Random Forest R²: 0.8071725356880382


**Why Random Forest?**

Random Forest is an ensemble learning method that combines multiple decision trees to improve performance and reduce overfitting. It averages the results from many models, making it more robust and accurate compared to a single decision tree.

Performance Comparison
Model Used: Random Forest Regressor

RMSE: 5.69

R² Score: 0.81

Interpretation:
The R² score of 0.81 indicates that 81% of the variance in the target variable (price_per_unit_area) is explained by the model.

The RMSE of 5.69 is relatively low, showing that the predictions are quite close to the actual values.



Comparison with Previous Models:
Compared to simpler models (like linear regression), Random Forest often shows better performance because:

It captures nonlinear relationships

It handles outliers and noise better

It’s more robust and accurate

Conclusion:
The Random Forest model demonstrated strong performance in predicting housing prices per unit area. It outperformed earlier approaches and confirmed that ensemble methods can enhance prediction accuracy in real estate datasets.