# Random Forest Regression

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

In [15]:
# Reading the encoded dataset into pandas
file_path = '../data/dataset_with_encoded_location.zip'
df = pd.read_csv(file_path, compression='zip')
df.head()


Unnamed: 0,bath,balcony,price,House_size,new_total_sqft,L1,L2,L3,L4,L5,...,L7,L8,L9,L10,L11,L12,L13,L14,L15,L16
0,2,3,62,3,1440,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,1,95,3,1521,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,1,51,2,1200,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,63,3,1310,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,2,70,3,1800,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Splitting the data

In [16]:
# initializing X and y
X = df.drop(columns='price')
y = df['price']

# splitting the data into tests and trains
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training the Model

In [17]:
# defining the hyperparameters
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20],       # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],   # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4]      # Minimum number of samples required at each leaf node
}

# initializing the random forest regressor
model = RandomForestRegressor(random_state=42)

# initializing the gridsearch
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)

# performing the grid search 
grid_search.fit(X_train, y_train)

# Getting the best parameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

print("Best Parameters:", best_params)

# Making predictions using the best estimator
y_pred = best_estimator.predict(X_test)

Best Parameters: {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 300}


## Evaluating the Model's Performance

In [18]:
# Mean Squared error
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# R-Squared
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("RMSE:", rmse)
print("R-squared:", r2)

Mean Squared Error: 8.450318175569452
RMSE: 2.9069430980962547
R-squared: 0.9897097408525412


## Conclusion

After conducting an extensive grid search using Random Forest Regressor with hyperparameter tuning, the optimal set of hyperparameters for our model was determined to be `{'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 300}.` This configuration resulted in a model that exhibited exceptional performance on our dataset.

The evaluation metrics further support the effectiveness of our model. The `Mean Squared Error (MSE)` was found to be approximately `8.45`, indicating that, on average, the squared difference between predicted and actual values was relatively low. Additionally, the `Root Mean Squared Error (RMSE)` was approximately `2.91`, implying that the average magnitude of errors in our predictions was small. Furthermore, the `R-squared` value of approximately `0.99` indicates that our model explains approximately 99% of the variance in the target variable, suggesting an excellent fit to the data.

In summary, the optimized **Random Forest Regressor** model demonstrates outstanding predictive performance, achieving low errors and high explanatory power. These results instill confidence in the model's ability to accurately predict house prices in Bengaluru based on the provided features.