# House Price Prediction: Advanced Regression Techniques- Modeling

### Introduction

In this data science project, we will tackle the exciting task of predicting home prices based on various features and attributes. The dataset contains valuable information about different homes, including their size, location, number of rooms, amenities, and more. Our primary objective is to develop robust predictive models that can accurately estimate the sale prices of houses.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
# Load the datasets
X_train_pca = pd.read_csv("X_train_pca.csv")
X_test_pca = pd.read_csv("X_test_pca.csv")
y_train = pd.read_csv("y_train.csv")
y_test = pd.read_csv("y_test.csv")
final_eval_pca = pd.read_csv("final_eval_pca.csv")

<IPython.core.display.Javascript object>

### Hyperparameter tuning

In this section, we will perform hyperparameter tuning for the selected models using GridSearchCV. Hyperparameter tuning helps us find the best combination of hyperparameters for each model, which can lead to improved model performance. For each model, we will define a parameter grid specifying the hyperparameter values we want to search over. GridSearchCV will then exhaustively search through this parameter grid and evaluate the model's performance using cross-validation to find the best hyperparameters.

Let's start by tuning the hyperparameters for each of the following models:
1. Decision Tree Regression
2. Random Forest Regression
3. XGBoost Regression

After hyperparameter tuning, we will have the best hyperparameters for each model, which we will use for the final modeling step.

In [3]:
# Decision tree grid search

param_grid = {
    "max_depth": np.arange(3, 15),
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
}

dt = DecisionTreeRegressor()
dt_cv = GridSearchCV(dt, param_grid, cv=5)
dt_cv.fit(X_train_pca, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(),
             param_grid={'max_depth': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10]})

<IPython.core.display.Javascript object>

In [4]:
print("Best Score:" + str(dt_cv.best_score_))
print("Best Parameters: " + str(dt_cv.best_params_))

Best Score:0.7789338961363504
Best Parameters: {'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 2}


<IPython.core.display.Javascript object>

In [5]:
# Random Forest grid search

param_grid = {
    "n_estimators": np.arange(50, 251, 50),
    "max_depth": [3, 6, 9, 12, 15],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
}

rf = RandomForestRegressor()
rf_cv = GridSearchCV(rf, param_grid, cv=5)
rf_cv.fit(X_train_pca, y_train.values.ravel())

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [3, 6, 9, 12, 15],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': array([ 50, 100, 150, 200, 250])})

<IPython.core.display.Javascript object>

In [6]:
print("Best Score:" + str(rf_cv.best_score_))
print("Best Parameters: " + str(rf_cv.best_params_))

Best Score:0.8370923325899223
Best Parameters: {'max_depth': 9, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 150}


<IPython.core.display.Javascript object>

In [14]:
# XGBoost grid search'

param_grid = {
    "n_estimators": np.arange(50, 501, 50),
    "min_child_weight": [1, 5, 10],
    "gamma": [0.5, 1, 1.5, 2, 5],
}

xgb = XGBRegressor()
xgb_cv = GridSearchCV(xgb, param_grid, cv=5)
xgb_cv.fit(X_train_pca, y_train)

GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None, gpu_id=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, m...
                                    max_cat_to_onehot=None, max_delta_step=None,
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None, n_estim

<IPython.core.display.Javascript object>

In [8]:
print("Best Score:" + str(xgb_cv.best_score_))
print("Best Parameters: " + str(xgb_cv.best_params_))

Best Score:0.8272512103175584
Best Parameters: {'gamma': 0.5, 'min_child_weight': 1, 'n_estimators': 50}


<IPython.core.display.Javascript object>

### Modeling

Now that we have obtained the optimal hyperparameters for our Decision Tree, Random Forest, and XGBoost models through hyperparameter tuning, we can proceed to the exciting modeling phase. In this section, we will build and evaluate these models using the best hyperparameters to predict housing prices effectively.

We will start by fitting each model to the training data with the respective hyperparameters, and then we will evaluate their performance on the test set using appropriate regression metrics such as Root Mean Squared Error (RMSE) and R-squared (R2). The models' predictions will be compared to the actual target values to assess their accuracy and predictive capabilities.

After evaluating each model's performance, we will identify the best-performing model based on the metrics' results. The model with the highest R2 and the lowest RMSE will be chosen as our top candidate for predicting housing prices.

Let's dive into the modeling phase and witness the predictive power of these algorithms on our dataset!

### Linear Regression

In [9]:
# Create a linear regression object
regr = LinearRegression()
model = regr.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_pca)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error (RMSE) for Linear Regression:", rmse)

# Calculate R-squared for Linear Regression
r2 = r2_score(y_test, y_pred)
print("R-squared for Linear Regression:", r2)

Root Mean Squared Error (RMSE) for Linear Regression: 36577.16617143117
R-squared for Linear Regression: 0.825575986080112


<IPython.core.display.Javascript object>

### Decision Tree

In [20]:
# Create a decision tree regressor object
dt = DecisionTreeRegressor(max_depth=6, min_samples_leaf=2, min_samples_split=2)
model_dt = dt.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred_dt = model_dt.predict(X_test_pca)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print("Root Mean Squared Error (RMSE) for Decision Tree Regression:", rmse)

# Calculate R-squared
r2 = r2_score(y_test, y_pred_dt)
print("R-squared for Decision Tree Regression:", r2)

Root Mean Squared Error (RMSE) for Decision Tree Regression: 36466.165965387
R-squared for Decision Tree Regression: 0.8266330239033199


<IPython.core.display.Javascript object>

### Random Forest

In [19]:
# Create a random forest regressor object
rf = RandomForestRegressor(
    n_estimators=150, max_depth=9, min_samples_split=5, min_samples_leaf=2
)
model_rf = rf.fit(X_train_pca, y_train.values.ravel())

# Make predictions on the test set
y_pred_rf = model_rf.predict(X_test_pca)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("Root Mean Squared Error (RMSE) for Random Forest Regression:", rmse)

# Calculate R-squared
r2 = r2_score(y_test, y_pred_rf)
print("R-squared for Random Forest Regression:", r2)

Root Mean Squared Error (RMSE) for Random Forest Regression: 32328.491669653704
R-squared for Random Forest Regression: 0.8637435559578188


<IPython.core.display.Javascript object>

### XGBoost

In [15]:
# Create an XGBoost regressor object with the best hyperparameters from the grid search
xgb = XGBRegressor(n_estimators=50, min_child_weight=1, gamma=0.5)

# Fit the XGBoost model on the training data
model_xgb = xgb.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred_xgb = model_xgb.predict(X_test_pca)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
print("Root Mean Squared Error (RMSE) for XGBoost Regression:", rmse)

# Calculate R-squared
r2 = r2_score(y_test, y_pred_xgb)
print("R-squared for XGBoost Regression:", r2)

Root Mean Squared Error (RMSE) for XGBoost Regression: 34418.5366007608
R-squared for XGBoost Regression: 0.8455560259447763


<IPython.core.display.Javascript object>

| Model                   | RMSE          | R-squared     | RMSE as % of Mean Price |
|-------------------------|---------------|---------------|-------------------------|
| Linear Regression       | 36,577.17     | 0.8256        | 20.20%                  |
| Decision Tree Regression| 36,466.17     | 0.8266        | 20.13%                  |
| Random Forest Regression| 32,328.49     | 0.8637        | 17.87%                  |
| XGBoost Regression      | 34,418.54     | 0.8456        | 19.03%                  |


# Findings and Model Evaluation

In this analysis, we evaluated four different regression models to predict house sale prices based on various features. The models we explored were Linear Regression, Decision Tree Regression, Random Forest Regression, and XGBoost Regression. Our main evaluation metrics were Root Mean Squared Error (RMSE) and R-squared, which provided insights into prediction accuracy and model fit.

### Overall Performance


After performing the hyperparameter tuning for each model, we obtained the following results:

Linear Regression: RMSE = 36,577.17, R-squared = 0.826 <br>
Decision Tree Regression: RMSE = 36,466.17, R-squared = 0.827 <br>
Random Forest Regression: RMSE = 32,328.49, R-squared = 0.864 <br>
XGBoost Regression: RMSE = 34,418.54, R-squared = 0.846

### Best Performing Model


Among the evaluated models, Random Forest Regression achieved the lowest RMSE of 32,328.49, indicating its superior predictive accuracy compared to the other models. Its R-squared of 0.864 shows that around 86.4% of the variance in house sale prices can be explained by the features in the dataset.

### Model Insights


Random Forest and XGBoost Regression models outperformed Linear Regression and Decision Tree Regression. These models are known for their ability to capture complex interactions between features and handle non-linear relationships, which contributed to their improved performance.

### Feature Importance


The Random Forest and XGBoost models provided feature importance insights, highlighting the most significant features in predicting house sale prices. These insights can guide us in understanding the factors that influence property prices the most.



### Baseline Model Comparison


We also compared the performance of a baseline Linear Regression model using only the square footage feature. The full-feature Linear Regression outperformed the baseline, indicating that including more relevant features significantly improved the model's predictive power.



### Potential Improvements


While our models have shown promising results, there are areas for further improvement. We could explore other advanced regression techniques, fine-tune hyperparameters even more, and consider engineering additional features to enhance model performance.



### Final Recommendations


Based on our findings, we recommend using the Random Forest Regression model for predicting house sale prices due to its superior accuracy and ability to handle complex relationships. However, it is essential to consider factors such as model interpretability and computational complexity when making the final decision.



# Conclusion


In conclusion, the Random Forest Regression model has shown promising performance in predicting the sale prices of houses in the dataset. The RMSE (Root Mean Squared Error) of approximately 32,328.49 dollars signifies the average deviation between the model's predictions and the actual sale prices. Considering that the average sale price of homes in the dataset is around $180,921, the RMSE of 32,328.49 translates to approximately 17.87% of the average sale price.

In non-technical terms, this means that the model's predictions have an average error of approximately 17.87% when estimating the sale prices of houses in the dataset. Lower RMSE values indicate better model performance, so reducing this value further would enhance the accuracy of the model's predictions and provide more reliable estimates for home prices.

As we move forward, it may be worth exploring additional data preprocessing techniques, feature engineering, or experimenting with more complex models to further improve the accuracy of the predictions. Additionally, evaluating the impact of other potential features on the model's performance could help refine the predictions even further. Overall, continuous fine-tuning and refinement of the model would contribute to more precise and reliable estimations of home prices in future analyses.