# House Price Prediction - Evaluation

This analysis deals with the prediction of house prices based on the house's properties. The prediction is based on a sample of houses from Ames, Iowa. The dataset itself is obtained from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) as part of a competition.

## Extract-Transform-Load (ETL)

In [None]:
import pandas as pd

In [None]:
houses_train = pd.read_csv("../data/processed/train.csv")
houses_validation = pd.read_csv("../data/processed/validation.csv")

## Evaluation

In [None]:
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_squared_log_error

sns.set_theme()

In [None]:
pd.concat([houses_train["SalePrice"], houses_validation["SalePrice"]]).describe()

### Linear Regression

As a baseline model, we trained a linear regression model on the dataset. The model was trained on the training set and evaluated on the test set.

In [None]:
lm_model = joblib.load("../models/linear_regression.joblib")

In [None]:
lm_prediction_train = lm_model.predict(houses_train.drop(columns=["SalePrice"]))
lm_rmse_train = round(mean_squared_error(houses_train["SalePrice"], lm_prediction_train, squared=False), 2)
lm_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], lm_prediction_train, squared=False)

lm_prediction_validation = lm_model.predict(houses_validation.drop(columns=["SalePrice"]))
lm_rmse_validation = round(mean_squared_error(houses_validation["SalePrice"], lm_prediction_validation, squared=False), 2)
lm_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], lm_prediction_validation, squared=False)

In [None]:
print(f"Linear regression train error (RMSE): {lm_rmse_train}")
print(f"Linear regression train error (RMSLE): {lm_rmsle_train}")
print(f"Linear regression validation error (RMSE): {lm_rmse_validation}")
print(f"Linear regression validation error (RMSLE): {lm_rmsle_validation}")

### Linear Ridge Regression

A simple linear regression model with ridge regularization was trained on the dataset.

In [None]:
lm_ridge_model = joblib.load("../models/linear_ridge_regression.joblib")

In [None]:
lm_ridge_prediction_train = lm_ridge_model.predict(houses_train.drop(columns=["SalePrice"]))
lm_ridge_rmse_train = round(mean_squared_error(houses_train["SalePrice"], lm_ridge_prediction_train, squared=False), 2)
lm_ridge_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], lm_ridge_prediction_train, squared=False)

lm_ridge_prediction_validation = lm_ridge_model.predict(houses_validation.drop(columns=["SalePrice"]))
lm_ridge_rmse_validation = round(mean_squared_error(houses_validation["SalePrice"], lm_ridge_prediction_validation, squared=False), 2)
lm_ridge_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], lm_ridge_prediction_validation, squared=False)

In [None]:
print(f"Linear ridge regression train error (RMSE): {lm_ridge_rmse_train}")
print(f"Linear ridge regression train error (RMSLE): {lm_ridge_rmsle_train}")
print(f"Linear ridge regression validation error (RMSE): {lm_ridge_rmse_validation}")
print(f"Linear ridge regression validation error (RMSLE): {lm_ridge_rmsle_validation}")

### Random Forest

As a further model, we trained a random forest on the dataset.

In [None]:
random_forest_model = joblib.load("../models/random_forest.joblib")

In [None]:
random_forest_prediction_train = random_forest_model.predict(houses_train.drop(columns=["SalePrice"]))
random_forest_rmse_train = round(mean_squared_error(houses_train["SalePrice"], random_forest_prediction_train, squared=False), 2)
random_forest_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], random_forest_prediction_train, squared=False)

random_forest_prediction_validation = random_forest_model.predict(houses_validation.drop(columns=["SalePrice"]))
random_forest_rmse_validation = round(mean_squared_error(houses_validation["SalePrice"], random_forest_prediction_validation, squared=False), 2)
random_forest_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], random_forest_prediction_validation, squared=False)

In [None]:
print(f"Random forest train error (RMSE): {random_forest_rmse_train}")
print(f"Random forest train error (RMSLE): {random_forest_rmsle_train}")
print(f"Random forest validation error (RMSE): {random_forest_rmse_validation}")
print(f"Random forest validation error (RMSLE): {random_forest_rmsle_validation}")

### Gradient Boosting

In [None]:
gradient_boosting_model = joblib.load("../models/gradient_boosting.joblib")

In [None]:
gradient_boosting_prediction_train = gradient_boosting_model.predict(houses_train.drop(columns=["SalePrice"]))
gradient_boosting_rmse_train = round(mean_squared_error(houses_train["SalePrice"], gradient_boosting_prediction_train, squared=False), 2)
gradient_boosting_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], gradient_boosting_prediction_train, squared=False)

gradient_boosting_prediction_validation = gradient_boosting_model.predict(houses_validation.drop(columns=["SalePrice"]))
gradient_boosting_rmse_validation = round(mean_squared_error(houses_validation["SalePrice"], gradient_boosting_prediction_validation, squared=False), 2)
gradient_boosting_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], gradient_boosting_prediction_validation, squared=False)

In [None]:
print(f"Gradient boosting train error (RMSE): {gradient_boosting_rmse_train}")
print(f"Gradient boosting train error (RMSLE): {gradient_boosting_rmsle_train}")
print(f"Gradient boosting validation error (RMSE): {gradient_boosting_rmse_validation}")
print(f"Gradient boosting validation error (RMSLE): {gradient_boosting_rmsle_validation}")

### Feed Forward Neural Network

In [None]:
feed_forward_nn_model = joblib.load("../models/feed_forward_nn.joblib")

In [None]:
feed_forward_nn_prediction_train = feed_forward_nn_model.predict(houses_train.drop(columns=["SalePrice"]))
feed_forward_nn_rmse_train = round(mean_squared_error(houses_train["SalePrice"], feed_forward_nn_prediction_train, squared=False), 2)
feed_forward_nn_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], feed_forward_nn_prediction_train, squared=False)

feed_forward_nn_prediction_validation = feed_forward_nn_model.predict(houses_validation.drop(columns=["SalePrice"]))
feed_forward_nn_rmse_validation = round(mean_squared_error(houses_validation["SalePrice"], feed_forward_nn_prediction_validation, squared=False), 2)
feed_forward_nn_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], feed_forward_nn_prediction_validation, squared=False)

In [None]:
print(f"Feed forward neural network train error (RMSE): {feed_forward_nn_rmse_train}")
print(f"Feed forward neural network train error (RMSLE): {feed_forward_nn_rmsle_train}")
print(f"Feed forward neural network validation error (RMSE): {feed_forward_nn_rmse_validation}")
print(f"Feed forward neural network validation error (RMSLE): {feed_forward_nn_rmsle_validation}")

## Conclusion

Finally, we compare the results of all trained models. As metrics, we use the root mean squared error (RMSE).

In [None]:
model_performance = pd.DataFrame({
    "Model": ["Linear Regression", "Linear Regression", "Linear Ridge Regression", "Linear Ridge Regression",
              "Random Forest", "Random Forest", "Gradient Boosting", "Gradient Boosting", "Feed Forward Neural Network",
              "Feed Forward Neural Network"],
    "RMSE": [lm_rmse_train, lm_rmse_validation, lm_ridge_rmse_train, lm_ridge_rmse_validation,
             random_forest_rmse_train, random_forest_rmse_validation,
             gradient_boosting_rmse_train, gradient_boosting_rmse_validation, feed_forward_nn_rmse_train,
             feed_forward_nn_rmse_validation],
    "RMSLE": [lm_rmsle_train, lm_rmsle_validation, lm_ridge_rmsle_train, lm_ridge_rmsle_validation,
              random_forest_rmsle_train, random_forest_rmsle_validation,
              gradient_boosting_rmsle_train, gradient_boosting_rmsle_validation, feed_forward_nn_rmsle_train,
              feed_forward_nn_rmsle_validation],
    "Dataset": ["Train", "Validation", "Train", "Validation", "Train", "Validation", "Train", "Validation", "Train",
                "Validation"]
})

model_performance

In [None]:
fig = plt.figure(figsize=(10, 5))

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Comparison of model performance")
sns.barplot(x="Model", y="RMSLE", hue="Dataset", data=model_performance, ax=ax,
            order=model_performance[model_performance["Dataset"] == "Validation"].sort_values(by="RMSLE")["Model"])
ax.set_xlabel("Model")
ax.set_ylabel("RMSLE")
ax.tick_params(axis="x", rotation=90)

plt.show()

## Submission

In a final step, we take the best model and predict the house prices for the submission dataset. The submission dataset is the test dataset from Kaggle. The predictions are saved in a CSV file and uploaded to Kaggle.

In [None]:
best_model = gradient_boosting_model

In [None]:
houses_test = pd.read_csv("../data/processed/test.csv")

In [None]:
best_model_prediction_test = best_model.predict(houses_test)

submission = pd.DataFrame({
    "Id": houses_test["Id"],
    "SalePrice": best_model_prediction_test
})

submission.to_csv("../data/submission.csv", index=False)