# House Price Prediction - Modelling

This analysis deals with the prediction of house prices based on the house's properties. The prediction is based on a sample of houses from Ames, Iowa. The dataset itself is obtained from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) as part of a competition.

## Extract-Transform-Load (ETL)

In [None]:
import pandas as pd

In [None]:
houses_train = pd.read_csv("../data/modelling/train.csv")
houses_validation = pd.read_csv("../data/modelling/validation.csv")

## Evaluation

In [None]:
import joblib
from sklearn.metrics import mean_squared_error, mean_squared_log_error

In [None]:
pd.concat([houses_train["SalePrice"], houses_validation["SalePrice"]]).describe()

### Linear Regression

As a baseline model, we trained a linear regression model on the dataset. The model was trained on the training set and evaluated on the test set.

In [None]:
lm_model = joblib.load("../models/lm.joblib")

In [None]:
lm_prediction_train = lm_model.predict(houses_train.drop(columns=["SalePrice"]))
lm_rmse_train = mean_squared_error(houses_train["SalePrice"], lm_prediction_train, squared=False)
lm_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], lm_prediction_train, squared=False)

lm_prediction_validation = lm_model.predict(houses_validation.drop(columns=["SalePrice"]))
lm_rmse_validation = mean_squared_error(houses_validation["SalePrice"], lm_prediction_validation, squared=False)
lm_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], lm_prediction_validation, squared=False)

In [None]:
print(f"Linear regression train error (RMSE): {lm_rmse_train}")
print(f"Linear regression train error (RMSLE): {lm_rmsle_train}")
print(f"Linear regression validation error (RMSE): {lm_rmse_validation}")
print(f"Linear regression validation error (RMSLE): {lm_rmsle_validation}")

### Linear Ridge Regression

A simple linear regression model with ridge regularization was trained on the dataset.

In [None]:
lm_ridge_model = joblib.load("../models/lm_ridge.joblib")

In [None]:
lm_ridge_prediction_train = lm_ridge_model.predict(houses_train.drop(columns=["SalePrice"]))
lm_ridge_rmse_train = mean_squared_error(houses_train["SalePrice"], lm_ridge_prediction_train, squared=False)
lm_ridge_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], lm_ridge_prediction_train, squared=False)

lm_ridge_prediction_validation = lm_ridge_model.predict(houses_validation.drop(columns=["SalePrice"]))
lm_ridge_rmse_validation = mean_squared_error(houses_validation["SalePrice"], lm_ridge_prediction_validation, squared=False)
lm_ridge_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], lm_ridge_prediction_validation, squared=False)

In [None]:
print(f"Linear ridge regression train error (RMSE): {lm_ridge_rmse_train}")
print(f"Linear ridge regression train error (RMSLE): {lm_ridge_rmsle_train}")
print(f"Linear ridge regression validation error (RMSE): {lm_ridge_rmse_validation}")
print(f"Linear ridge regression validation error (RMSLE): {lm_ridge_rmsle_validation}")

## Conclusion

In [None]:
import matplotlib.pyplot as plt

In [None]:
model_performance_train = pd.DataFrame({
    "model": ["Linear Regression", "Linear Ridge Regression"],
    "rmse": [lm_rmse_train, lm_ridge_rmse_train],
    "rmsle": [lm_rmsle_train, lm_ridge_rmsle_train]
}).sort_values(ascending=True, by="rmse")

model_performance_validation = pd.DataFrame({
    "model": ["Linear Regression", "Linear Ridge Regression"],
    "rmse": [lm_rmse_validation, lm_ridge_rmse_validation],
    "rmsle": [lm_rmsle_validation, lm_ridge_rmsle_validation]
}).sort_values(ascending=True, by="rmse")

fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Comparison of model performance on train set")
ax1.bar(model_performance_train["model"], model_performance_train["rmse"])
ax1.tick_params(axis="x", rotation=90)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("Comparison of model performance on validation set")
ax2.bar(model_performance_validation["model"], model_performance_validation["rmse"])
ax2.tick_params(axis="x", rotation=90)

plt.show()