# House Price Prediction - Modelling

This analysis deals with the prediction of house prices based on the house's properties. The prediction is based on a sample of houses from Ames, Iowa. The dataset itself is obtained from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) as part of a competition.

## Extract-Transform-Load (ETL)

In [None]:
import pandas as pd

In [None]:
houses_train = pd.read_csv("../data/modelling/train.csv")
houses_validation = pd.read_csv("../data/modelling/validation.csv")

## Evaluation

In [None]:
import joblib
from sklearn.metrics import mean_squared_error, mean_squared_log_error

In [None]:
pd.concat([houses_train["SalePrice"], houses_validation["SalePrice"]]).describe()

### Linear Regression

As a baseline model, we trained a linear regression model on the dataset. The model was trained on the training set and evaluated on the test set.

In [None]:
lm_model = joblib.load("../models/lm.joblib")

In [None]:
lm_prediction_train = lm_model.predict(houses_train.drop(columns=["SalePrice"]))
lm_rmse_train = mean_squared_error(houses_train["SalePrice"], lm_prediction_train, squared=False)
lm_rmsle_train = mean_squared_log_error(houses_train["SalePrice"], lm_prediction_train, squared=False)

lm_prediction_validation = lm_model.predict(houses_validation.drop(columns=["SalePrice"]))
lm_rmse_validation = mean_squared_error(houses_validation["SalePrice"], lm_prediction_validation, squared=False)
lm_rmsle_validation = mean_squared_log_error(houses_validation["SalePrice"], lm_prediction_validation, squared=False)

In [None]:
print(f"Error of linear regression model on train set (RMSE): {lm_rmse_train}")
print(f"Error of linear regression model on train set (RMSLE): {lm_rmsle_train}")
print(f"Error of linear regression model on validation set (RMSE): {lm_rmse_validation}")
print(f"Error of linear regression model on validation set (RMSLE): {lm_rmsle_validation}")