# Model building of house prices dataset
In this notebook we're going to build three different models for house prices dataset
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

KeyboardInterrupt: 

In [None]:
# import data
orig_data = pd.read_csv("data/train.csv")
house_data = orig_data.copy()
X_train = pd.read_csv("data/X_train.csv", index_col="Id")
X_test = pd.read_csv("data/X_test.csv", index_col="Id")
y_train = pd.read_csv("data/y_train.csv", index_col="Id")
y_test = pd.read_csv("data/y_test.csv", index_col="Id")

In [None]:
reg = LinearRegression(fit_intercept=False).fit(X_train, y_train)
reg.score(X_train, y_train)

Good R-squared score. Around 80% of variation can be explained by the model.

In [None]:
y_predict = reg.predict(X_test)

In [None]:
mean_squared_error(np.log(y_test + 1), np.log(y_predict + 1))

More verbose version of linear regression using statsmodels:

In [None]:
mod = sm.OLS(y_train, X_train)
res = mod.fit()
res.summary()

Most of the variables have low p-values which indicates they are strong predictors of House price.

Considering shape of target variable it might be more beneficial to predict its logarithm and then scale the output.

In [None]:
sns.histplot(data=house_data["SalePrice"])

In [None]:
scaled_y_train = np.log(y_train + 1) 
reg = LinearRegression(fit_intercept=False).fit(X_train, scaled_y_train)
reg.score(X_train, np.log(y_train))

As we can see, the corresponding model has even greater value of R-squared.

In [None]:
y_predict = reg.predict(X_test)

In [None]:
mean_squared_error(np.log(y_test + 1), y_predict)