## House Prices dataset: Model building

In the following cells, we will finally build our machine learning model, utilising the engineered data and the pre-selected features. 


### Setting the seed

It is important to note, that we are engineering variables and pre-processing data with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

This is perhaps one of the most important lessons that you need to take away from this course: **Always set the seeds**.

Let's go ahead and load the dataset.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# to build the model
from sklearn.linear_model import Lasso

# to evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load the train and test set with the engineered variables
# we built and saved these datasets in a previous notebook.
X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
0,0.0,0.461171,0.377048,0.777778,0.5,0.978261,0.95,0.0,0.002835,0.0,0.673479,0.239935,0.55976,0.0,0.0,0.52325,0.0,0.0,0.666667,0.0,0.375,0.333333,0.416667,0.0,0.972727,0.75,0.430183,0.116686,0.032907,0.0,0.0,0.0,0.0,0.0,0.545455,0.75,0.0,0.0,0.0
1,0.0,0.456066,0.399443,0.444444,0.75,0.630435,0.933333,0.03375,0.142807,0.0,0.114724,0.17234,0.434539,0.0,0.0,0.406196,0.333333,0.0,0.333333,0.5,0.375,0.333333,0.25,0.0,0.536364,0.25,0.220028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.636364,0.5,0.0,0.0,0.0
2,0.588235,0.394699,0.347082,0.888889,0.5,0.963768,0.916667,0.2575,0.080794,0.0,0.601951,0.286743,0.627205,0.0,0.0,0.586296,0.333333,0.0,0.666667,0.0,0.25,0.333333,0.333333,0.333333,0.954545,0.5,0.406206,0.228705,0.149909,0.0,0.0,0.0,0.0,0.0,0.090909,1.0,0.0,0.0,0.0
3,0.0,0.388581,0.493677,0.666667,0.5,0.913043,0.8,0.0,0.25567,0.0,0.018114,0.242553,0.56692,0.0,0.0,0.529943,0.333333,0.0,0.666667,0.0,0.375,0.333333,0.25,0.333333,0.890909,0.5,0.362482,0.469078,0.045704,0.0,0.0,0.0,0.0,0.0,0.636364,0.25,1.0,0.0,0.0
4,0.0,0.577658,0.402702,0.555556,0.5,0.666667,0.233333,0.17,0.086818,0.0,0.434278,0.233224,0.549026,0.0,0.0,0.513216,0.0,0.0,0.666667,0.0,0.375,0.333333,0.416667,0.333333,0.581818,0.5,0.406206,0.0,0.0,0.0,0.801181,0.0,0.0,0.0,0.545455,0.5,0.0,0.0,0.0


In [3]:
# capture the target (remember that is log transformed)
y_train = X_train['SalePrice']
y_test = X_test['SalePrice']

KeyError: 'SalePrice'

In [4]:
# pre-selected features
features = pd.read_csv('selected_features.csv')
features = features['0'].to_list() 
features

['MSSubClass',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'BsmtFullBath',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'YrSold']

In [5]:
# reduce the train and test set to the selected features
X_train = X_train[features]
X_test = X_test[features]

### Regularised linear regression: Lasso

Remember to set the seed.

In [41]:
# remember to set the random_state / seed
lin_model = Lasso(alpha=0.005, random_state=0)

# train the model
lin_model.fit(X_train, y_train)

NameError: name 'y_train' is not defined

In [None]:
# make predictions for train set
pred = lin_model.predict(X_train)

# determine mse and rmse
print('train mse: {}'.format(int(
    mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
    sqrt(mean_squared_error(np.exp(y_train), np.exp(pred))))))
print('train r2: {}'.format(
    r2_score(np.exp(y_train), np.exp(pred))))
print()

# make predictions for test set
pred = lin_model.predict(X_test)

# determine mse and rmse
print('test mse: {}'.format(int(
    mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
    sqrt(mean_squared_error(np.exp(y_test), np.exp(pred))))))
print('test r2: {}'.format(
    r2_score(np.exp(y_test), np.exp(pred))))
print()

print('Average house price: ', int(np.exp(y_train).median()))

In [None]:
# let's evaluate our predictions respect to the real sale price
plt.scatter(y_test, lin_model.predict(X_test))
plt.xlabel('True House Price')
plt.ylabel('Predicted House Price')
plt.title('Evaluation of Lasso Predictions')

In [None]:
# let's evaluate the distribution of the errors: 
# they should be fairly normally distributed
errors = y_test - lin_model.predict(X_test)
errors.hist(bins=30)

https://towardsdatascience.com/how-to-build-a-regression-model-in-python-9a10685c7f09