# ImmoEliza - XG Boost

After an exploration of several models, XG Boost turned out to provide the best results. Therefore this notebook is a seperate flow of this model.

If you want to rerun the model for a different property type (see below), you will need to import the dataframe again.

**Caveats:**
* At the moment this model only includes numerical variables (such as number of bedrooms or area of the plot) not categorical variables (such as building condition, heating type or EPC). 
* The model does not use any form of scaling.

In [1]:
#import libraries 
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xg

In [2]:
#import dataframe and display first rows and the total number of columns and rows
df = pd.read_csv('../data/saved_data.csv', index_col=0)
display(df.head())
df.shape

Unnamed: 0,property_type,property_subtype,immo_code,location,postcode,province,region,price,address,energy_class,...,construction_year,building_condition,bedrooms,bathrooms,shower_rooms,office,toilets,kitchen_type,heating_type,price_orig
0,house,house,10666428,essen,2910,Antwerpen,Flanders,469000,Heidestatiestraat 26,D,...,0,0,3,1,0,0,2,Installed,0,€469000
1,house,house,10666426,bierges,1301,Brabant wallon,Wallonia,1395000,"Rue de Wavre, 27",C,...,1988,Good,5,2,3,1,4,Hyper equipped,Fuel oil,€1395000
2,apartment,apartment,10666424,arlon,6700,Luxembourg,Wallonia,285000,Avenue de Longwy 340,D,...,0,Good,2,2,0,0,2,Installed,Electric,€285000
3,house,house,10666423,arlon,6700,Luxembourg,Wallonia,285000,Avenue de Longwy 340,D,...,0,Good,2,2,0,0,2,Installed,Electric,€285000
4,apartment,apartment,10666422,antwerp,2000,Antwerpen,Flanders,425000,"Grote steenweg, 511",C,...,0,As new,2,1,0,0,0,Installed,0,€425000


(6545, 27)

### Select properties by (sub)type
You can hardly compare a kot to a castle. Therefore you can run the model for different property (sub)types. This model is limited to housing types so commercial and industrial buildings are excluded. It uses the types and subtypes set by Immoweb. There are two main types, 'house' and 'apartment' with the following subtypes: 

* house: house, villa, mixed-use-building, exceptional-property, other-property, country-cottage, mansion, town-house, bungalow, farmhouse, new-real-estate-project-houses, chalet, manor-house, castle
* apartment: apartment, service-flat, duplex, triplex, apartment-block, loft, ground-floor, new-real-estate-project-apartments, kot, penthouse, flat-studio

By changing the codeblock below, you can select which (sub)types you want to include (==) or exclude (!=). By default castles are excluded from the model. If you want to change this selection, you should start by reloading the original data by rerunning the codeblock above.

In [3]:

#df = df[df.property_type == 'apartment']
#df = df[df.property_type == 'house']
#df = df[df.property_subtype != 'castle'] #by default we exclude castles from our model 

In [4]:
print(df.shape)
display(df.head())

(6545, 27)


Unnamed: 0,property_type,property_subtype,immo_code,location,postcode,province,region,price,address,energy_class,...,construction_year,building_condition,bedrooms,bathrooms,shower_rooms,office,toilets,kitchen_type,heating_type,price_orig
0,house,house,10666428,essen,2910,Antwerpen,Flanders,469000,Heidestatiestraat 26,D,...,0,0,3,1,0,0,2,Installed,0,€469000
1,house,house,10666426,bierges,1301,Brabant wallon,Wallonia,1395000,"Rue de Wavre, 27",C,...,1988,Good,5,2,3,1,4,Hyper equipped,Fuel oil,€1395000
2,apartment,apartment,10666424,arlon,6700,Luxembourg,Wallonia,285000,Avenue de Longwy 340,D,...,0,Good,2,2,0,0,2,Installed,Electric,€285000
3,house,house,10666423,arlon,6700,Luxembourg,Wallonia,285000,Avenue de Longwy 340,D,...,0,Good,2,2,0,0,2,Installed,Electric,€285000
4,apartment,apartment,10666422,antwerp,2000,Antwerpen,Flanders,425000,"Grote steenweg, 511",C,...,0,As new,2,1,0,0,0,Installed,0,€425000


### Define model variables


In [5]:
#int_columns=['furnished', 'terrace', 'terrace_surface', 'plot_surface', 'living_room_surface', 'frontages', 'construction_year', 'bedrooms', 'bathrooms', 'shower_rooms', 'office', 'toilets']
int_columns=['primary_energy_consumption', 'furnished', 'terrace', 'terrace_surface', 'plot_surface', 'living_room_surface', 'frontages', 'construction_year', 'bedrooms', 'bathrooms', 'shower_rooms', 'office', 'toilets']
#int_columns=['bedrooms', 'bathrooms', 'toilets']

X = df[int_columns].to_numpy()
y = df.price.to_numpy().reshape(-1,1)
print(type(X))
print(X)


<class 'numpy.ndarray'>
[[390   0   1 ...   0   0   2]
 [178   0   1 ...   3   1   4]
 [299   0   0 ...   0   0   2]
 ...
 [120   0   1 ...   0   0   0]
 [161   0   1 ...   0   0   0]
 [197   0   1 ...   3   1   4]]


### Split the dataset

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=41)
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)

Shape of X_train:  (5236, 13)
Shape of X_test:  (1309, 13)
Shape of y_train:  (5236, 1)
Shape of y_test:  (1309, 1)


##  Load and fit the model (with Sklearn)

In [7]:
XG_model = "XG Boost"

XG_regressor = xg.XGBRegressor(objective ='reg:linear', n_estimators = 10, seed = 20)
"""X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)"""
XG_regressor.fit(X_train, y_train)
XG_train_score = XG_regressor.score(X_train, y_train)
XG_test_score = XG_regressor.score(X_test,y_test)
XG_y_pred = XG_regressor.predict(X_test)
XG_R_MSE = mean_squared_error(y_test, XG_y_pred, squared = False)



In [8]:
"""pseudocode

XG_regressor opslaan
in API model importeren 
XG_regressor.predict toepassen op API-input"""

'pseudocode\n\nXG_regressor opslaan\nin API model importeren \nXG_regressor.predict toepassen op API-input'

### Scores

In [9]:
model_score_list = ["model", "train_score", "test_score", "R_MSE"]

data = [[XG_model, XG_train_score, XG_test_score, XG_R_MSE]]
newscore_df = pd.DataFrame(data, columns=model_score_list)
#score_df = score_df.join(newscore_df, on=['model'], how='right')
#score_df ["train_score", "test_score", "R_MSE"] = newscore_df ["train_score", "test_score", "R_MSE"]
score_df=newscore_df
display(score_df)

Unnamed: 0,model,train_score,test_score,R_MSE
0,XG Boost,0.771752,0.524155,326735.959876


### Save Model

In [10]:
XG_regressor.save_model('../models/xg_model.json')

### Get a random sample

In [22]:
test_columns = ["immo_code", "price"] + int_columns
test_sample_full = (
        df[test_columns]
        .sample()
)
test_sample_full_dict = test_sample_full.to_dict(orient="dict")
display(test_sample_full_dict)
test_sample = test_sample_full[int_columns].to_dict(orient="dict")
print(test_sample)


{'immo_code': {521: 10666229},
 'price': {521: 825000},
 'primary_energy_consumption': {521: 108},
 'furnished': {521: 0},
 'terrace': {521: 1},
 'terrace_surface': {521: 0},
 'plot_surface': {521: 0},
 'living_room_surface': {521: 0},
 'frontages': {521: 0},
 'construction_year': {521: 2003},
 'bedrooms': {521: 2},
 'bathrooms': {521: 2},
 'shower_rooms': {521: 0},
 'office': {521: 0},
 'toilets': {521: 1}}

{'primary_energy_consumption': {521: 108}, 'furnished': {521: 0}, 'terrace': {521: 1}, 'terrace_surface': {521: 0}, 'plot_surface': {521: 0}, 'living_room_surface': {521: 0}, 'frontages': {521: 0}, 'construction_year': {521: 2003}, 'bedrooms': {521: 2}, 'bathrooms': {521: 2}, 'shower_rooms': {521: 0}, 'office': {521: 0}, 'toilets': {521: 1}}


In [20]:

test_df = pd.DataFrame.from_dict(test_sample)
print(XG_regressor.predict(test_df))
display(test_df)

[356036.9]


Unnamed: 0,primary_energy_consumption,furnished,terrace,terrace_surface,plot_surface,living_room_surface,frontages,construction_year,bedrooms,bathrooms,shower_rooms,office,toilets
7594,286,0,0,0,580,30,3,1930,4,1,1,0,2
