In [1]:
import pandas as pd
import numpy as np
import pickle

from preprocessing import *
from scipy import spatial
from sklearn.ensemble import RandomForestRegressor as RFR
import forestci as fci

Failed to import duecredit due to No module named 'duecredit'


# Data
The data contains only columns that we determined to be relevant for the randomforestregressor as well as a couple of potential comparable features, each of these are specified below.

Since the model had some disappointing results for logarithmic price, we also see how well it performs on regular price

In [23]:
df = pickle.load(open('../Data/reduced_df.p','rb'))
df = df.rename(columns={'V1.x': 'Postcode5'})
df2 = df.copy(deep=True)
df2['endprice'] = np.exp(df['endprice'])

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66239 entries, 57585 to 43850
Data columns (total 16 columns):
livingspace                             66239 non-null float64
Gemiddelde woningwaarde:x 1 000 euro    66239 non-null float64
housetype                               66239 non-null category
Postcode5                               66239 non-null int64
lotsurface                              66239 non-null float64
yearofconstruction                      66239 non-null float64
longitude                               66239 non-null float64
latitude                                66239 non-null float64
housesubtype                            66239 non-null category
rooms                                   66239 non-null float64
bathroom.badkamer                       66239 non-null int64
feature.zwembad                         66239 non-null bool
bathroom.aparte toilet                  66239 non-null float64
balcony.balkon                          66239 non-null bool
feature.sauna

# Relevant Features
These are the 10 features that are most relevant for the random forest regressor as determined in previous notebooks. The user can enter these features manually, but for the purpose of this test, we pick them from the data.

In [7]:
RF = ['livingspace', 'Gemiddelde woningwaarde:x 1 000 euro', 'housetype', 'Postcode5',
      'lotsurface', 'yearofconstruction', 'longitude', 'latitude', 'housesubtype', 'rooms']

For the purpose of this notebook, we only look at 1 potential upgradable feature, namely the presence of a seperate toilet in the bathroom.

In [8]:
PUF = ['bathroom.aparte toilet']

Finally of course, we need the endprice as a target feature, to train our model and check our performance

In [9]:
TF = ['endprice']

We only keep the relevant features in our dataframe

In [11]:
df = df[RF+PUF+TF]
df2 = df[RF+PUF+TF]

## Calculating values from user input
One of the most important features for our predictive algorithm is de average value of houses in the neighbourhood. The user might not be aware of this value, but fortunately, we can predict it quite accurately by finding the house with the closest longitude, latitude and Postcode5 and simply taking their value instead.

In [14]:
CF = ['longitude', 'latitude', 'Postcode5']

def calculate_avg_housevalue(long, lat, post, data=df):
    tree = spatial.KDTree(data[CF].values)
    _, index = tree.query(np.array([long, lat, post]))
    return data.iloc[index]['Gemiddelde woningwaarde:x 1 000 euro']

# The model
We use a a random forest regression which was also used in previous notebooks to find the most important features.

In [20]:
m = RFR(n_estimators=500, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m2 = RFR(n_estimators=500, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)

In [16]:
X, y, _ = proc_df(df, 'endprice')

n_trn = len(X) // 2
n_valid = n_trn + (len(X) // 2) - 100
X_train, X_valid, X_test = split_vals_test(X, n_trn, n_valid)
y_train, y_valid, y_test = split_vals_test(y, n_trn, n_valid)

In [17]:
m.fit(X_train, y_train)
print_score(m, X_train, y_train, X_valid, y_valid)

[0.10236443300165127, 0.19644268035462187, 0.952838679255162, 0.8358577581207968, 0.8674103630259098]


In [18]:
X2, y2, _ = proc_df(df2, 'endprice')

n_trn2 = len(X2) // 2
n_valid2 = n_trn2 + (len(X2) // 2) - 100
X_train2, X_valid2, X_test2 = split_vals_test(X2, n_trn2, n_valid2)
y_train2, y_valid2, y_test2 = split_vals_test(y2, n_trn2, n_valid2)

In [19]:
m2.fit(X_train2, y_train2)
print_score(m2, X_train2, y_train2, X_valid2, y_valid2)

[0.10235963920338738, 0.19640841574229861, 0.9528430963472534, 0.8359150143114131, 0.8673557990912257]


# Predicting
Given the initial 10 values of our house + the number of seperate toilets in bathrooms, we make a prediction of the end price with the current number of toilets and with one extra toilet.

Our validation set has 76 houses with 0 toilets and 25 houses with 1 toilet.

In [13]:
X_test['bathroom.aparte toilet'].value_counts()

0.0    76
1.0    25
Name: bathroom.aparte toilet, dtype: int64

Below code demonstrates what would happen for a single house (e.g. one entered by the user)

First we look at our prediction for the house with the actual number of toilets it has (0)

In [14]:
user_house = X_test.iloc[0]
test = X_valid.append(user_house).copy()

In [15]:
pred_log_price = m.predict(np.array([user_house]))[0]
log_price = y_test[0]
error_margin = fci.random_forest_error(m, X_train, test)[-1]

In [16]:
print(np.mean(error_margin))
print(pred_log_price)

0.18525688557275355
13.300286513922472


In [17]:
np.exp(pred_log_price-log_price)

1.197127741813447

In [18]:
np.exp(log_price)

499000.0000000004

In [19]:
np.exp(pred_log_price)

597366.7431649105

We now expect our prediction to be somewhere between pred_price + error_margin and pred_price - error_margin. Note that these error margins correspond to single sigma so a ~65% confidence interval, for a ~95% confidence interval, we would need to do the error_margins times 2.

In [20]:
print("Lower bound: ", np.exp(pred_log_price-error_margin))
print("Actual price: ", np.exp(log_price))
print("Upper bound: ", np.exp(pred_log_price+error_margin))

Lower bound:  496346.5386318834
Actual price:  499000.0000000004
Upper bound:  718947.3443756774


Next we look at our predicion if it the user house had an extra toilet (1)

In [21]:
user_house_et = user_house.copy()
user_house_et['bathroom.aparte toilet'] += 1
test_et = X_test.append(user_house_et).copy()

In [22]:
pred_log_price = m.predict(np.array([user_house_et]))[0]
error_margin = fci.random_forest_error(m, X_train, test_et)[-1]

In [23]:
pred_log_price-error_margin

13.208052644825097

In [24]:
print("Lower bound: ", np.exp(pred_log_price-error_margin))
print("Predicted price: ", np.exp(pred_log_price))
print("Upper bound: ", np.exp(pred_log_price+error_margin))

Lower bound:  544733.8712907014
Predicted price:  603085.6258470296
Upper bound:  667688.0055971507
