# Moscow housing prediction [Short]

Written by<br> 

Hauk Aleksander Olaussen - **Student ID: *504903*** <br>

Charbel Badr - **Student ID: *557389***<br>

Noran Baskaran - **Student ID: *504877*** <br>

Kaggle competition name: **Moscow housing**

Kaggle team name: **Team 69**

## Introduction

This short notebook will contain the code and comments for you to reproduce our best performing model on the public testset. It will make use of the `Preprocessor.py` and `TestModel.py` classes, just as in **long.ipynb**. We will not do any exploration or engineering in this notebook other than run the functions actually doing this job in `Preprocessor.py`. We hope that this will contribute to a readable model with as little "code gore" as possible. 

In [3]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
%autoreload
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from Preprocessor import Preprocessor
from TestModel import TestModel
import lightgbm as lgbm
import pandas as pd
import numpy as np
import xgboost

preprocessor = Preprocessor()

## Extracting the data

As the data is contained within *.csv* files, we need to read this and put it into a `pandas` DataFrame. This will make it easier to perform the preprocessing later. This is done in the same way as you saw in the long notebook.

In [5]:
merged_final = pd.concat([preprocessor.merged.copy(), preprocessor.merged_test.copy()], ignore_index=True)
print(f"The dataframe now has {len(merged_final)} entries")

The dataframe now has 33222 entries


Lets run the preprocessing functions

In [6]:
general_removed = preprocessor.general_removal(merged_final.copy()) # Removes "layout", "ceiling", "balconies", "loggias", "elevator_without", "street", "address"
data = preprocessor.remove_NaNs(general_removed.copy())             # Fills in NaN valiues in different ways depending on the entry
data = preprocessor.impute_nans_with_building_id(data.copy(), ["condition"]) # Fills in NaNs for condition
# Removing outliers in lat-lon, as some of them are very far away
is_outlier = (data["longitude"] > 55) | (data["latitude"] < 54)
outliers = data.copy()[is_outlier]
data = preprocessor.fix_latlon_outliers(data.copy(), outliers)
data.head()

Unnamed: 0,id,seller,price,area_total,area_kitchen,area_living,floor,rooms,bathrooms_shared,bathrooms_private,...,longitude,district,constructed,material,stories,elevator_passenger,elevator_service,parking,garbage_chute,heating
0,0,3.0,7139520.0,59.2,12.5,31.0,2.0,2.0,0.0,2.0,...,37.478055,11.0,2021.0,3.0,9.0,1.0,1.0,1.0,-1.0,-1.0
1,1,-1.0,10500000.0,88.0,14.2,48.0,18.0,3.0,2.0,0.0,...,37.666647,2.0,2010.0,3.0,25.0,1.0,1.0,1.0,-1.0,0.0
2,2,3.0,9019650.0,78.5,22.5,40.8,12.0,3.0,0.0,2.0,...,37.515335,6.0,2021.0,3.0,15.0,1.0,1.0,1.0,-1.0,-1.0
3,3,-1.0,10500000.0,88.0,14.0,48.0,18.0,3.0,0.0,2.0,...,37.666647,2.0,2010.0,3.0,25.0,1.0,1.0,1.0,-1.0,0.0
4,4,-1.0,13900000.0,78.0,17.0,35.0,7.0,2.0,1.0,0.0,...,37.451438,11.0,2017.0,2.0,15.0,1.0,1.0,1.0,0.0,0.0


## Adding features

Now lets start adding the features we want for the model. We start with the distance features

In [7]:
# Adding distance features
distance_df = preprocessor.combine_latlon_subway(data.copy(), read_from_file=True)  # Subway
distance_df = preprocessor.closest_hospital(distance_df.copy())                     # Hospital
distance_df = preprocessor.closest_park(distance_df.copy())                         # Parks 
distance_df = preprocessor.distance_luxury_village(distance_df.copy())              # Luxury village
distance_df = preprocessor.closest_uni(distance_df.copy())                          # University
distance_df = preprocessor.distance_from_golden_mile(distance_df.copy())            # Golden mile, distance
distance_df = preprocessor.distance_from_khamovniki_center(distance_df.copy())      # Khamovniki, distance
distance_df = preprocessor.combine_latlon(distance_df.copy())                       # Red square
distance_df = preprocessor.closest_powerplant(distance_df.copy())                   # Power plant
print("Distance features added")

Distance features added


We continue by adding combined features.

In [8]:
# Creating the new features
featured = preprocessor.combine_windows(distance_df.copy(), boolean=True) # has_windows feature
featured = preprocessor.redo_new(featured.copy())                         # Fixing new feature before combining
featured = preprocessor.combine_new_constructed_distance(featured.copy()) # scaled_constructed feature
featured = preprocessor.floor_fraction(featured.copy())                   # floor_fraction feature
featured = preprocessor.combine_floor_stories(featured.copy())            # floor_stories feature
featured = preprocessor.area_distance(featured.copy())                    # area_distance feature
featured = preprocessor.combine_baths(featured.copy())                    # bathroom_amount feature
featured.head()

Unnamed: 0,id,seller,price,area_total,area_kitchen,area_living,floor,rooms,bathrooms_shared,bathrooms_private,...,distance_golden_mile,distance_khamovniki_center,distance_center,closest_powerplant,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance,bathroom_amount
0,0,3.0,7139520.0,59.2,12.5,31.0,2.0,2.0,0.0,2.0,...,22.936124,21.990803,25.02208,10.364623,True,2021.0,0.222222,18.0,0.171074,2.0
1,1,-1.0,10500000.0,88.0,14.2,48.0,18.0,3.0,2.0,0.0,...,14.282081,15.264384,12.267029,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0
2,2,3.0,9019650.0,78.5,22.5,40.8,12.0,3.0,0.0,2.0,...,9.839135,8.719216,12.060134,6.032235,True,2021.0,0.8,180.0,0.119716,2.0
3,3,-1.0,10500000.0,88.0,14.0,48.0,18.0,3.0,0.0,2.0,...,14.282081,15.264384,12.267029,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0
4,4,-1.0,13900000.0,78.0,17.0,35.0,7.0,2.0,1.0,0.0,...,18.83956,17.732445,21.041782,11.287607,True,2013.007917,0.466667,105.0,0.127619,1.0


## Removing features

Lets now remove features we do not indend to keep.

In [9]:
to_be_removed = ["rooms", "floor", "garbage_chute", "seller", "phones", "area_kitchen", "area_living", "id", "constructed", "bathrooms_private", "bathrooms_shared", "new", "windows_street", "stories", "windows_court", "elevator_passenger", "elevator_service"]
removed = preprocessor.remove_labels(featured.copy(), to_be_removed)
removed.head()

Unnamed: 0,price,area_total,condition,building_id,latitude,longitude,district,material,parking,heating,...,distance_golden_mile,distance_khamovniki_center,distance_center,closest_powerplant,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance,bathroom_amount
0,7139520.0,59.2,-1.0,4076,55.544046,37.478055,11.0,3.0,1.0,-1.0,...,22.936124,21.990803,25.02208,10.364623,True,2021.0,0.222222,18.0,0.171074,2.0
1,10500000.0,88.0,3.0,1893,55.861282,37.666647,2.0,3.0,1.0,0.0,...,14.282081,15.264384,12.267029,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0
2,9019650.0,78.5,-1.0,5176,55.663299,37.515335,6.0,3.0,1.0,-1.0,...,9.839135,8.719216,12.060134,6.032235,True,2021.0,0.8,180.0,0.119716,2.0
3,10500000.0,88.0,2.0,1893,55.861282,37.666647,2.0,3.0,1.0,0.0,...,14.282081,15.264384,12.267029,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0
4,13900000.0,78.0,3.0,6604,55.590785,37.451438,11.0,2.0,1.0,0.0,...,18.83956,17.732445,21.041782,11.287607,True,2013.007917,0.466667,105.0,0.127619,1.0


## Logifying skewed data

We logify some of the data, as they are quite skewed originally.

In [10]:
finished = preprocessor.log10ify(removed.copy(), "area_total")
finished = preprocessor.log10ify(finished.copy(), "price")
finished.head()

Unnamed: 0,price,area_total,condition,building_id,latitude,longitude,district,material,parking,heating,...,distance_golden_mile,distance_khamovniki_center,distance_center,closest_powerplant,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance,bathroom_amount
0,6.853669,1.772322,-1.0,4076,55.544046,37.478055,11.0,3.0,1.0,-1.0,...,22.936124,21.990803,25.02208,10.364623,True,2021.0,0.222222,18.0,0.171074,2.0
1,7.021189,1.944483,3.0,1893,55.861282,37.666647,2.0,3.0,1.0,0.0,...,14.282081,15.264384,12.267029,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0
2,6.95519,1.89487,-1.0,5176,55.663299,37.515335,6.0,3.0,1.0,-1.0,...,9.839135,8.719216,12.060134,6.032235,True,2021.0,0.8,180.0,0.119716,2.0
3,7.021189,1.944483,2.0,1893,55.861282,37.666647,2.0,3.0,1.0,0.0,...,14.282081,15.264384,12.267029,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0
4,7.143015,1.892095,3.0,6604,55.590785,37.451438,11.0,2.0,1.0,0.0,...,18.83956,17.732445,21.041782,11.287607,True,2013.007917,0.466667,105.0,0.127619,1.0


## Preparing for the model

We will now split the finished DataFrame back into training and test data. The training data contains prices, which we have put into a variable called *labels* above. The test data does not have prices - as this is the task at hand. We will use a combination of *XGBoost* and *LGBM* for our prediction. The configurations have been found using Microsofts autoML library *flaml*.

#### XGBoost

In [42]:
train_data = finished[:23285].copy()                 # Data used for fitting
test_data = finished[23285:].copy()                  # Actual data we want to predict

train_data = preprocessor.fix_categorical(train_data.copy()) # Changes the type of categorical data to "category" instead of float
test_data = preprocessor.fix_categorical(test_data.copy())   # Changes the type of categorical data to "category" instead of float

train_area = data[:23285]["area_total"]              # Extracting the area_total from data. It will be used to convert back to actual prices later
test_area = data[23285:]["area_total"]               # Same as above, but for test set

labels = data[:23285]["price"] / train_area          # Learning labels

train_data.drop("price", 1, inplace=True)            # Removing labels from training set
test_data.drop("price", 1, inplace=True)             # Removing labels from training set, even though they are all NaN'
print("Number of features:", train_data.shape[1])

Number of features: 24


In [30]:
xgboost_model = TestModel(train_data, labels)
xgb_config = {'n_estimators': 357, 'max_leaves': 129, 'min_child_weight': 0.976312019733583, 'learning_rate': 0.08053621844316501, 'subsample': 0.8781423948465318, 'colsample_bylevel': 0.24865656018408627, 'colsample_bytree': 0.5602094440842953, 'reg_alpha': 0.050673629557973825, 'reg_lambda': 0.03207420675586682}
xgb_model = xgboost_model.xgboost_model(xgb_config)
xgb_pipe = Pipeline([("scaler", StandardScaler()), ("xbg", xgb_model)])

In [31]:
print("Fitting XGBoost model...")
xgb_fitted = xgb_pipe.fit(train_data, labels)
print("XGBoost fitting complete!")

Fitting XGBoost model...
XGBoost fitting complete!


In [32]:
xgb_prediction = xgb_fitted.predict(test_data)
xgb_predicted_prices = xgb_prediction * test_area
print("XGBoost prediction finished!")
xgb_predicted_prices

XGBoost prediction finished!


23285    2.799290e+07
23286    9.985424e+06
23287    6.395052e+06
23288    8.791958e+06
23289    5.267879e+06
             ...     
33217    2.802333e+07
33218    1.973350e+07
33219    1.008596e+07
33220    1.030423e+07
33221    6.369004e+06
Name: area_total, Length: 9937, dtype: float64

#### LGBM

In [37]:
lgbm_model = TestModel(train_data, labels)
lgbm_config = {'n_estimators': 787, 'num_leaves': 361, 'min_child_samples': 5, 'learning_rate': 0.046914757916267216, 'log_max_bin': 10, 'colsample_bytree': 0.6840631739505604, 'reg_alpha': 0.006372690788270238, 'reg_lambda': 0.0016192368145899812}
lgbm_model = lgbm_model.lgbm_model(lgbm_config)
lgbm_pipe = Pipeline([("scaler", StandardScaler()), ("lgbm", lgbm_model)])

In [38]:
print("Fitting LGBM model...")
lgbm_fitted = lgbm_pipe.fit(train_data, labels)
print("LGBM fitting complete!")

Fitting LGBM model...
LGBM fitting complete!


In [39]:
lgbm_prediction = lgbm_fitted.predict(test_data)
lgbm_predicted_prices = lgbm_prediction * test_area
print("LGBM  prediction finished!")
lgbm_predicted_prices

LGBM  prediction finished!


23285    2.963969e+07
23286    1.000110e+07
23287    6.174860e+06
23288    8.402287e+06
23289    5.167484e+06
             ...     
33217    3.361625e+07
33218    2.171927e+07
33219    9.362633e+06
33220    1.002877e+07
33221    7.198080e+06
Name: area_total, Length: 9937, dtype: float64

#### Combining predictions

We will now combined the results from these two by taking the average value they predicted

In [40]:
combined_prediction = (xgb_predicted_prices + lgbm_predicted_prices) / 2
xgboost_model.save_predictions(list(combined_prediction))

Unnamed: 0,id,price_prediction
0,23285,2.881629e+07
1,23286,9.993262e+06
2,23287,6.284956e+06
3,23288,8.597122e+06
4,23289,5.217682e+06
...,...,...
9932,33217,3.081979e+07
9933,33218,2.072638e+07
9934,33219,9.724298e+06
9935,33220,1.016650e+07


# Current

best model xgboost configs {'n_estimators': 911, 'max_leaves': 275, 'min_child_weight': 8.554398319481574, 'learning_rate': 0.13797523454204058, 'subsample': 0.7689953576765899, 'colsample_bylevel': 0.3710959011675493, 'colsample_bytree': 0.4323890189027554, 'reg_alpha': 0.016828526535595058, 'reg_lambda': 2.800495230922604} 

best model lgbm configs {'n_estimators': 1025, 'num_leaves': 89, 'min_child_samples': 6, 'learning_rate': 0.07207849734387042, 'log_max_bin': 7, 'colsample_bytree': 0.5911916499145919, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.017810670616715204}

In [45]:
train_data = finished[:23285].copy()
test_data = finished[23285:].copy()
test_data.drop("price", 1, inplace=True)


train_area = data[:23285]["area_total"]
test_area = data[23285:]["area_total"]

labels = data[:23285]["price"] / train_area 

# Splitting training data into training and validation, removed the price for each of them afterwards
x_train, x_test, y_train, y_test = train_test_split(train_data, labels, stratify=train_data.price.round(), test_size=0.0002)
x_train.drop("price", 1, inplace=True)
x_test.drop("price", 1, inplace=True)
x_train = preprocessor.fix_categorical(x_train.copy()) # Changes the type of categorical data to "category" instead of float
x_test = preprocessor.fix_categorical(x_test.copy()) # Changes the type of categorical data to "category" instead of float

print(f"Number of features for LGBM:", x_train.shape[1])

Number of features for LGBM: 24


In [54]:
xgboost_model = TestModel(y_train, x_train)
best_xgboost = xgboost_model.autoMLfit(x_train, y_train, ["xgboost"], time=60, metric="auto")
xgboost_pred = [pred for pred in xgboost_model.autoMLpredict(x_test)] * (10 ** x_test["area_total"])

lgbm_model = TestModel(y_train, x_train)
best_lgbm = lgbm_model.autoMLfit(x_train, y_train, ["lgbm"], time=60, metric="auto")
lgbm_pred = [pred for pred in lgbm_model.autoMLpredict(x_test)] * (10 ** x_test["area_total"])

[flaml.automl: 11-19 04:38:44] {1485} INFO - Data split method: uniform
[flaml.automl: 11-19 04:38:44] {1489} INFO - Evaluation method: holdout
[flaml.automl: 11-19 04:38:44] {1540} INFO - Minimizing error metric: 1-r2
[flaml.automl: 11-19 04:38:44] {1577} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl: 11-19 04:38:44] {1826} INFO - iteration 0, current learner xgboost
[flaml.automl: 11-19 04:38:44] {1943} INFO - Estimated sufficient time budget=480s. Estimated necessary time budget=0s.
[flaml.automl: 11-19 04:38:44] {2023} INFO -  at 0.2s,	estimator xgboost's best error=1.7146,	best estimator xgboost's best error=1.7146
[flaml.automl: 11-19 04:38:44] {1826} INFO - iteration 1, current learner xgboost
[flaml.automl: 11-19 04:38:44] {2023} INFO -  at 0.3s,	estimator xgboost's best error=1.7146,	best estimator xgboost's best error=1.7146
[flaml.automl: 11-19 04:38:44] {1826} INFO - iteration 2, current learner xgboost
[flaml.automl: 11-19 04:38:44] {2023} INFO -  at 

In [57]:
xgboost_model.autoML_print_best_model()
lgbm_model.autoML_print_best_model()

best model xgboost
configs {'n_estimators': 88, 'max_leaves': 557, 'min_child_weight': 1.860805925804934, 'learning_rate': 0.11127517916769668, 'subsample': 0.9377924838515587, 'colsample_bylevel': 0.6649678342694664, 'colsample_bytree': 0.7031411822491804, 'reg_alpha': 0.001461499947717485, 'reg_lambda': 5.861545273673945}
best model lgbm
configs {'n_estimators': 156, 'num_leaves': 849, 'min_child_samples': 2, 'learning_rate': 0.08127231638894393, 'log_max_bin': 9, 'colsample_bytree': 0.37372100070460673, 'reg_alpha': 0.008393316190801551, 'reg_lambda': 0.006602183664100191}


In [58]:
combined = (xgboost_pred + lgbm_pred) / 2
test_labels = [lab for lab in y_test] * (10 ** x_test["area_total"])

print("RMSLE for XGBoost model: %s" % xgboost_model.root_mean_squared_log_error(test_labels, xgboost_pred))
print("RMSLE for LGBM model: %s\n" % xgboost_model.root_mean_squared_log_error(test_labels, lgbm_pred))

print("RMSLE for combined models: %s\n" % xgboost_model.root_mean_squared_log_error(test_labels, combined))

lgbm = [pred for pred in lgbm_model.autoMLpredict(test_data)] * (10 ** test_data["area_total"])
xgb = [pred for pred in xgboost_model.autoMLpredict(test_data)] * (10 ** test_data["area_total"])
c = (lgbm + xgb) / 2

RMSLE for XGBoost model: 0.1181547251369067
RMSLE for LGBM model: 0.08117782501252357

RMSLE for combined models: 0.07620114963276778



In [59]:
xgboost_model.save_predictions(list(c))

Unnamed: 0,id,price_prediction
0,23285,2.888993e+07
1,23286,9.725354e+06
2,23287,6.150077e+06
3,23288,8.316935e+06
4,23289,5.116972e+06
...,...,...
9932,33217,3.164303e+07
9933,33218,2.048860e+07
9934,33219,9.672864e+06
9935,33220,9.306851e+06
