# Moscow housing prediction [Short]

Written by<br> 

Hauk Aleksander Olaussen - **Student ID: *504903*** <br>

Charbel Badr - **Student ID: *557389***<br>

Noran Baskaran - **Student ID: *504877*** <br>

Kaggle competition name: **Moscow housing**

Kaggle team name: **Team 69**

## Introduction

This short notebook will contain the code and comments for you to reproduce our best performing model on the public testset. It will make use of the `Preprocessor.py` and `TestModel.py` classes, just as in **long.ipynb**. We will not do any exploration or engineering in this notebook other than run the functions actually doing this job in `Preprocessor.py`. We hope that this will contribute to a readable model with as little "code gore" as possible. 

In [24]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
%autoreload
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from Preprocessor import Preprocessor
from TestModel import TestModel
import lightgbm as lgbm
import pandas as pd
import numpy as np
import xgboost

preprocessor = Preprocessor()

## Extracting the data

As the data is contained within *.csv* files, we need to read this and put it into a `pandas` DataFrame. This will make it easier to perform the preprocessing later. This is done in the same way as you saw in the long notebook.

In [26]:
merged_final = pd.concat([preprocessor.merged.copy(), preprocessor.merged_test.copy()], ignore_index=True)
print(f"The dataframe now has {len(merged_final)} entries")

The dataframe now has 33222 entries


Lets run the preprocessing functions

In [27]:
general_removed = preprocessor.general_removal(merged_final.copy()) # Removes "layout", "ceiling", "balconies", "loggias", "elevator_without", "street", "address"
data = preprocessor.remove_NaNs(general_removed.copy())             # Fills in NaN valiues in different ways depending on the entry
data = preprocessor.impute_nans_with_building_id(data.copy(), ["condition"]) # Fills in NaNs for condition
# Removing outliers in lat-lon, as some of them are very far away
is_outlier = (data["longitude"] > 55) | (data["latitude"] < 54)
outliers = data.copy()[is_outlier]
data = preprocessor.fix_latlon_outliers(data.copy(), outliers)
data.head()

Unnamed: 0,id,seller,price,area_total,area_kitchen,area_living,floor,rooms,bathrooms_shared,bathrooms_private,...,longitude,district,constructed,material,stories,elevator_passenger,elevator_service,parking,garbage_chute,heating
0,0,3.0,7139520.0,59.2,12.5,31.0,2.0,2.0,0.0,2.0,...,37.478055,11.0,2021.0,3.0,9.0,1.0,1.0,1.0,-1.0,-1.0
1,1,-1.0,10500000.0,88.0,14.2,48.0,18.0,3.0,2.0,0.0,...,37.666647,2.0,2010.0,3.0,25.0,1.0,1.0,1.0,-1.0,0.0
2,2,3.0,9019650.0,78.5,22.5,40.8,12.0,3.0,0.0,2.0,...,37.515335,6.0,2021.0,3.0,15.0,1.0,1.0,1.0,-1.0,-1.0
3,3,-1.0,10500000.0,88.0,14.0,48.0,18.0,3.0,0.0,2.0,...,37.666647,2.0,2010.0,3.0,25.0,1.0,1.0,1.0,-1.0,0.0
4,4,-1.0,13900000.0,78.0,17.0,35.0,7.0,2.0,1.0,0.0,...,37.451438,11.0,2017.0,2.0,15.0,1.0,1.0,1.0,0.0,0.0


## Adding features

Now lets start adding the features we want for the model. We start with the distance features

In [28]:
# Adding distance features
distance_df = preprocessor.combine_latlon_subway(data.copy(), read_from_file=True)  # Subway
distance_df = preprocessor.closest_hospital(distance_df.copy())                     # Hospital
distance_df = preprocessor.closest_park(distance_df.copy())                         # Parks 
distance_df = preprocessor.distance_luxury_village(distance_df.copy())              # Luxury village
distance_df = preprocessor.closest_uni(distance_df.copy())                          # University
distance_df = preprocessor.distance_from_golden_mile(distance_df.copy())            # Golden mile, distance
distance_df = preprocessor.distance_from_khamovniki_center(distance_df.copy())      # Khamovniki, distance
distance_df = preprocessor.combine_latlon(distance_df.copy())                       # Red square
distance_df = preprocessor.closest_powerplant(distance_df.copy())                   # Power plant
print("Distance features added")

Distance features added


We continue by adding combined features.

In [29]:
# Creating the new features
featured = preprocessor.combine_windows(distance_df.copy(), boolean=True) # has_windows feature
featured = preprocessor.redo_new(featured.copy())                         # Fixing new feature before combining
featured = preprocessor.combine_new_constructed_distance(featured.copy()) # scaled_constructed feature
featured = preprocessor.floor_fraction(featured.copy())                   # floor_fraction feature
featured = preprocessor.combine_floor_stories(featured.copy())            # floor_stories feature
featured = preprocessor.area_distance(featured.copy())                    # area_distance feature
featured = preprocessor.combine_baths(featured.copy())                    # bathroom_amount feature
featured = preprocessor.combine_area_rooms(featured.copy())               # avg_room_size, kitchen fraction and living_fraction features
featured.head()

Unnamed: 0,id,seller,price,area_total,area_kitchen,area_living,floor,rooms,bathrooms_shared,bathrooms_private,...,closest_powerplant,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance,bathroom_amount,avg_room_size,living_fraction,kitchen_fraction
0,0,3.0,7139520.0,59.2,12.5,31.0,2.0,2.0,0.0,2.0,...,10.364623,True,2021.0,0.222222,18.0,0.171074,2.0,7.4,0.523649,0.211149
1,1,-1.0,10500000.0,88.0,14.2,48.0,18.0,3.0,2.0,0.0,...,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0,3.259259,0.545455,0.161364
2,2,3.0,9019650.0,78.5,22.5,40.8,12.0,3.0,0.0,2.0,...,6.032235,True,2021.0,0.8,180.0,0.119716,2.0,2.907407,0.519745,0.286624
3,3,-1.0,10500000.0,88.0,14.0,48.0,18.0,3.0,0.0,2.0,...,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0,3.259259,0.545455,0.159091
4,4,-1.0,13900000.0,78.0,17.0,35.0,7.0,2.0,1.0,0.0,...,11.287607,True,2013.007917,0.466667,105.0,0.127619,1.0,9.75,0.448718,0.217949


## Removing features

Lets now remove features we do not indend to keep.

In [30]:
to_be_removed = ["rooms", "floor","area_living", "area_kitchen" "garbage_chute", "seller", "phones", "area_kitchen", "area_living", "id", "constructed", "bathrooms_private", "bathrooms_shared", "new", "windows_street", "stories", "windows_court", "elevator_passenger", "elevator_service"]
removed = preprocessor.remove_labels(featured.copy(), to_be_removed)
removed.head()

Unnamed: 0,price,area_total,condition,building_id,latitude,longitude,district,material,parking,garbage_chute,...,closest_powerplant,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance,bathroom_amount,avg_room_size,living_fraction,kitchen_fraction
0,7139520.0,59.2,-1.0,4076,55.544046,37.478055,11.0,3.0,1.0,-1.0,...,10.364623,True,2021.0,0.222222,18.0,0.171074,2.0,7.4,0.523649,0.211149
1,10500000.0,88.0,3.0,1893,55.861282,37.666647,2.0,3.0,1.0,-1.0,...,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0,3.259259,0.545455,0.161364
2,9019650.0,78.5,-1.0,5176,55.663299,37.515335,6.0,3.0,1.0,-1.0,...,6.032235,True,2021.0,0.8,180.0,0.119716,2.0,2.907407,0.519745,0.286624
3,10500000.0,88.0,2.0,1893,55.861282,37.666647,2.0,3.0,1.0,-1.0,...,7.957621,True,1999.059871,0.72,450.0,0.106986,2.0,3.259259,0.545455,0.159091
4,13900000.0,78.0,3.0,6604,55.590785,37.451438,11.0,2.0,1.0,0.0,...,11.287607,True,2013.007917,0.466667,105.0,0.127619,1.0,9.75,0.448718,0.217949


## Logifying skewed data

We logify some of the data, as they are quite skewed originally. We do this with the area and distance features.

In [31]:
finished = preprocessor.log10ify(removed.copy(), "price")
finished = preprocessor.log10ify(finished.copy(), "area_total")
finished = preprocessor.log10ify(finished.copy(), "closest_powerplant")
finished = preprocessor.log10ify(finished.copy(), "distance_khamovniki_center")
finished = preprocessor.log10ify(finished.copy(), "distance_golden_mile")
finished = preprocessor.log10ify(finished.copy(), "closest_uni")
finished = preprocessor.log10ify(finished.copy(), "distance_luxury_village")
finished = preprocessor.log10ify(finished.copy(), "closest_park")
finished = preprocessor.log10ify(finished.copy(), "closest_hospital")
finished = preprocessor.log10ify(finished.copy(), "closest_subway_distance")
finished = preprocessor.log10ify(finished.copy(), "distance_center")
finished.head()

Unnamed: 0,price,area_total,condition,building_id,latitude,longitude,district,material,parking,garbage_chute,...,closest_powerplant,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance,bathroom_amount,avg_room_size,living_fraction,kitchen_fraction
0,6.853669,1.772322,-1.0,4076,55.544046,37.478055,11.0,3.0,1.0,-1.0,...,1.015554,True,2021.0,0.222222,18.0,0.171074,2.0,7.4,0.523649,0.211149
1,7.021189,1.944483,3.0,1893,55.861282,37.666647,2.0,3.0,1.0,-1.0,...,0.900783,True,1999.059871,0.72,450.0,0.106986,2.0,3.259259,0.545455,0.161364
2,6.95519,1.89487,-1.0,5176,55.663299,37.515335,6.0,3.0,1.0,-1.0,...,0.780478,True,2021.0,0.8,180.0,0.119716,2.0,2.907407,0.519745,0.286624
3,7.021189,1.944483,2.0,1893,55.861282,37.666647,2.0,3.0,1.0,-1.0,...,0.900783,True,1999.059871,0.72,450.0,0.106986,2.0,3.259259,0.545455,0.159091
4,7.143015,1.892095,3.0,6604,55.590785,37.451438,11.0,2.0,1.0,0.0,...,1.052602,True,2013.007917,0.466667,105.0,0.127619,1.0,9.75,0.448718,0.217949


## Preparing for the model

We will now split the finished DataFrame back into training and test data. The training data contains prices, which we have put into a variable called *labels* above. The test data does not have prices - as this is the task at hand. We will use a combination of *XGBoost* and *LGBM* for our prediction. The configurations have been found using Microsofts autoML library *flaml*, and predicted using the `TestModel.autoMLpredict` method. This uses the actual trained model from *flaml*, while the models below are initiated using the configs *flaml* provided. Still, the results should not deviate too much from the obtained value. If you run the code below, you will run two models created from a config, and if you run the *AutoML* at the bottom, you will use *AutoML* to find new model. We are aware that the predictions for the models might deviate somewhat. This is because of the nature of the way we find the models. That said, they all perform up to par with one another. We use *StandardScaler* for these pipelines, which *AutoML* does not. 

The following combined model configurations for *XGBoost* and *LGBM* landed a score of ***0.16286*** on the public leaderboard.

#### XGBoost

In [40]:
train_data = finished[:23285].copy()                 # Data used for fitting
test_data = finished[23285:].copy()                  # Actual data we want to predict

train_data = preprocessor.fix_categorical(train_data.copy()) # Changes the type of categorical data to "category" instead of float
test_data = preprocessor.fix_categorical(test_data.copy())   # Changes the type of categorical data to "category" instead of float

train_area = data[:23285]["area_total"]              # Extracting the area_total from data. It will be used to convert back to actual prices later
test_area = data[23285:]["area_total"]               # Same as above, but for test set

labels = data[:23285]["price"] / train_area          # Learning labels

train_data.drop("price", 1, inplace=True)            # Removing labels from training set
test_data.drop("price", 1, inplace=True)             # Removing labels from training set, even though they are all NaN'
print("Number of features:", train_data.shape[1])

Number of features: 28


In [41]:
xgboost_model = TestModel(train_data, labels)
xgb_config = {'n_estimators': 182, 'max_leaves': 1221, 'min_child_weight': 4.321656525884641, 'learning_rate': 0.03469596869532235, 'subsample': 0.827351358517848, 'colsample_bylevel': 0.5127241764310982, 'colsample_bytree': 0.722539217922852, 'reg_alpha': 0.0016952814239432799, 'reg_lambda': 1.0399794993312943}
xgb_model = xgboost_model.xgboost_model(xgb_config)
xgb_pipe = Pipeline([("scaler", StandardScaler()), ("xbg", xgb_model)])

In [42]:
print("Fitting XGBoost model...")
xgb_fitted = xgb_pipe.fit(train_data, labels)
print("XGBoost fitting complete!")

Fitting XGBoost model...
XGBoost fitting complete!


In [43]:
xgb_prediction = xgb_fitted.predict(test_data)
xgb_predicted_prices = xgb_prediction * test_area
print("XGBoost prediction finished!")
xgb_predicted_prices

XGBoost prediction finished!


23285    2.954891e+07
23286    1.109818e+07
23287    6.169507e+06
23288    8.887791e+06
23289    4.971776e+06
             ...     
33217    2.978534e+07
33218    2.140090e+07
33219    9.872597e+06
33220    9.737242e+06
33221    7.168858e+06
Name: area_total, Length: 9937, dtype: float64

#### LGBM

In [44]:
lgbm_model = TestModel(train_data, labels)
lgbm_config = {'n_estimators': 280, 'num_leaves': 1340, 'min_child_samples': 6, 'learning_rate': 0.05127888901330433, 'log_max_bin': 10, 'colsample_bytree': 0.5981670145528535, 'reg_alpha': 0.0034922118383222253, 'reg_lambda': 0.003300958692930056}
lgbm_model = lgbm_model.lgbm_model(lgbm_config)
lgbm_pipe = Pipeline([("scaler", StandardScaler()), ("lgbm", lgbm_model)])

In [48]:
print("Fitting LGBM model...")
lgbm_fitted = lgbm_pipe.fit(train_data, labels)
print("LGBM fitting complete!")

Fitting LGBM model...
LGBM fitting complete!


In [49]:
lgbm_prediction = lgbm_fitted.predict(test_data)
lgbm_predicted_prices = lgbm_prediction * test_area
print("LGBM  prediction finished!")
lgbm_predicted_prices

LGBM  prediction finished!


23285    3.038332e+07
23286    9.350997e+06
23287    6.167971e+06
23288    8.602383e+06
23289    5.225081e+06
             ...     
33217    3.291198e+07
33218    2.183107e+07
33219    8.511842e+06
33220    9.117586e+06
33221    7.196214e+06
Name: area_total, Length: 9937, dtype: float64

#### Combining predictions

We will now combined the results from these two by taking the average value they predicted

In [50]:
combined_prediction = (xgb_predicted_prices + lgbm_predicted_prices) / 2
xgboost_model.save_predictions(list(combined_prediction))

Unnamed: 0,id,price_prediction
0,23285,2.996612e+07
1,23286,1.022459e+07
2,23287,6.168739e+06
3,23288,8.745087e+06
4,23289,5.098428e+06
...,...,...
9932,33217,3.134866e+07
9933,33218,2.161599e+07
9934,33219,9.192219e+06
9935,33220,9.427414e+06


# How we predicted using *flaml* AutoML

This is just to show our process of finding the best models. *AutoML* models might not be the same for each run, even though the data is. Because of this, we saved the best configs for both *XGBoost* and *LGBM* below, so it will be possible to recreate them

#### Best XGBoost model

{'n_estimators': 182, 'max_leaves': 1221, 'min_child_weight': 4.321656525884641, 'learning_rate': 0.03469596869532235, 'subsample': 0.827351358517848, 'colsample_bylevel': 0.5127241764310982, 'colsample_bytree': 0.722539217922852, 'reg_alpha': 0.0016952814239432799, 'reg_lambda': 1.0399794993312943}

#### Best LGBM model

{'n_estimators': 280, 'num_leaves': 1340, 'min_child_samples': 6, 'learning_rate': 0.05127888901330433, 'log_max_bin': 10, 'colsample_bytree': 0.5981670145528535, 'reg_alpha': 0.0034922118383222253, 'reg_lambda': 0.003300958692930056}

In [None]:
train_data = finished[:23285].copy()
test_data = finished[23285:].copy()
test_data.drop("price", 1, inplace=True)


train_area = data[:23285]["area_total"]
test_area = data[23285:]["area_total"]

labels = data[:23285]["price"] / train_area 

# Splitting training data into training and validation, removed the price for each of them afterwards
x_train, x_test, y_train, y_test = train_test_split(train_data, labels, stratify=train_data.price.round(), test_size=0.0002)
x_train.drop("price", 1, inplace=True)
x_test.drop("price", 1, inplace=True)
x_train = preprocessor.fix_categorical(x_train.copy()) # Changes the type of categorical data to "category" instead of float
x_test = preprocessor.fix_categorical(x_test.copy()) # Changes the type of categorical data to "category" instead of float

print(f"Number of features for LGBM:", x_train.shape[1])

In [None]:
xgboost_model = TestModel(y_train, x_train)
best_xgboost = xgboost_model.autoMLfit(x_train, y_train, ["xgboost"], time=60, metric="auto")
xgboost_pred = [pred for pred in xgboost_model.autoMLpredict(x_test)] * (10 ** x_test["area_total"])

lgbm_model = TestModel(y_train, x_train)
best_lgbm = lgbm_model.autoMLfit(x_train, y_train, ["lgbm"], time=60, metric="auto")
lgbm_pred = [pred for pred in lgbm_model.autoMLpredict(x_test)] * (10 ** x_test["area_total"])

In [None]:
xgboost_model.autoML_print_best_model()
lgbm_model.autoML_print_best_model()

In [None]:
combined = (xgboost_pred + lgbm_pred) / 2
test_labels = [lab for lab in y_test] * (10 ** x_test["area_total"])

print("RMSLE for XGBoost model: %s" % xgboost_model.root_mean_squared_log_error(test_labels, xgboost_pred))
print("RMSLE for LGBM model: %s\n" % xgboost_model.root_mean_squared_log_error(test_labels, lgbm_pred))

print("RMSLE for combined models: %s\n" % xgboost_model.root_mean_squared_log_error(test_labels, combined))

lgbm = [pred for pred in lgbm_model.autoMLpredict(test_data)] * (10 ** test_data["area_total"])
xgb = [pred for pred in xgboost_model.autoMLpredict(test_data)] * (10 ** test_data["area_total"])
c = (lgbm + xgb) / 2
c