# Moscow housing prediction [Short]

Written by<br> 

Hauk Aleksander Olaussen - **Student ID: *504903*** <br>

Charbel Badr - **Student ID: *557389***<br>

Noran Baskaran - **Student ID: *504877*** <br>

Kaggle competition name: **Moscow housing**

Kaggle team name: **Team 69**

## Introduction

This short notebook will contain the code and comments for you to reproduce our best performing model on the public testset. It will make use of the `Preprocessor.py` and `TestModel.py` classes, just as in **long.ipynb**. We will not do any exploration or engineering in this notebook other than run the functions actually doing this job in `Preprocessor.py`. We hope that this will contribute to a readable model with as little "code gore" as possible. 

In [1]:
%load_ext autoreload

In [2]:
%autoreload
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from Preprocessor import Preprocessor
from TestModel import TestModel
import lightgbm as lgbm
import pandas as pd
import numpy as np
import xgboost

preprocessor = Preprocessor()

## Extracting the data

As the data is contained within *.csv* files, we need to read this and put it into a `pandas` DataFrame. This will make it easier to perform the preprocessing later. This is done in the same way as you saw in the long notebook.

In [3]:
merged_final = pd.concat([preprocessor.merged.copy(), preprocessor.merged_test.copy()], ignore_index=True)
print(f"The dataframe now has {len(merged_final)} entries")

The dataframe now has 33222 entries


Lets run the preprocessing functions

In [4]:
general_removed = preprocessor.general_removal(merged_final.copy()) # Removes "layout", "ceiling", "balconies", "loggias", "condition", "elevator_without", "street", "address"
data = preprocessor.remove_NaNs(general_removed.copy())             # Fills in NaN valiues in different ways depending on the entry
data = preprocessor.fix_categorical(data.copy())                    # Changes the type of categorical data to "category" instead of float
data.head()

Unnamed: 0,id,seller,price,area_total,area_kitchen,area_living,floor,rooms,bathrooms_shared,bathrooms_private,...,longitude,district,constructed,material,stories,elevator_passenger,elevator_service,parking,garbage_chute,heating
0,0,3.0,7139520.0,59.2,12.5,31.0,2.0,2.0,0.0,2.0,...,37.478055,11.0,2021.0,3.0,9.0,1.0,1.0,1.0,-1.0,-1.0
1,1,-1.0,10500000.0,88.0,14.2,48.0,18.0,3.0,2.0,0.0,...,37.666647,2.0,2010.0,3.0,25.0,1.0,1.0,1.0,-1.0,0.0
2,2,3.0,9019650.0,78.5,22.5,40.8,12.0,3.0,0.0,2.0,...,37.515335,6.0,2021.0,3.0,15.0,1.0,1.0,1.0,-1.0,-1.0
3,3,-1.0,10500000.0,88.0,14.0,48.0,18.0,3.0,0.0,2.0,...,37.666647,2.0,2010.0,3.0,25.0,1.0,1.0,1.0,-1.0,0.0
4,4,-1.0,13900000.0,78.0,17.0,35.0,7.0,2.0,1.0,0.0,...,37.451438,11.0,2017.0,2.0,15.0,1.0,1.0,1.0,0.0,0.0


## Adding features

Now lets start adding the features we want for the model. We start with the distance features

In [5]:
# Adding distance features
distance_df = preprocessor.combine_latlon_subway(data.copy(), read_from_file=True)  # Subway
distance_df = preprocessor.closest_hospital(distance_df.copy())                     # Hospitals
distance_df = preprocessor.closest_park(distance_df.copy())                         # Parks 
distance_df = preprocessor.distance_luxury_village(distance_df.copy())              # Luxury village
distance_df = preprocessor.closest_uni(distance_df.copy())                          # University
distance_df = preprocessor.inside_golden_mile(distance_df.copy())                   # Golden mile, boolean
distance_df = preprocessor.distance_from_golden_mile(distance_df.copy())            # Golden mile, distance
distance_df = preprocessor.inside_khamovniki(distance_df.copy())                    # Khamovniki, boolean
distance_df = preprocessor.distance_from_khamovniki_center(distance_df.copy())      # Khamovniki, distance
distance_df = preprocessor.combine_latlon(distance_df.copy())                       # Red square
print("Distance features added")

Distance features added


We continue by adding combined features.

In [6]:
# Creating the new features
featured = preprocessor.combine_windows(distance_df.copy(), boolean=True) # has_windows feature
featured = preprocessor.redo_new(featured.copy())                         # Fixing new feature before combining
featured = preprocessor.combine_new_constructed_distance(featured.copy()) # scaled_constructed feature
featured = preprocessor.floor_fraction(featured.copy())                   # floor_fraction feature
featured = preprocessor.combine_floor_stories(featured.copy())            # floor_stories feature
featured = preprocessor.area_distance(featured.copy())                    # area_distance feature
featured.head()

Unnamed: 0,id,seller,price,area_total,area_kitchen,area_living,floor,rooms,bathrooms_shared,bathrooms_private,...,inside_golden_mile,distance_golden_mile,inside_khamovniki,distance_khamovniki_center,distance_center,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance
0,0,3.0,7139520.0,59.2,12.5,31.0,2.0,2.0,0.0,2.0,...,0,22.936124,0,21.990803,25.02208,True,2021.0,0.222222,18.0,0.171074
1,1,-1.0,10500000.0,88.0,14.2,48.0,18.0,3.0,2.0,0.0,...,0,14.282081,0,15.264384,12.267029,True,1999.059871,0.72,450.0,0.106986
2,2,3.0,9019650.0,78.5,22.5,40.8,12.0,3.0,0.0,2.0,...,0,9.839135,0,8.719216,12.060134,True,2021.0,0.8,180.0,0.119716
3,3,-1.0,10500000.0,88.0,14.0,48.0,18.0,3.0,0.0,2.0,...,0,14.282081,0,15.264384,12.267029,True,1999.059871,0.72,450.0,0.106986
4,4,-1.0,13900000.0,78.0,17.0,35.0,7.0,2.0,1.0,0.0,...,0,18.83956,0,17.732445,21.041782,True,2013.007917,0.466667,105.0,0.127619


## Removing features

Lets now remove features we do not indend to keep.

In [7]:
to_be_removed = ["rooms", "material", "seller", "floor", "garbage_chute", "heating", "parking", "phones","area_kitchen", "area_living", "id", "constructed", "bathrooms_private", "bathrooms_shared", "new", "windows_street", "stories", "windows_court", "elevator_passenger", "elevator_service"]
removed = preprocessor.remove_labels(featured.copy(), to_be_removed)
removed.head()

Unnamed: 0,price,area_total,building_id,latitude,longitude,district,closest_subway_distance,closest_hospital,closest_park,distance_luxury_village,...,inside_golden_mile,distance_golden_mile,inside_khamovniki,distance_khamovniki_center,distance_center,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance
0,7139520.0,59.2,4076,55.544046,37.478055,11.0,1.911344,14.087087,7.678067,25.577801,...,0,22.936124,0,21.990803,25.02208,True,2021.0,0.222222,18.0,0.171074
1,10500000.0,88.0,1893,55.861282,37.666647,2.0,0.913793,8.433725,4.215548,28.890185,...,0,14.282081,0,15.264384,12.267029,True,1999.059871,0.72,450.0,0.106986
2,9019650.0,78.5,5176,55.663299,37.515335,6.0,1.643014,6.470263,7.331505,18.074328,...,0,9.839135,0,8.719216,12.060134,True,2021.0,0.8,180.0,0.119716
3,10500000.0,88.0,1893,55.861282,37.666647,2.0,0.913793,8.433725,4.215548,28.890185,...,0,14.282081,0,15.264384,12.267029,True,1999.059871,0.72,450.0,0.106986
4,13900000.0,78.0,6604,55.590785,37.451438,11.0,1.220328,13.319938,8.930833,20.333005,...,0,18.83956,0,17.732445,21.041782,True,2013.007917,0.466667,105.0,0.127619


## Logifying skewed data

We logify some of the data, as they are quite skewed originally.

In [8]:
finished = preprocessor.logify(removed.copy(), "area_total")
finished = preprocessor.logify(finished.copy(), "distance_center")
finished = preprocessor.logify(finished.copy(), "price")
finished.head()

Unnamed: 0,price,area_total,building_id,latitude,longitude,district,closest_subway_distance,closest_hospital,closest_park,distance_luxury_village,...,inside_golden_mile,distance_golden_mile,inside_khamovniki,distance_khamovniki_center,distance_center,has_windows,scaled_constructed,floor_fraction,floor_stories,area_distance
0,15.781156,4.097672,4076,55.544046,37.478055,11.0,1.911344,14.087087,7.678067,25.577801,...,0,22.936124,0,21.990803,3.258945,True,2021.0,0.222222,18.0,0.171074
1,16.166886,4.488636,1893,55.861282,37.666647,2.0,0.913793,8.433725,4.215548,28.890185,...,0,14.282081,0,15.264384,2.585282,True,1999.059871,0.72,450.0,0.106986
2,16.014916,4.375757,5176,55.663299,37.515335,6.0,1.643014,6.470263,7.331505,18.074328,...,0,9.839135,0,8.719216,2.569564,True,2021.0,0.8,180.0,0.119716
3,16.166886,4.488636,1893,55.861282,37.666647,2.0,0.913793,8.433725,4.215548,28.890185,...,0,14.282081,0,15.264384,2.585282,True,1999.059871,0.72,450.0,0.106986
4,16.447399,4.369448,6604,55.590785,37.451438,11.0,1.220328,13.319938,8.930833,20.333005,...,0,18.83956,0,17.732445,3.09294,True,2013.007917,0.466667,105.0,0.127619


## Preparing for the model

We will now split the finished DataFrame back into training and test data. The training data contains prices, which we have put into a variable called *labels* above. The test data does not have prices - as this is the task at hand. We will use a combination of *XGBoost* and *LGBM* for our prediction. The configurations have been found using Microsofts autoML library *flaml*.

#### XGBoost

In [9]:
train_data = finished[:23285].copy()                 # Data used for fitting
test_data = finished[23285:].copy()                  # Actual data we want to predict

train_area = removed["area_total"][:23285]           # We will need this later to convert back to price from the actual prediction
test_area = removed["area_total"][23285:]            # Same as above, but for test data

labels = train_data["price"]                         # Learning labels

train_data.drop("price", 1, inplace=True)            # Removing labels from training set
test_data.drop("price", 1, inplace=True)             # Removing labels from training set, even though they are all NaN

In [10]:
xgboost_model = TestModel(train_data, labels)
xgb_config = {'n_estimators': 2080, 'max_leaves': 33, 'min_child_weight': 0.03571272484753037, 'learning_rate': 0.017255426626148745, 'subsample': 0.6574197936285714, 'colsample_bylevel': 0.6579405013150642, 'colsample_bytree': 0.8926698031133763, 'reg_alpha': 0.0012147040662016035, 'reg_lambda': 1.0953681459256441}
xgb_model = xgboost_model.xgboost_model(xgb_config)
xgb_pipe = Pipeline([("scaler", StandardScaler()), ("xbg", xgb_model)])

In [11]:
print("Fitting XGBoost model...")
xgb_fitted = xgb_pipe.fit(train_data, labels)
print("XGBoost fitting complete!")

Fitting XGBoost model...
XGBoost fitting complete!


In [12]:
xgb_prediction = xgb_fitted.predict(test_data)
xgb_predicted_prices = np.expm1(xgb_prediction)
print(" XGBoost prediction finished!")
xgb_predicted_prices

 XGBoost prediction finished!


array([28010884.,  8814755.,  6062424., ..., 10051129.,  9205929.,
        7128650.], dtype=float32)

#### LGBM

In [13]:
lgbm_model = TestModel(train_data, labels)
lgbm_config = {'n_estimators': 1650, 'num_leaves': 28, 'min_child_samples': 3, 'learning_rate': 0.017116105449605335, 'colsample_bytree': 0.27135557685367967, 'reg_alpha': 0.04684224113991213, 'reg_lambda': 0.006296390009447546}
lgbm_model = lgbm_model.lgbm_model(lgbm_config)
lgbm_pipe = Pipeline([("scaler", StandardScaler()), ("lgbm", lgbm_model)])

In [14]:
print("Fitting LGBM model...")
lgbm_fitted = lgbm_pipe.fit(train_data, labels)
print("LGBM fitting complete!")

Fitting LGBM model...
LGBM fitting complete!


In [15]:
lgbm_prediction = lgbm_fitted.predict(test_data)
lgbm_predicted_prices = np.expm1(lgbm_prediction)
print(" LGBM  prediction finished!")
lgbm_predicted_prices

 LGBM  prediction finished!


array([31002115.41201915,  8688252.10832712,  6100758.16391392, ...,
        9577352.51591064,  9182774.86239288,  6781475.0037736 ])

#### Combining predictions

We will now combined the results from these two by taking the average value they predicted

In [16]:
combined_prediction = (xgb_predicted_prices + lgbm_predicted_prices) / 2
xgboost_model.save_predictions(combined_prediction)

Unnamed: 0,id,price_prediction
0,23285,2.950650e+07
1,23286,8.751504e+06
2,23287,6.081591e+06
3,23288,9.260575e+06
4,23289,5.128161e+06
...,...,...
9932,33217,2.634256e+07
9933,33218,2.056892e+07
9934,33219,9.814241e+06
9935,33220,9.194352e+06
