# Group 14
## Competition Name : Moscow Housing
## Elias Elfarri    , ID: 473700
## Nora Valen       , ID: 490606
## Muhammad Sarmad  , ID: 190729

# Table of Content
1. [Introduction](#1)
1. [Importing Libraries and Reading the Dataset](#2)
1. [Exploratory Data Analysis](#3) 
    * [3.1 General Property Valuation Domain Knowledge](#3.1)
    * [3.2 Moscow Housing Dataset Domain Knowledge](#3.2)
    * [3.3 Descriptive statistics](#3.3)
    * [3.4 Distribution of outcome variable](#3.4)
    * [3.5 Distance from city center](#3.5)
    * [3.6 Preventing Data Leakage](#3.6)
1. [Data Cleaning](#4)
    * [4.1 LongLat Outliers & NaN Values](#4.1)
    * [4.2 District Outliers & NaN Values](#4.2)
    * [4.3 General NaN values](#4.3)
    * [4.4 Seller Outliers & NaN Values](#4.4)
    * [4.5 Area Kitchen and Area Living Outliers & NaN Values](#4.5)
    * [4.6 Ceiling Outliers & NaN Values](#4.6)
    * [4.7 Condition Outliers & NaN Values](#4.7)
    * [4.8 Floor and Stories Outliers & NaN Values](#4.8)
    * [4.9 Material Outliers & NaN Values](#4.9)
    * [4.10 Constructed and New Outliers & NaN Values](#4.10) 
    * [4.11 Balconies and Loggias NaN Values](#4.11) 
    * [4.12 Elevator NaN Values](#4.12) 
    * [4.13 Remaining NaN Values After Cleaning](#4.13) 
1. [Feature Engineering](#5)
    * [5.1 Euclidean Distance From City Center](#5.1)
    * [5.2 Combining Elevator Features](#5.2)
    * [5.3 Projection-based Distance From City Center](#5.3)
    * [5.4 Area_per_room](#5.4)
    * [5.5 Bathrooms_total](#5.5)
    * [5.6 Balconies_total](#5.6)
    * [5.7 Distance to Financial City Center](#5.7)
    * [5.8 Floor_per_stories Ratio](#5.8)
    * [5.9 Log Features](#5.10)
    * [5.10 Squared Features](#5.11)
    * [5.11 Other Possible Features](#5.12)
    
1. [Feature Extraction](#6)
    * [6.1 Principal Component Analysis (PCA)](#6.1)
    * [6.2 Target Encoding](#6.2)
1. [Feature Selection](#7)
    * [7.1 Mutual Information (MI)](#7.1)
    * [7.2 ANOVA F-value](#7.2)
    * [7.3 Variance Threshold](#7.3)
    * [7.4 Conclusion Of Feature Selection](#7.4)
1. [Models](#8)
    * [8.1 Model Overview](#8.1)
    * [8.2 Choosing Prediction Target](#8.2)
    * [8.3 Validation Split](#8.3)
    * [8.4 Xgboost](#8.4)
    * [8.5 Lightgbm](#8.5)
    * [8.6 Catboost](#8.6)
    * [8.7 Stacking Model](#8.7)
    * [8.8 Weight Averaging](#8.8)
    * [8.9 Other Tried Models](#8.9)
1. [Optuna Optimization](#9)
    * [9.1 Hyperparameter Tuning Xgboost](#9.1)
    * [9.2 Hyperparameter Tuning Lgbm](#9.2)
    * [9.3 Hyperparameter Tuning Catboost](#9.3)
    * [9.4 Other Model Tuning](#9.4)
1. [Attempted and Documented Model Pipelines](#10)
    * [10.1 Attempt 1](#10.1)
    * [10.2 Attempt 2](#10.2)
    * [10.3 Attempt 3](#10.3)
    * [10.4 Attempt 4](#10.4)
    * [10.5 Attempt 5](#10.5)
    * [10.6 Group Kfolding Based on Building Split](#10.6)

1. [Model Interpretation](#11)
    * [11.1 LIME Interpretation](#11.1)
    * [11.2 Model Feature Importance](#11.2)

1. [Second Final Submisison - Short Notebook](#12)

1. [Conclusion](#13)
<hr/>

<a id="1"></a> <br>
# 1. Introduction

This is the Final Long Notebook for TDT4173 Machine Learning. In this notebook we explore and document things that we have done during the lifespan of the project. In this notebook you will find our EDA, Data cleaning, Feature Engineering, Models, Optimization strategy as well as the final submissions and model interpretations. 

<a id="2"></a> <br>
# 2. Importing Libraries and Reading the Dataset

In [5]:
import json
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

np.random.seed(123)
sns.set_style('darkgrid')
pd.set_option('display.max_colwidth', None)

#Model imports
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from mlxtend.regressor import StackingCVRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import ExtraTreesRegressor




#for Feature Extraction
from sklearn.decomposition import PCA

#for Feature Selection
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import VarianceThreshold


#Model Optimization
import optuna

#Sklearn misc
import sklearn.model_selection as model_selection
from numpy.random import choice

#Other misc
import pyproj

In [6]:
!ln -s /kaggle/input/moscow-housing-tdt4173 ./data
!ls ./data | sort

apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)

In [7]:
def describe_column(meta):
    """
    Utility function for describing a dataset column (see below for usage)
    """
    def f(x):
        d = pd.Series(name=x.name, dtype=object)
        m = next(m for m in meta if m['name'] == x.name)
        d['Type'] = m['type']
        d['#NaN'] = x.isna().sum()
        d['Description'] = m['desc']
        if m['type'] == 'categorical':
            counts = x.dropna().map(dict(enumerate(m['cats']))).value_counts().sort_index()
            d['Statistics'] = ', '.join(f'{c}({n})' for c, n in counts.items())
        elif m['type'] == 'real' or m['type'] == 'integer':
            stats = x.dropna().agg(['mean', 'std', 'min', 'max'])
            d['Statistics'] = ', '.join(f'{s}={v :.1f}' for s, v in stats.items())
        elif m['type'] == 'boolean':
            counts = x.dropna().astype(bool).value_counts().sort_index()
            d['Statistics'] = ', '.join(f'{c}({n})' for c, n in counts.items())
        else:
            d['Statistics'] = f'#unique={x.nunique()}'
        return d
    return f

def describe_data(data, meta):
    desc = data.apply(describe_column(meta)).T
    desc = desc.style.set_properties(**{'text-align': 'left'})
    desc = desc.set_table_styles([ dict(selector='th', props=[('text-align', 'left')])])
    return desc 



def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [8]:
buildings = pd.read_csv('data/buildings_train.csv')
print(f'Loaded {len(buildings)} buildings')
with open('data/buildings_meta.json') as f: 
    buildings_meta = json.load(f)
buildings.head()
describe_data(buildings, buildings_meta)

In [9]:
# Helper functions
def root_mean_squared_log_error(y_true, y_pred):
    # Alternatively: sklearn.metrics.mean_squared_log_error(y_true, y_pred) ** 0.5
    assert (y_true >= 0).all() 
    assert (y_pred >= 0).all()
    log_error = np.log1p(y_pred) - np.log1p(y_true)  # Note: log1p(x) = log(1 + x)
    return np.mean(log_error ** 2) ** 0.5


def plot_map(data, ax=None, s=5, a=0.75, q_lo=0.0, q_hi=0.9, cmap='autumn', column='price', title='Moscow apartment price by location'):
    data = data[["latitude", "longitude", column]].sort_values(by=column, ascending=True)
    backdrop = plt.imread('data/moscow.png')
    backdrop = np.einsum('hwc, c -> hw', backdrop, [0, 1, 0, 0]) ** 2
    if ax is None:
        plt.figure(figsize=(12, 8), dpi=100)
        ax = plt.gca()
    discrete = data[column].nunique() <= 20
    if not discrete:
        lo, hi = data[column].quantile([q_lo, q_hi])
        hue_norm = plt.Normalize(lo, hi)
        sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(lo, hi))
        sm.set_array([])
    else:
        hue_norm = None 
    ax.imshow(backdrop, alpha=0.5, extent=[37, 38, 55.5, 56], aspect='auto', cmap='bone', norm=plt.Normalize(0.0, 2))
    sns.scatterplot(x="longitude", y="latitude", hue=data[column].tolist(), ax=ax, s=s, alpha=a, palette=cmap,linewidth=0, hue_norm=hue_norm, data=data)
    ax.set_xlim(37, 38)    # min/max longitude of image 
    ax.set_ylim(55.5, 56)  # min/max latitude of image
    if not discrete:
        ax.legend().remove()
        ax.figure.colorbar(sm)
    ax.set_title(title)
    return ax, hue_norm


#helper functions for PCA
def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs


In [10]:
#installing packages
# !pip install lazypredict
# !pip install snapml

# import lazypredict
# from lazypredict import Supervised
# from lazypredict.Supervised import LazyRegressor
# from snapml import BoostingMachineRegressor 

# !pip install pandas --upgrade #to fix dependencies

<a id="3"></a> <br>
# 3. Exploratory Data Analysis
We wish to predict the price of a set of apartments. Hence, our target is the variable 'price'. In this section we will inspect this variable more closely, its correlation with other variables, and overall get a better understanding of the data set we are working with.   

<a id="3.1"></a> <br>
## 3.1 General Property Valuation Domain Knowledge 

In order to set out on the journey of analyzing the Moscow Housing dataset first we wanted to understand elements that are important in the domain of property valuation. There are many economic factors that go into why housing prices are the way they are, obviously going into the domain of micro economics one should look at how the economy is doing, supply and demand levels in the particular region where the predictions are done, track consumer behavior, current state of mortgage interest rates and much more. While these factors are very important in general house price predictions, for us it is more important to look into domain knowledge that will help us predict house prices based on a set of specific features related to said house. Therefore to simplify our task we will only emphasize Fair Market Value (FMV) in order to better understand feature importance of the Moscow Housing Dataset and how to generate new potential features that can better help our predictions. 

In real estate FMV is defined as the determined price that a property will sell for in an open market. Usually these house value are estimated by a professional real estate appraiser. A property's appraisal value is influenced in general by the visual inspection but there are key factors that the professionals usually look for that could be summed up in the following:
* Recent neighborhood sales of similar properties (in a 6 to 12 month period)
* Number of bedrooms
* Number of bathrooms
* Floor plan functionality / Layout
* Size and square footage
* Property conditions such as potentially needed repairs, aesthetics, etc
* Location of the property (Location, Location, Location!)
* Materials used in floors, walls, trims, exterior-walls, roof, windows
* On-site and nearby property characteristics
* Appraisers dont consider moveable features or decor
* Heating and cooling systems and quality
* Energy-efficient features (for example energy-efficiency certifications, tankless water heaters, insulated ducts)
* If the home sits in some risk zone (nearby environmental hazards, flood zones, earthquake zones, clay soil under foundation, etc)

<a id="3.2"></a> <br>
## 3.2 Moscow Housing Dataset Domain Knowledge

The domain knowledge in this project is required to perform feature engineering. Some of the domain knowledge was give as is while the rest was derived after careful observation of data and online search. Therefore, based on this analysis we have features that might have an impact on the price while we also have the features that likely do not have any effect on price.
Here we create an enumeration of the most important oberservations we have:
1.	The building close to the city center should have an increased price due to location.
2.	The apartments with the largest area should be in general pricier than one with lesser area.
3.	The apartments with more room per area should be pricier than ones with lesser area per room.
4.	The apartments with more bathrooms, balconies and windows facing the street should be more expensive.
5.	The apartments located on the top of a particular a high-rise building should be more expensive than the ones located on the lower floors.
6.	The apartments that are new and constructed recently should be more expensive than old apartments.
7.	The apartments with unusually high ceiling should be more expensive.
8.	The apartment located in a building with unusually high number of stories should be expensive.
9.	The condition of the apartment should be a feature that helps in the assessment of the price.
10.	The access to parking, heating and elevator can also be important contributor towards the price.
11.	The accurate location of the apartment e.g., precise, and easy to interpret location of the apartment can be very important factor in predicting the price.
12.	The material used for the construction can also be important contributor towards the price of the apartment. 
13.	The type of seller can be important. e.g., property dealers tend to sell the property at a higher cost than private buyers since property dealers need to keep their margins. 
14.	Since this is Moscow it does get cold in the winter and for that reason a heating and isolation matters as well for price predictions in general for a place that can become very cold.

Unimportant Features
On the other hand, many items can be do not care e.g.
1.	Presence of phone in the building might not be a deciding factor in the price as phone connections are easy to install and sometimes even not needed.
2.	The name of the street should have no impact on the price of the apartment. Since the location information is already clearly indicated by using the long and lat.
3.	Address of the building might also have no impact on the price due to the inclusion of the long and lat.


<a id="3.3"></a> <br>
## 3.3. Descriptive statistics

In [11]:
significant_features = ["price","area_total","rooms","stories"]
data[significant_features].describe().transpose()

We can tell from this summary that the 'price' variable has a very large range. 

<a id="3.4"></a> <br>
## 3.4. Distribution of outcome variable
We want to examine the distribution of the outcome variable. With the spread of the 'price' variable in mind, we apply a log transform to the variable and plot the transformed variable as well. In the plot with the raw prices from the training set, we can se a sharp peak close to zero, and a long tail. In plot showing the transformed variable, there is a clear peak but a much more tapered tail. The distribution is deviates from te normal distribution as it is celarly positively skewed, and has a relatively sharp peak.  

In [13]:
data['log(price)'] = np.log10(data['price'])

fig, (ax1, ax2) = plt.subplots(figsize=(16, 4), ncols=2, dpi=100)
sns.distplot(data['price'], ax=ax1);
ax1.set_title('Distribution of raw train set prices');
sns.distplot(data['log(price)'], ax=ax2);
ax2.set_title('Distribution of train set prices after log transform');

We want to examine the shape of the distibution. A normal distribution has skewness equal to 0, and kurtosis equal to 3. Kurtosis is a measure of the weight of the tails relative to the center of the distribution, and can give an impression of how extreme values are in the tail. 

We see that the skew is positive for both distributions, as observed. However, the kurtosis of 'price' is extremely high, meaning that we are likely to observe extreme values. This corresponds to the large range of the variable, which we have already observed. On the other hand, the kurtosis of 'log(price)' is lower than that of the normal distribution. This means that we are not liekly to see a lot of extreme values. 

In [14]:
print("Skewness of 'log(price)': %f" % data['log(price)'].skew())
print("Kurtosis of 'log(price)': %f \n" % data['log(price)'].kurt())

print("Skewness of 'price': %f" % data['price'].skew())
print("Kurtosis of 'price': %f" % data['price'].kurt())

<a id="3.4"></a> <br>
## 3.4. Distance from city center
The dataset includes the variables 'latitude' and 'longitude'. We want to visualize the relationship between location and price. From experience, we know that licing costs tend to be higher close to the city centre, and our assumption is that this is true for the Moscow dataset as well. The plot below plots apartments against their longitude at latitude, and color codes according to their price. We can easily see that apartments close to the city center are in a higher price range(yellow), and apartments further out are in a lower price range(red). 

In [15]:
plot_map(data);

Although the plot above does visualize an increasing price when moving away from the city center, we also see some high price(yellow) outliers in areas that has otherwise low price(red) apartments. In addition, the apartments are mostly high or low price, and there are few orange(mid-range) dots. Because the plot assigns color based on the total price, it does not take into the account that a very large apartment has a high price regardless of its geographical location. In reality, we know that apartment prices are relative to the size of the apartment. We repeat the plot, but this time colorcode based on price per square meter. We now see a more gradual decrease in price – moving from yellow, through orange, and into red. We observe that there are a lot more more mid-range apartments that were not missed by the first plot. This plot correlates well with our prior knowledge of apartment pricing.

In [16]:
data['price/area'] = data['price'] / data['area_total']

plot_map(data, column='price/area', title='Moscow apartment price per square meter by location');

<a id="3.5"></a> <br>
## 3.5. Correlation with outcome variable
We would also like to examine the correlation between features in the dataset and the target. This may give us an indication on which features are important.

In [17]:
# Should probably reduce this slightly and not include all variables 
ignore_cols = ['id', 'latitude', 'longitude', 'street', 'address', 'building_id']
corrmat = (data.drop(ignore_cols, axis=1)).corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

In [18]:
k = 10
cols = corrmat.nlargest(k, 'price')['price'].index
corrmat_price = data[cols].drop(['price/area','log(price)'], axis=1).corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat_price, vmax=.8, square=True,annot=True);

In [19]:
cols = corrmat.nlargest(k, 'price/area')['price/area'].index

corrmat_reduced = data[cols].drop(['price', 'log(price)'], axis=1).corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat_reduced, vmax=.8, square=True,annot=True);

The heatmap is reduced to only the top 10 most correlated features. Although 'area_total' and 'area_living' are highly correlated to the outcome, there are not a lot of features that appear to be very correlated. However, the correlation measures the strength of the linear relationship. There may, however, be multicollinearity between variables. This does, however, imply that a non-linear model is better suited for the prediction. In addition, it encourages some feature engineering in order to get features with a strong correlation. 

In [20]:
# scatterplot to see relationship between variables
lon1 =  37.621390
lat1 = 55.753098
geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)


data['distance'] = distance_arr

sns.set()
columns = ['price/area', 'area_total', 'area_living', 'area_kitchen','distance']
sns.pairplot(data[columns], size = 2.5)
plt.show();

We see that there apepars to be a linear relationship between 'area_total' and 'area_living', which is sensible as an increase in total area of an apartment also increases the total living area. We see a somewhat similar trend in area_kitchen with both of the mentioned variables. However, there is a clear cone shape to the plot. 

The top right graph between price / square meter and distance, we can see that the price decreases as we increase the distance. A similar trend can also be seens from the map.
Overall from domain knowledge we notice that distance from city center, area total etc. are all important features and it is not surprising that the most centeral apartments are going to be the most expensive one. We will now further closely consider each feature and either clean or perform feature engineering. 



<a id="3.6"></a> <br>
## 3.6 Preventing Data Leakage

We notice early on that the apartments in the test set and the training set do not share the same bulding ids. Therefore, we have to be careful in creating the validation and training split inorder to prevent data leakage. 

We therefore decide that it is important to split based on building ids and not apartment. To demonstrate this we look at a simple model

<b>First we split a simple model with an apartments split:</b>

In [21]:
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

 

data_train, data_valid = model_selection.train_test_split(data, test_size=0.33, stratify=np.log(data.price).round())

features = ["ceiling",  "rooms", "area_total", "area_kitchen", "area_living", "floor", "new", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "material", "stories"]

train_x = data_train[features]
train_y = np.log1p(data_train['price'])
test_x = data_valid[features]
test_y = np.log1p(data_valid['price'])


model =XGBRegressor()

model.fit(train_x,train_y, verbose=False)

preds = model.predict(test_x)
rmsle_apartments = root_mean_squared_log_error(y_true=np.expm1(test_y), y_pred=np.expm1(preds))

print("RMSLE:", rmsle_apartments)

<b>Split model based on Building_id:</b>

In [22]:
gs = model_selection.GroupShuffleSplit(n_splits=2, test_size=.33, random_state=0)
train_index, valid_index = next(gs.split(data, groups=data.building_id))

data_train = data.loc[train_index]
data_valid = data.loc[valid_index]


train_x = data_train[features]
train_y = np.log1p(data_train['price'])
test_x = data_valid[features]
test_y = np.log1p(data_valid['price'])


model =XGBRegressor()

model.fit(train_x,train_y, verbose=False)

preds = model.predict(test_x)
rmsle_building = root_mean_squared_log_error(y_true=np.expm1(test_y), y_pred=np.expm1(preds))

print("RMSLE:", rmsle_building)

The RMSLE from the building split is much higher but also much more realistic as this shows how the model would perform if there arent apartments in the same building between the training and validation set. This makes more sense as in the test set this is how the data is split and using building split might therefore close the gap between validation set and test set performance. 

<a id="4"></a> <br>
# 4. Data Cleaning

Before starting the data cleaning it is important to get a perspective of how many NaN values we have in the dataset as well as proceed with trying to understand what types of problems we might expect to encounter. Important questions to ask during cleaning:

* <b>Is a feature and its data intuitive on its own?</b>
* <b>Is a feature consistent in regards to closely related features?</b> For example area_kitchen, area_living and area_total
* <b>How prevalent are the NaN values?</b>
* <b>Are NaN values placed randomly or does it have a significant pattern?</b>

In [23]:
nan_percentage = data.isna().sum()/len(data) *100
nan_percentage = nan_percentage.sort_values(ascending=False)

nan_percentage_test = data_test.isna().sum()/len(data) *100
nan_percentage_test = nan_percentage_test.sort_values(ascending=False)

# Percent missing data by feature
f, ax = plt.subplots(1, 2, sharex=True, figsize=(20,10))
sns.barplot(ax=ax[0], x=nan_percentage.index, y=nan_percentage)
ax[0].set_xlabel('Features', fontsize=15)
ax[0].set_ylabel('Percent of missing values', fontsize=15)
ax[0].set_title('Percent missing data by feature - TRAINING DATA', fontsize=10)
ax[0].set_xticklabels(nan_percentage.sort_values(ascending=False).index, rotation=90)



# Percent missing data by feature
sns.barplot(ax=ax[1], x=nan_percentage_test.index, y=nan_percentage_test)
ax[1].set_xlabel('Features', fontsize=15)
ax[1].set_ylabel('Percent of missing values', fontsize=15)
ax[1].set_title('Percent missing data by feature - TEST DATA', fontsize=10)
ax[1].set_xticklabels(nan_percentage_test.sort_values(ascending=False).index, rotation=90)

print("histograms of NaN values")

It is interesting to see that for the NaN values in both the Training and Test set. That there is a pattern in terms of missing values. And the distribution of these missing values is fairly similar overall. This means that the training and test set was not split in a particular way to make the test set necessarily more difficult than the training set, but rather that the test set is supposed to represent a wide variety of housing and their features much like the training set. From this alone we can conclude that there is a significant pattern in the missing values, where the the train and test where collected and composed in the same fashion and therefore it is safe to say that NaN values can be treated equally in both datasets. 

Most NaN values could be set to some mode or mean if not another significant way of imputation is found, but looking at "Layout" we have concluded that while it could have its significance during predictions, it can safely be dropped as over 70% of the training set for said feature is missing.

In the next sections of this chapter we are going to take you through the handling of different outliers by analyzing if the data is intuitve on its own and/or if it is consistent with closely related features. 

<a id="4.1"></a> <br>
## 4.1 LongLat Outliers & NaN Values

There are as observed by us, three different anomalies in the Longitude and Latitude test data, specifically:
* NaN Values
* Negative Values
* Blown up values outside of moscow


In [24]:
data_test[["latitude","longitude", "district", "building_id", "address","street"]].query("longitude < 36.8 | longitude > 38.88 | latitude < 55.18 | latitude > 56.082 | longitude != longitude | latitude != latitude")

Here we use the address and street and find the correct longitude and latitude values in google maps and adjust the Longitude and Latitude accordingly.

In [25]:
#Fixing the NaN values
data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831
data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#Fixing negative values
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317
data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317
data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317
data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Fixing blown up values
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556
data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284
data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284


<a id="4.2"></a> <br>
## 4.2 District Outliers & NaN Values

In this section we look into handling District Outliers and NaN values.

In [26]:
data_central = data.copy().drop(data[(data['longitude'] < 37.1) & (data['latitude'] < 55.6)].index)
sns.set(rc={'figure.figsize':(30,20)})
ax = sns.scatterplot(data= data_central[["latitude","longitude"]],x="longitude", y="latitude",  hue=data_central["district"], palette="Paired")

ax.set_title("tdt4173 dataset districts - WITH OUTLIERS")


From this graph we can observe that there are some districts assigned in the middle of other districts and as such we decided to look at these as possible outliers. Specifically a green dot (district category 3) in a cluster of light blue dots (district category 0).

In [27]:
# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5

In [28]:
#Note: value != value in query is the way to identify NaN values
data_test[["latitude","longitude", "district", "building_id", "address","street"]].query("district != district")

We filled all the districts that contained nans by using their longitude and latitude coordinates and trying to find out what cluster of district it might be a part off.

In [29]:
# fixing test_data districts:
data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

In [30]:
data_central = data_test.copy().drop(data_test[(data_test['longitude'] < 37.1) & (data_test['latitude'] < 55.6)].index)
sns.set(rc={'figure.figsize':(10,10)})
ax = sns.scatterplot(data= data_central[["latitude","longitude"]],x="longitude", y="latitude",  hue=data_central["district"], palette="Paired")
ax.set_title("tdt4173 dataset districts - FIXED")


This shows the resulting cleaned version of the test set. Obviously theere are still overlaps here and there in the district clusters but we weren't confident enough about the district lines of Moscow to make decisions about if these are outliers or correct values

<a id="4.3"></a> <br>
## 4.3 General NaN values

The bathroom_shared and bathroom_private features contain NaN values. After some analysis we decided to fill the NaN in both features with value 1. While many old apartments mgiht not have any bathrooms we ended up thinking it is more valid to just guess it to be atleast one bathroom. Another way to do it would've been just to fill the NaN values with 99999 in order to do the Kaggle trick to communicate to the model that the NaN values is its own category. 

For parking we decided to create a new class called 'No Parking' and filled all the NaN with that.
For heating we assume that Moscow is really cold and all apartments probably have some form of heating. Therefore, we filled all the NaN with the most common class.



In [31]:
data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)

data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)


data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)

data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)

<a id="4.4"></a> <br>
## 4.4 Seller Outliers & NaN Values

In this section we look into handling Seller Outliers and NaN values. We decided to fill all the NaN values such that the distribution of this variable does not get affected.

Essentially, we used the frequency distribution of each category and filled all the NaNs at random based on the probablity distribution. For seller there is no dependeable feature from which we can derive 
missing data from.

In [32]:
fig, (ax1, ax2) = plt.subplots(figsize=(12, 4), ncols=2, dpi=100)
print(f'Split dataset into {len(data)} training samples and {len(data_test)} validation samples')

sns.histplot((data.seller).rename('Seller'), ax=ax1);
sns.histplot((data_test.seller).rename('Seller'), ax=ax2);
ax1.set_title('Training set sellers'); ax2.set_title('Validation set sellers');

We decided to split the NaN values based on the probability distribution of the sellers. 

In [33]:
#getting distribution of categories based on the non NaN values
for i in range(4):
    print(i ," " , len(data["seller"][data.seller == i])/len(data["seller"][data.seller == data.seller]))

In [34]:
list_of_candidates = [0,1,2,3]
# 14455 , owener 0, 
probability_distribution  = [0.11, 0.33, 0.13, 0.43]
number_of_items_to_pick = data['seller'].isna().sum()
number_of_items_to_pick_test = data_test['seller'].isna().sum()

np.random.seed(0)

draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)
draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
              p=probability_distribution)

data['seller'][data.seller.isna()] = draw
data_test['seller'][data_test.seller.isna()] = draw_test

<a id="4.5"></a> <br>
## 4.5 Area Kitchen and Area Living Outliers & NaN Values

Here we observe that area_living and area_kitchen do not only contain NaN values but also have multiple outliers. It does not for instance make sense that the sum of the living and kitchen are greater than or equal to total_area. Especiall if we need to account for loggias, balconies, shared_bathrooms and private_bathrooms.

In [35]:
#Outliers
print(data[["area_total","area_living", "area_kitchen"]].query("area_living + area_kitchen >= area_total"))

#NaN values
print(data[["area_total","area_living", "area_kitchen"]].query("area_living != area_living | area_kitchen != area_kitchen"))

We decided to fix this issue by looking at the mean kitchen and living areas of the non-outlier values. Note that we are using the mean of these values contained in the training set to populate the outlier values both for the training and test set. In many kaggle competitions if there is access to data from test set, it is usually used for applications as these, where one looks at the total mean of a feature from both the training and test set. Even though this might give a potential boost in performance we decided not to do this, as in a real world application test data is usually not accessable. 

In [36]:
percentage_area_data = pd.DataFrame()
percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

mean_kitchen = percentage_area_data["area_kitchen"].mean()
mean_living = percentage_area_data["area_living"].mean()
print("Mean kitchen area of the Non-outlier values:", mean_kitchen)
print("Mean living area of the Non-outlier values:", mean_living)

In [37]:

#to omit bugs
data["area_kitchen_edit"] = data["area_kitchen"].copy()
data["area_living_edit"] = data["area_living"].copy()

data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

data["area_kitchen"] = data["area_kitchen_edit"].copy()
data["area_living"] = data["area_living_edit"].copy()

#test_set
data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
data_test["area_living_edit"] = data_test["area_living"].copy()

data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
data_test["area_living"] = data_test["area_living_edit"].copy()


<a id="4.6"></a> <br>
## 4.6 Ceiling Outliers & NaN Values

We could observe certain outliers in the ceiling data, where the ceiling is less than 1 meter or much greater than 9 meters. Obviously there might be some incredibly high ceilings but it is very unlikely that they are in the range of above 9 meters. For these outliers as wel as nan values we set it equal to the mode which was 2.64m, it is safe to say that it is probably the best value to use as replacement for NaN and outlier values. 

In [38]:
data[['ceiling']].query("ceiling < 1 | ceiling > 9")

In [39]:
maxc = 9
minc = 1
data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)  

data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]


<a id="4.7"></a> <br>
## 4.7 Condition Outliers & NaN Values

Condition is feature split into 4 different categories: Decorated, Euro_repair, Undecorated, Special Design. We decidd to fill the NaN values with a new category as the condition is unknown and deriving it from other features is not as clear cut as it might have been for some of the other features.

In [40]:
unknown =  4.0
data['condition'] = data['condition'].fillna(unknown)
data_test['condition'] = data_test['condition'].fillna(unknown)

<a id="4.8"></a> <br>
## 4.8 Floor and Stories Outliers & NaN Values

We collect the building IDs of all the buildings where any floor values of a particular apartment is greater that the total number of
stories in that building. We then replace the stories of each such bulding ID with the max floor number associated with the apartment.
In short we chose to trust the floor number of each apartment as compared to the stories values of each building.

In [41]:
data[["floor", "stories"]][data.floor > data.stories]

In [42]:
idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
#data['storiesnew'] = data['stories'].copy()
for i in range(idss.size):
    max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
    data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor

idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
#data['storiesnew'] = data['stories'].copy()
for i in range(idss_test.size):
    max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
    data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test


<a id="4.9"></a> <br>
## 4.9 Material Outliers & NaN Values

We hypothesize that the class Monolith brick is the same as Monolith therefore we merge them together as part of the cleaning process,
So we merge class 5 with class 2. We also move Stalin from class 6 to 5 as class 5 is empty after merging class 5 with 2. Then we take the mode of the material as the replacement for NaN values.

In [43]:
data.material[data.material==5] = 2.0 #merging monlith brick with monolith
data.material[data.material==6] = 5.0 #stalin to 5

data_test.material[data_test.material==5] = 2.0
data_test.material[data_test.material==6] = 5.0

data['material'][data.material.isna()] = data['material'].mode()[0]
data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]

<a id="4.10"></a> <br>
## 4.10 Constructed and New Outliers & NaN Values

If the apartment in question is new and the coloum new is not a NaN but coloum constructed is a NAN then we fill the constructed coloum with year 2019. Since 2019 is the most prevalant year for apartments. Otherwise, we fill the coloumns with 2021. And after filling the all the NaN of constructed we fill the NaNs of feature new. If the apartment is constructed earlier than 2020 we fill the NaNs in new by 0.0 (Old) otherwise we fill it with 1.0. If it is not NaN we do not change it

In [44]:
data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
data['new'] = data.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)      


data_test['constructed'] = data_test.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data_test['new'] = data_test.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)     

<a id="4.11"></a> <br>
## 4.11 Balconies and Loggias NaN Values

Since data does not contain any apartments for which there are not balconies and loggias, it made sense to fill the NaN values to satisfy this condition. 

In [45]:
data[["balconies","loggias"]].query("balconies == 0 & loggias == 0 ")

In [46]:
# Therefore, we fill all the NANs with 0
data['balconies'] = data['balconies'].fillna(0)
data['loggias'] = data['loggias'].fillna(0)

data_test['balconies'] = data_test['balconies'].fillna(0)
data_test['loggias'] = data_test['loggias'].fillna(0)

<a id="4.12"></a> <br>
## 4.12 Elevator NaN Values

The elevators features are interesting to look at together as they are connected to eachother. Therefore we decided to clean te elevator missing values in the feature engineering as it made more sense to treat the NaN Values for the engineered elevator feature. In this section we will how ever look into possible outliers and data combinations that could possibly be unintuitive/illogical.

In [47]:
d = data[["elevator_without", "elevator_passenger", "elevator_service"]]
d

In [48]:
d.query("elevator_without == 0 & elevator_passenger == 0 & elevator_service == 0")

It is good to see that the data here is intuitive as the combination <b>"elevator_without == 0 & elevator_passenger == 0 & elevator_service == 0"</b> does not exist. If we would dissect what this could mean, it is basically that an apartment does not a passenger or a service elevator but elevator_without is saying that it does as it is 0. So as this features does not make sense it is good that there are any apartments that contain this combination. All the other binary combinations do how ever make sense and are broken down into categories and treated in section 5.2 under Feature engineering.

<a id="4.13"></a> <br>
## 4.13 Remaining NaN Values After Cleaning


In [49]:
data.isna().sum()

As observed there are still missing values for the following (similarly for the test data as well):
* Layout
* Windows
* Phones
* Elevator Features
* Garbage Chutes


For elevators we decided to combine the elevator features into a new feature based on different categories and then fill the nan values of that engineerewd values. The other features are how ever dropped by us for various reasons. For instance layout had over 70% missing values and phones do not add any particular importance to the price predictions. For Windows and Garbage Chutes, we didnt find a clever way to find the nan values of these so we decided to drop them all together. There is always the possibility of assigning a new category for which stands for "unknown values" but we decided that these features were not significant enough to put much effort into them.

<a id="5"></a> <br>
# 5. Feature Engineering
The information available in the dataset is not necessarily presented in a way a machine learning model is able to make good use of. In this section, we use the data available to engineer features that are more meaningful and more strongly correlated to the outcome. 

<a id="5.1"></a> <br>
## 5.1 Euclidean Distance From City Center
Adding the distance from city center. 

In [50]:
origin_coordinates = (37.621390,55.753098)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t

<a id="5.2"></a> <br>
## 5.2 Combining Elevator Features
***The Logic behind this feature***
Which could be interpreted as the following:
* elevator_without == 1 & elevator_passenger == 0 & elevator_service == 0 => <b>No elevator access for all apartments => Category 0</b>
* elevator_without == 1 & elevator_passenger == 0 & elevator_service == 1 => <b>Some apartments have service elevators and some do not => Category 1</b>
* elevator_without == 1 & elevator_passenger == 1 & elevator_service == 0 => <b>Some apartments have passenger elevators some do not => Category 2</b>
* elevator_without == 0 & elevator_passenger == 0 & elevator_service == 1 => <b>All apartments have service elevator => Category 3</b>
* elevator_without == 0 & elevator_passenger == 1 & elevator_service == 0 => <b>All apartments have passenger elevator => Category 4</b>
* elevator_without == 0 & elevator_passenger == 1 & elevator_service == 1 => <b>All apartments have passenger elevator and service elevator => Category 5</b>
* elevator_without == 1 & elevator_passenger == 1 & elevator_service == 1 => <b>Some apartments have service elevators and passenger elevators and some do not => Category 6</b>
* elevator_without == 0 & elevator_passenger == 0 & elevator_service == 0 => <b>Doesn't make sense but this doesnt exist either</b>




In [51]:
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )    
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) 
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  
mod = data['elevatern'].mode()
data['elevatern'] = data['elevatern'].fillna(mod[0])
data_test['elevatern'] = data_test['elevatern'].fillna(mod[0])

<a id="5.3"></a> <br>
## 5.3 Projection-based Distance From City Center
Adding the distance to city center as a feature. A projection-based distance is used rather than euclidean distance in order to account for the curvature of the earth's surface. 

In [52]:
lon1 =  37.621390
lat1 = 55.753098
geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data['fwd_azi'] = fwd_azimuth_arr
data['distance'] = distance_arr
data['back_azi'] = back_azimuth_arr


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data_test["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data_test["longitude"][i], data_test["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data_test['fwd_azi'] = fwd_azimuth_arr
data_test['distance'] = distance_arr
data_test['back_azi'] = back_azimuth_arr

<a id="5.4"></a> <br>
## 5.4 Area_per_room
Dividing the total are by the number of rooms to get average area per room.

In [53]:
data['area_per_room'] = data['area_total']/data['rooms']
data_test['area_per_room'] = data_test['area_total']/data_test['rooms']

<a id="5.5"></a> <br>
## 5.5 Bathrooms_total
Adding the 'bathrooms_shared' and 'bathrooms_private' features to get the total number of bathrooms of the apartment.

In [54]:
data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private

<a id="5.6"></a> <br>
## 5.6 Balconies_total

Combining loggias and balconies to a common feature makes sense as these two features have somewhat the same functionality in an apartment.

In [55]:
data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]

<a id="5.7"></a> <br>
## 5.7 Distance to Financial City Center

Moscow has also a financial center and since the domain knowledge speaks of the importance of location, it is therefore a good idea to include it in the feature engineering

In [56]:
financial_coords = (37.535497858, 55.741330368)
distance_from_city_center = np.sqrt((financial_coords[0] - data["longitude"])**2+(financial_coords[1] - data["latitude"])**2)
data["distance_from_financial_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((financial_coords[0] - data_test["longitude"])**2+(financial_coords[1] - data_test["latitude"])**2)
data_test["distance_from_financial_center"] = distance_from_city_center_t

<a id="5.8"></a> <br>
## 5.8 Floor_per_stories Ratio


It is interesting to give the floor/stories to communicate to the model how high the apartment is in a building in a scale between 0 and 1

In [57]:
data["floor/stories"] = data["floor"]/data["stories"]
data_test["floor/stories"] = data_test["floor"]/data_test["stories"]

<a id="5.10"></a> <br>
## 5.9 Log Features

We also provide log features to the model for some selected features. 

In [58]:
data['area_per_room_log']        = np.log1p(data['area_per_room'])
data['area_total_log']           = np.log1p(data['area_total'])
data['area_kitchen_log']         = np.log1p(data['area_kitchen'])
data['area_living_log']          = np.log1p(data['area_living'])




data_test['area_per_room_log']   = np.log1p(data_test['area_per_room'])
data_test['area_total_log']      = np.log1p(data_test['area_total'])
data_test['area_kitchen_log']    = np.log1p(data_test['area_kitchen'])
data_test['area_living_log']     = np.log1p(data_test['area_living'])


<a id="5.11"></a> <br>
## 5.10 Squared Features

It is also possible to square some selected numerical features 

In [59]:
data['area_per_room_squared']        = data['area_per_room']*data['area_per_room']
data['area_total_squared']           = data['area_total']*data['area_total']
data['area_kitchen_squared']         = data['area_kitchen']*data['area_kitchen']
data['area_living_squared']          = data['area_living']*data['area_living']



data_test['area_per_room_squared']   = data_test['area_per_room']*data_test['area_per_room']
data_test['area_total_squared']      = data_test['area_total']*data_test['area_total']
data_test['area_kitchen_squared']    = data_test['area_kitchen']*data_test['area_kitchen']
data_test['area_living_squared']     = data_test['area_living']*data_test['area_living']

<a id="5.12"></a> <br>
## 5.11 Other Possible Features

Other possible features could be taking the squareroot of the original numerical values or taking the 1/x of the feature. These are all transformations that could possibly help further minimze the rmsle.

<a id="6"></a> <br>
# 6. Feature Extraction

<a id="6.1"></a> <br>
## 6.1 Principal Component Analysis (PCA)

PCA is a method for partitioning of the variation in the data. This feature extraction method could help discover important relationship in the moscow dataset and help create even more informative features. There are two ways this method could be used, the first method is to discover relationships between different features such that one could manually combine them to a new more informative feature. The second method is to use the PCA feature(s) as part of the dataset directly because the PCA may better describe the variational structure of the data. 

It is good practice to consider the following in PCA:
* PCA only works with numerical features
* PCA is sensitive to scale and one should therefore consider standardizing the data before applying PCA
* Consider removing or constraining outliers, since they can have an undue influence on the results

In [60]:
features = ["area_total", "distance_from_financial_center", "distance_from_city_center", 'distance','back_azi','fwd_azi']


#normally distribute the data
data_scaled = (data[features] - data[features].mean(axis=0)) / data[features].std(axis=0)

# Create principal components
pca = PCA()
data_pca = pca.fit_transform(data_scaled)

# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(data_pca.shape[1])]
data_pca = pd.DataFrame(data_pca, columns=component_names)

data_pca.head()

In [61]:
loadings = pd.DataFrame(
    pca.components_.T,  # transpose the matrix of loadings
    columns=component_names,  # so the columns are the principal components
    index=data[features].columns,  # and the rows are the original features
)
loadings

Signs and magnitude of a PCA component loadings tell us what kind of variation it's captured. The first component shows a contrast between area_total and the different distance metrics

In [62]:
plot_variance(pca) #Shows the explained variance

In [64]:
mi_scores = make_mi_scores(data_pca, data.price/data.area_total, discrete_features=False)
plt.figure(dpi=100, figsize=(5, 5))
plot_mi_scores(mi_scores)

From the MI score we can see that PC1 and PC2 seems to be highly informative and therefore could be considered to be taken as features. MI scores are explained and further emphasized in section 7.1

There are other Feature Extraction techniques that would've been interesting to check out if there was more time:

For Linear cases:
* <b>ICA</b>
* <b>LDA</b>

For Non-linear cases:
* <b>LLE</b>
* <b>t-SNE</b>
* <b>Autoencoders</b>

<a id="6.2"></a> <br>
## 6.2 Target Encoding

Target encoding is a way of labeling categorical features. This method much like one-hot encoding or label encoding tries to transform a categorical value into numbers, just that target encoding utilizes the a target to create the encoding. This technique could be a useful way to categorize the features street and address

In [65]:
X = data.copy()
y = X.pop('area_total')

X_encode = X.sample(frac=0.25)
y_encode = y[X_encode.index]
X_pretrain = X.drop(X_encode.index)
y_train = y[X_pretrain.index]


from category_encoders import MEstimateEncoder

# Create the encoder instance. Choose m to control noise.
encoder = MEstimateEncoder(cols=["street"], m=5.0,random_state=20)

# Fit the encoder on the encoding split.
encoder.fit(X_encode, y_encode)

# Encode the Zipcode column to create the final training data
X_train = encoder.transform(X_pretrain)

plt.figure(dpi=90)
ax = sns.distplot(y, kde=False, norm_hist=True)
ax = sns.kdeplot(X_train.street, color='r', ax=ax)
ax.set_xlabel("area_total")
ax.legend(labels=['street', 'area_total']);

In [66]:
make_mi_scores(X_train[["street"]], X_train.price,discrete_features=False)

<a id="7"></a> <br>
# 7. Feature Selection

<a id="7.1"></a> <br>
## 7.1 Mutual Information (MI)


Mutual information is similar to correlation where it is a metric to measure a relationship between two quantities. The advantage of Mutual Information is that it can detect any kind of of relationship, whilst correlation detects linear relationships only. This way it measures the extent to which how much a feature can help us better predict another feature. In our case we would like to know how does each feature in our dataset measure in terms of MI paired with price. 


In [67]:
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log",
            "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi',"floor/stories",
           "distance_from_financial_center", "distance_from_city_center"]

# All categorical features are treated differently in MI
categorical = ["new", "parking", "heating", "constructed", "condition", "seller", "material"]
data_mi = data.copy()
data_mi[categorical] = data_mi[categorical].convert_dtypes(int)
discrete_features = data_mi[features].dtypes == int

In [68]:
mi_scores = make_mi_scores(data_mi[features], data_mi.price,discrete_features)
plt.figure(dpi=100, figsize=(10, 12))
plot_mi_scores(mi_scores)

From this it looks like heating is the least important feature of all the features. It does make sense that area is the most important features. This is very common domain knowledge in real estate, it is how ever interesting to see that MI is able to capture the importance as it does.

<a id="7.2"></a> <br>
## 7.2 ANOVA F-value

This method estimates degree of linearity between the input feature and output feature. A high F-value means a high degree of linearity and vice versa, this metric only captures linear relationships between the pair of features which is the main disadvantage of this selection method as it cannot detect nonlinear relationships that might appear, something that MI could.

In [69]:
f_value = f_classif(data[features], data.price)

# Create a bar chart for visualizing the F-values

plt.figure(figsize=(10,10))
plt.bar(x=features, height=f_value[0])
plt.xticks(rotation='vertical')
plt.ylabel('F-value')
plt.title('F-value Comparison')
plt.show()

As one can see here area_total gives the highest linear relationship with price but there are obviously many relationships that are left out because this is only linear

In [70]:
f_value = f_classif(data[features], data.price/data.area_total)

# Create a bar chart for visualizing the F-values

plt.figure(figsize=(10,10))
plt.bar(x=features, height=f_value[0])
plt.xticks(rotation='vertical')
plt.ylabel('F-value')
plt.title('F-value Comparison')
plt.show()

Price/area_total gives higher F-values for many more features 

<a id="7.3"></a> <br>
## 7.3 Variance Threshold

This method tries to simply look at individual features and remove those that are below a certain set threshold. Obviously this filtering method assumes that features that do not vary a lot must themselves have low predictive power. But this method does not consider the connection between input and output features and therefore it should be used in combination with either F-value or MI-score selection methods as well as general domain knowledge such that one does not eliminate important features.

In [71]:
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log",
            "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'back_azi','fwd_azi']
# Create VarianceThreshold object to perform variance thresholding
selector = VarianceThreshold()

# Perform variance thresholding
selector.fit_transform(data[features])


# Create a bar chart for visualizing the variances
plt.figure(figsize=(10,10))
plt.bar(x=features, height=selector.variances_)
plt.xticks(rotation='vertical')
plt.ylabel('Variance')
plt.title('Variance Comparison')

plt.show()

As one can see that bathrooms, longitude and latitude, district, material as well as other features vary very little but this si because these are categorical features and as such one should only take the variance of numerical values. 

<a id="7.4"></a> <br>
## 7.4 Conclusion Of Feature Selection

The methods that are explored by us are called Filter Methods, there are how ever other methods called Wrapper and Embeddded methods, where one tries to look at for instance feature importance to decide upon the feature selection.

Wrapper Methods:
* <b>Exhaustive Feature Selection</b>
* <b>Sequential Forward Selection</b>
* <b>Sequential Backward Selection</b>

Embedded Methods:
* <b>Feature Importance From Model</b>
* <b>Using Selector Object for Selecting Features</b>

From the observation of the methods that were explored, it seems like that MI gave the best overall information in terms of significance of the features in relation to the domain knowledge. The F-value whilst gave good information it is limited to linear relationships and thus it must be used in combination with other methods or omited entirely. Finally Variance Thresholding for a small portion of the feature set shows that area_total has the highest variation of the selected features, but even the low variance features are important as shown in the MI scores and as such it is only used to verify which features are really significant. What is worth noting is that all filter methods used MI, F-value and Variance thresholding showed that area_total is a really important feature, which resonates well with the domain knowledge of real estate.

<a id="8"></a> <br>
# 8. Models


<a id="8.1"></a> <br>
## 8.1 Model Overview

Using the Lazypredict package we can train a lot of different models included in the package and get a quick overview of the training time, the RMSE performance as well as other metrics to quickly get an idea of what models are potentially a good fit for the problem we are working on.

In [72]:
# data_train, data_valid = model_selection.train_test_split(data, test_size=0.33, stratify=np.log(data.price).round())

# X_train = data_train[features]
# y_train = np.log1p(data_train['price'])
# X_test = data_valid[features]
# y_test = np.log1p(data_valid['price'])

# reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
# models, predictions = reg.fit(X_train, X_test, y_train, y_test)
# print(models)

From this result it is easy to see that Xgboost and LGBM are both a good start. The only draw back is that this package does not contain Catboost as it is also a SOTA model worth checking out.

<a id="8.2"></a> <br>
## 8.2 Choosing Prediction Target

In terms of setting up a model it is also worth mapping what types of prediction targets one should use. In particular there are a few targets to consider:
* <b>Price Target</b>
* <b>Log_Price Target</b>
* <b>Price/sqm Target</b>
* <b>Log_Price/sqm Target</b>

The reasoning is motivated in the EDA, but a short version is that taking the log of the price gives it more of a normal distribution by making the target less skewed. How ever for the price/sqm, it generally makes more sense to predict that than the price itself as a house could have a high price because of various reasons, for example a lot of footage, rooms, etc but the price per squaremeter metric indicates the value of the apartment regardless of the size of it. And in a lot of cases there are really expensive apartments in the city center because of the square meter price but this is not very well emphasized if we look at the flat pricing - as a really cheap but big apartment could seem to be more of the expensive one.


<a id="8.3"></a> <br>
## 8.3 Validation Split
In terms of validation split there are two possibilities:
* <b>Split based on apartments</b>
* <b>Split based on buildings</b>

Obviously the apartment split will give a data leakage in the validation set vs the test set but while we do it in this notebook it is worth noting that before submitting models to the test set.

In [73]:
gs = model_selection.GroupShuffleSplit(n_splits=2, test_size=.33, random_state=0)
train_index, valid_index = next(gs.split(data, groups=data.building_id))

data_train = data.loc[train_index]
data_valid = data.loc[valid_index]

In [74]:
#features that are used
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log",
            "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi',"floor/stories",
           "distance_from_financial_center", "distance_from_city_center"]


train_x = data_train[features]
train_y = np.log1p(data_train['price']/data_train["area_total"])
test_x = data_valid[features]
test_y = data_valid['price']

<a id="8.4"></a> <br>
## 8.4 Xgboost

In [75]:
param = {
        'base_score' : 0.5,
        'booster' : 'gbtree',
        'colsample_bylevel' : 1,
        'gamma' : 0,
        'max_delta_step' : 0,
        'n_jobs' : -1,
        'nthread' : None,
        'objective' : 'reg:squarederror',
        'scale_pos_weight' : 1,
        'seed' : None,
        'lambda': 0.0024064014952485785, 
         'alpha': 0.001541503784279617, 
        'colsample_bytree': 0.43152225018148443, 
       'subsample': 0.8078473020517652, 
       'learning_rate': 0.013367834721822036, 
       'n_estimators': 5235, 
     'random_state': 291, 
      'max_depth': 9, 
    'min_child_weight': 13
}

model_xgb = XGBRegressor(**param)

model_xgb.fit(train_x,train_y)

xgb_preds = model_xgb.predict(test_x)
root_mean_squared_log_error(test_y, y_pred=np.expm1(xgb_preds)*data_valid["area_total"])

<a id="8.5"></a> <br>
## 8.5 Lightgbm

In [76]:
best_params = {'objective' : 'regression',
    "metric": "root_mean_squared_error",
    'random_state': 2020,
    "n_estimators": 3000,
    'boosting_type': 'gbdt', #better than dart
    "n_jobs": -1,
 'learning_rate': 0.009902216010560466, 
 'num_iterations': 9853, 
 'n_estimators': 2200, 
 'max_bin': 1145, 
 'num_leaves': 992, 
 'min_data_in_leaf': 21, 
 'min_sum_hessian_in_leaf': 6, 
 'bagging_fraction': 0.7553160099162841, 
 'bagging_freq': 1, 
 'max_depth': 5, 
 'lambda_l1': 0.001047756084491848, 
 'lambda_l2': 0.5231817241800534, 
 'min_gain_to_split': 0.01715842845568677
    }




model_lgbm = LGBMRegressor(**best_params)  

model_lgbm.fit(train_x,train_y,verbose=False)

lgbm_preds = model_lgbm.predict(test_x)
root_mean_squared_log_error(test_y, y_pred=np.expm1(lgbm_preds)*data_valid["area_total"])

<a id="8.6"></a> <br>
## 8.6 Catboost

In [77]:
param = {
"objective": "RMSE",
'depth': 8, 
 'reg_lambda': 0.6424630162452156, 
 'learning_rate': 0.008856338969505724, 
 'n_estimators': 5356, 
 'max_bin': 1042, 
 'random_state': 1695, 
 'subsample': 0.4474582804576312}




model_catb = CatBoostRegressor(**param)  

model_catb.fit(train_x,train_y, verbose=False)

catb_preds = model_catb.predict(test_x)
root_mean_squared_log_error(test_y, y_pred=np.expm1(catb_preds)*data_valid["area_total"])

<a id="8.7"></a> <br>
## 8.7 Stacking Model

In [78]:
#Takes 1 hour and 40 minutes to run so commented out
from mlxtend.regressor import StackingCVRegressor

stacked_model = StackingCVRegressor(regressors=(model_xgb,model_lgbm,model_catb),
                                meta_regressor=model_xgb, #our best individual model becomes the META
                                use_features_in_secondary=True,
                                   verbose=0)



#stacked_model.fit(train_x,train_y)

#stacked_preds = stacked_model.predict(test_x)
#root_mean_squared_log_error(test_y, y_pred=np.expm1(stacked_preds)*data_valid["area_total"])

<a id="8.8"></a> <br>
## 8.8 Weight Averaging

Weighting is based off of the base performance of each model, in this case the stacked_predictions is also perceived as a base model

In [79]:
weighting_preds = np.average(
    [np.expm1(xgb_preds)*data_valid["area_total"],
     np.expm1(lgbm_preds)*data_valid["area_total"],
     np.expm1(catb_preds)*data_valid["area_total"]
     #np.expm1(stacked_preds)*data_valid["area_total"]
    ],
    weights = 1 / np.array([0.1932,  0.1961, 0.1968]) ** 6,   
    axis=0
)


In [80]:
rmsle_xgb = root_mean_squared_log_error(test_y, y_pred=np.expm1(xgb_preds)*data_valid["area_total"])
rmsle_lgb = root_mean_squared_log_error(test_y, y_pred=np.expm1(lgbm_preds)*data_valid["area_total"])
rmsle_cat = root_mean_squared_log_error(test_y, y_pred=np.expm1(catb_preds)*data_valid["area_total"])
#rmsle_stacking = root_mean_squared_log_error(test_y, y_pred=np.expm1(stacked_preds)*data_valid["area_total"])
rmsle_weighting = root_mean_squared_log_error(test_y, y_pred=weighting_preds)
print("RMSLE xgb:", rmsle_xgb)
print("RMSLE lgbm:", rmsle_lgb)
print("RMSLE catb:", rmsle_cat)
#print("RMSLE stacking:", rmsle_stacking)
print("RMSLE weighting:", rmsle_weighting)

One can see that weighting and obviously if stacking is ran that these do give improvements on top of the base models. Stacking takes a whole lot of time run how ever. So it is commented out but one is welcome to uncomment it and try it 

<a id="8.9"></a> <br>
## 8.9 Other Tried Models
There are a few other models we tried but that did not have much success in our limited time of trying them out. Here is an exhaustive list of that:
* <b>Snapboost</b>
* <b>Bagging a model (for example xgboost)</b>
* <b>Random Forest</b>
* <b>Extra Trees</b>
* <b>Support Vector Machine Regressor</b>
* <b>Lasso and Ridge Regressors</b>

In [82]:
#Snapboost
param = {"objective": "mse",
          'random_state': 830, 
          'learning_rate': 0.0036201545027461845,
          'num_round': 4988, 
          'hist_nbins': 246, 
          'colsample_bytree': 0.9028656800598056, 
          'subsample': 0.2080280241767715, 
          'tree_select_probability': 0.5206241285414699, 
          'lambda_l2': 0.06806535901164175, 
          'regularizer': 0.7392140536568328, 
          'max_depth': 241}

#model_snap = BoostingMachineRegressor(**param)
#model_snap.fit(np.array(train_x),np.array(train_y))

#snap_preds = model_snap.predict(np.array(test_x))

In [83]:
#Random Forest
param = {
    "n_jobs": -1,
    "random_state": 2000,
    'n_estimators': 1000,
    'max_depth': 4,
    'min_samples_split': 4, #from 2 or error
    'min_samples_leaf': 2,
    }
    

#model =RandomForestRegressor(**param)
#model.fit(train_x,train_y)

#preds = model.predict(test_x)

In [84]:
#Bagging Model

param = {
   "base_estimator":  model_xgb,
    'random_state': 2020,

    "n_estimators": 2,
    "max_samples": 0.5, #higher variance (0.1-0.99), higher bias (1, inf)
    #"bootstrap": False,
    #"max_features": train_x.shape[1],
    #"oob_score": False,
    #"bootstrap_features": False
}

#model = BaggingRegressor(**param)  

#model.fit(train_x,train_y)

#preds = model.predict(test_x)

In [85]:
#Extra Trees
from sklearn.ensemble import ExtraTreesRegressor
et_model = ExtraTreesRegressor(n_estimators=1000, max_depth=10, max_features=0.3, n_jobs=-1, random_state=0)

#et_model.fit(train_x,train_y)

##preds = et_model.predict(test_x)

<a id="9"></a> <br>
# 9. Optuna Optimization

As one can observe in section 6 all the models have very odd/specific hyperparamters. This is because this is a result of the hyperparameter optimization that was done.

<a id="9.1"></a> <br>
## 9.1 Hyperparameter Tuning Xgboost

In [86]:
def objective(trial,data=data):
    
    train_x = data_train[features]
    train_y = np.log1p(data_train['price']/data_train["area_total"])
    test_x = data_valid[features]
    test_y = data_valid['price']

    

    
    
    param = {
      'base_score' : 0.5,
        'booster' : 'gbtree',
        'colsample_bylevel' : 1,
        'gamma' : 0,
        'max_delta_step' : 0,
        'n_jobs' : -1,
        'nthread' : None,
        'objective' : 'reg:squarederror',
        'scale_pos_weight' : 1,
        'seed' : None,



        #'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process        
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 1),
        'alpha': trial.suggest_loguniform('alpha',  1e-3, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.3,1.0),
        'subsample': trial.suggest_uniform('subsample', 0.3,1.0),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.008, 0.018),
        'n_estimators': trial.suggest_int('n_estimators', 500,1000), #should be between 500-4000 taken down for faster running in the long notebook
        'random_state': trial.suggest_int('random_state',0, 2000),
        'max_depth': trial.suggest_int('max_depth', 2, 30),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
    }
    
    model = XGBRegressor(**param)  
    
    model.fit(train_x,train_y)
    
    preds = model.predict(test_x)
    
    rmse = root_mean_squared_log_error(y_true=test_y, y_pred=np.expm1(preds)*test_x["area_total"])
    
    return rmse

In [87]:
#Trial is usually higher than 1 but for this case it is set low for kaggle to run faster
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

<a id="9.2"></a> <br>
## 9.2 Hyperparameter Tuning Lgbm

In [88]:
def objective(trial,data=data):
    
    train_x = data_train[features]
    train_y = np.log1p(data_train['price']/data_train["area_total"])
    test_x = data_valid[features]
    test_y = data_valid['price']
    
    
    param = {
    'objective' : 'regression',
    "metric": "root_mean_squared_error",
    'random_state': 2020,
    "n_estimators": 3000,
    'boosting_type': 'gbdt', #better than dart
    "n_jobs": -1,
        
    #Increases accuracy:
    'learning_rate' : trial.suggest_uniform('learning_rate',0.009, 0.01), #small learning rate and large iterations
    "num_iterations": trial.suggest_int("num_iterations",1000, 3000),#should be between 500-5000 taken down for faster running in the long notebook
    "n_estimators": trial.suggest_int("n_estimators", 500,1000), #should be between 500-5000 taken down for faster running in the long notebook
    "max_bin" : trial.suggest_int("max_bin",50,2000), #large max-bin - may be slower for large values
    "num_leaves" : trial.suggest_int("num_leaves", 50, 2000), #may cause overfitting for large values
        
    
    #deal with overfitting:
    "min_data_in_leaf" : trial.suggest_int("min_data_in_leaf", 20, 60),
    "min_sum_hessian_in_leaf" : trial.suggest_int("min_sum_hessian_in_leaf", 1, 10),
        
    #as as set against overfitting:
    "bagging_fraction" : trial.suggest_uniform("bagging_fraction", 0.1, 1),
    "bagging_freq" : trial.suggest_int("bagging_freq", 1, 10),
    "max_depth" : trial.suggest_int("max_depth",5,30),

    #Regularization:
    "lambda_l1" :  trial.suggest_uniform("lambda_l1", 0.001, 1),
    "lambda_l2" :  trial.suggest_uniform("lambda_l2", 0.001, 1),
    "min_gain_to_split" :trial.suggest_uniform("min_gain_to_split", 0.001, 0.1),
        
        

    }
    
    

    model =LGBMRegressor(**param)
    model.fit(train_x,train_y)

    preds = model.predict(test_x)
    rmse = root_mean_squared_log_error(y_true=test_y, y_pred=np.expm1(preds)*test_x["area_total"])



    
    return rmse

In [89]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=2)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

<a id="9.3"></a> <br>
## 9.3 Hyperparameter Tuning Catboost

In [90]:
def objective(trial,data=data):

    
    train_x = data_train[features]
    train_y = np.log1p(data_train['price']/data_train["area_total"])
    test_x = data_valid[features]
    test_y = data_valid['price']


    param = {
    "objective": "RMSE",
    'random_state': 2020,
    #"bootstrap_type": "Bayesian",
    "thread_count": -1, #basically n_jobs
        
    #regularization/preventing overfitting
    'depth': trial.suggest_int('depth',4,10), 
    'reg_lambda': trial.suggest_loguniform('lambda', 0.008, 2),

    #Increases accuracy:
    'learning_rate' : trial.suggest_uniform('learning_rate',0.0001,0.01),
    'n_estimators': trial.suggest_int('n_estimators', 500,1000), #should be between 500-4000 taken down for faster running in the long notebook
    "max_bin": trial.suggest_int("max_bin",50,2000),
    'random_state': trial.suggest_int('random_state', 0,2000),
        
     #misc:   
    'subsample': trial.suggest_uniform('subsample', 0.3,1.0),
    
    }

        

    model = CatBoostRegressor(**param)  

    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100, verbose=False)

    preds = model.predict(test_x)

    rmse = root_mean_squared_log_error(y_true=test_y, y_pred=np.expm1(preds)*test_x["area_total"])
       
    
    return rmse

In [91]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

<a id="9.4"></a> <br>
## 9.4 Other Model Tuning

Here are some basic setups used for Snapboost and Random Forest.

In [92]:
#Snapboost
def objective(trial,data=data):

    train_x = data_train[features]
    train_y = np.log1p(data_train['price'])
    test_x = data_valid[features]
    test_y = np.log1p(data_valid['price'])

    
    
    
    param = {
    "objective": "mse",
    "n_jobs": 1,
        
    #increase accuracy
    "random_state":  trial.suggest_int("random_state",1,1000),
    "learning_rate": trial.suggest_uniform("learning_rate",0.0001, 0.01),
    "num_round": trial.suggest_int("num_round",500,5000),

    #misc  
    "hist_nbins": trial.suggest_int("hist_nbins",1,256),
    "colsample_bytree": trial.suggest_uniform("colsample_bytree",0,0.99),
    "subsample": trial.suggest_uniform("subsample",0.1,1.0),
    "tree_select_probability": trial.suggest_uniform("tree_select_probability", 0.0,1.0),
    
   
        
    #regularization
    "lambda_l2": trial.suggest_uniform("lambda_l2",0.0,1.0),
    "regularizer": trial.suggest_uniform("regularizer",0.0,10.0),
     "max_depth": trial.suggest_int("max_depth",1,1000),

    }
    

    model = BoostingMachineRegressor(**param)
    model.fit(np.array(train_x),np.array(train_y))

    preds = model.predict(np.array(test_x))
    rmse = root_mean_squared_log_error(y_true=np.expm1(test_y), y_pred=np.expm1(preds))

    
    return rmse

In [93]:
#Random Forest
def objective(trial,data=data):

    train_x = data_train[features]
    train_y = np.log1p(data_train['price'])
    test_x = data_valid[features]
    test_y = np.log1p(data_valid['price'])

    
    
    
    param = {
    "n_jobs": -1,
    "random_state": trial.suggest_int("random_state", 0, 2000),
    'n_estimators': trial.suggest_int('n_estimators', 50, 10000),
    'max_depth': trial.suggest_int('max_depth', 4, 50),
    'min_samples_split': trial.suggest_int('min_samples_split', 2, 150), #from 2 or error
    'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 100),
    }
    

    model =RandomForestRegressor(**param)
    model.fit(train_x,train_y)

    preds = model.predict(test_x)
    rmse = root_mean_squared_log_error(y_true=np.expm1(test_y), y_pred=np.expm1(preds))
    
    
    return rmse

<a id="10"></a> <br>
# 10. Attempted and Documented Model Pipelines

NOTE: the fitting in the attemps have been commented out such that the notebook runs faster - either skip this part and go to the model interpretation or uncoment the models to run the different attempts.

<a id="10.1"></a> <br>
## 10.1 Attempt 1

In [94]:
#Attempt 1 - 11.10.2021 - 0.16080 on test set
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t


# features = [ "area_total", "rooms", "floor","new","distance_from_city_center",
#         "latitude", "longitude","district", "constructed", "material", "stories"]
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]
#TDONE floor, celing, rooms   "new" , constructed condition
# TODO area kitchen , area living

data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1) 
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
        
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
                        
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     
    else:
        mean = data[feature].mean()
        data[feature] = data[feature].fillna(mean)
        
        data_test[feature] = data_test[feature].fillna(mean)

data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])

import sklearn.model_selection as model_selection
data_train, data_valid = model_selection.train_test_split(data, test_size=0.33, stratify=np.log(data.price).round())


from xgboost import XGBRegressor

 


X_train = data_train[features]
y_train = np.log1p(data_train['price'])
X_valid = data_valid[features]
y_valid = np.log1p(data_valid['price'])
#Trial 26 finished with value: 0.12977325584218719 and parameters: {'lambda': 0.2287729489989326, 'alpha': 0.021096319890667407, 'colsample_bytree': 0.5144984086781564, 
     #'subsample': 0.42023355655422495, 'learning_rate': 0.01693820796093592, 'n_estimators': 3977, 'max_depth': 23, 'random_state': 2020, 'min_child_weight': 5}. Best is trial 26 with value: 0.12977325584218719.
    
model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.5144984086781564, gamma=0, learning_rate=0.01693820796093592, max_delta_step=0,
       max_depth=23, min_child_weight=5, n_estimators=3977,
       n_jobs=1, nthread=None, objective='reg:squarederror', random_state=2020, # squarederror  reg:squaredlogerror   reg:squarederror
       reg_alpha=0.021096319890667407, reg_lambda=0.2287729489989326, scale_pos_weight=1, seed=None, subsample=0.42023355655422495)



#model.fit(X_train, y_train)
#y_train_hat = model.predict(X_train)
#y_valid_hat = model.predict(X_valid)

<a id="10.2"></a> <br>
## 10.2 Attempt 2

In [95]:
#Attempt 2 - 12.10.2021 - 0.15712 on test set
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)




#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t


# features = [ "area_total", "rooms", "floor","new","distance_from_city_center",
#         "latitude", "longitude","district", "constructed", "material", "stories"]
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]
#TDONE floor, celing, rooms   "new" , constructed condition
# TODO area kitchen , area living

data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1) 
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
        
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
                        
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     
    else:
        mean = data[feature].mean()
        data[feature] = data[feature].fillna(mean)
        
        data_test[feature] = data_test[feature].fillna(mean)

data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])

#best xgboost model
from xgboost import XGBRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

    
model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.5144984086781564, gamma=0, learning_rate=0.01693820796093592, max_delta_step=0,
       max_depth=23, min_child_weight=5, n_estimators=3977,
       n_jobs=1, nthread=None, objective='reg:squarederror', random_state=2020, # squarederror  reg:squaredlogerror   reg:squarederror
       reg_alpha=0.021096319890667407, reg_lambda=0.2287729489989326, scale_pos_weight=1, seed=None, subsample=0.42023355655422495)


#model.fit(train_x,train_y)
#xgb_preds = model.predict(test_x)




#next model
#Second best Xgboost model pipeline
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


# SIGMA
# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t



# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

# features = [ "area_total", "rooms", "floor","new","distance_from_city_center",
#         "latitude", "longitude","district", "constructed", "material", "stories"]
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]
#TDONE floor, celing, rooms   "new" , constructed condition
# TODO area kitchen , area living

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])

        
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     

        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    else:
        mean = data[feature].mean()
        #print('Not Categorical',feature)
        data[feature] = data[feature].fillna(mean)
        
        data_test[feature] = data_test[feature].fillna(mean)


data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])





from xgboost import XGBRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]


model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.6185033815828214, gamma=0, learning_rate=0.015430795026041527, max_delta_step=0,
       max_depth=12, min_child_weight=2, n_estimators=3210,
       n_jobs=1, nthread=None, objective='reg:squarederror', random_state=2020, # squarederror  reg:squaredlogerror   reg:squarederror
       reg_alpha=0.13226917041287767, reg_lambda=0.9825323223526006, scale_pos_weight=1, seed=None, subsample=0.7155648654032974)

 
#model.fit(train_x,train_y)
#xgb_preds_scnd_best_model = model.predict(test_x)


apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)




from numpy.random import choice

# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t



# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private

data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private


data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)

data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)


data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]


data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "elevatern","distance_from_city_center", "bathrooms_total", "bathrooms_shared", 
            "bathrooms_private", 'parking',"latitude", "longitude","district", "constructed", "condition", "seller", "material", "stories"]




# Revisit
# We removed windows court and windows street as they did not seem to corelate with price,
# total_balconies == 0 instead 1 .
# 

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
    
    elif feature == 'seller':
        
        list_of_candidates = [0,1,2,3]
        # 14455 , owener 0, 
        probability_distribution  = [0.11, 0.33, 0.13, 0.43]
        number_of_items_to_pick = data['seller'].isna().sum()
        number_of_items_to_pick_test = data_test['seller'].isna().sum()

        np.random.seed(0)

        draw = choice(list_of_candidates, number_of_items_to_pick,
                      p=probability_distribution)
        draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
                      p=probability_distribution)

        data['seller'][data.seller.isna()] = draw
        data_test['seller'][data_test.seller.isna()] = draw_test
        
    elif feature == 'area_kitchen' or feature == 'area_living':
        if feature == 'area_kitchen':
           
            percentage_area_data = pd.DataFrame()
            percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
            percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

            mean_kitchen = percentage_area_data["area_kitchen"].mean()
            mean_living = percentage_area_data["area_living"].mean()

            #to omit bugs
            data["area_kitchen_edit"] = data["area_kitchen"].copy()
            data["area_living_edit"] = data["area_living"].copy()

            data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
            data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

            data["area_kitchen"] = data["area_kitchen_edit"].copy()
            data["area_living"] = data["area_living_edit"].copy()

            #test_set
            data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
            data_test["area_living_edit"] = data_test["area_living"].copy()

            data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
            data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


            data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
            data_test["area_living"] = data_test["area_living_edit"].copy()
        
        else:
            pass
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
        data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
        data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
        data['material'][data.material.isna() ] = data['material'].mode()[0]
        data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    elif feature == 'condition': # Can also merge with 0.0 i.e. the undecorated class. but for now we created new class
        data["condition"][data.condition.isna() ] = 4.0 
        data_test["condition"][data_test.condition.isna() ] = 4.0 
    else:
        pass
        #         mean = data[feature].mean()
#         #print('Not Categorical',feature)
#         data[feature] = data[feature].fillna(mean)
        
#         data_test[feature] = data_test[feature].fillna(mean)



data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])




#Submission
from lightgbm import LGBMRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

best_params = {'objective' : 'regression',
                "metric": "root_mean_squared_error",
                'random_state': 2020,
                "n_estimators": 3000,
                'boosting_type': 'gbdt',
               'learning_rate': 0.009545503382688678, 
               'num_iterations': 6000, 
               'n_estimators': 3467, 
               'max_bin': 1329, 
               'num_leaves': 84, 
               'min_data_in_leaf': 21, 
               'min_sum_hessian_in_leaf': 10, 
               'bagging_fraction': 0.6138769224842795, 
               'bagging_freq': 1, 
               'max_depth': 24, 
               'lambda_l1': 0.03828165214380662, 
               'lambda_l2': 0.9439756453034776, 
               'min_gain_to_split': 0.006068058433178841}



model = LGBMRegressor(**best_params)  

#model.fit(train_x,train_y,verbose=False)

#lgbm_preds = model.predict(test_x)



#model 3

apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


from numpy.random import choice

# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t



# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private

data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private


data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)

data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)


data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]


data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log", "floor", "new", "elevatern","distance_from_city_center", 
            "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories"]


# Revisit
# We removed windows court and windows street as they did not seem to corelate with price,
# total_balconies == 0 instead 1 .
# 

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
    
    elif feature == 'seller':
        
        list_of_candidates = [0,1,2,3]
        # 14455 , owener 0, 
        probability_distribution  = [0.11, 0.33, 0.13, 0.43]
        number_of_items_to_pick = data['seller'].isna().sum()
        number_of_items_to_pick_test = data_test['seller'].isna().sum()

        np.random.seed(0)

        draw = choice(list_of_candidates, number_of_items_to_pick,
                      p=probability_distribution)
        draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
                      p=probability_distribution)

        data['seller'][data.seller.isna()] = draw
        data_test['seller'][data_test.seller.isna()] = draw_test
        
    elif feature == 'area_kitchen' or feature == 'area_living':
        if feature == 'area_kitchen':
           
            percentage_area_data = pd.DataFrame()
            percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
            percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

            mean_kitchen = percentage_area_data["area_kitchen"].mean()
            mean_living = percentage_area_data["area_living"].mean()

            #to omit bugs
            data["area_kitchen_edit"] = data["area_kitchen"].copy()
            data["area_living_edit"] = data["area_living"].copy()

            data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
            data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

            data["area_kitchen"] = data["area_kitchen_edit"].copy()
            data["area_living"] = data["area_living_edit"].copy()

            #test_set
            data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
            data_test["area_living_edit"] = data_test["area_living"].copy()

            data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
            data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


            data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
            data_test["area_living"] = data_test["area_living_edit"].copy()
        
        else:
            pass
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
        data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
        data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
        data['material'][data.material.isna() ] = data['material'].mode()[0]
        data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    elif feature == 'condition': # Can also merge with 0.0 i.e. the undecorated class. but for now we created new class
        data["condition"][data.condition.isna() ] = 4.0 
        data_test["condition"][data_test.condition.isna() ] = 4.0 
    else:
        pass
        #         mean = data[feature].mean()
#         #print('Not Categorical',feature)
#         data[feature] = data[feature].fillna(mean)
        
#         data_test[feature] = data_test[feature].fillna(mean)



data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])


#model
from catboost import CatBoostRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]



param = {
"objective": "RMSE",
'random_state': 2263, 
'learning_rate': 0.025133301103588284, 
'n_estimators': 3326, 
'reg_lambda': 0.01621262044795105, 
'subsample': 0.909304956841248, 
'depth': 9}


model = CatBoostRegressor(**param)  

#model.fit(train_x,train_y,early_stopping_rounds=100,verbose=False)
#catboost_preds = model.predict(test_x)


#  final_preds = np.average(
#     [np.expm1(xgb_preds),
#      np.expm1(lgbm_preds),
#      np.expm1(xgb_preds_scnd_best_model),
#      np.expm1(catboost_preds)
#     ],
#     weights = 1 / np.array([0.16080,0.16638,0.16084,0.16460]) ** 6,  #Should be 4 by standard and then increase to 6 to squeeze more juice
#     axis=0
# )


<a id="10.3"></a> <br>
## 10.3 Attempt 3

In [96]:
#Attempt 3 - 17.10.2021 - 0.15570 on test set
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)




#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t


# features = [ "area_total", "rooms", "floor","new","distance_from_city_center",
#         "latitude", "longitude","district", "constructed", "material", "stories"]
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]
#TDONE floor, celing, rooms   "new" , constructed condition
# TODO area kitchen , area living

data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1) 
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
        
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
                        
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     
    else:
        mean = data[feature].mean()
        data[feature] = data[feature].fillna(mean)
        
        data_test[feature] = data_test[feature].fillna(mean)

data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])

#best xgboost model
from xgboost import XGBRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

    
model_xgb1 = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.5144984086781564, gamma=0, learning_rate=0.01693820796093592, max_delta_step=0,
       max_depth=23, min_child_weight=5, n_estimators=3977,
       n_jobs=1, nthread=None, objective='reg:squarederror', random_state=2020, # squarederror  reg:squaredlogerror   reg:squarederror
       reg_alpha=0.021096319890667407, reg_lambda=0.2287729489989326, scale_pos_weight=1, seed=None, subsample=0.42023355655422495)


#model_xgb1.fit(train_x,train_y)
#xgb_preds = model_xgb1.predict(test_x)


apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


# SIGMA
# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t



# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

# features = [ "area_total", "rooms", "floor","new","distance_from_city_center",
#         "latitude", "longitude","district", "constructed", "material", "stories"]
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]
#TDONE floor, celing, rooms   "new" , constructed condition
# TODO area kitchen , area living

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])

        
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     

        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    else:
        mean = data[feature].mean()
        #print('Not Categorical',feature)
        data[feature] = data[feature].fillna(mean)
        
        data_test[feature] = data_test[feature].fillna(mean)


data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])





from xgboost import XGBRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]


model_xgb2 = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.6185033815828214, gamma=0, learning_rate=0.015430795026041527, max_delta_step=0,
       max_depth=12, min_child_weight=2, n_estimators=3210,
       n_jobs=1, nthread=None, objective='reg:squarederror', random_state=2020, # squarederror  reg:squaredlogerror   reg:squarederror
       reg_alpha=0.13226917041287767, reg_lambda=0.9825323223526006, scale_pos_weight=1, seed=None, subsample=0.7155648654032974)

 
#model_xgb2.fit(train_x,train_y)
#xgb_preds_scnd_best_model = model_xgb2.predict(test_x)


#lightgbm pipeline
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)




from numpy.random import choice

# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t



# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private

data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private


data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)

data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)


data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]


data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "elevatern","distance_from_city_center", "bathrooms_total", "bathrooms_shared", 
            "bathrooms_private", 'parking',"latitude", "longitude","district", "constructed", "condition", "seller", "material", "stories"]




# Revisit
# We removed windows court and windows street as they did not seem to corelate with price,
# total_balconies == 0 instead 1 .
# 

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
    
    elif feature == 'seller':
        
        list_of_candidates = [0,1,2,3]
        # 14455 , owener 0, 
        probability_distribution  = [0.11, 0.33, 0.13, 0.43]
        number_of_items_to_pick = data['seller'].isna().sum()
        number_of_items_to_pick_test = data_test['seller'].isna().sum()

        np.random.seed(0)

        draw = choice(list_of_candidates, number_of_items_to_pick,
                      p=probability_distribution)
        draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
                      p=probability_distribution)

        data['seller'][data.seller.isna()] = draw
        data_test['seller'][data_test.seller.isna()] = draw_test
        
    elif feature == 'area_kitchen' or feature == 'area_living':
        if feature == 'area_kitchen':
           
            percentage_area_data = pd.DataFrame()
            percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
            percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

            mean_kitchen = percentage_area_data["area_kitchen"].mean()
            mean_living = percentage_area_data["area_living"].mean()

            #to omit bugs
            data["area_kitchen_edit"] = data["area_kitchen"].copy()
            data["area_living_edit"] = data["area_living"].copy()

            data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
            data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

            data["area_kitchen"] = data["area_kitchen_edit"].copy()
            data["area_living"] = data["area_living_edit"].copy()

            #test_set
            data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
            data_test["area_living_edit"] = data_test["area_living"].copy()

            data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
            data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


            data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
            data_test["area_living"] = data_test["area_living_edit"].copy()
        
        else:
            pass
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
        data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
        data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
        data['material'][data.material.isna() ] = data['material'].mode()[0]
        data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    elif feature == 'condition': # Can also merge with 0.0 i.e. the undecorated class. but for now we created new class
        data["condition"][data.condition.isna() ] = 4.0 
        data_test["condition"][data_test.condition.isna() ] = 4.0 
    else:
        pass
        #         mean = data[feature].mean()
#         #print('Not Categorical',feature)
#         data[feature] = data[feature].fillna(mean)
        
#         data_test[feature] = data_test[feature].fillna(mean)



data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])




#Submission
from lightgbm import LGBMRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

best_params = {'objective' : 'regression',
                "metric": "root_mean_squared_error",
                'random_state': 2020,
                "n_estimators": 3000,
                'boosting_type': 'gbdt',
               'learning_rate': 0.009545503382688678, 
               'num_iterations': 6000, 
               'n_estimators': 3467, 
               'max_bin': 1329, 
               'num_leaves': 84, 
               'min_data_in_leaf': 21, 
               'min_sum_hessian_in_leaf': 10, 
               'bagging_fraction': 0.6138769224842795, 
               'bagging_freq': 1, 
               'max_depth': 24, 
               'lambda_l1': 0.03828165214380662, 
               'lambda_l2': 0.9439756453034776, 
               'min_gain_to_split': 0.006068058433178841}



model_lgbm = LGBMRegressor(**best_params)  

#model_lgbm.fit(train_x,train_y,verbose=False)
#lgbm_preds = model_lgbm.predict(test_x)



apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


from numpy.random import choice

# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t



# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private

data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private


data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)

data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)


data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]


data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log", "floor", "new", "elevatern","distance_from_city_center", 
            "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories"]


# Revisit
# We removed windows court and windows street as they did not seem to corelate with price,
# total_balconies == 0 instead 1 .
# 

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
    
    elif feature == 'seller':
        
        list_of_candidates = [0,1,2,3]
        # 14455 , owener 0, 
        probability_distribution  = [0.11, 0.33, 0.13, 0.43]
        number_of_items_to_pick = data['seller'].isna().sum()
        number_of_items_to_pick_test = data_test['seller'].isna().sum()

        np.random.seed(0)

        draw = choice(list_of_candidates, number_of_items_to_pick,
                      p=probability_distribution)
        draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
                      p=probability_distribution)

        data['seller'][data.seller.isna()] = draw
        data_test['seller'][data_test.seller.isna()] = draw_test
        
    elif feature == 'area_kitchen' or feature == 'area_living':
        if feature == 'area_kitchen':
           
            percentage_area_data = pd.DataFrame()
            percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
            percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

            mean_kitchen = percentage_area_data["area_kitchen"].mean()
            mean_living = percentage_area_data["area_living"].mean()

            #to omit bugs
            data["area_kitchen_edit"] = data["area_kitchen"].copy()
            data["area_living_edit"] = data["area_living"].copy()

            data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
            data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

            data["area_kitchen"] = data["area_kitchen_edit"].copy()
            data["area_living"] = data["area_living_edit"].copy()

            #test_set
            data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
            data_test["area_living_edit"] = data_test["area_living"].copy()

            data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
            data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


            data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
            data_test["area_living"] = data_test["area_living_edit"].copy()
        
        else:
            pass
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
        data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
        data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
        data['material'][data.material.isna() ] = data['material'].mode()[0]
        data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    elif feature == 'condition': # Can also merge with 0.0 i.e. the undecorated class. but for now we created new class
        data["condition"][data.condition.isna() ] = 4.0 
        data_test["condition"][data_test.condition.isna() ] = 4.0 
    else:
        pass
        #         mean = data[feature].mean()
#         #print('Not Categorical',feature)
#         data[feature] = data[feature].fillna(mean)
        
#         data_test[feature] = data_test[feature].fillna(mean)



data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])


#model
from catboost import CatBoostRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]



param = {
"objective": "RMSE",
'random_state': 2263, 
'learning_rate': 0.025133301103588284, 
'n_estimators': 3326, 
'reg_lambda': 0.01621262044795105, 
'subsample': 0.909304956841248, 
'depth': 9}


model_cat = CatBoostRegressor(**param)  

#model_cat.fit(train_x,train_y,early_stopping_rounds=100,verbose=False)
#catboost_preds = model_cat.predict(test_x)


apartments = pd.read_csv('./data/apartments_train.csv')
buildings = pd.read_csv('./data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('./data/apartments_test.csv')
buildings_test = pd.read_csv('./data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)

def clean_NaN_values(data, features):
    for feature in features:
        if data[feature].max() == 1.0 or feature == "elevator_service" or feature == "condition" or feature == "constructed" or feature == "material" or feature == "seller" :
            #print('Categorical',feature)
            mod = data[feature].mode()
            data[feature] = data[feature].fillna(mod[0])
        else:
            mean = data[feature].mean()
            #print('Not Categorical',feature)
            data[feature] = data[feature].fillna(mean)
            
    return data


def add_euclidean_distance_feature(data):
    origin_coordinates = (37.6, 55.75)
    X, Y = 0,1
    distance_from_city_center = np.sqrt((origin_coordinates[X] - data["longitude"])**2+(origin_coordinates[Y] - data["latitude"])**2)
    data["distance_from_city_center"] = distance_from_city_center
    return data



features = ["ceiling","rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new",
            "latitude", "longitude","district", "constructed", "seller", "balconies", "material", "stories"]

data      = clean_NaN_values(data, features)
data_test = clean_NaN_values(data_test, features)


data      = add_euclidean_distance_feature(data)
data_test = add_euclidean_distance_feature(data_test)

features = ["ceiling","rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new",
            "latitude", "longitude","district", "constructed", "seller", "balconies", "material", "stories", "distance_from_city_center"]



from catboost import CatBoostRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

#validation: 0.12496821678247572  
 

param = {
"objective": "RMSE",
'random_state': 2020, 
'learning_rate': 0.027775682386650822, 
'n_estimators': 9561, 
'reg_lambda': 0.02942773134248745, 
'subsample': 0.6452052083779029,
'depth': 8,
'bagging_temperature': 56.77037557663241}


model_cat2 = CatBoostRegressor(**param)  

#model_cat2.fit(train_x,train_y,early_stopping_rounds=100,verbose=False)
#catboost2_preds = model_cat2.predict(test_x)



#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t


# features = [ "area_total", "rooms", "floor","new","distance_from_city_center",
#         "latitude", "longitude","district", "constructed", "material", "stories"]
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]
#TDONE floor, celing, rooms   "new" , constructed condition
# TODO area kitchen , area living

data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1) 
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
        
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)
        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
                        
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     
    else:
        mean = data[feature].mean()
        data[feature] = data[feature].fillna(mean)
        
        data_test[feature] = data_test[feature].fillna(mean)

data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])


train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]



from mlxtend.regressor import StackingCVRegressor

stacked_model = StackingCVRegressor(regressors=(model_xgb1, model_xgb2, model_lgbm, model_cat, model_cat2),
                                meta_regressor=model_xgb1, #our best individual mode becomes the META
                                use_features_in_secondary=True,
                                   verbose=0)



#stacked_model.fit(train_x,train_y)
#stacked_preds = stacked_model.predict(test_x)

# final_preds = np.average(
#     [np.expm1(xgb_preds),
#      np.expm1(lgbm_preds),
#      np.expm1(xgb_preds_scnd_best_model),
#      np.expm1(catboost_preds),
#      np.expm1(stacked_preds),
#      np.expm1(catboost2_preds)
#     ],
#     weights = 1 / np.array([0.12887,  0.13364,  0.12631,  0.13242,  0.12871,0.12496]) ** 6,  #Should be 4 by standard and then increase to 6 to squeeze more juice
#     axis=0
# )



<a id="10.4"></a> <br>
## 10.4 Attempt 4

In [97]:
#Attempt 4 - 27.10.2021 - 0.15178 on test set
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)



import pyproj
from numpy.random import choice


#FEATURE CLEANING
data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831
data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317
data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317
data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317
data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556
data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284
data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11
data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3
data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


#cleaning/engineering all elevators as one feature
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )    
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) 
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  
mod = data['elevatern'].mode()
data['elevatern'] = data['elevatern'].fillna(mod[0])
data_test['elevatern'] = data_test['elevatern'].fillna(mod[0])


data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)



data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)
data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)

#engineering and cleaning a feature
data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]
data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


#seller
list_of_candidates = [0,1,2,3]
probability_distribution  = [0.11, 0.33, 0.13, 0.43]
number_of_items_to_pick = data['seller'].isna().sum()
number_of_items_to_pick_test = data_test['seller'].isna().sum()

np.random.seed(0)

draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)
draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
              p=probability_distribution)

data['seller'][data.seller.isna()] = draw
data_test['seller'][data_test.seller.isna()] = draw_test


#area_kitchen and living
percentage_area_data = pd.DataFrame()
percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

mean_kitchen = percentage_area_data["area_kitchen"].mean()
mean_living = percentage_area_data["area_living"].mean()

#to omit bugs
data["area_kitchen_edit"] = data["area_kitchen"].copy()
data["area_living_edit"] = data["area_living"].copy()

data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

data["area_kitchen"] = data["area_kitchen_edit"].copy()
data["area_living"] = data["area_living_edit"].copy()

#test_set
data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
data_test["area_living_edit"] = data_test["area_living"].copy()

data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
data_test["area_living"] = data_test["area_living_edit"].copy()
        

#ceiling    
maxc = 9
minc = 1
data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)  

data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]


#condition
var =  4.0
data['condition'] = data['condition'].fillna(var)
data_test['condition'] = data_test['condition'].fillna(var)


#stories
idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
for i in range(idss.size):
    max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
    data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor

idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
for i in range(idss_test.size):
    max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
    data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test


#material
data.material[data.material==5] = 2.0 #merging monlith brick with monolith
data.material[data.material==6] = 5.0 #stalin to 5

data_test.material[data_test.material==5] = 2.0
data_test.material[data_test.material==6] = 5.0

data['material'][data.material.isna() ] = data['material'].mode()[0]
data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]

#constructed, new:
data['constructed'] = data.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data['new'] = data.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)      


data_test['constructed'] = data_test.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data_test['new'] = data_test.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)  




#FEATURE ENGINEERING:

lon1 =  37.621390
lat1 = 55.753098
geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data['fwd_azi'] = fwd_azimuth_arr
data['distance'] = distance_arr
data['back_azi'] = back_azimuth_arr


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data_test["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data_test["longitude"][i], data_test["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data_test['fwd_azi'] = fwd_azimuth_arr
data_test['distance'] = distance_arr
data_test['back_azi'] = back_azimuth_arr




data['area_per_room'] = data['area_total']/data['rooms']
data_test['area_per_room'] = data_test['area_total']/data_test['rooms']

data['area_per_room_log'] = np.log1p(data['area_per_room'])
data_test['area_per_room_log'] = np.log1p(data_test['area_per_room'])

data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])


data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private




#floor/stories
data["floor/stories"] = data["floor"]/data["stories"]
data_test["floor/stories"] = data_test["floor"]/data_test["stories"]


#euclidean financial distance from city center
financial_coords = (37.535497858, 55.741330368)
distance_from_city_center = np.sqrt((financial_coords[0] - data["longitude"])**2+(financial_coords[1] - data["latitude"])**2)
data["distance_from_financial_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((financial_coords[0] - data_test["longitude"])**2+(financial_coords[1] - data_test["latitude"])**2)
data_test["distance_from_financial_center"] = distance_from_city_center_t


#euclidean distance from city center
origin_coordinates = (37.621390,55.753098)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t

#FEATURES INCLUDED:
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log",
            "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi',"floor/stories",
           "distance_from_financial_center", "distance_from_city_center"]





#MODEL:
train_x = data[features]
train_y = np.log1p(data['price']/data["area_total"])
test_x = data_test[features]

param = {
        'base_score' : 0.5,
        'booster' : 'gbtree',
        'colsample_bylevel' : 1,
        'gamma' : 0,
        'max_delta_step' : 0,
        'n_jobs' : -1,
        'nthread' : None,
        'objective' : 'reg:squarederror',
        'scale_pos_weight' : 1,
        'seed' : None,
        'lambda': 0.0024064014952485785, 
         'alpha': 0.001541503784279617, 
        'colsample_bytree': 0.43152225018148443, 
       'subsample': 0.8078473020517652, 
       'learning_rate': 0.013367834721822036, 
       'n_estimators': 5235, 
     'random_state': 291, 
      'max_depth': 9, 
    'min_child_weight': 13
}



model_xgb3 = XGBRegressor(**param)

#model_xgb3.fit(train_x,train_y)
#xgb3_preds = model_xgb3.predict(test_x)


<a id="10.5"></a> <br>
## 10.5 Attempt 5

In [98]:
#Attempt 5 - 08.11.2021 - 0.14871 on test set
#Xgboost 1 PIPELINE  
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


#Feature Cleaning
maxc = 9
minc = 1
data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     
data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)     


var =  4.0
data['condition'] = data['condition'].fillna(var)
data_test['condition'] = data_test['condition'].fillna(var)


data['constructed'] = data.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data['new'] = data.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)      

data_test['constructed'] = data_test.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data_test['new'] = data_test.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)     

        
for feature in ["rooms", "area_total", "area_kitchen", "area_living", "floor","latitude", 
                "longitude","district", "seller", "windows_court", "balconies", "material", "stories"]:
    mean = data[feature].mean()
    data[feature] = data[feature].fillna(mean)    
    data_test[feature] = data_test[feature].fillna(mean)


        
        
#FEATURE ENGINEERING DONE:
data['area_total'] = np.log1p(data['area_total'])
data['area_kitchen'] = np.log1p(data['area_kitchen'])
data['area_living'] = np.log1p(data['area_living'])

data_test['area_total'] = np.log1p(data_test['area_total'])
data_test['area_kitchen'] = np.log1p(data_test['area_kitchen'])
data_test['area_living'] = np.log1p(data_test['area_living'])


data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )   
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1) 
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  
mod = data['elevatern'].mode()
data['elevatern'] = data['elevatern'].fillna(mod[0])
data_test['elevatern'] = data_test['elevatern'].fillna(mod[0])


#Adding city center as origin
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t




#FEATURES INCLUDED:
features = ["ceiling", "rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new", "elevatern","distance_from_city_center",
            "latitude", "longitude","district", "constructed", "seller", "windows_court", "balconies", "material", "stories"]



#Model
from xgboost import XGBRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

    
model_xgb1 = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.5144984086781564, gamma=0, learning_rate=0.01693820796093592, max_delta_step=0,
       max_depth=23, min_child_weight=5, n_estimators=3977,
       n_jobs=1, nthread=None, objective='reg:squarederror', random_state=2020, # squarederror  reg:squaredlogerror   reg:squarederror
       reg_alpha=0.021096319890667407, reg_lambda=0.2287729489989326, scale_pos_weight=1, seed=None, subsample=0.42023355655422495)


#model_xgb1.fit(train_x,train_y)
#xgb_preds = model_xgb1.predict(test_x)

#Catboost PIPELINE  
apartments = pd.read_csv('./data/apartments_train.csv')
buildings = pd.read_csv('./data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('./data/apartments_test.csv')
buildings_test = pd.read_csv('./data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


#FEATURE CLEANING:
for feature in ["ceiling","rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new",
            "latitude", "longitude","district", "constructed", "seller", "balconies", "material", "stories"]:
        if data[feature].max() == 1.0 or feature == "elevator_service" or feature == "condition" or feature == "constructed" or feature == "material" or feature == "seller" :
            #print('Categorical',feature)
            mod = data[feature].mode()
            data[feature] = data[feature].fillna(mod[0])
            
            mod_t = data_test[feature].mode()
            data_test[feature] = data_test[feature].fillna(mod_t[0])          
        else:
            mean = data[feature].mean()
            data[feature] = data[feature].fillna(mean)
            
            mean_t = data_test[feature].mean()
            data_test[feature] = data_test[feature].fillna(mean_t)



#FEATURE ENGINEERING:
origin_coordinates = (37.6, 55.75)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t

#FEATURES INCLUDED:
features = ["ceiling","rooms", "area_total", "area_kitchen", "area_living", "floor", "condition","new",
            "latitude", "longitude","district", "constructed", "seller", "balconies", "material", "stories", "distance_from_city_center"]


#Model
from catboost import CatBoostRegressor

train_x = data[features]
train_y = np.log1p(data['price'])
test_x = data_test[features]

 
 

param = {
"objective": "RMSE",
'random_state': 2020, 
'learning_rate': 0.027775682386650822, 
'n_estimators': 9561, 
'reg_lambda': 0.02942773134248745, 
'subsample': 0.6452052083779029,
'depth': 8,
'bagging_temperature': 56.77037557663241}


model_cat2 = CatBoostRegressor(**param)  

#model_cat2.fit(train_x,train_y,early_stopping_rounds=100,verbose=False)
#catboost2_preds = model_cat2.predict(test_x)


#LGBM 2 PIPELINE
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


from lightgbm import LGBMRegressor
import pyproj
from numpy.random import choice


#FEATURE CLEANING
data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831
data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317
data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317
data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317
data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556
data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284
data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11
data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3
data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


#cleaning/engineering all elevators as one feature
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )    
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) 
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  
mod = data['elevatern'].mode()
data['elevatern'] = data['elevatern'].fillna(mod[0])
data_test['elevatern'] = data_test['elevatern'].fillna(mod[0])


data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)



data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)
data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)

#engineering and cleaning a feature
data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]
data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


#seller
list_of_candidates = [0,1,2,3]
probability_distribution  = [0.11, 0.33, 0.13, 0.43]
number_of_items_to_pick = data['seller'].isna().sum()
number_of_items_to_pick_test = data_test['seller'].isna().sum()

np.random.seed(0)

draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)
draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
              p=probability_distribution)

data['seller'][data.seller.isna()] = draw
data_test['seller'][data_test.seller.isna()] = draw_test


#area_kitchen and living
percentage_area_data = pd.DataFrame()
percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

mean_kitchen = percentage_area_data["area_kitchen"].mean()
mean_living = percentage_area_data["area_living"].mean()

#to omit bugs
data["area_kitchen_edit"] = data["area_kitchen"].copy()
data["area_living_edit"] = data["area_living"].copy()

data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

data["area_kitchen"] = data["area_kitchen_edit"].copy()
data["area_living"] = data["area_living_edit"].copy()

#test_set
data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
data_test["area_living_edit"] = data_test["area_living"].copy()

data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
data_test["area_living"] = data_test["area_living_edit"].copy()
        

#ceiling    
maxc = 9
minc = 1
data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)  

data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]


#condition
var =  4.0
data['condition'] = data['condition'].fillna(var)
data_test['condition'] = data_test['condition'].fillna(var)


#stories
idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
for i in range(idss.size):
    max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
    data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor

idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
for i in range(idss_test.size):
    max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
    data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test


#material
data.material[data.material==5] = 2.0 #merging monlith brick with monolith
data.material[data.material==6] = 5.0 #stalin to 5

data_test.material[data_test.material==5] = 2.0
data_test.material[data_test.material==6] = 5.0

data['material'][data.material.isna() ] = data['material'].mode()[0]
data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]

#constructed, new:
data['constructed'] = data.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data['new'] = data.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)      


data_test['constructed'] = data_test.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data_test['new'] = data_test.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)  




#FEATURE ENGINEERING:

lon1 =  37.621390
lat1 = 55.753098
geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data['fwd_azi'] = fwd_azimuth_arr
data['distance'] = distance_arr
data['back_azi'] = back_azimuth_arr


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data_test["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data_test["longitude"][i], data_test["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data_test['fwd_azi'] = fwd_azimuth_arr
data_test['distance'] = distance_arr
data_test['back_azi'] = back_azimuth_arr




data['area_per_room'] = data['area_total']/data['rooms']
data_test['area_per_room'] = data_test['area_total']/data_test['rooms']

data['area_per_room_log'] = np.log1p(data['area_per_room'])
data_test['area_per_room_log'] = np.log1p(data_test['area_per_room'])

data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])


data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private



#FEATURES INCLUDED:
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log", "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi']

    
#Model
train_x = data[features]
train_y = np.log1p(data['price']/data["area_total"])
test_x = data_test[features]



param = {
            'boosting_type': 'gbdt',
            'num_leaves': 35,
            'min_child_weight': 1,
            'subsample': 0.32415121173658534,
            'colsample_bytree':  0.4768205472451884,
            'reg_lambda': 0.14916991373512928,
            'reg_alpha': 0.006696476138868112,
            'learning_rate': 0.01747572661694792,
            'max_depth': 46,
            'n_estimators': 9775,
            'n_jobs' : 1,
            'objective' : 'regression',


        }


model_lgbm2 = LGBMRegressor(**param)


#model_lgbm2.fit(train_x,train_y)
#lgbm2_preds = model_lgbm2.predict(test_x)


#Xgb2 Pipeline
train_x = data[features]
train_y = np.log1p(data['price']/data["area_total"])
test_x = data_test[features]

model_xgb2 = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5728285533310635, gamma=0, learning_rate=0.016003653491882115, max_delta_step=0,
max_depth=8, min_child_weight=1, n_estimators=3515,
n_jobs=1, nthread=None, objective='reg:squarederror', random_state=5, # squarederror reg:squaredlogerror reg:squarederror
reg_alpha=0.005800171239325761, reg_lambda=0.48110648627756064, scale_pos_weight=1, seed=None, subsample=0.7155884414918227)


#model_xgb2.fit(train_x,train_y)
#xgb2_preds = model_xgb2.predict(test_x)


#XGB3 (0.15178) on its own
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)



import pyproj
from numpy.random import choice


#FEATURE CLEANING
data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831
data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317
data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317
data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317
data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556
data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284
data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11
data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3
data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


#cleaning/engineering all elevators as one feature
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )    
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) 
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  
mod = data['elevatern'].mode()
data['elevatern'] = data['elevatern'].fillna(mod[0])
data_test['elevatern'] = data_test['elevatern'].fillna(mod[0])


data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)



data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)
data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)

#engineering and cleaning a feature
data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]
data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


#seller
list_of_candidates = [0,1,2,3]
probability_distribution  = [0.11, 0.33, 0.13, 0.43]
number_of_items_to_pick = data['seller'].isna().sum()
number_of_items_to_pick_test = data_test['seller'].isna().sum()

np.random.seed(0)

draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)
draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
              p=probability_distribution)

data['seller'][data.seller.isna()] = draw
data_test['seller'][data_test.seller.isna()] = draw_test


#area_kitchen and living
percentage_area_data = pd.DataFrame()
percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

mean_kitchen = percentage_area_data["area_kitchen"].mean()
mean_living = percentage_area_data["area_living"].mean()

#to omit bugs
data["area_kitchen_edit"] = data["area_kitchen"].copy()
data["area_living_edit"] = data["area_living"].copy()

data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

data["area_kitchen"] = data["area_kitchen_edit"].copy()
data["area_living"] = data["area_living_edit"].copy()

#test_set
data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
data_test["area_living_edit"] = data_test["area_living"].copy()

data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
data_test["area_living"] = data_test["area_living_edit"].copy()
        

#ceiling    
maxc = 9
minc = 1
data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)  

data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]


#condition
var =  4.0
data['condition'] = data['condition'].fillna(var)
data_test['condition'] = data_test['condition'].fillna(var)


#stories
idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
for i in range(idss.size):
    max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
    data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor

idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
for i in range(idss_test.size):
    max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
    data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test


#material
data.material[data.material==5] = 2.0 #merging monlith brick with monolith
data.material[data.material==6] = 5.0 #stalin to 5

data_test.material[data_test.material==5] = 2.0
data_test.material[data_test.material==6] = 5.0

data['material'][data.material.isna() ] = data['material'].mode()[0]
data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]

#constructed, new:
data['constructed'] = data.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data['new'] = data.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)      


data_test['constructed'] = data_test.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data_test['new'] = data_test.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)  




#FEATURE ENGINEERING:

lon1 =  37.621390
lat1 = 55.753098
geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data['fwd_azi'] = fwd_azimuth_arr
data['distance'] = distance_arr
data['back_azi'] = back_azimuth_arr


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data_test["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data_test["longitude"][i], data_test["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data_test['fwd_azi'] = fwd_azimuth_arr
data_test['distance'] = distance_arr
data_test['back_azi'] = back_azimuth_arr




data['area_per_room'] = data['area_total']/data['rooms']
data_test['area_per_room'] = data_test['area_total']/data_test['rooms']

data['area_per_room_log'] = np.log1p(data['area_per_room'])
data_test['area_per_room_log'] = np.log1p(data_test['area_per_room'])

data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])


data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private




#floor/stories
data["floor/stories"] = data["floor"]/data["stories"]
data_test["floor/stories"] = data_test["floor"]/data_test["stories"]


#euclidean financial distance from city center
financial_coords = (37.535497858, 55.741330368)
distance_from_city_center = np.sqrt((financial_coords[0] - data["longitude"])**2+(financial_coords[1] - data["latitude"])**2)
data["distance_from_financial_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((financial_coords[0] - data_test["longitude"])**2+(financial_coords[1] - data_test["latitude"])**2)
data_test["distance_from_financial_center"] = distance_from_city_center_t


#euclidean distance from city center
origin_coordinates = (37.621390,55.753098)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t

#FEATURES INCLUDED:
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log",
            "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi',"floor/stories",
           "distance_from_financial_center", "distance_from_city_center"]





#MODEL:
train_x = data[features]
train_y = np.log1p(data['price']/data["area_total"])
test_x = data_test[features]

param = {
        'base_score' : 0.5,
        'booster' : 'gbtree',
        'colsample_bylevel' : 1,
        'gamma' : 0,
        'max_delta_step' : 0,
        'n_jobs' : -1,
        'nthread' : None,
        'objective' : 'reg:squarederror',
        'scale_pos_weight' : 1,
        'seed' : None,
        'lambda': 0.0024064014952485785, 
         'alpha': 0.001541503784279617, 
        'colsample_bytree': 0.43152225018148443, 
       'subsample': 0.8078473020517652, 
       'learning_rate': 0.013367834721822036, 
       'n_estimators': 5235, 
     'random_state': 291, 
      'max_depth': 9, 
    'min_child_weight': 13
}



model_xgb3 = XGBRegressor(**param)

#model_xgb3.fit(train_x,train_y)
#xgb3_preds = model_xgb3.predict(test_x)


#Stacked Model Pipeline
train_x = data[features]
train_y = np.log1p(data['price']/data["area_total"])
test_x = data_test[features]



from mlxtend.regressor import StackingCVRegressor

stacked_model = StackingCVRegressor(regressors=(model_xgb1, model_xgb2, model_cat2, model_lgbm2,model_xgb3),
                                meta_regressor=model_xgb3, #our best individual model becomes the META
                                use_features_in_secondary=True,
                                   verbose=0)



#stacked_model.fit(train_x,train_y)
#stacked_preds = stacked_model.predict(test_x)


#Weighted Averaging/Blending Model Pipeline
# final_preds = np.average(
#     [np.expm1(xgb_preds),
#      np.expm1(xgb2_preds)*data_test["area_total"],
#      np.expm1(stacked_preds)*data_test["area_total"],
#      np.expm1(catboost2_preds),
#      np.expm1(lgbm2_preds)*data_test["area_total"],
#      np.expm1(xgb3_preds)*data_test["area_total"]
#     ],
#     weights = 1 / np.array([0.12887,  0.12631, 0.12,0.12492,0.12866,0.1225]) ** 6,  #Should be 4 by standard and then increase to 6 to squeeze more juice
#     axis=0
# )

#Submission

# Construct submission dataframe
# submission = pd.DataFrame()
# submission['id'] = data_test.id
# submission['price_prediction'] = final_preds # *0.99839    
# print(f'Generated {len(submission)} predictions')

# # Export submission to csv with headers
# submission.to_csv('submission.csv', index=False)

# # Look at submitted csv
# print('\nLine count of submission')
# !wc -l submission.csv

# print('\nFirst 5 rows of submission')
# !head -n 10 submission.csv


<a id="10.6"></a> <br>
## 10.6 Group KFolding Based on Building Split
This method is used to calculate the weights in the averaging method. Note models from section 10.5 are used here for demonstration. The reason why cross validation is important is to reveal possible overfitting that might be happening on the test set. If the models perform both well on the test set and the cross validation then we can be safe that there is a smaller risk for us to overfit on the public part of the test set. So this is a very important step, and also it is important to to this split based on the building_id split as well as extract out the cross validation predictions to make sure to account for the fact that the different models have been trained on different targets.

In [None]:
#Random cross validation - does not work very well but worth demonstrating
from sklearn.model_selection import KFold, cross_val_score, GroupShuffleSplit, cross_val_predict
def cv_rmsle(model):
    rmset = 0
    splits = 20
    #cross validates on 20 different random splits
    for i in range(splits):
        gs = GroupShuffleSplit(n_splits=2, test_size=.33)
        train_index, valid_index = next(gs.split(data, groups=data.building_id))

        data_train = data.loc[train_index]
        data_valid = data.loc[valid_index]

        train_x = data_train[features]
        train_y = np.log1p(data_train['price']/data_train["area_total"])
        test_x = data_valid[features]
        test_y = data_valid['price']


        model.fit(train_x,train_y)

        preds = model.predict(test_x)

        rmse = root_mean_squared_log_error(y_true=test_y, y_pred=np.expm1(preds)*test_x.area_total)

        rmset = rmset + rmse
    
    return rmset / splits

print("Get dem weights bro", 0.0)
print("CROSS VALIDATED XGBOOST1:", cv_rmsle(model_xgb1))
print("CROSS VALIDATED XGBOOST2:", cv_rmsle(model_xgb2))
print("CROSS VALIDATED CATBOOST2:", cv_rmsle(model_cat2))
print("CROSS VALIDATED LGBM3:", cv_rmsle(model_lgbm3)) 
print("CROSS VALIDATED XGBOOST3:", cv_rmsle(model_xgb3)) 
print("CROSS VALIDATED CATBOOST3:", cv_rmsle(model_catb3)) 
#Group KFold Cross_validation based on building split
from sklearn.model_selection import GroupKFold, cross_val_score, cross_val_predict


kfolds = 10
gkf = GroupKFold( n_splits=kfolds)
groups = data.building_id

def group_cv(model, y, data=data[features], cv=gkf, groups=groups):
    model_preds = cross_val_predict(model, X=data, y=y, cv=gkf , groups =groups)
    return model_preds


print("LOGGING GROUP KFOLDING 10 SPLITS")

#uncomment to run
#xgb1_cv_preds = group_cv(model_xgb1, np.log1p(data.price))
#print("XGBOOST1 SCORE:", root_mean_squared_log_error(data.price, np.expm1(xgb1_cv_preds)))


#cat2_cv_preds = group_cv(model_cat2, np.log1p(data.price))
#print("CATBOOST2 SCORE:", root_mean_squared_log_error(data.price, np.expm1(cat2_cv_preds)))

#xgb2_cv_preds = group_cv(model_xgb2, np.log1p(data.price/data.area_total))
#print("XGBOOST2 SCORE:", root_mean_squared_log_error(data.price, np.expm1(xgb2_cv_preds)*data.area_total))

#lgbm3_cv_preds = group_cv(model_lgbm3, np.log1p(data.price/data.area_total))
#print("LGBM3 SCORE:", root_mean_squared_log_error(data.price, np.expm1(lgbm3_cv_preds)*data.area_total))

#xgb3_cv_preds = group_cv(model_xgb3, np.log1p(data.price/data.area_total))
#print("XGBOOST3 SCORE:", root_mean_squared_log_error(data.price, np.expm1(xgb3_cv_preds)*data.area_total))


#cat3_cv_preds = group_cv(model_catb3, np.log1p(data.price/data.area_total))
#print("CATBOOST3 SCORE:", root_mean_squared_log_error(data.price, np.expm1(cat3_cv_preds)*data.area_total))

The results from a possible 10 Group KFold Cross Validation:

* CATBOOST2 SCORE: 0.19819805228618034
* CATBOOST3 SCORE: 0.19596734972841498
* XGBOOST1 SCORE : 0.1955884082590347
* XGBOOST3 SCORE : 0.1908350746986002
* LGBM3 SCORE : 0.18979521325245455
* XGBOOST2 SCORE : 0.18874589461384148

Sorting the scores from lowest to best performing, it looks pretty consistent that the latest models are performing better than their earlier counter parts atleast if one looks at each model. XGBOOST3 and XGBOOST 2 are pretty close even though XGBOOST3 performs better in the test set (maybe an indication of overfitting, but it is so small that it is hard to conclude that), how ever CATBOOST3 does better on cross validation than CATBOOST2 which is an indication that CATBOOST2 is overfitting on the public part of the test set as it is performing better than CATBOOST3.

<a id="10.7"></a> <br>
## 10.7 Final Submission Pipeline 1

<a id="10.8"></a> <br>
## 10.8 Final Submission Pipeline 2

<a id="11"></a> <br>
# 11. Model Interpretation

<a id="11.1"></a> <br>
## 11.1 LIME Interpretation

The lime explaination help us in performing a check on the reasoning behind the various model decisions. Based on the previous feature engineering we expect that distance from city center and area total are the most important features, therefore, we will expect that the most influential feature for model decision should be one of these features. 

In [99]:
#CURRENTLY BEST MODEL - 22.10.2021 - 0.15831
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.options.mode.chained_assignment = None

apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)


# Formation of Objective for Optuna
import pyproj
from numpy.random import choice

# Filling missing long lat in the Test set
#55.568139, 37.481831 - fixing nans

data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831



data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831


#55.544066, 37.482317 - Fixing negative numbers
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317



data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317



data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317



data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317


#Blown up coordinates outside moscow fixed:
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556



data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284



data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

#fixing test_data districts:

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11



data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3



data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

# Fixing training data districts
data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


# data[["area_total","area_kitchen","area_living","bathrooms_private", "bathrooms_shared","balconies","loggias"]][data.area_living + data.area_kitchen > data.area_total]

# Add a new feature of distance from center
#Adding city center as origin
#origin_coordinates = (37.6, 55.75)

# https://stackoverflow.com/questions/24617013/convert-latitude-and-longitude-to-x-and-y-grid-system-using-python
# data["dx"]= (lon1-data["longitude"])*40000*np.cos((lat1+data["latitude"])*np.pi/360)/360
# data["dy"] = (lat1-data["latitude"])*40000/360


lon1 =  37.621390
lat1 = 55.753098


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data['fwd_azi'] = fwd_azimuth_arr
data['distance'] = distance_arr
data['back_azi'] = back_azimuth_arr


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data_test["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data_test["longitude"][i], data_test["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data_test['fwd_azi'] = fwd_azimuth_arr
data_test['distance'] = distance_arr
data_test['back_azi'] = back_azimuth_arr







# Modify/ add the eleveator feature for both test and train
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) # expesnive      0 1 4 6 expensive     2 3 5   cheap     ,   E    1, 0 , 2 ,6  ,,,,   3,      
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  

data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private

data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private


data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)

data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)


data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]


data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)



features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log", "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi']


# Revisit
# We removed windows court and windows street as they did not seem to corelate with price,
# total_balconies == 0 instead 1 .
# 
for feature in features:
    if   feature == 'elevatern': #or feature == "elevator_service" or features == "condition" or feature == "constructed" or features == "material" or features == "seller"
        #print('Categorical',feature)
        mod = data[feature].mode()
        data[feature] = data[feature].fillna(mod[0])
        data_test[feature] = data_test[feature].fillna(mod[0])
    
    elif feature == 'seller':
        
        list_of_candidates = [0,1,2,3]
        # 14455 , owener 0, 
        probability_distribution  = [0.11, 0.33, 0.13, 0.43]
        number_of_items_to_pick = data['seller'].isna().sum()
        number_of_items_to_pick_test = data_test['seller'].isna().sum()

        np.random.seed(0)

        draw = choice(list_of_candidates, number_of_items_to_pick,
                      p=probability_distribution)
        draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
                      p=probability_distribution)

        data['seller'][data.seller.isna()] = draw
        data_test['seller'][data_test.seller.isna()] = draw_test
        
    elif feature == 'area_kitchen' or feature == 'area_living':
        if feature == 'area_kitchen':
           
            percentage_area_data = pd.DataFrame()
            percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
            percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

            mean_kitchen = percentage_area_data["area_kitchen"].mean()
            mean_living = percentage_area_data["area_living"].mean()

            #to omit bugs
            data["area_kitchen_edit"] = data["area_kitchen"].copy()
            data["area_living_edit"] = data["area_living"].copy()

            data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
            data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

            data["area_kitchen"] = data["area_kitchen_edit"].copy()
            data["area_living"] = data["area_living_edit"].copy()

            #test_set
            data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
            data_test["area_living_edit"] = data_test["area_living"].copy()

            data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
            data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


            data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
            data_test["area_living"] = data_test["area_living_edit"].copy()
        
        else:
            pass
    elif feature == 'ceiling':
        maxc = 9
        minc = 1
        data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
        data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)  
        
        data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
        data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]
        
    elif feature == 'condition' :
        var =  4.0
        data[feature] = data[feature].fillna(var)
        data_test[feature] = data_test[feature].fillna(var)

        
    elif feature == 'stories':
        
        idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss.size):
            max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
            data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor
        
        idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
        #data['storiesnew'] = data['stories'].copy()
        for i in range(idss_test.size):
            max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
            data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test
            
    elif feature == 'material':
        data.material[data.material==5] = 2.0 #merging monlith brick with monolith
        data.material[data.material==6] = 5.0 #stalin to 5
        
        data_test.material[data_test.material==5] = 2.0
        data_test.material[data_test.material==6] = 5.0
        
        data['material'][data.material.isna() ] = data['material'].mode()[0]
        data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]

        
    elif feature == 'constructed' or feature == 'new':
        if feature == 'new':
            pass
        else:
            data['constructed'] = data.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data['new'] = data.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )      
            
            
            data_test['constructed'] = data_test.apply(
                lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
                axis=1
            )  
            data_test['new'] = data_test.apply(
                lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
                axis=1
            )     

    elif feature == 'condition': # Can also merge with 0.0 i.e. the undecorated class. but for now we created new class
        data["condition"][data.condition.isna() ] = 4.0 
        data_test["condition"][data_test.condition.isna() ] = 4.0 
    else:
        pass
        #         mean = data[feature].mean()
#         #print('Not Categorical',feature)
#         data[feature] = data[feature].fillna(mean)
        
#         data_test[feature] = data_test[feature].fillna(mean)

data['area_per_room'] = data['area_total']/data['rooms']
data_test['area_per_room'] = data_test['area_total']/data_test['rooms']

data['area_per_room_log'] = np.log1p(data['area_per_room'])
data_test['area_per_room_log'] = np.log1p(data_test['area_per_room'])

data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])



# train_x = data[features]
# train_y = np.log1p(data['price']/data["area_total"])
# test_x = data_test[features]

model_xgb2 = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5728285533310635, gamma=0, learning_rate=0.016003653491882115, max_delta_step=0,
max_depth=8, min_child_weight=1, n_estimators=3515,
n_jobs=1, nthread=None, objective='reg:squarederror', random_state=5, # squarederror reg:squaredlogerror reg:squarederror
reg_alpha=0.005800171239325761, reg_lambda=0.48110648627756064, scale_pos_weight=1, seed=None, subsample=0.7155884414918227)


data_train, data_valid = model_selection.train_test_split(data, test_size=0.33, random_state = 0, stratify=np.log(data.price).round())
train_X = data_train[features]
train_y = np.log1p(data_train['price']/data_train["area_total"]) #data['price']/data["area_total"]
test_X = data_valid[features]
test_y = np.log1p(data_valid['price']/data_valid["area_total"])
        
model_xgb2.fit(train_X,train_y)

test_pred = model_xgb2.predict(test_X)
# final_preds = final_preds *data_test["area_total"]


errors = test_pred - test_y
errors = errors.to_numpy()
sorted_errors = np.argsort(abs(errors))

errors = test_pred - test_y
errors = errors.to_numpy()
sorted_errors = np.argsort(abs(errors))
worse_5 = sorted_errors[-5:]
best_5 = sorted_errors[:5]

print(pd.DataFrame({'worse':errors[worse_5]}))
print()
print(pd.DataFrame({'best':errors[best_5]}))




Lime first creates an explainer object. Then we can visualize each instance of the dataset under consideration.

In [100]:
import lime
import lime.lime_tabular
train_Xnp = train_X.to_numpy()
explainer = lime.lime_tabular.LimeTabularExplainer(train_Xnp, feature_names=train_X.columns, class_names=['Price'], verbose=True, mode='regression')

i = worse_5[0]
print('Error =', errors[i])
test_Xnp = test_X.to_numpy()
exp = explainer.explain_instance(test_Xnp[i], model_xgb2.predict, num_features=10)
exp.show_in_notebook(show_table=True)

From the graph above, we can observe that for one of the bad performing data point, the important features are stated in the table. Those features are distance, latitude and so on. This makes sense that the model is relying on these features to make a decision and makes the model more interpretable. We also notice that the prediction can be brought closer to the groundtruth by increase the weight of the distance feature. 

In [101]:
i = worse_5[1]
print('Error =', errors[i])
exp = explainer.explain_instance(test_Xnp[i], model_xgb2.predict, num_features=10)
exp.show_in_notebook(show_table=True)

From the graph above, we can observe that for one of the bad performing data point, the important features are stated in the table. Those features are distance, parking and so on. The interpretation of this point is also similar to the previous data point selected.

In [102]:
i = best_5[0]
print('Error =', errors[i])
exp = explainer.explain_instance(test_Xnp[i], model_xgb2.predict, num_features=10)
exp.show_in_notebook(show_table=True)

From the graph above, we can observe that for one of the best performing data point.

In [103]:
i = best_5[1]
print('Error =', errors[i])
exp = explainer.explain_instance(test_Xnp[i], model_xgb2.predict, num_features=10)
exp.show_in_notebook(show_table=True)

Apart from LIME we also try out the Feature Importance and find that the most important features are intuitively the same which we consider the most important in our feature engineering and data analyisis stage. 

<a id="11.2"></a> <br>
## 11.2 Model Feature Importance

LIME provides an instance level idea of importance of each feature towards the prediction of a model. We also perform feature importance to check the importance of all the features for the data set as a whole. 
We find that the most important features are intuitively more or less the same which we consider the most important in our feature engineering and data analyisis stage. 

In [104]:
col_sorted_by_importance=model_xgb2.feature_importances_.argsort()
feat_imp=pd.DataFrame({
    'cols':train_X.columns[col_sorted_by_importance],
    'imps':model_xgb2.feature_importances_[col_sorted_by_importance]
})

import plotly_express as px
px.bar(feat_imp, x='cols', y='imps')

<a id="12"></a> <br>
## 12 Second Final Submission- Short Notebook

In the following sections, we document and provide the code for the second final submission. This submission takes 40 minutes to run. 
The short notebook has the other final submission.

In [None]:
import json
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

np.random.seed(123)
sns.set_style('darkgrid')
pd.set_option('display.max_colwidth', None)

!ln -s /kaggle/input/moscow-housing-tdt4173 ./data
!ls ./data | sort

In [None]:
def root_mean_squared_log_error(y_true, y_pred):
    # Alternatively: sklearn.metrics.mean_squared_log_error(y_true, y_pred) ** 0.5
    assert (y_true >= 0).all() 
    assert (y_pred >= 0).all()
    log_error = np.log1p(y_pred) - np.log1p(y_true)  # Note: log1p(x) = log(1 + x)
    return np.mean(log_error ** 2) ** 0.5

In [None]:
apartments = pd.read_csv('data/apartments_train.csv')
buildings = pd.read_csv('data/buildings_train.csv')
data = pd.merge(apartments, buildings.set_index('id'), how='left', left_on='building_id', right_index=True)

apartments_test = pd.read_csv('data/apartments_test.csv')
buildings_test = pd.read_csv('data/buildings_test.csv')
data_test = pd.merge(apartments_test, buildings_test.set_index('id'), how='left', left_on='building_id', right_index=True)

In [None]:
import pyproj
from numpy.random import choice


#FEATURE CLEANING
data_test.latitude.iloc[90] = 55.568139
data_test.longitude.iloc[90]= 37.481831
data_test.latitude.iloc[23] = 55.568139
data_test.longitude.iloc[23]= 37.481831
data_test.latitude.iloc[2511] = 55.544066
data_test.longitude.iloc[2511]= 37.482317
data_test.latitude.iloc[6959] = 55.544066
data_test.longitude.iloc[6959]= 37.482317
data_test.latitude.iloc[5090] = 55.544066
data_test.longitude.iloc[5090]= 37.482317
data_test.latitude.iloc[8596] = 55.544066
data_test.longitude.iloc[8596]= 37.482317
data_test.latitude.iloc[2529] = 55.764335
data_test.longitude.iloc[2529]= 37.907556
data_test.latitude.iloc[4719] = 55.765430
data_test.longitude.iloc[4719]= 37.928284
data_test.latitude.iloc[9547] = 55.765430
data_test.longitude.iloc[9547]= 37.928284

data_test.district[data_test.building_id == 3803] = 11
data_test.district[data_test.building_id == 4636] = 11
data_test.district[data_test.building_id == 4412] = 11
data_test.district[data_test.building_id == 926] = 3
data_test.district[data_test.building_id == 4202] = 3
data_test.district[data_test.building_id == 8811] = 3
data_test.district[data_test.building_id == 6879] = 3
data_test.district[data_test.building_id == 5667] = 3
data_test.district[data_test.building_id == 2265] = 5
data_test.district[data_test.building_id == 6403] = 5
data_test.district[data_test.building_id == 7317] = 5
data_test.district[data_test.building_id == 1647] = 5
data_test.district[data_test.building_id == 183] = 5

data.district[data.building_id == 2029] = 0
data.district[data.building_id == 1255] = 0
data.district[data.building_id == 4162] = 5


#cleaning/engineering all elevators as one feature
data['elevatern'] = data.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                )    
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)
data_test['elevatern'] = data_test.apply(lambda row: 0 if (row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 0.0 ) # 
                               else( 1 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(2 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(3 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 0.0 and row["elevator_service"] == 1.0) # 
                                else(4 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 0.0) # 
                                else(5 if(row["elevator_without"] == 0 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0) # 
                                else(6 if(row["elevator_without"] == 1 and row["elevator_passenger"] == 1.0 and row["elevator_service"] == 1.0)
                                else(np.nan)
                                ) 
                                )
                                )
                                )   
                                )
                                )
                                ,axis=1)  
mod = data['elevatern'].mode()
data['elevatern'] = data['elevatern'].fillna(mod[0])
data_test['elevatern'] = data_test['elevatern'].fillna(mod[0])


data["bathrooms_shared"] = data["bathrooms_shared"].fillna(1)
data["bathrooms_private"] = data["bathrooms_private"].fillna(1)
data_test["bathrooms_shared"] = data_test["bathrooms_shared"].fillna(1)
data_test["bathrooms_private"] = data_test["bathrooms_private"].fillna(1)



data["parking"] =  data["parking"].fillna(3.0)
data_test["parking"] =  data_test["parking"].fillna(3.0)
data["heating"] =  data["heating"].fillna(0.0)
data_test["heating"] =  data_test["heating"].fillna(0.0)

#engineering and cleaning a feature
data["total_balconies"] = data["balconies"] + data["loggias"]
data_test["total_balconies"] = data_test["balconies"] + data_test["loggias"]
data["total_balconies"] =  data["total_balconies"].fillna(1.0)
data_test["total_balconies"] =  data_test["total_balconies"].fillna(1.0)


#seller
list_of_candidates = [0,1,2,3]
probability_distribution  = [0.11, 0.33, 0.13, 0.43]
number_of_items_to_pick = data['seller'].isna().sum()
number_of_items_to_pick_test = data_test['seller'].isna().sum()

np.random.seed(0)

draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)
draw_test = choice(list_of_candidates, number_of_items_to_pick_test,
              p=probability_distribution)

data['seller'][data.seller.isna()] = draw
data_test['seller'][data_test.seller.isna()] = draw_test


#area_kitchen and living
percentage_area_data = pd.DataFrame()
percentage_area_data["area_kitchen"] = data["area_kitchen"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]
percentage_area_data["area_living"] = data["area_living"][data.area_living + data.area_kitchen < data.area_total]/data["area_total"][data.area_living + data.area_kitchen < data.area_total]

mean_kitchen = percentage_area_data["area_kitchen"].mean()
mean_living = percentage_area_data["area_living"].mean()

#to omit bugs
data["area_kitchen_edit"] = data["area_kitchen"].copy()
data["area_living_edit"] = data["area_living"].copy()

data["area_kitchen_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_kitchen
data["area_living_edit"][(data.area_living + data.area_kitchen >= data.area_total) | (data.area_living.isna() | data.area_kitchen.isna())] = data.area_total*mean_living

data["area_kitchen"] = data["area_kitchen_edit"].copy()
data["area_living"] = data["area_living_edit"].copy()

#test_set
data_test["area_kitchen_edit"] = data_test["area_kitchen"].copy()
data_test["area_living_edit"] = data_test["area_living"].copy()

data_test["area_kitchen_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_kitchen
data_test["area_living_edit"][(data_test.area_living + data_test.area_kitchen >= data_test.area_total) | (data_test.area_living.isna() | data_test.area_kitchen.isna())] = data_test.area_total*mean_living


data_test["area_kitchen"] = data_test["area_kitchen_edit"].copy()
data_test["area_living"] = data_test["area_living_edit"].copy()
        

#ceiling    
maxc = 9
minc = 1
data['ceiling'] = data.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1) 
data_test['ceiling'] = data_test.apply(lambda row: data["ceiling"].mode()[0] if (row["ceiling"] < minc or row["ceiling"] > maxc ) else( row["ceiling"]) ,axis=1)  

data['ceiling'][data.ceiling.isna() ] = data["ceiling"].mode()[0]
data_test['ceiling'][data_test.ceiling.isna() ] = data["ceiling"].mode()[0]


#condition
var =  4.0
data['condition'] = data['condition'].fillna(var)
data_test['condition'] = data_test['condition'].fillna(var)


#stories
idss = data[["building_id"]][data.floor > data.stories].sort_values("building_id").drop_duplicates()
for i in range(idss.size):
    max_floor = data['floor'][data["building_id"] == idss["building_id"].iloc[i]].max()
    data['stories'][data["building_id"] == idss["building_id"].iloc[i]] =  max_floor

idss_test = data_test[["building_id"]][data_test.floor > data_test.stories].sort_values("building_id").drop_duplicates()
for i in range(idss_test.size):
    max_floor_test = data_test['floor'][data_test["building_id"] == idss_test["building_id"].iloc[i]].max()
    data_test['stories'][data_test["building_id"] == idss_test["building_id"].iloc[i]] =  max_floor_test


#material
data.material[data.material==5] = 2.0 #merging monlith brick with monolith
data.material[data.material==6] = 5.0 #stalin to 5

data_test.material[data_test.material==5] = 2.0
data_test.material[data_test.material==6] = 5.0

data['material'][data.material.isna() ] = data['material'].mode()[0]
data_test['material'][data_test.material.isna() ] = data['material'].mode()[0]

#constructed, new:
data['constructed'] = data.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data['new'] = data.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)      


data_test['constructed'] = data_test.apply(
    lambda row: 2019 if (np.isnan(row['constructed']) and ~np.isnan(row['new']) and row['new'] == 0.0) else( 2021 if (np.isnan(row['constructed'])) else row['constructed']),
    axis=1
)  
data_test['new'] = data_test.apply(
    lambda row: 0.0 if (np.isnan(row['new']) and row['constructed'] < 2020) else( 1.0 if (np.isnan(row['new'])) else row['new']),
    axis=1
)  




#FEATURE ENGINEERING:

lon1 =  37.621390
lat1 = 55.753098
geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data["longitude"][i], data["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data['fwd_azi'] = fwd_azimuth_arr
data['distance'] = distance_arr
data['back_azi'] = back_azimuth_arr


geodesic = pyproj.Geod(ellps='WGS84')
distance_arr = []
back_azimuth_arr = []
fwd_azimuth_arr = []
for i in range(len(data_test["longitude"])):
    fwd_azimuth,back_azimuth,distance = geodesic.inv(lon1, lat1, data_test["longitude"][i], data_test["latitude"][i])
    distance_arr.append(distance)
    back_azimuth_arr.append(back_azimuth)
    fwd_azimuth_arr.append(fwd_azimuth)

data_test['fwd_azi'] = fwd_azimuth_arr
data_test['distance'] = distance_arr
data_test['back_azi'] = back_azimuth_arr




data['area_per_room'] = data['area_total']/data['rooms']
data_test['area_per_room'] = data_test['area_total']/data_test['rooms']

data['area_per_room_log'] = np.log1p(data['area_per_room'])
data_test['area_per_room_log'] = np.log1p(data_test['area_per_room'])

data['area_total_log'] = np.log1p(data['area_total'])
data['area_kitchen_log'] = np.log1p(data['area_kitchen'])
data['area_living_log'] = np.log1p(data['area_living'])

data_test['area_total_log'] = np.log1p(data_test['area_total'])
data_test['area_kitchen_log'] = np.log1p(data_test['area_kitchen'])
data_test['area_living_log'] = np.log1p(data_test['area_living'])


data["bathrooms_total"] = data.bathrooms_shared + data.bathrooms_private
data_test["bathrooms_total"] = data_test.bathrooms_shared + data_test.bathrooms_private






#floor/stories
data["floor/stories"] = data["floor"]/data["stories"]
data_test["floor/stories"] = data_test["floor"]/data_test["stories"]


#euclidean financial distance from city center
financial_coords = (37.535497858, 55.741330368)
distance_from_city_center = np.sqrt((financial_coords[0] - data["longitude"])**2+(financial_coords[1] - data["latitude"])**2)
data["distance_from_financial_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((financial_coords[0] - data_test["longitude"])**2+(financial_coords[1] - data_test["latitude"])**2)
data_test["distance_from_financial_center"] = distance_from_city_center_t


#euclidean distance from city center
origin_coordinates = (37.621390,55.753098)
distance_from_city_center = np.sqrt((origin_coordinates[0] - data["longitude"])**2+(origin_coordinates[1] - data["latitude"])**2)
data["distance_from_city_center"] = distance_from_city_center

distance_from_city_center_t = np.sqrt((origin_coordinates[0] - data_test["longitude"])**2+(origin_coordinates[1] - data_test["latitude"])**2)
data_test["distance_from_city_center"] = distance_from_city_center_t






#FEATURES INCLUDED:
features = ["ceiling", "area_per_room" ,  "area_per_room_log", "rooms", "area_total", "area_kitchen", "area_living", "area_total_log", "area_kitchen_log", "area_living_log",
            "floor", "new", "elevatern", "bathrooms_total", "bathrooms_shared", "bathrooms_private", 'parking', 'heating',
            "latitude", "longitude","district", "constructed", "condition", "seller", "total_balconies", "material", "stories",'distance','back_azi','fwd_azi',"floor/stories",
           "distance_from_financial_center", "distance_from_city_center"]


In [None]:
#GroupKfold cross validation based on building split
from sklearn.model_selection import GroupKFold, cross_val_score, cross_val_predict

kfolds = 10
gkf = GroupKFold( n_splits=kfolds)
groups = data.building_id

def group_cv(model, y, data=data[features], cv=gkf, groups=groups):
    model_preds = cross_val_predict(model, X=data, y=y, cv=gkf , groups =groups)
    return model_preds

In [None]:
#CROSS VALIDATION IS DONE HERE:
#LGBM
from lightgbm import LGBMRegressor
best_params = {'objective' : 'regression',
    "metric": "root_mean_squared_error",
    'random_state': 2020,
    "n_estimators": 3000,
    'boosting_type': 'gbdt', #better than dart
    "n_jobs": -1,
 'learning_rate': 0.009902216010560466, 
 'num_iterations': 9853, 
 'n_estimators': 2200, 
 'max_bin': 1145, 
 'num_leaves': 992, 
 'min_data_in_leaf': 21, 
 'min_sum_hessian_in_leaf': 6, 
 'bagging_fraction': 0.7553160099162841, 
 'bagging_freq': 1, 
 'max_depth': 5, 
 'lambda_l1': 0.001047756084491848, 
 'lambda_l2': 0.5231817241800534, 
 'min_gain_to_split': 0.01715842845568677
    }
model_lgbm3 = LGBMRegressor(**best_params)



#Catboost
from catboost import CatBoostRegressor
param = {
"objective": "RMSE",
'depth': 8, 
 'reg_lambda': 0.6424630162452156, 
 'learning_rate': 0.008856338969505724, 
 'n_estimators': 5356, 
 'max_bin': 1042, 
 'random_state': 1695, 
 'subsample': 0.4474582804576312,
    "verbose": False}
model_catb3 = CatBoostRegressor(**param)  



#Xgboost
from xgboost import XGBRegressor
param = {
        'base_score' : 0.5,
        'booster' : 'gbtree',
        'colsample_bylevel' : 1,
        'gamma' : 0,
        'max_delta_step' : 0,
        'n_jobs' : -1,
        'nthread' : None,
        'objective' : 'reg:squarederror',
        'scale_pos_weight' : 1,
        'seed' : None,
        'lambda': 0.0024064014952485785, 
         'alpha': 0.001541503784279617, 
        'colsample_bytree': 0.43152225018148443, 
       'subsample': 0.8078473020517652, 
       'learning_rate': 0.013367834721822036, 
       'n_estimators': 5235, 
     'random_state': 291, 
      'max_depth': 9, 
    'min_child_weight': 13
}
model_xgb3 = XGBRegressor(**param)



#From GroupKFolding on 10 ksplits on building_id
#LGBM3 SCORE: 0.18979521325245455
#XGBOOST3 SCORE: 0.1908350746986002
#CATBOOST3 SCORE: 0.19596734972841498

#comment these out when submitting:
#print("LOGGING GROUP KFOLDING 10 SPLITS")

#lgbm3_cv_preds = group_cv(model_lgbm3, np.log1p(data.price/data.area_total))
#print("LGBM3 SCORE:", root_mean_squared_log_error(data.price, np.expm1(lgbm3_cv_preds)*data.area_total))

#xgb3_cv_preds = group_cv(model_xgb3, np.log1p(data.price/data.area_total))
#print("XGBOOST3 SCORE:", root_mean_squared_log_error(data.price, np.expm1(xgb3_cv_preds)*data.area_total))

#cat3_cv_preds = group_cv(model_catb3, np.log1p(data.price/data.area_total))
#print("CATBOOST3 SCORE:", root_mean_squared_log_error(data.price, np.expm1(cat3_cv_preds)*data.area_total))
    

In [None]:
#STACKED MODEL
train_x = data[features]
train_y = np.log1p(data['price']/data["area_total"])
test_x = data_test[features]



from mlxtend.regressor import StackingCVRegressor
stacked_model = StackingCVRegressor(regressors=(model_lgbm3, model_catb3, model_xgb3),
                                meta_regressor=model_xgb3, #our best individual model becomes the META
                                use_features_in_secondary=True,
                                   verbose=0)



stacked_model.fit(train_x,train_y)
stacked_preds = stacked_model.predict(test_x)

final_preds = np.expm1(stacked_preds)*data_test["area_total"]


#submission
submission = pd.DataFrame()
submission['id'] = data_test.id
submission['price_prediction'] = final_preds     
print(f'Generated {len(submission)} predictions')

# Export submission to csv with headers
submission.to_csv('submission.csv', index=False)

# Look at submitted csv
print('\nLine count of submission')
!wc -l submission.csv

print('\nFirst 5 rows of submission')
!head -n 10 submission.csv

<a id="13"></a> <br>
# 13. Conclusion

In conclusion domain knowledge is a very important starting point for any machine learning project. Cleaning features and generating new ones should be motivated by the domain knowledge as this from experience has lead to better performance for similar models. This goes to show that a datadriven approach has greater value than a modeldriven approach. This is because investigating the dataset, finding out if it is consistent and understanding different patterns in the data will both help give a better understanding of how the data was generated but also how it needs to be altered such that a model understands it. In terms of modeling, it has shown that the SoTA methods Xgboost, LGBM and Catboost has given the best performances and have been performing even better after a optuna optimization pipeline. 