## Lab 1.3 - Predicting Real Estate Data in St. Petersburg
We have data from Yandex.Realty classified https://realty.yandex.ru containing real estate listings for apartments in St. Petersburg and Leningrad Oblast from 2016 till the middle of August 2018. In this Lab you'll learn how to apply machine learning algorithms to solve business problems. Accurate price prediction can help to find fraudsters automatically and help Yandex.Realty users to make better decisions when buying and selling real estate.

Using python with machine learning algotithms is the #1 option for prototyping solutions among data scientists today. We'll take a look at it in this lab.

### Main objectives
After successful completion of the lab work students will be able to:
-	Apply machine learning for solving price prediction problem
-   Calculate metrics which can help us find out whether our machine learning model is ready for production

### Tasks
-	Encode dataset
-	Split dataset to train and validation datasets
-	Apply decision tree algorithm to build ML (machine learning) model for price predictions
-   Calculate metrics
-   Try other algorithms and factors to get a better solution 


### 1. Load data with real estate prices

In [1]:
# let's import pandas library and set options to be able to view data right in the browser
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.style as style
from matplotlib import pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
style.use('fivethirtyeight')


**Data preparation**

In [2]:
rent_df_cleaned = pd.read_csv('cleaned_dataset.csv')

In [3]:
rent_df_cleaned.head()

Unnamed: 0,first_day_exposition,last_day_exposition,last_price,floor,open_plan,rooms,studio,area,kitchen_area,living_area,renovation,building_id,last_price_log
0,2015-01-24T00:00:00+03:00,2016-01-19T00:00:00+03:00,20000.0,7,0,1,0,28.0,8.0,20.0,3.0,6719200042030752141,9.903488
1,2015-11-17T00:00:00+03:00,2016-03-04T00:00:00+03:00,24000.0,4,0,2,0,59.0,10.0,35.0,3.0,6719200042030752141,10.085809
2,2015-11-17T00:00:00+03:00,2016-04-24T00:00:00+03:00,18000.0,6,0,1,0,36.0,9.0,17.0,3.0,6719200042030752141,9.798127
3,2016-02-04T00:00:00+03:00,2016-02-28T00:00:00+03:00,18000.0,8,0,1,0,39.0,10.0,18.0,0.0,6719200042030752141,9.798127
4,2016-02-28T00:00:00+03:00,2016-04-02T00:00:00+03:00,19000.0,3,0,1,0,36.0,8.0,18.0,11.0,6719200042030752141,9.852194


In [4]:
rent_df_cleaned.describe()

Unnamed: 0,last_price,floor,open_plan,rooms,studio,area,kitchen_area,living_area,renovation,building_id,last_price_log
count,155392.0,155392.0,155392.0,155392.0,155392.0,155392.0,118364.0,123015.0,155392.0,155392.0,155392.0
mean,31889.961349,6.698498,5.1e-05,1.599297,0.063768,53.358246,11.597837,30.419019,2.442565,4.566682e+18,10.24364
std,20303.545282,4.978305,0.007175,0.87241,0.24434,24.09781,82.238109,17.405261,4.047309,2.696169e+18,0.460827
min,9990.0,1.0,0.0,0.0,0.0,11.0,2.0,2.0,0.0,458573000000000.0,9.20934
25%,20000.0,3.0,0.0,1.0,0.0,38.0,8.0,18.0,0.0,2.15354e+18,9.903488
50%,25000.0,5.0,0.0,1.0,0.0,46.0,10.0,27.0,0.0,4.69763e+18,10.126631
75%,35000.0,9.0,0.0,2.0,0.0,62.0,12.0,37.0,3.0,6.829723e+18,10.463103
max,199000.0,92.0,1.0,5.0,1.0,200.0,25000.0,2015.0,11.0,9.220297e+18,12.20106


In [5]:
rent_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155392 entries, 0 to 155391
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   first_day_exposition  155392 non-null  object 
 1   last_day_exposition   155392 non-null  object 
 2   last_price            155392 non-null  float64
 3   floor                 155392 non-null  int64  
 4   open_plan             155392 non-null  int64  
 5   rooms                 155392 non-null  int64  
 6   studio                155392 non-null  int64  
 7   area                  155392 non-null  float64
 8   kitchen_area          118364 non-null  float64
 9   living_area           123015 non-null  float64
 10  renovation            155392 non-null  float64
 11  building_id           155392 non-null  int64  
 12  last_price_log        155392 non-null  float64
dtypes: float64(6), int64(5), object(2)
memory usage: 15.4+ MB


In [6]:
rent_df_cleaned['renovation'] = rent_df_cleaned['renovation'].astype(int)

In [7]:
rent_df_cleaned.head()

Unnamed: 0,first_day_exposition,last_day_exposition,last_price,floor,open_plan,rooms,studio,area,kitchen_area,living_area,renovation,building_id,last_price_log
0,2015-01-24T00:00:00+03:00,2016-01-19T00:00:00+03:00,20000.0,7,0,1,0,28.0,8.0,20.0,3,6719200042030752141,9.903488
1,2015-11-17T00:00:00+03:00,2016-03-04T00:00:00+03:00,24000.0,4,0,2,0,59.0,10.0,35.0,3,6719200042030752141,10.085809
2,2015-11-17T00:00:00+03:00,2016-04-24T00:00:00+03:00,18000.0,6,0,1,0,36.0,9.0,17.0,3,6719200042030752141,9.798127
3,2016-02-04T00:00:00+03:00,2016-02-28T00:00:00+03:00,18000.0,8,0,1,0,39.0,10.0,18.0,0,6719200042030752141,9.798127
4,2016-02-28T00:00:00+03:00,2016-04-02T00:00:00+03:00,19000.0,3,0,1,0,36.0,8.0,18.0,11,6719200042030752141,9.852194


In [8]:
rent_df_cleaned = rent_df_cleaned.drop(['first_day_exposition', 'last_day_exposition', 'floor', 'studio' ,'kitchen_area', 'living_area','building_id'], axis=1)

In [9]:
rent_df_cleaned.head()

Unnamed: 0,last_price,open_plan,rooms,area,renovation,last_price_log
0,20000.0,0,1,28.0,3,9.903488
1,24000.0,0,2,59.0,3,10.085809
2,18000.0,0,1,36.0,3,9.798127
3,18000.0,0,1,39.0,0,9.798127
4,19000.0,0,1,36.0,11,9.852194


In [12]:
x = rent_df_cleaned.drop(['last_price', 'last_price_log'], axis=1)
y = rent_df_cleaned['last_price']

In [13]:
from sklearn.preprocessing import StandardScaler

In [15]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

In [16]:
x.shape , y.shape

((155392, 4), (155392,))

In [17]:
y_train= y_train.values.reshape(-1,1)
y_val= y_val.values.reshape(-1,1)

In [18]:
sc_x = StandardScaler()
sc_y = StandardScaler()

x_train = sc_x.fit_transform(x_train)
x_val = sc_x.fit_transform(x_val)
y_train = sc_y.fit_transform(y_train)
y_val = sc_y.fit_transform(y_val)

In [19]:
x_train.shape, y_train.shape

((124313, 4), (124313, 1))

In [20]:
y_train = y_train.ravel()
y_val = y_val.ravel()

**Models creation**

In [21]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import metrics
import numpy as np

Decision Tree model

In [22]:
dt = DecisionTreeRegressor(random_state=42, max_depth=10, min_samples_leaf=20, max_features=22)
dt.fit(x_train, y_train)

predict_dt = dt.predict(x_val)

print('MAE:', metrics.mean_absolute_error(y_val, predict_dt))
print('MSE:', metrics.mean_squared_error(y_val, predict_dt))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_val, predict_dt)))

MAE: 0.3592829865304456
MSE: 0.3494678645398351
RMSE: 0.591158070688234


Random forest

In [23]:
rf = RandomForestRegressor(random_state=42, n_estimators=100, max_depth = 20, min_samples_split = 4, min_samples_leaf =8)

rf.fit(x_train, y_train)

predict_rf = rf.predict(x_val)

print('MAE:', metrics.mean_absolute_error(y_val, predict_rf))
print('MSE:', metrics.mean_squared_error(y_val, predict_rf))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_val, predict_rf)))

MAE: 0.35975153450258196
MSE: 0.34820278374307556
RMSE: 0.5900870984380827


Gradient boosting

In [24]:
gb = GradientBoostingRegressor(random_state=42, n_estimators=100, max_depth = 3, min_samples_split = 2, min_samples_leaf =1)

gb.fit(x_train, y_train)

predict_gb = gb.predict(x_val)

print('MAE:', metrics.mean_absolute_error(y_val, predict_gb))
print('MSE:', metrics.mean_squared_error(y_val, predict_gb))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_val, predict_gb)))

MAE: 0.35952524012846615
MSE: 0.34581558655370054
RMSE: 0.5880608697691937


In [25]:
import joblib

model_file = 'model1.pkl'
scaler_x = 'scaler_x1.pkl'
scaler_y = 'scaler_y1.pkl'
joblib.dump(sc_x, scaler_x)
joblib.dump(sc_y, scaler_y)
joblib.dump(gb, model_file)

['model1.pkl']