## Lab 1.3 - Predicting Real Estate Data in St. Petersburg
We have data from Yandex.Realty classified https://realty.yandex.ru containing real estate listings for apartments in St. Petersburg and Leningrad Oblast from 2016 till the middle of August 2018. In this Lab you'll learn how to apply machine learning algorithms to solve business problems. Accurate price prediction can help to find fraudsters automatically and help Yandex.Realty users to make better decisions when buying and selling real estate.

Using python with machine learning algotithms is the #1 option for prototyping solutions among data scientists today. We'll take a look at it in this lab.

### Main objectives
After successful completion of the lab work students will be able to:
-	Apply machine learning for solving price prediction problem
-   Calculate metrics which can help us find out whether our machine learning model is ready for production

### Tasks
-	Encode dataset
-	Split dataset to train and validation datasets
-	Apply decision tree algorithm to build ML (machine learning) model for price predictions
-   Calculate metrics
-   Try other algorithms and factors to get a better solution 


### 1. Load data with real estate prices

In [52]:
# let's import pandas library and set options to be able to view data right in the browser
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.style as style
from matplotlib import pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
style.use('fivethirtyeight')

In [59]:
rent_df_cleaned = pd.read_csv('cleaned_dataset.csv')

In [60]:
rent_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155391 entries, 0 to 155390
Data columns (total 8 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   first_day_exposition  155391 non-null  object 
 1   last_day_exposition   155391 non-null  object 
 2   last_price            155391 non-null  float64
 3   open_plan             155391 non-null  int64  
 4   rooms                 155391 non-null  int64  
 5   area                  155391 non-null  float64
 6   renovation            155391 non-null  float64
 7   last_price_log        155391 non-null  float64
dtypes: float64(4), int64(2), object(2)
memory usage: 9.5+ MB


In [58]:
rent_df_cleaned.drop(columns=['first_day_exposition', 'last_day_exposition', 'last_price_log'], inplace=True)

In [21]:
rent_df_cleaned['renovation'].value_counts()

0.0     102910
3.0      16407
11.0     14076
10.0     13876
7.0       3882
1.0       3125
6.0        698
8.0        225
2.0        159
4.0         32
5.0          1
Name: renovation, dtype: int64

Let's create a new feature:

In [22]:
rent_df_cleaned['length_exposition'] = (pd.to_datetime(rent_df_cleaned.last_day_exposition) - pd.to_datetime(rent_df_cleaned.first_day_exposition)).dt.days

In [23]:
rent_df_cleaned.length_exposition.value_counts()

7       14712
1       10387
30       6307
0        4670
6        4534
        ...  
1340        1
1346        1
1331        1
1326        1
1332        1
Name: length_exposition, Length: 1349, dtype: int64

Let's encode 'open_plan', 'rooms', 'renovation' with dummy variables

In [24]:
rent_df_cleaned = pd.get_dummies(rent_df_cleaned, columns=['open_plan', 'rooms', 'renovation'])

In [25]:
rent_df_cleaned.head()

Unnamed: 0,first_day_exposition,last_day_exposition,last_price,area,last_price_log,length_exposition,open_plan_0,open_plan_1,rooms_0,rooms_1,rooms_2,rooms_3,rooms_4,rooms_5,renovation_0.0,renovation_1.0,renovation_2.0,renovation_3.0,renovation_4.0,renovation_5.0,renovation_6.0,renovation_7.0,renovation_8.0,renovation_10.0,renovation_11.0
0,2015-01-24T00:00:00+03:00,2016-01-19T00:00:00+03:00,20000.0,28.0,9.903488,360,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,2015-11-17T00:00:00+03:00,2016-03-04T00:00:00+03:00,24000.0,59.0,10.085809,108,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,2015-11-17T00:00:00+03:00,2016-04-24T00:00:00+03:00,18000.0,36.0,9.798127,159,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,2016-02-04T00:00:00+03:00,2016-02-28T00:00:00+03:00,18000.0,39.0,9.798127,24,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,2016-02-28T00:00:00+03:00,2016-04-02T00:00:00+03:00,19000.0,36.0,9.852194,34,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


Let's split to train and test samples as 9:1

In [26]:
train_df = rent_df_cleaned[(rent_df_cleaned.first_day_exposition >= '2018-01-01') 
                          & (rent_df_cleaned.first_day_exposition < '2018-04-01')]

# select all offers added in april and may 2018 as test dataset.
test_df = rent_df_cleaned[(rent_df_cleaned.first_day_exposition >= '2018-04-01') 
                          & (rent_df_cleaned.first_day_exposition < '2018-06-01')]

# let's use latest data from 2018-06-01 as a hodout dataset to simulate how algorithms would
# behave in production
holdout_df = rent_df_cleaned[rent_df_cleaned.first_day_exposition >= '2018-06-01']

In [27]:
train_df.drop(['first_day_exposition', 'last_day_exposition', 'last_price_log'], axis=1, inplace=True)
test_df.drop(['first_day_exposition', 'last_day_exposition', 'last_price_log'], axis=1, inplace=True)
holdout_df.drop(['first_day_exposition', 'last_day_exposition', 'last_price_log'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Let's scale samples

In [28]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
sc_y = StandardScaler()

X_train = train_df.drop('last_price', axis=1)
y_train = train_df['last_price']
X_valid = test_df.drop('last_price', axis=1)
y_valid = test_df['last_price']
X_test = holdout_df.drop('last_price', axis=1)
y_test = holdout_df['last_price']
y_train= y_train.values.reshape(-1,1)
y_valid= y_valid.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)

X_train = sc_X.fit_transform(X_train)
X_valid = sc_X.fit_transform(X_valid)
X_test = sc_X.fit_transform(X_test)
y_train = sc_y.fit_transform(y_train)
y_valid = sc_y.fit_transform(y_valid)
y_test = sc_y.fit_transform(y_test)

In [29]:
train_df

Unnamed: 0,last_price,area,length_exposition,open_plan_0,open_plan_1,rooms_0,rooms_1,rooms_2,rooms_3,rooms_4,rooms_5,renovation_0.0,renovation_1.0,renovation_2.0,renovation_3.0,renovation_4.0,renovation_5.0,renovation_6.0,renovation_7.0,renovation_8.0,renovation_10.0,renovation_11.0
8,26000.0,36.0,11,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
24,17500.0,32.0,39,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
25,16000.0,38.0,22,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
26,22000.0,32.0,30,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
27,20000.0,32.0,4,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154884,21000.0,46.0,143,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
154921,24000.0,50.0,176,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
155028,28000.0,66.0,165,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
155079,25000.0,52.0,132,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


And build a decision tree algorithm with use of tuning hyperparametres

In [30]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
import numpy as np

In [31]:
first_tree = DecisionTreeRegressor(random_state=17)
tree_params = {'max_depth': np.arange(1, 20, 1),
              'min_samples_leaf': np.arange(1, 20, 2),
              'max_features': [.85, .9, .95, .99, 1]}
tree_grid = GridSearchCV(first_tree, tree_params, cv = 5, n_jobs=-1)
%time
tree_grid.fit(X_train, y_train);

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.25 µs


Print the best hyperparametres after tuning

In [32]:
tree_grid.best_score_, tree_grid.best_params_

(0.5887354331556119,
 {'max_depth': 7, 'max_features': 0.85, 'min_samples_leaf': 17})

In [33]:
from sklearn.metrics import mean_absolute_percentage_error as mape

In [38]:
best_tree=tree_grid.best_estimator_

In [40]:
best_tree.fit(X_train, y_train)
y_pred_valid = sc_y.inverse_transform(best_tree.predict(X_valid))
y_pred = sc_y.inverse_transform(best_tree.predict(X_test))
mape_1=mape(sc_y.inverse_transform(y_valid), y_pred_valid)
mape_2=mape(sc_y.inverse_transform(y_test), y_pred)
print("MAPE_valid = {:.2f}".format((mape_1*100)))
print("MAPE_test = {:.2f}".format((mape_2*100)))

MAPE_valid = 21.84
MAPE_test = 24.25


### _**Here I created pkl file used in a service building**_

In [37]:
import joblib

In [41]:
model_file = 'model.pkl'
scaler_x = 'scaler_x.pkl'
scaler_y = 'scaler_y.pkl'
joblib.dump(sc_X, scaler_x)
joblib.dump(sc_y, scaler_y)
joblib.dump(best_tree, model_file)

['model.pkl']

### Self-control stops
1. Compete with other teams to create the best solution. You can play with factors and algorithm parameters to come up with it.

Another approach with regularized regression model

In [21]:
import numpy as np
from tqdm import tqdm
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_absolute_percentage_error

In [16]:
param_grid_log = dict(alpha=np.logspace(start=-3, stop=0, num=5).tolist(), 
                      l1=np.logspace(start=-3, stop=0, num=5).tolist())
param_grid_log

{'alpha': [0.001,
  0.005623413251903491,
  0.03162277660168379,
  0.1778279410038923,
  1.0],
 'l1': [0.001,
  0.005623413251903491,
  0.03162277660168379,
  0.1778279410038923,
  1.0]}

In [23]:
results = []

for params in tqdm(ParameterGrid(param_grid_log)):
    
    pipe_reg = Pipeline(steps = [('en', ElasticNet(alpha=params['alpha'], l1_ratio=params['l1']))])
    
    pipe_reg.fit(X_train, y_train)
   
    results.append(dict(params=params,
                        train_score=mean_absolute_percentage_error(y_train,
                                                       pipe_reg.predict(X_train)),
        
                        test_score=mean_absolute_percentage_error(y_test,
                                                      pipe_reg.predict(X_test))
    ))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
100%|██████████| 25/25 [02:26<00:00,  5.86s/it]


In [24]:
pd.DataFrame(results).sort_values('test_score').style.bar(subset=['train_score','test_score'], vmin=0)

Unnamed: 0,params,train_score,test_score
10,"{'alpha': 0.03162277660168379, 'l1': 0.001}",0.232948,0.234501
11,"{'alpha': 0.03162277660168379, 'l1': 0.005623413251903491}",0.232949,0.234503
12,"{'alpha': 0.03162277660168379, 'l1': 0.03162277660168379}",0.232956,0.234514
13,"{'alpha': 0.03162277660168379, 'l1': 0.1778279410038923}",0.233007,0.234592
5,"{'alpha': 0.005623413251903491, 'l1': 0.001}",0.233528,0.235246
6,"{'alpha': 0.005623413251903491, 'l1': 0.005623413251903491}",0.233529,0.235247
7,"{'alpha': 0.005623413251903491, 'l1': 0.03162277660168379}",0.233535,0.235254
8,"{'alpha': 0.005623413251903491, 'l1': 0.1778279410038923}",0.233567,0.235292
0,"{'alpha': 0.001, 'l1': 0.001}",0.233724,0.235477
1,"{'alpha': 0.001, 'l1': 0.005623413251903491}",0.233724,0.235477


According to MAPE, the best linear model is:

In [25]:
best = {'alpha': 0.03162277660168379, 'l1': 0.001}

In [26]:
final_model = Pipeline(steps = [('en', ElasticNet(alpha=best['alpha'], l1_ratio=best['l1']))])
final_model.fit(X_train, y_train)
best_pred = final_model.predict(X_test)
best_pred

array([35349.30598362, 33467.99171747, 19256.97568969, ...,
       19677.28282843, 54182.13891571, 30396.10138268])

So, MAPE is better in DecisionTree, so that we should use this algorithm for better predicting