## Decision trees

In [None]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1OKFSv2GpuUFDphO0r8LdM7bl6MAWwBfX' -O data.csv

--2021-04-11 06:30:03--  https://docs.google.com/uc?export=download&id=1OKFSv2GpuUFDphO0r8LdM7bl6MAWwBfX
Resolving docs.google.com (docs.google.com)... 172.217.3.14, 2607:f8b0:4026:800::200e
Connecting to docs.google.com (docs.google.com)|172.217.3.14|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/6ocvpr5tlvlk38resg4kf3j152bmrr3q/1618122600000/03856158561714992485/*/1OKFSv2GpuUFDphO0r8LdM7bl6MAWwBfX?e=download [following]
--2021-04-11 06:30:04--  https://doc-04-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/6ocvpr5tlvlk38resg4kf3j152bmrr3q/1618122600000/03856158561714992485/*/1OKFSv2GpuUFDphO0r8LdM7bl6MAWwBfX?e=download
Resolving doc-04-ak-docs.googleusercontent.com (doc-04-ak-docs.googleusercontent.com)... 216.58.192.65, 2607:f8b0:4026:802::2001
Connecting to doc-04-ak-docs.googleusercontent.com (doc-04-ak-docs.googleuser

В этой работе вы будете предсказывать стоимость домов по их характеристикам.

Метрика качества: `RMSE`

### Описание датасета

Короткое описание данных:
```
price: sale price (this is the target variable)
id: transaction id
timestamp: date of transaction
full_sq: total area in square meters, including loggias, balconies and other non-residential areas
life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
floor: for apartments, floor of the building
max_floor: number of floors in the building
material: wall material
build_year: year built
num_room: number of living rooms
kitch_sq: kitchen area
state: apartment condition
product_type: owner-occupier purchase or investment
sub_area: name of the district

The dataset also includes a collection of features about each property's surrounding neighbourhood, and some features that are constant across each sub area (known as a Raion). Most of the feature names are self explanatory, with the following notes. See below for a complete list.

full_all: subarea population
male_f, female_f: subarea population by gender
young_*: population younger than working age
work_*: working-age population
ekder_*: retirement-age population
n_m_{all|male|female}: population between n and m years old
build_count_*: buildings in the subarea by construction type or year
x_count_500: the number of x within 500m of the property
x_part_500: the share of x within 500m of the property
_sqm_: square meters
cafe_count_d_price_p: number of cafes within d meters of the property that have an average bill under p RUB
trc_: shopping malls
prom_: industrial zones
green_: green zones
metro_: subway
_avto_: distances by car
mkad_: Moscow Circle Auto Road
ttk_: Third Transport Ring
sadovoe_: Garden Ring
bulvar_ring_: Boulevard Ring
kremlin_: City center
zd_vokzaly_: Train station
oil_chemistry_: Dirty industry
ts_: Power plant
```

### Setup

In [None]:
pip install catboost



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from tqdm.notebook import tqdm
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
plt.style.use(style='ggplot')
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import HuberRegressor

In [None]:
df = pd.read_csv("data.csv", parse_dates=["timestamp"])

Разделите имеющиеся у вас данные на обучающую и тестовую выборки. В качестве обучающей выборки возьмите первые 80% данных, последние 20% - тестовая выборка.

In [None]:
product_type = df['product_type']

In [None]:
drop_columns = ['id', 'timestamp']
cat_columns = ['product_type', 'material', 'state', 'sub_area', 'culture_objects_top_25', 'thermal_power_plant_raion', 'incineration_raion',        
               'oil_chemistry_raion', 'radiation_raion', 'railroad_terminal_raion', 'big_market_raion', 'nuclear_reactor_raion', 'detention_facility_raion',  
               'ID_metro', 'ID_railroad_station_walk', 'ID_railroad_station_avto', 'water_1line', 'ID_big_road1', 'big_road1_1line', 'ID_big_road2',              
               'railroad_1line', 'ID_railroad_terminal', 'ID_bus_terminal', 'ecology']
num_columns = list(set(df.columns).difference(set(cat_columns + drop_columns)))

In [None]:
df_baseline = df.drop(cat_columns + drop_columns, axis=1)

In [None]:
columns = list(df_baseline.columns)

In [None]:
for column in columns:
    df_baseline[column] = df_baseline[column].fillna(np.mean(df_baseline[column]))

In [None]:
df_train = df_baseline[:16000]
df_test = df_baseline[16000:]

In [None]:
df_test = df_test.reset_index(drop=True)

Возможно в ваших моделях вам придется указывать, какие колонки являются категориальными (например, в бустингах). Для упрощения предлагается разделить колонки по следующему принципу:
```
drop_columns = [
    'id',           # May leak information
    'timestamp',    # May leak information
]
cat_columns = [
    'product_type',              #
    'material',                  # Material of the wall
    'state',                     # Satisfaction level
    'sub_area',                  # District name
    'culture_objects_top_25',    #
    'thermal_power_plant_raion', #
    'incineration_raion',        #
    'oil_chemistry_raion',       #
    'radiation_raion',           #
    'railroad_terminal_raion',   #
    'big_market_raion',          #
    'nuclear_reactor_raion',     #
    'detention_facility_raion',  #
    'ID_metro',                  #
    'ID_railroad_station_walk',  #
    'ID_railroad_station_avto',  #
    'water_1line',               #
    'ID_big_road1',              #
    'big_road1_1line',           #
    'ID_big_road2',              #
    'railroad_1line',            #
    'ID_railroad_terminal',      #
    'ID_bus_terminal',           #
    'ecology',                   #
]
num_columns = list(set(df.columns).difference(set(cat_columns + drop_columns)))
```

### Baseline

В качестве Baseline обучите `DecisionTreeRegressor` из `sklearn`.

In [None]:
X_train = df_train.drop('price', axis=1)
y_train = df_train['price']
X_test = df_test.drop('price', axis=1)
y_test = df_test['price']

In [None]:
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=42, splitter='best')

Проверьте качество на отложенной выборке.

In [None]:
y_pred = model.predict(X_test)
print(f'RMSE: {(mean_squared_error(y_test, y_pred))**0.5}')

RMSE: 3830081.9700675923


### Feature Engineering

Часто улучшить модель можно с помощью аккуратного Feature Engineering.

Добавим в модель дополнительные признаки:
* "Как часто в этот год и этот месяц появлились объявления"
* "Как часто в этот год и эту неделю появлялись объявления"

In [None]:
month_year = (df.timestamp.dt.month + df.timestamp.dt.year * 100)
month_year_cnt_map = month_year.value_counts().to_dict()
df["month_year_cnt"] = month_year.map(month_year_cnt_map)

week_year = (df.timestamp.dt.weekofyear + df.timestamp.dt.year * 100)
week_year_cnt_map = week_year.value_counts().to_dict()
df["week_year_cnt"] = week_year.map(week_year_cnt_map)

Добавьте следующие дополнительные признаки:
* Месяц (из колонки `timestamp`)
* День недели (из колонки `timestamp`)
* Отношение "этаж / максимальный этаж в здании" (колонки `floor` и `max_floor`)
* Отношение "площадь кухни / площадь квартиры" (колонки `kitchen_sq` и `full_sq`)

По желанию можно добавить и другие признаки.

In [None]:
month = []
for i in tqdm(range(len(df))):
    month.append(df['timestamp'].dt.month[i])

HBox(children=(FloatProgress(value=0.0, max=20000.0), HTML(value='')))




In [None]:
day = []
for i in tqdm(range(len(df))):
    day.append(df['timestamp'].dt.day[i])

HBox(children=(FloatProgress(value=0.0, max=20000.0), HTML(value='')))




In [None]:
df['month'] = pd.DataFrame(month)
df['day'] = pd.DataFrame(day)

In [None]:
df = df.drop(cat_columns + drop_columns, axis=1)

In [None]:
columns = list(df.columns)

In [None]:
for column in columns:
    df[column] = df[column].fillna(list(df[column].value_counts().index)[0])

In [None]:
df = df.loc[df['max_floor'] != 0.0]

In [None]:
df = df.reset_index(drop=True)

In [None]:
df['floor_ratio'] = df['floor'] / df['max_floor']

In [None]:
df = df.loc[df['full_sq'] != 0.0]

In [None]:
df = df.reset_index(drop=True)

In [None]:
df['sq_ratio'] = df['kitch_sq'] / df['full_sq']

Разделите выборку на обучающую и тестовую еще раз (потому что дополнительные признаки созданы для исходной выборки).

In [None]:
df_train_eng = df[:15726]
df_test_eng = df[15726:]

### Model Selection

Посмотрите, какого качества можно добиться если использовать разные модели:
* `DecisionTreeRegressor` из `sklearn`
* `RandomForestRegressor` из `sklearn`
* `CatBoostRegressor`

Также вы можете попробовать линейные модели, другие бустинги (`LigthGBM` и `XGBoost`).

Почти все библиотеки поддерживают удобный способ подбора гиперпараметров: посмотрите как это делать в [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) или в [catboost](https://catboost.ai/docs/concepts/python-reference_catboostregressor_grid_search.html).

Проверяйте качество каждой модели на тестовой выборке и выберите наилучшую.

In [None]:
X_train_eng = df_train_eng.drop('price', axis=1)
y_train_eng = df_train_eng['price']
X_test_eng = df_test_eng.drop('price', axis=1)
y_test_eng = df_test_eng['price']

#### DecisionTreeRegressor

In [None]:
model1 = DecisionTreeRegressor(random_state=42)
model1.fit(X_train_eng, y_train_eng)
y_pred1 = model1.predict(X_test_eng)
print(f'DecisionTreeRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred1))**0.5}')

DecisionTreeRegressor RMSE: 3749391.0250316425


In [None]:
param_grid = {
    'criterion' :['mse', 'friedman_mse', 'poisson'],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4, 6, 8]
}

dec_cv = GridSearchCV(model1, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=10, n_jobs=-1)
dec_cv.fit(X_train_eng, y_train_eng)

best_parameters = dec_cv.best_params_
print('The best parameters for using this model is', best_parameters)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:   16.8s
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:   20.4s
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:   25.5s
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1982s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:   27.1s
[Parallel(n_jobs=-1)]: Batch comput

The best parameters for using this model is {'criterion': 'mse', 'max_depth': 6, 'max_features': 'auto'}


In [None]:
model1_best = DecisionTreeRegressor(max_depth=6, max_features='auto')
model1_best.fit(X_train_eng, y_train_eng)
y_pred1_best = model1_best.predict(X_test_eng)
print(f'DecisionTreeRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred1_best))**0.5}')

DecisionTreeRegressor RMSE: 3323020.017271318


У DecisionTreeRegressor перебором гиперпараметров удалось заметно улучшить метрику RMSE!

#### RandomForestClassifier

In [None]:
model2 = RandomForestRegressor(random_state=42, n_jobs=-1)
model2.fit(X_train_eng, y_train_eng)
y_pred2 = model2.predict(X_test_eng)
print(f'RandomForestRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred2))**0.5}')

RandomForestRegressor RMSE: 2780613.432684806


In [None]:
param_grid = {
    'n_estimators': [10, 100],
    'max_depth' : [None, 6],
    'min_samples_split': [2, 4],
    'min_samples_leaf': [1, 2]
}

rf_cv = GridSearchCV(model2, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=10, n_jobs=-1)
rf_cv.fit(X_train_eng, y_train_eng)

best_parameters = rf_cv.best_params_
print('The best parameters for using this model is', best_parameters)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   56.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed: 24.5min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed: 34.1min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 42.3min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 48.8min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed: 53.8min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed: 59.1min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 64.4min finished


The best parameters for using this model is {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}


In [None]:
model2 = RandomForestRegressor(random_state=42, min_samples_leaf=2, n_jobs=-1)
model2.fit(X_train_eng, y_train_eng)
y_pred2 = model2.predict(X_test_eng)
print(f'RandomForestRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred2))**0.5}')

RandomForestRegressor RMSE: 2775781.1415436487


У модели RandomForestRegressor также удалось улучшить качество перебором гиперпараметров

#### CatBoostRegressor

In [None]:
model3 = CatBoostRegressor(random_state=42, silent=True)
model3.fit(X_train_eng, y_train_eng)
y_pred3 = model3.predict(X_test_eng)
print(f'CatBoostRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred3))**0.5}')

CatBoostRegressor RMSE: 2652518.6627344983


In [None]:
param_grid = {
    'learning_rate': [0.01, 0.001, 0.1],
    'depth' : [4, 6, 8]
}

cb_cv = GridSearchCV(model3, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=10, n_jobs=-1)
cb_cv.fit(X_train_eng, y_train_eng)

best_parameters = cb_cv.best_params_
print('The best parameters for using this model is', best_parameters)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   51.9s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed: 12.5min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed: 19.1min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 40.3min
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed: 58.3min finished


The best parameters for using this model is {'depth': 8, 'learning_rate': 0.01}


In [None]:
model3_best = CatBoostRegressor(depth=8, learning_rate=0.01, random_state=42, silent=True)
model3_best.fit(X_train_eng, y_train_eng)
y_pred3_best = model3_best.predict(X_test_eng)
print(f'CatBoostRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred3_best))**0.5}')

CatBoostRegressor RMSE: 2718032.042930009


У модели CatBoostRegressor перебором гиперпараметров не удалось улучшить метрику RMSE, лучшим качеством осталось первоначальное обучение

#### LGBMRegressor

In [None]:
model4 = LGBMRegressor(random_state=42)
model4.fit(X_train_eng, y_train_eng)
y_pred4 = model4.predict(X_test_eng)
print(f'LGBMRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred4))**0.5}')

LGBMRegressor RMSE: 2755589.764282879


In [None]:
param_grid = {
    'learning_rate': [0.01, 0.001, 0.1],
    'depth' : [-1, 6],
    'n_estimators': [100, 1000]
}

lgb_cv = GridSearchCV(model4, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=10, n_jobs=-1)
lgb_cv.fit(X_train_eng, y_train_eng)

best_parameters = lgb_cv.best_params_
print('The best parameters for using this model is', best_parameters)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  9.7min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 15.6min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed: 19.5min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 21.0min finished


The best parameters for using this model is {'depth': -1, 'learning_rate': 0.01, 'n_estimators': 1000}


In [None]:
model4_best = LGBMRegressor(learning_rate=0.01, n_estimators=1000, random_state=42)
model4_best.fit(X_train_eng, y_train_eng)
y_pred4_best = model4_best.predict(X_test_eng)
print(f'LGBMRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred4_best))**0.5}')

LGBMRegressor RMSE: 2734973.215789504


У модели LGBMRegressor удалось улучшить качество метрики

#### XGBRegressor

In [None]:
model5 = XGBRegressor(random_state=42)
model5.fit(X_train_eng, y_train_eng)
y_pred5 = model5.predict(X_test_eng)
print(f'XGBRegressor RMSE: {(mean_squared_error(y_test_eng, y_pred5))**0.5}')

XGBRegressor RMSE: 2798649.1789921722


In [None]:
param_grid = {'gamma':[0, 4, 8],
              'subsample':[1, 0.5],
              'max_depth': [2, 4, 6]
}

xgb_cv = GridSearchCV(model5, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=10, n_jobs=-1)
xgb_cv.fit(X_train_eng, y_train_eng)

best_parameters = xgb_cv.best_params_
print('The best parameters for using this model is', best_parameters)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   23.5s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   59.0s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed: 10.4min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed: 11.9min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed: 14.2min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed: 16.6min finished


The best parameters for using this model is {'gamma': 0, 'max_depth': 6, 'subsample': 1}


Лучшие гиперпараметры оказались такие же, как и стоят по дефолту

### Ensemble v.1

Ансамбли иногда оказываются лучше чем одна большая модель.

В колонке `product_type` содержится информация о том, каким является объявление: `Investment` (продажа квартиры как инвестиции) или `OwnerOccupier` (продажа квартиры для жилья). Логично предположить, что если сделать по модели на каждый из этих типов, то качество будет выше.

Обучите свои лучшие модели на отдельно на `Investment` и `OwnerOccupier` (т.е. у вас будет `model_invest`, обученная на `(invest_train_X, invest_train_Y)` и `model_owner`, обученная на `(owner_train_X, owner_train_Y)`) и проверьте качество на отложенной выборке (т.е. на исходном `test_split`).

In [None]:
df['product_type'] = product_type

In [None]:
invest_train_X = (df[df['product_type'] == 'Investment']).drop('price', axis=1)
invest_train_X = invest_train_X.drop('product_type', axis=1)
invest_train_Y = (df[df['product_type'] == 'Investment'])['price']

In [None]:
X_test = df.drop(['price', 'product_type'], axis=1)[15726:]
y_test = df['price'][15726:]

In [None]:
model_invest = CatBoostRegressor(random_state=42, silent=True)
model_invest.fit(invest_train_X, invest_train_Y)
y_pred_invest = model_invest.predict(X_test)

In [None]:
owner_train_X = (df[df['product_type'] == 'OwnerOccupier']).drop('price', axis=1)
owner_train_X = owner_train_X.drop('product_type', axis=1)
owner_train_Y = (df[df['product_type'] == 'OwnerOccupier'])['price']

In [None]:
model_owner = CatBoostRegressor(random_state=42, silent=True)
model_owner.fit(owner_train_X, owner_train_Y)
y_pred_owner = model_owner.predict(X_test)

In [None]:
y_pred_mean = (y_pred_invest + y_pred_owner) / 2
print(f'RMSE Ensembling: {mean_squared_error(y_test, y_pred_mean)**0.5}')

RMSE Ensembling: 1969791.4261275425


Действительно Ансамблированием удается заметно улучшить качество метрики RMSE: 2652518.66 -> **1969791.43**

### (*) Ensemble v.2

Попробуйте сделать для `Investment` более сложную модель: обучите `CatBoostRegressor` и `HuberRegressor` из `sklearn`, а затем сложите их предсказания с весами `w_1` и `w_2` (выберите веса сами; сумма весов равняется 1).

In [None]:
model_invest2 = CatBoostRegressor(random_state=42, silent=True)
model_invest2.fit(invest_train_X, invest_train_Y)
y_pred_invest2 = model_invest2.predict(X_test)
rmse_invest_cbr = mean_squared_error(y_test, y_pred_invest2)**0.5
print(f'RMSE Investment: {rmse_invest_cbr}')

RMSE Investment: 2005368.158347318


In [None]:
model_invest3 = HuberRegressor()
model_invest3.fit(invest_train_X, invest_train_Y)
y_pred_invest3 = model_invest3.predict(X_test)
rmse_invest_hbr = mean_squared_error(y_test, y_pred_invest3)**0.5
print(f'RMSE Investment: {rmse_invest_hbr}')

RMSE Investment: 5127101.249236857


In [None]:
w1 = 0.8
w2 = 0.2
rmse = rmse_invest_cbr*w1 + rmse_invest_hbr*w2
print(f'RMSE Investment CatBoostRegressor + HuberRegressor: {rmse}')

RMSE Investment CatBoostRegressor + HuberRegressor: 2629714.7765252255
