# Часть 1 Бустинг (5 баллов)

В этой части будем предсказывать зарплату data scientist-ов в зависимости  от ряда факторов с помощью градиентного бустинга.

В датасете есть следующие признаки:



* work_year: The number of years of work experience in the field of data science.

* experience_level: The level of experience, such as Junior, Senior, or Lead.

* employment_type: The type of employment, such as Full-time or Contract.

* job_title: The specific job title or role, such as Data Analyst or Data Scientist.

* salary: The salary amount for the given job.

* salary_currency: The currency in which the salary is denoted.

* salary_in_usd: The equivalent salary amount converted to US dollars (USD) for comparison purposes.

* employee_residence: The country or region where the employee resides.

* remote_ratio: The percentage of remote work offered in the job.

* company_location: The location of the company or organization.

* company_size: The company's size is categorized as Small, Medium, or Large.

In [12]:
import pandas as pd

df = pd.read_csv("ds_salaries.csv")
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


## Задание 1 (0.5 балла) Подготовка



*   Разделите выборку на train, val, test (80%, 10%, 10%)
*   Выдерите salary_in_usd в качестве таргета
*   Найдите и удалите признак, из-за которого возможен лик в данных


In [13]:
from sklearn.model_selection import train_test_split

y = df['salary_in_usd']
X = df.drop(columns='salary_in_usd')

X = X.drop(columns='salary') #вынесем отдельно, чтобы подчеркнуть

X_train, X_temp, y_train, y_temp = train_test_split(X,y, random_state=42, train_size=0.8)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Share of train: ", len(X_train)/len(X))
print("Share of test: ", len(X_test)/len(X))
print("Share of val: ", len(X_val)/len(X))

X_train.head()

Share of train:  0.8
Share of test:  0.09986684420772303
Share of val:  0.10013315579227697


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_currency,employee_residence,remote_ratio,company_location,company_size
2238,2022,SE,FT,Data Engineer,EUR,ES,0,ES,M
485,2023,MI,FT,Research Scientist,USD,US,100,US,M
2177,2022,SE,FT,Data Analyst,USD,US,0,US,M
3305,2022,SE,FT,Data Engineer,USD,US,100,US,M
1769,2023,SE,FT,Data Engineer,USD,US,100,US,M


В датафрейме есть столбец 'salary', отвечающий за размер зарплаты в той валюте, в которой ему предлагается оффер. Эта информация напрямую влияет на тренировочную выборку, поэтому ее надо удалить из всех выборок. P.s. сделано в самом начале

## Задание 2 (0.5 балла) Линейная модель


*   Закодируйте категориальные  признаки с помощью OneHotEncoder
*   Обучите модель линейной регрессии
*   Оцените  качество через MAPE и RMSE


In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error, root_mean_squared_error
from sklearn.preprocessing import OneHotEncoder
import numpy as np


categorical_features = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()


enc = OneHotEncoder(drop='first',sparse_output=False, handle_unknown='ignore')


X_train_encounded = enc.fit_transform(X_train[categorical_features])
X_test_encounded = enc.transform(X_test[categorical_features])
X_val_encounded = enc.transform(X_val[categorical_features])

X_train_num = X_train[numeric_features].values
X_test_num  = X_test[numeric_features].values
X_val_num   = X_val[numeric_features].values


X_train_final = np.hstack([X_train_num, X_train_encounded])
X_test_final  = np.hstack([X_test_num,  X_test_encounded])
X_val_final   = np.hstack([X_val_num,   X_val_encounded])

lg = LinearRegression()
lg.fit(X_train_final,y_train)
y_test_res = lg.predict(X_test_final)


print('MAPE: ', mean_absolute_percentage_error(y_true=y_test, y_pred=y_test_res))
print('RMSE: ', root_mean_squared_error(y_true=y_test, y_pred=y_test_res))

MAPE:  0.4316404767175547
RMSE:  46984.19170454946




## Задание 3 (0.5 балла) XGboost

Начнем с библиотеки xgboost.

Обучите модель `XGBRegressor` на тех же данных, что линейную модель, подобрав оптимальные гиперпараметры (`max_depth, learning_rate, n_estimators, gamma`, etc.) по валидационной выборке. Оцените качество итоговой модели (MAPE, RMSE), скорость обучения и скорость предсказания.

In [15]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor

params = {
    'max_depth': [2, 4, 6],
    'learning_rate': [0.1, 0.2, 0.3],
    'n_estimators': [100, 150, 200],
    'gamma': [0, 1, 2],
    'min_child_weight': [1, 3, 5],
    'reg_alpha': [0, 0.1, 1, 5],                  
    'reg_lambda': [1, 2, 5, 10] 
}
best_params = None
best_rmse = float('inf')
best_model = None

for max_depth in [2, 4, 6]:
    for learning_rate in [0.1, 0.2, 0.3]:
            for gamma in [0,1,2]:
                for min_child_weight in [1,3,5]:
                    model_xgb = XGBRegressor(
                            learning_rate=learning_rate,
                            n_estimators=1000,
                            max_depth= max_depth,
                            gamma= gamma,
                            min_child_weight=min_child_weight,
                            random_state= 42,
                            early_stopping_rounds=50
                    )
                    model_xgb.fit(X_train_final, y_train, eval_set=[(X_val_final, y_val)])

                    temp_res = model_xgb.predict(X_val_final)

                    rmse = root_mean_squared_error(y_val, temp_res)

                    if (rmse < best_rmse):
                        best_rmse = rmse
                        best_params = {
                            'max_depth': max_depth,
                            'learning_rate': learning_rate,
                            'n_estimators': model_xgb.best_iteration, #моделька сама за счет early_stop понимает, какой оптимальный n
                            'gamma': gamma,
                            'min_child_weight': min_child_weight,
                        }   
                        best_model = model_xgb
                    
print("Лучшие параметры:", best_params)
print("Лучший RMSE на val:", best_rmse)


[0]	validation_0-rmse:62406.46529
[1]	validation_0-rmse:60909.44088
[2]	validation_0-rmse:59699.00137
[3]	validation_0-rmse:58772.30421
[4]	validation_0-rmse:57863.93935
[5]	validation_0-rmse:57204.11684
[6]	validation_0-rmse:56500.83181
[7]	validation_0-rmse:55924.06074
[8]	validation_0-rmse:55496.57424
[9]	validation_0-rmse:55001.64452
[10]	validation_0-rmse:54695.48068
[11]	validation_0-rmse:54168.24009
[12]	validation_0-rmse:53799.72821
[13]	validation_0-rmse:53547.23111
[14]	validation_0-rmse:53205.49354
[15]	validation_0-rmse:52929.11527
[16]	validation_0-rmse:52763.00601
[17]	validation_0-rmse:52589.37186
[18]	validation_0-rmse:52481.12176
[19]	validation_0-rmse:52314.34996
[20]	validation_0-rmse:52213.85918
[21]	validation_0-rmse:52103.83208
[22]	validation_0-rmse:52071.98908
[23]	validation_0-rmse:52012.91111
[24]	validation_0-rmse:51947.42887
[25]	validation_0-rmse:51901.70291
[26]	validation_0-rmse:51853.85109
[27]	validation_0-rmse:51798.89595
[28]	validation_0-rmse:51755.9

Получили какие то оптимальные гиперпараметры, теперь попробуем убедиться, что нет переобучения

In [16]:
y_train_pred_best = best_model.predict(X_train_final)
y_val_pred_best   = best_model.predict(X_val_final)

rmse_train = root_mean_squared_error(y_train, y_train_pred_best)
rmse_val   = root_mean_squared_error(y_val,   y_val_pred_best)

print(f"Лучшая модель | Train RMSE: {rmse_train:.3f}, Val RMSE: {rmse_val:.3f}")

Лучшая модель | Train RMSE: 44686.895, Val RMSE: 49379.395


Данные примерно равны=> не словили переобучение

In [17]:
xg_y_pred = best_model.predict(X_test_final)

mape = mean_absolute_percentage_error(y_test,xg_y_pred)
rmse = root_mean_squared_error(y_test,xg_y_pred)


y_baseline = [y_train.mean()] * len(y_val) #здесь возьмем y_val, типо имитируя обучение
baseline_mape = mean_absolute_percentage_error(y_val, y_baseline)
print(f"Baseline MAPE: {baseline_mape:.3f} ({baseline_mape*100:.1f}%)")

print('MAPE: ', mape)
print('RMSE: ', rmse)

Baseline MAPE: 0.641 (64.1%)
MAPE:  0.40950098633766174
RMSE:  46300.54296875


## Задание 4 (1 балл) CatBoost

Теперь библиотека CatBoost.

Обучите модель `CatBoostRegressor`, подобрав оптимальные гиперпараметры (`depth, learning_rate, iterations`, etc.) по валидационной выборке. Оцените качество итоговой модели (MAPE, RMSE), скорость обучения и скорость предсказания.

In [18]:

from catboost import CatBoostRegressor
from catboost import *
from catboost import datasets
best_ct_model = None
best_ct_params = None
best_ct_rmse = float('inf')

for depth in [2,4,6]:
        for learning_rate in [0.1, 0.2, 0.3]:
            model_ct = CatBoostRegressor(
                iterations=1000,
                max_depth=depth,
                learning_rate=learning_rate,
                random_seed= 42
            )

            model_ct.fit(X_train_final, y_train, eval_set=(X_val_final, y_val) ,early_stopping_rounds=50) #если 50 деревьев подряд наша rmse не будет улучшаться, то работа прекращается. Замеры на X_val

            ct_y_pred = model_ct.predict(X_val_final)

            ct_rmse = root_mean_squared_error(y_true=y_val, y_pred=ct_y_pred)

            if (ct_rmse < best_ct_rmse):
                best_ct_rmse = ct_rmse
                best_ct_params = {
                    'learning_rate':learning_rate,
                    'max_depth':depth,
                    'iterations': model_ct.best_iteration_
                }
            best_ct_model = model_ct


print("Лучшие параметры:", best_ct_params)
print("Лучший RMSE на val:", best_ct_rmse)

0:	learn: 61686.8223944	test: 62664.3935773	best: 62664.3935773 (0)	total: 153ms	remaining: 2m 33s
1:	learn: 60273.7381721	test: 61391.4388920	best: 61391.4388920 (1)	total: 158ms	remaining: 1m 18s
2:	learn: 58959.0961839	test: 60152.8135541	best: 60152.8135541 (2)	total: 160ms	remaining: 53s
3:	learn: 57872.2058470	test: 59156.9112210	best: 59156.9112210 (3)	total: 162ms	remaining: 40.2s
4:	learn: 57133.7856640	test: 58472.0443464	best: 58472.0443464 (4)	total: 163ms	remaining: 32.5s
5:	learn: 56488.9489493	test: 57888.5888821	best: 57888.5888821 (5)	total: 165ms	remaining: 27.4s
6:	learn: 55698.1738186	test: 57221.7729528	best: 57221.7729528 (6)	total: 167ms	remaining: 23.7s
7:	learn: 55109.6798235	test: 56636.3904444	best: 56636.3904444 (7)	total: 169ms	remaining: 21s
8:	learn: 54486.0036704	test: 56068.5665237	best: 56068.5665237 (8)	total: 170ms	remaining: 18.7s
9:	learn: 53996.0864080	test: 55646.4402791	best: 55646.4402791 (9)	total: 170ms	remaining: 16.9s
10:	learn: 53583.35809

In [19]:
best_ct_res = best_ct_model.predict(X_test_final)

ct_mape = mean_absolute_percentage_error(y_true=y_test, y_pred=best_ct_res)
ct_rmse = root_mean_squared_error(y_true=y_test, y_pred=best_ct_res)

print('MAPE: ', ct_mape)
print('RMSE: ', ct_rmse)

MAPE:  0.3880664713635183
RMSE:  45058.580307698525


Для применения catboost моделей не обязательно сначала кодировать категориальные признаки, модель может кодировать их сама. Обучите catboost с подбором оптимальных гиперпараметров снова, используя pool для передачи данных в модель с указанием какие признаки категориальные, а какие нет с помощью параметра cat_features. Оцените качество и время. Стало ли лучше?

In [23]:
from catboost import Pool

pool_train = Pool(X_train, label=y_train, cat_features=categorical_features)
pool_val = Pool(X_val, label=y_val, cat_features=categorical_features)
best_ct_model_with_pool = None
best_ct_params_with_pool = None
best_ct_rmse_with_pool = float('inf')

for depth in [2,4,6]:
        for learning_rate in [0.1, 0.2, 0.3]:
            model_ct_with_pool = CatBoostRegressor(
                iterations=1000,
                max_depth=depth,
                learning_rate=learning_rate,
                random_seed= 42
            )

            model_ct_with_pool.fit(pool_train ,eval_set=pool_val,early_stopping_rounds=50) #если 50 деревьев подряд наша rmse не будет улучшаться, то работа прекращается. Замеры на X_val

            ct_with_pool_y_pred = model_ct_with_pool.predict(pool_val)

            ct_rmse_with_pool = root_mean_squared_error(y_true=y_val, y_pred=ct_with_pool_y_pred)

            if (ct_rmse_with_pool < best_ct_rmse_with_pool):
                best_ct_rmse_with_pool = ct_rmse_with_pool
                best_ct_params_with_pool = {
                    'learning_rate':learning_rate,
                    'max_depth':depth,
                    'iterations': model_ct_with_pool.best_iteration_
                }
            best_ct_model_with_pool = model_ct_with_pool


print("Лучшие параметры:", best_ct_params_with_pool)
print("Лучший RMSE на val:", best_ct_rmse_with_pool)


0:	learn: 61323.6981927	test: 62372.2681613	best: 62372.2681613 (0)	total: 44.6ms	remaining: 44.6s
1:	learn: 59881.0701449	test: 60974.8513333	best: 60974.8513333 (1)	total: 64.7ms	remaining: 32.3s
2:	learn: 58648.7976531	test: 59821.8333687	best: 59821.8333687 (2)	total: 78.7ms	remaining: 26.2s
3:	learn: 57629.2999988	test: 58871.3889899	best: 58871.3889899 (3)	total: 96.2ms	remaining: 23.9s
4:	learn: 56827.6757300	test: 58090.9074091	best: 58090.9074091 (4)	total: 114ms	remaining: 22.6s
5:	learn: 56138.5203794	test: 57451.6216967	best: 57451.6216967 (5)	total: 139ms	remaining: 23.1s
6:	learn: 55621.6238067	test: 56934.6120896	best: 56934.6120896 (6)	total: 163ms	remaining: 23.2s
7:	learn: 54924.5967453	test: 56360.1464562	best: 56360.1464562 (7)	total: 180ms	remaining: 22.3s
8:	learn: 54422.7872304	test: 55982.5584693	best: 55982.5584693 (8)	total: 195ms	remaining: 21.5s
9:	learn: 53915.6003976	test: 55507.4201542	best: 55507.4201542 (9)	total: 210ms	remaining: 20.8s
10:	learn: 53292

In [21]:
test_pool = Pool(X_test, y_test, cat_features=categorical_features)

best_ct_res_with_pool = best_ct_model_with_pool.predict(test_pool)

ct_with_pool_mape = mean_absolute_percentage_error(y_true=y_test, y_pred=best_ct_res_with_pool)
ct_with_pool_rmse = root_mean_squared_error(y_true=y_test, y_pred=best_ct_res_with_pool)

print('MAPE: ', ct_with_pool_mape)
print('RMSE: ', ct_with_pool_rmse)

MAPE:  0.4429915303794044
RMSE:  46230.005634686124


**Ответ:** По времени и точности лучше самому кодировать данные, однако CatBoost с использованием Pool, как вариант "из коробки", вполне неплох

## Задание 5 (0.5 балла) LightGBM

И наконец библиотека LightGBM - используйте `LGBMRegressor`, снова подберите гиперпараметры, оцените качество и скорость.


In [8]:
from lightgbm import LGBMRegressor




best_gbm_params = None
best_gbm_model = None
best_gbm_rmse = float('inf')

for depth in [2,4,6]:
    for learning_rate in [0.1,0.2, 0.3]:
        for num_leaves in [20,40, 60]:
            model_gbm = LGBMRegressor(
                max_depth=depth,
                learning_rate=learning_rate,
                num_leaves=num_leaves,
                n_estimators=1000,
                early_stopping_rounds=50
            )

            model_gbm.fit(X_train_final, y_train, eval_set=(X_val_final, y_val))

            gbm_y_res = model_gbm.predict(X_val_final)

            gbm_rmse = root_mean_squared_error(y_true = y_val, y_pred= gbm_y_res)

            if (gbm_rmse < best_gbm_rmse):
                best_gbm_rmse = gbm_rmse
                best_gbm_params = {
                    "max_depth":depth,
                    "n_estimators":  model_gbm.best_iteration_,
                    "num_leaves": num_leaves,
                    "learning_rate": learning_rate
                }
                best_gbm_model = model_gbm

print("Лучшие параметры: ", best_gbm_params)
print("Лучший RMSE: ", best_gbm_rmse)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000522 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[316]	valid_0's l2: 2.50992e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000383 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 ro



Early stopping, best iteration is:
[135]	valid_0's l2: 2.48896e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001326 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[135]	valid_0's l2: 2.48896e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000310 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000289 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[65]	valid_0's l2: 2.50186e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000203 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rou



Early stopping, best iteration is:
[63]	valid_0's l2: 2.47679e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[55]	valid_0's l2: 2.47798e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000554 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 13805



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000305 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[20]	valid_0's l2: 2.4604e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000207 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 roun



Early stopping, best iteration is:
[48]	valid_0's l2: 2.43997e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001158 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 138055.989348
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[27]	valid_0's l2: 2.44623e+09
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000303 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 82
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 39
[LightGBM] [Info] Start training from score 13805



In [10]:
best_gbm_res = best_gbm_model.predict(X_test_final)

gbm_test_rmse = root_mean_squared_error(y_true=y_test, y_pred=best_gbm_res)
gbm_test_mape = mean_absolute_percentage_error(y_true=y_test, y_pred=best_gbm_res)
print('MAPE: ', gbm_test_mape )
print('RMSE: ', gbm_test_rmse)

MAPE:  0.40109252410679197
RMSE:  46377.81863089044




## Задание 6 (2 балла) Сравнение и выводы

Сравните модели бустинга и сделайте про них выводы, какая из моделей показала лучший/худший результат по качеству, скорости обучения и скорости предсказания? Как отличаются гиперпараметры для разных моделей?

**Ответ:** 
|Вид модели|RMSE на тестовой выборке|MAPE на тестовой выборке|Суммарное время на побор гиперпараметров(s)|
|-|--------|---|---|
|XgBoost|46300.54296875|0.40950098633766174|94.4|
|CatBoost with Pool|46230.005634686124|0.4429915303794044|76.8|
|CatBoost without Pool|45058.580307698525|0.3880664713635183|6.4|
|LightGBM|49346.24547450481|0.40109252410679197|2.0|


#### Вывод:

* CatBoost без использования Pool для кодировки категориальных признаков показал наиболее оптимальные результаты как по времени работы, так и по результатам. Он опережает в более чем 10 раз быстрее аналога с таким же подбором гиперпараметров, но с Pool. Это объясняется кодировкой данных под капотом CatBoost и работой с лишней абстракцией Pool, которая может тормозить работу программы. Однако ручная предобработка данных может требовать больше знаний, опыта и точности в выборе тех, или иных решений. Стоит учитывать и человческий фактор
* Самую долгую работу показала XGboost. Предположу, что это связано с работой со вторыми частными производными для более точного построения деревьев
* LightGBM - самый быстрый ансамбль. Связано это со спецификой построения деревьев: если количество листьев равно num_leaves, то мы больше не разбиваем". Это позволяет довольно быстро получить результаты. Однако жертвуем точностью - у LightGBM самый высокий показатель RMSE, а MAPE все еще меньше чем у CatBoost без Pool
* Переобучения среди моделей нет, так как ошибка валидационной выборке (см. ниже) и на тестовой выборке (см. таблицу) примерно похожи на результаты на тренировочной выборке

In [22]:
y_train_pred_best = best_model.predict(X_train_final)
y_val_pred_best   = best_model.predict(X_val_final)

rmse_train = root_mean_squared_error(y_train, y_train_pred_best)
rmse_val   = root_mean_squared_error(y_val,   y_val_pred_best)

print(f"Лучшая модель XG | Train RMSE: {rmse_train:.3f}, Val RMSE: {rmse_val:.3f}")

y_train_pred_best = best_gbm_model.predict(X_train_final)
y_val_pred_best   = best_gbm_model.predict(X_val_final)

rmse_train = root_mean_squared_error(y_train, y_train_pred_best)
rmse_val   = root_mean_squared_error(y_val,   y_val_pred_best)

print(f"Лучшая модель GBM | Train RMSE: {rmse_train:.3f}, Val RMSE: {rmse_val:.3f}")


y_train_pred_best = best_ct_model.predict(X_train_final)
y_val_pred_best   = best_ct_model.predict(X_val_final)

rmse_train = root_mean_squared_error(y_train, y_train_pred_best)
rmse_val   = root_mean_squared_error(y_val,   y_val_pred_best)

print(f"Лучшая модель GBM | Train RMSE: {rmse_train:.3f}, Val RMSE: {rmse_val:.3f}")

Лучшая модель XG | Train RMSE: 44686.895, Val RMSE: 49379.395
Лучшая модель GBM | Train RMSE: 46348.666, Val RMSE: 49346.245




Лучшая модель GBM | Train RMSE: 43489.317, Val RMSE: 49277.322
