# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом  разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В нашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Нам нужно построить модель для определения стоимости.

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

# План работы
- Подготовка данных
- Обучение моделей
- Анализ моделей
- Вывод



# Признаки
DateCrawled — дата скачивания анкеты из базы

VehicleType — тип автомобильного кузова

RegistrationYear — год регистрации автомобиля

Gearbox — тип коробки передач

Power — мощность (л. с.)

Model — модель автомобиля

Kilometer — пробег (км)

RegistrationMonth — месяц регистрации автомобиля

FuelType — тип топлива

Brand — марка автомобиля

Repaired — была машина в ремонте или нет

DateCreated — дата создания анкеты

NumberOfPictures — количество фотографий автомобиля

PostalCode — почтовый индекс владельца анкеты (пользователя)

LastSeen — дата последней активности пользователя

Целевой признак

Price — цена (евро)

## Подготовка данных

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score, TimeSeriesSplit

In [None]:
!pip install lightgbm



In [None]:
try:
    df = pd.read_csv('/datasets/autos.csv')
except:
    df = pd.read_csv('C:/Users/www/autos.csv')

In [None]:
df

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,2016-03-21 09:50:58,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,2016-03-21 00:00:00,0,2694,2016-03-21 10:42:49
354365,2016-03-14 17:48:27,2200,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
354366,2016-03-05 19:56:21,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
354367,2016-03-19 18:57:12,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26


переведем все названия столбцов в нижный регистр и в змеиный регистр

In [None]:
df.columns = df.columns.str.replace(r"([A-Z])", r" \1").str.lower().str.replace(' ', '_').str[1:]

  df.columns = df.columns.str.replace(r"([A-Z])", r" \1").str.lower().str.replace(' ', '_').str[1:]


In [None]:
df

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired,date_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,2016-03-21 09:50:58,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,2016-03-21 00:00:00,0,2694,2016-03-21 10:42:49
354365,2016-03-14 17:48:27,2200,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
354366,2016-03-05 19:56:21,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
354367,2016-03-19 18:57:12,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26


In [None]:
df.isna().mean().sort_values(ascending=False)

repaired              0.200791
vehicle_type          0.105794
fuel_type             0.092827
gearbox               0.055967
model                 0.055606
date_crawled          0.000000
price                 0.000000
registration_year     0.000000
power                 0.000000
kilometer             0.000000
registration_month    0.000000
brand                 0.000000
date_created          0.000000
number_of_pictures    0.000000
postal_code           0.000000
last_seen             0.000000
dtype: float64

Значительные пропуски наблюдаются в признаке repaired - это может быть связано как с тем, что пользователь не сообщил о факте отсутствия ремонта, либо автомобиль подвергался ремонту. Вероятность пропуска могла бы быть описана на основе других атрибутов, но информация по этим атрибутам в наборе данных отсутствует. vehicle_type

fuel_type, gearbox, model - достаточно важные признаки, наблюдается 5-9% пропусков - тип топлива, коробка передач и модель автомобиля влияют на конечную цену. Могли появится по причине пропуска данных характеристик при заполнении объявления пользователем - умышленно либо по причине неосведомленности

Удалять пропуски не будем - заменим на unknown - неизвестно, данный метод будет оптимальным в данном случае

In [None]:
nan = ['repaired', 'vehicle_type', 'fuel_type', 'gearbox', 'model']
df[nan] = df[nan].fillna('unknown')

In [None]:
df

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired,date_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,480,unknown,1993,manual,0,golf,150000,0,petrol,volkswagen,unknown,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,unknown,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,unknown,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,2016-03-21 09:50:58,0,unknown,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,2016-03-21 00:00:00,0,2694,2016-03-21 10:42:49
354365,2016-03-14 17:48:27,2200,unknown,2005,unknown,0,unknown,20000,1,unknown,sonstige_autos,unknown,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
354366,2016-03-05 19:56:21,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
354367,2016-03-19 18:57:12,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26


In [None]:
df.describe()

Unnamed: 0,price,registration_year,power,kilometer,registration_month,number_of_pictures,postal_code
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


Есть аномалии в признаке registration_year - min значится как 1000.0, max значится как 9999, аналогично для признака power аномальное значения мощности (максимальной) - 20 000 л.с. - это недопустимо для легковых авто

In [None]:
df.duplicated().sum()
print('Количество дубликатов до удаления:', df.duplicated().sum())

Количество дубликатов до удаления: 4


In [None]:
df = df.drop_duplicates()
print('Количество дубликатов после удаления:', df.duplicated().sum())

Количество дубликатов после удаления: 0


удалим признаки которые не несут полезной нагрузки в целях обуения модели:

In [None]:
df_with_date = df.copy()
df = df.drop(["date_crawled","date_created","last_seen","number_of_pictures","postal_code", "registration_month"],axis = 1)

In [None]:
df.columns

Index(['price', 'vehicle_type', 'registration_year', 'gearbox', 'power',
       'model', 'kilometer', 'fuel_type', 'brand', 'repaired'],
      dtype='object')

In [None]:
# RegistrationYear
def Balance_RegistrationYear(value):
    if value > 2016:
        return 2016
    elif value < 1980:
        return 1980
    else:
        return value
df["registration_year"] = df["registration_year"].apply(Balance_RegistrationYear)
# Power
df.loc[df['power'] > 3500, 'power'] = 3500


In [None]:
def remove_ouliers(df,column):
    q25=np.array(df[column].quantile(0.45))

    q75=np.array(df[column].quantile(0.75))
    first_part=q25-0.5*(q75-q25)
    second_part=q75+1.5*(q75-q25)
    del_index = []
    for index_value, value in zip(df[column].index,df[column]):
        if second_part <= value or value <= first_part:
            del_index.append(index_value)

    print('Количество строк, выбранных для удаления ' + str(column)+":",len(del_index))
    return del_index

In [None]:
array_num_col = ["price","power"]
count = 0
for column in array_num_col:
    index_del = remove_ouliers(df,column)
    count += len(index_del)
    df = df.drop(index_del,axis = 0)
print("Было удалено:", count)

Количество строк, выбранных для удаления price: 50183
Количество строк, выбранных для удаления power: 125483
Было удалено: 175666


In [None]:
df.describe()

Unnamed: 0,price,registration_year,power,kilometer
count,178699.0,178699.0,178699.0,178699.0
mean,4247.68166,2003.17276,126.844644,134390.035758
std,3140.030418,6.054909,28.695155,31562.637583
min,251.0,1980.0,81.0,5000.0
25%,1650.0,1999.0,102.0,125000.0
50%,3400.0,2003.0,122.0,150000.0
75%,6300.0,2007.0,150.0,150000.0
max,12549.0,2016.0,199.0,150000.0


Можно заметить, что в столбце registration_year присутствуют аномальные значения, предлагаю установить границы, в котором может находится значение этого столбца. А именно, год регистрации не может быть больше 2016, и не может быть меньше 1980.

Так же в power присутствуют аномальные значения (20 000 лошадиных сил), ограничим их мощностью БелАЗ(а), т.е. 3500 (признак нуждается в поиске выбросов), минимальные значения power и price было 0.

## Обучение моделей

Разделим, закодируем и массштабируем выборки выборки

In [None]:
trainX,testX,trainY,testY = train_test_split(df.drop("price",axis = 1),
                                             df["price"],
                                             test_size = 0.25,
                                             random_state = 44)

trainX_wo_ohe,testX_wo_ohe,trainY_wo_ohe,testY_wo_ohe = train_test_split(df.drop("price",axis = 1),
                                             df["price"],
                                             test_size = 0.25,
                                             random_state = 44)
(trainX_wo_ohe_light,
 testX_wo_ohe_light,
 trainY_wo_ohe_light,
 testY_wo_ohe_light) = train_test_split(df.drop("price",axis = 1),
                                             df["price"],
                                             test_size = 0.25,
                                             random_state = 44)

In [None]:
df.describe()

Unnamed: 0,price,registration_year,power,kilometer
count,178699.0,178699.0,178699.0,178699.0
mean,4247.68166,2003.17276,126.844644,134390.035758
std,3140.030418,6.054909,28.695155,31562.637583
min,251.0,1980.0,81.0,5000.0
25%,1650.0,1999.0,102.0,125000.0
50%,3400.0,2003.0,122.0,150000.0
75%,6300.0,2007.0,150.0,150000.0
max,12549.0,2016.0,199.0,150000.0


In [None]:
# создание объекта OHE
ohe = OneHotEncoder()

# кодирование trainX и testX
trainX_encoded = ohe.fit_transform(trainX)
testX_encoded = ohe.transform(testX)

# кодирование trainX_wo_ohe и testX_wo_ohe
trainX_wo_ohe_encoded = ohe.fit_transform(trainX_wo_ohe)
testX_wo_ohe_encoded = ohe.transform(testX_wo_ohe)

# кодирование trainX_wo_ohe_light и testX_wo_ohe_light
trainX_wo_ohe_light_encoded = ohe.fit_transform(trainX_wo_ohe_light)
testX_wo_ohe_light_encoded = ohe.transform(testX_wo_ohe_light)

In [None]:
# Определяем список категориальных признаков
categorical_features = ["vehicle_type","gearbox","model","fuel_type","brand","repaired"]

# Применяем OHE
encoder = OneHotEncoder(categories='auto', handle_unknown='ignore', sparse=False)
trainX_ohe = pd.DataFrame(encoder.fit_transform(trainX_wo_ohe_light[categorical_features]))
testX_ohe = pd.DataFrame(encoder.transform(testX_wo_ohe_light[categorical_features]))

# Удаляем старые категориальные признаки из trainX и testX
trainX_wo_ohe_light = trainX_wo_ohe_light.drop(categorical_features, axis=1)
testX_wo_ohe_light = testX_wo_ohe_light.drop(categorical_features, axis=1)

# Объединяем закодированные и остальные признаки для train и test
trainX_light = pd.concat([trainX_wo_ohe_light.reset_index(drop=True), trainX_ohe], axis=1)
testX_light = pd.concat([testX_wo_ohe_light.reset_index(drop=True), testX_ohe], axis=1)

# Создаем датасеты для LightGBM
train_data = lgb.Dataset(trainX_light, label=trainY_wo_ohe_light)
test_data = lgb.Dataset(testX_light, label=testY_wo_ohe_light)

In [None]:
%%time
# LightGBM
lgb_train = lgb.Dataset(trainX_wo_ohe_light, trainY_wo_ohe_light)
lgb_test = lgb.Dataset(testX_wo_ohe_light, testY_wo_ohe_light, reference=lgb_train)
parameters_light = {'metric': 'l2', 'max_depth':10,"random_state": 44,"learning_rate":0.1}
light = lgb.train(parameters_light,
                lgb_train,
                num_boost_round=1000,
                valid_sets=[lgb_train, lgb_test],
                verbose_eval=100)



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 172
[LightGBM] [Info] Number of data points in the train set: 134024, number of used features: 3
[LightGBM] [Info] Start training from score 4244.199994
[100]	training's l2: 2.81249e+06	valid_1's l2: 2.87986e+06
[200]	training's l2: 2.73886e+06	valid_1's l2: 2.83029e+06
[300]	training's l2: 2.69808e+06	valid_1's l2: 2.81135e+06
[400]	training's l2: 2.67093e+06	valid_1's l2: 2.80287e+06
[500]	training's l2: 2.65099e+06	valid_1's l2: 2.79855e+06
[600]	training's l2: 2.63464e+06	valid_1's l2: 2.79597e+06
[700]	training's l2: 2.62164e+06	valid_1's l2: 2.79474e+06
[800]	training's l2: 2.61075e+06	valid_1's l2: 2.79396e+06
[900]	training's l2: 2.60214e+06	valid_1's l2: 2.79463e+06
[1000]	training's l2: 2.59301e+06	valid_1's l2: 2.79482e+06
CPU times: total: 23.8 s
Wall time: 2.08 s


обучим модели

In [None]:
%%time

catboost = CatBoostRegressor(loss_function='RMSE')
parameters_cat = {'depth':[5,10], 'learning_rate':np.arange(0.1,1,0.2)}
catboost_grid = catboost.grid_search(parameters_cat,
            Pool(trainX_wo_ohe, trainY_wo_ohe, cat_features=["vehicle_type","gearbox",
                                                             "model","fuel_type",
                                                             "brand","repaired"]),
            cv=3,
            verbose=True,
            plot=False)

0:	learn: 4834.5359703	test: 4835.4878887	best: 4835.4878887 (0)	total: 209ms	remaining: 3m 29s
1:	learn: 4442.8700644	test: 4443.4451348	best: 4443.4451348 (1)	total: 244ms	remaining: 2m 1s
2:	learn: 4096.2675677	test: 4096.5037928	best: 4096.5037928 (2)	total: 304ms	remaining: 1m 40s
3:	learn: 3787.1837810	test: 3788.8374524	best: 3788.8374524 (3)	total: 359ms	remaining: 1m 29s
4:	learn: 3515.1585596	test: 3517.4958786	best: 3517.4958786 (4)	total: 408ms	remaining: 1m 21s
5:	learn: 3276.8547708	test: 3279.7816856	best: 3279.7816856 (5)	total: 465ms	remaining: 1m 17s
6:	learn: 3057.9762036	test: 3060.3277403	best: 3060.3277403 (6)	total: 526ms	remaining: 1m 14s
7:	learn: 2869.5197399	test: 2871.0046767	best: 2871.0046767 (7)	total: 562ms	remaining: 1m 9s
8:	learn: 2700.4746683	test: 2701.4443774	best: 2701.4443774 (8)	total: 614ms	remaining: 1m 7s
9:	learn: 2556.3642229	test: 2558.5470876	best: 2558.5470876 (9)	total: 653ms	remaining: 1m 4s
10:	learn: 2432.0029952	test: 2434.9328497	b

In [None]:
catboost_grid["params"]

{'depth': 10, 'learning_rate': 0.1}

найдём для каждой модели время обучения

In [None]:
%%time
final_cat = CatBoostRegressor(depth=catboost_grid["params"]["depth"],
                              learning_rate=catboost_grid["params"]['learning_rate'],
                              loss_function='RMSE',verbose=100)
final_cat.fit(Pool(trainX_wo_ohe,trainY_wo_ohe,
                   cat_features=["vehicle_type","gearbox","model","fuel_type","brand","repaired"]))

0:	learn: 2923.6273064	total: 155ms	remaining: 2m 34s
100:	learn: 1291.9777481	total: 13.3s	remaining: 1m 58s
200:	learn: 1227.9066944	total: 27.1s	remaining: 1m 47s
300:	learn: 1190.8888067	total: 41.2s	remaining: 1m 35s
400:	learn: 1163.1763125	total: 55.4s	remaining: 1m 22s
500:	learn: 1141.8208551	total: 1m 9s	remaining: 1m 9s
600:	learn: 1123.8129450	total: 1m 23s	remaining: 55.5s
700:	learn: 1107.5811035	total: 1m 37s	remaining: 41.8s
800:	learn: 1094.8237103	total: 1m 52s	remaining: 27.9s
900:	learn: 1081.8767338	total: 2m 7s	remaining: 14s
999:	learn: 1070.7666280	total: 2m 21s	remaining: 0us
CPU times: total: 20min 19s
Wall time: 2min 22s


<catboost.core.CatBoostRegressor at 0x1d794049b20>

In [None]:
%%time
# LightGBM
lgb_train = lgb.Dataset(trainX_wo_ohe_light, trainY_wo_ohe_light)
lgb_test = lgb.Dataset(testX_wo_ohe_light, testY_wo_ohe_light, reference=lgb_train)
parameters_light = {'metric': 'l2', 'max_depth':10,"random_state": 44,"learning_rate":0.1}
light = lgb.train(parameters_light,
                lgb_train,
                num_boost_round=1000,
                valid_sets=[lgb_train, lgb_test],
                verbose_eval=100)

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 172
[LightGBM] [Info] Number of data points in the train set: 134024, number of used features: 3
[LightGBM] [Info] Start training from score 4244.199994




[100]	training's l2: 2.81249e+06	valid_1's l2: 2.87986e+06
[200]	training's l2: 2.73886e+06	valid_1's l2: 2.83029e+06
[300]	training's l2: 2.69808e+06	valid_1's l2: 2.81135e+06
[400]	training's l2: 2.67093e+06	valid_1's l2: 2.80287e+06
[500]	training's l2: 2.65099e+06	valid_1's l2: 2.79855e+06
[600]	training's l2: 2.63464e+06	valid_1's l2: 2.79597e+06
[700]	training's l2: 2.62164e+06	valid_1's l2: 2.79474e+06
[800]	training's l2: 2.61075e+06	valid_1's l2: 2.79396e+06
[900]	training's l2: 2.60214e+06	valid_1's l2: 2.79463e+06
[1000]	training's l2: 2.59301e+06	valid_1's l2: 2.79482e+06
CPU times: total: 22.8 s
Wall time: 1.95 s


In [None]:
%%time
my_cv = TimeSeriesSplit(n_splits=3).split(trainX_wo_ohe_light)
cat_model = CatBoostRegressor()
cat_cv_scores = cross_val_score(cat_model, trainX_wo_ohe_light, trainY_wo_ohe_light, scoring='neg_root_mean_squared_error', cv=my_cv)
cat_cv_rmse = -cat_cv_scores.mean()
print('Среднее RMSE модели CatBoostRegressor на кросс-валидации:', cat_cv_rmse)

Learning rate set to 0.07131
0:	learn: 2992.9381086	total: 4.2ms	remaining: 4.2s
1:	learn: 2878.3339957	total: 7.78ms	remaining: 3.88s
2:	learn: 2772.4524401	total: 11.5ms	remaining: 3.82s
3:	learn: 2681.2942110	total: 15.4ms	remaining: 3.83s
4:	learn: 2593.1343882	total: 19.2ms	remaining: 3.82s
5:	learn: 2513.2113388	total: 22.9ms	remaining: 3.79s
6:	learn: 2442.5796814	total: 26.5ms	remaining: 3.75s
7:	learn: 2377.4913343	total: 30.5ms	remaining: 3.78s
8:	learn: 2318.0583509	total: 34.2ms	remaining: 3.76s
9:	learn: 2265.3444353	total: 37.7ms	remaining: 3.73s
10:	learn: 2219.6450864	total: 41.5ms	remaining: 3.73s
11:	learn: 2176.1430811	total: 45ms	remaining: 3.7s
12:	learn: 2138.5690438	total: 48.5ms	remaining: 3.68s
13:	learn: 2107.5468889	total: 52ms	remaining: 3.67s
14:	learn: 2075.6481949	total: 55.7ms	remaining: 3.66s
15:	learn: 2047.5228526	total: 59.2ms	remaining: 3.64s
16:	learn: 2022.4065362	total: 63ms	remaining: 3.64s
17:	learn: 2001.3755194	total: 66.7ms	remaining: 3.64s


In [None]:
%%time
my_cv = TimeSeriesSplit(n_splits=3).split(trainX_wo_ohe_light)
lgb_model = LGBMRegressor()
lgb_cv_scores = cross_val_score(lgb_model, trainX_wo_ohe_light, trainY_wo_ohe_light, scoring='neg_root_mean_squared_error', cv=my_cv)
lgb_cv_rmse = -lgb_cv_scores.mean()
print('Среднее RMSE модели LGBMRegressor на кросс-валидации:', lgb_cv_rmse)

Среднее RMSE модели LGBMRegressor на кросс-валидации: 1696.8549365511274
CPU times: total: 7.8 s
Wall time: 675 ms


## Анализ моделей

Среднее RMSE модели CatBoostRegressor на кросс-валидации: 1688.0292544740926
CPU times: total: 2min 21s
Wall time: 15.2 s

Среднее RMSE модели LGBMRegressor на кросс-валидации: 1696.8549365511274
CPU times: total: 7.8 s
Wall time: 675 ms

## Вывод

В условии указано, что решающим факторами при выборе модели являются следующие показатели:

Время обучения
Время предсказания
Качество предсказаний
Делаем вывод что при сопоставим качестве Среднее RMSE модели CatBoostRegressor на кросс-валидации: 1688 и Среднее RMSE модели LGBMRegressor на кросс-валидации, LGBMR значительно быстрее обучается и делает предсказания