# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В моём распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Построю модель для определения стоимости. 

Ключевые метрики:

- качество предсказания;
- скорость предсказания;
- время обучения.

<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка данных</a></span><ul class="toc-item"><li><span><a href="#Обработаем-выбросы" data-toc-modified-id="Обработаем-выбросы-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Обработаем выбросы</a></span></li><li><span><a href="#Удалим-пустые-признаки-и-не-имеющие-отношения-к-машинам." data-toc-modified-id="Удалим-пустые-признаки-и-не-имеющие-отношения-к-машинам.-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Удалим пустые признаки и не имеющие отношения к машинам.</a></span></li></ul></li><li><span><a href="#Обучение-моделей" data-toc-modified-id="Обучение-моделей-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение моделей</a></span><ul class="toc-item"><li><span><a href="#Используем-LightGBM" data-toc-modified-id="Используем-LightGBM-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Используем LightGBM</a></span></li><li><span><a href="#Перебор-гиперпараметров" data-toc-modified-id="Перебор-гиперпараметров-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Перебор гиперпараметров</a></span></li><li><span><a href="#Предикт-на-тесте" data-toc-modified-id="Предикт-на-тесте-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Предикт на тесте</a></span></li><li><span><a href="#Используем-CatBoost" data-toc-modified-id="Используем-CatBoost-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Используем CatBoost</a></span></li><li><span><a href="#Предикт-на-тесте" data-toc-modified-id="Предикт-на-тесте-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Предикт на тесте</a></span></li><li><span><a href="#Интерпретация-признаков-модели" data-toc-modified-id="Интерпретация-признаков-модели-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Интерпретация признаков модели</a></span></li><li><span><a href="#Используем-LinearRegression" data-toc-modified-id="Используем-LinearRegression-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Используем LinearRegression</a></span></li></ul></li><li><span><a href="#Анализ-моделей" data-toc-modified-id="Анализ-моделей-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Анализ моделей</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

## Подготовка данных

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import lightgbm as lgb
from sklearn.preprocessing import OrdinalEncoder
from catboost import CatBoostRegressor
import time
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [2]:
try:
    auto = pd.read_csv(
        '/Users/bogda/anaconda3/projects/praktikum/project9_auto/autos.csv'
    )
except FileNotFoundError as e:
    print(e)
    auto = pd.read_csv('/datasets/autos.csv')

In [3]:
auto.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [4]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [5]:
auto.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Price,354369.0,4416.656776,4514.158514,0.0,1050.0,2700.0,6400.0,20000.0
RegistrationYear,354369.0,2004.234448,90.227958,1000.0,1999.0,2003.0,2008.0,9999.0
Power,354369.0,110.094337,189.850405,0.0,69.0,105.0,143.0,20000.0
Kilometer,354369.0,128211.172535,37905.34153,5000.0,125000.0,150000.0,150000.0,150000.0
RegistrationMonth,354369.0,5.714645,3.726421,0.0,3.0,6.0,9.0,12.0
NumberOfPictures,354369.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PostalCode,354369.0,50508.689087,25783.096248,1067.0,30165.0,49413.0,71083.0,99998.0


**Видим выбросы в RegistrationYear и Power. Есть пропущенные значения в VehicleType, Model, FeulType, NotRepaired обработаем ниже.**

In [6]:
auto['VehicleType'].unique()

array([nan, 'coupe', 'suv', 'small', 'sedan', 'convertible', 'bus',
       'wagon', 'other'], dtype=object)

In [7]:
auto['VehicleType'] = auto['VehicleType'].fillna('unknow')

In [8]:
auto['Gearbox'].unique()

array(['manual', 'auto', nan], dtype=object)

In [9]:
auto['Gearbox'].isnull().sum()

19833

In [76]:
auto['Gearbox'].duplicated().mean()

0.9999915342481989

In [80]:
auto['Gearbox'].isnull().sum() / len(auto['Gearbox'])

0.0

In [10]:
auto['Gearbox'] = auto['Gearbox'].fillna('unknow')

In [11]:
auto['Model'] = auto['Model'].fillna('unknow')

In [12]:
auto['FuelType'] = auto['FuelType'].fillna('unknow')

In [13]:
auto['NotRepaired'] = auto['NotRepaired'].fillna('unknow')

### Обработаем выбросы

**Автомобилей с мощностью 0 л.с. и 20000 не бывает, примем минимальную мощность за мощность "Оки" - 33 л.с. а максимальную за мощность Porsche 911 GT2 RS - 700 л.с., остльные значения заменим медианой.**

In [14]:
median_p = auto['Power'].median()

In [15]:
auto.loc[auto['Power'] < 33]['Power'] = median_p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto.loc[auto['Power'] < 33]['Power'] = median_p


In [16]:
auto.loc[auto['Power'] > 700]['Power'] = median_p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto.loc[auto['Power'] > 700]['Power'] = median_p


**Примем 1937 г. регистрации автомобиля за минимальное значение и текущий год за максимальное, остальное заменим медианой.**

In [17]:
median_r = auto['RegistrationYear'].median()

In [18]:
auto.loc[auto['RegistrationYear'] < 1970]['RegistrationYear'] = median_r

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto.loc[auto['RegistrationYear'] < 1970]['RegistrationYear'] = median_r


In [19]:
auto.loc[auto['RegistrationYear'] > 2022]['RegistrationYear'] = median_r

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto.loc[auto['RegistrationYear'] > 2022]['RegistrationYear'] = median_r


### Удалим пустые признаки и не имеющие отношения к машинам.

**NumberOfPictures - пустой столбец. DateCrawled, DateCreated дата скачивания анкеты и дата создания кароточки не влияют на цену. Lastseen,PostalCode последнее время активности пользователя в сети и почтовый код тоже.**

In [20]:
auto_pre = auto.drop(['NumberOfPictures', 'DateCrawled', 'DateCreated', 'LastSeen', 'PostalCode'], axis=1)

In [21]:
auto_pre

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,unknow,1993,manual,0,golf,150000,0,petrol,volkswagen,unknow
1,18300,coupe,2011,manual,190,unknow,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,unknow
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
...,...,...,...,...,...,...,...,...,...,...,...
354364,0,unknow,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes
354365,2200,unknow,2005,unknow,0,unknow,20000,1,unknow,sonstige_autos,unknow
354366,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no
354367,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no


## Обучение моделей

### Используем LightGBM

* кодируем...

In [22]:
encoder = OrdinalEncoder()

In [23]:
auto_ord = pd.DataFrame(encoder.fit_transform(auto_pre), columns=auto_pre.columns)

In [24]:
auto_ord

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
0,256.0,7.0,86.0,1.0,0.0,116.0,12.0,0.0,6.0,38.0,1.0
1,3587.0,2.0,104.0,1.0,190.0,228.0,11.0,5.0,2.0,1.0,2.0
2,2590.0,6.0,97.0,0.0,163.0,117.0,11.0,8.0,2.0,14.0,1.0
3,696.0,5.0,94.0,1.0,75.0,116.0,12.0,6.0,6.0,38.0,0.0
4,1333.0,5.0,101.0,1.0,69.0,101.0,9.0,7.0,2.0,31.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
354364,0.0,7.0,98.0,1.0,0.0,78.0,12.0,7.0,6.0,22.0,2.0
354365,923.0,7.0,98.0,2.0,0.0,228.0,2.0,1.0,7.0,33.0,1.0
354366,568.0,1.0,93.0,0.0,101.0,106.0,11.0,3.0,6.0,32.0,0.0
354367,2492.0,0.0,89.0,1.0,102.0,224.0,12.0,3.0,2.0,38.0,0.0


In [25]:
auto_train, auto_valid = train_test_split(auto_ord, test_size=0.4, random_state=77)
auto_valid, auto_test = train_test_split(auto_valid, test_size=0.5, random_state=77)

**Проверим размер выборок**

In [26]:
auto_train

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
336806,909.0,4.0,98.0,1.0,136.0,10.0,12.0,3.0,6.0,25.0,0.0
252536,1115.0,5.0,95.0,0.0,75.0,83.0,2.0,1.0,6.0,24.0,0.0
199374,611.0,4.0,86.0,1.0,115.0,11.0,12.0,3.0,6.0,2.0,0.0
189687,0.0,7.0,110.0,1.0,170.0,15.0,12.0,0.0,4.0,2.0,1.0
184728,339.0,3.0,94.0,1.0,64.0,62.0,12.0,4.0,2.0,38.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
138904,1252.0,7.0,90.0,2.0,0.0,228.0,11.0,12.0,6.0,2.0,0.0
107813,1862.0,5.0,102.0,1.0,174.0,8.0,11.0,2.0,6.0,25.0,0.0
215275,1954.0,5.0,102.0,1.0,99.0,123.0,8.0,1.0,6.0,11.0,0.0
74335,2626.0,8.0,101.0,1.0,109.0,77.0,11.0,1.0,2.0,21.0,0.0


In [27]:
len(auto_train) / len(auto)

0.5999988712330931

In [28]:
auto_valid

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
34765,1537.0,4.0,100.0,0.0,150.0,95.0,12.0,4.0,2.0,20.0,0.0
3473,1439.0,0.0,99.0,1.0,90.0,150.0,10.0,6.0,6.0,24.0,0.0
35654,1447.0,4.0,98.0,1.0,136.0,103.0,12.0,5.0,2.0,10.0,0.0
213254,1690.0,3.0,92.0,1.0,102.0,166.0,12.0,1.0,2.0,38.0,0.0
128016,1943.0,2.0,101.0,2.0,71.0,106.0,9.0,8.0,6.0,32.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
1978,2469.0,8.0,100.0,1.0,102.0,224.0,12.0,1.0,2.0,38.0,0.0
150583,242.0,5.0,83.0,1.0,52.0,228.0,12.0,7.0,6.0,35.0,0.0
294070,222.0,7.0,110.0,1.0,0.0,102.0,12.0,0.0,7.0,10.0,2.0
84331,1028.0,2.0,95.0,1.0,0.0,214.0,11.0,1.0,6.0,9.0,0.0


In [29]:
len(auto_valid) / len(auto)

0.2000005643834534

In [30]:
auto_test

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
46010,1690.0,4.0,105.0,1.0,60.0,8.0,6.0,2.0,6.0,25.0,1.0
255055,0.0,3.0,91.0,1.0,102.0,11.0,12.0,10.0,6.0,2.0,0.0
24666,1674.0,8.0,96.0,0.0,170.0,95.0,12.0,4.0,2.0,20.0,0.0
78915,502.0,1.0,87.0,1.0,192.0,11.0,12.0,3.0,6.0,2.0,1.0
141315,2014.0,4.0,98.0,1.0,185.0,4.0,11.0,12.0,6.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
255713,169.0,5.0,92.0,1.0,75.0,103.0,12.0,2.0,6.0,10.0,2.0
350593,1663.0,7.0,110.0,2.0,125.0,154.0,12.0,0.0,7.0,10.0,1.0
311687,2293.0,7.0,111.0,1.0,110.0,149.0,12.0,9.0,7.0,27.0,0.0
323125,923.0,1.0,88.0,1.0,150.0,20.0,12.0,9.0,6.0,1.0,0.0


In [31]:
len(auto_test) / len(auto)

0.2000005643834534

In [32]:
features_train = auto_train.drop(['Price'], axis=1)
target_train = auto_train['Price']
features_valid = auto_valid.drop(['Price'], axis=1)
target_valid = auto_valid['Price']
features_test = auto_test.drop(['Price'], axis=1)
target_test = auto_test['Price']

In [33]:
booster = lgb.LGBMRegressor(n_estimators=1201, max_depth=21, random_state=77)

In [34]:
start_time = time.time()
booster.fit(features_train, target_train, eval_set=[(features_valid, target_valid)], eval_metric='rmse')
print('\\\%s seconds///' % (time.time() - start_time))

[1]	valid_0's rmse: 895.115	valid_0's l2: 801231
[2]	valid_0's rmse: 837.562	valid_0's l2: 701510
[3]	valid_0's rmse: 786.714	valid_0's l2: 618920
[4]	valid_0's rmse: 743.504	valid_0's l2: 552799
[5]	valid_0's rmse: 705.19	valid_0's l2: 497293
[6]	valid_0's rmse: 671.892	valid_0's l2: 451439
[7]	valid_0's rmse: 642.259	valid_0's l2: 412496
[8]	valid_0's rmse: 617.367	valid_0's l2: 381142
[9]	valid_0's rmse: 595.293	valid_0's l2: 354374
[10]	valid_0's rmse: 576.622	valid_0's l2: 332493
[11]	valid_0's rmse: 559.689	valid_0's l2: 313252
[12]	valid_0's rmse: 545.382	valid_0's l2: 297441
[13]	valid_0's rmse: 532.823	valid_0's l2: 283900
[14]	valid_0's rmse: 521.933	valid_0's l2: 272414
[15]	valid_0's rmse: 512.181	valid_0's l2: 262330
[16]	valid_0's rmse: 503.721	valid_0's l2: 253734
[17]	valid_0's rmse: 495.523	valid_0's l2: 245543
[18]	valid_0's rmse: 488.381	valid_0's l2: 238516
[19]	valid_0's rmse: 482.345	valid_0's l2: 232657
[20]	valid_0's rmse: 476.189	valid_0's l2: 226756
[21]	valid

[240]	valid_0's rmse: 390.303	valid_0's l2: 152337
[241]	valid_0's rmse: 390.276	valid_0's l2: 152315
[242]	valid_0's rmse: 390.261	valid_0's l2: 152304
[243]	valid_0's rmse: 390.241	valid_0's l2: 152288
[244]	valid_0's rmse: 390.159	valid_0's l2: 152224
[245]	valid_0's rmse: 390.103	valid_0's l2: 152180
[246]	valid_0's rmse: 390.016	valid_0's l2: 152112
[247]	valid_0's rmse: 389.958	valid_0's l2: 152068
[248]	valid_0's rmse: 389.917	valid_0's l2: 152036
[249]	valid_0's rmse: 389.822	valid_0's l2: 151961
[250]	valid_0's rmse: 389.808	valid_0's l2: 151950
[251]	valid_0's rmse: 389.75	valid_0's l2: 151905
[252]	valid_0's rmse: 389.728	valid_0's l2: 151888
[253]	valid_0's rmse: 389.678	valid_0's l2: 151849
[254]	valid_0's rmse: 389.617	valid_0's l2: 151801
[255]	valid_0's rmse: 389.606	valid_0's l2: 151793
[256]	valid_0's rmse: 389.6	valid_0's l2: 151788
[257]	valid_0's rmse: 389.535	valid_0's l2: 151738
[258]	valid_0's rmse: 389.476	valid_0's l2: 151692
[259]	valid_0's rmse: 389.413	vali

[406]	valid_0's rmse: 384.787	valid_0's l2: 148061
[407]	valid_0's rmse: 384.755	valid_0's l2: 148036
[408]	valid_0's rmse: 384.769	valid_0's l2: 148047
[409]	valid_0's rmse: 384.754	valid_0's l2: 148035
[410]	valid_0's rmse: 384.737	valid_0's l2: 148023
[411]	valid_0's rmse: 384.702	valid_0's l2: 147996
[412]	valid_0's rmse: 384.691	valid_0's l2: 147987
[413]	valid_0's rmse: 384.671	valid_0's l2: 147971
[414]	valid_0's rmse: 384.65	valid_0's l2: 147955
[415]	valid_0's rmse: 384.631	valid_0's l2: 147941
[416]	valid_0's rmse: 384.631	valid_0's l2: 147941
[417]	valid_0's rmse: 384.604	valid_0's l2: 147920
[418]	valid_0's rmse: 384.571	valid_0's l2: 147895
[419]	valid_0's rmse: 384.552	valid_0's l2: 147880
[420]	valid_0's rmse: 384.538	valid_0's l2: 147869
[421]	valid_0's rmse: 384.535	valid_0's l2: 147867
[422]	valid_0's rmse: 384.491	valid_0's l2: 147833
[423]	valid_0's rmse: 384.448	valid_0's l2: 147801
[424]	valid_0's rmse: 384.43	valid_0's l2: 147786
[425]	valid_0's rmse: 384.411	val

[627]	valid_0's rmse: 381.106	valid_0's l2: 145242
[628]	valid_0's rmse: 381.097	valid_0's l2: 145235
[629]	valid_0's rmse: 381.094	valid_0's l2: 145233
[630]	valid_0's rmse: 381.064	valid_0's l2: 145210
[631]	valid_0's rmse: 381.029	valid_0's l2: 145183
[632]	valid_0's rmse: 381.008	valid_0's l2: 145167
[633]	valid_0's rmse: 380.989	valid_0's l2: 145152
[634]	valid_0's rmse: 380.986	valid_0's l2: 145151
[635]	valid_0's rmse: 380.982	valid_0's l2: 145147
[636]	valid_0's rmse: 380.959	valid_0's l2: 145130
[637]	valid_0's rmse: 380.94	valid_0's l2: 145115
[638]	valid_0's rmse: 380.914	valid_0's l2: 145096
[639]	valid_0's rmse: 380.894	valid_0's l2: 145080
[640]	valid_0's rmse: 380.903	valid_0's l2: 145087
[641]	valid_0's rmse: 380.885	valid_0's l2: 145073
[642]	valid_0's rmse: 380.846	valid_0's l2: 145044
[643]	valid_0's rmse: 380.829	valid_0's l2: 145030
[644]	valid_0's rmse: 380.816	valid_0's l2: 145021
[645]	valid_0's rmse: 380.823	valid_0's l2: 145026
[646]	valid_0's rmse: 380.834	va

[845]	valid_0's rmse: 379.237	valid_0's l2: 143821
[846]	valid_0's rmse: 379.214	valid_0's l2: 143803
[847]	valid_0's rmse: 379.218	valid_0's l2: 143806
[848]	valid_0's rmse: 379.218	valid_0's l2: 143806
[849]	valid_0's rmse: 379.211	valid_0's l2: 143801
[850]	valid_0's rmse: 379.208	valid_0's l2: 143799
[851]	valid_0's rmse: 379.182	valid_0's l2: 143779
[852]	valid_0's rmse: 379.167	valid_0's l2: 143768
[853]	valid_0's rmse: 379.141	valid_0's l2: 143748
[854]	valid_0's rmse: 379.136	valid_0's l2: 143744
[855]	valid_0's rmse: 379.096	valid_0's l2: 143714
[856]	valid_0's rmse: 379.099	valid_0's l2: 143716
[857]	valid_0's rmse: 379.082	valid_0's l2: 143703
[858]	valid_0's rmse: 379.068	valid_0's l2: 143693
[859]	valid_0's rmse: 379.05	valid_0's l2: 143679
[860]	valid_0's rmse: 379.047	valid_0's l2: 143676
[861]	valid_0's rmse: 379.036	valid_0's l2: 143668
[862]	valid_0's rmse: 379.024	valid_0's l2: 143659
[863]	valid_0's rmse: 379.014	valid_0's l2: 143652
[864]	valid_0's rmse: 379.003	va

[1050]	valid_0's rmse: 377.706	valid_0's l2: 142662
[1051]	valid_0's rmse: 377.698	valid_0's l2: 142656
[1052]	valid_0's rmse: 377.684	valid_0's l2: 142646
[1053]	valid_0's rmse: 377.686	valid_0's l2: 142646
[1054]	valid_0's rmse: 377.673	valid_0's l2: 142637
[1055]	valid_0's rmse: 377.66	valid_0's l2: 142627
[1056]	valid_0's rmse: 377.657	valid_0's l2: 142625
[1057]	valid_0's rmse: 377.636	valid_0's l2: 142609
[1058]	valid_0's rmse: 377.629	valid_0's l2: 142604
[1059]	valid_0's rmse: 377.62	valid_0's l2: 142597
[1060]	valid_0's rmse: 377.61	valid_0's l2: 142589
[1061]	valid_0's rmse: 377.615	valid_0's l2: 142593
[1062]	valid_0's rmse: 377.622	valid_0's l2: 142598
[1063]	valid_0's rmse: 377.619	valid_0's l2: 142596
[1064]	valid_0's rmse: 377.609	valid_0's l2: 142588
[1065]	valid_0's rmse: 377.615	valid_0's l2: 142593
[1066]	valid_0's rmse: 377.615	valid_0's l2: 142593
[1067]	valid_0's rmse: 377.613	valid_0's l2: 142592
[1068]	valid_0's rmse: 377.587	valid_0's l2: 142572
[1069]	valid_0'

### Перебор гиперпараметров

In [35]:
param_boost = {'n_estimators': range (1, 1500, 300), 
                'max_depth': range (1, 35, 5)}

In [36]:
grid = GridSearchCV(booster, param_grid=param_boost, scoring='neg_root_mean_squared_error', n_jobs=-1)

In [37]:
start_time = time.time()
grid.fit(features_train, target_train)
print('\\\%s seconds///' % (time.time() - start_time))

\\169.9857039451599 seconds///


In [38]:
grid.best_params_

{'max_depth': 21, 'n_estimators': 1201}

In [39]:
print('RMSE', grid.best_score_ * -1)

RMSE 375.6599684051903


In [40]:
grid.cv_results_

{'mean_fit_time': array([ 0.28311591,  1.74382005,  3.37671199,  4.75812693,  6.08597903,
         0.38110738,  3.68223348,  6.7506969 ,  8.94142957, 11.35515246,
         0.33449807,  3.25503254,  5.47662191,  8.14978771, 10.28121219,
         0.31877036,  3.32615924,  5.70756178,  7.74875417,  9.69900293,
         0.31956034,  3.36175413,  5.70579948,  7.87974734,  9.83445029,
         0.31071811,  3.3442946 ,  5.44868445,  7.87706003,  9.65170517,
         0.32269411,  3.31992807,  5.60153551,  8.12347932,  9.24494276]),
 'std_fit_time': array([0.00561312, 0.04764482, 0.29617206, 0.32171603, 0.34097471,
        0.01719632, 0.08704186, 0.34433988, 0.19514137, 0.24459214,
        0.00824988, 0.14174995, 0.28208721, 0.21333633, 0.70031311,
        0.02119507, 0.05761416, 0.0623076 , 0.06614049, 0.23662865,
        0.01195187, 0.04903517, 0.32642303, 0.11966533, 0.23568579,
        0.01080508, 0.06274016, 0.07325953, 0.05659683, 0.20292855,
        0.021126  , 0.02826186, 0.05691897, 0.

### Предикт на тесте

In [41]:
start_time = time.time()
predicted_test = booster.predict(features_test)
print('---%s seconds---' % (time.time() - start_time))


---0.750126838684082 seconds---


In [42]:
mean_squared_error(target_test, predicted_test)**0.5


373.61353626973727

In [43]:
print("Test  R2: %.2f"%booster.score(features_test, target_test))

Test  R2: 0.85


###  Используем CatBoost

**Не кодируем...**

In [44]:
train, valid = train_test_split(auto_pre, test_size=0.4, random_state=77)
valid, test = train_test_split(valid, test_size=0.5, random_state=77)

In [45]:
auto_pre.columns

Index(['Price', 'VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Kilometer', 'RegistrationMonth', 'FuelType', 'Brand', 'NotRepaired'],
      dtype='object')

In [46]:
X = ['VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Kilometer', 'RegistrationMonth', 'FuelType', 'Brand', 'NotRepaired']

In [47]:
 auto_pre.select_dtypes(include='object').columns

Index(['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired'], dtype='object')

In [48]:
cat_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

In [49]:
y = ['Price']

### Перебор гиперпараметров

**Базовые**

In [50]:
params = {'cat_features': cat_features,
         'random_state': 77,
         'verbose': 100}

**Уменьшим learning rate**

In [51]:
params_l = {'cat_features': cat_features,
         'random_state': 77,
         'verbose': 100,
         'learning_rate': 0.06}

**Увеличим глубину**

In [52]:
params_ld = {'cat_features': cat_features,
         'random_state': 77,
         'verbose': 100,
         'learning_rate': 0.06,
         'depth':10}

**Обучим модели с различными параметрами**

In [53]:
model = CatBoostRegressor(**params)

In [54]:
start_time = time.time()
model.fit(train[X], train[y], eval_set=(valid[X], valid[y]))
print('+++%s sec+++' % (time.time() - start_time))

Learning rate set to 0.126883
0:	learn: 4164.2340642	test: 4144.0982490	best: 4144.0982490 (0)	total: 312ms	remaining: 5m 11s
100:	learn: 1876.1370395	test: 1901.5540769	best: 1901.5540769 (100)	total: 13.2s	remaining: 1m 57s
200:	learn: 1800.3567235	test: 1841.0226403	best: 1841.0226403 (200)	total: 25.8s	remaining: 1m 42s
300:	learn: 1763.9416144	test: 1815.2017467	best: 1815.2017467 (300)	total: 38.3s	remaining: 1m 28s
400:	learn: 1738.3376849	test: 1798.0906004	best: 1798.0906004 (400)	total: 51.1s	remaining: 1m 16s
500:	learn: 1718.2789593	test: 1786.8609183	best: 1786.8609183 (500)	total: 1m 3s	remaining: 1m 3s
600:	learn: 1698.9834460	test: 1775.8293537	best: 1775.8013174 (599)	total: 1m 16s	remaining: 50.8s
700:	learn: 1685.4507982	test: 1770.3871784	best: 1770.3871784 (700)	total: 1m 29s	remaining: 38.3s
800:	learn: 1672.8819216	test: 1765.3863677	best: 1765.3863677 (800)	total: 1m 42s	remaining: 25.5s
900:	learn: 1660.5423221	test: 1759.8440696	best: 1759.8440696 (900)	total:

In [55]:
model_l = CatBoostRegressor(**params_l)

In [56]:
start_time = time.time()
model_l.fit(train[X], train[y], eval_set=(valid[X], valid[y]))
print('+++%s sec+++' % (time.time() - start_time))

0:	learn: 4348.2627610	test: 4327.3406653	best: 4327.3406653 (0)	total: 129ms	remaining: 2m 9s
100:	learn: 1974.2306966	test: 1990.2080386	best: 1990.2080386 (100)	total: 12.4s	remaining: 1m 50s
200:	learn: 1884.6447792	test: 1909.5468753	best: 1909.5468753 (200)	total: 24.9s	remaining: 1m 39s
300:	learn: 1836.1632933	test: 1867.6687151	best: 1867.6687151 (300)	total: 38.1s	remaining: 1m 28s
400:	learn: 1804.6932577	test: 1842.8080396	best: 1842.8080396 (400)	total: 50.7s	remaining: 1m 15s
500:	learn: 1783.1902721	test: 1827.6058810	best: 1827.6058810 (500)	total: 1m 3s	remaining: 1m 3s
600:	learn: 1766.0384941	test: 1816.1812438	best: 1816.1691306 (599)	total: 1m 16s	remaining: 51s
700:	learn: 1752.1510637	test: 1807.3389876	best: 1807.3389876 (700)	total: 1m 30s	remaining: 38.4s
800:	learn: 1741.4414237	test: 1801.0976972	best: 1801.0976972 (800)	total: 1m 42s	remaining: 25.5s
900:	learn: 1730.7097693	test: 1794.8017852	best: 1794.8017852 (900)	total: 1m 55s	remaining: 12.7s
999:	lea

In [57]:
model_ld = CatBoostRegressor(**params_ld)

In [58]:
start_time = time.time()
model_ld.fit(train[X], train[y], eval_set=(valid[X], valid[y]))
print('+++%s sec+++' % (time.time() - start_time))

0:	learn: 4326.7357024	test: 4305.2120729	best: 4305.2120729 (0)	total: 205ms	remaining: 3m 24s
100:	learn: 1827.5146525	test: 1873.0278442	best: 1873.0278442 (100)	total: 18.8s	remaining: 2m 47s
200:	learn: 1728.4147848	test: 1805.9498530	best: 1805.9498530 (200)	total: 38.3s	remaining: 2m 32s
300:	learn: 1669.9146376	test: 1775.6830403	best: 1775.6830403 (300)	total: 57.8s	remaining: 2m 14s
400:	learn: 1629.5125194	test: 1758.2439656	best: 1758.2439656 (400)	total: 1m 18s	remaining: 1m 56s
500:	learn: 1601.1659546	test: 1748.0174363	best: 1748.0174363 (500)	total: 1m 38s	remaining: 1m 38s
600:	learn: 1578.2561726	test: 1740.7249301	best: 1740.7249301 (600)	total: 1m 59s	remaining: 1m 19s
700:	learn: 1556.6693847	test: 1734.6498429	best: 1734.6498429 (700)	total: 2m 18s	remaining: 59.2s
800:	learn: 1535.4672351	test: 1729.1992082	best: 1729.1992082 (800)	total: 2m 40s	remaining: 39.8s
900:	learn: 1516.4051665	test: 1725.1929836	best: 1725.1903882 (899)	total: 3m 1s	remaining: 19.9s
99

####  Вывод
**Ручное назначение параметра Depth привело к незначительному улучшению метрики, используем её в дальнейшем**

### Предикт на тесте

In [60]:
start_time = time.time()
test_pred = model.predict(test[X])
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.19759273529052734 seconds ---


In [61]:
start_time = time.time()
test_pred = model_ld.predict(test[X])
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.305600643157959 seconds ---


#### Вывод
**Три процента качества предсказания ведут к почти двукратному ухудшению скорости предсказания и времени обучения. Вернемся к предустановленным параметрам**

In [62]:
mean_squared_error(test[y], test_pred) ** 0.5

1704.1089047119826

In [63]:
r2_score(test[y], test_pred)

0.8579151123859945

### Интерпретация признаков модели

In [64]:
model.get_feature_importance(prettified=True)

Unnamed: 0,Feature Id,Importances
0,RegistrationYear,38.180602
1,Power,21.696413
2,VehicleType,11.067203
3,Kilometer,9.736305
4,Brand,7.931206
5,Model,4.835362
6,NotRepaired,3.440595
7,FuelType,1.71748
8,Gearbox,0.923453
9,RegistrationMonth,0.471381


###  Используем LinearRegression

**Исходя из алгоритма лежащего в основе модели, перекодируем признаки правильным методом, исключив dummy - ловушку.**

In [65]:
auto_ohe = pd.get_dummies(auto_pre, drop_first=True, columns=cat_features)

In [66]:
auto_ohe

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_unknow,NotRepaired_yes
0,480,1993,0,150000,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,18300,2011,190,125000,5,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,2004,163,125000,8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1500,2001,75,150000,6,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,3600,2008,69,90000,7,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,0,2005,0,150000,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
354365,2200,2005,0,20000,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
354366,1199,2000,101,125000,3,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
354367,9200,1996,102,150000,3,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [67]:
features = auto_ohe.drop('Price', axis=1)

In [68]:
target = auto_ohe['Price']

In [69]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.4, random_state=77)
features_valid, features_test, target_valid, target_test = train_test_split(features, target, test_size=0.5, random_state=77)

In [70]:
start_time = time.time()
model = LinearRegression()
model.fit(features_train, target_train)
print("--- %s seconds ---" % (time.time() - start_time))

--- 2.6868371963500977 seconds ---


In [71]:
start_time = time.time()
pred = model.predict(features_test)
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.38778185844421387 seconds ---


In [72]:
mean_squared_error(target_test, pred) ** 0.5

3168.0523977339503

In [73]:
r2_score(target_test, pred)

0.5058865409946078

## Анализ моделей

* Время обучения модели CatBoost - 126s , RMSE - 1704 , что больше и хуже чем у LightGBM время обучения модели - 5s, RMSE 373 . Возможно сказался метод кодировки переменных он не увеличивает количество признаков, а вписывает их в существующие. Или т.к. RMSE считается в единицах целевого показателя,а LightGBM кодирована в сторону уменьшения, то использовать её для сравнения не коректно, другая относительная метрика R2_score показывает паритет 0.85. Скорость предсказания у CatBoost почти в четыре раза выше 0.20 против 0.75 секунд. LinearRegression работает быстро скорость обучения 2.5 c. , скорость предсказания 0.36. Но точность примерно в два раза хуже RMSE 3168 , r2 0.50.

###  Вывод
**Если необходимо качество предсказания и нет очень большой нагрузки по скорости предсказания, то выбераем LightGBM. Если критична скорость предсказания и можно пожертвовать скоростью обучения и в небольшой степени качеством предсказания то CatBoost.**