# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

**Цель проекта:**  построить модель для определения стоимости автомобиля.

<br>**Задачи исследования:**
<br>1.Изучить исходные данные и выполнить их предобработку.
<br>2.Построение модели.
<br>3.Сделать общий вывод.

## Подготовка данных

In [2]:
pip install optuna

Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 1.1 MB/s eta 0:00:01
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 9.5 MB/s  eta 0:00:01
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Collecting autopage>=0.4.0
  Downloading autopage-0.5.0-py3-none-any.whl (29 kB)
Collecting PrettyTable>=0.7.2
  Downloading prettytable-3.2.0-py3-none-any.whl (26 kB)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.8.1-py2.py3-none-any.whl (113 kB)
[K     |████████████████████████████████| 113 kB 18.4 MB/s eta 0:00:01
[?25hCollecting cmd2>=1.0.0
  Downloading cmd2-2.4.1-py3-none-any.whl (146 kB)
[K     |████████████████████████████████| 146 kB 19.7 MB/s eta 0:00:01
[?25hCollecting stevedore>=2.0.1
  Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)

### Общая информация о данных

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import optuna

In [4]:
data_raw = pd.read_csv('/datasets/autos.csv')

In [5]:
data_raw.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [6]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Согласно описанию к данным:

DateCrawled — дата скачивания анкеты из базы <br>
VehicleType — тип автомобильного кузова<br>
RegistrationYear — год регистрации автомобиля<br>
Gearbox — тип коробки передач<br>
Power — мощность (л. с.)<br>
Model — модель автомобиля<br>
Kilometer — пробег (км)<br>
RegistrationMonth — месяц регистрации автомобиля<br>
FuelType — тип топлива<br>
Brand — марка автомобиля<br>
NotRepaired — была машина в ремонте или нет<br>
DateCreated — дата создания анкеты<br>
NumberOfPictures — количество фотографий автомобиля<br>
PostalCode — почтовый индекс владельца анкеты (пользователя)<br>
LastSeen — дата последней активности пользователя<br>

In [7]:
data_raw.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


### Предобработка данных

Для более удобной работы приведем наименования столбцов к змеиному регистру.

In [8]:
columns_old = data_raw.columns
columns_new = columns_old.str.lower()

In [9]:
columns_new

Index(['datecrawled', 'price', 'vehicletype', 'registrationyear', 'gearbox',
       'power', 'model', 'kilometer', 'registrationmonth', 'fueltype', 'brand',
       'notrepaired', 'datecreated', 'numberofpictures', 'postalcode',
       'lastseen'],
      dtype='object')

In [10]:
data_raw.columns = columns_new

In [11]:
data_raw = data_raw.rename({'datecrawled': 'date_crawled', 'vechicletype': 'vehicle_type',
'registrationyear' : 'registration_year', 'registrationmonth' : 'registration_month', 'fueltype' :
'fuel_type', 'notrepaired' : 'not_repaired', 'datecreated' : 'date_created', 'numberofpictures' :
'number_of_pictures', 'postalcode' : 'postal_code', 'lastseen' : 'last_seen'}, axis=1)

In [12]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicletype         316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   kilometer           354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

Проверим датасет на наличие дубликатов

In [13]:
data_raw.duplicated().sum()

4

Удалим дубликаты из датасета:

In [14]:
data_raw = data_raw.drop_duplicates()

In [15]:
data_raw.duplicated().sum()

0

Максимальное значение параметра registration_year (год регистрации автомобиля) равно 9999. Найдем количество строк, при условии, что год регистрации автомобиля больше 2022.

In [16]:
len(data_raw.loc[data_raw.registration_year > 2022])

105

Поскольку количество таких значений составляет менее 1%, то удалим их.

In [17]:
data_raw = data_raw.loc[data_raw.registration_year < 2022]

Изучим распределение переменной number_of_pictures.

In [18]:
data_raw.number_of_pictures.value_counts()

0    354260
Name: number_of_pictures, dtype: int64

Поскольку переменная принимает только одно значение, что не будет влиять на дальнейшие предсказания, удалим ее.

In [19]:
data_raw = data_raw.drop(['number_of_pictures'], axis = 1)

In [20]:
corr = data_raw.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,price,registration_year,power,kilometer,registration_month,postal_code
price,1.0,0.206408,0.159135,-0.334048,0.110527,0.075998
registration_year,0.206408,1.0,0.017936,-0.05792,0.041327,0.014851
power,0.159135,0.017936,1.0,0.023979,0.043208,0.021669
kilometer,-0.334048,-0.05792,0.023979,1.0,0.008505,-0.007962
registration_month,0.110527,0.041327,0.043208,0.008505,1.0,0.013883
postal_code,0.075998,0.014851,0.021669,-0.007962,0.013883,1.0


По полученной матрице корреляций можно сделать вывод о том, что связь между параметрами слабая.

In [21]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354260 entries, 0 to 354368
Data columns (total 15 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354260 non-null  object
 1   price               354260 non-null  int64 
 2   vehicletype         316875 non-null  object
 3   registration_year   354260 non-null  int64 
 4   gearbox             334508 non-null  object
 5   power               354260 non-null  int64 
 6   model               334597 non-null  object
 7   kilometer           354260 non-null  int64 
 8   registration_month  354260 non-null  int64 
 9   fuel_type           321441 non-null  object
 10  brand               354260 non-null  object
 11  not_repaired        283197 non-null  object
 12  date_created        354260 non-null  object
 13  postal_code         354260 non-null  int64 
 14  last_seen           354260 non-null  object
dtypes: int64(6), object(9)
memory usage: 43.2+ MB


Преобразуем типы данных для переменных date_crawled, date_created, last_seen в формат даты. Из полученного формата данных получим признаки: год, месяц, день недели.

In [22]:
data_raw['date_crawled'] = pd.to_datetime(
    data_raw['date_crawled'], format='%Y-%m-%dT%H:%M:%S')
data_raw['year_crawled'] = pd.DatetimeIndex(data_raw['date_crawled']).year
data_raw['month_crawled'] = pd.DatetimeIndex(data_raw['date_crawled']).month
data_raw['weekday_crawled'] = pd.DatetimeIndex(data_raw['date_crawled']).weekday

In [23]:
data_raw['date_created'] = pd.to_datetime(
    data_raw['date_created'], format='%Y-%m-%dT%H:%M:%S')
data_raw['year_created'] = pd.DatetimeIndex(data_raw['date_created']).year
data_raw['month_created'] = pd.DatetimeIndex(data_raw['date_created']).month
data_raw['weekday_created'] = pd.DatetimeIndex(data_raw['date_created']).weekday

In [24]:
data_raw['last_seen'] = pd.to_datetime(
    data_raw['last_seen'], format='%Y-%m-%dT%H:%M:%S')
data_raw['year_last_seen'] = pd.DatetimeIndex(data_raw['last_seen']).year
data_raw['month_last_seen'] = pd.DatetimeIndex(data_raw['last_seen']).month
data_raw['weekday_last_seen'] = pd.DatetimeIndex(data_raw['last_seen']).weekday

Выполним категоризацию переменных: vehicletype, gearbox, fuel_type, not_repaired.

In [25]:
data = data_raw
data = pd.get_dummies(data, columns=['vehicletype', 'gearbox', 'fuel_type', 'not_repaired'])
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354260 entries, 0 to 354368
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   date_crawled             354260 non-null  datetime64[ns]
 1   price                    354260 non-null  int64         
 2   registration_year        354260 non-null  int64         
 3   power                    354260 non-null  int64         
 4   model                    334597 non-null  object        
 5   kilometer                354260 non-null  int64         
 6   registration_month       354260 non-null  int64         
 7   brand                    354260 non-null  object        
 8   date_created             354260 non-null  datetime64[ns]
 9   postal_code              354260 non-null  int64         
 10  last_seen                354260 non-null  datetime64[ns]
 11  year_crawled             354260 non-null  int64         
 12  month_crawled   

In [26]:
data_raw['not_repaired'].head()

0    NaN
1    yes
2    NaN
3     no
4     no
Name: not_repaired, dtype: object

In [27]:
data[['not_repaired_no', 'not_repaired_yes']].head()

Unnamed: 0,not_repaired_no,not_repaired_yes
0,0,0
1,0,1
2,0,0
3,1,0
4,1,0


На данном этапе работы были удалены дубликаты, выбросы, выполнена категоризация переменных и преобразвание типов данных.

## Обучение моделей

### Linear Regression

Обучим модель линейной регрессии. Для этого необходимо подготовим выборку для обучения и тестирования модели. 

Для построения точной модели необходимо перемешаем строки в датасете:

In [28]:
data = data.sample(frac=1).reset_index(drop=True)

In [29]:
features = data.drop(['model', 'brand', 'price', 'date_crawled', 'date_created', 'last_seen'],
                        axis = 1)
target = data['price']

In [30]:
data_raw['brand'].nunique()

40

In [31]:
data_raw['model'].nunique()

250

Разделим датасет на тренировочную, валидационную и тестовую выборки.

In [32]:
features_train, features_valid, features_test = np.split(features, [int(.6*len(features)),
                                                                 int(.8*len(features))])
target_train, target_valid, target_test = np.split(target, [int(.6*len(target)),
                                                                 int(.8*len(target))])

Обучим модель линейной регрессии и с ее помощью предскажем значения для целевой переменной.

In [33]:
%%time
linear_regression = LinearRegression().fit(features_train, target_train)

CPU times: user 484 ms, sys: 309 ms, total: 793 ms
Wall time: 764 ms


In [34]:
%%time
result_lr = linear_regression.predict(features_test)

CPU times: user 33.7 ms, sys: 71.2 ms, total: 105 ms
Wall time: 105 ms


Оценим качество полученной модели с помощью метрики RMSE.

In [35]:
rmse_lr = mean_squared_error(target_test, result_lr)**0.5
print('RMSE для модели линейной регрессии:', rmse_lr)

RMSE для модели линейной регрессии: 3416.2822779862677


### Random Forest

Используем модель Random Forest для получения значений цены.

In [36]:
model_forest = RandomForestRegressor(max_depth = 7, n_estimators = 90)

In [37]:
%%time
model_forest.fit(features_train, target_train)

CPU times: user 49.5 s, sys: 27 ms, total: 49.5 s
Wall time: 49.8 s


RandomForestRegressor(max_depth=7, n_estimators=90)

In [38]:
result_forest_valid = model_forest.predict(features_valid)
rmse_forest_valid = mean_squared_error(target_valid, result_forest_valid)**0.5
print(rmse_forest_valid)

2264.4646457595277


In [39]:
%%time
result_forest = model_forest.predict(features_test)

CPU times: user 358 ms, sys: 4.16 ms, total: 362 ms
Wall time: 371 ms


In [40]:
rmse_rf = mean_squared_error(target_test, result_forest)**0.5
print('RMSE для модели случайного леса:', rmse_rf)

RMSE для модели случайного леса: 2272.864416990728


### XGBoost

Построим модель с помощью алгоритма XGBoost. 

In [41]:
%%time
def objective(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 5, 10, step = 1),
        "n_estimators": trial.suggest_int("n_estimators", 80, 120, step = 5),
    }
    reg = xgb.XGBRegressor(**params)
    reg.fit(features_train, target_train)
    y_pred = reg.predict(features_valid)
    rmse = mean_squared_error(target_valid, y_pred)**0.5
    return rmse

study = optuna.create_study()
study.optimize(objective, n_trials = 100, timeout = 6)

[32m[I 2022-04-26 13:37:30,767][0m A new study created in memory with name: no-name-f8ff5b45-65a6-4f48-be13-24e6cca21acc[0m
[32m[I 2022-04-26 13:40:52,032][0m Trial 0 finished with value: 1937.4742725199746 and parameters: {'max_depth': 6, 'n_estimators': 110}. Best is trial 0 with value: 1937.4742725199746.[0m


CPU times: user 3min 19s, sys: 1.09 s, total: 3min 20s
Wall time: 3min 21s


In [42]:
study.best_params

{'max_depth': 6, 'n_estimators': 110}

In [43]:
model_xgb = xgb.XGBRegressor(max_depth = 6, n_estimators = 110, eval_metric = 'rmse')

In [44]:
%%time
model_xgb.fit(features_train, target_train)

CPU times: user 3min 16s, sys: 861 ms, total: 3min 17s
Wall time: 3min 18s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             eval_metric='rmse', gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=110, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [45]:
result_xgb_valid = model_xgb.predict(features_valid)

In [46]:
rmse_xgb_valid = mean_squared_error(target_valid, result_xgb_valid)**0.5
rmse_xgb_valid

1937.4742725199746

In [47]:
%%time
result_xgb = model_xgb.predict(features_test)

CPU times: user 350 ms, sys: 0 ns, total: 350 ms
Wall time: 378 ms


In [48]:
rmse_xgb = mean_squared_error(target_test, result_xgb)**0.5
print('RMSE для XGBoost:', rmse_xgb)

RMSE для XGBoost: 1944.1572107361999


### LightGBM

In [49]:
%%time
def lgbm(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 5, 10, step = 1),
        "n_estimators": trial.suggest_int("n_estimators", 80, 120, step = 5),
    }
    reg = LGBMRegressor(**params)
    reg.fit(features_train, target_train)
    y_pred = reg.predict(features_valid)
    rmse = mean_squared_error(target_valid, y_pred)**0.5
    return rmse

study = optuna.create_study()
study.optimize(lgbm, n_trials = 100, timeout = 6)

[32m[I 2022-04-26 13:45:52,564][0m A new study created in memory with name: no-name-2ba4bf07-7725-40fb-9624-1718a822c53d[0m
[32m[I 2022-04-26 13:51:12,989][0m Trial 0 finished with value: 2009.633714739668 and parameters: {'max_depth': 8, 'n_estimators': 95}. Best is trial 0 with value: 2009.633714739668.[0m


CPU times: user 5min 17s, sys: 1.45 s, total: 5min 18s
Wall time: 5min 20s


In [50]:
study.best_params

{'max_depth': 8, 'n_estimators': 95}

In [52]:
model_gbm = LGBMRegressor(max_depth = 8, n_estimators = 95)

In [53]:
%%time
model_gbm.fit(features_train, target_train)

CPU times: user 4min 22s, sys: 1.22 s, total: 4min 23s
Wall time: 4min 25s


LGBMRegressor(max_depth=8, n_estimators=95)

In [54]:
result_gbm_valid = model_gbm.predict(features_valid)
rmse_gbm_valid = mean_squared_error(target_valid, result_gbm_valid)**0.5
rmse_gbm_valid

2009.633714739668

In [55]:
%%time
result_gbm = model_gbm.predict(features_test)

CPU times: user 588 ms, sys: 4.03 ms, total: 592 ms
Wall time: 610 ms


In [56]:
rmse_gbm = mean_squared_error(target_test, result_gbm)**0.5
print('RMSE для LightGBM:', rmse_gbm)

RMSE для LightGBM: 2014.434630350793


### CatBoost

In [57]:
%%time
def cat(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 5, 13, step = 1),
        "n_estimators": trial.suggest_int("n_estimators", 80, 1200, step = 5),
    }
    reg = CatBoostRegressor(**params)
    reg.fit(features_train, target_train)
    y_pred = reg.predict(features_valid)
    rmse = mean_squared_error(target_valid, y_pred)**0.5
    return rmse

study = optuna.create_study()
study.optimize(cat, n_trials = 100, timeout = 6)

[32m[I 2022-04-26 13:56:24,609][0m A new study created in memory with name: no-name-50c0ed59-56f6-4c4e-9a4c-3f107010cd4f[0m


Learning rate set to 0.155241
0:	learn: 4052.8277931	total: 114ms	remaining: 1m 2s
1:	learn: 3688.2816305	total: 172ms	remaining: 47.1s
2:	learn: 3379.5516512	total: 228ms	remaining: 41.6s
3:	learn: 3143.0548009	total: 290ms	remaining: 39.5s
4:	learn: 2951.9618816	total: 344ms	remaining: 37.5s
5:	learn: 2796.0838642	total: 405ms	remaining: 36.7s
6:	learn: 2665.8121649	total: 459ms	remaining: 35.6s
7:	learn: 2567.0733710	total: 515ms	remaining: 34.9s
8:	learn: 2487.6891312	total: 571ms	remaining: 34.3s
9:	learn: 2425.6496622	total: 624ms	remaining: 33.7s
10:	learn: 2370.1064944	total: 679ms	remaining: 33.2s
11:	learn: 2328.2421514	total: 734ms	remaining: 32.9s
12:	learn: 2290.2928448	total: 790ms	remaining: 32.6s
13:	learn: 2259.7605089	total: 850ms	remaining: 32.6s
14:	learn: 2233.8733566	total: 908ms	remaining: 32.4s
15:	learn: 2210.3092576	total: 964ms	remaining: 32.2s
16:	learn: 2192.3881356	total: 1s	remaining: 31.4s
17:	learn: 2175.9495137	total: 1.04s	remaining: 30.8s
18:	learn: 

[32m[I 2022-04-26 13:56:47,407][0m Trial 0 finished with value: 1899.61943809072 and parameters: {'max_depth': 8, 'n_estimators': 550}. Best is trial 0 with value: 1899.61943809072.[0m


CPU times: user 21.3 s, sys: 169 ms, total: 21.4 s
Wall time: 22.8 s


In [58]:
study.best_params

{'max_depth': 8, 'n_estimators': 550}

In [62]:
model_cat = CatBoostRegressor(max_depth = 8, n_estimators = 550, loss_function='RMSE')

In [63]:
%%time
model_cat.fit(features_train, target_train);

Custom logger is already specified. Specify more than one logger at same time is not thread safe.

Learning rate set to 0.155241
0:	learn: 4052.8277931	total: 43.9ms	remaining: 24.1s
1:	learn: 3688.2816305	total: 82.6ms	remaining: 22.6s
2:	learn: 3379.5516512	total: 125ms	remaining: 22.8s
3:	learn: 3143.0548009	total: 167ms	remaining: 22.8s
4:	learn: 2951.9618816	total: 208ms	remaining: 22.7s
5:	learn: 2796.0838642	total: 251ms	remaining: 22.7s
6:	learn: 2665.8121649	total: 294ms	remaining: 22.8s
7:	learn: 2567.0733710	total: 334ms	remaining: 22.6s
8:	learn: 2487.6891312	total: 375ms	remaining: 22.5s
9:	learn: 2425.6496622	total: 415ms	remaining: 22.4s
10:	learn: 2370.1064944	total: 454ms	remaining: 22.2s
11:	learn: 2328.2421514	total: 496ms	remaining: 22.2s
12:	learn: 2290.2928448	total: 532ms	remaining: 22s
13:	learn: 2259.7605089	total: 568ms	remaining: 21.7s
14:	learn: 2233.8733566	total: 605ms	remaining: 21.6s
15:	learn: 2210.3092576	total: 640ms	remaining: 21.4s
16:	learn: 2192.3881356	total: 673ms	remaining: 21.1s
17:	learn: 2175.9495137	total: 707ms	remaining: 20.9s
18:	lear

<catboost.core.CatBoostRegressor at 0x7f65abebbd30>

In [64]:
result_cat_valid = model_cat.predict(features_valid)
rmse_cat_valid = mean_squared_error(target_valid, result_cat_valid)**0.5

In [65]:
rmse_cat_valid

1899.61943809072

In [66]:
%%time
result_cat_test = model_cat.predict(features_test)

CPU times: user 56.6 ms, sys: 0 ns, total: 56.6 ms
Wall time: 54.6 ms


In [67]:
rmse_cat_test = mean_squared_error(target_test, result_cat_test)**0.5
print('RMSE для CatBoost:', rmse_cat_test)

RMSE для CatBoost: 1907.1380328983246


## Анализ моделей

In [68]:
print('RMSE для модели линейной регрессии:', rmse_lr)
print('RMSE для модели случайного леса:', rmse_rf)
print('RMSE для XGBoost:', rmse_xgb)
print('RMSE для LightGBM:', rmse_gbm)
print('RMSE для CatBoost:', rmse_cat_test)

RMSE для модели линейной регрессии: 3416.2822779862677
RMSE для модели случайного леса: 2272.864416990728
RMSE для XGBoost: 1944.1572107361999
RMSE для LightGBM: 2014.434630350793
RMSE для CatBoost: 1907.1380328983246


Наименее точный результат был получен при построении модели с помощью линейной регрессии RMSE 3416,3, наиболее точный  - при применении XGBoost - 1944,16, CatBoost - 1907,1.<br>
При использовании модели линейной регрессии была получена самая быстрая модель - с минимальным затраченным временем на обучение модели и получение предсказаний. Больше всего времени заняло обучение моделей XGBoost, RandomForest.<br>
Поскольку результат, полученный с помощью линейной регрессии менее точный, а обучение с помощью XGBoost, LightGBM занимает много времени, то эти модели не подходят. По сравнению с RandomForest y CatBoost более точный результат и меньшее время предсказания, поэтому для предсказания рыночной стоимости автомобиля будем использовать модель, полученную с помощью CatBoost.