# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

**Ход работы**

Данные будут получены из файла `/datasets/autos.csv`. О качестве данных ничего не известно. Поэтому перед исследованием ряда вопросов и проверкой гипотез понадобится обзор данных.

Проверим данные на ошибки, пропуски, дубликаты. По возможности исправим все эти "пороки".



Таким образом, моя работа пройдет в **6 этапов**:

1. Загрузка данных, путь к файлу:  `/datasets/autos.csv`.
   
   В датасете присутствует 16 столбцов:
   
   **Признаки:**
   * `DateCrawled` — дата скачивания анкеты из базы
   * `VehicleType` — тип автомобильного кузова
   * `RegistrationYear` — год регистрации автомобиля
   * `Gearbox` — тип коробки передач
   * `Power` — мощность (л. с.)
   * `Model` — модель автомобиля
   * `Kilometer` — пробег (км)
   * `RegistrationMonth` — месяц регистрации автомобиля
   * `FuelType` — тип топлива
   * `Brand` — марка автомобиля
   * `Repaired` — была машина в ремонте или нет
   * `DateCreated` — дата создания анкеты
   * `NumberOfPictures` — количество фотографий автомобиля
   * `PostalCode` — почтовый индекс владельца анкеты (пользователя)
   * `LastSeen` — дата последней активности пользователя
   
   **Целевой признак:**
   * `Price` — цена (евро)
  

2. Изучение данных. Заполнение пропущенных значений и обработка аномалий в столбцах. Удаление неинформативных признаков при их наличии.
3. Подготовка выборок для обучения моделей.
4. Обучение разных моделей, одна из которых — LightGBM, как минимум одна — не бустинг.
5. Анализ времени обучения, времени предсказания и качества моделей.
6. Опираясь на критерии заказчика, выбор лучшей модели, проверка её качества на тестовой выборке.

## Подготовка данных

### Импорт библиотек

**Импортируем все необходимые для работы библиотеки:**

In [1]:
!pip install lightgbm



In [2]:
import warnings

warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from catboost import CatBoostRegressor

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from lightgbm import LGBMRegressor

from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

import time

### Чтение файла

In [3]:
try:
    
    df = pd.read_csv("C:/Users/KreoS/Downloads/autos.csv")
except:
    
    df = pd.read_csv("/datasets/autos.csv")

### Изучение данных

Выведем первые строки датафрейма:

In [4]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


Выведем размеры таблицы:

In [5]:
df.shape

(354369, 16)

Выведем общую информацию о датафрейме:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Выведем на экран основные статистики:

In [7]:
df.describe().style.background_gradient('YlOrRd')

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


Посмотрим на количество пропусков в столбцах:

In [8]:
pd.DataFrame(df.isna().sum(), columns=['Кол-во пропусков']).style.background_gradient('coolwarm')

Unnamed: 0,Кол-во пропусков
DateCrawled,0
Price,0
VehicleType,37490
RegistrationYear,0
Gearbox,19833
Power,0
Model,19705
Kilometer,0
RegistrationMonth,0
FuelType,32895


Посмотрим на процентное соотношение пропусков:

In [9]:
pd.DataFrame(round(df.isna().mean() * 100, 2), columns=['%']).style.background_gradient('coolwarm')

Unnamed: 0,%
DateCrawled,0.0
Price,0.0
VehicleType,10.58
RegistrationYear,0.0
Gearbox,5.6
Power,0.0
Model,5.56
Kilometer,0.0
RegistrationMonth,0.0
FuelType,9.28


### Приведение столбцов к `змеиному_регистру`

Столбцы сейчас не приведены к общему стандарту. Поэтому приведем их к нему:

In [10]:
df.columns = [x.lower() for x in df.columns] # приведение столбцов к змеиному регистру

In [11]:
df = df.rename(                              # переименование столбцов
    columns={
        'datecrawled': 'date_crawled',
        'vehicletype': 'vehicle_type',
        'registrationyear': 'registration_year',
        'registrationmonth': 'registration_month',
        'fueltype': 'fuel_type',
        'datecreated': 'date_created',
        'numberofpictures': 'number_of_pictures',
        'postalcode': 'postal_code',
        'lastseen': 'last_seen'
    })

Выведем названия переименованных столбцов, чтобы убедиться, что столбцы были успешно приведены к `змеиному_регистру`:

In [12]:
df.columns.to_list() # вывод названий столбцов списком

['date_crawled',
 'price',
 'vehicle_type',
 'registration_year',
 'gearbox',
 'power',
 'model',
 'kilometer',
 'registration_month',
 'fuel_type',
 'brand',
 'repaired',
 'date_created',
 'number_of_pictures',
 'postal_code',
 'last_seen']

### Проверка на явные дубликаты

In [13]:
df.duplicated().sum()

4

В данных присутствуют 4 явных строки-дубликата. Удалим их:

In [14]:
df = df.drop_duplicates()

In [15]:
df.duplicated().sum()

0

### Удаление лишних столбцов

Такие столбцы, как `date_crawled`, `date_created`, `number_of_pictures`, `last_seen`, `postal_code` - не значимые признаки, они никак не влияют на предсказание цены автомобиля, поэтому 
соответствующие столбцы можно удалить.

In [16]:
df = df.drop([                                                                    # удаление столбцов из датафрейма
    'date_crawled', 'date_created',
    'number_of_pictures', 'postal_code', 'last_seen'
],
             axis=1)

Выведем первые 5 строк датафрейма:

In [17]:
df.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no


Столбцы были успешно удалены.

После удаления столбцов с датами проверим еще раз наш датасет на дубликаты:

In [18]:
df.duplicated().sum()

27539

Удалим дубликаты:

In [19]:
df = df.drop_duplicates()

In [20]:
df.duplicated().sum()

0

### Работа с пропусками

Столбцы `gearbox` и `model` имеют примерно 5.5% пропусков. Удплим пропуски в этих столбцах:

In [21]:
df = df.dropna(subset=['gearbox', 'model']).reset_index(drop=True)

Пропуски в столбце `vehicle_type` составляют больше 10 %. Удалять такое количество пропусков нельзя. Заполним пропуски в этом столбце на значение `other`.

In [22]:
df['vehicle_type'] = df['vehicle_type'].fillna(value='other')

In [23]:
df['vehicle_type'].isna().sum()

0

Посмотрим на уникальные значения столбца данного столбца:

In [24]:
df['vehicle_type'].unique()

array(['other', 'suv', 'small', 'sedan', 'convertible', 'bus', 'wagon',
       'coupe'], dtype=object)

Посмотрим на уникальные значения столбца `fuel_type`:

In [25]:
df['fuel_type'].unique()

array(['petrol', 'gasoline', nan, 'lpg', 'other', 'hybrid', 'cng',
       'electric'], dtype=object)

Как мы видим, в списке уникальных значений есть значение `other`. Возможно в списке выбора топлива не было необходимого варианта, поэтому возможно некоторые люди не заполняли эту графу, поэтому заменим пропуски на значение `other`:

In [26]:
df['fuel_type'] = df['fuel_type'].fillna(value='other')

Пропуски в столбце `repaired` могут означать, что авто не чинилось. Заменим пропуски на `no`:

In [27]:
df['repaired'] = df['repaired'].fillna(value='no')

Проверим ещё раз пропуски:

In [28]:
pd.DataFrame(df.isna().sum(), columns=['Кол-во пропусков']).style.background_gradient('coolwarm')

Unnamed: 0,Кол-во пропусков
price,0
vehicle_type,0
registration_year,0
gearbox,0
power,0
model,0
kilometer,0
registration_month,0
fuel_type,0
brand,0


Все пропуски были удалены/заполнены.

### Работа с аномальными значениями

Из таблицы с основными статистиками мы видим, что в данных присутствуют автомобили с нулевыми стоимостью и мощностью, а также год регистрации самый поздний 9999, а самый ранний 1000. Скорее всего это ошибка Вряд ли на рынке можно найти иномарку с ценой меньше 1000 евро, поэтому уберём строки, в которых значения цен меньше 1000 евро. Также сделаем срез по мощности с условием, что она больше 50, но меньше лошадиных сил. Так же уберем строки, в которых год меньше 1950 и больше 2016.

In [29]:
df = df.loc[(df['price'] > 1000) & (df['power'] > 50) & (df['power'] < 500) &
            (df['registration_year'] <= 2016) &
            (df['registration_year'] >= 1950)]

Посмотрим на размер данных после предобработки:

In [30]:
df.shape

(204929, 11)

## Обучение моделей

### Деление данных на выборки

Поделим данные на обучающую, валидационную и тестовую выборки:

In [31]:
df_train, df_valid_test = train_test_split(df, test_size=0.4, random_state=12345)
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=12345)

Сбросим индексы:

In [32]:
df_train = df_train.reset_index(drop=True)
df_train.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,3900,small,2008,manual,131,other,150000,9,gasoline,fiat,no
1,4699,small,2006,manual,110,corolla,90000,4,petrol,toyota,no
2,5850,small,2007,manual,140,golf,150000,8,gasoline,volkswagen,no
3,6600,sedan,2009,manual,95,one,150000,3,petrol,mini,no
4,3899,small,2002,manual,100,ibiza,125000,11,petrol,seat,no


In [33]:
df_valid = df_valid.reset_index(drop=True)
df_valid

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,2900,bus,1988,manual,75,transporter,150000,7,gasoline,volkswagen,no
1,8750,bus,2007,auto,140,b_klasse,150000,8,gasoline,mercedes_benz,no
2,4850,small,2006,manual,105,golf,150000,9,gasoline,volkswagen,no
3,2200,small,2001,manual,58,corsa,150000,7,petrol,opel,no
4,16999,coupe,2007,auto,286,3er,150000,7,gasoline,bmw,no
...,...,...,...,...,...,...,...,...,...,...,...
40981,5300,sedan,2009,manual,105,bravo,80000,3,gasoline,fiat,no
40982,12000,suv,2009,auto,250,other,125000,5,petrol,saab,no
40983,15899,sedan,2007,manual,150,x_reihe,80000,8,gasoline,bmw,no
40984,1800,wagon,1999,manual,60,caddy,150000,0,petrol,volkswagen,no


In [34]:
df_test = df_test.reset_index(drop=True)
df_test.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,3500,sedan,1998,manual,200,golf,150000,6,petrol,volkswagen,no
1,5999,wagon,2011,manual,125,astra,150000,8,gasoline,opel,no
2,3200,convertible,1997,manual,125,80,150000,5,petrol,audi,no
3,2000,wagon,2002,manual,141,6_reihe,150000,12,petrol,mazda,no
4,3200,wagon,2006,manual,101,focus,125000,0,other,ford,no


Выведем размеры получившихся выборок:

In [35]:
df_train.shape # размер обучающей выборки

(122957, 11)

In [36]:
df_valid.shape # размер валидационной выборки

(40986, 11)

In [37]:
df_test.shape # размер тестовой выборки

(40986, 11)

In [38]:
df.shape[0] == df_train.shape[0] + df_valid.shape[0] + df_test.shape[0]

True

Суммарное количество строк трех выборок сходится с количеством строк исходного датафрейма.

### Деление выборок на признаки и целевой признак

Разделим выборки на признаки и целевой признак:

In [39]:
features_train = df_train.drop(['price'], axis=1) # деление обучающей выборки
target_train = df_train['price']

In [40]:
features_valid = df_valid.drop(['price'], axis=1) # деление валидацинной выборки
target_valid = df_valid['price']

In [41]:
features_test = df_test.drop(['price'], axis=1) # деление тестовой выборки
target_test = df_test['price']

### Функция для вычисления `RMSE`

Функция для вычисления `RMSE`:

In [42]:
def rmse(y, y_pred):
    return mean_squared_error(y, y_pred) ** .5

### CatBoostRegressor

In [43]:
cat_features = ['vehicle_type', 'gearbox', 'fuel_type', 'brand', 'repaired', 'model']

model_catboost = CatBoostRegressor(loss_function='RMSE', iterations=2000, random_state=12345)

Обучим модель:

In [44]:
start = time.time()

model_catboost.fit(features_train, target_train, cat_features=cat_features, verbose=False)

end = time.time()

fit_time_cat = round(end - start, 2)

print('Время обучения:', fit_time_cat, 'с')

Время обучения: 173.31 с


Сделаем предсказания:

In [45]:
start = time.time()

predictions = model_catboost.predict(features_valid)

end = time.time()

pred_time_cat = round(end - start, 2)

print('Время предсказания:', pred_time_cat, 'с')

Время предсказания: 0.49 с


Посчитаем `RMSE`:

In [46]:
catboost_rmse = round(rmse(target_valid, predictions), 2)
print('RMSE модели CatBoostRegressor на валидационной выборке:', catboost_rmse)

RMSE модели CatBoostRegressor на валидационной выборке: 1625.21


### One-Hot Encoding

Закодируем категориальные данные и обучим модель `Линейной регрессии` и `Случайного дерева`.

#### Применим `OneHotEncoder`

In [47]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit(features_train[cat_features])
features_train_cat_ohe = ohe.transform(features_train[cat_features])
features_valid_cat_ohe = ohe.transform(features_valid[cat_features])
features_test_cat_ohe = ohe.transform(features_test[cat_features])

Преобразуем `np.array` в `DataFrame`:

In [48]:
features_train_cat_ohe = pd.DataFrame(features_train_cat_ohe, columns=ohe.get_feature_names())
features_valid_cat_ohe = pd.DataFrame(features_valid_cat_ohe, columns=ohe.get_feature_names())
features_test_cat_ohe = pd.DataFrame(features_test_cat_ohe, columns=ohe.get_feature_names())

Посмотрим на получившиеся датафреймы:

In [49]:
features_train_cat_ohe.head()  

Unnamed: 0,x0_bus,x0_convertible,x0_coupe,x0_other,x0_sedan,x0_small,x0_suv,x0_wagon,x1_auto,x1_manual,...,x5_wrangler,x5_x_reihe,x5_x_trail,x5_x_type,x5_xc_reihe,x5_yaris,x5_yeti,x5_ypsilon,x5_z_reihe,x5_zafira
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
features_valid_cat_ohe.head()

Unnamed: 0,x0_bus,x0_convertible,x0_coupe,x0_other,x0_sedan,x0_small,x0_suv,x0_wagon,x1_auto,x1_manual,...,x5_wrangler,x5_x_reihe,x5_x_trail,x5_x_type,x5_xc_reihe,x5_yaris,x5_yeti,x5_ypsilon,x5_z_reihe,x5_zafira
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
features_test_cat_ohe.head()

Unnamed: 0,x0_bus,x0_convertible,x0_coupe,x0_other,x0_sedan,x0_small,x0_suv,x0_wagon,x1_auto,x1_manual,...,x5_wrangler,x5_x_reihe,x5_x_trail,x5_x_type,x5_xc_reihe,x5_yaris,x5_yeti,x5_ypsilon,x5_z_reihe,x5_zafira
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Объединим закодированные столбцы с исходным датафреймом признаков:

In [52]:
features_train_ohe = features_train.join(features_train_cat_ohe)
features_valid_ohe = features_valid.join(features_valid_cat_ohe)
features_test_ohe = features_test.join(features_test_cat_ohe)

Удалим незакодриванные категориальные столбцы:

In [53]:
features_train_ohe = features_train_ohe.drop(cat_features, axis=1)
features_valid_ohe = features_valid_ohe.drop(cat_features, axis=1)
features_test_ohe = features_test_ohe.drop(cat_features, axis=1)

Выведем получившиеся признаки для 3-х выборок:

In [54]:
features_train_ohe.head() 

Unnamed: 0,registration_year,power,kilometer,registration_month,x0_bus,x0_convertible,x0_coupe,x0_other,x0_sedan,x0_small,...,x5_wrangler,x5_x_reihe,x5_x_trail,x5_x_type,x5_xc_reihe,x5_yaris,x5_yeti,x5_ypsilon,x5_z_reihe,x5_zafira
0,2008,131,150000,9,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2006,110,90000,4,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2007,140,150000,8,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2009,95,150000,3,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2002,100,125000,11,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
features_valid_ohe.head() 

Unnamed: 0,registration_year,power,kilometer,registration_month,x0_bus,x0_convertible,x0_coupe,x0_other,x0_sedan,x0_small,...,x5_wrangler,x5_x_reihe,x5_x_trail,x5_x_type,x5_xc_reihe,x5_yaris,x5_yeti,x5_ypsilon,x5_z_reihe,x5_zafira
0,1988,75,150000,7,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2007,140,150000,8,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2006,105,150000,9,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2001,58,150000,7,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2007,286,150000,7,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
features_test_ohe.head() 

Unnamed: 0,registration_year,power,kilometer,registration_month,x0_bus,x0_convertible,x0_coupe,x0_other,x0_sedan,x0_small,...,x5_wrangler,x5_x_reihe,x5_x_trail,x5_x_type,x5_xc_reihe,x5_yaris,x5_yeti,x5_ypsilon,x5_z_reihe,x5_zafira
0,1998,200,150000,6,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2011,125,150000,8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1997,125,150000,5,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2002,141,150000,12,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2006,101,125000,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### RandomForestRegressor

Обучим модель и сделаем предсказания.

Для подбора гипепараметров используем `RandomizedSearchCV`.

In [57]:
model_forest_ohe = RandomForestRegressor(random_state=12345)

Выделим гиперпараметры для перебора:

In [58]:
parametrs = {
    'max_depth': range(1, 10),
    'n_estimators': range(10, 101, 10),
    'min_samples_leaf': range(1, 8),
    'min_samples_split': range(2, 10, 2)
}

Используем make_scorer для дальнейшего использония `RMSE` в `RandomizedSearchCV`:

In [59]:
rmse_scorer = make_scorer(rmse, greater_is_better=False)

In [60]:
random_src = RandomizedSearchCV(estimator=model_forest_ohe,
                                param_distributions=parametrs,
                                cv=3,
                                n_iter=10,
                                n_jobs=-1,
                                random_state=12345,
                                scoring=rmse_scorer,
                                verbose=2)

Для обучения `RandomizedSearchCV` объединим обучающую и валидационную выборки:

In [61]:
features_train_ohe_united = pd.concat([features_train_ohe, features_valid_ohe]).reset_index(drop=True)
target_train_united = pd.concat([target_train, target_valid]).reset_index(drop=True)

In [62]:
random_src.fit(features_train_ohe_united, target_train_united)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END max_depth=2, min_samples_leaf=6, min_samples_split=2, n_estimators=30; total time=   7.8s
[CV] END max_depth=2, min_samples_leaf=6, min_samples_split=2, n_estimators=30; total time=   7.8s
[CV] END max_depth=2, min_samples_leaf=6, min_samples_split=2, n_estimators=30; total time=   7.5s
[CV] END max_depth=8, min_samples_leaf=6, min_samples_split=4, n_estimators=80; total time= 1.4min
[CV] END max_depth=8, min_samples_leaf=6, min_samples_split=4, n_estimators=80; total time= 1.4min
[CV] END max_depth=8, min_samples_leaf=6, min_samples_split=4, n_estimators=80; total time= 1.4min
[CV] END max_depth=2, min_samples_leaf=3, min_samples_split=6, n_estimators=30; total time=   7.9s
[CV] END max_depth=2, min_samples_leaf=3, min_samples_split=6, n_estimators=30; total time=   7.8s
[CV] END max_depth=2, min_samples_leaf=3, min_samples_split=6, n_estimators=30; total time=   7.9s
[CV] END max_depth=2, min_samples_leaf=7, min_sa

RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(random_state=12345),
                   n_jobs=-1,
                   param_distributions={'max_depth': range(1, 10),
                                        'min_samples_leaf': range(1, 8),
                                        'min_samples_split': range(2, 10, 2),
                                        'n_estimators': range(10, 101, 10)},
                   random_state=12345,
                   scoring=make_scorer(rmse, greater_is_better=False),
                   verbose=2)

Выведем результаты кросс-валидации:

In [63]:
fit_time_forest_ohe = round(random_src.cv_results_['mean_fit_time'][random_src.best_index_], 2)
print('Время обучения:', fit_time_forest_ohe, 'с')

Время обучения: 84.4 с


In [64]:
pred_time_forest_ohe = round(random_src.cv_results_['mean_score_time'][random_src.best_index_], 2)
print('Время предсказания:', pred_time_forest_ohe, 'с')

Время предсказания: 0.4 с


In [65]:
random_src.cv_results_['params'][random_src.best_index_] # гиперпараметры лучшей модели

{'n_estimators': 80,
 'min_samples_split': 4,
 'min_samples_leaf': 6,
 'max_depth': 8}

In [66]:
forest_ohe_rmse = abs(round(random_src.best_score_, 2))
print('RMSE модели RandomForestRegressor на валидационной выборке:', forest_ohe_rmse)

RMSE модели RandomForestRegressor на валидационной выборке: 2073.26


#### Линейная регрессия

Для модели линейной регрессии сделаем масштабирование признаков:

In [71]:
numeric_col = features_train.select_dtypes(include=['number']).columns.to_list()

features_train_ohe_scaled = features_train_ohe.copy()
features_valid_ohe_scaled = features_valid_ohe.copy()
features_test_ohe_scaled = features_test_ohe.copy()

scaler = StandardScaler()
scaler.fit(features_train_ohe[numeric_col])
features_train_ohe_scaled[numeric_col] = scaler.transform(features_train_ohe[numeric_col])
features_valid_ohe_scaled[numeric_col] = scaler.transform(features_valid_ohe[numeric_col])
features_test_ohe_scaled[numeric_col] = scaler.transform(features_test_ohe[numeric_col])

Обучим модель и сделаем предсказания:

In [72]:
model_linear_ohe = LinearRegression()

start = time.time()

model_linear_ohe.fit(features_train_ohe_scaled, target_train)

end = time.time()

fit_time_linear_ohe_scaled = round(end - start, 2)

print('Время обучения:', fit_time_linear_ohe_scaled, 'с')

Время обучения: 19.37 с


In [73]:
start = time.time()

predictions = model_linear_ohe.predict(features_valid_ohe_scaled)

end = time.time()

pred_time_linear_ohe_scaled = round(end - start, 2)

print('Время предсказания:', pred_time_linear_ohe_scaled, 'с')

Время предсказания: 0.14 с


In [74]:
linear_ohe_scaled_rmse = round(rmse(target_valid, predictions), 2)
print('RMSE модели LinearRegression на валидационной выборке:', linear_ohe_scaled_rmse)

RMSE модели LinearRegression на валидационной выборке: 2580.86


### Ordinal Encoding

#### Применим `OdinalEncoder`

In [75]:
oe = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
oe.fit(features_train[cat_features])

features_train_oe = features_train.copy()
features_valid_oe = features_valid.copy()
features_test_oe = features_test.copy()

features_train_oe[cat_features] = oe.transform(features_train_oe[cat_features])
features_valid_oe[cat_features] = oe.transform(features_valid_oe[cat_features])
features_test_oe[cat_features] = oe.transform(features_test_oe[cat_features])

Выведем получившиеся признаки для 3-х выборок:

In [76]:
features_train_oe.head()

Unnamed: 0,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,5.0,2008,1.0,131,165.0,150000,9,2.0,9.0,0.0
1,5.0,2006,1.0,110,81.0,90000,4,6.0,35.0,0.0
2,5.0,2007,1.0,140,115.0,150000,8,2.0,37.0,0.0
3,4.0,2009,1.0,95,164.0,150000,3,6.0,21.0,0.0
4,5.0,2002,1.0,100,119.0,125000,11,6.0,30.0,0.0


In [77]:
features_valid_oe.head()

Unnamed: 0,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,0.0,1988,1.0,75,221.0,150000,7,2.0,37.0,0.0
1,0.0,2007,0.0,140,46.0,150000,8,2.0,20.0,0.0
2,5.0,2006,1.0,105,115.0,150000,9,2.0,37.0,0.0
3,5.0,2001,1.0,58,82.0,150000,7,6.0,24.0,0.0
4,2.0,2007,0.0,286,11.0,150000,7,2.0,2.0,0.0


In [78]:
features_test_oe.head()

Unnamed: 0,vehicle_type,registration_year,gearbox,power,model,kilometer,registration_month,fuel_type,brand,repaired
0,4.0,1998,1.0,200,115.0,150000,6,6.0,37.0,0.0
1,7.0,2011,1.0,125,41.0,150000,8,2.0,24.0,0.0
2,1.0,1997,1.0,125,19.0,150000,5,6.0,1.0,0.0
3,7.0,2002,1.0,141,16.0,150000,12,6.0,19.0,0.0
4,7.0,2006,1.0,101,102.0,125000,0,5.0,10.0,0.0


#### RandomForestRegressor

Обучим модель и сделаем предсказания.

Для подбора гипепараметров используем `RandomizedSearchCV`.

In [79]:
model_forest_oe = RandomForestRegressor(random_state=12345)

Выделим гиперпараметры для перебора:

In [80]:
parametrs = {
    'max_depth': range(1, 10),
    'n_estimators': range(10, 101, 10),
    'min_samples_leaf': range(1, 8),
    'min_samples_split': range(2, 10, 2)
}

Для обучения `RandomizedSearchCV` объединим обучающую и валидационную выборки:

In [82]:
features_train_oe_united = pd.concat([features_train_oe, features_valid_oe]).reset_index(drop=True)

In [83]:
random_src.fit(features_train_oe_united, target_train_united)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END max_depth=2, min_samples_leaf=6, min_samples_split=2, n_estimators=30; total time=   1.1s
[CV] END max_depth=2, min_samples_leaf=6, min_samples_split=2, n_estimators=30; total time=   1.1s
[CV] END max_depth=2, min_samples_leaf=6, min_samples_split=2, n_estimators=30; total time=   1.1s
[CV] END max_depth=8, min_samples_leaf=6, min_samples_split=4, n_estimators=80; total time=  10.1s
[CV] END max_depth=8, min_samples_leaf=6, min_samples_split=4, n_estimators=80; total time=   9.7s
[CV] END max_depth=8, min_samples_leaf=6, min_samples_split=4, n_estimators=80; total time=   9.5s
[CV] END max_depth=2, min_samples_leaf=3, min_samples_split=6, n_estimators=30; total time=   1.1s
[CV] END max_depth=2, min_samples_leaf=3, min_samples_split=6, n_estimators=30; total time=   1.1s
[CV] END max_depth=2, min_samples_leaf=3, min_samples_split=6, n_estimators=30; total time=   1.1s
[CV] END max_depth=2, min_samples_leaf=7, min_sa

RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(random_state=12345),
                   n_jobs=-1,
                   param_distributions={'max_depth': range(1, 10),
                                        'min_samples_leaf': range(1, 8),
                                        'min_samples_split': range(2, 10, 2),
                                        'n_estimators': range(10, 101, 10)},
                   random_state=12345,
                   scoring=make_scorer(rmse, greater_is_better=False),
                   verbose=2)

Выведем результаты кросс-валидации:

In [84]:
fit_time_forest_oe = round(random_src.cv_results_['mean_fit_time'][random_src.best_index_], 2)
print('Время обучения:', fit_time_forest_oe, 'с')

Время обучения: 9.47 с


In [85]:
pred_time_forest_oe = round(random_src.cv_results_['mean_score_time'][random_src.best_index_], 2)
print('Время предсказания:', pred_time_forest_oe, 'с')

Время предсказания: 0.28 с


In [86]:
random_src.cv_results_['params'][random_src.best_index_] # гиперпараметры лучшей модели

{'n_estimators': 80,
 'min_samples_split': 4,
 'min_samples_leaf': 6,
 'max_depth': 8}

In [87]:
forest_oe_rmse = abs(round(random_src.best_score_, 2))
print('RMSE модели RandomForestRegressor на валидационной выборке:', forest_oe_rmse)

RMSE модели RandomForestRegressor на валидационной выборке: 2071.64


#### Линейная регрессия

Для модели линейной регрессии сделаем масштабирование признаков:

In [92]:
features_train_oe_scaled = features_train_oe.copy()
features_valid_oe_scaled = features_valid_oe.copy()
features_test_oe_scaled = features_test_oe.copy()

scaler = StandardScaler()
scaler.fit(features_train_oe[numeric_col])
features_train_oe_scaled[numeric_col] = scaler.transform(features_train_oe[numeric_col])
features_valid_oe_scaled[numeric_col] = scaler.transform(features_valid_oe[numeric_col])
features_test_oe_scaled[numeric_col] = scaler.transform(features_test_oe[numeric_col])

Обучим модель и сделаем предсказания:

In [93]:
model_linear_oe = LinearRegression()

start = time.time()

model_linear_oe.fit(features_train_oe_scaled, target_train)

end = time.time()

fit_time_linear_oe_scaled = round(end - start, 2)

print('Время обучения:', fit_time_linear_oe_scaled, 'с')

Время обучения: 0.06 с


In [94]:
start = time.time()

predictions = model_linear_oe.predict(features_valid_oe_scaled)

end = time.time()

pred_time_linear_oe_scaled = round(end - start, 2)

print('Время предсказания:', pred_time_linear_oe_scaled, 'с')

Время предсказания: 0.0 с


In [95]:
linear_oe_scaled_rmse = round(rmse(target_valid, predictions), 2)
print('RMSE модели LinearRegression на валидационной выборке:', linear_oe_scaled_rmse)

RMSE модели LinearRegression на валидационной выборке: 2967.43


### LightGBM

Обучим модель и сделаем предсказания.

Для подбора гипепараметров используем `RandomizedSearchCV`.

In [96]:
model_lgbm = LGBMRegressor(random_state=12345)

Выделим гиперпараметры для перебора:

In [97]:
parametrs = {
    'max_depth': range(1, 10),
    'n_estimators': range(10, 101, 10),
}

In [98]:
random_src = RandomizedSearchCV(estimator=model_lgbm,
                                param_distributions=parametrs,
                                cv=3,
                                n_iter=10,
                                n_jobs=-1,
                                random_state=12345,
                                scoring=rmse_scorer,
                                verbose=2)

In [99]:
random_src.fit(features_train_oe_united, target_train_united)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END .......................max_depth=1, n_estimators=50; total time=   3.6s
[CV] END .......................max_depth=1, n_estimators=50; total time=   2.6s
[CV] END .......................max_depth=1, n_estimators=50; total time=   2.4s
[CV] END ......................max_depth=1, n_estimators=100; total time=   4.8s
[CV] END ......................max_depth=1, n_estimators=100; total time=   3.4s
[CV] END ......................max_depth=1, n_estimators=100; total time=   6.1s
[CV] END .......................max_depth=2, n_estimators=40; total time=   4.8s
[CV] END .......................max_depth=2, n_estimators=40; total time=   3.2s
[CV] END .......................max_depth=2, n_estimators=40; total time=   3.1s
[CV] END ......................max_depth=4, n_estimators=100; total time=  13.5s
[CV] END ......................max_depth=4, n_estimators=100; total time=  15.5s
[CV] END ......................max_depth=4, n_es

RandomizedSearchCV(cv=3, estimator=LGBMRegressor(random_state=12345), n_jobs=-1,
                   param_distributions={'max_depth': range(1, 10),
                                        'n_estimators': range(10, 101, 10)},
                   random_state=12345,
                   scoring=make_scorer(rmse, greater_is_better=False),
                   verbose=2)

Выведем результаты кросс-валидации:

In [100]:
fit_time_lgbm_oe = round(random_src.cv_results_['mean_fit_time'][random_src.best_index_], 2)
print('Время обучения:', fit_time_lgbm_oe, 'с')

Время обучения: 19.6 с


In [101]:
pred_time_lgbm_oe = round(random_src.cv_results_['mean_score_time'][random_src.best_index_], 2)
print('Время предсказания:', pred_time_lgbm_oe, 'с')

Время предсказания: 0.36 с


In [102]:
random_src.cv_results_['params'][random_src.best_index_] # гиперпараметры лучшей модели

{'n_estimators': 70, 'max_depth': 7}

In [103]:
lgbm_oe_rmse = abs(round(random_src.best_score_, 2)) 
print('RMSE модели RandomForestRegressor на валидационной выборке:', lgbm_oe_rmse)

RMSE модели RandomForestRegressor на валидационной выборке: 1762.11


Соберём все получившиеся результаты в `Dataframe`:

In [108]:
data = [[fit_time_cat, pred_time_cat, catboost_rmse],
        [fit_time_forest_ohe, pred_time_forest_ohe, forest_ohe_rmse],
        [fit_time_linear_ohe_scaled, pred_time_linear_ohe_scaled, linear_ohe_scaled_rmse],
        [fit_time_forest_oe, pred_time_forest_oe, forest_oe_rmse],
        [fit_time_linear_oe_scaled, pred_time_linear_oe_scaled, linear_oe_scaled_rmse],
        [fit_time_lgbm_oe, pred_time_lgbm_oe, lgbm_oe_rmse]]

results = pd.DataFrame(
    data,
    index=[
        'CatBoostRegressor', 'Forest_OHE', 'LR_OHE', 'Forest_OE', 'LR_OE',
        'LGBM_OE'
    ],
    columns=['Время обучения, с', 'Время предсказания, с', 'RMSE'])

In [109]:
results.sort_values(by='RMSE')

Unnamed: 0,"Время обучения, с","Время предсказания, с",RMSE
CatBoostRegressor,173.31,0.49,1625.21
LGBM_OE,19.6,0.36,1762.11
Forest_OE,9.47,0.28,2071.64
Forest_OHE,84.4,0.4,2073.26
LR_OHE,19.37,0.14,2580.86
LR_OE,0.06,0.0,2967.43


**Примечение:** вместо огромного значения `RMSE` для линейной регрессии с данными, закодированными `OneHotEncoder`'ом поставил пропуск, чтобы не слетал формат вывода значений.

### Тест лучшей модели

Лучший результат показала модель `CatBoostRegressor`, проверим метрику на тестовых данных.

In [110]:
start = time.time()

model_catboost.fit(features_train, target_train, cat_features=cat_features, verbose=False)

end = time.time()

fit_time_cat_test = round(end - start, 2)

print('Время обучения:', fit_time_cat_test, 'с')

Время обучения: 169.24 с


Сделаем предсказания:

In [111]:
start = time.time()

predictions = model_catboost.predict(features_test)

end = time.time()

pred_time_cat_test = round(end - start, 2)

print('Время предсказания:', pred_time_cat_test, 'с')

Время предсказания: 0.47 с


Посчитаем `RMSE`:

In [112]:
catboost_rmse_test = round(rmse(target_test, predictions), 2)
print('RMSE модели CatBoostRegressor на тестовой выборке:', catboost_rmse_test)

RMSE модели CatBoostRegressor на тестовой выборке: 1647.34


## Анализ моделей

**Подведем итоги.**

### Время обучения

In [113]:
results.sort_values(by='Время обучения, с')

Unnamed: 0,"Время обучения, с","Время предсказания, с",RMSE
LR_OE,0.06,0.0,2967.43
Forest_OE,9.47,0.28,2071.64
LR_OHE,19.37,0.14,2580.86
LGBM_OE,19.6,0.36,1762.11
Forest_OHE,84.4,0.4,2073.26
CatBoostRegressor,173.31,0.49,1625.21


**Вывод:** быстрее всех обучалась модель линейной регресии на данных, закодированных методом `Ordinal Encoding`. Дольше всех обучалась модель `CatBoostRegressor`.

### Время предсказания

In [114]:
results.sort_values(by='Время предсказания, с')

Unnamed: 0,"Время обучения, с","Время предсказания, с",RMSE
LR_OE,0.06,0.0,2967.43
LR_OHE,19.37,0.14,2580.86
Forest_OE,9.47,0.28,2071.64
LGBM_OE,19.6,0.36,1762.11
Forest_OHE,84.4,0.4,2073.26
CatBoostRegressor,173.31,0.49,1625.21


**Вывод:** меньше всего времени на предсказание ушло у той же модели линейной регресии на данных, закодированных методом `Ordinal Encoding`. Дольше всех предсказывала модель `RandomForestRegressor` на данных, закодированных методом `One-Hot Encoding`.

### `RMSE`

In [115]:
results.sort_values(by='RMSE')

Unnamed: 0,"Время обучения, с","Время предсказания, с",RMSE
CatBoostRegressor,173.31,0.49,1625.21
LGBM_OE,19.6,0.36,1762.11
Forest_OE,9.47,0.28,2071.64
Forest_OHE,84.4,0.4,2073.26
LR_OHE,19.37,0.14,2580.86
LR_OE,0.06,0.0,2967.43


**Вывод:** наименьшее `RMSE` показала модель `CatBoostRegressor`. Худшее `RMSE` у модели линейной регресии на данных, закодированных методом `Ordinal Encoding`. С ТЗ справились `CatBoostRegressor`, `LightGBM`, `RandomForestRegressor` на данных, закодированных методом `One-Hot Encoding`, `RandomForestRegressor` на данных, закодированных методом `Ordinal Encoding`, т.к. у них у всех `RMSE` < 2500.

## Общий вывод

1. Было произведено предварительное изучение данных и подготовка данных: 

   * загрузили все необходимые для работы библиотеки ([1.1 Импорт библиотек](#Импорт-библиотек))
   * сохранили данные в переменную ([1.2 Чтение файла](#Чтение-файла)) 
   * была просмотрена вся основаная информация о датафрейме ([1.3 Изучение данных](#Изучение-данных))
   * столбцы были приведены к змеиному регистру в соотвествии со стандартом PEP8 ([1.4 Приведение столбцов к змеиному_регистру](#Приведение-столбцов-к-змеиному_регистру))
   * была проведена проверка на явные дубликаты и их удаление сначала для данных без удалённых столбцов ([1.5 Проверка на явные дубликаты](#Проверка-на-явные-дубликаты)), затем после удаления лишних столбцов ([1.6 Удаление лишних столбцов](#Удаление-лишних-столбцов))
   * были удалены столбцы, которые не влияют на обучение моделей ([1.6 Удаление лишних столбцов](#Удаление-лишних-столбцов))
   * была проведена работа с пропусками ([1.7 Работа с пропусками](#Работа-с-пропусками))
   * была проведена работа с аномальными значениями ([1.8 Работа с аномальными значениями](#Работа-с-аномальными-значениями))


2. Обучение моделей:
   * перед обучением данные были поделены на три выборки: обучающую, валидационную и тестовую ([2.1 Деление данных на выборки](#Деление-данных-на-выборки))
   * выборки были поделены на признаки и целевой признак ([2.2 Деление выборок на признаки и целевой признак](#Деление-выборок-на-признаки-и-целевой-признак))
   * написана функция для вычисления `RMSE` ([2.3 Функция для вычисления `RMSE`](#Функция-для-вычисления-RMSE))
   * обучена и проверена модель `CatBoostRegressor` ([2.4 CatBoostRegressor](#CatBoostRegressor))
   * данные были закодированы методом `One-Hot Encoding` и на них обучены модели `RandomForestRegressor` и линейная регрессия ([2.5 One-Hot Encoding](#One-Hot-Encoding))
   * данные были закодированы методом `Ordinal Encoding` и на них обучены модели `RandomForestRegressor` и линейная регрессия, а также `LGBMRegressor` ([2.6 Ordinal Encoding](#Ordinal-Encoding)) ([2.7 LightGBM](#LightGBM))
   * модель, показавшая лучший результат на валидационной выборке, была проверена на тестовой выборке ([2.8 Тест лучшей модели](#Тест-лучшей-модели))


3. Был сделан анализ скорости работы и качества моделей ([3 Анализ моделей](#Анализ-моделей))


4. Написан общий вывод к работе.