# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Подготовка данных

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from catboost import Pool, CatBoostRegressor, cv
from lightgbm import LGBMRegressor
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv("/datasets/autos.csv")

## Изучение данных

In [3]:
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [4]:
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

None

Признаки
- DateCrawled — дата скачивания анкеты из базы
- VehicleType — тип автомобильного кузова
- RegistrationYear — год регистрации автомобиля
- Gearbox — тип коробки передач
- Power — мощность (л. с.)
- Model — модель автомобиля
- Kilometer — пробег (км)
- RegistrationMonth — месяц регистрации автомобиля
- FuelType — тип топлива
- Brand — марка автомобиля
- Repaired — была машина в ремонте или нет
- DateCreated — дата создания анкеты
- NumberOfPictures — количество фотографий автомобиля
- PostalCode — почтовый индекс владельца анкеты (пользователя)
- LastSeen — дата последней активности пользователя

Целевой признак
- Price — цена (евро)

In [5]:
data[data.duplicated(keep=False)]

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
41529,2016-03-18 18:46:15,1999,wagon,2001,manual,131,passat,150000,7,gasoline,volkswagen,no,2016-03-18 00:00:00,0,36391,2016-03-18 18:46:15
88087,2016-03-08 18:42:48,1799,coupe,1999,auto,193,clk,20000,7,petrol,mercedes_benz,no,2016-03-08 00:00:00,0,89518,2016-03-09 09:46:57
90964,2016-03-28 00:56:10,1000,small,2002,manual,83,other,150000,1,petrol,suzuki,no,2016-03-28 00:00:00,0,66589,2016-03-28 08:46:21
171088,2016-03-08 18:42:48,1799,coupe,1999,auto,193,clk,20000,7,petrol,mercedes_benz,no,2016-03-08 00:00:00,0,89518,2016-03-09 09:46:57
187735,2016-04-03 09:01:15,4699,coupe,2003,auto,218,clk,125000,6,petrol,mercedes_benz,yes,2016-04-03 00:00:00,0,75196,2016-04-07 09:44:54
231258,2016-03-28 00:56:10,1000,small,2002,manual,83,other,150000,1,petrol,suzuki,no,2016-03-28 00:00:00,0,66589,2016-03-28 08:46:21
258109,2016-04-03 09:01:15,4699,coupe,2003,auto,218,clk,125000,6,petrol,mercedes_benz,yes,2016-04-03 00:00:00,0,75196,2016-04-07 09:44:54
325651,2016-03-18 18:46:15,1999,wagon,2001,manual,131,passat,150000,7,gasoline,volkswagen,no,2016-03-18 00:00:00,0,36391,2016-03-18 18:46:15


In [6]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


Важные признаки: VehicleType, Gearbox, Power, Kilometer, FuelType, Brand, NotRepaired, RegistrationYear, Model.

Признаки NumberOfPictures, PostalCode не несут полезной информации, их можно удалить.
DateCrawled, LastSeen, NumberOfPictures могли быть полезными для прогнозирования скорости продажи но для прогнозирования цены авто они не нужны.
DateCreated может быть полезна для анализа с учетом инфляции.

По многим важным параметрам есть пропуски и нулевые значеня необходимо будет восстановить их или удалить.
Есть неадекватные значения параметров необходимо будет восстановить их или удалить
Найдено небольшое количество дубликатов необходимо их удалить

## Предобработка данных

Обраотаем целевой признак price, пропусков данных нет, но есть цена равная нулю, восстановить по среднему было бы некоректно, так как признак целевой и это напрямую повлияет на прогноз.

In [7]:
print("Колличество объявлений с нулевой ценой:",len(data.loc[data['Price'] == 0]))

Колличество объявлений с нулевой ценой: 10772


In [8]:
data = data.loc[data['Price'] != 0]

Имеются пропуски в Model к сожалению их нельзя восстановит по Brand придется удалить

In [9]:
print("Колличество объявлений с пропущенной моделью:", len(data.loc[data['Model'].isna()]))

Колличество объявлений с пропущенной моделью: 17521


In [10]:
data = data.loc[~data['Model'].isna()]

Колличество пропусков в VehicleType около 10 процентов, данных слишком много чтобы просто от них избавится, если заменить на среднюю это тоже может сказаться на точности предсказания. Если брать в расчет что в дальнейшем пользователи при оценке автомобиля могут так же не вводить тип кузова, то стоит заменить пропущенные значения на unknown

In [11]:
print("Колличество объявлений с незаполненным типом кузова:", len(data.loc[data['VehicleType'].isna()]))

Колличество объявлений с незаполненным типом кузова: 28166


In [12]:
data['VehicleType'] = data['VehicleType'].fillna('unknown')

Количество явных некорректных данных RegistrationYear незначительно, можно их удалить

In [13]:
print("Колличество объявлений с некорректной годом:",len(data.loc[(data['RegistrationYear'] > 2021) | (data['RegistrationYear'] < 1769)]))

Колличество объявлений с некорректной годом: 83


In [14]:
data['DateCrawled'].max()

'2016-04-07 14:36:58'

In [15]:
data = data.loc[(data['RegistrationYear'] <= 2016) & (data['RegistrationYear'] >= 1900)]

Колличество пропусков в Gearbox тоже велико. Заменю на наиболее встречающийся тип коробки в модели.

In [16]:
print("Колличество объявлений с незаполненным типом коробки:", len(data.loc[data['Gearbox'].isna()]))

Колличество объявлений с незаполненным типом коробки: 12889


In [17]:
data['Gearbox'] = data['Gearbox'].fillna(data
                                         .groupby('Model')['Gearbox']
                                         .transform(lambda x: x.value_counts().idxmax())
                                        )

Имеются значения Power равные 0 и больше 1000 что являеться некорректным, можно заменить их на медиану по модели

In [18]:
print("Колличество объявлений с некорректной мощностью:",len(data.loc[(data['Power'] > 1000) | (data['Power'] <= 0)]))

Колличество объявлений с некорректной мощностью: 28700


In [19]:
data.loc[(data['Power'] > 1000) | (data['Power'] <= 0), 'Power'] = None
data['Power'] = data['Power'].fillna(data.groupby('Model')['Power'].transform('median'))
data = data.loc[~data['Power'].isna()]
data['Power'] = data['Power'].astype('int64')

Пропуски в FuelType заменю на среднее по моделям

In [20]:
data['FuelType'] = data['FuelType'].fillna(data.groupby('Model')['FuelType'].transform(lambda x: x.value_counts().idxmax()))

Пропуски в Repaired состоявляют треть от данных. Скорее всего если автомобиль не был поврежден то этот параметр просто не заполнялся. Заменю на "no"

In [21]:
data["Repaired"].fillna("no", inplace=True)

Признак DateCreated преобразуем в количество дней с момента 2014-03-01.

In [22]:
print("Минимальная дата создания объявления:", min(data['DateCreated']))

Минимальная дата создания объявления: 2014-03-10 00:00:00


In [23]:
base_date = pd.Timestamp(min(data['DateCreated']))
data['DateCreated'] = data['DateCreated'].map(lambda date : (pd.Timestamp(date) - base_date).days)

Удалим дубликаты

In [24]:
data = data.drop_duplicates()

Удалю ненужные признаки NumberOfPictures, PostalCode, DateCrawled, LastSeen, RegistrationMonth, DateCreated.

In [25]:
data = data.drop(['NumberOfPictures', 'PostalCode', 'DateCrawled', 'LastSeen', 'RegistrationMonth', 'DateCreated'], axis=1)

In [26]:
data = data.reset_index(drop=True)

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314122 entries, 0 to 314121
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             314122 non-null  int64 
 1   VehicleType       314122 non-null  object
 2   RegistrationYear  314122 non-null  int64 
 3   Gearbox           314122 non-null  object
 4   Power             314122 non-null  int64 
 5   Model             314122 non-null  object
 6   Kilometer         314122 non-null  int64 
 7   FuelType          314122 non-null  object
 8   Brand             314122 non-null  object
 9   Repaired          314122 non-null  object
dtypes: int64(4), object(6)
memory usage: 24.0+ MB


In [28]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer
count,314122.0,314122.0,314122.0,314122.0
mean,4672.038269,2002.739124,119.590535,128428.524586
std,4559.20291,6.637302,53.337357,37041.686246
min,1.0,1910.0,1.0,5000.0
25%,1250.0,1999.0,75.0,125000.0
50%,2990.0,2003.0,110.0,150000.0
75%,6790.0,2007.0,147.0,150000.0
max,20000.0,2016.0,1000.0,150000.0


In [29]:
print("Количество дубликатов в финальной версии таблицы:", data.duplicated().sum())

Количество дубликатов в финальной версии таблицы: 53704


In [30]:
data = data.drop_duplicates()

In [31]:
print("Количество дубликатов в финальной версии таблицы:", data.duplicated().sum())

Количество дубликатов в финальной версии таблицы: 0


## Кодирование категориальных признаков

Разобью данные на тренировочную, тестовую и валидационную выборки и на признаки и целевой признак

In [32]:
target = data['Price']
features = data.drop('Price', axis=1)



In [33]:
# Разбиение данных на тренировочную и оставшуюся часть
features_trainval, features_test, target_trainval, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)

# Разбиение оставшейся части на валидационную и тестовую выборки
features_train, features_val, target_train, target_val = train_test_split(
    features_trainval, target_trainval, test_size=0.25, random_state=12345)

print("Размер тренировочной выборки:", len(features_train))
print("Размер валидационной выборки:", len(features_val))
print("Размер тестовой выборки:", len(features_test))

Размер тренировочной выборки: 156250
Размер валидационной выборки: 52084
Размер тестовой выборки: 52084


Получили features_train, feautures_val, feautures_test, target_train, target_val, target_test. Далее закодируем их.


In [34]:
#data_reg = data

In [35]:
#encoder = OrdinalEncoder()
#encoder.fit(data) 
#data_ordinal = pd.DataFrame(encoder.transform(data), columns=data.columns)

#data_ordinal.reset_index()
#data.loc[:,['Brand', 'Model']] = data_ordinal.loc[:,['Brand', 'Model']]
#data['Brand'] = data['Brand'].astype(int)
#data['Model'] = data['Model'].astype(int)

In [73]:
encoder = OneHotEncoder(handle_unknown='ignore')

encoder_feautures_train = pd.DataFrame(encoder.fit_transform(
    features_train[['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired']]). toarray ())

final_features_train = features_train.join(encoder_feautures_train)

final_features_train.drop(['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired'], axis=1, inplace=True)

final_features_train.fillna(0, inplace=True)

final_features_train

Unnamed: 0,RegistrationYear,Power,Kilometer,0,1,2,3,4,5,6,...,297,298,299,300,301,302,303,304,305,306
195985,2009,122,50000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
228151,2005,54,50000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33534,1993,150,150000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
284505,2001,58,90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
299576,1982,80,90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198257,2008,170,150000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70364,2010,90,125000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
116989,1998,133,90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
7501,2006,125,90000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [37]:
encoder_feautures_val = pd.DataFrame(encoder.transform(
features_val[['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired']]). toarray ())

final_features_val = features_val.join(encoder_feautures_val)

final_features_val.drop(['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired'], axis=1, inplace=True)

final_features_val.fillna(0, inplace=True)

In [38]:
encoder_feautures_test = pd.DataFrame(encoder.transform(
features_test[['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired']]). toarray ())

final_features_test = features_val.join(encoder_feautures_test)

final_features_test.drop(['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired'], axis=1, inplace=True)

final_features_test.fillna(0, inplace=True)

In [39]:
final_features_train

Unnamed: 0,RegistrationYear,Power,Kilometer,0,1,2,3,4,5,6,...,297,298,299,300,301,302,303,304,305,306
195985,2009,122,50000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
228151,2005,54,50000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33534,1993,150,150000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
284505,2001,58,90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
299576,1982,80,90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198257,2008,170,150000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70364,2010,90,125000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
116989,1998,133,90000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
7501,2006,125,90000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


Признаки VehicleType, Gearbox, FuelType, NotRepaired закодируем one-hot encoding.

In [40]:
#def dum(data, column):
    #df = pd.get_dummies(data[column], prefix=column, drop_first=True)
 #  data = data.drop(column, axis=1)
  #  return data.join(df)

In [41]:
#data = dum(data, 'VehicleType')
#data = dum(data, 'Gearbox')
#data = dum(data, 'FuelType')
#data = dum(data, 'NotRepaired')

In [42]:
#features_train

Стандартизирую численные признаки.

In [43]:
numeric = ['RegistrationYear', 'Power', 'Kilometer']

In [44]:
scaler = StandardScaler()
scaler.fit(final_features_train[numeric])
final_features_train[numeric] = scaler.transform(final_features_train[numeric])
final_features_val[numeric] = scaler.transform(final_features_val[numeric])
final_features_test[numeric] = scaler.transform(final_features_test[numeric])
print(final_features_val.head())

        RegistrationYear     Power  Kilometer    0    1    2    3    4    5  \
95063           1.193568  0.013916  -1.488796  0.0  0.0  0.0  0.0  0.0  0.0   
163478         -0.995890  0.088352   0.610263  0.0  0.0  0.0  0.0  0.0  0.0   
252247          0.901640 -0.376871  -0.045693  0.0  0.0  0.0  0.0  0.0  0.0   
79403           1.193568  1.465411  -2.013561  0.0  0.0  0.0  0.0  0.0  0.0   
35406           0.609712  0.348877   0.610263  0.0  0.0  0.0  0.0  0.0  0.0   

          6  ...  297  298  299  300  301  302  303  304  305  306  
95063   0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
163478  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
252247  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
79403   0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
35406   0.0  ...  1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  

[5 rows x 310 columns]


Вывод

Дубликаты удалены. Отобраны признаки, необходимые для построения моделей.
Пропущенные значения, выбросы, неправдоподобные значения в признаках заполнены на основе имеющейся инфрмации или удалены.
Категориальные признаки преобразованы с помощью one-hot encoding.
Признак с большим колличеством категорий закодирован техникой Ordinal Encoding.
Данные разделены на тестовую и обучающую выборки и стандартизированны

## Обучение моделей

# Linear Regression

In [45]:
%%time
lmodel = LinearRegression()
lmodel.fit(final_features_train, target_train)

CPU times: user 10.3 s, sys: 3.29 s, total: 13.6 s
Wall time: 13.6 s


LinearRegression()

In [46]:
%%time
preds_lmodel = lmodel.predict(final_features_val)

CPU times: user 94.4 ms, sys: 93.4 ms, total: 188 ms
Wall time: 216 ms


In [47]:
mse_lmodel = mean_squared_error(target_val, preds_lmodel)
print("RMSE для линейной модели на тестовой выборке:", round((mse_lmodel) ** 0.5, 2))

RMSE для линейной модели на тестовой выборке: 3275.22


# Регрессия Ridge

In [48]:
#%%time

#rmodel = Ridge()
#hyperparams = [{'solver':['auto', 'svd', 'cholesky', 'lsqr','sparse_cg']}]
#rmodel = GridSearchCV(rmodel, hyperparams, scoring='neg_mean_squared_error')
#rmodel.fit(final_features_train, target_train)
#print(rmodel.best_params_)

In [49]:
#%%time
#preds_rmodel = rmodel.predict(final_features_val)

In [50]:
#mse_rmodel = mean_squared_error(target_test, preds_rmodel)
#print("RMSE для Ridge модели на тестовой выборке:", round((mse_rmodel) ** 0.5, 2))

# Регрессия DecisionTreeRegresso

In [51]:
%%time

trmodel = DecisionTreeRegressor(criterion='mse', 
                              max_depth=8, 
                              random_state=12345) 
trmodel.fit(final_features_train, target_train)

CPU times: user 2.83 s, sys: 15.6 ms, total: 2.84 s
Wall time: 2.84 s


DecisionTreeRegressor(max_depth=8, random_state=12345)

In [52]:
%%time
preds_trmodel = trmodel.predict(final_features_val)

CPU times: user 25.4 ms, sys: 24.4 ms, total: 49.8 ms
Wall time: 48.4 ms


In [53]:
mse_trmodel = mean_squared_error(target_val, preds_trmodel)
print("RMSE для DecisionTreeRegresso модели на тестовой выборке:", round((mse_trmodel) ** 0.5, 2))


RMSE для DecisionTreeRegresso модели на тестовой выборке: 2271.12


# Регрессия CatBoostRegressor

In [54]:
#numerical_features = ['DateCreated', 'Price', 'RegistrationYear', 'Power', 'Kilometer']

In [55]:
#categorical_features = [col for col in list(features_train.columns) if col not in numerical_features]

В связи с тем, что CatBoostRegressor принемает только целые числа или строки в категориях, пришлость пересобрать признаки.
Так же попробывал оставить категориальные признаки без one-hot encoding, качество немного выросло на CatBoost и LGBM, но время обработки многократно увеличилось

In [56]:
#data_reg.loc[:,['VehicleType','Gearbox', 'Brand', 'FuelType', 'Repaired']] = data_ordinal.loc[:,['VehicleType','Gearbox', 'Brand', 'FuelType', 'Repaired']]
#catfeatures = data_reg.drop('Price', axis=1)
#cattarget = data_reg['Price']
#catfeatures_train, catfeatures_test, cattarget_train, cattarget_test = train_test_split(features, target, test_size=0.25, random_state=12345)

In [57]:
#numericcat = ['RegistrationYear', 'Power', 'Kilometer', 'DateCreated']
#scaler = StandardScaler()
#scaler.fit(catfeatures_train[numericcat])
#catfeatures_train[numericcat] = scaler.transform(catfeatures_train[numericcat])
#catfeatures_test[numericcat] = scaler.transform(catfeatures_test[numericcat]) 

In [58]:
%%time
catmodel = CatBoostRegressor(learning_rate=0.5, random_state=12345, verbose=False) 
catmodel.fit(features_train, target_train, cat_features=['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired'])

CPU times: user 1min 55s, sys: 392 ms, total: 1min 55s
Wall time: 1min 57s


<catboost.core.CatBoostRegressor at 0x7ff811985b20>

In [59]:
%%time

cat_predict = catmodel.predict(features_val)

CPU times: user 621 ms, sys: 3.59 ms, total: 625 ms
Wall time: 631 ms


In [60]:
mse_catmodel = mean_squared_error(target_val, cat_predict)

print("RMSE для CatBoostRegressor модели на тестовой выборке:", round((mse_catmodel) ** 0.5, 2))

RMSE для CatBoostRegressor модели на тестовой выборке: 1665.14


# Регрессия LGBMRegressor

In [61]:
cat_columns = ['Brand', 'Model', 'VehicleType', 'Gearbox', 'FuelType', 'Repaired']
features_train[cat_columns] = features_train[cat_columns].astype('category')

In [62]:
features_train

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,FuelType,Brand,Repaired
195985,small,2009,auto,122,golf,50000,petrol,volkswagen,no
228151,small,2005,manual,54,polo,50000,petrol,volkswagen,no
33534,coupe,1993,auto,150,3er,150000,petrol,bmw,no
284505,small,2001,manual,58,corsa,90000,petrol,opel,no
299576,coupe,1982,manual,80,golf,90000,petrol,volkswagen,no
...,...,...,...,...,...,...,...,...,...
198257,sedan,2008,manual,170,3er,150000,petrol,bmw,no
70364,wagon,2010,manual,90,megane,125000,gasoline,renault,no
116989,sedan,1998,manual,133,nubira,90000,petrol,daewoo,no
7501,sedan,2006,manual,125,mondeo,90000,lpg,ford,no


In [63]:
%%time

lgbmmodel = LGBMRegressor(learning_rate=0.1, 
                      num_leaves=100, 
                      random_state=12345)
lgbmmodel.fit(features_train, target_train)

CPU times: user 9min 7s, sys: 3.92 s, total: 9min 11s
Wall time: 9min 14s


LGBMRegressor(num_leaves=100, random_state=12345)

In [64]:
features_val[cat_columns] = features_val[cat_columns].astype('category')

In [65]:
%%time

lgbm_predict = lgbmmodel.predict(features_val)

CPU times: user 860 ms, sys: 1 ms, total: 861 ms
Wall time: 871 ms


In [66]:
mse_lgbmmodel = mean_squared_error(target_val, lgbm_predict)

print("RMSE для LGBMRegressor модели на тестовой выборке:", round((mse_lgbmmodel) ** 0.5, 2))

RMSE для LGBMRegressor модели на тестовой выборке: 1646.13


Проведены измерения:

- времени обучения
- времени предсказания моделей
- качества предсказания по метрике RMSE

## Тестирование лучшей модели

In [67]:
features_test[cat_columns] = features_test[cat_columns].astype('category')

In [68]:
%%time

lgbm_predict = lgbmmodel.predict(features_test)

CPU times: user 835 ms, sys: 1.76 ms, total: 836 ms
Wall time: 784 ms


In [69]:
mse_lgbmmodel = mean_squared_error(target_test, lgbm_predict)

print("RMSE для LGBMRegressor модели на тестовой выборке:",round((mse_lgbmmodel) ** 0.5, 2))

RMSE для LGBMRegressor модели на тестовой выборке: 1652.31


## Анализ моделей

In [70]:
df = [[3275.21],
        [2271.12],
        [1665.14],
        [1646.13]]
model = ["Linear Regression", "DecisionTreeRegresso", "CatBoostRegressor", "LGBMRegressor"]

In [71]:
pd.DataFrame(data=df, index=model, columns=['RMSE'])

Unnamed: 0,RMSE
Linear Regression,3275.21
DecisionTreeRegresso,2271.12
CatBoostRegressor,1665.14
LGBMRegressor,1646.13


Вывод 2.0

В этой версии проекта лучший результат показала модель LGBMRegressor, CatBoostRegressor уступает, но не сильно. Кроме того, LGBMRegressor подтвердила результат на тестовой выборке.