# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В нашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Нам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Подготовка данных

In [1]:
!pip install scikit-learn==1.1.3
!pip install phik



In [2]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
import phik

In [3]:
df = pd.read_csv('/datasets/autos.csv')
print(df.info())
df.head(5) #посмотрим содержимое

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


Начнем у удаления неинформативных признаков. Для модели обучения будут не нужны такие признаки как дата регистрации пользователя или дата выгрузки анкеты. Удалим их 

In [4]:
df_clear = df.drop(['DateCrawled','RegistrationMonth','DateCreated','PostalCode','LastSeen','NumberOfPictures','Model'], axis=1)

In [5]:
print(df_clear.info())
print(df_clear.isna().sum())
df_clear.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             354369 non-null  int64 
 1   VehicleType       316879 non-null  object
 2   RegistrationYear  354369 non-null  int64 
 3   Gearbox           334536 non-null  object
 4   Power             354369 non-null  int64 
 5   Kilometer         354369 non-null  int64 
 6   FuelType          321474 non-null  object
 7   Brand             354369 non-null  object
 8   Repaired          283215 non-null  object
dtypes: int64(4), object(5)
memory usage: 24.3+ MB
None
Price                   0
VehicleType         37490
RegistrationYear        0
Gearbox             19833
Power                   0
Kilometer               0
FuelType            32895
Brand                   0
Repaired            71154
dtype: int64


Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,Repaired
0,480,,1993,manual,0,150000,petrol,volkswagen,
1,18300,coupe,2011,manual,190,125000,gasoline,audi,yes
2,9800,suv,2004,auto,163,125000,gasoline,jeep,
3,1500,small,2001,manual,75,150000,petrol,volkswagen,no
4,3600,small,2008,manual,69,90000,gasoline,skoda,no


Рассмотрим внимательно признаки. Видны пропуски и нули. Нужно очистить данные чтобы модель могла качесвтенно обучиться

In [6]:
df_clear['Price'].value_counts() #мы видим 10772 целевых признака со значением 0. Мы будем строить модель с учителем,
#поэтому нам нужны только признаки у которых есть это поле заполненное.

0        10772
500       5670
1500      5394
1000      4649
1200      4594
         ...  
13180        1
10879        1
2683         1
634          1
8188         1
Name: Price, Length: 3731, dtype: int64

In [7]:
df_clear = df_clear.query("Price > 0")

In [8]:
df_clear['RegistrationYear'].value_counts() # по правилам авто старше 30 лет не считаются пригодными для ежедневного использования 
# и могут быть признаны раритетными, я думаю это не цель нашего исследования. Удалим авто страше 1989г и "свежее" 2019г

2000    23072
1999    21995
2005    21524
2006    19679
2001    19654
        ...  
1949        1
2222        1
5300        1
8888        1
2290        1
Name: RegistrationYear, Length: 140, dtype: int64

In [9]:
df_clear = df_clear.query("1988 < RegistrationYear < 2020")

In [10]:
df_clear['Gearbox'].value_counts() #аномалий нет

manual    256306
auto       63327
Name: Gearbox, dtype: int64

In [11]:
df_clear['Kilometer'].value_counts() #аномалий нет

150000    227125
125000     35095
100000     13788
90000      11047
80000       9536
70000       8177
60000       7087
50000       5901
40000       4668
5000        4611
30000       4141
20000       3659
10000        987
Name: Kilometer, dtype: int64

In [12]:
df_clear['Brand'].value_counts() #аномалий нет и даже без дубликатов

volkswagen        72612
opel              37939
bmw               35276
mercedes_benz     30052
audi              28226
ford              24023
renault           17277
peugeot           10738
fiat               9058
seat               6713
mazda              5436
skoda              5410
smart              5171
citroen            4894
nissan             4779
toyota             4483
hyundai            3531
mini               3144
volvo              3030
mitsubishi         2906
honda              2693
kia                2406
suzuki             2213
alfa_romeo         2138
sonstige_autos     2026
chevrolet          1520
chrysler           1338
dacia               890
daihatsu            771
subaru              723
jeep                612
porsche             572
daewoo              533
saab                494
land_rover          492
jaguar              458
rover               454
lancia              432
lada                180
trabant             179
Name: Brand, dtype: int64

In [13]:
df_clear['Power'].value_counts() # я думаю что параметр слишком важный чтобы оставлять без внимания 0 значения а так же 
#выбросы за много тысяч. Воозьмем для анализа значения больше 0 и меньше 

0        34494
75       23034
60       15251
150      14086
101      12911
         ...  
2461         1
4400         1
923          1
10910        1
6512         1
Name: Power, Length: 689, dtype: int64

In [14]:
df_clear = df_clear.query("0 < Power < 700") # свыше 700 лошадиных сил уже редкие спорткары. Отсеим их как выбросы

In [15]:
df_clear.isna().sum() # Repaired содержит 45805 пропусков. Но это признак который влияет крайне сильно, мы должны поставить 
#в приоритет требование заказчика по точности модели и удалить эти признаки

Price                   0
VehicleType         21096
RegistrationYear        0
Gearbox              5879
Power                   0
Kilometer               0
FuelType            19351
Brand                   0
Repaired            45805
dtype: int64

In [16]:
df_clear = df_clear.dropna(subset = ['Repaired'])

In [17]:
df_clear['VehicleType'].value_counts() #отмечаем что есть специальное поле other куда можно занести признаки если есть сомнения

sedan          70880
small          58872
wagon          51262
bus            23369
convertible    16114
coupe          11671
suv             9646
other           1623
Name: VehicleType, dtype: int64

In [18]:
df_clear['FuelType'].value_counts() #отмечаем что есть специальное поле other куда можно занести признаки если есть сомнения

petrol      159966
gasoline     78496
lpg           3980
cng            461
hybrid         194
electric        69
other           55
Name: FuelType, dtype: int64

In [19]:
df_clear.isna().sum()

Price                   0
VehicleType         11771
RegistrationYear        0
Gearbox              3095
Power                   0
Kilometer               0
FuelType            11987
Brand                   0
Repaired                0
dtype: int64

В VehicleType пропущено 11771 значений, я их перенесу в значение other, в FuelType пропущено 11987 значений, я их так же перенесу в признак other так как я не могу оставлять Nan, т.к. не все модели работают с ними. Gearbox с 3095 признаками Nan так же перейдет в признак other

In [20]:
df_clear['VehicleType'] = df_clear['VehicleType'].fillna('other')
df_clear['FuelType'] = df_clear['FuelType'].fillna('other')
df_clear['Gearbox'] = df_clear['Gearbox'].fillna('other')

In [21]:
print(df_clear.info()) #итоговая проверка
df_clear.isna().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 255208 entries, 1 to 354367
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             255208 non-null  int64 
 1   VehicleType       255208 non-null  object
 2   RegistrationYear  255208 non-null  int64 
 3   Gearbox           255208 non-null  object
 4   Power             255208 non-null  int64 
 5   Kilometer         255208 non-null  int64 
 6   FuelType          255208 non-null  object
 7   Brand             255208 non-null  object
 8   Repaired          255208 non-null  object
dtypes: int64(4), object(5)
memory usage: 19.5+ MB
None


Price               0
VehicleType         0
RegistrationYear    0
Gearbox             0
Power               0
Kilometer           0
FuelType            0
Brand               0
Repaired            0
dtype: int64

In [22]:
df_clear.phik_matrix()

interval columns not set, guessing: ['Price', 'RegistrationYear', 'Power', 'Kilometer']


Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,Repaired
Price,1.0,0.270908,0.698463,0.289002,0.489917,0.323741,0.265053,0.367295,0.360122
VehicleType,0.270908,1.0,0.54779,0.244934,0.481202,0.183336,0.414429,0.630575,0.106072
RegistrationYear,0.698463,0.54779,1.0,0.12911,0.233936,0.472356,0.361791,0.300311,0.204745
Gearbox,0.289002,0.244934,0.12911,1.0,0.460577,0.0445,0.179451,0.533652,0.02112
Power,0.489917,0.481202,0.233936,0.460577,1.0,0.095705,0.208106,0.56566,0.075566
Kilometer,0.323741,0.183336,0.472356,0.0445,0.095705,1.0,0.161239,0.268622,0.110121
FuelType,0.265053,0.414429,0.361791,0.179451,0.208106,0.161239,1.0,0.29035,0.07481
Brand,0.367295,0.630575,0.300311,0.533652,0.56566,0.268622,0.29035,1.0,0.094965
Repaired,0.360122,0.106072,0.204745,0.02112,0.075566,0.110121,0.07481,0.094965,1.0


Число Пирсона указывает что наш целевой признак Price больше всего коррелирует с признаком даты регистрации, что указывает на то что мы правильно отсеяли старые варианты, для которых нужна отдельная модель работы. 

Мы получили очищенные данные, без дублей, пропусков и неинформативных признаков. Подготовка окончена.

## Обучение моделей

Обучим разные модели, одна из которых — LightGBM. Для каждой модели попробуйте разные гиперпараметры. Мы решаем задачу машинного обучения с учителем с целевым параметром Price

In [24]:
target = df_clear['Price']
features = df_clear.drop(['Price'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.40, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.50, random_state=12345) 

print(features_train.shape,features_valid.shape,features_test.shape)

(153124, 8) (51042, 8) (51042, 8)


Используем порядковое кодирование признака Brand для случайного леса, имеющего большое количество уникальных значений, и OHE кодирование остальных категориальных признаков.

In [25]:
ohe_features_linear = features_train.select_dtypes(include='object').columns.to_list()
print(ohe_features_linear) #категориальные признаки для OHE линейной модели

ohe_features_forest = ohe_features_linear.copy()
ohe_features_forest.remove('Brand')
ohe_features_forest #категориальные признаки для OHE случайного леса

['VehicleType', 'Gearbox', 'FuelType', 'Brand', 'Repaired']


['VehicleType', 'Gearbox', 'FuelType', 'Repaired']

In [26]:
num_features = features_train.select_dtypes(exclude='object').columns.to_list()
num_features #численные признаки

['RegistrationYear', 'Power', 'Kilometer']

In [27]:
features_train_linear = features_train.copy()
features_valid_linear = features_valid.copy()

In [28]:
encoder_ohe = OneHotEncoder(drop='first', handle_unknown='ignore', sparse=False)

encoder_ohe.fit(features_train_linear[ohe_features_linear]) #обучаем энкодер на заданных категориальных признаках тренировочной выборки

features_train_linear[                      #добавляем закодированные признаки в features_train_linear
    encoder_ohe.get_feature_names_out()     #encoder_ohe.get_feature_names_out() позволяет получить названия колонок
] = encoder_ohe.transform(features_train_linear[ohe_features_linear])

features_train_linear = features_train_linear.drop(ohe_features_linear, axis=1) #удаляем изначальные незакодированные признаки

scaler = StandardScaler() #создаём скелер
features_train_linear[num_features] = scaler.fit_transform(features_train_linear[num_features]) #обучаем его на численных признаках тренировочной выборки, трансформируем её же
                          
features_train_linear.head(5) #результат

Unnamed: 0,RegistrationYear,Power,Kilometer,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon,...,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_yes
308108,-2.039854,0.958868,0.605895,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
203461,-0.490448,0.125784,0.605895,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
246574,-1.351229,-0.873917,0.605895,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
164914,0.714645,-0.910943,-1.294168,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
58066,-1.179073,-0.114885,0.605895,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Повторяем для выборки на валидацию

In [29]:
features_valid_linear[
    encoder_ohe.get_feature_names_out() 
] = encoder_ohe.transform(features_valid_linear[ohe_features_linear])

features_valid_linear = features_valid_linear.drop(ohe_features_linear, axis=1)

features_valid_linear[num_features] = scaler.transform(features_valid_linear[num_features])
features_valid_linear.head(5) #результат

Unnamed: 0,RegistrationYear,Power,Kilometer,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon,...,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_yes
100408,-1.179073,0.514556,0.605895,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
321499,2.264051,-1.299715,0.605895,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
55966,1.231114,-0.003807,-2.651356,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
325783,-0.83476,1.384666,0.605895,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285335,-0.662604,-0.003807,0.605895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Построим модель для линейной регрессии

In [30]:
%%time
model_linear = LinearRegression().fit(features_train_linear, target_train)

CPU times: user 1.36 s, sys: 1.4 s, total: 2.76 s
Wall time: 2.73 s


In [31]:
%%time
predictions = model_linear.predict(features_valid_linear)

mse = mean_squared_error(target_valid,predictions)
rmse_linear = mse**0.5

print('Среднеквадратическая ошибка для модели линейной регрессии составляет:', rmse_linear)

Среднеквадратическая ошибка для модели линейной регрессии составляет: 2488.421792808303
CPU times: user 42.9 ms, sys: 27.5 ms, total: 70.4 ms
Wall time: 31.8 ms


Для модели линейной регрессии мы получили на валидационной выборке с результат RMSE равным 2488

Построим модель для случайного леса

In [32]:
features_train_forest = features_train.copy()
features_valid_forest = features_valid.copy()

Изменим нашу тренировочную выборку применяя ОНЕ для категориальных признаков, а так же порядковое кодирование для столбца Brand в котором много разных категорий которые в будущем так же возможно будут пополняться

In [33]:
col_transformer_forest= make_column_transformer(
    (
        OneHotEncoder(drop='first', handle_unknown='ignore'), 
        ohe_features_forest
    ),
    (
        OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), 
        ['Brand']
    ),
    (
        StandardScaler(), 
        num_features
    ),
    remainder='passthrough'
)
features_train_forest = pd.DataFrame(
    col_transformer_forest.fit_transform(features_train_forest),
    columns=col_transformer_forest.get_feature_names_out()
)
features_train_forest.head(5)

Unnamed: 0,onehotencoder__VehicleType_convertible,onehotencoder__VehicleType_coupe,onehotencoder__VehicleType_other,onehotencoder__VehicleType_sedan,onehotencoder__VehicleType_small,onehotencoder__VehicleType_suv,onehotencoder__VehicleType_wagon,onehotencoder__Gearbox_manual,onehotencoder__Gearbox_other,onehotencoder__FuelType_electric,onehotencoder__FuelType_gasoline,onehotencoder__FuelType_hybrid,onehotencoder__FuelType_lpg,onehotencoder__FuelType_other,onehotencoder__FuelType_petrol,onehotencoder__Repaired_yes,ordinalencoder__Brand,standardscaler__RegistrationYear,standardscaler__Power,standardscaler__Kilometer
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,38.0,-2.039854,0.958868,0.605895
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,20.0,-0.490448,0.125784,0.605895
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,38.0,-1.351229,-0.873917,0.605895
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,25.0,0.714645,-0.910943,-1.294168
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,24.0,-1.179073,-0.114885,0.605895


Так же трансформируем валидационную выборку

In [34]:
features_valid_forest = pd.DataFrame(
    col_transformer_forest.transform(features_valid_forest),
    columns=col_transformer_forest.get_feature_names_out()
)
features_valid_forest.head(5)

Unnamed: 0,onehotencoder__VehicleType_convertible,onehotencoder__VehicleType_coupe,onehotencoder__VehicleType_other,onehotencoder__VehicleType_sedan,onehotencoder__VehicleType_small,onehotencoder__VehicleType_suv,onehotencoder__VehicleType_wagon,onehotencoder__Gearbox_manual,onehotencoder__Gearbox_other,onehotencoder__FuelType_electric,onehotencoder__FuelType_gasoline,onehotencoder__FuelType_hybrid,onehotencoder__FuelType_lpg,onehotencoder__FuelType_other,onehotencoder__FuelType_petrol,onehotencoder__Repaired_yes,ordinalencoder__Brand,standardscaler__RegistrationYear,standardscaler__Power,standardscaler__Kilometer
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,-1.179073,0.514556,0.605895
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,27.0,2.264051,-1.299715,0.605895
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,38.0,1.231114,-0.003807,-2.651356
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,20.0,-0.83476,1.384666,0.605895
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,20.0,-0.662604,-0.003807,0.605895


### Построим модель для случайного леса

In [35]:
for depth in range(17,19,1):
    
    model = RandomForestRegressor(max_depth=depth, n_estimators=10, random_state=12345).fit(features_train_forest, target_train)
    predictions_valid = model.predict(features_valid_forest)
    
    print('max_depth:',depth,'rmse:',mean_squared_error(target_valid,predictions_valid)**0.5)

max_depth: 17 rmse: 1645.678947186473
max_depth: 18 rmse: 1645.0790374575079


In [36]:
for est in range(47,50,2): 
    
    model = RandomForestRegressor(max_depth=18, n_estimators=est, random_state=12345).fit(features_train_forest, target_train)
    predictions_valid = model.predict(features_valid_forest)
    
    print('estimators:',est,'rmse:',(mean_squared_error(target_valid,predictions_valid))**0.5)

estimators: 47 rmse: 1616.9666963560117
estimators: 49 rmse: 1616.988174012904


Проведем замер времени на обучение предсказание для модели случайного леса с найденными ранее лучшими гиперпараметрами:

In [37]:
%%time
model_forest = RandomForestRegressor(max_depth=18, n_estimators=49, random_state=12345).fit(features_train_forest, target_train)

CPU times: user 18.3 s, sys: 14.1 ms, total: 18.3 s
Wall time: 18.3 s


In [38]:
%%time
predictions_valid = model_forest.predict(features_valid_forest)
rmse_forest = mean_squared_error(target_valid,predictions_valid)**0.5
print('estimators:',est,'rmse:', rmse_forest)

estimators: 49 rmse: 1616.988174012904
CPU times: user 617 ms, sys: 3.8 ms, total: 621 ms
Wall time: 629 ms


Лучшее значение случайного леса на валидационной выборке с результатом в RMSE равным 1625 мы получили при гиперпараметрах max_depth=18 и estimators=49. Дальнейшее повышение гиперпараметров приводит к незначительному улучшению модели, но кратно увеличенным затратам времени.

### Построим модель LightGBM

In [39]:
%%time
model_LGBMR = LGBMRegressor(n_jobs=-1).fit(features_train_forest, target_train)

CPU times: user 56.7 s, sys: 970 ms, total: 57.7 s
Wall time: 58.1 s


In [40]:
%%time
predictions_valid = model_LGBMR.predict(features_valid_forest)

mse = mean_squared_error(target_valid,predictions_valid)
rmse_LGBMR = mse**0.5

print('Среднеквадратическая ошибка для модели LightGBM составляет:', rmse_LGBMR)

Среднеквадратическая ошибка для модели LightGBM составляет: 1684.8942684928902
CPU times: user 441 ms, sys: 14.2 ms, total: 455 ms
Wall time: 416 ms


Лучшее значение LightGBM на валидационной выборке с результатом в RMSE равным 1595

## Анализ моделей

Мы построили несколько моделей для поиска целевого параметра.

Линейной регрессия показала на валидационной выборке с результат RMSE равным 2488 , обучение 3 секунды, предсказание 178мс.

Лучшее значение случайного леса на валидационной выборке с результатом в RMSE равным 1625 мы получили при гиперпараметрах max_depth=18 и estimators=49, обучение 21 секунда, предсказание 613мс.

Лучшее значение LightGBM на валидационной выборке с результатом в RMSE равным 1595, обучение 14 секунд, предсказание 1.78 сек. При том что результат в 1690 был получен за 6с при стандартном num_iterations=100.

Как следствие мы можем сделать вывод что модель LightGBM подходит лучше всего для задачи заказчика так как она достаточно быстра и точная для его целей. Используем эту модель на тестовой выборке.

In [41]:
result = pd.DataFrame(
    [rmse_linear, rmse_forest, rmse_LGBMR], 
    index=['LinearRegression', 'RandomForestRegressor', 'LGBMRegressor'], 
    columns=['Результаты валидационной выборки RMSE']
)
result

Unnamed: 0,Результаты валидационной выборки RMSE
LinearRegression,2488.421793
RandomForestRegressor,1616.988174
LGBMRegressor,1684.894268


LGBMRegressor на тестовой выборке

In [42]:
features_test_forest = features_test.copy()

In [43]:
features_test_forest = pd.DataFrame(
    col_transformer_forest.transform(features_test_forest),
    columns=col_transformer_forest.get_feature_names_out()
)
features_test_forest.head(5)

Unnamed: 0,onehotencoder__VehicleType_convertible,onehotencoder__VehicleType_coupe,onehotencoder__VehicleType_other,onehotencoder__VehicleType_sedan,onehotencoder__VehicleType_small,onehotencoder__VehicleType_suv,onehotencoder__VehicleType_wagon,onehotencoder__Gearbox_manual,onehotencoder__Gearbox_other,onehotencoder__FuelType_electric,onehotencoder__FuelType_gasoline,onehotencoder__FuelType_hybrid,onehotencoder__FuelType_lpg,onehotencoder__FuelType_other,onehotencoder__FuelType_petrol,onehotencoder__Repaired_yes,ordinalencoder__Brand,standardscaler__RegistrationYear,standardscaler__Power,standardscaler__Kilometer
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,20.0,-2.384166,-0.873917,0.605895
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,24.0,0.542489,0.514556,-0.072699
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,27.0,0.026021,0.792251,-0.072699
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,1.058958,1.143998,-1.294168
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,36.0,0.198177,-0.873917,0.605895


In [44]:
%%time
model = LGBMRegressor(n_jobs=-1).fit(features_train_forest, target_train)

CPU times: user 38 s, sys: 665 ms, total: 38.7 s
Wall time: 39 s


In [45]:
%%time
predictions_valid = model.predict(features_test_forest)

mse = mean_squared_error(target_test,predictions_valid)
rmse = mse**0.5

print('Среднеквадратическая ошибка для модели LightGBM составляет:', rmse)

Среднеквадратическая ошибка для модели LightGBM составляет: 1686.963324831109
CPU times: user 400 ms, sys: 23.7 ms, total: 423 ms
Wall time: 405 ms


Лучшее значение LightGBM на тестовой выборке с результатом в RMSE равным 1687. 

Сравним наш резульат со стандартным результатом рассчета Дамми регрессора.

In [46]:
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(features_train, target_train)
DummyRegressor()
dummy_regr_predict = dummy_regr.predict(features_test)
mse = mean_squared_error(target_test,dummy_regr_predict)
rmse = mse**0.5
print('Среднеквадратическая ошибка для модели Дамми составляет:', rmse)

Среднеквадратическая ошибка для модели Дамми составляет: 4649.776124182614


Значение Дамми на тестовой выборке с результатом в RMSE равным 4650. Что существенно хуже нашей модели.

## Вывод

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В нашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Нам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

1) Мы провели подготовку данных, удалили неинформативые признаки, убрали нули и пустые строки и аномалии

2) Мы построили несколько моделей для поиска целевого параметра.

Линейной регрессия показала на валидационной выборке с результат RMSE равным 2481 за 620мс

Лучшее значение случайного леса на валидационной выборке с результатом в RMSE равным 1668 мы получили при гиперпараметрах max_depth=18 и estimators=49 за 30с без учета потраченного времени на перебор параметров.

Лучшее значение LightGBM на валидационной выборке с результатом в RMSE равным 1591 мы получили за 36с. При том что результат в 1690 был получен за 3с при стандартном num_iterations=100.

3) Как следствие мы можем сделать вывод что модель LightGBM подходит лучше всего для задачи заказчика так как она достаточно быстра и точная для его целей. Используем эту модель на тестовой выборке. 

Лучшее значение LightGBM на тестовой выборке с результатом в RMSE равным 1687.

Мы сравнили наш результат с Дамми тестом чтобы оценить адекватность нашего результата. Значение Дамми на тестовой выборке с результатом в RMSE равным 4650. Что существенно хуже нашей модели.