<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка данных</a></span></li><li><span><a href="#Обучение-моделей" data-toc-modified-id="Обучение-моделей-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение моделей</a></span></li><li><span><a href="#Анализ-моделей" data-toc-modified-id="Анализ-моделей-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Анализ моделей</a></span></li></ul></div>

# Определение стоимости автомобилей

Необходимо построить модель для определения стоимости автомобилей для сервис по продаже автомобилей с пробегом. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей.

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Подготовка данных

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
from catboost import cv, Pool
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from sklearn.preprocessing import StandardScaler,OrdinalEncoder, OneHotEncoder
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from time import time
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('/datasets/autos.csv')


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [5]:
data.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Kilometer                0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

В сете присутствует очень много пропусков, дубликаты и ненормальные значения.
Средняя цена авто: 4416
Средняя мощность: 110
Средний пробег: 128211
Минимальные и максимальные знчения ненормальны.

In [6]:
# Удалим столбцы, не влияющие на стоимость авто (по личным ощущениям)
data = data.drop(['DateCrawled', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)

In [7]:
data.shape

(354369, 11)

In [8]:
data.duplicated().sum()

27543

In [9]:
#Удалим дубликаты
data.drop_duplicates(inplace=True)

In [10]:
data.shape

(326826, 11)

In [11]:
data.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no


In [12]:
data[['Price', 'RegistrationYear', 'Power', 'Kilometer', 'RegistrationMonth']] = data[['Price', 'RegistrationYear', 'Power', 'Kilometer', 'RegistrationMonth']].astype('int32')

In [13]:
#Создадим список категориальных признаков
cat_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

In [14]:
data[cat_features].isna().sum()

VehicleType    35249
Gearbox        17578
Model          18532
FuelType       31122
Brand              0
NotRepaired    66427
dtype: int64

In [15]:
#Пропуски в категориальных признаках заполним 'no_info'
data[cat_features] = data[cat_features].fillna('no_info')

In [16]:
data = data[data['Price']>100]

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 314059 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              314059 non-null  int32 
 1   VehicleType        314059 non-null  object
 2   RegistrationYear   314059 non-null  int32 
 3   Gearbox            314059 non-null  object
 4   Power              314059 non-null  int32 
 5   Model              314059 non-null  object
 6   Kilometer          314059 non-null  int32 
 7   RegistrationMonth  314059 non-null  int32 
 8   FuelType           314059 non-null  object
 9   Brand              314059 non-null  object
 10  NotRepaired        314059 non-null  object
dtypes: int32(5), object(6)
memory usage: 22.8+ MB


In [18]:
data.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,no_info,1993,manual,0,golf,150000,0,petrol,volkswagen,no_info
1,18300,coupe,2011,manual,190,no_info,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,no_info
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no


**Разобьем датасет на трейн и тест**

In [19]:
train, test = train_test_split(data, train_size=0.75,random_state=2007)
features = data.drop(['Price'],axis=1) 
target = data['Price']

X_train, X_test, y_train, y_test= train_test_split(features, target, train_size=0.6,random_state=2007)
X_valid, X_test, y_valid, y_test= train_test_split(X_test, y_test, train_size=0.5,random_state=2007)

In [20]:
print(X_train.shape, y_train.shape)
print(X_valid.shape, y_valid.shape)
print(X_test.shape, y_test.shape)

(188435, 10) (188435,)
(62812, 10) (62812,)
(62812, 10) (62812,)


In [21]:
# Для кэтбуста, енкодера, выберем столбцы с типом 'object'
column_category = X_train.select_dtypes(include='object').columns.to_list()
column_numeric  = X_train.select_dtypes(exclude='object').columns
print(column_category, column_numeric)

['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired'] Index(['RegistrationYear', 'Power', 'Kilometer', 'RegistrationMonth'], dtype='object')


In [22]:
encoder_OE_train = OrdinalEncoder()
X_train_OE = X_train.copy()
X_test_OE = X_test.copy()
X_train_OE.loc[:,column_category] = pd.DataFrame(encoder_OE_train.fit_transform(X_train.loc[:,column_category]),columns=column_category,index=X_train.index)
X_test_OE.loc[:,column_category] = pd.DataFrame(encoder_OE_train.transform(X_test.loc[:,column_category]),columns=column_category,index=X_test.index)
X_test_OE.head()

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
124422,5.0,2008,1.0,80,116.0,90000,5,7.0,38.0,0.0
253446,5.0,1998,0.0,136,95.0,125000,10,7.0,20.0,1.0
281813,5.0,1997,1.0,0,116.0,70000,8,7.0,38.0,0.0
278672,0.0,1998,1.0,0,161.0,150000,4,2.0,33.0,0.0
295318,6.0,2013,1.0,75,152.0,20000,5,7.0,30.0,1.0


In [23]:
# Стандартизируем данные
scaler = StandardScaler()
X_train_OE =  pd.DataFrame(scaler.fit_transform(X_train_OE),columns=X_train_OE.columns,index=X_train_OE.index)
X_test_OE =  pd.DataFrame(scaler.transform(X_test_OE),columns=X_train_OE.columns,index=X_test_OE.index)

In [24]:
# X,y для пула
X = ['RegistrationYear', 'Power', 'Kilometer', 'RegistrationMonth', 'VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
y = ['Price']

In [25]:
#Создадим пул для трейна
train_data = Pool(data=train[X],
                  label=train[y],
                  cat_features=cat_features
                 )

In [26]:
#Создадим пул для теста
test_data = Pool(data=test[X],
                  label=test[y],
                  cat_features=cat_features
                 )

**Запуск кросс-валидации для CatBoost**

In [27]:
params = {'cat_features': cat_features,
              'eval_metric': 'RMSE',
              'loss_function': 'RMSE',
              'learning_rate': 0.15,
              'random_seed': 2007,
              'verbose':100}

In [28]:
cv_data = cv(
    params = params,
    pool = train_data,
    fold_count=3,
    shuffle=True,
    partition_random_seed=0,
    stratified=False,
    verbose=100,
    early_stopping_rounds=200
)

Training on fold [0/3]
0:	learn: 5656.1736382	test: 5643.4170966	best: 5643.4170966 (0)	total: 517ms	remaining: 8m 36s
100:	learn: 1750.3051576	test: 1771.1882684	best: 1771.1882684 (100)	total: 25.7s	remaining: 3m 49s
200:	learn: 1677.1100600	test: 1711.5435037	best: 1711.5435037 (200)	total: 52.5s	remaining: 3m 28s
300:	learn: 1640.1532082	test: 1688.1374474	best: 1688.1374474 (300)	total: 1m 16s	remaining: 2m 57s
400:	learn: 1614.4286910	test: 1674.1455551	best: 1674.1455551 (400)	total: 1m 44s	remaining: 2m 35s
500:	learn: 1593.7746672	test: 1664.4624934	best: 1664.4624934 (500)	total: 2m 10s	remaining: 2m 9s
600:	learn: 1573.8693192	test: 1655.9376321	best: 1655.9376321 (600)	total: 2m 36s	remaining: 1m 43s
700:	learn: 1560.1693292	test: 1652.1043977	best: 1652.0261477 (699)	total: 3m 3s	remaining: 1m 18s
800:	learn: 1546.8885089	test: 1648.1401054	best: 1648.1401054 (800)	total: 3m 30s	remaining: 52.4s
900:	learn: 1534.8343954	test: 1644.6906965	best: 1644.6798166 (899)	total: 3m

In [29]:
cv_data[cv_data['test-RMSE-mean'] == cv_data['test-RMSE-mean'].min()]

Unnamed: 0,iterations,test-RMSE-mean,test-RMSE-std,train-RMSE-mean,train-RMSE-std
997,997,1643.519208,6.660425,1523.978316,2.870819


In [30]:
n_iters = cv_data[cv_data['test-RMSE-mean'] == cv_data['test-RMSE-mean'].min()]['iterations'].values[0]
n_iters

997

## Обучение моделей

**LightGBM**

In [31]:
%%time
LGBM_t0 = time()
LGBM = LGBMRegressor(metric='rmse', max_depth = 10, n_estimators = 100, random_state = 2007, n_jobs = -1,learning_rate = 0.22)
LGBM.fit(X_train_OE,y_train)
LGBM_t1 = time()
predict_LGBM = LGBM.predict(X_test_OE)
RMSE_LGBM = mean_squared_error(predict_LGBM, y_test)**0.5
LGBM_t2 = time()
print('RMSE:', RMSE_LGBM)
LGBM_time_fit = LGBM_t1 - LGBM_t0
LGBM_time_predict = LGBM_t2 - LGBM_t1

RMSE: 1696.6803901305632
CPU times: user 6min 12s, sys: 3.3 s, total: 6min 15s
Wall time: 6min 20s


**DecisionTreeRegressor**

In [32]:
%%time
DTR_t0 = time()
DTR = DecisionTreeRegressor(max_depth=10)
DTR.fit(X_train_OE,y_train)
DTR_t1 = time()
predict_DTR = DTR.predict(X_test_OE)
RMSE_DTR = mean_squared_error(y_test, predict_DTR)**0.5
DTR_t2 = time()
print('RMSE:', RMSE_DTR)
DTR_time_fit = DTR_t1 - DTR_t0
DTR_time_predict = DTR_t2 - DTR_t1

RMSE: 2044.6880196160962
CPU times: user 550 ms, sys: 0 ns, total: 550 ms
Wall time: 570 ms


**LinearGegression**

Для линейных моделей категоризируем данные через ohe и стандартизируем через standart scaler. Будем использовать гридсерч в Ridge.

In [33]:
data_ohe = pd.get_dummies(data, drop_first=True)
features_LR = data_ohe.drop(['Price'],axis=1) 
target_LR = data_ohe['Price']

In [38]:
X_train_LR, X_test_LR, y_train_LR, y_test_LR= train_test_split(features_LR, target_LR, train_size=0.6,random_state=2007)
X_valid_LR, X_test_LR, y_valid_LR, y_test_LR= train_test_split(X_test_LR, y_test_LR, train_size=0.5,random_state=2007)

In [39]:
scaler = StandardScaler()
scaler.fit(X_train_LR)
features_train_scaled = scaler.transform(X_train_LR)
features_valid_scaled = scaler.transform(X_valid_LR)
features_valid_scaled = scaler.transform(X_test_LR)

In [40]:
%%time
LR_t0 = time()
LR = LinearRegression()
LR = LR.fit(X_train_LR, y_train_LR)
LR_t1 = time()
predict_LR_VD = LR.predict(X_valid_LR)
RMSE_LR_VD = mean_squared_error(y_valid_LR, predict_LR_VD)**0.5
LR_t2 = time()
print('RMSE:', RMSE_LR_VD)
LR_time_fit = LR_t1 - LR_t0
LR_time_valid_predict = LR_t2 - LR_t1

RMSE: 3129.006479173038
CPU times: user 22.7 s, sys: 25.5 s, total: 48.2 s
Wall time: 48.4 s


In [41]:
LR_test_t0 = time()
predict_LR = LR.predict(X_test_LR)
RMSE_LR = mean_squared_error(y_test_LR, predict_LR)**0.5
print('RMSE:', RMSE_LR)
LR_test_t1 = time()
LR_time_predict = LR_test_t1 - LR_test_t0

RMSE: 3129.8899368763773


**Ridge**

In [43]:
R=Ridge()
parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100]}
R_regressor=GridSearchCV(R,parameters,scoring='neg_mean_squared_error',cv=5)
R_regressor.fit(X_valid_LR,y_valid_LR)

GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [1e-15, 1e-10, 1e-08, 0.001, 0.01, 1, 5, 10,
                                   20, 30, 35, 40, 45, 50, 55, 100]},
             scoring='neg_mean_squared_error')

In [62]:
R_regressor.best_params_

{'alpha': 0.01}

In [63]:
R_test = Ridge(alpha = 0.01)
R_t0 = time()
R_test.fit(X_train_LR,y_train_LR)
R_t1 = time()
predict_R=R_test.predict(X_test_LR)
R_t2 = time()
R_time_fit = R_t1 - R_t0
R_time_predict = R_t2 - R_t1

In [64]:
rmse_R = mean_squared_error(y_test_LR, predict_R)**0.5
rmse_R

3129.8809727005037

**CatBoost**

In [None]:
%%time
CB_t0 = time()
CB = CatBoostRegressor(**params)
CB.fit(train_data)
CB_t1 = time()
test['y_pred'] = CB.predict(test_data)
CB_t2 = time()
RMSE_CB = mean_squared_error(test['Price'],test['y_pred'])**0.5
cv_data[cv_data['test-RMSE-mean'] == cv_data['test-RMSE-mean'].min()]
CB_time_fit = CB_t1 - CB_t0
CB_time_predict = CB_t2 - CB_t1

0:	learn: 4091.4104522	total: 417ms	remaining: 6m 56s


## Анализ моделей

In [None]:
results = pd.DataFrame([['LGBMRegressor',        RMSE_LGBM, LGBM_time_fit,  LGBM_time_predict],
                        ['DecisionTreeRegressor',RMSE_DTR,  DTR_time_fit,   DTR_time_predict],
                        ['LinearGegression',     RMSE_LR,   LR_time_fit,    LR_time_predict],
                        ['CatBoostRegressor',    RMSE_CB,   CB_time_fit,    CB_time_predict],
                        ['Ridge',                rmse_R,    R_time_fit,     CB_time_predict]],
columns=['Model', 'RMSE', 'Time_fit, sec', 'Time_predict, sec'])
results.sort_values('RMSE').reset_index(drop=True)

In [None]:
results.sort_values('Time_fit, sec').reset_index(drop=True)

Стандартные простые модели работают быстро, но качество предсказания ниже порогового. Лучшей моделью по качеству оказалась CatBoostRegressor, LGBM на втором месте. При этом, у кэтбуста больше всех параметров, что замедляет время предсказания.
При увеличении количества гиперпараметров увеличивается и время обучения, при этом улучшается качетво модели. По-хорошему нужно сравнивать модели с одинаковыми гиперпараметрами, или вовсе без них. Также можно улучшить качество моделей, детально выполнив предобработку, но пороговый RMSE был достигнут и без этого.