# Определение стоимости автомобилей / Determining the cost of cars

Сервис по продаже автомобилей с пробегом разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. 

Имеются исторические данные: технические характеристики, комплектации и цены автомобилей. 

Требуется построить модель для определения стоимости. 

A service for selling used cars is developing an application to attract new customers. Here you can quickly find out the market value of your car.

There are historical data: technical characteristics, configurations and prices of cars.

It is required to build a model to determine the cost.

## Подготовка данных / Data preparation

In [1]:
import pandas as pd
import lightgbm as lgb
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.linear_model import LinearRegression

from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('/Users/vladamalkina/Downloads/autos.csv')
display(data.head())

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Проверим количество пропусков в данных / Let's check the number of gaps in the data:

In [4]:
data.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Kilometer                0
RegistrationMonth        0
FuelType             32895
Brand                    0
Repaired             71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

Удалим явные дубликаты / Let's remove obvious duplicates:

In [5]:
data = data.drop_duplicates()

Заполним пропуски в столбце Repaired: если пробег автомобиля > 100000 км, то машина была в ремонте(yes), иначе no:

Let's fill in the blanks in the Repaired column: if the car's mileage is > 100,000 km, then the car was repaired (yes), otherwise no:

In [6]:
data.loc[(data['Kilometer']>100000 & data['Repaired'].isna()), 'Repaired'] = 'yes'
data.loc[(data['Kilometer']<100000 & data['Repaired'].isna()), 'Repaired'] = 'no'

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354365 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354365 non-null  object
 1   Price              354365 non-null  int64 
 2   VehicleType        316875 non-null  object
 3   RegistrationYear   354365 non-null  int64 
 4   Gearbox            334532 non-null  object
 5   Power              354365 non-null  int64 
 6   Model              334660 non-null  object
 7   Kilometer          354365 non-null  int64 
 8   RegistrationMonth  354365 non-null  int64 
 9   FuelType           321470 non-null  object
 10  Brand              354365 non-null  object
 11  Repaired           354365 non-null  object
 12  DateCreated        354365 non-null  object
 13  NumberOfPictures   354365 non-null  int64 
 14  PostalCode         354365 non-null  int64 
 15  LastSeen           354365 non-null  object
dtypes: int64(7), object(

Удалим выбросы в столбце 'Power': минимальное значение мощности = 20 л.с.

Let's remove emissions in the 'Power' column: minimum power value = 20 hp.

In [8]:
data['Power'].unique()
data = data.loc[data['Power']>=20]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 313723 entries, 1 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        313723 non-null  object
 1   Price              313723 non-null  int64 
 2   VehicleType        290915 non-null  object
 3   RegistrationYear   313723 non-null  int64 
 4   Gearbox            307233 non-null  object
 5   Power              313723 non-null  int64 
 6   Model              300525 non-null  object
 7   Kilometer          313723 non-null  int64 
 8   RegistrationMonth  313723 non-null  int64 
 9   FuelType           292553 non-null  object
 10  Brand              313723 non-null  object
 11  Repaired           313723 non-null  object
 12  DateCreated        313723 non-null  object
 13  NumberOfPictures   313723 non-null  int64 
 14  PostalCode         313723 non-null  int64 
 15  LastSeen           313723 non-null  object
dtypes: int64(7), object(

Удалим выбросы в столбце 'Price': стоимость автомобиля не может быть меньше 300 евро.

Let's remove outliers in the 'Price' column: the cost of the car cannot be less than 300 euros.

In [9]:
data = data.loc[data['Price']>300]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296888 entries, 1 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        296888 non-null  object
 1   Price              296888 non-null  int64 
 2   VehicleType        277348 non-null  object
 3   RegistrationYear   296888 non-null  int64 
 4   Gearbox            291395 non-null  object
 5   Power              296888 non-null  int64 
 6   Model              285459 non-null  object
 7   Kilometer          296888 non-null  int64 
 8   RegistrationMonth  296888 non-null  int64 
 9   FuelType           278942 non-null  object
 10  Brand              296888 non-null  object
 11  Repaired           296888 non-null  object
 12  DateCreated        296888 non-null  object
 13  NumberOfPictures   296888 non-null  int64 
 14  PostalCode         296888 non-null  int64 
 15  LastSeen           296888 non-null  object
dtypes: int64(7), object(

Удалим неинформативные столбцы, которые не влияют на стоимость автомобиля:

Let's remove uninformative columns that do not affect the cost of the car:

In [10]:
data = data.drop(columns = ['DateCrawled', 'RegistrationYear', 'RegistrationMonth', 'DateCreated', 
                            'NumberOfPictures', 'PostalCode', 'LastSeen'])

In [11]:
data.isna().sum()

Price              0
VehicleType    19540
Gearbox         5493
Power              0
Model          11429
Kilometer          0
FuelType       17946
Brand              0
Repaired           0
dtype: int64

C помощью заглушки 'unknown' заполним пропуски в 'VehicleType', 'FuelType', 'Gearbox', 'Model':

Using the 'unknown' stub, fill in the gaps in 'VehicleType', 'FuelType', 'Gearbox', 'Model':

In [12]:
data.loc[data['VehicleType'].isna(), 'VehicleType'] = 'unknown'
data.loc[data['FuelType'].isna(), 'FuelType'] = 'unknown'
data.loc[data['Gearbox'].isna(), 'Gearbox'] = 'unknown'
data.loc[data['Model'].isna(), 'Model'] = 'unknown'

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296888 entries, 1 to 354368
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Price        296888 non-null  int64 
 1   VehicleType  296888 non-null  object
 2   Gearbox      296888 non-null  object
 3   Power        296888 non-null  int64 
 4   Model        296888 non-null  object
 5   Kilometer    296888 non-null  int64 
 6   FuelType     296888 non-null  object
 7   Brand        296888 non-null  object
 8   Repaired     296888 non-null  object
dtypes: int64(3), object(6)
memory usage: 22.7+ MB


Разделим данные на три выборки: обучающую, валидационную и тестовую (60% 20% 20%).

Let's divide the data into three samples: training, validation and test (60% 20% 20%).

In [14]:
data_train, data_common = train_test_split(data,  test_size = 0.4, random_state = 12345)
data_valid, data_test = train_test_split(data_common,  test_size = 0.5, random_state = 12345)

In [15]:
features_train = data_train.drop(['Price'], axis=1)
target_train = data_train['Price']
features_valid = data_valid.drop(['Price'], axis=1)
target_valid = data_valid['Price'] 
features_test = data_test.drop(['Price'], axis=1)
target_test = data_test['Price']

Проверим размерность выборок:

Let's check the sample size:

In [16]:
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(178132, 8)
(178132,)
(59378, 8)
(59378,)
(59378, 8)
(59378,)


## Обучение моделей / Model training

CATBOOST

In [17]:
encoder = OrdinalEncoder()
train_ordinal = pd.DataFrame(encoder.fit_transform(features_train), columns = features_train.columns)
valid_ordinal = pd.DataFrame(encoder.fit_transform(features_valid), columns = features_valid.columns)
cat_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']

In [18]:
%%time
parameters = {'depth': range(3, 12, 3),
              'learning_rate': np.arange(0.00, 0.15, 0.05),
              'iterations':  range(20, 100, 20)}
cb = CatBoostRegressor()
grid = GridSearchCV(cb, parameters, cv=5)
grid.fit(train_ordinal, target_train, verbose = False)
grid.best_params_

CPU times: user 2min 9s, sys: 20.3 s, total: 2min 30s
Wall time: 38.5 s


{'depth': 9, 'iterations': 80, 'learning_rate': 0.1}

In [19]:
cb = CatBoostRegressor(depth = 9, learning_rate = 0.1, iterations = 80)
cb.fit(features_train, target_train, cat_features = cat_features, verbose=10)

0:	learn: 4322.1622239	total: 26.7ms	remaining: 2.11s
10:	learn: 3086.6158590	total: 213ms	remaining: 1.33s
20:	learn: 2749.9656313	total: 396ms	remaining: 1.11s
30:	learn: 2647.7414132	total: 579ms	remaining: 915ms
40:	learn: 2596.9104621	total: 769ms	remaining: 732ms
50:	learn: 2565.5950380	total: 962ms	remaining: 547ms
60:	learn: 2538.0700010	total: 1.15s	remaining: 359ms
70:	learn: 2516.5831723	total: 1.35s	remaining: 171ms
79:	learn: 2488.0360241	total: 1.52s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x7fe25e0464f0>

In [20]:
%%time
preds = cb.predict(features_valid)
res = mean_squared_error(preds, target_valid) ** 0.5
print(res)

2497.088711431166
CPU times: user 120 ms, sys: 6.38 ms, total: 126 ms
Wall time: 88.5 ms


СЛУЧАЙНЫЙ ЛЕС / RANDOM FOREST

In [21]:
%%time
parametrs = { 'n_estimators': range (1, 10, 1),
              'max_depth': range (1, 10, 1)}
clf = RandomForestRegressor()
grid = GridSearchCV(clf, parametrs, cv=5)
grid.fit(train_ordinal, target_train)
grid.best_params_

CPU times: user 1min 51s, sys: 251 ms, total: 1min 51s
Wall time: 1min 51s


{'max_depth': 9, 'n_estimators': 7}

In [22]:
clf = RandomForestRegressor(n_estimators = 7, max_depth = 9)
clf.fit(train_ordinal, target_train)

RandomForestRegressor(max_depth=9, n_estimators=7)

In [23]:
%%time
preds =  clf.predict(valid_ordinal)
res = mean_squared_error(preds, target_valid) ** 0.5
print(res)

2812.1058918722333
CPU times: user 59.7 ms, sys: 4.11 ms, total: 63.8 ms
Wall time: 60.7 ms


LIGHTGBM

In [24]:
%%time
params = {
    'task': 'train', 
    'boosting': 'gbdt',
    'objective': 'regression',
    'num_leaves': 10,
    'learning_rage': 0.05,
    'metric': {'l2','l1'},
    'verbose': -1
}
lgb_train = lgb.Dataset(train_ordinal, target_train)
lgb_eval = lgb.Dataset(valid_ordinal, target_valid, reference=lgb_train)
model = lgb.train(params,
                 train_set=lgb_train,
                 valid_sets=lgb_eval,
                 early_stopping_rounds=30)

[1]	valid_0's l1: 3463.4	valid_0's l2: 1.92844e+07
Training until validation scores don't improve for 30 rounds
[2]	valid_0's l1: 3303.03	valid_0's l2: 1.77098e+07
[3]	valid_0's l1: 3164.8	valid_0's l2: 1.63986e+07
[4]	valid_0's l1: 3040.83	valid_0's l2: 1.52372e+07
[5]	valid_0's l1: 2937.2	valid_0's l2: 1.4347e+07
[6]	valid_0's l1: 2842.45	valid_0's l2: 1.35477e+07
[7]	valid_0's l1: 2753.3	valid_0's l2: 1.28333e+07
[8]	valid_0's l1: 2683.64	valid_0's l2: 1.22822e+07
[9]	valid_0's l1: 2613.83	valid_0's l2: 1.17726e+07
[10]	valid_0's l1: 2554.9	valid_0's l2: 1.13306e+07
[11]	valid_0's l1: 2502.21	valid_0's l2: 1.09617e+07
[12]	valid_0's l1: 2458.24	valid_0's l2: 1.0652e+07
[13]	valid_0's l1: 2416.63	valid_0's l2: 1.0377e+07
[14]	valid_0's l1: 2379.43	valid_0's l2: 1.0139e+07
[15]	valid_0's l1: 2344.43	valid_0's l2: 9.92935e+06
[16]	valid_0's l1: 2316.7	valid_0's l2: 9.763e+06
[17]	valid_0's l1: 2288.15	valid_0's l2: 9.59114e+06
[18]	valid_0's l1: 2264.43	valid_0's l2: 9.46326e+06
[19]	v

In [25]:
%%time
y_pred = model.predict(valid_ordinal)
res = mean_squared_error(y_pred, target_valid) ** 0.5
print(res)

2739.519925442982
CPU times: user 463 ms, sys: 14.4 ms, total: 477 ms
Wall time: 77.1 ms


## Анализ моделей / Model Analysis

CATBOOST Значение/Meaning RMSE: 2497.088711431166

СЛУЧАЙНЫЙ ЛЕС/RANDOM FOREST Значение/Meaning RMSE: 2812.1058918722333

LIGHTGBM Значение/Meaning RMSE: 2739.519925442982


In [26]:
preds = cb.predict(features_test)
res = mean_squared_error(preds, target_test) ** 0.5
print(res)

2499.2414501705584


Лучшая модель - Catboost, значение метрики RMSE = 2499.2414501705584.

The best model is Catboost, metric value RMSE = 2499.2414501705584.