# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости.

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Подготовка данных

In [None]:
!pip install catboost
import catboost as cb
from catboost import CatBoostRegressor



In [None]:
conda install lightgbm

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


Импортируем все неаобходимые библиотеки

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler)

from catboost import CatBoostRegressor, Pool
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from IPython.display import display

warnings.filterwarnings('ignore')

Откроем датасет

In [None]:
try:
    data = pd.read_csv('/datasets/autos.csv')
except:
    data = pd.read_csv('C:/Users/goshe/OneDrive/Рабочий стол/Yandex_Praktikum/projects/Datasets/autos.csv')

Изучим датасет

In [None]:
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Есть пропуски в колонках VehicleType, Gearbox, Model, FuelType, Repaired.

Заполним пропуски в VehicleType, Gearbox, FuelType модой по модели машины, в Repaired проставим "no", предположив, что пропуск означает, что машина не ремонтировалась. Пропуски в Modelтзаполню значением "no", т.к. неоткуда взять данные.

In [None]:
data["Model"] = data["Model"].fillna("no")
data["Repaired"] = data["Repaired"].fillna("no")

In [None]:
data['VehicleType'] = data.groupby(['Model'])['VehicleType'].transform(lambda x: x.fillna((x.mode()[0] if x.count()!=0 else "unknown")))

In [None]:
data['Gearbox'] = data.groupby(['Model'])['Gearbox']\
    .transform(lambda x: x.fillna((x.mode()[0] if x.count()!=0 else "unknown")))

In [None]:
data['FuelType'] = data.groupby(['Model'])['FuelType']\
    .transform(lambda x: x.fillna((x.mode()[0] if x.count()!=0 else "unknown")))

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        354369 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            354369 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              354369 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           354369 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           354369 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Пропусков больше нет.

In [None]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


Проведем проверку на дубликаты среди категорийных признаков

In [None]:
objects_columns = ["VehicleType","Gearbox","Model","FuelType","Brand","Repaired"]
for column in objects_columns:
    print(column, pd.Series(data[column].unique()).str.lower().duplicated().sum())

VehicleType 0
Gearbox 0
Model 0
FuelType 0
Brand 0
Repaired 0


Дубликатов не обнаружено

 Проверим на дубликаты

In [None]:
data.duplicated().sum()

5

Удалим дубликаты

In [None]:
data = data.drop_duplicates()
data.duplicated().sum()

0

Посмотрим на максимальную дату в колонке "Дата создания"

In [None]:
data['DateCreated'].max()

'2016-04-07 00:00:00'

Удалим столбцы, которые больше не понадобятся

In [None]:
data = data.drop(["DateCrawled","DateCreated","LastSeen","NumberOfPictures","PostalCode"],axis = 1)

Выводы:
- В данных присуствуют выбросы:
  - Год регистрации. Регистрация автомобилей началась в 1931 году. Соотвественно, нужно будет удалить все значения меньше. Максимальная дата в колонке "дата регистрации" приходится на 2016 год, так что нужно удалить все значения после.
  - Минимальня мощность автомобиля составляет 13, а максимальная - 1500 лошадиных сил. Значения, выходящие за этот диапазон нужно удалить
  - Registration month - присуствуют нули. Заменим их на единицы.
  - Признаки DateCrawled","DateCreated","LastSeen","NumberOfPictures","PostalCode не несут полезной информации, их можно удалить.
  - у некоторых автомобилей очень низкая цена, отсечем значения менее 50 евро, так как, скорее всего, это ошибки в данных.

Удаляю автомобили со стоимостью менее 50 евро.

In [None]:
data = data.loc[data['Price'] >=50 ]

In [None]:
# RegistrationYear
def Balance_RegistrationYear(value):
    if value > 2016:
        return 2016
    elif value < 1930:
        return 1930
    else:
        return value
data["RegistrationYear"] = data["RegistrationYear"].apply(Balance_RegistrationYear)
# RegistrationMonth
data.loc[data['RegistrationMonth'] == 0, 'RegistrationMonth'] = 1
# Power
data.loc[data['Power'] > 3500, 'Power'] = 3500
data.loc[data['Power'] < 13, 'Power'] = 13

In [None]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth
count,341941.0,341941.0,341941.0,341941.0,341941.0
mean,4577.083924,2003.142212,110.88925,128456.210282,5.885258
std,4514.844713,7.252932,89.327498,37321.04486,3.554349
min,50.0,1930.0,13.0,5000.0,1.0
25%,1200.0,1999.0,69.0,125000.0,3.0
50%,2900.0,2003.0,105.0,150000.0,6.0
75%,6500.0,2008.0,143.0,150000.0,9.0
max,20000.0,2016.0,3500.0,150000.0,12.0


Выводы:
- Была произведена обработка данных:
 - заполнены пропущенные значения,
 - отсечены выбросы
 - удалены признаки, не влияющие на целевой признак.

## Обучение моделей

### Подготовка данных

Выделяю целевой признак и фичи

In [None]:
target = data['Price']
features = data.drop('Price', axis=1)

Разделяю на выборки

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)


Кодирую признаки с OHE:

In [None]:
features_ohe = pd.get_dummies(features, drop_first=True)

features_train_ohe, features_test_ohe, target_train_ohe, target_test_ohe = train_test_split(features_ohe,
                                                                                            target,
                                                                                            test_size=.25,
                                                                                            random_state=12345)

In [None]:
!pip install scikit-learn==1.1.3



Кодирую категориальные признаки с OE:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)

cat_columns = ['VehicleType', 'Gearbox', 'FuelType','Brand','Model', 'Repaired']

features_train_oe = features_train.copy()
features_test_oe = features_test.copy()

features_train_oe[cat_columns] = encoder.fit_transform(features_train[cat_columns])
features_test_oe[cat_columns] = encoder.transform(features_test[cat_columns])

Проверим размер получившихся выборок:

In [None]:
for i in [features_train_ohe, features_test_ohe, target_train_ohe, target_test_ohe]:
    print(i.shape)

print()

for i in [features_train_oe, features_test_oe]:
    print(i.shape)
# for i in [features_train_oe, features_test_oe, target_train_oe, target_test_oe]:
#    print(i.shape)


print()

for i in [features_train, features_test, target_train, target_test]:
    print(i.shape)

(256455, 308)
(85486, 308)
(256455,)
(85486,)

(256455, 10)
(85486, 10)

(256455, 10)
(85486, 10)
(256455,)
(85486,)


### Catboost

<b> На выборках c OneHotEncoder <b/>

In [None]:
model_cbr = CatBoostRegressor()
parameters = [{'learning_rate':[.1, .5, .8], 'random_state':[12345], 'verbose':[False]}]

gscv_cbr_ohe = GridSearchCV(model_cbr, parameters, scoring='neg_root_mean_squared_error')
gscv_cbr_ohe.fit(features_train_ohe, target_train)

results = pd.DataFrame(gscv_cbr_ohe.cv_results_)
display(results)
display(results[results['rank_test_score'] == 1])
fit_time = results[results['rank_test_score'] == 1]['mean_fit_time'].values[0]
predict_time = results[results['rank_test_score'] == 1]['mean_score_time'].values[0]
print(f'Fit time:{fit_time}, predict time: {predict_time}')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_random_state,param_verbose,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,17.465375,1.580946,0.2285,0.270803,0.1,12345,False,"{'learning_rate': 0.1, 'random_state': 12345, ...",-1663.223049,-1666.590023,-1654.184317,-1647.019637,-1676.311791,-1661.465763,10.113957,3
1,16.491309,0.578657,0.068306,0.013396,0.5,12345,False,"{'learning_rate': 0.5, 'random_state': 12345, ...",-1625.643344,-1628.715101,-1607.140925,-1610.165387,-1637.10639,-1621.754229,11.376447,1
2,17.056578,0.442555,0.090638,0.044191,0.8,12345,False,"{'learning_rate': 0.8, 'random_state': 12345, ...",-1659.560707,-1662.918957,-1639.878383,-1624.323221,-1674.33558,-1652.20337,17.821318,2


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_random_state,param_verbose,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,16.491309,0.578657,0.068306,0.013396,0.5,12345,False,"{'learning_rate': 0.5, 'random_state': 12345, ...",-1625.643344,-1628.715101,-1607.140925,-1610.165387,-1637.10639,-1621.754229,11.376447,1


Fit time:16.491308975219727, predict time: 0.06830568313598633


<b> На выборках c OrdinalEncoder <b/>

In [None]:
model_cbr = CatBoostRegressor()
parameters = [{'learning_rate':[.1, .5, .8], 'random_state':[12345], 'verbose':[False]}]

gscv_cbr_oe = GridSearchCV(model_cbr, parameters, scoring='neg_root_mean_squared_error')
gscv_cbr_oe.fit(features_train_oe, target_train)

results = pd.DataFrame(gscv_cbr_oe.cv_results_)
display(results)
display(results[results['rank_test_score'] == 1])
fit_time = results[results['rank_test_score'] == 1]['mean_fit_time'].values[0]
predict_time = results[results['rank_test_score'] == 1]['mean_score_time'].values[0]
print(f'Fit time:{fit_time}, predict time: {predict_time}')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_random_state,param_verbose,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,15.871995,0.262786,0.026929,0.013195,0.1,12345,False,"{'learning_rate': 0.1, 'random_state': 12345, ...",-1677.461083,-1675.818068,-1668.929518,-1653.53275,-1692.03155,-1673.554594,12.52236,3
1,15.743749,0.174493,0.026511,0.012192,0.5,12345,False,"{'learning_rate': 0.5, 'random_state': 12345, ...",-1646.468067,-1642.786846,-1634.481377,-1619.272836,-1653.869785,-1639.375782,11.832968,1
2,15.860545,0.350936,0.040149,0.012496,0.8,12345,False,"{'learning_rate': 0.8, 'random_state': 12345, ...",-1677.855241,-1680.073171,-1656.909752,-1654.060786,-1682.259046,-1670.231599,12.154034,2


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_random_state,param_verbose,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,15.743749,0.174493,0.026511,0.012192,0.5,12345,False,"{'learning_rate': 0.5, 'random_state': 12345, ...",-1646.468067,-1642.786846,-1634.481377,-1619.272836,-1653.869785,-1639.375782,11.832968,1


Fit time:15.743748664855957, predict time: 0.02651076316833496


<b> На выборках без кодировки <b/>

In [None]:
cat_features = ['VehicleType', 'Gearbox', 'FuelType','Brand','Model', 'Repaired']

In [None]:
model_cbr = CatBoostRegressor()
parameters = [{'learning_rate':[.1, .5, .8], 'random_state':[12345], 'verbose':[False]}]

gscv_cbr = GridSearchCV(model_cbr, parameters, error_score='raise', scoring='neg_root_mean_squared_error')
gscv_cbr.fit(features_train, target_train, cat_features=cat_features)

results = pd.DataFrame(gscv_cbr.cv_results_)
display(results)
display(results[results['rank_test_score'] == 1])
fit_time = results[results['rank_test_score'] == 1]['mean_fit_time'].values[0]
predict_time = results[results['rank_test_score'] == 1]['mean_score_time'].values[0]
print(f'Fit time:{fit_time}, predict time: {predict_time}')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_random_state,param_verbose,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,95.720268,1.489336,0.173258,0.015455,0.1,12345,False,"{'learning_rate': 0.1, 'random_state': 12345, ...",-1669.002757,-1660.055252,-1650.866832,-1645.843176,-1682.139235,-1661.581451,12.980939,2
1,95.94373,0.856003,0.201487,0.007544,0.5,12345,False,"{'learning_rate': 0.5, 'random_state': 12345, ...",-1648.706202,-1642.346435,-1634.89894,-1625.043799,-1657.183522,-1641.635779,11.071648,1
2,95.779158,0.922922,0.214043,0.013385,0.8,12345,False,"{'learning_rate': 0.8, 'random_state': 12345, ...",-1678.014978,-1659.232216,-1665.542144,-1659.003377,-1701.683203,-1672.695183,16.052966,3


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_random_state,param_verbose,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,95.94373,0.856003,0.201487,0.007544,0.5,12345,False,"{'learning_rate': 0.5, 'random_state': 12345, ...",-1648.706202,-1642.346435,-1634.89894,-1625.043799,-1657.183522,-1641.635779,11.071648,1


Fit time:95.9437297821045, predict time: 0.20148687362670897


### LightGBM

<b> На выборках c OneHotEncoder <b/>

In [None]:
#del features_train

In [None]:
model_lgbmr = LGBMRegressor()
parameters = [{'num_leaves':[25, 50, 100, 200], 'learning_rate':[.1, .3, .5], 'random_state':[12345]}]

gscv_lgr_ohe = GridSearchCV(model_lgbmr, parameters, scoring='neg_root_mean_squared_error')
gscv_lgr_ohe.fit(features_train_ohe, target_train)

results = pd.DataFrame(gscv_lgr_ohe.cv_results_)
display(results)
display(results[results['rank_test_score'] == 1])
fit_time = results[results['rank_test_score'] == 1]['mean_fit_time'].values[0]
predict_time = results[results['rank_test_score'] == 1]['mean_score_time'].values[0]
print(f'Fit time:{fit_time}, predict time: {predict_time}')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_num_leaves,param_random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.262035,0.065978,0.150473,0.004376,0.1,25,12345,"{'learning_rate': 0.1, 'num_leaves': 25, 'rand...",-1781.674341,-1785.12571,-1769.625584,-1766.389259,-1792.082819,-1778.979543,9.620238,12
1,1.614207,0.477141,0.1709,0.044259,0.1,50,12345,"{'learning_rate': 0.1, 'num_leaves': 50, 'rand...",-1717.769334,-1721.428691,-1707.240015,-1699.569536,-1727.929502,-1714.787416,10.141597,11
2,2.150659,0.48105,0.195228,0.041933,0.1,100,12345,"{'learning_rate': 0.1, 'num_leaves': 100, 'ran...",-1668.227522,-1667.964069,-1653.497716,-1650.152342,-1680.074685,-1663.983267,10.899867,7
3,2.090063,0.186995,0.198382,0.027614,0.1,200,12345,"{'learning_rate': 0.1, 'num_leaves': 200, 'ran...",-1629.483794,-1631.518978,-1619.189483,-1609.49811,-1637.40658,-1625.419389,9.896625,2
4,1.378354,0.051883,0.155243,0.014554,0.3,25,12345,"{'learning_rate': 0.3, 'num_leaves': 25, 'rand...",-1711.054146,-1709.033828,-1698.520428,-1696.945343,-1716.011935,-1706.313136,7.381435,10
5,1.493056,0.052899,0.163931,0.021711,0.3,50,12345,"{'learning_rate': 0.3, 'num_leaves': 50, 'rand...",-1669.226025,-1670.4167,-1652.395701,-1647.415137,-1675.088772,-1662.908467,10.910605,6
6,1.743213,0.085917,0.176611,0.017992,0.3,100,12345,"{'learning_rate': 0.3, 'num_leaves': 100, 'ran...",-1637.014168,-1640.018124,-1625.347252,-1616.176613,-1645.268258,-1632.764883,10.556441,3
7,1.987089,0.021612,0.195552,0.007081,0.3,200,12345,"{'learning_rate': 0.3, 'num_leaves': 200, 'ran...",-1620.135494,-1620.244695,-1609.91758,-1597.626381,-1627.498564,-1615.084543,10.370459,1
8,1.400286,0.0446,0.166319,0.015263,0.5,25,12345,"{'learning_rate': 0.5, 'num_leaves': 25, 'rand...",-1699.335884,-1703.072371,-1684.558369,-1673.838106,-1706.760989,-1693.513144,12.394222,9
9,1.54812,0.106271,0.169143,0.012656,0.5,50,12345,"{'learning_rate': 0.5, 'num_leaves': 50, 'rand...",-1666.779124,-1671.873998,-1662.610708,-1645.152875,-1680.472731,-1665.377887,11.736487,8


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_num_leaves,param_random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
7,1.987089,0.021612,0.195552,0.007081,0.3,200,12345,"{'learning_rate': 0.3, 'num_leaves': 200, 'ran...",-1620.135494,-1620.244695,-1609.91758,-1597.626381,-1627.498564,-1615.084543,10.370459,1


Fit time:1.9870888710021972, predict time: 0.19555239677429198


<b> На выборках c OrdinalEncoder <b/>

In [None]:
model_lgbmr = LGBMRegressor()
parameters = [{'num_leaves':[25, 50, 100, 200], 'learning_rate':[.1, .3, .5], 'random_state':[12345]}]

gscv_lgr_oe = GridSearchCV(model_lgbmr, parameters, scoring='neg_root_mean_squared_error')
gscv_lgr_oe.fit(features_train_oe, target_train)

results = pd.DataFrame(gscv_lgr_oe.cv_results_)
display(results)
display(results[results['rank_test_score'] == 1])
fit_time = results[results['rank_test_score'] == 1]['mean_fit_time'].values[0]
predict_time = results[results['rank_test_score'] == 1]['mean_score_time'].values[0]
print(f'Fit time:{fit_time}, predict time: {predict_time}')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_num_leaves,param_random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.429834,0.012287,0.053516,0.004723,0.1,25,12345,"{'learning_rate': 0.1, 'num_leaves': 25, 'rand...",-1794.867267,-1793.353132,-1784.978024,-1780.377513,-1811.681969,-1793.051581,10.735859,12
1,0.639409,0.10388,0.069222,0.011252,0.1,50,12345,"{'learning_rate': 0.1, 'num_leaves': 50, 'rand...",-1734.987346,-1730.901654,-1724.602808,-1711.406523,-1743.675868,-1729.11484,10.804124,11
2,0.728726,0.101433,0.073121,0.011507,0.1,100,12345,"{'learning_rate': 0.1, 'num_leaves': 100, 'ran...",-1684.355349,-1678.903851,-1668.254334,-1662.515367,-1691.023909,-1677.010562,10.398909,7
3,0.91685,0.007857,0.081725,0.001327,0.1,200,12345,"{'learning_rate': 0.1, 'num_leaves': 200, 'ran...",-1644.383851,-1635.097145,-1630.962427,-1621.24207,-1647.309094,-1635.798917,9.401821,2
4,0.380222,0.032858,0.036913,0.00102,0.3,25,12345,"{'learning_rate': 0.3, 'num_leaves': 25, 'rand...",-1714.454438,-1706.387587,-1708.724594,-1694.855342,-1727.893095,-1710.463011,10.797869,10
5,0.435247,0.011966,0.040297,0.000488,0.3,50,12345,"{'learning_rate': 0.3, 'num_leaves': 50, 'rand...",-1676.670754,-1667.991576,-1668.97708,-1655.611179,-1688.570162,-1671.56415,10.855739,6
6,0.732577,0.01708,0.058017,0.010179,0.3,100,12345,"{'learning_rate': 0.3, 'num_leaves': 100, 'ran...",-1649.147116,-1642.775227,-1640.637368,-1628.066408,-1659.147132,-1643.95465,10.22355,3
7,0.791808,0.023473,0.05852,0.001051,0.3,200,12345,"{'learning_rate': 0.3, 'num_leaves': 200, 'ran...",-1632.27459,-1621.839788,-1624.359185,-1607.597087,-1643.442197,-1625.902569,11.850908,1
8,0.345461,0.003273,0.035311,0.004378,0.5,25,12345,"{'learning_rate': 0.5, 'num_leaves': 25, 'rand...",-1716.438489,-1699.148711,-1702.873383,-1687.395561,-1716.745915,-1704.520412,11.102539,9
9,0.431524,0.035546,0.036635,0.001362,0.5,50,12345,"{'learning_rate': 0.5, 'num_leaves': 50, 'rand...",-1685.726383,-1678.151465,-1675.646963,-1664.210551,-1684.473869,-1677.641846,7.702925,8


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_num_leaves,param_random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
7,0.791808,0.023473,0.05852,0.001051,0.3,200,12345,"{'learning_rate': 0.3, 'num_leaves': 200, 'ran...",-1632.27459,-1621.839788,-1624.359185,-1607.597087,-1643.442197,-1625.902569,11.850908,1


Fit time:0.791808032989502, predict time: 0.05851979255676269


### Ridge регрессия

<b> На выборках c OneHotEncoder <b/>

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 341941 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              341941 non-null  int64 
 1   VehicleType        341941 non-null  object
 2   RegistrationYear   341941 non-null  int64 
 3   Gearbox            341941 non-null  object
 4   Power              341941 non-null  int64 
 5   Model              341941 non-null  object
 6   Kilometer          341941 non-null  int64 
 7   RegistrationMonth  341941 non-null  int64 
 8   FuelType           341941 non-null  object
 9   Brand              341941 non-null  object
 10  Repaired           341941 non-null  object
dtypes: int64(5), object(6)
memory usage: 31.3+ MB


Категориальные признаки для OHE Ridge

In [None]:
ohe_features_ridge = features_train.select_dtypes(include='object').columns.to_list()
print(ohe_features_ridge)

['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'Repaired']


In [None]:
features_train.head(10)

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
201585,convertible,1997,manual,90,other,125000,4,petrol,toyota,no
205139,bus,2006,manual,102,altea,125000,10,petrol,seat,yes
194998,small,2016,manual,105,ibiza,70000,2,petrol,seat,no
181020,wagon,2009,auto,292,e_klasse,150000,2,petrol,mercedes_benz,no
292576,small,2000,manual,75,no,150000,1,petrol,peugeot,no
278022,sedan,2000,auto,143,c_klasse,150000,11,gasoline,mercedes_benz,no
337662,convertible,1998,manual,75,astra,150000,12,petrol,opel,no
287779,small,2009,auto,77,i_reihe,30000,3,petrol,hyundai,no
93510,convertible,1996,manual,118,z_reihe,150000,6,petrol,bmw,no
347240,wagon,2008,manual,140,passat,150000,9,gasoline,volkswagen,no


Численные признаки

In [None]:
num_features = features_train.select_dtypes(exclude='object').columns.to_list()
num_features

['RegistrationYear', 'Power', 'Kilometer', 'RegistrationMonth']

In [None]:
features_train_ridge = features_train.copy()
features_test_ridge = features_test.copy()

In [None]:
# drop='first' удаляет первый признак из закодированных:
# таким образом обходим dummy-ловушку
# задаём handle_unknown='ignore':
# игнорируется ранее невстречающиеся значения признака (при transform)
encoder_ohe = OneHotEncoder(drop='first', handle_unknown='ignore', sparse=False)

# обучаем энкодер на заданных категориальных признаках тренировочной выборки
encoder_ohe.fit(features_train_ridge[ohe_features_ridge])

# добавляем закодированные признаки в X_train_ohe
# encoder_ohe.get_feature_names_out() позволяет получить названия колонок
features_train_ridge[
    encoder_ohe.get_feature_names_out()
] = encoder_ohe.transform(features_train[ohe_features_ridge])

# удаляем незакодированные категориальные признаки (изначальные колонки)
features_train_ridge = features_train_ridge.drop(ohe_features_ridge, axis=1)

# создаём скелер
scaler = StandardScaler()

# обучаем его на численных признаках тренировочной выборки, трансформируем её же
features_train_ridge[num_features] = scaler.fit_transform(features_train_ridge[num_features])

# смотрим на результат
features_train_ridge.head()

Unnamed: 0,RegistrationYear,Power,Kilometer,RegistrationMonth,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,...,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_yes
201585,-0.849586,-0.233231,-0.091922,-0.530448,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
205139,0.393576,-0.09933,-0.091922,1.156964,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
194998,1.774866,-0.065855,-1.564328,-1.092918,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
181020,0.807963,2.020764,0.577354,-1.092918,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
292576,-0.435198,-0.400607,0.577354,-1.374154,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
RANDOM_STATE = 42

In [None]:
model_ridge = Ridge(random_state=RANDOM_STATE)

# словарь с гиперпараметрами и значениями, которые хотим перебрать
param_grid_ridge = {
    'alpha': np.arange(0, 0.21, 0.01),
}

gs_ridge = GridSearchCV(
    model_ridge,
    param_grid=param_grid_ridge,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)

gs_ridge.fit(features_train_ridge, target_train)

fit_time = results[results['rank_test_score'] == 1]['mean_fit_time'].values[0]
predict_time = results[results['rank_test_score'] == 1]['mean_score_time'].values[0]
print(f'Fit time:{fit_time}, predict time: {predict_time}')

# лучшее значение RMSE на кросс-валидации
print(f'best_score: {gs_ridge.best_score_ * -1}')

# лучшие гиперпараметры
print(f'best_params: {gs_ridge.best_params_}')

Fit time:0.798194694519043, predict time: 0.060316944122314455
best_score: 2917.3930779886673
best_params: {'alpha': 0.05}


## Анализ моделей

Создаю сводную таблицу по показателям RMSE < 2500, время обучения модели и время предсказания модели:

In [None]:
index = ['CatBoostRegressor с OHE',
         'CatBoostRegressor с OE',
         'CatBoostRegressor без кодировки',
         'LGBMRegressor с OHE',
         'LGBMRegressor с OE',
         'Ridge регрессия с OHE'

        ]

data = {'RMSE':[gscv_cbr_ohe.best_score_ * -1,
                gscv_cbr_oe.best_score_ * -1,
                gscv_cbr.best_score_ * -1,
                gscv_lgr_ohe.best_score_ * -1,
                gscv_lgr_oe.best_score_ * -1,
                gs_ridge.best_score_ * -1],
        'Время обучения модели':[15.82,
                                 15.55,
                                 97.26,
                                 1.85,
                                 0.79,
                                 0.79],

        'Время предсказания модели':[0.11,
                                     0.02,
                                     0.22,
                                     0.18,
                                     0.06,
                                     0.06]
       }

kpi_data = pd.DataFrame(data=data, index=index)

kpi_data.sort_values(by = 'RMSE', ascending=True)

Unnamed: 0,RMSE,Время обучения модели,Время предсказания модели
LGBMRegressor с OHE,1615.084543,1.85,0.18
CatBoostRegressor с OHE,1621.754229,15.82,0.11
LGBMRegressor с OE,1625.902569,0.79,0.06
CatBoostRegressor с OE,1639.375782,15.55,0.02
CatBoostRegressor без кодировки,1641.635779,97.26,0.22
Ridge регрессия с OHE,2917.393078,0.79,0.06


Промежуточный вывод:

Я смогла получить значение метрики RMSE ниже 2500 на моделях:
- LGBMRegressor
- CatBoostRegressor
Лучшей моделью по RMSE является LGBMRegressor с выборками c OHE. Время обучения этой модели - 2 секунды. Наименее эффективная модель - Ridge регрессия.

Проверим качество лучшей модели на тестовой выборке

In [None]:
gscv_lgr_ohe.fit(features_train_ohe, target_train_ohe)

lgr_ohe_prediction = gscv_lgr_ohe.predict(features_test_ohe)
metric_test = mean_squared_error(target_test_ohe, lgr_ohe_prediction, squared=False)
metric_test

1593.3577485458086

Вывод:
В проекте я загрузила данные и провела их предобработку - очистку, заполнение, удаление лишних данных.
Подготовила выборки для машинного обучения, провела кодирование признаков методами OneHotEncode и OrdinalEncoder.
Сравнила 3 модели с разными гиперпараметрами и методами кодирования.
Выбрала лучшую модель, учитывая RMSE, время обучения и время предскзания - и проверила ее на тестовой выборке.
Итог - онаиболее эффективная модель - LGBMRegressor с OHE, которая на финальном тестировании показала RMSE 1593.