# Determining the value of cars

The service for the sale of used cars "Not beaten, not painted" is developing an application to attract new customers. In it, we can quickly find out the market value of your car. We have historical data to work with: technical specifications, equipment and prices of cars. We need to build a model to determine the value. 

The customer is interested in:

- quality of prediction;
- speed of prediction;
- training time.

Work plan:
1. Study the general information from dataframes.
2. Perform preprocessing of the data.
3. Conduct coding of categorical features.
4. Carry out scaling of quantitative traits.
5. Train various models, including gradient bousting models, and find the hyperparameter values at which the minimum value of the RMSE metric is achieved.
6. Measure time of training models and getting predictions with them.
7. Verify the quality of the best models on a test sample.
8. Determine the model recommended for the customer.

<a id="0"></a> <br>
# Table of Contents  
1. [Data preparation](#1)     
2. [Model training](#2)
3. [Model Analysis](#3)

<a id="1"></a>
## Data preparation
[Back to the top](#0)

In [1]:
# pip install nb_black

In [2]:
#! pip install catboost

In [3]:
#! pip install lightgbm

In [13]:
#%load_ext nb_black

<IPython.core.display.Javascript object>

In [1]:
import pandas as pd

import numpy as np

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from catboost import CatBoostRegressor

import lightgbm as lgb

import statistics

Explore the data.

In [6]:
df = pd.read_csv("/datasets/autos.csv", sep=",")
df.head(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
5,2016-04-04 17:36:23,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,2016-04-04 00:00:00,0,33775,2016-04-06 19:17:07
6,2016-04-01 20:48:51,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,2016-04-01 00:00:00,0,67112,2016-04-05 18:18:39
7,2016-03-21 18:54:38,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no,2016-03-21 00:00:00,0,19348,2016-03-25 16:47:58
8,2016-04-04 23:42:13,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,,2016-04-04 00:00:00,0,94505,2016-04-04 23:42:13
9,2016-03-17 10:53:50,999,small,1998,manual,101,golf,150000,0,,volkswagen,,2016-03-17 00:00:00,0,27472,2016-03-31 17:17:06


It is noted that the dataframe contains attributes that do not affect the market value of the car: `DateCrawled`, `DateCreated`, `NumberOfPictures`, `PostalCode`, `LastSeen`. Since they are uninformative for further analysis, we will remove them.

In [7]:
df.drop(
    ["DateCrawled", "DateCreated", "NumberOfPictures", "PostalCode", "LastSeen"],
    axis=1,
    inplace=True,
)

In [8]:
df.head(10)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes
6,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no
7,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no
8,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,
9,999,small,1998,manual,101,golf,150000,0,,volkswagen,


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              354369 non-null  int64 
 1   VehicleType        316879 non-null  object
 2   RegistrationYear   354369 non-null  int64 
 3   Gearbox            334536 non-null  object
 4   Power              354369 non-null  int64 
 5   Model              334664 non-null  object
 6   Kilometer          354369 non-null  int64 
 7   RegistrationMonth  354369 non-null  int64 
 8   FuelType           321474 non-null  object
 9   Brand              354369 non-null  object
 10  Repaired           283215 non-null  object
dtypes: int64(5), object(6)
memory usage: 29.7+ MB


We see that the dataframe contains gaps in the columns `VehicleType`, `Gearbox`, `Model`, `FuelType`, `Repaired`.

In [10]:
df["Repaired"].value_counts()

no     247161
yes     36054
Name: Repaired, dtype: int64

For missing values in the `Repaired` column, we will create a separate category "Unknown", meaning that it is unknown for this vehicle whether it has been repaired or not.

In [11]:
df["Repaired"].fillna("Unknown", inplace=True)

In [12]:
df["Repaired"].value_counts()

no         247161
Unknown     71154
yes         36054
Name: Repaired, dtype: int64

In [13]:
df["Model"].value_counts()

golf                  29232
other                 24421
3er                   19761
polo                  13066
corsa                 12570
                      ...  
serie_2                   8
rangerover                4
serie_3                   4
serie_1                   2
range_rover_evoque        2
Name: Model, Length: 250, dtype: int64

Now let's fill in the omissions in the `Model` column. We will assume that if the car model is not specified, it means that the user did not find it in the standard list, so it belongs to the "other" type.

In [14]:
df["Model"].fillna("other", inplace=True)

In [15]:
df["Model"].value_counts()

other                 44126
golf                  29232
3er                   19761
polo                  13066
corsa                 12570
                      ...  
serie_2                   8
rangerover                4
serie_3                   4
serie_1                   2
range_rover_evoque        2
Name: Model, Length: 250, dtype: int64

In [16]:
df.head(10)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,Unknown
1,18300,coupe,2011,manual,190,other,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,Unknown
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes
6,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no
7,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no
8,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,Unknown
9,999,small,1998,manual,101,golf,150000,0,,volkswagen,Unknown


In [17]:
df["Gearbox"].value_counts()

manual    268251
auto       66285
Name: Gearbox, dtype: int64

Since there are only two types of transmission, let's replace the omissions in this column with the most common type for that brand and model.

In [18]:
pt_g = df.pivot_table(
    index=["Brand", "Model"], values=["Gearbox"], aggfunc=[pd.Series.mode]
).reset_index()

In [19]:
pt_g

Unnamed: 0_level_0,Brand,Model,mode
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Gearbox
0,alfa_romeo,145,manual
1,alfa_romeo,147,manual
2,alfa_romeo,156,manual
3,alfa_romeo,159,manual
4,alfa_romeo,other,manual
...,...,...,...
293,volvo,v40,manual
294,volvo,v50,manual
295,volvo,v60,manual
296,volvo,v70,manual


In [20]:
def func_g(row):
    result = row["Gearbox"]
    mode = pt_g[
        (pt_g["Brand"] == row["Brand"]) & (pt_g["Model"] == row["Model"])
    ].values[0][2]
    if (result != result) & (type(mode) == str):
        return mode
    else:
        return result

In [21]:
%%time
df["Gearbox"] = df.apply(lambda row: func_g(row), axis=1)

CPU times: user 6min 38s, sys: 1.96 s, total: 6min 40s
Wall time: 6min 40s


In [22]:
df["Gearbox"].isna().sum()

0

The replacement of omissions in the `Gearbox` column has been successful.

Replace the missing values in the `VehicleType` column with the most popular values for the `Brand`-`Model` combination.

In [23]:
df["VehicleType"].unique()

array([nan, 'coupe', 'suv', 'small', 'sedan', 'convertible', 'bus',
       'wagon', 'other'], dtype=object)

In [24]:
pt_v = df.pivot_table(
    index=["Brand", "Model"], values=["VehicleType"], aggfunc=[pd.Series.mode]
).reset_index()

In [25]:
pt_v

Unnamed: 0_level_0,Brand,Model,mode
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,VehicleType
0,alfa_romeo,145,small
1,alfa_romeo,147,sedan
2,alfa_romeo,156,wagon
3,alfa_romeo,159,wagon
4,alfa_romeo,other,sedan
...,...,...,...
293,volvo,v40,wagon
294,volvo,v50,wagon
295,volvo,v60,wagon
296,volvo,v70,wagon


In [26]:
def func_v(row):
    result = row["VehicleType"]
    mode = pt_v[
        (pt_v["Brand"] == row["Brand"]) & (pt_v["Model"] == row["Model"])
    ].values[0][2]
    if (result != result) & (type(mode) == str):
        return mode
    else:
        return result

In [27]:
%%time
df["VehicleType"] = df.apply(lambda row: func_v(row), axis=1)

CPU times: user 6min 57s, sys: 1.53 s, total: 6min 58s
Wall time: 6min 59s


In [28]:
df["VehicleType"].isna().sum()

12

We see that there are some gaps in the `VehicleType` column, these are the objects for which the mod was an undefined value. There is not much such data, so we can ignore it.

In [29]:
df.dropna(subset=["VehicleType"], inplace=True)

In [30]:
df["VehicleType"].isna().sum()

0

As we can see, there are no more omissions in the `VehicleType` column.

In [31]:
df["FuelType"].unique()

array(['petrol', 'gasoline', nan, 'lpg', 'other', 'hybrid', 'cng',
       'electric'], dtype=object)

The omissions in the `FuelType` column can be replaced by the most common value of this characteristic for each `Brand`-`Power` combination, since usually the same automobile brand puts the same engine with a certain power rating on its models.

In [32]:
pt_f = df.pivot_table(
    index=["Brand", "Power"], values=["FuelType"], aggfunc=[pd.Series.mode]
).reset_index()

In [33]:
pt_f

Unnamed: 0_level_0,Brand,Power,mode
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,FuelType
0,alfa_romeo,0,petrol
1,alfa_romeo,50,petrol
2,alfa_romeo,63,petrol
3,alfa_romeo,65,petrol
4,alfa_romeo,66,petrol
...,...,...,...
5425,volvo,300,petrol
5426,volvo,315,petrol
5427,volvo,1056,[]
5428,volvo,1162,petrol


In [34]:
def func_f(row):
    result = row["FuelType"]
    mode = pt_f[
        (pt_f["Brand"] == row["Brand"]) & (pt_f["Power"] == row["Power"])
    ].values[0][2]
    if (result != result) & (type(mode) == str):
        return mode
    else:
        return result

In [35]:
%%time
df["FuelType"] = df.apply(lambda row: func_f(row), axis=1)

CPU times: user 18min 20s, sys: 3.57 s, total: 18min 24s
Wall time: 18min 25s


In [36]:
df["FuelType"].isna().sum()

324

We got a result similar to the one we got when processing omissions in `Fueltype`, let's do the same as in that situation.

In [37]:
df.dropna(subset=["FuelType"], inplace=True)

In [38]:
df["FuelType"].isna().sum()

0

As we can see, we managed to get rid of the omissions in `FuelType`.

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354033 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              354033 non-null  int64 
 1   VehicleType        354033 non-null  object
 2   RegistrationYear   354033 non-null  int64 
 3   Gearbox            354033 non-null  object
 4   Power              354033 non-null  int64 
 5   Model              354033 non-null  object
 6   Kilometer          354033 non-null  int64 
 7   RegistrationMonth  354033 non-null  int64 
 8   FuelType           354033 non-null  object
 9   Brand              354033 non-null  object
 10  Repaired           354033 non-null  object
dtypes: int64(5), object(6)
memory usage: 32.4+ MB


There are no more omissions in the table, let's see if there are any abnormal values.

In [40]:
df[df["Price"] < 0]["Price"].count()

0

In [41]:
df[df["Price"] == 0]["Price"].count() / len(df)

0.030336155104185202

We see that 3% of objects in the `Price` column have zero values, which is an anomaly. Let's remove them.

In [42]:
df = df[df["Price"] != 0]

In [43]:
df[df["Price"] == 0]["Price"].count()

0

In [44]:
df.to_csv(r"df_processed_1.csv", index=False, sep=",")

In [45]:
df = pd.read_csv("df_processed_1.csv", sep=",")
df.head(10)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
0,480,sedan,1993,manual,0,golf,150000,0,petrol,volkswagen,Unknown
1,18300,coupe,2011,manual,190,other,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,Unknown
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes
6,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no
7,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,Unknown
8,999,small,1998,manual,101,golf,150000,0,petrol,volkswagen,Unknown
9,2000,sedan,2004,manual,105,3_reihe,150000,12,petrol,mazda,no


Since the first automobile was created in 1885, it means that the values in `RegistrationYear` must be greater than this value, and they must not exceed the current year.

In [46]:
df[df["RegistrationYear"] < 1885]["RegistrationYear"].count() / len(df)

0.0001427352145252015

In [47]:
df[df["RegistrationYear"] > 2022]["RegistrationYear"].count() / len(df)

0.0002476019027477985

Such data is extremely scarce, so we can dispose of it.

In [48]:
df = df[(df["RegistrationYear"] >= 1885) & (df["RegistrationYear"] <= 2022)]

In [49]:
df[df["RegistrationYear"] < 1885]["RegistrationYear"].count()

0

In [50]:
df[df["RegistrationYear"] > 2022]["RegistrationYear"].count()

0

The values in the `Power` column should be positive and not exceed the value of 4000 hp, which corresponds to the power of the world's largest truck, let's check if this is true.

In [51]:
df[df["Power"] < 0]["Power"].count()

0

In [52]:
df[df["Power"] == 0]["Power"].count()

36256

In [53]:
df[df["Power"] > 4000]["Power"].count()

66

We see that abnormal values in `Power` are present, let's remove them.

In [54]:
df = df[(df["Power"] > 0) & (df["Power"] <= 4000)]

In [55]:
df[(df["Power"] == 0) | (df["Power"] > 4000)]["Power"].count()

0

A vehicle's mileage cannot be negative.

In [56]:
df[df["Kilometer"] < 0]["Kilometer"].count()

0

We see that no abnormal values are found in `Kilometer`.

In [57]:
df["Kilometer"].unique()

array([125000, 150000,  90000,  30000,  70000, 100000,  60000,   5000,
        20000,  80000,  50000,  40000,  10000])

We see that the `Kilometer` column accepts a limited number of values, hence it can be considered categorical.

The values in the `RegistrationMonth` column must take values from 1 to 12, let's check if this condition is met.

In [58]:
df[(df["RegistrationMonth"] < 1) | (df["RegistrationMonth"] > 12)][
    "RegistrationMonth"
].count() / len(df)

0.06520074176191268

We see that there are such values, let's remove them.

In [59]:
df = df[(df["RegistrationMonth"] >= 1) & (df["RegistrationMonth"] <= 12)]

In [60]:
df[(df["RegistrationMonth"] < 1) | (df["RegistrationMonth"] > 12)][
    "RegistrationMonth"
].count()

0

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 286831 entries, 1 to 343292
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              286831 non-null  int64 
 1   VehicleType        286831 non-null  object
 2   RegistrationYear   286831 non-null  int64 
 3   Gearbox            286831 non-null  object
 4   Power              286831 non-null  int64 
 5   Model              286831 non-null  object
 6   Kilometer          286831 non-null  int64 
 7   RegistrationMonth  286831 non-null  int64 
 8   FuelType           286831 non-null  object
 9   Brand              286831 non-null  object
 10  Repaired           286831 non-null  object
dtypes: int64(5), object(6)
memory usage: 26.3+ MB


Check the dataframe for duplicates.

In [62]:
df.duplicated().mean()

0.08328249038632504

We see that there are 8% duplicates in the dataframe, let's delete them.

In [63]:
df = df.drop_duplicates().reset_index(drop=True)

In [64]:
df.duplicated().sum()

0

In [65]:
df.head(10)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
0,18300,coupe,2011,manual,190,other,125000,5,gasoline,audi,yes
1,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,Unknown
2,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
3,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
4,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes
5,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no
6,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,Unknown
7,2000,sedan,2004,manual,105,3_reihe,150000,12,petrol,mazda,no
8,2799,wagon,2005,manual,140,passat,150000,12,gasoline,volkswagen,yes
9,999,wagon,1995,manual,115,passat,150000,11,petrol,volkswagen,Unknown


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262943 entries, 0 to 262942
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              262943 non-null  int64 
 1   VehicleType        262943 non-null  object
 2   RegistrationYear   262943 non-null  int64 
 3   Gearbox            262943 non-null  object
 4   Power              262943 non-null  int64 
 5   Model              262943 non-null  object
 6   Kilometer          262943 non-null  int64 
 7   RegistrationMonth  262943 non-null  int64 
 8   FuelType           262943 non-null  object
 9   Brand              262943 non-null  object
 10  Repaired           262943 non-null  object
dtypes: int64(5), object(6)
memory usage: 22.1+ MB


In [67]:
#df.to_csv(r"df_processed_2.csv", index=False, sep=",")

In [3]:
#df = pd.read_csv("df_processed_2.csv", sep=",")
# df.head(10)

In [185]:
# df.info()

<IPython.core.display.Javascript object>

Split the raw data into two samples: training(75%) and test(25%).

In [4]:
RANDOM_STATE = 12345

In [5]:
target = df["Price"]
features = df.drop("Price", axis=1)
(features_train, features_test, target_train, target_test,) = train_test_split(
    features, target, test_size=0.25, random_state=RANDOM_STATE
)

In [6]:
features_train.shape

(197207, 10)

In [7]:
features_train.head()

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
247352,wagon,2010,auto,184,3er,150000,5,gasoline,bmw,no
175733,small,2015,manual,95,one,40000,1,gasoline,mini,no
77294,wagon,2005,manual,150,a3,90000,12,petrol,audi,no
57049,small,2003,manual,110,cooper,150000,3,petrol,mini,no
39656,coupe,2009,manual,260,other,90000,4,petrol,alfa_romeo,no


In [8]:
features_test.shape

(65736, 10)

We can see by the number of rows in each sample that the separation has been done correctly.

In [9]:
df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired
0,18300,coupe,2011,manual,190,other,125000,5,gasoline,audi,yes
1,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,Unknown
2,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
3,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
4,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes


Let's highlight the categorical attributes.

In [10]:
categories = [
    "VehicleType",
    "Gearbox",
    "Kilometer",
    "Model",
    "FuelType",
    "Brand",
    "Repaired",
]

Next, in order to convert the categorical features into numerical features we use one-hot encoding technique (OHE).

Apply direct encoding to the entire dataframe.

In [11]:
ohe = OneHotEncoder(handle_unknown="ignore")

In [14]:
features_train[categories]

Unnamed: 0,VehicleType,Gearbox,Kilometer,Model,FuelType,Brand,Repaired
247352,wagon,auto,150000,3er,gasoline,bmw,no
175733,small,manual,40000,one,gasoline,mini,no
77294,wagon,manual,90000,a3,petrol,audi,no
57049,small,manual,150000,cooper,petrol,mini,no
39656,coupe,manual,90000,other,petrol,alfa_romeo,no
...,...,...,...,...,...,...,...
158838,sedan,manual,150000,bora,petrol,volkswagen,no
47873,convertible,manual,70000,mx_reihe,petrol,mazda,no
86398,sedan,manual,150000,a6,gasoline,audi,no
77285,sedan,auto,150000,other,petrol,bmw,no


<IPython.core.display.Javascript object>

In [15]:
result = ohe.fit_transform(features_train[categories])

<IPython.core.display.Javascript object>

In [16]:
features_train_ohe = pd.DataFrame.sparse.from_spmatrix(result)

<IPython.core.display.Javascript object>

In [17]:
features_train_ohe.columns = ohe.get_feature_names(features_train[categories].columns)

<IPython.core.display.Javascript object>

In [18]:
features_train_ohe

Unnamed: 0,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon,Gearbox_auto,Gearbox_manual,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_Unknown,Repaired_no,Repaired_yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197202,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
197203,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
197204,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
197205,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


<IPython.core.display.Javascript object>

In [19]:
result = ohe.transform(features_test[categories])

<IPython.core.display.Javascript object>

In [20]:
features_test_ohe = pd.DataFrame.sparse.from_spmatrix(result)

<IPython.core.display.Javascript object>

In [21]:
features_test_ohe.columns = ohe.get_feature_names(features_test[categories].columns)

<IPython.core.display.Javascript object>

In [22]:
features_test_ohe

Unnamed: 0,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon,Gearbox_auto,Gearbox_manual,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_Unknown,Repaired_no,Repaired_yes
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65731,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
65732,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
65733,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
65734,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


<IPython.core.display.Javascript object>

Apply ordinal coding to the original dataframe as well.

In [23]:
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)

<IPython.core.display.Javascript object>

In [24]:
features_train_ordinal = pd.DataFrame(
    encoder.fit_transform(features_train[categories]),
    columns=features_train[categories].columns,
)

<IPython.core.display.Javascript object>

In [25]:
features_test_ordinal = pd.DataFrame(
    encoder.transform(features_test[categories]),
    columns=features_test[categories].columns,
)

<IPython.core.display.Javascript object>

In [26]:
features_train_ordinal

Unnamed: 0,VehicleType,Gearbox,Kilometer,Model,FuelType,Brand,Repaired
0,7.0,0.0,12.0,11.0,2.0,2.0,1.0
1,5.0,1.0,4.0,165.0,2.0,21.0,1.0
2,7.0,1.0,9.0,28.0,6.0,1.0,1.0
3,5.0,1.0,12.0,80.0,6.0,21.0,1.0
4,2.0,1.0,9.0,166.0,6.0,0.0,1.0
...,...,...,...,...,...,...,...
197202,4.0,1.0,12.0,51.0,6.0,38.0,1.0
197203,1.0,1.0,7.0,158.0,6.0,19.0,1.0
197204,4.0,1.0,12.0,31.0,2.0,1.0,1.0
197205,4.0,0.0,12.0,166.0,6.0,2.0,1.0


<IPython.core.display.Javascript object>

In [27]:
features_test_ordinal

Unnamed: 0,VehicleType,Gearbox,Kilometer,Model,FuelType,Brand,Repaired
0,4.0,1.0,12.0,95.0,6.0,20.0,2.0
1,7.0,0.0,12.0,203.0,2.0,24.0,1.0
2,4.0,1.0,12.0,116.0,2.0,38.0,1.0
3,0.0,1.0,12.0,222.0,2.0,38.0,1.0
4,0.0,1.0,12.0,211.0,2.0,20.0,1.0
...,...,...,...,...,...,...,...
65731,6.0,1.0,12.0,168.0,2.0,22.0,1.0
65732,7.0,1.0,12.0,31.0,2.0,1.0,1.0
65733,5.0,1.0,6.0,106.0,6.0,32.0,1.0
65734,0.0,1.0,12.0,166.0,2.0,27.0,0.0


<IPython.core.display.Javascript object>

Scale the quantitative attributes: `RegistrationYear`, `Power`, `RegistrationMonth`.

In [28]:
numeric = ["RegistrationYear", "Power", "RegistrationMonth"]
scaler = StandardScaler()
scaler.fit(features_train[numeric])
pd.options.mode.chained_assignment = None
features_train[numeric] = scaler.transform(features_train[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

<IPython.core.display.Javascript object>

In [29]:
features_train[numeric]

Unnamed: 0,RegistrationYear,Power,RegistrationMonth
247352,0.969484,0.947190,-0.407586
175733,1.700201,-0.406657,-1.603028
77294,0.238767,0.429990,1.684437
57049,-0.053520,-0.178480,-1.005307
39656,0.823341,2.103284,-0.706446
...,...,...,...
158838,-0.784237,-0.315386,1.385576
47873,0.238767,-0.178480,0.787856
86398,1.115627,1.038461,-0.108726
77285,-1.222667,1.144943,1.385576


<IPython.core.display.Javascript object>

In [30]:
features_test[numeric]

Unnamed: 0,RegistrationYear,Power,RegistrationMonth
248581,-1.661097,-0.193692,1.684437
21953,0.092624,0.840708,0.488995
49337,-0.345807,-0.087210,-1.304167
77635,-0.638093,-0.817374,1.684437
209391,-0.784237,-0.300174,-0.108726
...,...,...,...
120420,-1.222667,0.049696,1.086716
128766,0.677197,0.871131,-1.603028
179950,0.531054,-0.923856,0.190135
50017,-0.199663,-0.482715,-1.005307


<IPython.core.display.Javascript object>

We can see that the scaling was successful.

We get the final dataframes of the categorical and quantitative features processed by the two techniques for the training and test sample.

In [31]:
features_train_num = features_train[numeric].reset_index(drop=True)

<IPython.core.display.Javascript object>

In [32]:
features_train_num

Unnamed: 0,RegistrationYear,Power,RegistrationMonth
0,0.969484,0.947190,-0.407586
1,1.700201,-0.406657,-1.603028
2,0.238767,0.429990,1.684437
3,-0.053520,-0.178480,-1.005307
4,0.823341,2.103284,-0.706446
...,...,...,...
197202,-0.784237,-0.315386,1.385576
197203,0.238767,-0.178480,0.787856
197204,1.115627,1.038461,-0.108726
197205,-1.222667,1.144943,1.385576


<IPython.core.display.Javascript object>

In [33]:
features_train_ohe_conc = pd.concat([features_train_ohe, features_train_num], axis=1)

<IPython.core.display.Javascript object>

In [34]:
features_train_ohe_conc

Unnamed: 0,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon,Gearbox_auto,Gearbox_manual,...,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_Unknown,Repaired_no,Repaired_yes,RegistrationYear,Power,RegistrationMonth
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.969484,0.947190,-0.407586
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.700201,-0.406657,-1.603028
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.238767,0.429990,1.684437
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.053520,-0.178480,-1.005307
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.823341,2.103284,-0.706446
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197202,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.784237,-0.315386,1.385576
197203,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.238767,-0.178480,0.787856
197204,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.115627,1.038461,-0.108726
197205,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.222667,1.144943,1.385576


<IPython.core.display.Javascript object>

In [35]:
features_train_ordinal_conc = pd.concat(
    [features_train_ordinal, features_train_num], axis=1
)

<IPython.core.display.Javascript object>

In [36]:
features_train_ordinal_conc

Unnamed: 0,VehicleType,Gearbox,Kilometer,Model,FuelType,Brand,Repaired,RegistrationYear,Power,RegistrationMonth
0,7.0,0.0,12.0,11.0,2.0,2.0,1.0,0.969484,0.947190,-0.407586
1,5.0,1.0,4.0,165.0,2.0,21.0,1.0,1.700201,-0.406657,-1.603028
2,7.0,1.0,9.0,28.0,6.0,1.0,1.0,0.238767,0.429990,1.684437
3,5.0,1.0,12.0,80.0,6.0,21.0,1.0,-0.053520,-0.178480,-1.005307
4,2.0,1.0,9.0,166.0,6.0,0.0,1.0,0.823341,2.103284,-0.706446
...,...,...,...,...,...,...,...,...,...,...
197202,4.0,1.0,12.0,51.0,6.0,38.0,1.0,-0.784237,-0.315386,1.385576
197203,1.0,1.0,7.0,158.0,6.0,19.0,1.0,0.238767,-0.178480,0.787856
197204,4.0,1.0,12.0,31.0,2.0,1.0,1.0,1.115627,1.038461,-0.108726
197205,4.0,0.0,12.0,166.0,6.0,2.0,1.0,-1.222667,1.144943,1.385576


<IPython.core.display.Javascript object>

In [37]:
features_test_num = features_test[numeric].reset_index(drop=True)

<IPython.core.display.Javascript object>

In [38]:
features_test_num

Unnamed: 0,RegistrationYear,Power,RegistrationMonth
0,-1.661097,-0.193692,1.684437
1,0.092624,0.840708,0.488995
2,-0.345807,-0.087210,-1.304167
3,-0.638093,-0.817374,1.684437
4,-0.784237,-0.300174,-0.108726
...,...,...,...
65731,-1.222667,0.049696,1.086716
65732,0.677197,0.871131,-1.603028
65733,0.531054,-0.923856,0.190135
65734,-0.199663,-0.482715,-1.005307


<IPython.core.display.Javascript object>

In [39]:
features_test_ohe_conc = pd.concat([features_test_ohe, features_test_num], axis=1)

<IPython.core.display.Javascript object>

In [40]:
features_test_ohe_conc

Unnamed: 0,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon,Gearbox_auto,Gearbox_manual,...,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Repaired_Unknown,Repaired_no,Repaired_yes,RegistrationYear,Power,RegistrationMonth
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.661097,-0.193692,1.684437
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.092624,0.840708,0.488995
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.345807,-0.087210,-1.304167
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.638093,-0.817374,1.684437
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.784237,-0.300174,-0.108726
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65731,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.222667,0.049696,1.086716
65732,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.677197,0.871131,-1.603028
65733,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.531054,-0.923856,0.190135
65734,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.199663,-0.482715,-1.005307


<IPython.core.display.Javascript object>

In [41]:
features_test_ordinal_conc = pd.concat(
    [features_test_ordinal, features_test_num], axis=1
)

<IPython.core.display.Javascript object>

In [42]:
features_test_ordinal_conc

Unnamed: 0,VehicleType,Gearbox,Kilometer,Model,FuelType,Brand,Repaired,RegistrationYear,Power,RegistrationMonth
0,4.0,1.0,12.0,95.0,6.0,20.0,2.0,-1.661097,-0.193692,1.684437
1,7.0,0.0,12.0,203.0,2.0,24.0,1.0,0.092624,0.840708,0.488995
2,4.0,1.0,12.0,116.0,2.0,38.0,1.0,-0.345807,-0.087210,-1.304167
3,0.0,1.0,12.0,222.0,2.0,38.0,1.0,-0.638093,-0.817374,1.684437
4,0.0,1.0,12.0,211.0,2.0,20.0,1.0,-0.784237,-0.300174,-0.108726
...,...,...,...,...,...,...,...,...,...,...
65731,6.0,1.0,12.0,168.0,2.0,22.0,1.0,-1.222667,0.049696,1.086716
65732,7.0,1.0,12.0,31.0,2.0,1.0,1.0,0.677197,0.871131,-1.603028
65733,5.0,1.0,6.0,106.0,6.0,32.0,1.0,0.531054,-0.923856,0.190135
65734,0.0,1.0,12.0,166.0,2.0,27.0,0.0,-0.199663,-0.482715,-1.005307


<IPython.core.display.Javascript object>

<a id="2"></a>
## Model training
[Back to the top](#0)

Build several models to predict the values of the target attribute `Price`.

Examine the linear regression model.

In [43]:
%%time
model = LinearRegression()
scores = cross_val_score(
    model, features_train_ohe_conc, target_train, scoring='neg_mean_squared_error', cv=5, n_jobs=-1
)
print("RMSE модели линейной регрессии на обучающей выборке:", (scores * (-1)).min() ** 0.5)

RMSE модели линейной регрессии на обучающей выборке: 2756.9609355248654
Wall time: 30.7 s


<IPython.core.display.Javascript object>

Measure the prediction time of this model.

In [47]:
%%time
model.fit(features_train_ohe_conc, target_train)
model.predict(features_train_ohe_conc)



Wall time: 4.98 s


array([ 8830., 13406.,  8446., ...,  8060.,  4058.,  9634.])

<IPython.core.display.Javascript object>

Now, let's have a look at the decision tree.

In [48]:
%%time
model = DecisionTreeRegressor(random_state=RANDOM_STATE)
parameters = {"max_depth": range(1, 10, 1)}
grid_tr = GridSearchCV(model, parameters, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_tr.fit(features_train_ordinal_conc, target_train)
grid_tr.best_params_

Wall time: 9.12 s


{'max_depth': 9}

<IPython.core.display.Javascript object>

Measure the prediction time of this model.

In [50]:
%%time
grid_tr.predict(features_train_ordinal_conc)

Wall time: 24 ms


array([14592.08035714, 11176.725     ,  7349.83571429, ...,
       14592.08035714,  2676.62312572,  9427.95381883])

<IPython.core.display.Javascript object>

In [51]:
print(
    "RMSE модели дерева решений на обучающей выборке:",
    ((grid_tr.best_score_ * (-1)) ** 0.5),
)

RMSE модели дерева решений на обучающей выборке: 2079.2150932912805


<IPython.core.display.Javascript object>

Build now a model of the random forest.

In [54]:
%%time
model = RandomForestRegressor(random_state=RANDOM_STATE)
parameters = {"n_estimators": range(100, 201, 20)}
grid = GridSearchCV(model, parameters, cv=5, scoring='neg_mean_squared_error')
grid.fit(features_train_ordinal_conc, target_train)
grid.best_params_

Wall time: 47min 51s


{'n_estimators': 200}

<IPython.core.display.Javascript object>

Measure the prediction time of this model.

In [55]:
%%time
grid.predict(features_train_ordinal_conc)

Wall time: 20.8 s


array([15629.88470238, 15379.865     ,  8717.9795    , ...,
       15858.5       ,  2638.323     ,  6857.99      ])

<IPython.core.display.Javascript object>

In [56]:
print(
    "RMSE модели случайного леса на обучающей выборке:",
    ((grid.best_score_ * (-1)) ** 0.5),
)

RMSE модели случайного леса на обучающей выборке: 1703.3717127153698


<IPython.core.display.Javascript object>

Next, let's build gradient bousting models using the CatBoost and LightGBM libraries.

Next, for CatBoostRegressor, highlight the categorical attributes.

In [57]:
cat_features = ["VehicleType", "Gearbox", "Model", "FuelType", "Brand", "Repaired"]

<IPython.core.display.Javascript object>

Split the original dataframe into training and test sample and perform scaling of quantitative features.

In [58]:
target = df["Price"]
features = df.drop("Price", axis=1)
(features_train, features_test, target_train, target_test) = train_test_split(
    features, target, test_size=0.25, random_state=RANDOM_STATE
)

<IPython.core.display.Javascript object>

In [59]:
numeric = ["RegistrationYear", "Power", "RegistrationMonth"]
scaler = StandardScaler()
scaler.fit(features_train[numeric])
pd.options.mode.chained_assignment = None
features_train[numeric] = scaler.transform(features_train[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

<IPython.core.display.Javascript object>

In [60]:
features_train.shape

(197207, 10)

<IPython.core.display.Javascript object>

In [61]:
features_test.shape

(65736, 10)

<IPython.core.display.Javascript object>

We can see by the number of rows in each set that the separation has been done correctly.

In [62]:
%%time
model = CatBoostRegressor(loss_function="RMSE")
parameters = {'depth' : [6,8,10],
              'learning_rate' : [0.01, 0.05, 0.1],
              'iterations'    : [30, 50, 100]
              }
grid = GridSearchCV(model, param_grid = parameters, cv = 5, scoring='neg_mean_squared_error', n_jobs=-1)
grid.fit(features_train, target_train, cat_features=cat_features, verbose = 10)
grid.best_params_

0:	learn: 4308.3140084	total: 293ms	remaining: 29s
10:	learn: 2559.4940553	total: 2.08s	remaining: 16.9s
20:	learn: 2013.9793540	total: 3.68s	remaining: 13.9s
30:	learn: 1838.2438524	total: 5.5s	remaining: 12.2s
40:	learn: 1777.1925050	total: 7.12s	remaining: 10.2s
50:	learn: 1742.1182339	total: 10.3s	remaining: 9.88s
60:	learn: 1715.1789536	total: 13s	remaining: 8.32s
70:	learn: 1695.0640522	total: 14.3s	remaining: 5.85s
80:	learn: 1677.8745056	total: 15.5s	remaining: 3.64s
90:	learn: 1664.6556765	total: 16.7s	remaining: 1.65s
99:	learn: 1652.2184566	total: 17.8s	remaining: 0us
Wall time: 11min 3s


{'depth': 10, 'iterations': 100, 'learning_rate': 0.1}

<IPython.core.display.Javascript object>

Measure the prediction time of this model.

In [63]:
%%time
grid.predict(features_train)

Wall time: 441 ms


array([14192.20669873, 14958.15878516,  8302.77024462, ...,
       14917.39431931,  3165.05128718,  9962.16771783])

<IPython.core.display.Javascript object>

In [64]:
print(
    "RMSE модели CatBoostRegressor на обучающей выборке:",
    ((grid.best_score_ * (-1)) ** 0.5),
)

RMSE модели CatBoostRegressor на обучающей выборке: 1700.7659216849222


<IPython.core.display.Javascript object>

In [65]:
%%time
model = lgb.LGBMRegressor(verbosity= -1)
parameters = {'max_depth' : [6,8,10],
              'learning_rate' : [0.01, 0.05, 0.1],
              'n_estimators'    : [30, 50, 100]
              }
grid = GridSearchCV(model, param_grid = parameters, cv = 5, scoring='neg_mean_squared_error', n_jobs=-1)
grid.fit(features_train_ordinal_conc, target_train)
grid.best_params_

Wall time: 1min 22s


{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 100}

<IPython.core.display.Javascript object>

Measure the prediction time of this model.

In [66]:
%%time
grid.predict(features_train_ordinal_conc)

Wall time: 326 ms


array([14298.22184318, 14264.17614359,  8351.45145639, ...,
       15174.47445939,  3203.92428506,  9800.24087688])

<IPython.core.display.Javascript object>

In [67]:
print(
    "RMSE модели LGBMRegressor на обучающей выборке:",
    ((grid.best_score_ * (-1)) ** 0.5),
)

RMSE модели LGBMRegressor на обучающей выборке: 1720.937715036067


<IPython.core.display.Javascript object>

<a id="3"></a>
## Model Analysis
[Back to the top](#0)

From the obtained values of RMSE metric for different models it follows that the best result was shown by CatBoostRegressor gradient bousting model. At the same time, another model of gradient bousting LGBMRegressor showed a similar quality value with significantly shorter prediction time, therefore, the optimal decision will be to choose this model.

Check the quality of the LGBMRegressor model on a test sample.

In [181]:
%%time
model = lgb.LGBMRegressor(max_depth=10, learning_rate=0.1, n_estimators=100)
model.fit(features_train_ordinal_conc, target_train)

Wall time: 1.01 s


LGBMRegressor(max_depth=10)

<IPython.core.display.Javascript object>

In [182]:
%%time
pred = model.predict(features_test_ordinal_conc)

Wall time: 132 ms


<IPython.core.display.Javascript object>

In [183]:
print(
    "RMSE модели LGBMRegressor на тестовой выборке:",
    (mean_squared_error(pred, target_test) ** 0.5),
)

RMSE модели LGBMRegressor на тестовой выборке: 1711.2873632337144


<IPython.core.display.Javascript object>

We see that no overtraining is observed for the LGBMRegressor gradient bousting model. The LGBM library model shows good quality both on the training and test data, and at the same time it has a rather small time to build a prediction, so it should be recommended to the customer.