# Predicting the cost of cars

## Introduction

The service for the sale of used cars is developing an application to attract new customers. In it, you can quickly find out the market value of your car. Historical data is at your disposal: technical specifications, complete sets and prices of cars. You need to build a model to determine the cost.

## Important to the customer:

- prediction quality;
- prediction speed;
- training time.

## Data description:
Признаки:  
- `DateCrawled` — date the questionnaire was downloaded from the database
- 'VehicleType` — type of car body
- 'registrationyear` — the year of registration of the car
- 'Gearbox` — type of gearbox
- 'Power` - power (hp)
- 'Model` - car model
- 'Kilometer` - mileage (km)
- 'Registrationmonth` — month of car registration
- 'FuelType` — fuel type
- 'Brand` — car brand
- 'notrepaired` — was the car under repair or not
- 'DateCreated` — date the questionnaire was created
- 'numberofpictures` — number of photos of the car
- 'Postalcode` — postal code of the questionnaire owner (user)
- 'lastSeen` — the date of the user's last activity

Target attribute:
- 'Price` - price (Euro)

## Data preprocessing

#### Libraries

In [96]:
import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from lightgbm import LGBMRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score
from sklearn.metrics import mean_squared_error as mse, make_scorer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesClassifier

import warnings
warnings.filterwarnings('ignore')

#### Data Loading

In [105]:
# Upload the data and create a dataframe

df = pd.read_csv('autos.csv')
df

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,2016-03-21 09:50:58,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,2016-03-21 00:00:00,0,2694,2016-03-21 10:42:49
354365,2016-03-14 17:48:27,2200,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
354366,2016-03-05 19:56:21,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
354367,2016-03-19 18:57:12,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26


### Diving into the Data

In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [107]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [108]:
df.dtypes

DateCrawled          object
Price                 int64
VehicleType          object
RegistrationYear      int64
Gearbox              object
Power                 int64
Model                object
Kilometer             int64
RegistrationMonth     int64
FuelType             object
Brand                object
NotRepaired          object
DateCreated          object
NumberOfPictures      int64
PostalCode            int64
LastSeen             object
dtype: object

In [109]:
# Check for duplicates

df.duplicated().sum()

4

* There are duplicates in the data. Date columns have a string type. We will remove unnecessary columns, study omissions, remove duplicates, and bring the signs to the desired format.

In [110]:
# Remove duplicates

df = df.drop_duplicates().reset_index(drop=True)

In [111]:
df.duplicated().sum()

0

In [112]:
# Remove the columns that do not affect the price of the car

df = df.drop(['DateCrawled', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)
df

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
...,...,...,...,...,...,...,...,...,...,...,...
354360,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes
354361,2200,,2005,,0,,20000,1,,sonstige_autos,
354362,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no
354363,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no


In [113]:
# Let's check the missing value percentages in the dataset now.

pd.DataFrame(round((df.isna().mean()*100),2)).style.background_gradient('coolwarm')

Unnamed: 0,0
Price,0.0
VehicleType,10.58
RegistrationYear,0.0
Gearbox,5.6
Power,0.0
Model,5.56
Kilometer,0.0
RegistrationMonth,0.0
FuelType,9.28
Brand,0.0


* There are missing value in the columns: Vehicle Type, Gearbox, Model, FuelType, NotRepaired. All these signs are individual. It would be incorrect to assume what type of body - sedan or hatchback, automatic transmission or mechanics, model, fuel type and whether there was a repair. Fill in the missing value with the value "unknown".

In [114]:
# Fill in the missing value with the value "unknown"

df = df.fillna('unknown')

In [115]:
# Let's check if there are any missing value in our data after filling in

pd.DataFrame(round((df.isna().mean()*100),2)).style.background_gradient('coolwarm')

Unnamed: 0,0
Price,0.0
VehicleType,0.0
RegistrationYear,0.0
Gearbox,0.0
Power,0.0
Model,0.0
Kilometer,0.0
RegistrationMonth,0.0
FuelType,0.0
Brand,0.0


In [60]:
# Let's save the names of the signs in a separate list

attributes = list(df)
attributes

['Price',
 'VehicleType',
 'RegistrationYear',
 'Gearbox',
 'Power',
 'Model',
 'Kilometer',
 'RegistrationMonth',
 'FuelType',
 'Brand',
 'NotRepaired']

* Let's analyze each column more carefully. It is necessary to isolate each condition separately and look at the other signs to make a decision.

In [61]:
# Column Price
           
df.sort_values(['Price'], ascending=False)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
285890,20000,sedan,2012,unknown,184,3er,100000,6,gasoline,bmw,no
212038,20000,sedan,2011,auto,265,c_klasse,50000,11,gasoline,mercedes_benz,no
166379,20000,bus,2014,manual,150,other,20000,12,petrol,ford,no
1586,20000,sedan,2014,auto,184,leon,40000,4,gasoline,seat,no
321362,20000,sedan,2011,auto,265,c_klasse,50000,11,gasoline,mercedes_benz,no
...,...,...,...,...,...,...,...,...,...,...,...
187584,0,convertible,2004,manual,135,megane,150000,3,petrol,renault,no
99318,0,wagon,2001,manual,140,lybra,150000,0,gasoline,lancia,unknown
99319,0,small,1995,unknown,0,corsa,150000,0,petrol,opel,unknown
99331,0,convertible,1996,manual,75,golf,50000,6,unknown,volkswagen,no


In [62]:
df.sort_values(['RegistrationYear'], ascending=False)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
151228,0,unknown,9999,unknown,0,unknown,10000,7,unknown,mazda,unknown
326721,60,unknown,9999,unknown,0,c4,10000,0,unknown,citroen,unknown
167937,1000,unknown,9999,unknown,0,unknown,10000,0,unknown,mazda,unknown
306575,350,unknown,9999,unknown,0,kaefer,10000,1,unknown,volkswagen,unknown
333484,0,unknown,9999,unknown,0,unknown,10000,0,unknown,bmw,unknown
...,...,...,...,...,...,...,...,...,...,...,...
348826,1,unknown,1000,unknown,1000,unknown,150000,0,unknown,sonstige_autos,unknown
323440,30,unknown,1000,unknown,0,unknown,5000,0,unknown,audi,unknown
55605,500,unknown,1000,unknown,0,unknown,5000,0,unknown,citroen,yes
183778,500,unknown,1000,unknown,0,unknown,5000,1,unknown,sonstige_autos,unknown


In [63]:
df.sort_values(['Power'], ascending=False)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
219583,4300,coupe,1999,auto,20000,clk,150000,1,petrol,mercedes_benz,no
299177,1500,wagon,1997,manual,19312,5er,150000,1,unknown,bmw,no
114106,9999,sedan,2006,manual,19211,1er,125000,0,gasoline,bmw,unknown
132485,2100,wagon,2001,manual,19208,5er,150000,5,unknown,bmw,yes
63986,3250,sedan,2001,auto,17932,omega,150000,6,petrol,opel,unknown
...,...,...,...,...,...,...,...,...,...,...,...
115725,650,small,1998,auto,0,micra,150000,4,petrol,nissan,no
264237,2600,wagon,1988,manual,0,601,150000,0,petrol,trabant,no
264230,550,unknown,2000,unknown,0,unknown,5000,0,petrol,volkswagen,unknown
264229,6500,small,1993,manual,0,golf,150000,11,gasoline,volkswagen,unknown


In [64]:
df.sort_values(['Kilometer'], ascending=False)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,unknown,1993,manual,0,golf,150000,0,petrol,volkswagen,unknown
213204,3200,wagon,2005,manual,116,3er,150000,5,gasoline,bmw,no
213213,2799,sedan,1995,manual,150,a4,150000,9,petrol,audi,no
213212,550,sedan,1996,auto,90,carisma,150000,11,petrol,mitsubishi,no
213209,100,unknown,1995,manual,90,golf,150000,4,petrol,volkswagen,yes
...,...,...,...,...,...,...,...,...,...,...,...
14862,6500,bus,2005,unknown,0,touran,5000,0,gasoline,volkswagen,unknown
173477,750,small,2002,manual,70,grand,5000,8,petrol,suzuki,no
179744,3000,sedan,1996,manual,170,5er,5000,10,petrol,bmw,no
206570,6399,wagon,2007,manual,140,transit,5000,3,gasoline,ford,unknown


* Column **Price**."0" values are clearly incorrect. If they wanted to give the car as a gift, it wouldn't be here. We remove them, they will not help us in the forecast.
* Column **registrationyear**. All values older than 1960 can be removed, all values older than 2021 are the same. If the car is older than 1960, it is either junk, for which you need to pay extra when recycling, or a rare car. Rarities are sold at their auctions. It is very difficult to guess the price. Sort by condition.
* Column **Power**. "0" power values cannot be and anything over 1000 is also incorrect. We're cleaning it up.
* Column **Kilometer**. Everything is fine here.
* General conclusion - sort the dataframe by conditions.


In [65]:
# Sort the dataframe by conditions

df = df.query('0 < Price and 1960 < RegistrationYear < 2021 and 0 < Power < 1000 and Price > 500')
df


Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
1,18300,coupe,2011,manual,190,unknown,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,unknown
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes
...,...,...,...,...,...,...,...,...,...,...,...
354357,5250,unknown,2016,auto,150,159,150000,12,unknown,alfa_romeo,no
354358,3200,sedan,2004,manual,225,leon,150000,5,petrol,seat,yes
354362,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no
354363,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no


* 69795 - we have removed so many rows from our selection by applying the agreed conditions.

In [66]:
# Let's analyze the correlation of features, output the correlation matrix

cor_m = df[['Price', 'RegistrationYear', 'Power', 'Kilometer', 'RegistrationMonth']].corr()
cor_m.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth
Price,1.0,0.42,0.48,-0.39,0.05
RegistrationYear,0.42,1.0,0.08,-0.27,0.01
Power,0.48,0.08,1.0,0.12,0.03
Kilometer,-0.39,-0.27,0.12,1.0,-0.01
RegistrationMonth,0.05,0.01,0.03,-0.01,1.0


* The correlation matrix shows that the year of manufacture and power correlate with the price - which is reasonable.

### Working with categorical features

We will use gradient boosting. So let's encode all the data of the object type into numeric values.
* categorical features are in the columns: 'Vehicle Type', 'Gearbox', 'Model', 'FuelType', 'Brand'. We will encode them using the Label Encoder.
* the column **Not Repaired** will be encoded into values (0,1)

In [67]:
df['NotRepaired'].unique()

array(['yes', 'unknown', 'no'], dtype=object)

In [68]:
# Encoding a binary attribute

df['NotRepaired'] = df['NotRepaired'].map({'no': 0, 'unknown': 0, 'yes': 1})

In [69]:
# Let's create a function for encoding categorical features into numeric values

def get_encode(data, columns):
    encode_df = data.copy()
    enc = OrdinalEncoder()
    encode_df[columns]=enc.fit_transform(encode_df[columns])
    encode_df[columns] = encode_df[columns].astype('int')
    display(encode_df.head())
    return encode_df

In [70]:
# Let's start the function in operation

encode_df = get_encode(df, ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand'])

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
1,18300,2,2011,1,190,227,125000,5,2,1,1
2,9800,6,2004,0,163,117,125000,8,2,14,0
3,1500,5,2001,1,75,116,150000,6,6,38,0
4,3600,5,2008,1,69,101,90000,7,2,31,0
5,650,4,1995,1,102,11,150000,10,6,2,1


In [71]:
# Let's analyze the correlation of features, output the correlation matrix

cor_m = encode_df.corr()
cor_m.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired
Price,1.0,-0.08,0.42,-0.25,0.48,-0.02,-0.39,0.05,-0.27,-0.1,-0.15
VehicleType,-0.08,1.0,0.16,-0.02,0.0,-0.13,0.07,-0.0,-0.05,-0.05,0.01
RegistrationYear,0.42,0.16,1.0,-0.01,0.08,-0.01,-0.27,0.01,-0.19,-0.0,-0.06
Gearbox,-0.25,-0.02,-0.01,1.0,-0.41,0.04,-0.03,-0.04,0.14,0.11,-0.0
Power,0.48,0.0,0.08,-0.41,1.0,-0.12,0.12,0.03,-0.17,-0.32,-0.01
Model,-0.02,-0.13,-0.01,0.04,-0.12,1.0,-0.04,-0.02,-0.01,0.45,0.01
Kilometer,-0.39,0.07,-0.27,-0.03,0.12,-0.04,1.0,-0.01,-0.15,-0.06,0.06
RegistrationMonth,0.05,-0.0,0.01,-0.04,0.03,-0.02,-0.01,1.0,-0.07,-0.01,-0.02
FuelType,-0.27,-0.05,-0.19,0.14,-0.17,-0.01,-0.15,-0.07,1.0,0.03,0.0
Brand,-0.1,-0.05,-0.0,0.11,-0.32,0.45,-0.06,-0.01,0.03,1.0,-0.01


* The picture is stable. The correlation of the price with the year of manufacture and capacity is visible + high correlation of the model and brand. We can combine these columns into 1 attribute.

In [72]:
# Combining 2 columns into 1 feature (there is a high correlation)

df['Brand_Model']=df['Brand'] +'_'+ df['Model']
df=df.drop(['Brand','Model'],  axis=1)
df.head(3)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,RegistrationMonth,FuelType,NotRepaired,Brand_Model
1,18300,coupe,2011,manual,190,125000,5,gasoline,1,audi_unknown
2,9800,suv,2004,auto,163,125000,8,gasoline,0,jeep_grand
3,1500,small,2001,manual,75,150000,6,petrol,0,volkswagen_golf


In [73]:
encode_df = get_encode(df, ['VehicleType', 'Gearbox', 'FuelType', 'Brand_Model'])

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,RegistrationMonth,FuelType,NotRepaired,Brand_Model
1,18300,2,2011,1,190,125000,5,2,1,23
2,9800,6,2004,0,163,125000,8,2,0,119
3,1500,5,2001,1,75,150000,6,6,0,310
4,3600,5,2008,1,69,90000,7,2,0,267
5,650,4,1995,1,102,150000,10,6,1,25


In [74]:
encode_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 284570 entries, 1 to 354364
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   Price              284570 non-null  int64
 1   VehicleType        284570 non-null  int64
 2   RegistrationYear   284570 non-null  int64
 3   Gearbox            284570 non-null  int64
 4   Power              284570 non-null  int64
 5   Kilometer          284570 non-null  int64
 6   RegistrationMonth  284570 non-null  int64
 7   FuelType           284570 non-null  int64
 8   NotRepaired        284570 non-null  int64
 9   Brand_Model        284570 non-null  int64
dtypes: int64(10)
memory usage: 23.9 MB


### Output: 
* we analyzed the data, checked for duplicates, preprocessed, got rid of gaps and outliers;
* preprocessing categorical features;
* The data is ready to work

## Model fitting

* Prepare samples

In [75]:
# Let's divide our dataset into training and target features

target = encode_df['Price']
features = encode_df.drop('Price', axis=1)

In [76]:
# Assign the variable **random_state**

rs=12345

In [77]:
target_train, target_test, features_train, features_test = train_test_split(target, features,
                                                                           test_size=.25,
                                                                           random_state=rs)

In [78]:
# Let's check the shape of the samples received

print('Features train', features_train.shape, target_train.shape)
print('Features test', features_test.shape, target_test.shape)


Features train (213427, 9) (213427,)
Features test (71143, 9) (71143,)


In [79]:
# Let's normalize features_train features (all features of the training sample, with the exception of the target feature)

scaler = StandardScaler()
scaler.fit(features_train)
features_train = pd.DataFrame(scaler.transform(features_train), columns = features_train.columns)
features_test = scaler.transform(features_test)

## Model analysis

### Let's check the results of the basic models by cross-validation. Let's calculate the value of the RMSE metric for 3 basic regression models. Cross-validation on 5 blocks.


In [80]:
# Let's create a function to calculate the RMSE metric

def rmse_score(target, predictions):
    return mse(target, predictions)**.5

In [81]:
# Let's create a meter based on the RMSE metric calculation function

rmse = make_scorer(rmse_score, greater_is_better=False)

In [82]:
linear = LinearRegression()
linear_score = cross_val_score(linear, features, target, cv=5, scoring=rmse).mean()
linear_score

-2982.059671224179

In [83]:
tree = DecisionTreeRegressor()
tree_score = cross_val_score(tree, features, target, cv=5, scoring=rmse).mean()
tree_score

-2024.5862968957165

In [84]:
%%time
forest = RandomForestRegressor(n_estimators=100)
forest_score = cross_val_score(forest, features, target, cv=5, scoring=rmse).mean()
forest_score

CPU times: user 5min 32s, sys: 4.2 s, total: 5min 36s
Wall time: 5min 36s


-1596.3619612625148

* The best result was shown by the **RandomForestRegressor model**

### Let's select hyperparameters and train the model

#### DecisionTreeRegressor

In [85]:
%%time
best_score = 5000
best_depth = 0
for depth in range(1, 16):
    tree_model = DecisionTreeRegressor(max_depth=depth, random_state=rs)
    tree_model_score = cross_val_score(tree_model, features_train, target_train,
                                     cv=5, scoring=rmse)
    if np.abs(tree_model_score.mean()) < np.abs(best_score):
        best_score_t = tree_model_score.mean()
        best_depth_t = depth
        best_tree_model = tree_model

print('Best depth:', best_depth_t)
print('RMSE:', best_score_t)

Best depth: 15
RMSE: -1930.541234867934
CPU times: user 21.1 s, sys: 20 ms, total: 21.1 s
Wall time: 21.1 s


In [86]:
fit_time = datetime.datetime.now()
model_tree = DecisionTreeRegressor(max_depth=best_depth_t, random_state=rs)
model_tree.fit(features_train, target_train)
time_of_fit_tree = (datetime.datetime.now()-fit_time).seconds
predict_time = datetime.datetime.now()
predictions_tree = model_tree.predict(features_test)
time_of_predict_tree = (datetime.datetime.now()-predict_time).seconds
tree_rmse = rmse_score(target_test, predictions_tree)

print('Time of fit:', time_of_fit_tree, 'секунд.')
print('Time og predict:', time_of_predict_tree, 'секунд.')
print('RMSE:', tree_rmse)

Time of fit: 0 секунд.
Time og predict: 0 секунд.
RMSE: 1890.2267677046527


In [87]:
# Let's analyze the influence of factors on the target. Output by importance

importances = model_tree.feature_importances_
feature_list = list(features_train.columns)
feature_results = pd.DataFrame({'feature': feature_list,'importance': importances}).sort_values('importance',ascending = False).reset_index(drop=True)
feature_results.head(11)

Unnamed: 0,feature,importance
0,RegistrationYear,0.494152
1,Power,0.299829
2,Kilometer,0.088131
3,Brand_Model,0.05559
4,VehicleType,0.028704
5,RegistrationMonth,0.011101
6,NotRepaired,0.010539
7,FuelType,0.007213
8,Gearbox,0.00474


* The results are expected. Important factors are the year of registration and the power of the car.

#### RandomForestRegressor

In [88]:
%%time 
best_score = 5000
best_est = 0
best_depth = 0
for est in range(90, 110, 10):
    for depth in range(13, 16):
        forest_model = RandomForestRegressor(max_depth=depth, n_estimators=est, random_state=rs)
        forest_model_score = cross_val_score(forest_model, features_train, target_train,
                                           cv=5, scoring=rmse)
        if np.abs(forest_model_score.mean()) < np.abs(best_score):
            best_score_f = forest_model_score.mean()
            best_depth_f = depth
            best_est_f = est

print('Best n_estimators:', best_est_f)
print('Best depth:', best_depth_f)
print('RMSE:', best_score_f)

Best n_estimators: 100
Best depth: 15
RMSE: -1660.9116288343969
CPU times: user 14min 36s, sys: 568 ms, total: 14min 37s
Wall time: 14min 36s


In [89]:
fit_time = datetime.datetime.now()
model_forest = RandomForestRegressor(n_estimators=best_est_f, max_depth=best_depth_f, random_state=rs)
model_forest.fit(features_train, target_train)
time_of_fit_forest = (datetime.datetime.now()-fit_time).seconds
predict_time = datetime.datetime.now()
predictions_forest = model_forest.predict(features_test)
time_of_predict_forest = (datetime.datetime.now()-predict_time).seconds
forest_rmse = rmse_score(target_test, predictions_forest)

print('Time of fit:', time_of_fit_forest, 'секунд.')
print('Time of predict:', time_of_predict_forest, 'секунд.')
print('RMSE:', forest_rmse)

Time of fit: 40 секунд.
Time of predict: 1 секунд.
RMSE: 1640.5297854209975


#### LinearRegression

In [90]:
fit_time = datetime.datetime.now()
model_linear = LinearRegression()
model_linear.fit(features_train, target_train)
time_of_fit_linear = (datetime.datetime.now()-fit_time).seconds
predict_time = datetime.datetime.now()
predictions_linear = model_linear.predict(features_test)
time_of_predict_linear = (datetime.datetime.now()-predict_time).seconds
linear_rmse = rmse_score(target_test, predictions_linear)

print('Time of fit:', time_of_fit_linear, 'секунд.')
print('Time of predict:', time_of_predict_linear, 'секунд.')
print('RMSE:', linear_rmse)

Time of fit: 0 секунд.
Time of predict: 0 секунд.
RMSE: 2983.847334368411


#### Gradient boosting LightGBM

In [91]:
%%time 
best_score = 5000
best_est = 0
best_depth = 0
for est in range(80,100, 10):
    for depth in range(13, 15):
        light_model = LGBMRegressor(max_depth=depth, n_estimators=est, random_state=rs)
        light_model_score = cross_val_score(light_model, features_train, target_train,
                                           cv=5, scoring=rmse)
        if np.abs(light_model_score.mean()) < np.abs(best_score):
            best_score_l = light_model_score.mean()
            best_depth_l = depth
            best_est_l = est

print('Best n_estimators:', best_est_l)
print('Best depth:', best_depth_l)
print('RMSE:', best_score_l)

Best n_estimators: 90
Best depth: 14
RMSE: -1738.7029118911346
CPU times: user 56.5 s, sys: 320 ms, total: 56.9 s
Wall time: 15.4 s


In [92]:
fit_time = datetime.datetime.now()
model_light = LGBMRegressor(n_estimators=best_est_l, max_depth=best_depth_l, random_state=rs)
model_light.fit(features_train, target_train)
time_of_fit_light = (datetime.datetime.now()-fit_time).seconds
predict_time = datetime.datetime.now()
predictions_light = model_light.predict(features_test)
time_of_predict_light = (datetime.datetime.now()-predict_time).seconds
light_rmse = rmse_score(target_test, predictions_light)

print('Time of fit:', time_of_fit_light, 'секунд.')
print('Time of predict:', time_of_predict_light, 'секунд.')
print('RMSE:', light_rmse)

Time of fit: 0 секунд.
Time of predict: 0 секунд.
RMSE: 1738.060138683184


#### Let's analyze the models

In [93]:
# Создадим итоговую таблицу результатов

data = {'Model':['Linear Regression', 'Decision Tree', 'Random Forest', 'LightGBM'],
       'RMSE':[linear_rmse, tree_rmse, forest_rmse, light_rmse],
       'Fit Time':[time_of_fit_linear, time_of_fit_tree, time_of_fit_forest,
                       time_of_fit_light],
       'Predict Time':[time_of_predict_linear, time_of_predict_tree, time_of_predict_forest,
                       time_of_predict_light],
       'Parameters Selection Time':[0, 30, 5040, 1800]}

result_table = pd.DataFrame(data)
result_table

Unnamed: 0,Model,RMSE,Fit Time,Predict Time,Parameters Selection Time
0,Linear Regression,2983.847334,0,0,0
1,Decision Tree,1890.226768,0,0,30
2,Random Forest,1640.529785,40,1,5040
3,LightGBM,1738.060139,0,0,1800


In [94]:
# Let's analyze the influence of the model factors Random Forest

importances_forest_model = model_forest.feature_importances_
feature_list = list(features_train.columns)
feature_results = pd.DataFrame({'feature': feature_list,'importance': importances}).sort_values('importance',ascending = False).reset_index(drop=True)
feature_results.head(11)

Unnamed: 0,feature,importance
0,RegistrationYear,0.494152
1,Power,0.299829
2,Kilometer,0.088131
3,Brand_Model,0.05559
4,VehicleType,0.028704
5,RegistrationMonth,0.011101
6,NotRepaired,0.010539
7,FuelType,0.007213
8,Gearbox,0.00474


## Summary: the Random Forest model showed the best result, but the model loses to everyone during training.  A close result was shown by the LightGBM model with a good result in terms of training time.