## Car sales

We have an app to attract new customers. In that app, customer can quickly find out the market value of his car. We need to build the model to determine the value. 

We are interested in:

* the quality of the prediction
* the speed of the prediction
* the time required for training

In [1]:
import pandas as pd
import time
from sklearn.metrics import  make_scorer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('/datasets/car_data.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Mileage              354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


We have dataset with 354369 entries. There are 16 columns : 7 columns are integer, 9 - object type. 

In [3]:
df[df.duplicated(keep=False)].shape

(524, 16)

There are 524 duplicates. We'll drop duplicates except for the first occurrence using drop_duplicates() method.

In [4]:
df = df.drop_duplicates()

In [5]:
df.isna().mean()

DateCrawled          0.000000
Price                0.000000
VehicleType          0.105855
RegistrationYear     0.000000
Gearbox              0.056000
Power                0.000000
Model                0.055636
Mileage              0.000000
RegistrationMonth    0.000000
FuelType             0.092879
Brand                0.000000
NotRepaired          0.200914
DateCreated          0.000000
NumberOfPictures     0.000000
PostalCode           0.000000
LastSeen             0.000000
dtype: float64

We have 5 columns with missing values: VehicleType, Gearbox, Model, FuelType, NotRepaired. The information in these columns came from customers. We can't fill these missing values without client.

Let's drop all missing values.

In [6]:
df = df.dropna()
df.shape

(245567, 16)

After droping missing values we stayed with about 70% of the data. 

In oder to find market value of the car we don't need information about clients' account, we need only car characteristics. So let's drop these columns: DateCrawled, DateCreated, NumberOfPictures, PostalCode, LastSeen.  

In [7]:
df = df.drop(['DateCrawled','DateCreated','NumberOfPictures','PostalCode','LastSeen'], axis=1)
df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes
6,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no
7,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no


Our target variable "price" is numerical, so we have regression task. 

First, transform categorical features into numerical features using One-Hot Encoding (OHE). We'll call pd.get_dummies() for all data and set drop_first = True to avoid the dummy trap.

In [8]:
df_ohe = pd.get_dummies(df, drop_first=True)

In [9]:
target = df_ohe['Price']
features = df_ohe.drop('Price', axis=1)

# split the data into a training set and validation set at a ratio of 75:25
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)
    
# scale the numeric data using StandardScaler() from sklearn.preprocessing module.
numeric = ['RegistrationYear','Power','Mileage','RegistrationMonth']

scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])

We'll use RMSE to evaluate different models.

In [10]:
def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred)**0.5

Let's compare four models: Linear Regression, Random Forest, LightGBM, CatBoost.  

* #### Linear Regression

In [11]:
regressor = LinearRegression()  
regressor.fit(features_train, target_train)

predicted_valid = regressor.predict(features_valid)

rmse_linear_reg = rmse(target_valid, predicted_valid)
print('RMSE:', rmse_linear_reg)


RMSE: 2713.527208953366


* #### Random Forest

Let's tune hyperparameters for Random Forest using RandomizedSearch.

In [12]:
model = RandomForestRegressor()

parameters = {'max_depth'     : [6,10],
              'n_estimators'  : [40, 50] }
    
randm = RandomizedSearchCV(estimator=model, param_distributions = parameters,
                           scoring='neg_mean_squared_error', cv = 3, n_iter = 10, 
                           n_jobs=-1, verbose = 2, random_state=12345)

randm_results = randm.fit(features_train, target_train)


print("\n The best score:\n ", randm_results.best_score_)

print("\n The best parameters:\n ", randm_results.best_params_)

# Predict after fitting RandomizedSearchCV with best parameters
y_pred = randm .predict(features_valid)
 
score_random_forest = rmse(target_valid, y_pred)

print("\n Score on test data:\n ", score_random_forest)


Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] n_estimators=40, max_depth=6 ....................................


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..................... n_estimators=40, max_depth=6, total=  41.4s
[CV] n_estimators=40, max_depth=6 ....................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   41.4s remaining:    0.0s


[CV] ..................... n_estimators=40, max_depth=6, total=  42.9s
[CV] n_estimators=40, max_depth=6 ....................................
[CV] ..................... n_estimators=40, max_depth=6, total=  42.2s
[CV] n_estimators=50, max_depth=6 ....................................
[CV] ..................... n_estimators=50, max_depth=6, total=  53.3s
[CV] n_estimators=50, max_depth=6 ....................................
[CV] ..................... n_estimators=50, max_depth=6, total=  53.1s
[CV] n_estimators=50, max_depth=6 ....................................
[CV] ..................... n_estimators=50, max_depth=6, total=  52.7s
[CV] n_estimators=40, max_depth=10 ...................................
[CV] .................... n_estimators=40, max_depth=10, total= 1.0min
[CV] n_estimators=40, max_depth=10 ...................................
[CV] .................... n_estimators=40, max_depth=10, total= 1.0min
[CV] n_estimators=40, max_depth=10 ...................................
[CV] .

[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed: 11.8min finished



 The best score:
  -3788300.102208344

 The best parameters:
  {'n_estimators': 40, 'max_depth': 10}

 Score on test data:
  1946.4253217634284


For LightGBM and CatBoost we'll convert all 'object' data types to 'categorical' data types

In [13]:
target = df['Price']
features = df.drop('Price', axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

In [14]:
cat_features = ['VehicleType','RegistrationYear','Gearbox','Model', 'RegistrationMonth','FuelType','Brand','NotRepaired']

for header in cat_features:
    X_train[header] = X_train[header].astype('category')
    X_valid[header] = X_valid[header].astype('category')

* #### CatBoost

For tuning hyperparameters for CatBoost and LightGBM we'll use RandomizedSearch again.

In [15]:
model = CatBoostRegressor()

parameters = {'depth'         : [6,10],
              'learning_rate' : [0.03, 0.1],
              'n_estimators'  : [60,100] }
    
randm = RandomizedSearchCV(estimator=model, param_distributions = parameters, scoring='neg_mean_squared_error',  
                               cv = 3, n_iter = 10, n_jobs=-1, verbose = 2, random_state=12345)

randm_results=randm.fit(X_train, y_train,cat_features=cat_features)


print("\n The best score:\n ", randm_results.best_score_)

print("\n The best parameters:\n ", randm_results.best_params_)

# Predict after fitting RandomizedSearchCV with best parameters
y_pred = randm .predict(X_valid)
 
score_catboost = rmse(y_valid, y_pred)

print("\n Score on test data:\n ",score_catboost)


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] n_estimators=60, learning_rate=0.03, depth=6 ....................


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


0:	learn: 4624.2096442	total: 192ms	remaining: 11.3s
1:	learn: 4533.8704676	total: 392ms	remaining: 11.4s
2:	learn: 4445.8074521	total: 684ms	remaining: 13s
3:	learn: 4362.0110640	total: 883ms	remaining: 12.4s
4:	learn: 4279.5737430	total: 1.09s	remaining: 12s
5:	learn: 4200.2157723	total: 1.29s	remaining: 11.6s
6:	learn: 4123.4443283	total: 1.49s	remaining: 11.3s
7:	learn: 4049.4382654	total: 1.78s	remaining: 11.6s
8:	learn: 3978.8516632	total: 1.98s	remaining: 11.2s
9:	learn: 3909.9792343	total: 2.18s	remaining: 10.9s
10:	learn: 3845.4083412	total: 2.38s	remaining: 10.6s
11:	learn: 3783.7934624	total: 2.58s	remaining: 10.3s
12:	learn: 3722.4359959	total: 2.78s	remaining: 10.1s
13:	learn: 3664.0844418	total: 2.98s	remaining: 9.8s
14:	learn: 3607.7736287	total: 3.19s	remaining: 9.56s
15:	learn: 3554.4493895	total: 3.38s	remaining: 9.3s
16:	learn: 3501.5494646	total: 3.58s	remaining: 9.07s
17:	learn: 3451.1937425	total: 3.87s	remaining: 9.03s
18:	learn: 3403.0910081	total: 4.07s	remaini

[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   14.3s remaining:    0.0s


0:	learn: 4629.7432785	total: 148ms	remaining: 8.73s
1:	learn: 4539.6979674	total: 347ms	remaining: 10.1s
2:	learn: 4451.8403817	total: 550ms	remaining: 10.4s
3:	learn: 4367.5244479	total: 748ms	remaining: 10.5s
4:	learn: 4284.5992199	total: 1.04s	remaining: 11.5s
5:	learn: 4206.4573953	total: 1.24s	remaining: 11.2s
6:	learn: 4128.9779651	total: 1.44s	remaining: 10.9s
7:	learn: 4056.8637379	total: 1.64s	remaining: 10.7s
8:	learn: 3984.8135172	total: 1.84s	remaining: 10.4s
9:	learn: 3916.0311446	total: 2.04s	remaining: 10.2s
10:	learn: 3850.5925737	total: 2.24s	remaining: 9.99s
11:	learn: 3788.0361598	total: 2.44s	remaining: 9.78s
12:	learn: 3728.1490807	total: 2.65s	remaining: 9.59s
13:	learn: 3671.1592798	total: 2.84s	remaining: 9.34s
14:	learn: 3613.0704123	total: 3.04s	remaining: 9.13s
15:	learn: 3560.0160209	total: 3.24s	remaining: 8.91s
16:	learn: 3509.8969247	total: 3.53s	remaining: 8.93s
17:	learn: 3460.4597190	total: 3.73s	remaining: 8.7s
18:	learn: 3412.2286813	total: 3.93s	re

28:	learn: 3007.7986529	total: 6.17s	remaining: 15.1s
29:	learn: 2976.5270466	total: 6.37s	remaining: 14.9s
30:	learn: 2946.8563497	total: 6.57s	remaining: 14.6s
31:	learn: 2917.4396537	total: 6.77s	remaining: 14.4s
32:	learn: 2889.6499854	total: 6.97s	remaining: 14.2s
33:	learn: 2864.1533838	total: 7.17s	remaining: 13.9s
34:	learn: 2837.9352614	total: 7.37s	remaining: 13.7s
35:	learn: 2813.3746129	total: 7.66s	remaining: 13.6s
36:	learn: 2789.2432993	total: 7.86s	remaining: 13.4s
37:	learn: 2763.8571468	total: 8.06s	remaining: 13.1s
38:	learn: 2739.0134899	total: 8.26s	remaining: 12.9s
39:	learn: 2715.0191242	total: 8.46s	remaining: 12.7s
40:	learn: 2692.0233961	total: 8.66s	remaining: 12.5s
41:	learn: 2669.5984874	total: 8.86s	remaining: 12.2s
42:	learn: 2648.3345275	total: 9.06s	remaining: 12s
43:	learn: 2627.8997677	total: 9.27s	remaining: 11.8s
44:	learn: 2608.5300290	total: 9.46s	remaining: 11.6s
45:	learn: 2589.5892680	total: 9.66s	remaining: 11.3s
46:	learn: 2571.7499520	total:

80:	learn: 2162.3128285	total: 16.9s	remaining: 3.95s
81:	learn: 2155.4259849	total: 17.1s	remaining: 3.74s
82:	learn: 2148.8906715	total: 17.3s	remaining: 3.54s
83:	learn: 2142.6284192	total: 17.5s	remaining: 3.33s
84:	learn: 2135.7715332	total: 17.7s	remaining: 3.12s
85:	learn: 2129.1801710	total: 17.9s	remaining: 2.91s
86:	learn: 2122.8313636	total: 18.1s	remaining: 2.7s
87:	learn: 2117.3967089	total: 18.3s	remaining: 2.49s
88:	learn: 2112.3892894	total: 18.5s	remaining: 2.28s
89:	learn: 2107.2698409	total: 18.7s	remaining: 2.07s
90:	learn: 2102.1132565	total: 18.9s	remaining: 1.86s
91:	learn: 2097.7641610	total: 19.1s	remaining: 1.67s
92:	learn: 2093.6426162	total: 19.4s	remaining: 1.46s
93:	learn: 2089.6036543	total: 19.6s	remaining: 1.25s
94:	learn: 2085.8462538	total: 19.8s	remaining: 1.04s
95:	learn: 2081.1364102	total: 20s	remaining: 831ms
96:	learn: 2077.4978942	total: 20.2s	remaining: 623ms
97:	learn: 2074.1513367	total: 20.4s	remaining: 415ms
98:	learn: 2070.5916270	total: 

28:	learn: 2070.6636318	total: 5.97s	remaining: 6.38s
29:	learn: 2058.1288530	total: 6.16s	remaining: 6.16s
30:	learn: 2048.0905863	total: 6.36s	remaining: 5.95s
31:	learn: 2037.4088580	total: 6.56s	remaining: 5.74s
32:	learn: 2028.8932852	total: 6.76s	remaining: 5.53s
33:	learn: 2019.0665567	total: 6.96s	remaining: 5.32s
34:	learn: 2012.7416216	total: 7.25s	remaining: 5.18s
35:	learn: 2006.6370809	total: 7.45s	remaining: 4.96s
36:	learn: 2000.9330000	total: 7.65s	remaining: 4.75s
37:	learn: 1994.9073792	total: 7.85s	remaining: 4.54s
38:	learn: 1989.8381498	total: 8.05s	remaining: 4.33s
39:	learn: 1985.0124971	total: 8.24s	remaining: 4.12s
40:	learn: 1980.5203041	total: 8.44s	remaining: 3.91s
41:	learn: 1975.2078698	total: 8.64s	remaining: 3.7s
42:	learn: 1971.3691910	total: 8.84s	remaining: 3.5s
43:	learn: 1967.1727979	total: 9.04s	remaining: 3.29s
44:	learn: 1962.6986207	total: 9.24s	remaining: 3.08s
45:	learn: 1957.1257177	total: 9.44s	remaining: 2.87s
46:	learn: 1953.6274916	total:

56:	learn: 1926.1025551	total: 11.8s	remaining: 621ms
57:	learn: 1923.3770123	total: 12s	remaining: 414ms
58:	learn: 1921.5800677	total: 12.2s	remaining: 207ms
59:	learn: 1920.4462500	total: 12.4s	remaining: 0us
[CV] ...... n_estimators=60, learning_rate=0.1, depth=6, total=  14.1s
[CV] n_estimators=100, learning_rate=0.1, depth=6 ....................
0:	learn: 4407.0444338	total: 145ms	remaining: 14.4s
1:	learn: 4133.7015835	total: 347ms	remaining: 17s
2:	learn: 3889.4875680	total: 550ms	remaining: 17.8s
3:	learn: 3674.4282039	total: 748ms	remaining: 17.9s
4:	learn: 3487.3681767	total: 1.04s	remaining: 19.7s
5:	learn: 3321.0545023	total: 1.24s	remaining: 19.5s
6:	learn: 3175.9218282	total: 1.44s	remaining: 19.1s
7:	learn: 3054.3505288	total: 1.64s	remaining: 18.8s
8:	learn: 2944.7035950	total: 1.84s	remaining: 18.6s
9:	learn: 2848.9445295	total: 2.04s	remaining: 18.3s
10:	learn: 2763.5779651	total: 2.24s	remaining: 18.1s
11:	learn: 2683.1646241	total: 2.44s	remaining: 17.9s
12:	learn:

44:	learn: 1961.3044301	total: 9.45s	remaining: 11.6s
45:	learn: 1957.6913523	total: 9.65s	remaining: 11.3s
46:	learn: 1954.6851759	total: 9.86s	remaining: 11.1s
47:	learn: 1950.8028233	total: 10.1s	remaining: 10.9s
48:	learn: 1947.1394855	total: 10.3s	remaining: 10.7s
49:	learn: 1944.2851174	total: 10.5s	remaining: 10.5s
50:	learn: 1940.5100406	total: 10.7s	remaining: 10.2s
51:	learn: 1938.2685692	total: 10.9s	remaining: 10s
52:	learn: 1934.9859053	total: 11.1s	remaining: 9.8s
53:	learn: 1932.8051447	total: 11.3s	remaining: 9.59s
54:	learn: 1930.2206049	total: 11.5s	remaining: 9.37s
55:	learn: 1928.3367305	total: 11.7s	remaining: 9.16s
56:	learn: 1923.8488597	total: 11.9s	remaining: 8.94s
57:	learn: 1921.0344233	total: 12.1s	remaining: 8.79s
58:	learn: 1918.5271225	total: 12.3s	remaining: 8.57s
59:	learn: 1915.8352609	total: 12.5s	remaining: 8.36s
60:	learn: 1912.4992489	total: 12.7s	remaining: 8.14s
61:	learn: 1909.8232671	total: 12.9s	remaining: 7.93s
62:	learn: 1906.5747772	total: 

96:	learn: 1844.5013782	total: 20s	remaining: 619ms
97:	learn: 1842.8630854	total: 20.2s	remaining: 412ms
98:	learn: 1841.2490658	total: 20.4s	remaining: 206ms
99:	learn: 1839.3191372	total: 20.6s	remaining: 0us
[CV] ..... n_estimators=100, learning_rate=0.1, depth=6, total=  22.3s
[CV] n_estimators=60, learning_rate=0.03, depth=10 ...................
0:	learn: 4617.1861962	total: 237ms	remaining: 14s
1:	learn: 4519.4429312	total: 627ms	remaining: 18.2s
2:	learn: 4424.8483863	total: 1.02s	remaining: 19.4s
3:	learn: 4332.2531072	total: 1.33s	remaining: 18.6s
4:	learn: 4244.9964844	total: 1.63s	remaining: 17.9s
5:	learn: 4160.3843059	total: 1.93s	remaining: 17.4s
6:	learn: 4078.6700827	total: 2.32s	remaining: 17.6s
7:	learn: 4000.0386067	total: 2.62s	remaining: 17.1s
8:	learn: 3924.6778187	total: 2.93s	remaining: 16.6s
9:	learn: 3851.8740286	total: 3.33s	remaining: 16.6s
10:	learn: 3782.5768248	total: 3.61s	remaining: 16.1s
11:	learn: 3714.8333466	total: 4s	remaining: 16s
12:	learn: 3649

22:	learn: 3101.7893967	total: 7.42s	remaining: 11.9s
23:	learn: 3058.5197440	total: 7.72s	remaining: 11.6s
24:	learn: 3017.4783172	total: 8.02s	remaining: 11.2s
25:	learn: 2978.5000534	total: 8.33s	remaining: 10.9s
26:	learn: 2940.4278384	total: 8.71s	remaining: 10.7s
27:	learn: 2904.4122650	total: 9.02s	remaining: 10.3s
28:	learn: 2869.9220354	total: 9.32s	remaining: 9.96s
29:	learn: 2837.0533849	total: 9.62s	remaining: 9.62s
30:	learn: 2805.3691570	total: 9.93s	remaining: 9.29s
31:	learn: 2774.4978389	total: 10.3s	remaining: 9.02s
32:	learn: 2744.4428259	total: 10.6s	remaining: 8.68s
33:	learn: 2711.0090625	total: 10.9s	remaining: 8.35s
34:	learn: 2679.3308501	total: 11.2s	remaining: 8.01s
35:	learn: 2649.2555880	total: 11.5s	remaining: 7.68s
36:	learn: 2620.5004817	total: 11.9s	remaining: 7.4s
37:	learn: 2592.4021684	total: 12.2s	remaining: 7.07s
38:	learn: 2566.0223033	total: 12.5s	remaining: 6.74s
39:	learn: 2541.4765643	total: 12.8s	remaining: 6.4s
40:	learn: 2517.2691779	total:

10:	learn: 3786.3535458	total: 3.53s	remaining: 28.6s
11:	learn: 3717.4356039	total: 3.83s	remaining: 28.1s
12:	learn: 3651.8236234	total: 4.13s	remaining: 27.6s
13:	learn: 3589.1163882	total: 4.43s	remaining: 27.2s
14:	learn: 3528.5393228	total: 4.82s	remaining: 27.3s
15:	learn: 3469.8817290	total: 5.13s	remaining: 26.9s
16:	learn: 3414.4037815	total: 5.43s	remaining: 26.5s
17:	learn: 3361.5030969	total: 5.82s	remaining: 26.5s
18:	learn: 3309.4080493	total: 6.12s	remaining: 26.1s
19:	learn: 3258.8094318	total: 6.42s	remaining: 25.7s
20:	learn: 3210.5370642	total: 6.72s	remaining: 25.3s
21:	learn: 3165.3550133	total: 7.11s	remaining: 25.2s
22:	learn: 3121.6659121	total: 7.41s	remaining: 24.8s
23:	learn: 3080.0476829	total: 7.72s	remaining: 24.4s
24:	learn: 3039.3433034	total: 8.02s	remaining: 24.1s
25:	learn: 3000.5339068	total: 8.41s	remaining: 23.9s
26:	learn: 2961.4887071	total: 8.71s	remaining: 23.6s
27:	learn: 2926.5990321	total: 9.01s	remaining: 23.2s
28:	learn: 2891.3908720	tota

61:	learn: 2159.6975325	total: 20.2s	remaining: 12.4s
62:	learn: 2147.1054393	total: 20.5s	remaining: 12s
63:	learn: 2135.1219509	total: 20.8s	remaining: 11.7s
64:	learn: 2123.5459728	total: 21.2s	remaining: 11.4s
65:	learn: 2111.8983522	total: 21.5s	remaining: 11.1s
66:	learn: 2101.4329574	total: 21.8s	remaining: 10.7s
67:	learn: 2091.5015318	total: 22.2s	remaining: 10.4s
68:	learn: 2081.4622962	total: 22.5s	remaining: 10.1s
69:	learn: 2071.9803442	total: 22.8s	remaining: 9.75s
70:	learn: 2062.4260225	total: 23.1s	remaining: 9.42s
71:	learn: 2053.6009536	total: 23.4s	remaining: 9.09s
72:	learn: 2045.5822170	total: 23.7s	remaining: 8.75s
73:	learn: 2037.3878202	total: 24s	remaining: 8.42s
74:	learn: 2029.8173930	total: 24.4s	remaining: 8.12s
75:	learn: 2022.5659112	total: 24.7s	remaining: 7.78s
76:	learn: 2015.4336383	total: 25s	remaining: 7.45s
77:	learn: 2008.4590685	total: 25.3s	remaining: 7.12s
78:	learn: 2001.7213123	total: 25.6s	remaining: 6.79s
79:	learn: 1994.7220833	total: 25.

49:	learn: 1784.7287991	total: 16.2s	remaining: 3.23s
50:	learn: 1782.0609966	total: 16.5s	remaining: 2.9s
51:	learn: 1779.8128097	total: 16.8s	remaining: 2.58s
52:	learn: 1777.6923109	total: 17.1s	remaining: 2.25s
53:	learn: 1774.7803271	total: 17.5s	remaining: 1.94s
54:	learn: 1769.0815249	total: 17.8s	remaining: 1.61s
55:	learn: 1765.5985659	total: 18.1s	remaining: 1.29s
56:	learn: 1763.1513543	total: 18.4s	remaining: 966ms
57:	learn: 1760.4127512	total: 18.7s	remaining: 646ms
58:	learn: 1758.4720387	total: 19.1s	remaining: 323ms
59:	learn: 1756.8470544	total: 19.4s	remaining: 0us
[CV] ..... n_estimators=60, learning_rate=0.1, depth=10, total=  21.4s
[CV] n_estimators=60, learning_rate=0.1, depth=10 ....................
0:	learn: 4369.7177814	total: 291ms	remaining: 17.2s
1:	learn: 4073.7295014	total: 594ms	remaining: 17.2s
2:	learn: 3815.5321663	total: 894ms	remaining: 17s
3:	learn: 3590.8433743	total: 1.2s	remaining: 16.8s
4:	learn: 3390.3315154	total: 1.59s	remaining: 17.5s
5:	le

77:	learn: 1714.1540243	total: 25.5s	remaining: 7.19s
78:	learn: 1710.8058930	total: 25.8s	remaining: 6.86s
79:	learn: 1709.7825254	total: 26.1s	remaining: 6.52s
80:	learn: 1708.2844141	total: 26.4s	remaining: 6.19s
81:	learn: 1707.0946594	total: 26.8s	remaining: 5.88s
82:	learn: 1704.8288339	total: 27.1s	remaining: 5.55s
83:	learn: 1703.0598021	total: 27.4s	remaining: 5.22s
84:	learn: 1700.3717096	total: 27.7s	remaining: 4.89s
85:	learn: 1698.0617504	total: 28.1s	remaining: 4.57s
86:	learn: 1695.7498616	total: 28.4s	remaining: 4.24s
87:	learn: 1694.1319657	total: 28.7s	remaining: 3.91s
88:	learn: 1693.3326930	total: 29s	remaining: 3.58s
89:	learn: 1691.4472548	total: 29.4s	remaining: 3.26s
90:	learn: 1689.8062787	total: 29.7s	remaining: 2.94s
91:	learn: 1687.7147595	total: 30s	remaining: 2.61s
92:	learn: 1686.4808478	total: 30.3s	remaining: 2.28s
93:	learn: 1685.8033705	total: 30.6s	remaining: 1.95s
94:	learn: 1684.7222289	total: 31s	remaining: 1.63s
95:	learn: 1681.8468786	total: 31.

25:	learn: 1962.6850194	total: 8.38s	remaining: 23.9s
26:	learn: 1948.0103491	total: 8.68s	remaining: 23.5s
27:	learn: 1934.9086362	total: 8.98s	remaining: 23.1s
28:	learn: 1922.8281234	total: 9.29s	remaining: 22.7s
29:	learn: 1907.9535517	total: 9.59s	remaining: 22.4s
30:	learn: 1896.1874179	total: 9.98s	remaining: 22.2s
31:	learn: 1883.4003720	total: 10.3s	remaining: 21.8s
32:	learn: 1871.1205007	total: 10.6s	remaining: 21.5s
33:	learn: 1861.4158736	total: 10.9s	remaining: 21.1s
34:	learn: 1852.9869282	total: 11.2s	remaining: 20.8s
35:	learn: 1845.2521451	total: 11.5s	remaining: 20.4s
36:	learn: 1838.3901602	total: 11.9s	remaining: 20.2s
37:	learn: 1832.4835585	total: 12.2s	remaining: 19.9s
38:	learn: 1827.0166848	total: 12.5s	remaining: 19.5s
39:	learn: 1822.7779715	total: 12.8s	remaining: 19.2s
40:	learn: 1818.2917735	total: 13.2s	remaining: 18.9s
41:	learn: 1813.1530517	total: 13.5s	remaining: 18.6s
42:	learn: 1808.2554096	total: 13.8s	remaining: 18.3s
43:	learn: 1803.7168099	tota

[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  9.2min finished


0:	learn: 4380.7325613	total: 411ms	remaining: 40.7s
1:	learn: 4087.8244075	total: 810ms	remaining: 39.7s
2:	learn: 3825.8857570	total: 1.22s	remaining: 39.4s
3:	learn: 3598.1767909	total: 1.71s	remaining: 41.1s
4:	learn: 3393.5752622	total: 2.12s	remaining: 40.2s
5:	learn: 3216.2819410	total: 2.61s	remaining: 40.9s
6:	learn: 3063.2489814	total: 3.01s	remaining: 40s
7:	learn: 2932.2079442	total: 3.41s	remaining: 39.2s
8:	learn: 2818.1719280	total: 3.82s	remaining: 38.6s
9:	learn: 2714.5918553	total: 4.31s	remaining: 38.8s
10:	learn: 2608.2146949	total: 4.71s	remaining: 38.1s
11:	learn: 2518.3634317	total: 5.11s	remaining: 37.5s
12:	learn: 2439.9996803	total: 5.61s	remaining: 37.5s
13:	learn: 2372.5194434	total: 6.01s	remaining: 36.9s
14:	learn: 2316.2173467	total: 6.41s	remaining: 36.3s
15:	learn: 2268.5273108	total: 6.81s	remaining: 35.7s
16:	learn: 2219.7849226	total: 7.21s	remaining: 35.2s
17:	learn: 2178.9178396	total: 7.61s	remaining: 34.7s
18:	learn: 2142.5231365	total: 8.1s	rema

* #### LightGBM

In [16]:
lgbm_model = lgb.LGBMRegressor()
                          
lgbm_params = {'depth'        : [6,10],
              'learning_rate' : [0.03, 0.1],
              'n_estimators'  : [60,100] }

random_search = RandomizedSearchCV(lgbm_model, lgbm_params, n_iter=10, n_jobs=-1, cv=3, 
                                   scoring= 'neg_mean_squared_error', verbose = 2, random_state=12345)

random_search.fit(X_train, y_train, categorical_feature= cat_features)

print("\n The best parameters:\n ", random_search.best_params_)

# Predict after fitting RandomizedSearchCV with best parameters
y_pred = random_search.predict(X_valid)
 
score_lgbm = rmse(y_valid, y_pred)

print("\n Score on test data:\n ", score_lgbm)


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] n_estimators=60, learning_rate=0.03, depth=6 ....................


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..... n_estimators=60, learning_rate=0.03, depth=6, total=   7.4s
[CV] n_estimators=60, learning_rate=0.03, depth=6 ....................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    7.4s remaining:    0.0s


[CV] ..... n_estimators=60, learning_rate=0.03, depth=6, total=   6.5s
[CV] n_estimators=60, learning_rate=0.03, depth=6 ....................
[CV] ..... n_estimators=60, learning_rate=0.03, depth=6, total=   5.0s
[CV] n_estimators=100, learning_rate=0.03, depth=6 ...................
[CV] .... n_estimators=100, learning_rate=0.03, depth=6, total=   8.2s
[CV] n_estimators=100, learning_rate=0.03, depth=6 ...................
[CV] .... n_estimators=100, learning_rate=0.03, depth=6, total=   7.4s
[CV] n_estimators=100, learning_rate=0.03, depth=6 ...................
[CV] .... n_estimators=100, learning_rate=0.03, depth=6, total=   8.1s
[CV] n_estimators=60, learning_rate=0.1, depth=6 .....................
[CV] ...... n_estimators=60, learning_rate=0.1, depth=6, total=   4.6s
[CV] n_estimators=60, learning_rate=0.1, depth=6 .....................
[CV] ...... n_estimators=60, learning_rate=0.1, depth=6, total=   4.8s
[CV] n_estimators=60, learning_rate=0.1, depth=6 .....................
[CV] .

[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.5min finished



 The best parameters:
  {'n_estimators': 100, 'learning_rate': 0.1, 'depth': 6}

 Score on test data:
  1664.1104279896954


Let’s find out how long it takes training and prediction for each model.

In [17]:
#Linear regression training
start_time = time.time()

regressor = LinearRegression()  
regressor.fit(features_train, target_train)

elapsed_time = time.time() - start_time
time_LR_t = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Training Time: ", time_LR_t)


Training Time:  00:00:18


In [18]:
#Linear regression prediction
start_time = time.time()

predicted_valid = regressor.predict(features_valid)

elapsed_time = time.time() - start_time
time_LR_p = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Prediction Time: ", time_LR_p)


Prediction Time:  00:00:00


In [19]:
#Random Forest training
start = time.time()

model = RandomForestRegressor(n_estimators = 50, max_depth = 10, random_state=0)
model.fit(features_train, target_train)

elapsed_time = time.time() - start
time_RF_t = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Training Time: ", time_RF_t)


Training Time:  00:02:06


In [20]:
#Random Forest prediction
start = time.time()

y_pred = model.predict(features_valid)

elapsed_time = time.time() - start
time_RF_p = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Prediction Time: ", time_RF_p)

Prediction Time:  00:00:00


In [21]:
#CatBoost training
start = time.time()

model = CatBoostRegressor(n_estimators = 100, learning_rate = 0.1, depth = 10, random_state = 0)

model.fit(X_train, y_train,cat_features=cat_features)

elapsed_time = time.time() - start
time_CatB_t = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Training Time: ", time_CatB_t)

0:	learn: 4380.7325613	total: 465ms	remaining: 46.1s
1:	learn: 4087.8244075	total: 870ms	remaining: 42.6s
2:	learn: 3825.8857570	total: 1.27s	remaining: 41.2s
3:	learn: 3598.1767909	total: 1.77s	remaining: 42.5s
4:	learn: 3393.5752622	total: 2.18s	remaining: 41.4s
5:	learn: 3216.2819410	total: 2.66s	remaining: 41.7s
6:	learn: 3063.2489814	total: 3.07s	remaining: 40.8s
7:	learn: 2932.2079442	total: 3.47s	remaining: 39.9s
8:	learn: 2818.1719280	total: 3.87s	remaining: 39.1s
9:	learn: 2714.5918553	total: 4.36s	remaining: 39.3s
10:	learn: 2608.2146949	total: 4.76s	remaining: 38.6s
11:	learn: 2518.3634317	total: 5.17s	remaining: 37.9s
12:	learn: 2439.9996803	total: 5.66s	remaining: 37.9s
13:	learn: 2372.5194434	total: 6.06s	remaining: 37.2s
14:	learn: 2316.2173467	total: 6.46s	remaining: 36.6s
15:	learn: 2268.5273108	total: 6.87s	remaining: 36s
16:	learn: 2219.7849226	total: 7.35s	remaining: 35.9s
17:	learn: 2178.9178396	total: 7.75s	remaining: 35.3s
18:	learn: 2142.5231365	total: 8.15s	rem

In [22]:
#CatBoost prediction
start = time.time()

y_pred = model.predict(X_valid)

elapsed_time = time.time() - start
time_CatB_p = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Prediction Time: ", time_CatB_p)

Prediction Time:  00:00:00


In [23]:
#LGBM training
start = time.time()

lgbm_model = lgb.LGBMRegressor(n_estimators = 100, learning_rate = 0.1, depth = 6, random_state = 0)

lgbm_model.fit(X_train, y_train, categorical_feature= cat_features)

elapsed_time = time.time() - start
time_LGBM_t = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Training Time: ", time_LGBM_t)

Training Time:  00:00:07


In [24]:
#LGBM prediction
start = time.time()

y_pred = lgbm_model.predict(X_valid)
 
elapsed_time = time.time() - start
time_LGBM_p = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print("Prediction Time: ", time_LGBM_p)

Prediction Time:  00:00:00


In [25]:
data = {'Model': ['Linear Regression', 'Random Forest', 'CatBoost', 'LightGBM'],
        'RMSE': [rmse_linear_reg, score_random_forest, score_catboost, score_lgbm],
        'Time train': [time_LR_t, time_RF_t, time_CatB_t, time_LGBM_t],
        'Time prediction': [time_LR_p, time_RF_p, time_CatB_p, time_LGBM_p] }

conclusion = pd.DataFrame (data, columns = ['Model', 'RMSE', 'Time train', 'Time prediction'])
conclusion

Unnamed: 0,Model,RMSE,Time train,Time prediction
0,Linear Regression,2713.527209,00:00:18,00:00:00
1,Random Forest,1946.425322,00:02:06,00:00:00
2,CatBoost,1707.620656,00:00:44,00:00:00
3,LightGBM,1664.110428,00:00:07,00:00:00


We use Linear Regression as a sanity check. All the rest models perfomed better than Linear Regression.

We can see from the table above that LightGBM has the best value of RMSE and also it is the fastest model.

So, the conclusion is to choose LightGBM to predict the market value of the car.