# Check Model Time & Speed

Rusty Bargain is a used car buying and selling company that is developing an application to attract new buyers. In this application, you can quickly find out the market value of your car. You have access to historical data, vehicle technical specifications, vehicle model versions, and vehicle prices. Your task is to create a model that is able to determine the market value of the car.
Rusty Bargain is interested in:

- Prediction quality
- Speed of the model in predicting
- Time required to train the model

# Table of Contents:
1. Load Data & Data Preprocessing
2. Training Data
3. Model Analysis
4. Conclusion

##  Load Data & Data Preprocessing

In [2]:
import pandas as pd, numpy as np, seaborn as sns
import lightgbm as lgb
import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor

In [3]:
df = pd.read_csv('/datasets/car_data.csv')

In [4]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [5]:
df_2 = pd.get_dummies(df[['Price','VehicleType']])
df_2

Unnamed: 0,Price,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,VehicleType_wagon
0,480,0,0,0,0,0,0,0,0
1,18300,0,0,1,0,0,0,0,0
2,9800,0,0,0,0,0,0,1,0
3,1500,0,0,0,0,0,1,0,0
4,3600,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
354364,0,0,0,0,0,0,0,0,0
354365,2200,0,0,0,0,0,0,0,0
354366,1199,0,1,0,0,0,0,0,0
354367,9200,1,0,0,0,0,0,0,0


In [6]:
df_2.corr()['Price']

Price                      1.000000
VehicleType_bus            0.070493
VehicleType_convertible    0.130201
VehicleType_coupe          0.077205
VehicleType_other         -0.018283
VehicleType_sedan          0.039981
VehicleType_small         -0.207735
VehicleType_suv            0.190435
VehicleType_wagon          0.048760
Name: Price, dtype: float64

In [7]:
df['Price'].value_counts() / df.shape[0] *100

0        3.039769
500      1.600027
1500     1.522142
1000     1.311909
1200     1.296389
           ...   
13180    0.000282
10879    0.000282
2683     0.000282
634      0.000282
8188     0.000282
Name: Price, Length: 3731, dtype: float64

In [8]:
df.corr()['Price']

Price                1.000000
RegistrationYear     0.026916
Power                0.158872
Mileage             -0.333199
RegistrationMonth    0.110581
NumberOfPictures          NaN
PostalCode           0.076055
Name: Price, dtype: float64

**From the data above, the price is greatly influenced by 'Mileage' or how far the car travels, which means the further the car travels, the cheaper the price**

### Drop columns

In [9]:
#Drop unused columns
df = df.drop(['DateCrawled','DateCreated','PostalCode','LastSeen','NumberOfPictures','RegistrationYear'], axis=1)
df.head()

Unnamed: 0,Price,VehicleType,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,manual,69,fabia,90000,7,gasoline,skoda,no


### Resolving Missing Values

In [10]:
#Checks for missing values
df.isnull().sum()

Price                    0
VehicleType          37490
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
dtype: int64

In [11]:
#Fill in the missing values
df = df.fillna('unkwon')

In [12]:
#Checks whether there are still missing values
df.isnull().sum()

Price                0
VehicleType          0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
dtype: int64

In [13]:
#Making data OHE 
data_ohe = pd.get_dummies(df)
data_ohe.head(10)

Unnamed: 0,Price,Power,Mileage,RegistrationMonth,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_no,NotRepaired_unkwon,NotRepaired_yes
0,480,0,150000,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,18300,190,125000,5,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,163,125000,8,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1500,75,150000,6,0,0,0,0,0,1,...,0,0,0,0,0,1,0,1,0,0
4,3600,69,90000,7,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
5,650,102,150000,10,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
6,2200,109,150000,8,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,50,40000,7,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,0
8,14500,125,30000,8,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,999,101,150000,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,1,0


In [14]:
data_ohe.columns

Index(['Price', 'Power', 'Mileage', 'RegistrationMonth', 'VehicleType_bus',
       'VehicleType_convertible', 'VehicleType_coupe', 'VehicleType_other',
       'VehicleType_sedan', 'VehicleType_small',
       ...
       'Brand_sonstige_autos', 'Brand_subaru', 'Brand_suzuki', 'Brand_toyota',
       'Brand_trabant', 'Brand_volkswagen', 'Brand_volvo', 'NotRepaired_no',
       'NotRepaired_unkwon', 'NotRepaired_yes'],
      dtype='object', length=318)

In [15]:
#dividing OHE data into training, validation, and test data
data_ohe_train_valid, data_ohe_test = train_test_split(data_ohe, test_size=0.15, random_state=12)
data_ohe_train, data_ohe_valid = train_test_split(data_ohe_train_valid, test_size=0.25, random_state=23)

print(data_ohe_train.shape)
print(data_ohe_valid.shape)
print(data_ohe_test.shape)

(225909, 318)
(75304, 318)
(53156, 318)


**We have successfully divided the data into OHE (One-Hot-Encoding) data and divided it into training, valid and test data**

## Model Training

In [16]:
def rmse (target, prediction):
    return mean_squared_error(target, prediction)**0.5


In [17]:
features_train = data_ohe_train.drop(['Price'], axis=1)
target_train = data_ohe_train['Price']

features_valid = data_ohe_valid.drop(['Price'], axis=1)
target_valid = data_ohe_valid['Price']

features_test = data_ohe_test.drop(['Price'], axis=1)
target_test = data_ohe_test['Price']

### Linear Regression

In [18]:
%%time

model = LinearRegression()
model.fit(features_train, target_train)

CPU times: user 12.9 s, sys: 11.2 s, total: 24.1 s
Wall time: 24.1 s


LinearRegression()

In [19]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("RMSE TRAIN:", rmse(target_train, pred_train))
print("RMSE VALID:", rmse(target_valid, pred_valid))
print("RMSE TEST:", rmse(target_test, pred_test))

RMSE TRAIN: 3172.3449051965
RMSE VALID: 3160.984559251561
RMSE TEST: 3171.428023294031
CPU times: user 391 ms, sys: 505 ms, total: 895 ms
Wall time: 925 ms


**From modeling with Linear Regression the training time is 18.8 s and the prediction time is 916 ms**

### Decision Tree

In [20]:
for depth in [2, 5, 7, 8, 10, None]:
    model = DecisionTreeRegressor(max_depth=depth)
    model.fit(features_train, target_train)

    pred_train = model.predict(features_train)
    pred_valid = model.predict(features_valid)
    print("DEPTH:", depth)
    print("RMSE TRAIN:", rmse(target_train, pred_train))
    print("RMSE VALID:", rmse(target_valid, pred_valid))

DEPTH: 2
RMSE TRAIN: 3613.8852719603833
RMSE VALID: 3624.4783034648563
DEPTH: 5
RMSE TRAIN: 3003.3934763988054
RMSE VALID: 3023.759481102029
DEPTH: 7
RMSE TRAIN: 2800.749056782062
RMSE VALID: 2842.694576707867
DEPTH: 8
RMSE TRAIN: 2718.758359952099
RMSE VALID: 2775.249212744164
DEPTH: 10
RMSE TRAIN: 2548.42661751013
RMSE VALID: 2661.9479774070164
DEPTH: None
RMSE TRAIN: 969.5389473525087
RMSE VALID: 2752.045072992252


In [21]:
%%time

model = DecisionTreeRegressor(max_depth=10)
model.fit(features_train, target_train)

CPU times: user 5.61 s, sys: 92.6 ms, total: 5.71 s
Wall time: 5.72 s


DecisionTreeRegressor(max_depth=10)

In [22]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("RMSE TRAIN:", rmse(target_train, pred_train))
print("RMSE VALID:", rmse(target_valid, pred_valid))
print("RMSE TEST:", rmse(target_test, pred_test))

RMSE TRAIN: 2548.42661751013
RMSE VALID: 2660.9077435644917
RMSE TEST: 2613.859064896998
CPU times: user 302 ms, sys: 193 ms, total: 495 ms
Wall time: 499 ms


**From modeling with Decision Tree the training time is 5.13 s which is faster than Linear Regression and the prediction time is 457 ms**

### Random Forest

In [23]:
for depth in [7, 8, 10, None]:
    model = RandomForestRegressor(max_depth=depth, n_estimators=10)
    model.fit(features_train, target_train)

    pred_train = model.predict(features_train)
    pred_valid = model.predict(features_valid)
    print("DEPTH:", depth)
    print("RMSE TRAIN:", rmse(target_train, pred_train))
    print("RMSE VALID:", rmse(target_valid, pred_valid))

DEPTH: 7
RMSE TRAIN: 2757.921356437492
RMSE VALID: 2793.848698080692
DEPTH: 8
RMSE TRAIN: 2675.6722738371304
RMSE VALID: 2724.2654321254226
DEPTH: 10
RMSE TRAIN: 2486.1520451549795
RMSE VALID: 2592.502329418726
DEPTH: None
RMSE TRAIN: 1271.765462341522
RMSE VALID: 2323.6940285376954


In [24]:
%%time

model = RandomForestRegressor(max_depth=10, n_estimators=10)
model.fit(features_train, target_train)

CPU times: user 34.8 s, sys: 81.3 ms, total: 34.9 s
Wall time: 34.9 s


RandomForestRegressor(max_depth=10, n_estimators=10)

In [25]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("RMSE TRAIN:", rmse(target_train, pred_train))
print("RMSE VALID:", rmse(target_valid, pred_valid))
print("RMSE TEST:", rmse(target_test, pred_test))

RMSE TRAIN: 2487.3554805455583
RMSE VALID: 2590.858232714484
RMSE TEST: 2552.336369805737
CPU times: user 668 ms, sys: 254 ms, total: 921 ms
Wall time: 932 ms


**From modeling with Random Forest the training time is 32.2 s and the prediction time is 887 ms. If we set the n_estimators value even larger then the process will take longer**

### LGBM - Gradient Boosting

In [26]:
%%time
model = lgb.LGBMRegressor(num_iterations=20, verbose=0, metric='rmse')
model.fit(features_train, target_train, eval_set=(features_valid, target_valid))



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[1]	valid_0's rmse: 4253.34
[2]	valid_0's rmse: 4039.98
[3]	valid_0's rmse: 3857.67
[4]	valid_0's rmse: 3702.65
[5]	valid_0's rmse: 3570.23
[6]	valid_0's rmse: 3453.72
[7]	valid_0's rmse: 3354.66
[8]	valid_0's rmse: 3269.66
[9]	valid_0's rmse: 3197.16
[10]	valid_0's rmse: 3134.28
[11]	valid_0's rmse: 3076.88
[12]	valid_0's rmse: 3028.21
[13]	valid_0's rmse: 2986.87
[14]	valid_0's rmse: 2948.02
[15]	valid_0's rmse: 2915.11
[16]	valid_0's rmse: 2886.22
[17]	valid_0's rmse: 2861.15
[18]	valid_0's rmse: 2838.91
[19]	valid_0's rmse: 2817.17
[20]	valid_0's rmse: 2799.16
CPU times: user 22.5 s, sys: 366 ms, total: 22.9 s
Wall time: 23 s


LGBMRegressor(metric='rmse', num_iterations=20, verbose=0)

In [27]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("RMSE TRAIN:", rmse(target_train, pred_train))
print("RMSE VALID:", rmse(target_valid, pred_valid))
print("RMSE TEST:", rmse(target_test, pred_test))

RMSE TRAIN: 2775.978421945285
RMSE VALID: 2799.156799320431
RMSE TEST: 2771.185154036098
CPU times: user 1.47 s, sys: 213 ms, total: 1.68 s
Wall time: 1.69 s


**From modeling with LGBM the training time is very fast, namely 4.42 s and the prediction time is 1600 ms, this model is very effective because it is fast and the RMSE value is better than before**

### CatBoost

In [28]:
%%time

model = CatBoostRegressor(iterations=20, verbose=0, loss_function='RMSE', random_seed=42)
model.fit(features_train, target_train, eval_set=(features_valid, target_valid))

CPU times: user 3.26 s, sys: 19 ms, total: 3.28 s
Wall time: 3.65 s


<catboost.core.CatBoostRegressor at 0x7fbb4988fe80>

In [29]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("RMSE TRAIN:", rmse(target_train, pred_train))
print("RMSE VALID:", rmse(target_valid, pred_valid))
print("RMSE TEST:", rmse(target_test, pred_test))

RMSE TRAIN: 2601.293557905662
RMSE VALID: 2629.2525493792896
RMSE TEST: 2595.162784703966
CPU times: user 122 ms, sys: 4.05 ms, total: 126 ms
Wall time: 130 ms


**From modeling with CatBoost the training time is very fast, namely 3.27 s and the prediction time is 125 ms, this model is very effective because it is very fast and the RMSE value is better than LGBM**

### XGBoost

In [30]:
%%time

model = xgb.XGBRegressor(n_estimators=20, verbosity=0, objective='reg:squarederror', random_state=42)
model.fit(features_train, target_train, eval_set=[(features_valid, target_valid)])

[0]	validation_0-rmse:4879.78711
[1]	validation_0-rmse:3985.73267
[2]	validation_0-rmse:3443.85132
[3]	validation_0-rmse:3126.77466
[4]	validation_0-rmse:2947.08008
[5]	validation_0-rmse:2836.18579
[6]	validation_0-rmse:2767.43335
[7]	validation_0-rmse:2723.34961
[8]	validation_0-rmse:2698.94629
[9]	validation_0-rmse:2677.03760
[10]	validation_0-rmse:2658.38892
[11]	validation_0-rmse:2648.09033
[12]	validation_0-rmse:2633.32983
[13]	validation_0-rmse:2627.27441
[14]	validation_0-rmse:2609.09741
[15]	validation_0-rmse:2603.68823
[16]	validation_0-rmse:2597.09375
[17]	validation_0-rmse:2590.94165
[18]	validation_0-rmse:2582.26489
[19]	validation_0-rmse:2578.06787
CPU times: user 1min 47s, sys: 664 ms, total: 1min 48s
Wall time: 1min 48s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=20, n_jobs=4,
             num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=0)

In [31]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("RMSE TRAIN:", rmse(target_train, pred_train))
print("RMSE VALID:", rmse(target_valid, pred_valid))
print("RMSE TEST:", rmse(target_test, pred_test))

RMSE TRAIN: 2529.2963540347887
RMSE VALID: 2578.069424988964
RMSE TEST: 2544.919591987355
CPU times: user 4.12 s, sys: 133 ms, total: 4.25 s
Wall time: 4.25 s


**From modeling with XGB the training time is a little long, namely 92 s and the prediction time is 4100 ms, this model has a better RMSE value than CatBoost, only the training time is a little longer.**

## Model Analysis

In [32]:
index = ['Linear Rergression','Decision Tree', 'Random Forest', 'LGBM','CatBoost','XGBoost']
summary = pd.DataFrame(data={
    'Training_time(s)':[18.8, 5.13,32.2,4.42,3.27,92],
    'Prediction_time(ms)':[916,457,887,1600,125,4100],
    'RMSE TRAIN':[3172,2548,2485,2775,2601,2529],
    'RMSE VALID':[3160,2661,2587,2799,2629,2578],
    'RMSE TEST':[3171,2614,2548,2771,2595,2544]
}, index=index)

summary


Unnamed: 0,Training_time(s),Prediction_time(ms),RMSE TRAIN,RMSE VALID,RMSE TEST
Linear Rergression,18.8,916,3172,3160,3171
Decision Tree,5.13,457,2548,2661,2614
Random Forest,32.2,887,2485,2587,2548
LGBM,4.42,1600,2775,2799,2771
CatBoost,3.27,125,2601,2629,2595
XGBoost,92.0,4100,2529,2578,2544


Based on the comparison, it was found that CatBoost had the fastest training time and XGBoost had the longest training time. Meanwhile CatBoost has the fastest prediction time and XGBoost has the longest prediction time. If you look at the RMSE value, XGBoost has the smallest results when compared to the others. So if you look for an ideal model where the prediction time and training time are short and the RMSE value is small, CatBoost is the most ideal.

## Conclusion

1. Our data has missing values ​​so we need to fill it with 'unknown'.
2. We have checked the length of time to execute our code starting from Linear Regression where as our baseline, Decision Tree, Random Forest, LGBM, CatBoost, and XGBoost.
3. CatBoost has the fastest training time (3.27 s) and XGBoost has the longest training time (92 s). Meanwhile CatBoost has the fastest prediction time (125 ms) and XGBoost has the longest prediction time (4100 ms). If you look at the RMSE value, XGBoost has the smallest results when compared to the others. So if you look for an ideal model where the prediction time and training time are short and the RMSE value is small, CatBoost is the most ideal.