Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# 1. Data preparation

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
import time

In [2]:
df = pd.read_csv('/datasets/car_data.csv')

df1 will be used for all of the models except linearregression and random forest.

In [3]:
df.drop(['PostalCode','NumberOfPictures','DateCrawled','DateCreated','LastSeen'],axis=1,inplace=True)

In [7]:
df = df[(df['RegistrationYear']>1950) & (df['RegistrationYear']<2020)]

In [13]:
df = df[df['Price'] != 0]

In [20]:
df = df[df['Power'] != 0]

In [23]:
df.isna().mean()

Price                0.000000
VehicleType          0.069540
RegistrationYear     0.000000
Gearbox              0.019808
Power                0.000000
Model                0.040590
Mileage              0.000000
RegistrationMonth    0.000000
FuelType             0.064474
Brand                0.000000
NotRepaired          0.153501
dtype: float64

after data preprocessing we can see that at maximum there are 15% of not available data in the column so I will delete that data

<div class="alert alert-block alert-success">
<b>Success:</b>  Well done, analysis and prepocessing was done correctly.
</div>

In [24]:
df.dropna(inplace=True)

In [25]:
df1 = df.copy()

**LR data preprocess**

I will convert datetime to numerical data for linear regression

In [27]:
df=pd.get_dummies(df,['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired'],drop_first=True)

In [28]:
target = df['Price']
features = df.drop('Price',axis=1)

In [29]:
features_train, features_test, target_train, target_test = train_test_split(features,target, test_size=0.25, random_state=12345)

In [30]:
scaler = StandardScaler()

In [32]:
numeric = ['RegistrationYear','Power','Mileage','RegistrationMonth']
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

**catboost**

In [33]:
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

In [34]:
categorical = ['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired']

In [35]:
target1 = df1['Price']
features1 = df1.drop('Price',axis=1)
cat_features = column_index(df1,categorical)

In [36]:
cat_features1 = cat_features -1

In [37]:
cat_features1

array([0, 2, 4, 7, 8, 9])

In [39]:
features_train1, features_valid_test1, target_train1, target_valid_test1 = train_test_split(features1,target1, test_size=0.4, random_state=12345)
features_valid1, features_test1,target_valid1,target_test1 = train_test_split(features_valid_test1,target_valid_test1, test_size=0.5, random_state=12345)

# 2. Model training

Lets start with LR for sanity check

In [40]:
def exec_time(start, end):
   diff_time = end - start
   ms = diff_time
   ms = float(round(ms, 2))
   return ms

In [41]:
model = LinearRegression()

In [42]:
start = time.time()
model.fit(features_train,target_train)
end = time.time()
t = exec_time(start,end)

In [43]:
t #time

17.65

In [44]:
start = time.time()
pred = model.predict(features_test)
end = time.time()
t = exec_time(start,end)

In [45]:
t #time

0.07

In [46]:
mse = mean_squared_error(pred,target_test)
mse**0.5

2656.3597850613205

RMSE for LR is 2656.3597850613205 and it took 17.65 seconds to fit and 0.07 seconds to predict

In [47]:
model = CatBoostRegressor()

In [48]:
start = time.time()
model.fit(features_train1, target_train1,
             eval_set=(features_valid1,target_valid1),
             cat_features=cat_features1,
             use_best_model=True,
             verbose=True)
end = time.time()
t = exec_time(start,end)

0:	learn: 4639.9831244	test: 4614.2631560	best: 4614.2631560 (0)	total: 521ms	remaining: 8m 40s
1:	learn: 4544.7063417	test: 4519.6347164	best: 4519.6347164 (1)	total: 1.02s	remaining: 8m 28s
2:	learn: 4453.5794143	test: 4429.4015482	best: 4429.4015482 (2)	total: 1.61s	remaining: 8m 56s
3:	learn: 4365.2966494	test: 4341.8015844	best: 4341.8015844 (3)	total: 2.01s	remaining: 8m 20s
4:	learn: 4279.8279987	test: 4257.1561894	best: 4257.1561894 (4)	total: 2.5s	remaining: 8m 18s
5:	learn: 4197.2587357	test: 4175.4051775	best: 4175.4051775 (5)	total: 3s	remaining: 8m 17s
6:	learn: 4118.6492257	test: 4097.6252551	best: 4097.6252551 (6)	total: 3.4s	remaining: 8m 2s
7:	learn: 4040.9987133	test: 4020.5386883	best: 4020.5386883 (7)	total: 3.9s	remaining: 8m 3s
8:	learn: 3966.8879847	test: 3947.0965067	best: 3947.0965067 (8)	total: 4.39s	remaining: 8m 3s
9:	learn: 3895.6818632	test: 3876.5523285	best: 3876.5523285 (9)	total: 4.79s	remaining: 7m 54s
10:	learn: 3827.6210161	test: 3808.8229345	best: 

In [49]:
t #time

427.28

In [50]:
start = time.time()
pred = model.predict(features_test1)
end = time.time()
t = exec_time(start,end)

In [53]:
t #time

0.45

In [52]:
mse = mean_squared_error(pred,target_test1)
mse**0.5

1657.857438536004

RMSE for catboost is 1657.857438536004 and it took 427.28 seconds for training and 0.45 for prediction.

# 3. Model analysis

After all I can conclude that it took much more for CatBoost to train but the RMSE was improved by almost twice, also It is very convenient that catboost can work with categorical values and by the most part it's starting hyperparameters are really good. You can also during fitting give it validation set too.