## Project Description: Car Price Model


Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data description

### Features

- `DateCrawled:` date profile was downloaded from the database
- `VehicleType:` vehicle body type
- `RegistrationYear:` vehicle registration year
- `Gearbox:` gearbox type
- `Power:` power (hp)
- `Model:` vehicle model
- `Mileage:` mileage (measured in km due to dataset's regional specifics)
- `RegistrationMonth:` vehicle registration month
- `FuelType:` fuel type
- `Brand:` vehicle brand
- `NotRepaired:`vehicle repaired or not
- `DateCreated:` date of profile creation
- `NumberOfPictures:` number of vehicle pictures
- `PostalCode:` postal code of profile owner (user)
- `LastSeen:` date of the last activity of the user

### Target

- `Price:` price (Euro)


## Table of Contents

1. Data Preparation
2. Data Preprocessing
3. Model Training 
- Linear Regression
- Decision Tree
- Random Forest
- CatBoost
- LightGBM
4. Model Analysis
5. Conclusion


## Data preparation

Let us import our necessary libraries 

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import make_scorer
random_state = 12345

Loading the dataset

In [2]:
data = pd.read_csv('/datasets/car_data.csv')

data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


Getting some info about the dataset

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

The total number of rows for the dataset is 354369. Some columns have missing values which have to be dealt with. We will take a look at the description of the numerical columns

In [4]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


Taking a look at the dataset, we see that:

- There are price values of zero. Considering the fact that we are looking at the market value of a car, the price cannot be zero. Therfore, we need to filter out the rows that have prices of zero.
- There are registration years that date as far back as the year 1000, and far into the future as much as 9999. This is the year 2023. Therefore, those outlier values have to be filtered out .
- Also, from the power column, there are cars with zero power. Those zero values will be filtered out also.
- The number of Months goes from 1 to 12. However, we have values of 0 in the RegistrationMonth column. The zero values will be filtered out
- Lastly, the  NumberOfPictures column is zero all through. The column is not of any significance. Therefore, it will be dropped.

## Data Preprocessing

Counting the number of duplicate rows

In [5]:
data.duplicated().sum()

262

Next, we drop the 0 values from the columns

In [6]:
data = data.loc[data['Price'] != 0] #this code drops 0 values from the Price column

data = data.loc[data['RegistrationYear'] != 0] #this code drops 0 values from the RegistrationYear column

data = data.loc[data['Power'] != 0] #this code drops 0 values from the Power column


To filter out the outliers in "RegistrationYear" column, we need cars that have more recent registration dates. 

Checking for the Registration dates older than 1990

In [7]:
old_registn = data.loc[data['RegistrationYear'] <= 1990].shape[0]
old_registn

8842

In [8]:
tefy = old_registn/data.shape[0]
print(tefy)

0.028779085852289926


Filtering out dates from 1990 takes out about 2.8% from our dataset, which is not much. 

Also, to filter the outliers in the future dates, we take a look at the number of dates beyond 2023.

In [9]:
new_registn = data.loc[data['RegistrationYear'] >= 2023].shape[0]
new_registn


24

There are 24 records of the Registration dates beyond 2023. Seeing that Rusty Bargain used car sales service is developing an app to attract new customers in the year 2023, then we will filter out Registration Years older than 1990, and beyond 2023

In [10]:
data = data.loc[(data['RegistrationYear'] >= 1990 ) & (data['RegistrationYear'] <= 2023)]
#registration Years between and including 1990 and 2023

Dropping the NumberofPictures Column

In [11]:
data = data.drop('NumberOfPictures', axis = 1)

Filling the missing values in 'NotRepaired' Column with empty values

In [12]:
data['NotRepaired'].fillna(value = 'unknown', inplace = True)

Dropping all other missing values and duplicate rows

In [13]:
data.dropna(inplace = True)

data.drop_duplicates(inplace = True)


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 258067 entries, 2 to 354368
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        258067 non-null  object
 1   Price              258067 non-null  int64 
 2   VehicleType        258067 non-null  object
 3   RegistrationYear   258067 non-null  int64 
 4   Gearbox            258067 non-null  object
 5   Power              258067 non-null  int64 
 6   Model              258067 non-null  object
 7   Mileage            258067 non-null  int64 
 8   RegistrationMonth  258067 non-null  int64 
 9   FuelType           258067 non-null  object
 10  Brand              258067 non-null  object
 11  NotRepaired        258067 non-null  object
 12  DateCreated        258067 non-null  object
 13  PostalCode         258067 non-null  int64 
 14  LastSeen           258067 non-null  object
dtypes: int64(6), object(9)
memory usage: 31.5+ MB


The cleaning process is done. Next, we prepare the data splits. 

## Model training

We will be training Linear Regression (as a baseline), Decision Tree Regressor, Random Forest Regressor, CatBoost Regressor, and LightGBM Regressor models. For the first three models we will need to encode the categorical features prior to training. The last two models don't need prior encoding since they have built-in encoders. Therefore, we will make two copies of the cleaned data: one that isn't encoded (which will be used with CatBoost and LightGBM), and the other which will be encoded for the other models

Creating and preparing a copy of the cleaned data to use it with CatBoost and LightGBM 

In [15]:
data_gbm = data.copy()#makes a copy of our cleaned data
data_gbm.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1, inplace=True)#drops the columns that are in time-date
#format 
data_gbm.reset_index(drop=True, inplace=True)#resets the index
features_gbm=data_gbm.drop('Price', axis=1)#defines our features
target_gbm=data_gbm['Price']#defines the price column as our target
f_train_gbm, f_test_gbm, t_train_gbm, t_test_gbm = train_test_split(features_gbm, target_gbm, test_size=0.2,
                                                                 random_state=random_state)
#splits our new data into training and test sets for our features and target

#prints the shapes of our splits
print(f_train_gbm.shape)
print(t_train_gbm.shape)
print(f_test_gbm.shape)
print(t_test_gbm.shape)

(206453, 11)
(206453,)
(51614, 11)
(51614,)


For the models that need encoding, we will create a list that comprises the columns that we want encoded: 

In [16]:
dat_features = ['VehicleType', 'RegistrationYear', 'Gearbox', 'Model', 'RegistrationMonth',
         'FuelType', 'Brand', 'NotRepaired', 'PostalCode']

In [17]:
#creating and preparing a copy of data to use with models requiring prior encoding
data_enc = data.copy() # creating a copy of cleaned data and putting it into the variable data_enc
encoder = OrdinalEncoder()#creates an instance of the ordinal encoder
data_enc[dat_features] = encoder.fit_transform(data_enc[dat_features]) #encodes the columns and saves the encoded values into
                                                                       # into a new table called data_enc[dat_features]

In [18]:
data_enc[dat_features]

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Model,RegistrationMonth,FuelType,Brand,NotRepaired,PostalCode
2,6.0,14.0,0.0,117.0,8.0,2.0,14.0,1.0,6935.0
3,5.0,11.0,1.0,116.0,6.0,6.0,37.0,0.0,6975.0
4,5.0,18.0,1.0,101.0,7.0,2.0,31.0,0.0,4167.0
5,4.0,5.0,1.0,11.0,10.0,6.0,2.0,2.0,2341.0
6,1.0,14.0,1.0,8.0,8.0,6.0,25.0,0.0,4591.0
...,...,...,...,...,...,...,...,...,...
354360,7.0,15.0,1.0,11.0,5.0,2.0,2.0,0.0,5861.0
354362,4.0,14.0,1.0,140.0,5.0,6.0,30.0,2.0,7719.0
354366,1.0,10.0,0.0,106.0,3.0,6.0,32.0,0.0,1793.0
354367,0.0,6.0,1.0,221.0,3.0,2.0,37.0,0.0,6580.0


Preparing and splitting the encoded data

In [19]:
#preparation and splitting for encoded data

data_enc.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis = 1, inplace = True)

data_enc.reset_index(drop = True, inplace = True)

feat_enc = data_enc.drop('Price', axis = 1)
targ_enc = data_enc['Price']

feat_train_enc, feat_test_enc, targ_train_enc, targ_test_enc = train_test_split(feat_enc, targ_enc, test_size=0.2,
                                                                 random_state=random_state)

print(feat_train_enc.shape)
print(feat_test_enc.shape)
print(targ_train_enc.shape)
print(targ_test_enc.shape)


(206453, 11)
(51614, 11)
(206453,)
(51614,)


Our dataset has been cleaned, and split into the training, and test sets for both the encoded and unencoded data

Next, we will create a function to calculate the RMSE and then make it our evaluation metric for our models. It takes as arguments the prediction and target data.

In [20]:
#create an rmse function and make it our scorer
def rmse(pred, target): #creates the rmse function that takes target and prediction values as arguments
    pred = np.array(pred) #turns the prediction into a vector
    target = np.array(target) #turns the target into a vector
    error = pred - target
    error_sq = error ** 2
    error_sq_mean = error_sq.mean() #this gives the mean of all errors
    score = error_sq_mean ** 0.5 #this code gives the Square Root Error
    return score #returns the value


scorer = make_scorer(rmse, greater_is_better = False) 
#makes our rmse function our scorer and specifies that a smaller value is better
#make_scorer calls the function rmse, with its arguments




Using Linear Regression as a Baseline

## Linear Regression (Baseline)

The next step is to train a Linear Regression model as our baseline. We will feed the Linear Regression model with the encoded data. Let's get a cross-validation score:

In [21]:
LR = LinearRegression()
LR_score = cross_val_score(LR, feat_train_enc, targ_train_enc, scoring = scorer, cv=5) #++++++
#calculates the cross-validation scores for 5 folds of the training data
print(LR_score.mean()) #prints the mean of those scores

-3024.555600865481


Next, we train the model and take note of the wall time it takes to do so

In [22]:
%time LR.fit(feat_train_enc, targ_train_enc) #trains the model with the encoded data and times the process

CPU times: user 71.7 ms, sys: 24.9 ms, total: 96.6 ms
Wall time: 90.9 ms


LinearRegression()

Next, we get the prediction

In [23]:
%time lr_pred = LR.predict(feat_test_enc) #gets predictions and times the process

CPU times: user 4.7 ms, sys: 8.13 ms, total: 12.8 ms
Wall time: 5.96 ms


In [24]:
#Calculating the RMSE

rmse_lin = rmse(lr_pred, targ_test_enc)
rmse_lin

3025.5635552424487

So the RMSE score of the Linear Model is 3025.56. This is the baseline score that all other models must exceed.

## Decision Tree Regressor

For this model we will perform hyperparameter tuning (with the max_depth hyperparameter). We want to find the best value to set it to when training the model. We will choose the hyperparameter which gets the best cross-validation score.

In [25]:
#Decision Tree Rgressor - hyperparameter tuning

for depth in range(1, 20):
    DTR = DecisionTreeRegressor(max_depth = depth, random_state=random_state)
    
    DTR_score = cross_val_score(DTR, feat_train_enc, targ_train_enc, scoring=scorer, cv=5)
    #gets cross-validation score
    
    print('Max_depth', depth, 'score:', DTR_score.mean())
    #prints max_depth value and the mean cross-validation score

Max_depth 1 score: -3521.978669523077
Max_depth 2 score: -3069.033708414861
Max_depth 3 score: -2682.365668080236
Max_depth 4 score: -2439.3184454603615
Max_depth 5 score: -2271.762077459454
Max_depth 6 score: -2169.0729030836947
Max_depth 7 score: -2076.1607854002673
Max_depth 8 score: -2000.3913187802452
Max_depth 9 score: -1925.2246002770867
Max_depth 10 score: -1880.0420807615567
Max_depth 11 score: -1838.190057993053
Max_depth 12 score: -1812.7662955205255
Max_depth 13 score: -1806.4749135845188
Max_depth 14 score: -1817.3029310842855
Max_depth 15 score: -1834.2969364040723
Max_depth 16 score: -1856.2507639921885
Max_depth 17 score: -1877.8910328720085
Max_depth 18 score: -1899.289940839661
Max_depth 19 score: -1921.3172146778893


The best score was gotten when the max_depth hyperparameter was 13. So this max_depth of 13 will be used to train our model and get predictions, and the processes will be timed.

In [26]:
#DTR Training

DTR = DecisionTreeRegressor(max_depth = 13, random_state = random_state)

%time DTR.fit(feat_train_enc, targ_train_enc)

CPU times: user 1.13 s, sys: 3.82 ms, total: 1.13 s
Wall time: 1.13 s


DecisionTreeRegressor(max_depth=13, random_state=12345)

In [27]:
#DTR model prediction
%time DTR_pred = DTR.predict(feat_test_enc)

CPU times: user 14.5 ms, sys: 64 µs, total: 14.5 ms
Wall time: 12.9 ms


In [28]:
#DTR RMSE
DTR_rmse = rmse(targ_test_enc, DTR_pred)
DTR_rmse

1741.9646827979418

The RMSE for the Decision Tree model is 1741.96 which is way better than our baseline 

## Random Forest Regressor

For this modelling, we would be calling the RandomForestRegressor() function. The hyperparameters that we will be dealing with are max_depth, which is the depth of each tree, and n_estimators, which is the number of trees.

In [29]:
#Random Forest Regressor - hyperparameter tuning
for depth in range(5, 20):
    RFR = RandomForestRegressor(n_estimators = 40, max_depth = depth, random_state = random_state)
    RFR_score = cross_val_score(RFR, feat_train_enc, targ_train_enc, scoring = scorer, cv = 5)
    print('Max_depth', depth, 'score:', RFR_score.mean())

Max_depth 5 score: -2242.7838215752417
Max_depth 6 score: -2123.2738875063774
Max_depth 7 score: -2017.4549151904678
Max_depth 8 score: -1927.3379542957125
Max_depth 9 score: -1843.111338657626
Max_depth 10 score: -1770.4342461342865
Max_depth 11 score: -1705.9164234292668
Max_depth 12 score: -1652.6048068778769
Max_depth 13 score: -1609.427899022185
Max_depth 14 score: -1574.9250191060187
Max_depth 15 score: -1549.3650093061883
Max_depth 16 score: -1531.8121481561598
Max_depth 17 score: -1518.0519262817368
Max_depth 18 score: -1508.885344389258
Max_depth 19 score: -1504.218039629033


Next we train and test our model, with the processes being timed

In [30]:
#Random Forest Regressor model training
model_RFR = RandomForestRegressor(n_estimators = 40, max_depth = 19, random_state = random_state)
%time model_RFR.fit(feat_train_enc, targ_train_enc)

CPU times: user 40.4 s, sys: 128 ms, total: 40.6 s
Wall time: 40.7 s


RandomForestRegressor(max_depth=19, n_estimators=40, random_state=12345)

In [31]:
#Random Forest Regressor model predictions
%time RFR_pred = model_RFR.predict(feat_test_enc)

CPU times: user 885 ms, sys: 19 µs, total: 885 ms
Wall time: 886 ms


Calculating the RMSE

In [32]:
RFR_model_RMSE = rmse(targ_test_enc, RFR_pred)
RFR_model_RMSE

1457.4614759314102

RMSE of 1457.456. Much better than our baseline

## CatBoost Regressor

CatBoost regressor uses gradient boosting, so, we don't need prior encoding. We will be tuning using different hyperparameters this time, such as depth, learning rate, 12_lear_reg, iterations, loss functions, random_speed. We will find the best parameters using GridSearchCV.

In [34]:
CBR = CatBoostRegressor()
parameters={'depth': [6, 8, 10],
           'learning_rate': [0.5, 0.1],
           'l2_leaf_reg': [2, 4],
           'iterations': [10, 50],
           'loss_function': ['RMSE'],
           'random_seed': [random_state]} #our dictionary of hyperparameters that will be looped through when 
                                          #we feed them to GridSearch
grid = GridSearchCV(estimator = CBR, param_grid = parameters, scoring = scorer, cv = 3, n_jobs=-1, verbose=0)

grid.fit(f_train_gbm, t_train_gbm,  cat_features = dat_features) #fits our training unencoded data into our grid instance

best_par = grid.best_params_ #gets the best set of parameters for our model

0:	learn: 3157.5838254	total: 180ms	remaining: 1.62s
1:	learn: 2464.3591891	total: 355ms	remaining: 1.42s
2:	learn: 2156.0403011	total: 530ms	remaining: 1.24s
3:	learn: 2006.3875119	total: 687ms	remaining: 1.03s
4:	learn: 1926.1073904	total: 849ms	remaining: 849ms
5:	learn: 1891.7555166	total: 1s	remaining: 669ms
6:	learn: 1862.9877756	total: 1.17s	remaining: 500ms
7:	learn: 1834.7744324	total: 1.32s	remaining: 331ms
8:	learn: 1819.1045177	total: 1.49s	remaining: 165ms
9:	learn: 1801.2861528	total: 1.64s	remaining: 0us
0:	learn: 3160.8208597	total: 189ms	remaining: 1.7s
1:	learn: 2486.2182110	total: 367ms	remaining: 1.47s
2:	learn: 2205.0192763	total: 545ms	remaining: 1.27s
3:	learn: 2020.3149201	total: 704ms	remaining: 1.05s
4:	learn: 1943.4368540	total: 862ms	remaining: 862ms
5:	learn: 1898.8838408	total: 1.01s	remaining: 677ms
6:	learn: 1868.5899521	total: 1.18s	remaining: 507ms
7:	learn: 1846.1523077	total: 1.34s	remaining: 334ms
8:	learn: 1817.6770620	total: 1.5s	remaining: 166ms


We can print out the best hyperparameter settings for our model

In [35]:
print('Best score across all searched parameters', grid.best_score_)
print('Best parameters:', best_par)

Best score across all searched parameters -1525.683316147072
Best parameters: {'depth': 10, 'iterations': 50, 'l2_leaf_reg': 2, 'learning_rate': 0.5, 'loss_function': 'RMSE', 'random_seed': 12345}


Next, we train our CatBoost Regressor model using the best hyperparameter settings. Also, we note the time of the process.

In [36]:
#CatBoost Model training 

CBR_model = CatBoostRegressor(depth = 10,
                              iterations = 50,
                              l2_leaf_reg = 2,
                              learning_rate = 0.5,
                              loss_function = 'RMSE', 
                              random_seed = random_state)

%time CBR_model.fit(f_train_gbm, t_train_gbm, cat_features = dat_features, verbose = False, plot = False)

CPU times: user 20.6 s, sys: 48.1 ms, total: 20.6 s
Wall time: 20.8 s


<catboost.core.CatBoostRegressor at 0x7fa90bb2b580>

In [37]:
#CatBoost Model Predictions

%time CBR_model_pred = CBR_model.predict(f_test_gbm)

CPU times: user 162 ms, sys: 11 µs, total: 162 ms
Wall time: 161 ms


RMSE Calculation:

In [38]:
CBR_rmse = rmse(t_test_gbm, CBR_model_pred)

CBR_rmse

1489.7733356259291

We got an RMSE of 1489.77. This RMSE is higher than our baseline model

## LightGBM Regressor

The LightGBM model does not require encoding. We will still be performing hyperparameter tuning similar to the way we did for CatBoost, except we will be dealing with different hyperparameters. 
One thing to note about LightGBM is that our categorical features have to be of the 'category' type before feeding it to LightGBM. 
LightGBM does not accept 'object' type or anything else. So we will need to do so both for the training set and the test set.

In [39]:
#Converting all training features to "category" type

obj_feat = list(f_train_gbm.loc[:, f_train_gbm.dtypes == 'object'].columns.values)

for feature in obj_feat:
    f_train_gbm[feature] = pd.Series(f_train_gbm[feature], dtype="category")

model_GBM = LGBMRegressor()
parameters = {'num_leaves': [10, 20, 30],
              'learning_rate': [0.5, 0.1],
              'n_estimators': [10, 20],
              'random_state': [random_state],
              'objective': ['rmse']}

grid = GridSearchCV(estimator = model_GBM, param_grid = parameters, scoring = scorer, cv=3, n_jobs=-1)
grid.fit(f_train_gbm, t_train_gbm)
best_param = grid.best_params_

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f_train_gbm[feature] = pd.Series(f_train_gbm[feature], dtype="category")


Getting our best hyperparameter setting:

In [40]:
print('Best score across all searched parameters', grid.best_score_)
print('Best parameters:', best_param)

Best score across all searched parameters -1579.8335124962603
Best parameters: {'learning_rate': 0.5, 'n_estimators': 20, 'num_leaves': 30, 'objective': 'rmse', 'random_state': 12345}


Next, we will train and test our LightGBM model using the hyperparameter settings above, while timing the process.

In [41]:
#LightGBM model training

LGBM_model = LGBMRegressor(learning_rate = 0.5,
                           n_estimators = 20,
                           num_leaves = 30,
                           objective = 'rmse',
                           random_state = random_state)

%time LGBM_model.fit(f_train_gbm, t_train_gbm)
                        

CPU times: user 1.68 s, sys: 19.3 ms, total: 1.7 s
Wall time: 1.67 s


LGBMRegressor(learning_rate=0.5, n_estimators=20, num_leaves=30,
              objective='rmse', random_state=12345)

In [42]:
#Converting all test features to "category" type

obj_feat = list(f_test_gbm.loc[:, f_test_gbm.dtypes == 'object'].columns.values)

for feature in obj_feat:
    f_test_gbm[feature] = pd.Series(f_test_gbm[feature], dtype="category")

#LightGBM model predictions
%time LGBM_pred = LGBM_model.predict(f_test_gbm)

CPU times: user 148 ms, sys: 0 ns, total: 148 ms
Wall time: 89.2 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f_test_gbm[feature] = pd.Series(f_test_gbm[feature], dtype="category")


RMSE Calculcation:

In [43]:
LGBM_rmse = rmse(LGBM_pred, t_test_gbm)

LGBM_rmse

#def rmse(pred, target):

1547.9494795906612

The RMSE of 1547.949 is much better than the RMSE of the baseline model 

## Model analysis

To analyze the model, we prepare a table showing the different models, their training and prediction times (in milliseconds) and RMSEs

In [45]:
index = ['LR', 'DTR', 'RFR', 'CBM', 'LGBM']

total_summary = pd.DataFrame(data = {'training_time(ms)': [90.9, 1130, 40700, 20800, 1670],
                                     'prediction_time(ms)': [5.96, 12.9, 886, 161, 89.2],
                                     'RMSE': [3025, 1741, 1457, 1489, 1547]},
                            index = index)
                                     
                                    
total_summary

Unnamed: 0,training_time(ms),prediction_time(ms),RMSE
LR,90.9,5.96,3025
DTR,1130.0,12.9,1741
RFR,40700.0,886.0,1457
CBM,20800.0,161.0,1489
LGBM,1670.0,89.2,1547


Takeaways:

- Linear Regression had the best training time (90.9 ms) while the worst goes to Random Forest (40700 ms)

- Linear Regression had the best prediction time (5.96 ms) while the worst goes to Random Forest (886 ms)

- Random Forest had the best RMSE (1457) while the worst goes to Linear Regression (3025)

## Conclusion

This project was done to develop an app to attract new customers for Rusty Bargain. Several models were built to determine the value of the cars. We have successfully cleaned and prepared the data and used it to train models. Linear Regression has the best training time (90.9 ms) while the worst is Random Forest (40700 ms). Also, Linear Regression has the best prediction time of 5.96 ms while the Random Forest Model has the worst, 886 ms.

Random Forest has the best RMSE of 1457. However, the time taken for training and prediction is highest. The CatBoost model takes about half the time of the Random Forest model for training, and about four times less than the Random Forest Model for making predictions. Moreover, the RMSE of the CatBoost Model is 1489. Therefore, we will recommend with the CatBoost model.