# Car price predictor model

Rusty Bargain, a secondhand car marketplace, is developing an app. The app needs an algorithm to predict car prices based on the vehicle's specifications.

Rust Bargain is interested in 
- prediction quality,
- prediction speed, and
- the time required to train the model.

# Loading libraries

In [85]:
# For numerical operations
import numpy as np

# For dataframe manipulation
import pandas as pd

# Scikit-Learn
## Tools
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split

## Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Gradient-boosting models
## XGBoost
from xgboost import XGBRegressor
## LightGBM
from lightgbm import LGBMRegressor
## CatBoost
from catboost import CatBoostRegressor

import time

# Setting the global random_state value for this project
random_state = 12345

# Checking if all libraries were loaded properly
print('Loading completed successfully.')

Loading completed successfully.


# Loading dataset

The data has 16 columns:
- `DateCrawled` - profile download date
- `VehicleType`
- `RegistrationYear`
- `Gearbox`
- `Power` - in HP
- `Model`
- `Mileage` - in kilometers
- `RegistrationMonth`
- `FuelType`
- `Brand`
- `NotRepaired` - whether or not the vehicle has been repaired before
- `DateCreated` - profile creation date
- `NumberOfPictures` - number of pictures of the vehicle
- `PostalCode` - owner's postal code
- `LastSeen` - last time the owner was active
- `Price` - vehicle price in euros, our target.

In [40]:
# If run on the platform
try:
    df = pd.read_csv('\datasets\car_data.csv')

# If run locally
except:
    df = pd.read_csv('datasets/car_data.csv')

In [41]:
# Getting random samples of the data
df.sample(5, random_state=random_state)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
18734,04/04/2016 13:36,16900,bus,2010,auto,150,viano,150000,4,gasoline,mercedes_benz,no,04/04/2016 00:00,0,60326,05/04/2016 12:18
141787,07/03/2016 17:57,15500,other,2011,manual,143,1er,40000,5,gasoline,bmw,no,07/03/2016 00:00,0,35083,06/04/2016 20:19
37523,24/03/2016 09:37,3600,sedan,2004,manual,125,astra,150000,12,petrol,opel,no,24/03/2016 00:00,0,13627,24/03/2016 10:38
194192,15/03/2016 09:49,8990,sedan,2007,auto,224,c_klasse,150000,9,gasoline,mercedes_benz,no,15/03/2016 00:00,0,58135,18/03/2016 02:17
110210,29/03/2016 23:43,2500,other,1994,manual,68,transporter,150000,9,gasoline,volkswagen,no,29/03/2016 00:00,0,24598,02/04/2016 12:45


In [42]:
# Getting general information of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [43]:
# Checking for missing values
df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [44]:
# Checking for duplicates
df.duplicated().sum()

262

In [45]:
# Checking statistical description
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


Findings:
- The set comprises of 354,369 rows across 16 columns.
- Some of the columns have missing values of varying amounts.
- Some of the rows are duplicates.
- Some of the columns have impossible values, e.g.:
    - `0` in `Price`, `Power`, and `RegistrationMonth`,
    - Impossible years in `RegistrationYear`.
- Some features won't have any significance in price predictions.
- The set has many categorical features, which need to be encoded before processing by ML algorithms.

# Data preprocessing

## Dropping duplicates

The set has 262 duplicates. We will drop these rows.

In [46]:
df.drop_duplicates(inplace=True)
df.shape

(354107, 16)

## Treating missing values

In [47]:
df[df.isna().any(axis=1)].shape

(108540, 16)

In [48]:
df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37484
RegistrationYear         0
Gearbox              19830
Power                    0
Model                19701
Mileage                  0
RegistrationMonth        0
FuelType             32889
Brand                    0
NotRepaired          71145
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

All of the missing values are found in categorical features. We can guess `VehicleType`s based on its `Model`. However, other features cannot be easily inferred because the same model can have different specifications. Thus, we will fill these (and `VehicleType`s of unknown `Model`) with a placeholder categorical value of `unknown`.

In [49]:
# Creating a dictionary to map VehicleType-Model values
## Getting only unique values
df_unique = df[(df['VehicleType'].isna() == False) & (df['Model'].isna() == False)][['VehicleType', 'Model']].copy()
df_unique.drop_duplicates(subset='Model', keep='first', inplace=True)
df_unique = df_unique[['Model', 'VehicleType']]

## Converting value pairs into a dict
model_type_dict = dict(df_unique.values)
model_type_dict

{'grand': 'suv',
 'golf': 'small',
 'fabia': 'small',
 '3er': 'sedan',
 '2_reihe': 'convertible',
 'other': 'sedan',
 'c_max': 'bus',
 '3_reihe': 'sedan',
 'passat': 'wagon',
 'navara': 'suv',
 'ka': 'small',
 'twingo': 'small',
 'a_klasse': 'bus',
 'scirocco': 'coupe',
 '5er': 'sedan',
 'arosa': 'small',
 'civic': 'sedan',
 'transporter': 'bus',
 'punto': 'small',
 'e_klasse': 'sedan',
 'kadett': 'other',
 'one': 'sedan',
 'fortwo': 'small',
 'clio': 'small',
 '1er': 'sedan',
 'b_klasse': 'bus',
 'signum': 'wagon',
 'astra': 'wagon',
 'a8': 'sedan',
 'jetta': 'sedan',
 'polo': 'small',
 'fiesta': 'small',
 'c_klasse': 'wagon',
 'micra': 'small',
 'vito': 'other',
 'sprinter': 'bus',
 '156': 'wagon',
 'escort': 'sedan',
 'forester': 'wagon',
 'xc_reihe': 'suv',
 'scenic': 'bus',
 'a4': 'sedan',
 'a1': 'small',
 'combo': 'bus',
 'focus': 'wagon',
 'tt': 'coupe',
 'corsa': 'small',
 'a6': 'wagon',
 'jazz': 'small',
 'omega': 'sedan',
 'slk': 'sedan',
 '7er': 'sedan',
 '80': 'convertible'

In [50]:
# Filling missing VehicleTypes using the dict
df['VehicleType'] = df['VehicleType'].fillna(df['Model'].map(model_type_dict))
df['VehicleType'].isna().sum()

6827

In [51]:
# Filling the rest of NaN values with 'unknown'
df.fillna('unknown', inplace=True)
df.isna().sum()

DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64

## Treating impossible values

### Registration date

In [52]:
df.sort_values(by='DateCrawled')

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
44605,01/04/2016 00:06,7500,sedan,2000,auto,306,s_klasse,150000,3,petrol,mercedes_benz,no,31/03/2016 00:00,0,61169,01/04/2016 03:06
266309,01/04/2016 00:06,8700,bus,2009,auto,147,verso,150000,10,petrol,toyota,no,31/03/2016 00:00,0,57080,06/04/2016 17:44
157575,01/04/2016 00:06,2900,wagon,2000,auto,279,e_klasse,150000,10,lpg,mercedes_benz,yes,31/03/2016 00:00,0,55118,06/04/2016 17:44
119464,01/04/2016 00:10,4899,wagon,2001,manual,231,3er,150000,3,petrol,bmw,no,30/03/2016 00:00,0,57539,07/04/2016 05:46
122961,01/04/2016 00:25,13500,suv,2012,auto,156,outlander,125000,1,gasoline,mitsubishi,no,31/03/2016 00:00,0,61191,02/04/2016 18:10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78345,31/03/2016 23:58,0,small,2018,unknown,0,c2,150000,0,unknown,citroen,unknown,31/03/2016 00:00,0,49610,07/04/2016 04:16
345137,31/03/2016 23:58,1199,small,2001,manual,75,lupo,150000,11,petrol,volkswagen,yes,31/03/2016 00:00,0,91177,07/04/2016 04:16
102507,31/03/2016 23:58,120,small,1999,manual,54,other,150000,0,petrol,skoda,unknown,31/03/2016 00:00,0,84529,07/04/2016 04:16
59100,31/03/2016 23:58,8500,coupe,1988,auto,132,other,50000,4,petrol,mercedes_benz,no,31/03/2016 00:00,0,70437,01/04/2016 07:44


In [53]:
df[df['RegistrationYear'] < 1900]['RegistrationYear'].value_counts().sort_index()

1000    37
1001     1
1039     1
1111     3
1200     1
1234     4
1253     1
1255     1
1300     2
1400     1
1500     5
1600     2
1602     1
1688     1
1800     5
Name: RegistrationYear, dtype: int64

The last entry in this dataset was obtained in April 2016, so we will drop rows with `RegistrationYear` values that go beyond this date. As for the bottom limit, we will use the year `1900`.

Impossible values in `RegistrationMonth` will also be dropped.

In [54]:
shape_pre = df.shape[0]
print('Rows before dropping:', shape_pre)

# Dropping rows with impossible years
df = df[(df['RegistrationYear'] > 1900) & (df['RegistrationYear'] < 2016)]

# Dropping rows with impossible months
df = df[(df['RegistrationMonth'] > 0) & (df['RegistrationMonth'] < 13)]

# Dropping rows with impossible months in 2016
df = df.drop(df[(df['RegistrationYear'] == 2016) & (df['RegistrationMonth'] > 4)].index)

shape_post = df.shape[0]
print('Dropped rows:', shape_pre - shape_post)
print('Rows after dropping:', shape_post)

Rows before dropping: 354107
Dropped rows: 55050
Rows after dropping: 299057


### `Power` & `Price`

Because it's difficult to pinpoint the exact limits of possible values in these columns, we will identify and remove outliers using statistics.

In [55]:
# Removing outliers in `Power`

shape_pre = df.shape[0]
print('Rows before dropping:', shape_pre)

Q1 = df['Power'].quantile(0.25)
Q3 = df['Power'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - IQR
upper_bound = Q3 + IQR

df = df[(df['Power'] >= lower_bound) & (df['Power'] <=  upper_bound)] 

shape_post = df.shape[0]
print('Dropped rows:', shape_pre - shape_post)
print('Rows after dropping:', shape_post)

Rows before dropping: 299057
Dropped rows: 38779
Rows after dropping: 260278


In [56]:
# Removing outliers in `Price`

shape_pre = df.shape[0]
print('Rows before dropping:', shape_pre)

Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - IQR
upper_bound = Q3 + IQR

df = df[(df['Price'] >= lower_bound) & (df['Price'] <= upper_bound)] 

shape_post = df.shape[0]
print('Dropped rows:', shape_pre - shape_post)
print('Rows after dropping:', shape_post)

Rows before dropping: 260278
Dropped rows: 23973
Rows after dropping: 236305


## Dropping insignificant features

`DateCrawled`, `DateCreated`, `LastSeen`, `PostalCode`, and `NumberOfPictures` won't contribute to price predictions. We will drop them.

In [57]:
df.drop(['DateCrawled', 'DateCreated', 'LastSeen', 'PostalCode', 'NumberOfPictures'], axis=1, inplace=True)
df.shape

(236305, 11)

In [58]:
df.columns

Index(['Price', 'VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Mileage', 'RegistrationMonth', 'FuelType', 'Brand', 'NotRepaired'],
      dtype='object')

## Checking data types

model numerical data

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 236305 entries, 2 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              236305 non-null  int64 
 1   VehicleType        236305 non-null  object
 2   RegistrationYear   236305 non-null  int64 
 3   Gearbox            236305 non-null  object
 4   Power              236305 non-null  int64 
 5   Model              236305 non-null  object
 6   Mileage            236305 non-null  int64 
 7   RegistrationMonth  236305 non-null  int64 
 8   FuelType           236305 non-null  object
 9   Brand              236305 non-null  object
 10  NotRepaired        236305 non-null  object
dtypes: int64(5), object(6)
memory usage: 21.6+ MB


## Non-ordinal categorical feature encoding

ML models can only process numerical data. We can represent our categorical data numerically by encoding them. Because most of our categorical features are not ordinal, ordinal encoding is very likely to distort ML predictions. We will use one-hot encoding (OHE) instead.

In [60]:
# Specifying columns to encode
columns_to_encode = ['VehicleType', 'Gearbox', 'Power', 'Model', 'FuelType', 'Brand', 'NotRepaired']
other_columns = ['RegistrationYear', 'Mileage', 'RegistrationMonth', 'Price']

# Transforming non-ordinal categorical columns
# drop_first=True to prevent redundancy
df_encoded = df[columns_to_encode]
df_encoded = pd.get_dummies(df_encoded, drop_first=True)
print('df_encoded size:', df_encoded.shape)

# Appending non-encoded columns to the df
df_encoded[other_columns] = df[other_columns]
print('Complete df_encoded size:', df_encoded.shape)

df_encoded size: (236305, 304)
Complete df_encoded size: (236305, 308)


## Training and test set splitting

We will divide the full set into two: one for training and another for testing, with a ratio of 70:30.

In [61]:
df_train, df_test = train_test_split(df_encoded, train_size=0.7, random_state=random_state)
print('df_train size:', df_train.shape)
print('df_test size:', df_test.shape)

df_train size: (165413, 308)
df_test size: (70892, 308)


## Feature and target separation

`Price` is our target. We will separate it from the features.

In [62]:
train_target = df_train['Price']
train_features = df_train.drop('Price', axis=1)

test_target = df_test['Price']
test_features = df_test.drop('Price', axis=1)

## Data preprocessing summary

In this stage, we made the following changes:
- removed 262 duplicate rows,
- filled 30,657 missing `VehicleType`s with data from available models,
- filled every other missing value with `unknown` placeholder category ,
- removed rows with registration year before 1900 and after April 2016,
- removed rows with outlier values in `Power` and `Price`,
- dropped features that would not contribute to machine learning,
- encoded non-ordinal categorical variables with one-hot encoding.

These actions resulted in a dataset of 236,305 rows (removing 118,064 rows, approximately 33% of the original size) and 308 columns.

Finally, we separated the features from the target.

# Model creation, training, and testing

We're ready to create predictor models for Rusty Bargain. We'll compare several algorithms: linear, decision tree, and random forest regressors, and several implementations of gradient boosting algorithms. The models will be trained and cross-validated with the full dataset. As a measure of the models' performance, we will use their calculation speed and root of mean squared error (RMSE).

## Creating a dictionary to store model performance information

At the end of this project, we'll have to compare the performance of all models. We will create an empty dictionary to store this information along the way.

In [63]:
performance_dict = {}
performance_columns = ['Training time', 'Test time', 'Training score', 'Test score']

## Linear regression

Due to its low configurability and simple algorithm, this model will serve as a measure of sanity check for the other models.

In [64]:
# Creating an instance of the linear regressor model
lr = LinearRegression()

# Fitting
print('Training')

## Recording start time
start = time.process_time()

lr.fit(train_features, train_target)

## Recording end time and storing it
end = time.process_time()
lr_train_time = end - start

print('Training CPU execution time:', lr_train_time)

pred = lr.predict(train_features)
lr_train_score = (mse(train_target, pred) ** 0.5)
print('Training score:', lr_train_score)

print()

# Testing
print('Testing')

## Getting test time
start = time.process_time()

pred = lr.predict(test_features)

## Recording end time and storing it
end = time.process_time()
lr_test_time = end - start

print('Test prediction CPU execution time:', lr_test_time)

## Getting test score
lr_test_score = (mse(test_target, pred) ** 0.5)
print('Test score:', lr_test_score)

# Storing information in the performance dictionary
performance_dict['lr'] = [lr_train_time, lr_test_time, lr_train_score, lr_test_score]

Training
Training CPU execution time: 34.03125
Training score: 1758.9908905595864

Testing
Test prediction CPU execution time: 0.15625
Test score: 1742.2336664346026


## Decision tree regressor

We can tune the `max_depth` hyperparameter for this model.

In [65]:
# Creating variables to store best scores and time
dtr_best_train_score = 99999
dtr_best_test_score = 99999
dtr_best_train_time = 99999
dtr_best_test_time = 99999
dtr_best_model = None

# Creating instances of the decision tree model with varying hyperparameters
for depth in range(1, 15):
    dtr = DecisionTreeRegressor(max_depth=depth, random_state=random_state)

    # Fitting   
    ## Recording start time
    start = time.process_time()

    dtr.fit(train_features, train_target)

    ## Recording end time and storing it
    end = time.process_time()
    dtr_train_time = end - start
    
    pred = dtr.predict(train_features)
    dtr_train_score = (mse(train_target, pred) ** 0.5)
    
    # Testing   
    ## Recording start time
    start = time.process_time()

    pred = dtr.predict(test_features)

    ## Recording end time and storing it
    end = time.process_time()
    dtr_test_time = end - start
    
    dtr_test_score = (mse(test_target, pred) ** 0.5)
    
    # Storing best max_depth and scores
    if dtr_test_score < dtr_best_test_score:
        dtr_best_train_time = dtr_train_time
        dtr_best_test_time = dtr_test_time
        dtr_best_train_score = dtr_train_score
        dtr_best_test_score = dtr_test_score
        dtr_best_model = dtr

# Storing information in the performance dictionary
performance_dict['dtr'] = [dtr_best_train_time, dtr_best_test_time, dtr_best_train_score, dtr_best_test_score]
    
# Printing best depth and scores
print('Best training time:', dtr_best_train_time)
print('Best test time:', dtr_best_test_time)
print('Best training score:', dtr_best_train_score)
print('Best test score:', dtr_best_test_score)
print('Best model & hyperparameters:')
dtr_best_model

Best training time: 3.671875
Best test time: 0.125
Best training score: 1159.8168895706096
Best test score: 1356.9307257119617
Best model & hyperparameters:


## Random forest regressor

This model has 2 hyperparameters we can tune:
- `n_estimators`, from 1 to 100 with an increment of 25 in each iteration
- `max_depth`, the same as before (1--15) with an increment of 5 in each iteration

In [78]:
# Creating variables to store best scores
rfr_best_train_score = 99999
rfr_best_test_score = 99999
rfr_best_train_time = 99999
rfr_best_test_time = 99999
rfr_best_model = None

# Increasing the number of estimators from 1 to 100 by 25 in each loop
for estimators in range(1, 101, 25):
    # Increasing the depth from 1 to 15 by 5 in each loop 
    for depth in range(1, 15, 5):
        # Creating instances of LGBMRegressor with varying hyperparameters
        rfr = RandomForestRegressor(n_estimators=estimators, max_depth=depth, random_state=random_state)

        # Fitting   
        ## Recording start time
        start = time.process_time()

        rfr.fit(train_features, train_target)

        ## Recording end time and storing it
        end = time.process_time()
        rfr_train_time = end - start

        pred = rfr.predict(train_features)
        rfr_train_score = (mse(train_target, pred) ** 0.5)

        # Testing   
        ## Recording start time
        start = time.process_time()

        pred = rfr.predict(test_features)

        ## Recording end time and storing it
        end = time.process_time()
        rfr_test_time = end - start

        rfr_test_score = (mse(test_target, pred) ** 0.5)

        # Storing best max_depth and scores
        if rfr_test_score < rfr_best_test_score:
            rfr_best_train_time = rfr_train_time
            rfr_best_test_time = rfr_test_time
            rfr_best_train_score = rfr_train_score
            rfr_best_test_score = rfr_test_score
            rfr_best_model = rfr
    
# Storing information in the performance dictionary
performance_dict['rfr'] = [rfr_best_train_time, rfr_best_test_time, rfr_best_train_score, rfr_best_test_score]
    
# Printing best hyperparameters and scores
print('Best training time:', rfr_best_train_time)
print('Best test time:', rfr_best_test_time)
print('Best training score:', rfr_best_train_score)
print('Best test score:', rfr_best_test_score)
print('Best model & hyperparameters:')
rfr_best_model

Best training time: 152.34375
Best test time: 0.59375
Best training score: 1261.8176282736492
Best test score: 1332.5585663830586
Best model & hyperparameters:


## Gradient-boosting algorithms

We will try and compare 3 implementations of gradient-boosting algorithms: XGBoost, LightGBM, and CatBoost. Being the different implementations of similar algorithms, these 3 models have similar tunable hyperparameters:
- `n_estimators`
- `max_depth`

Note that these models have built-in overfitting detectors that can stop model training when the optimum hyperparameters have been reached. However, because we tune our hyperparameters manually (instead of letting the model do it on its own), we cannot use this feature.

### XGBoost Regressor

In [67]:
# Creating variables to store best scores
xgbr_best_train_score = 99999
xgbr_best_test_score = 99999
xgbr_best_train_time = 99999
xgbr_best_test_time = 99999
xgbr_best_model = None

# Increasing the number of estimators from 1 to 100 by 25 in each loop
for estimators in range(1, 101, 25):
    # Increasing the depth from 1 to 15 by 5 in each loop 
    for depth in range(1, 15, 5):
        # Creating instances of LGBMRegressor with varying hyperparameters
        xgbr = XGBRegressor(n_estimators=estimators, max_depth=depth, random_state=random_state)

        # Fitting   
        ## Recording start time
        start = time.process_time()

        xgbr.fit(train_features, train_target)

        ## Recording end time and storing it
        end = time.process_time()
        xgbr_train_time = end - start

        pred = xgbr.predict(train_features)
        xgbr_train_score = (mse(train_target, pred) ** 0.5)

        # Testing   
        ## Recording start time
        start = time.process_time()

        pred = xgbr.predict(test_features)

        ## Recording end time and storing it
        end = time.process_time()
        xgbr_test_time = end - start

        xgbr_test_score = (mse(test_target, pred) ** 0.5)

        # Storing best max_depth and scores
        if xgbr_test_score < xgbr_best_test_score:
            xgbr_best_train_time = xgbr_train_time
            xgbr_best_test_time = xgbr_test_time
            xgbr_best_train_score = xgbr_train_score
            xgbr_best_test_score = xgbr_test_score
            xgbr_best_model = xgbr
    
# Storing information in the performance dictionary
performance_dict['xgbr'] = [xgbr_best_train_time, xgbr_best_test_time, xgbr_best_train_score, xgbr_best_test_score]
    
# Printing best hyperparameters and scores
print('Best training time:', xgbr_best_train_time)
print('Best test time:', xgbr_best_test_time)
print('Best training score:', xgbr_best_train_score)
print('Best test score:', xgbr_best_test_score)
print('Best model & hyperparameters:')
xgbr_best_model

Best training time: 580.59375
Best test time: 2.078125
Best training score: 934.1587388302968
Best test score: 1160.4868796934459
Best model & hyperparameters:


### LightGBM Regressor

In [68]:
# Creating variables to store best scores
lgbmr_best_train_score = 99999
lgbmr_best_test_score = 99999
lgbmr_best_train_time = 99999
lgbmr_best_test_time = 99999
lgbmr_best_model = None

# Increasing the number of estimators from 1 to 100 by 25 in each loop
for estimators in range(1, 101, 25):
    # Increasing the depth from 1 to 15 by 5 in each loop 
    for depth in range(1, 15, 5):
        # Creating instances of LGBMRegressor with varying hyperparameters
        lgbmr = LGBMRegressor(n_estimators=estimators, max_depth=depth, random_state=random_state)

        # Fitting   
        ## Recording start time
        start = time.process_time()

        lgbmr.fit(train_features, train_target)

        ## Recording end time and storing it
        end = time.process_time()
        lgbmr_train_time = end - start

        pred = lgbmr.predict(train_features)
        lgbmr_train_score = (mse(train_target, pred) ** 0.5)

        # Testing   
        ## Recording start time
        start = time.process_time()

        pred = lgbmr.predict(test_features)

        ## Recording end time and storing it
        end = time.process_time()
        lgbmr_test_time = end - start

        lgbmr_test_score = (mse(test_target, pred) ** 0.5)

        # Storing best max_depth and scores
        if lgbmr_test_score < lgbmr_best_test_score:
            lgbmr_best_train_time = lgbmr_train_time
            lgbmr_best_test_time = lgbmr_test_time
            lgbmr_best_train_score = lgbmr_train_score
            lgbmr_best_test_score = lgbmr_test_score
            lgbmr_best_model = lgbmr
    
# Storing information in the performance dictionary
performance_dict['lgbmr'] = [lgbmr_best_train_time, lgbmr_best_test_time, lgbmr_best_train_score, lgbmr_best_test_score]
    
# Printing best hyperparameters and scores
print('Best training score:', lgbmr_best_train_score)
print('Best test score:', lgbmr_best_test_score)
print('Best training time:', lgbmr_best_train_time)
print('Best test time:', lgbmr_best_test_time)
print('Best model & hyperparameters:')
lgbmr_best_model

Best training score: 1232.4583742616953
Best test score: 1245.8386318333496
Best training time: 8.625
Best test time: 1.5
Best model & hyperparameters:


### CatBoost Regressor

In [69]:
# Creating variables to store best scores
cbr_best_depth = 99999
cbr_best_train_score = 99999
cbr_best_test_score = 99999
cbr_best_train_time = 99999
cbr_best_test_time = 99999
cbr_best_model = None

# Increasing the number of estimators from 1 to 100 by 25 in each loop
for estimators in range(1, 101, 25):
    # Increasing the depth from 1 to 15 by 5 in each loop 
    for depth in range(1, 15, 5):
        # Creating instances of LGBMRegressor with varying hyperparameters
        cbr = CatBoostRegressor(n_estimators=estimators, max_depth=depth, random_state=random_state)

        # Fitting   
        ## Recording start time
        start = time.process_time()

        cbr.fit(train_features, train_target)

        ## Recording end time and storing it
        end = time.process_time()
        cbr_train_time = end - start

        pred = cbr.predict(train_features)
        cbr_train_score = (mse(train_target, pred) ** 0.5)

        # Testing   
        ## Recording start time
        start = time.process_time()

        pred = cbr.predict(test_features)

        ## Recording end time and storing it
        end = time.process_time()
        cbr_test_time = end - start

        cbr_test_score = (mse(test_target, pred) ** 0.5)

        # Storing best max_depth and scores
        if cbr_test_score < cbr_best_test_score:
            cbr_best_train_time = cbr_train_time
            cbr_best_test_time = cbr_test_time
            cbr_best_train_score = cbr_train_score
            cbr_best_test_score = cbr_test_score
            cbr_best_model = cbr
    
# Storing information in the performance dictionary
performance_dict['cbr'] = [cbr_best_train_time, cbr_best_test_time, cbr_best_train_score, cbr_best_test_score]
    
# Printing best hyperparameters and scores
print('Best training time:', cbr_best_train_time)
print('Best test time:', cbr_best_test_time)
print('Best training score:', cbr_best_train_score)
print('Best test score:', cbr_best_test_score)
print('Best model & hyperparameters:')
cbr_best_model

Learning rate set to 0.5
0:	learn: 2406.4211772	total: 171ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 2130.0072082	total: 12.8ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 2029.4776431	total: 44.8ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 2406.4211772	total: 7.8ms	remaining: 195ms
1:	learn: 2217.2267645	total: 13.8ms	remaining: 166ms
2:	learn: 2085.8440761	total: 19.8ms	remaining: 152ms
3:	learn: 1997.3952233	total: 25.9ms	remaining: 142ms
4:	learn: 1938.0498667	total: 31.9ms	remaining: 134ms
5:	learn: 1883.4384461	total: 37.9ms	remaining: 126ms
6:	learn: 1849.2587724	total: 44.1ms	remaining: 120ms
7:	learn: 1820.8434109	total: 50.7ms	remaining: 114ms
8:	learn: 1794.0602806	total: 56.9ms	remaining: 108ms
9:	learn: 1772.7863433	total: 62.9ms	remaining: 101ms
10:	learn: 1752.5147970	total: 69.2ms	remaining: 94.4ms
11:	learn: 1733.8716098	total: 75.4ms	remaining: 88ms
12:	learn: 1715.7296139	total: 81.8ms	remaining: 81.8ms
13:	learn: 1702.3783414	total: 88.5ms	re

25:	learn: 1289.3646059	total: 362ms	remaining: 348ms
26:	learn: 1286.3501071	total: 377ms	remaining: 335ms
27:	learn: 1282.2615637	total: 391ms	remaining: 321ms
28:	learn: 1279.2623059	total: 405ms	remaining: 307ms
29:	learn: 1277.0224722	total: 418ms	remaining: 293ms
30:	learn: 1274.7965468	total: 432ms	remaining: 279ms
31:	learn: 1271.6080797	total: 447ms	remaining: 265ms
32:	learn: 1268.6940737	total: 462ms	remaining: 252ms
33:	learn: 1266.2131057	total: 476ms	remaining: 238ms
34:	learn: 1263.3763725	total: 490ms	remaining: 224ms
35:	learn: 1261.4335802	total: 504ms	remaining: 210ms
36:	learn: 1258.8367404	total: 518ms	remaining: 196ms
37:	learn: 1256.5327484	total: 533ms	remaining: 182ms
38:	learn: 1254.7065348	total: 546ms	remaining: 168ms
39:	learn: 1253.3726375	total: 559ms	remaining: 154ms
40:	learn: 1251.0831334	total: 574ms	remaining: 140ms
41:	learn: 1248.9929017	total: 591ms	remaining: 127ms
42:	learn: 1247.0725458	total: 604ms	remaining: 112ms
43:	learn: 1245.8116388	tota

Learning rate set to 0.5
0:	learn: 2130.0072082	total: 12.5ms	remaining: 941ms
1:	learn: 1793.5278272	total: 23.6ms	remaining: 872ms
2:	learn: 1621.5894609	total: 35.3ms	remaining: 860ms
3:	learn: 1524.1163768	total: 48.1ms	remaining: 865ms
4:	learn: 1473.5523643	total: 60.4ms	remaining: 858ms
5:	learn: 1441.9152850	total: 72.6ms	remaining: 848ms
6:	learn: 1425.6852057	total: 84.6ms	remaining: 834ms
7:	learn: 1400.6361759	total: 97.9ms	remaining: 832ms
8:	learn: 1384.3662931	total: 111ms	remaining: 826ms
9:	learn: 1372.9725639	total: 124ms	remaining: 819ms
10:	learn: 1361.8885555	total: 137ms	remaining: 812ms
11:	learn: 1354.0831876	total: 152ms	remaining: 809ms
12:	learn: 1348.8680810	total: 165ms	remaining: 798ms
13:	learn: 1340.6743638	total: 178ms	remaining: 788ms
14:	learn: 1336.6554873	total: 191ms	remaining: 776ms
15:	learn: 1331.4751500	total: 205ms	remaining: 768ms
16:	learn: 1324.6551759	total: 219ms	remaining: 759ms
17:	learn: 1319.6440221	total: 232ms	remaining: 748ms
18:	l

Best training time: 23.21875
Best test time: 0.546875
Best training score: 1104.1936927950594
Best test score: 1185.5537619202262
Best model & hyperparameters:


<catboost.core.CatBoostRegressor at 0x2d9ac1c4b50>

In [86]:
cbr_best_model.get_params()

{'loss_function': 'RMSE',
 'max_depth': 11,
 'n_estimators': 76,
 'random_state': 12345}

## Model performance review

In [79]:
performance_df = pd.DataFrame.from_dict(data=performance_dict, orient='index', columns=performance_columns)
performance_df

Unnamed: 0,Training time,Test time,Training score,Test score
lr,34.03125,0.15625,1758.990891,1742.233666
dtr,3.671875,0.125,1159.81689,1356.930726
rfr,152.34375,0.59375,1261.817628,1332.558566
xgbr,580.59375,2.078125,934.158739,1160.48688
lgbmr,8.625,1.5,1232.458374,1245.838632
cbr,23.21875,0.546875,1104.193693,1185.553762


### By prediction quality

In [80]:
performance_df.sort_values(by='Test score', ascending=True)

Unnamed: 0,Training time,Test time,Training score,Test score
xgbr,580.59375,2.078125,934.158739,1160.48688
cbr,23.21875,0.546875,1104.193693,1185.553762
lgbmr,8.625,1.5,1232.458374,1245.838632
rfr,152.34375,0.59375,1261.817628,1332.558566
dtr,3.671875,0.125,1159.81689,1356.930726
lr,34.03125,0.15625,1758.990891,1742.233666


### By prediction speed

In [81]:
performance_df.sort_values(by='Test time', ascending=True)

Unnamed: 0,Training time,Test time,Training score,Test score
dtr,3.671875,0.125,1159.81689,1356.930726
lr,34.03125,0.15625,1758.990891,1742.233666
cbr,23.21875,0.546875,1104.193693,1185.553762
rfr,152.34375,0.59375,1261.817628,1332.558566
lgbmr,8.625,1.5,1232.458374,1245.838632
xgbr,580.59375,2.078125,934.158739,1160.48688


### By training speed

In [82]:
performance_df.sort_values(by='Training time', ascending=True)

Unnamed: 0,Training time,Test time,Training score,Test score
dtr,3.671875,0.125,1159.81689,1356.930726
lgbmr,8.625,1.5,1232.458374,1245.838632
cbr,23.21875,0.546875,1104.193693,1185.553762
lr,34.03125,0.15625,1758.990891,1742.233666
rfr,152.34375,0.59375,1261.817628,1332.558566
xgbr,580.59375,2.078125,934.158739,1160.48688


### Summary

Findings & insights:
- All models did better than linear regression at predicting targets, as indicated by the lower RMSE scores.
- All ensemble models (random forest regressor and every gradient-boosting models) hit peak performance at the same hyperparameters: 
    - `n_estimators = 76`
    - `max_depth = 11`
- Gradient boosting models made better predictions than the classic gradient descent models provided by sklearn. In ascending order of RMSE scores, 
    - XGBoost regressor (\~1160) did the best job at predicting target values, followed by 
    - CatBoost (\~1185) and 
    - LightGBM (\~1245) regressors.
- However, gradient boosting models generally suffered from lower speed. 
    - CatBoost regressor, the 3rd fastest model in terms of prediction speed, was the only model with prediction speed of less than 1 second (\~0.5 seconds). 
    - LightGBM (1.5 seconds) and XGBoost (\~2.07 seconds) regressors were the slowest predictors. 
    - As a comparison, the fastest model (decision tree regressor) took \~0.125 seconds to predict the target values.
- By training speed, 
    - LightGBM regressor (\~8.625 seconds) was the 2nd fastest model, followed by 
    - CatBoost regressor (\~23.21 seconds). 
    - XGBoost regressor was, again, the slowest model at \~580.59 seconds of fitting time. 
    - In comparison, the fastest model (again, decision tree regressor) took \~3.67 seconds to train.

# Conclusion

We were given a set of 354,369 rows across 16 columns to train and test the models with.

In preprocessing, we did the following:
- removed 262 duplicate rows,
- filled 30,657 missing `VehicleType`s with data from available models,
- filled every other missing value with `unknown` placeholder category ,
- removed rows with registration year before 1900 and after April 2016,
- removed rows with outlier values in `Power` and `Price`,
- dropped features that would not contribute to machine learning,
- encoded non-ordinal categorical variables with one-hot encoding,
- separated the features from the target.

Data preprocessing resulted in a dataset of 236,305 rows (removing 118,064 rows, approximately 33% of the original dataset size) and 308 columns.

Next, we trained the following models and tuned their hyperparameters where permitted:
- linear regression (which served as the sanity check for other models due to its low configurability and simple algorithm),
- decision tree regressor,
- random forest regressor,
- three implementations of gradient-boosting algorithms:
	- XGBoost regressor,
	- LightGBM regressor,
	- CatBoost regressor.

The decision tree hit peak performance at `max_depth=14`. Meanwhile, all of the ensemble models reached best scores at `max_depth=11` and `n_estimators=76`.

When reviewing the performance data of all models, we found out that gradient-boosting models achieved superior prediction quality, as indicated by their relatively low RMSE scores. However, this comes at the cost of being slower than models with simpler algorithms.

Based on Rusty Bargain's criteria of prediction quality, prediction speed, and training speed, the best algorithm for their app would be an instance of **CatBoost regressor**, trained with our training set and the following hyperparameters:

- `loss_function = RMSE`,
- `max_depth = 11`,
- `n_estimators = 76`,
- `random_state = 12345`