## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.metrics import mean_squared_log_error

import warnings
warnings.filterwarnings("ignore")

In [2]:
raw_data = pd.read_csv('used_cars_price_cleaned.csv')
data = raw_data.copy()
data = data.drop('Unnamed: 0', axis = 1)

data.head()

Unnamed: 0,B,C,D,E,F,J,M,S,front-wheel drive,part-time four-wheel drive,rear drive,electrocar,petrol,with damage,with mileage,make,priceUSD,year,mileage(kilometers),volume(cm3)
0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,31,565,1993,960015.0,2000.0
1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,23,5550,2008,172000.0,1400.0
2,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,56,8300,2008,223000.0,2500.0
3,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,84,3300,2005,140000.0,1200.0
4,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,56,2450,2002,413000.0,2200.0


## Splitting of dataset

In [3]:
from sklearn.model_selection import train_test_split

x = data.drop('priceUSD', axis = 1) #Independent variables
y = data['priceUSD'] #Dependent variables

#Training and test data split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1,random_state = 42)

#Training and validation data split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size = 0.1, random_state = 42)

## Scaling the data

In [4]:
#Since range of values for each feature is different, they need to be scaled with a fixed range
#MinMaxScaler scales features within a range of (0,1)

from sklearn.preprocessing import MinMaxScaler 
scaler = MinMaxScaler() 
  
x_train_scaled = scaler.fit_transform(x_train)
x_val_scaled = scaler.transform(x_val)
x_test_scaled = scaler.transform(x_test) 

**Since the Price of cars is not normally distributed, skewed to the left, hence it will be best to perfrom a log transformation on price.**

In [5]:
y_train_log = np.log(y_train)

y_val_log = np.log(y_val)

y_test_log = np.log(y_test)

######     

## Model creation, comparison and Evaluation

In this phase, we will use different Machine Learning algorithms to create models and compare the performance of those models.

We will use the Root Mean Squared Log Error(RMSLE) and R^2 value (Accuracy) as our evaluation metric. The lower the RMSLE score and higher the accuracy, the better is the model.

The different ML algorithms to be implemented:

    1. Linear Regression

    2. Random Forest 
    
    3. K Nearest Neighbours
    
    4. XGBoost
    
    5. LightGBM
    
    6. CatBoost

At first, let's create a DataFrame that will store the validation results of all the models so that it can be compared later on.

In [6]:
Results = pd.DataFrame(columns = ['Model', 'Validation Score(Before tuning)', 'Accuracy(Before tuning)', 'Validation Score(After tuning)', 'Accuracy(After tuning)' ])

#### 1. Linear Regression

In [7]:
from sklearn.linear_model import LinearRegression

lr =  LinearRegression()

lr.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = lr.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results

score = ((lr.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.058461182269852544
Accuracy score:  77.317


In [8]:
Results = Results.append({'Model' : 'Linear Regression', 'Validation Score(Before tuning)': rmsle, 
                          'Accuracy(Before tuning)': score }, ignore_index = True)
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,,


#####   

#### 2. Random Forest

In [9]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()

rf.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = rf.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((rf.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.037815375137285036
Accuracy score:  89.968


In [10]:
Results = Results.append({'Model' : 'Random Forest', 'Validation Score(Before tuning)': rmsle,
                         'Accuracy(Before tuning)': score}, ignore_index = True)
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,,
1,Random Forest,0.037815,89.968,,


#####  

#### 3. K Nearest Neighbours

In [11]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()

knn.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = knn.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((knn.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.03823251764652535
Accuracy score:  89.49


In [12]:
Results = Results.append({'Model' : 'K Nearest Neighbours', 'Validation Score(Before tuning)': rmsle,
                         'Accuracy(Before tuning)': score}, ignore_index = True)
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,,
1,Random Forest,0.037815,89.968,,
2,K Nearest Neighbours,0.038233,89.49,,


######   

#### 4. XGBoost

In [13]:
from xgboost import XGBRegressor

xgb = XGBRegressor()

xgb.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = xgb.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((xgb.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.04095935448169442
Accuracy score:  87.925


In [14]:
Results = Results.append({'Model' : 'XGBoost', 'Validation Score(Before tuning)': rmsle,
                         'Accuracy(Before tuning)': score}, ignore_index = True)
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,,
1,Random Forest,0.037815,89.968,,
2,K Nearest Neighbours,0.038233,89.49,,
3,XGBoost,0.040959,87.925,,


#####    

#### 5. Light GBM (LGBM)

In [15]:
from lightgbm import LGBMRegressor

lgbr = LGBMRegressor()

lgbr.fit(x_train_scaled, y_train_log) #Fitting or training the data
 
y_val_pred_log = lgbr.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((lgbr.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.036177417400881436
Accuracy score:  90.781


In [16]:
Results = Results.append({'Model' : 'Light GBM', 'Validation Score(Before tuning)': rmsle,
                         'Accuracy(Before tuning)': score}, ignore_index = True)
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,,
1,Random Forest,0.037815,89.968,,
2,K Nearest Neighbours,0.038233,89.49,,
3,XGBoost,0.040959,87.925,,
4,Light GBM,0.036177,90.781,,


#####   

#### 6. CatBoost

In [17]:
from catboost import CatBoostRegressor

cb = CatBoostRegressor()

cb.fit(x_train_scaled, y_train_log, verbose = 200) #Fitting or training the data

y_val_pred_log = cb.predict(x_val_scaled) #Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((cb.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

Learning rate set to 0.07236
0:	learn: 0.9634298	total: 169ms	remaining: 2m 48s
200:	learn: 0.3110477	total: 1.57s	remaining: 6.23s
400:	learn: 0.2879327	total: 2.9s	remaining: 4.33s
600:	learn: 0.2771087	total: 4.21s	remaining: 2.8s
800:	learn: 0.2698537	total: 5.51s	remaining: 1.37s
999:	learn: 0.2637301	total: 6.9s	remaining: 0us
RMSLE score:  0.03534135196770207
Accuracy score:  91.283


In [18]:
Results = Results.append({'Model' : 'CatBoost', 'Validation Score(Before tuning)': rmsle,
                         'Accuracy(Before tuning)': score}, ignore_index = True)
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,,
1,Random Forest,0.037815,89.968,,
2,K Nearest Neighbours,0.038233,89.49,,
3,XGBoost,0.040959,87.925,,
4,Light GBM,0.036177,90.781,,
5,CatBoost,0.035341,91.283,,


##  

## <u>Hyper Parameter Tuning</u>

####  For this we will use GridSearchCV 

Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified.

In [19]:
from sklearn.model_selection import GridSearchCV

#Defining a function that will calculate the best parameters and RMSE score of the model based on those parameters
#Using GridSearchCV

def grid_search(regressor,parameters):
    
    grid = GridSearchCV(estimator = regressor, #The model object
                        param_grid = parameters, #The dictionary of parameteres to be searched on
                        scoring = 'neg_mean_squared_log_error', #Scoring method
                        cv = 5, #Cross-validation splits value
                        n_jobs = -1 #Number of CPU cores to be used, -1 means all cores to be used
                        )
    
    grid.fit(x_train_scaled,y_train_log) #Training the model 

    print('Best parameters: ', grid.best_params_) #Displaying the best parameters of the model

    print("MSLE: ", ((grid.best_score_)))#MSLE score of the model based on those parameters

#### 1. Linear regression

In [20]:
from sklearn.linear_model import LinearRegression

param_lr = {
    'normalize': ['True', 'False'],#"True":The regressors of X will be normalized by subtracting with mean and diving by l2 norm
    }

lr = LinearRegression()
grid_search(lr,param_lr)

Best parameters:  {'normalize': 'True'}
MSLE:  -0.003001407854897586


In [21]:
#Retraining model with best parameters

lr =  LinearRegression(normalize = 'True')

lr.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = lr.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results

score = ((lr.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.05846118226985253
Accuracy score:  77.317


In [22]:
Results['Validation Score(After tuning)'][0] = rmsle
Results['Accuracy(After tuning)'][0] = score
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,0.058461,77.317
1,Random Forest,0.037815,89.968,,
2,K Nearest Neighbours,0.038233,89.49,,
3,XGBoost,0.040959,87.925,,
4,Light GBM,0.036177,90.781,,
5,CatBoost,0.035341,91.283,,


######  

#### 2. Random forest

In [23]:
from sklearn.ensemble import RandomForestRegressor

param_rf = {
    'max_depth': [10, 20, 30], #max_depth determines the maximum height of each tree
    'n_estimators': [100, 200, 500] #The number of trees in the forest.
    }

rf = RandomForestRegressor()
grid_search(rf,param_rf)

Best parameters:  {'max_depth': 20, 'n_estimators': 500}
MSLE:  -0.0014119664837570103


In [24]:
#Retraining model with best parameters

rf = RandomForestRegressor(max_depth = 20, n_estimators = 500)

rf.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = rf.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((rf.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.037368093251942285
Accuracy score:  90.184


In [25]:
Results['Validation Score(After tuning)'][1] = rmsle
Results['Accuracy(After tuning)'][1] = score
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,0.058461,77.317
1,Random Forest,0.037815,89.968,0.037368,90.184
2,K Nearest Neighbours,0.038233,89.49,,
3,XGBoost,0.040959,87.925,,
4,Light GBM,0.036177,90.781,,
5,CatBoost,0.035341,91.283,,


#####   

#### 3. K Nearest neighbours

In [26]:
from sklearn.neighbors import KNeighborsRegressor

param_knn = {
    'leaf_size': [20, 50, 60], #Leaf size affect the speed of the construction and query, 
                                   #as well as the memory required to store the tree.
    'n_neighbors': [5,7,9], #Number of neighbouring points to consider
    'p': [1,2] #p = 1 for manhattan_distance, p = 2 for euclidean distance            
    }

knn = KNeighborsRegressor()
grid_search(knn,param_knn)

Best parameters:  {'leaf_size': 50, 'n_neighbors': 7, 'p': 1}
MSLE:  -0.0015305808594260795


In [27]:
#Retraining model with best parameters

knn = KNeighborsRegressor(leaf_size = 50, n_neighbors = 7, p = 1)

knn.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = knn.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((knn.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.03756992213538703
Accuracy score:  89.92


In [28]:
Results['Validation Score(After tuning)'][2] = rmsle
Results['Accuracy(After tuning)'][2] = score
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,0.058461,77.317
1,Random Forest,0.037815,89.968,0.037368,90.184
2,K Nearest Neighbours,0.038233,89.49,0.03757,89.92
3,XGBoost,0.040959,87.925,,
4,Light GBM,0.036177,90.781,,
5,CatBoost,0.035341,91.283,,


#####   

#### 4. XGBoost

In [29]:
from xgboost import XGBRegressor

param_xgb = {
    "learning_rate"    : [0.05, 0.1], #The learning rate of the algorithm
    "max_depth"        : [12, 13, 14], #Maximum height of the boosted trees
    "min_child_weight" : [10, 12], #Minimum sum of instance weight(hessian) needed in a child.
    "gamma"            : [ 0.0, 0.1], #Minimum loss reduction required to make a further partition on a leaf node
    }

xgb = XGBRegressor()
grid_search(xgb,param_xgb)

Best parameters:  {'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 12, 'min_child_weight': 10}
MSLE:  -0.0012454023711729076


In [30]:
#Retraining model with best parameters

xgb = XGBRegressor(gamma = 0.1, learning_rate = 0.1, max_depth = 12, min_child_weight = 10)

xgb.fit(x_train_scaled, y_train_log) #Fitting or training the data

y_val_pred_log = xgb.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((xgb.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.03553172096309003
Accuracy score:  91.264


In [31]:
Results['Validation Score(After tuning)'][3] = rmsle
Results['Accuracy(After tuning)'][3] = score
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,0.058461,77.317
1,Random Forest,0.037815,89.968,0.037368,90.184
2,K Nearest Neighbours,0.038233,89.49,0.03757,89.92
3,XGBoost,0.040959,87.925,0.035532,91.264
4,Light GBM,0.036177,90.781,,
5,CatBoost,0.035341,91.283,,


#####   

#### 5. Light GBM

In [32]:
from lightgbm import LGBMRegressor

param_lgb = {
    
    'learning_rate': [0.5, 0.4, 0.2, 0.1], #Learning rate of the algorithm
    'max_depth': [10, 12, 15, 17], #Maximum height of trees
    'num_leaves': [15, 20, 25], #Maximum number of leaves for base learners
    'feature_fraction': [0.6, 0.8], #Random fraction of features to be used while training
    'subsample': [0.2, 0.5], #Random fraction of records to be used while training
    }

lgb = LGBMRegressor()
grid_search(lgb,param_lgb)

Best parameters:  {'feature_fraction': 0.8, 'learning_rate': 0.2, 'max_depth': 17, 'num_leaves': 25, 'subsample': 0.2}
MSLE:  -0.0012692575525791328


In [33]:
#Retraining model with best parameters

lgbr = LGBMRegressor(feature_fraction = 0.8, learning_rate = 0.2, max_depth = 17, num_leaves = 25, subsample = 0.2)

lgbr.fit(x_train_scaled, y_train_log) #Fitting or training the data
 
y_val_pred_log = lgbr.predict(x_val_scaled)#Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((lgbr.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.03613911435416256
Accuracy score:  90.809


In [34]:
Results['Validation Score(After tuning)'][4] = rmsle
Results['Accuracy(After tuning)'][4] = score
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,0.058461,77.317
1,Random Forest,0.037815,89.968,0.037368,90.184
2,K Nearest Neighbours,0.038233,89.49,0.03757,89.92
3,XGBoost,0.040959,87.925,0.035532,91.264
4,Light GBM,0.036177,90.781,0.036139,90.809
5,CatBoost,0.035341,91.283,,


######    

#### 6. CatBoost

In [35]:
from catboost import CatBoostRegressor

param_cb = {
    
    'learning_rate': [0.1, 0.05], #Learning rate of algorithm
    'depth': [8, 10], #Lower value prevents overfitting
    'bagging_temperature': [3, 10], #Helps to tackale overfitting
    'l2_leaf_reg' : [0.5, 1] #Regularization term
    
    }

cb = CatBoostRegressor(early_stopping_rounds = 10, verbose = False)
grid_search(cb,param_cb)

Best parameters:  {'bagging_temperature': 3, 'depth': 8, 'l2_leaf_reg': 0.5, 'learning_rate': 0.05}
MSLE:  -0.0012095789095128227


In [36]:
#Retraining model with best parameters

cb = CatBoostRegressor(bagging_temperature = 3, depth = 8, l2_leaf_reg = 0.5, learning_rate = 0.05)

cb.fit(x_train_scaled, y_train_log, verbose = 200) #Fitting or training the data

y_val_pred_log = cb.predict(x_val_scaled) #Predicting on validation set

rmsle = np.sqrt(mean_squared_log_error(y_val_log, y_val_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((cb.score(x_val_scaled,y_val_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

0:	learn: 0.9789353	total: 15.8ms	remaining: 15.8s
200:	learn: 0.3043661	total: 2.33s	remaining: 9.26s
400:	learn: 0.2784122	total: 4.57s	remaining: 6.83s
600:	learn: 0.2649057	total: 6.68s	remaining: 4.44s
800:	learn: 0.2554966	total: 8.8s	remaining: 2.19s
999:	learn: 0.2478434	total: 10.9s	remaining: 0us
RMSLE score:  0.0352913178210058
Accuracy score:  91.335


In [37]:
Results['Validation Score(After tuning)'][5] = rmsle
Results['Accuracy(After tuning)'][5] = score
Results

Unnamed: 0,Model,Validation Score(Before tuning),Accuracy(Before tuning),Validation Score(After tuning),Accuracy(After tuning)
0,Linear Regression,0.058461,77.317,0.058461,77.317
1,Random Forest,0.037815,89.968,0.037368,90.184
2,K Nearest Neighbours,0.038233,89.49,0.03757,89.92
3,XGBoost,0.040959,87.925,0.035532,91.264
4,Light GBM,0.036177,90.781,0.036139,90.809
5,CatBoost,0.035341,91.283,0.035291,91.335


**XGBoost and CatBoost both performed better than all the other models tested for this dataset.**

**Since both RMSLE score and Accuracy of CatBoost is better than XGBoost, Hence CatBoost is considered as the best model for this dataset.**

Let's perform a final evaluation on our test set.

In [38]:
y_test_pred_log = cb.predict(x_test_scaled) #Predicting on test set

rmsle = np.sqrt(mean_squared_log_error(y_test_log, y_test_pred_log)) #Evaluating model rmsle score by comparing actual and predicted results
score = ((cb.score(x_test_scaled,y_test_log))*100).round(3)

print("RMSLE score: ", rmsle)

print("Accuracy score: ", score)

RMSLE score:  0.03488305748337859
Accuracy score:  91.596


**Both Root mean squared log error and Accuracy of the model on the test set is amazing.**

#####  

## Saving the scaler object

In [39]:
filehandler = open("scaler.pickle", "wb")

pickle.dump(scaler, filehandler)

filehandler.close()

## Saving Model

In [40]:
filehandler = open("CatBoostRegressor.pickle", "wb")

pickle.dump(cb, filehandler)

filehandler.close()

## Saving Model comparison chart

In [41]:
Results.to_csv("Model_comparison.csv")