In [1]:
import pandas as pd
import numpy as np
import sklearn

In [2]:
df_chunks = pd.read_csv('/Users/chenzizhang/Data Science/data/processed/vehicles_2.csv', chunksize=5000)
chunk_list = [] 
for chunk in df_chunks:
    chunk_list.append(chunk)

df = pd.concat(chunk_list)

# 5. Model Training

There are lots of regression algorithm for model training. Considering the estimation of used car prices is a regression problem, thus I chose $\color{red}{\text{Linear Regression}}$, $\color{red}{\text{SGD Regression}}$, $\color{red}{\text{Decision Tree Regression}}$, $\color{red}{\text{LGBM}}$, $\color{red}{\text{XGBOOST}}$ algorithms to train the model respectively. Then, I will compare three models using prediction accuracy indicator RMSE.

Reference: https://zhuanlan.zhihu.com/p/62034592

In [3]:
target = 'price'
Y = df[target]
X = df.drop([target], axis=1)

In [4]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)

## 5.1 Linear Regression

Here, linear regression is used as a baseline regression algorithm.

In [5]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
Y_pred_reg = reg.predict(x_test) 

## 5.2 SGD Regression

Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large.
The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. $\color{red}{\text{SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000)}}$, for other problems we recommend Ridge, Lasso, or ElasticNet.


In [39]:
from sklearn import linear_model
SGD = linear_model.SGDRegressor(max_iter=1000, tol=1e-3)
SGD.fit(x_train, y_train)
Y_pred_SGD = SGD.predict(x_test) 

## 5.3 Decision Tree Regression

Decision Trees are divided into Classification and Regression Trees. Regression trees are needed when the response variable is numeric or continuous. Classification trees, as the name implies are used to separate the dataset into classes belonging to the response variable. This piece explains a Decision Tree Regression Model practice with Python.

In [41]:
from sklearn.tree import DecisionTreeRegressor 
dtr = DecisionTreeRegressor()
dtr.fit(x_train, y_train)
Y_pred_dtr = dtr.predict(x_test) 

## 5.2 LGBM

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.

In [8]:
import lightgbm as lgb

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [9]:
train_set = lgb.Dataset(x_train, y_train, silent = False)
valid_set = lgb.Dataset(x_test, y_test, silent = False)

In [10]:
# Reference: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models/notebook
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'num_leaves': 31,
        'learning_rate': 0.01,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 5000 ,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'rmse',
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 10,
        'scale_pos_weight':1,
        'zero_as_missing': False,
        'seed':0,        
    }

In [11]:
lgb_model = lgb.train(params, train_set = train_set, num_boost_round = 10000,
                     early_stopping_rounds = 8000, verbose_eval = 500,
                     valid_sets = valid_set)

Training until validation scores don't improve for 8000 rounds
[500]	valid_0's rmse: 0.00105308
[1000]	valid_0's rmse: 0.00105308
[1500]	valid_0's rmse: 0.00105308
[2000]	valid_0's rmse: 0.00105308
[2500]	valid_0's rmse: 0.00105308
[3000]	valid_0's rmse: 0.00105308
[3500]	valid_0's rmse: 0.00105308
[4000]	valid_0's rmse: 0.00105308
[4500]	valid_0's rmse: 0.00105308
[5000]	valid_0's rmse: 0.00105308
[5500]	valid_0's rmse: 0.00105308
[6000]	valid_0's rmse: 0.00105308
[6500]	valid_0's rmse: 0.00105308
[7000]	valid_0's rmse: 0.00105308
[7500]	valid_0's rmse: 0.00105308
[8000]	valid_0's rmse: 0.00105308
Early stopping, best iteration is:
[1]	valid_0's rmse: 0.00105308


In [12]:
Y_pred_lgb = lgb_model.predict(x_test) 

## 5.3 XGBoost

XGBoost is an ensemble tree method that apply the principle of boosting weak learners (CARTs generally) using the gradient descent architecture. XGBoost improves upon the base Gradient Boosting Machines (GBM) framework through systems optimization and algorithmic enhancements.

Reference: https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d

In [14]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import GridSearchCV

In [21]:
xgb_clf = xgb.XGBRegressor() 
xgb_clf.fit(x_train, y_train)
Y_pred_xgb_clf = xgb_clf.predict(x_test) 

In [24]:
# XGBoost
mae_xgb_clf = sklearn.metrics.mean_absolute_error(y_test, Y_pred_xgb_clf)
mse_xgb_clf = sklearn.metrics.mean_squared_error(y_test, Y_pred_xgb_clf)
rmse_xgb_clf = sqrt(mse_xgb_clf)
print('Mean Absolute Error for xgb:', mae_xgb_clf)
print('Mean Squared Error for xgb:', mse_xgb_clf)
print('RMSE for xgb:', rmse_xgb_clf)

Mean Absolute Error for xgb: 1.3922660937840372e-05
Mean Squared Error for xgb: 1.1130616156115258e-06
RMSE for xgb: 0.0010550173532276736


let's do $\color{red}{\text{Grid search}}$ towards xgbRegressor as $\color{red}{\text{parameter tuning}}$.

In [49]:
from sklearn.model_selection import GridSearchCV
xgb_1 = xgb.XGBRegressor()
parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['reg:linear'],
              'learning_rate': [0.01, 0.03, 0.05, 0.07, 0.09, 0,1], #so called `eta` value
              'max_depth': [5, 6, 7],
              'min_child_weight': [4],
              'silent': [1],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [500]}
xgb_grid = GridSearchCV(xgb_1, parameters, cv = 2, n_jobs = 5, verbose=True).fit(x_train, y_train)
print("Best score: %0.3f" % xgb_grid.best_score_)
print("Best parameters set:", xgb_grid.best_params_)

Fitting 2 folds for each of 21 candidates, totalling 42 fits


[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  42 out of  42 | elapsed:  7.5min finished


Best score: -0.326
Best parameters set: {'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 7, 'min_child_weight': 4, 'n_estimators': 500, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}


In [51]:
Y_pred_xgb_grid = xgb_grid.predict(x_test) 

In [53]:
mae_xgb_grid = sklearn.metrics.mean_absolute_error(y_test, Y_pred_xgb_grid)
mse_xgb_grid = sklearn.metrics.mean_squared_error(y_test, Y_pred_xgb_grid)
rmse_xgb_grid = sqrt(mse_xgb_grid)
print('Mean Absolute Error for xgb:', mae_xgb_grid)
print('Mean Squared Error for xgb:', mse_xgb_grid)
print('RMSE for xgb:', rmse_xgb_grid)

Mean Absolute Error for xgb: 6.654373872458162e-05
Mean Squared Error for xgb: 1.926790330466165e-06
RMSE for xgb: 0.0013880887329224185


Unfortunately, the performance of XGBoostRegression Algorithm after grid search (parameter tuning) is worse than basic XGBoostRegression.

# 6. Model Evaluation

Here, I used Mean Absolute Error, Mean Squared Error, Root Mean Square Error to compare the performance of the above five models.

In [6]:
# Linear Regreesion
from sklearn import metrics
from math import sqrt
mae_reg = sklearn.metrics.mean_absolute_error(y_test, Y_pred_reg)
mse_reg = sklearn.metrics.mean_squared_error(y_test, Y_pred_reg)
rmse_reg = sqrt(mse_reg)
print('Mean Absolute Error for linear regression:', mae_reg)
print('Mean Squared Error for linear regression:', mse_reg)
print('RMSE for linear regression:', rmse_reg)

Mean Absolute Error for linear regression: 4.222730515245781e-05
Mean Squared Error for linear regression: 1.110787566610547e-06
RMSE for linear regression: 0.0010539390715836219


In [40]:
# SGD Regression
mae_SGD = sklearn.metrics.mean_absolute_error(y_test, Y_pred_SGD)
mse_SGD = sklearn.metrics.mean_squared_error(y_test, Y_pred_SGD)
rmse_SGD = sqrt(mse_SGD)
print('Mean Absolute Error for SGD regression:', mae_SGD)
print('Mean Squared Error for SGD regression:', mse_SGD)
print('RMSE for SGD regression:', rmse_SGD)

Mean Absolute Error for linear regression: 1.4390628220122832e-05
Mean Squared Error for linear regression: 1.1081851737717412e-06
RMSE for linear regression: 0.0010527037445415216


In [50]:
# Decision Tree Regression
mae_dtr = sklearn.metrics.mean_absolute_error(y_test, Y_pred_dtr)
mse_dtr = sklearn.metrics.mean_squared_error(y_test, Y_pred_dtr)
rmse_dtr = sqrt(mse_dtr)
print('Mean Absolute Error for decision tree regression:', mae_dtr)
print('Mean Squared Error for decision tree regression:', mse_dtr)
print('RMSE for decision tree regression:', rmse_dtr)

Mean Absolute Error for decision tree regression: 8.933313045823305e-06
Mean Squared Error for decision tree regression: 1.1081831788114237e-06
RMSE for decision tree regression: 0.0010527027969999053


In [13]:
# LGBM
mae_lgb = sklearn.metrics.mean_absolute_error(y_test, Y_pred_lgb)
mse_lgb = sklearn.metrics.mean_squared_error(y_test, Y_pred_lgb)
rmse_lgb = sqrt(mse_lgb)
print('Mean Absolute Error for LightGBM:', mae_lgb)
print('Mean Squared Error for LightGBM:', mse_lgb)
print('RMSE for LightGBM:', rmse_lgb)

Mean Absolute Error for LightGBM: 2.2715692714974698e-05
Mean Squared Error for LightGBM: 1.1082376533448567e-06
RMSE for LightGBM: 0.0010527286703347909


In [54]:
# XGBoost
mae_xgb_clf = sklearn.metrics.mean_absolute_error(y_test, Y_pred_xgb_clf)
mse_xgb_clf = sklearn.metrics.mean_squared_error(y_test, Y_pred_xgb_clf)
rmse_xgb_clf = sqrt(mse_xgb_clf)
print('Mean Absolute Error for xgb:', mae_xgb_clf)
print('Mean Squared Error for xgb:', mse_xgb_clf)
print('RMSE for xgb:', rmse_xgb_clf)

Mean Absolute Error for xgb: 1.3922660937840372e-05
Mean Squared Error for xgb: 1.1130616156115258e-06
RMSE for xgb: 0.0010550173532276736


In [73]:
index = ['Linear Regression', 'SGD Regression', 'Decision Tree Regression', 'LGBM', 'XGBoost']
mae = [mae_reg, mae_SGD, mae_dtr, mae_lgb, mae_xgb_clf]
mse = [mse_reg, mse_SGD, mse_dtr, mse_lgb, mse_xgb_clf]
rmse = [rmse_reg, rmse_SGD, rmse_dtr, rmse_lgb, rmse_xgb_clf]
dict = {'Algorithm': index,
       'Mean Absolute Error':mae,
       'Mean Squared Error': mse,
       'Root Mean Squared Error': rmse}
pd.set_option('precision', 9)
df = pd.DataFrame(dict)
df.set_index('Algorithm')

Unnamed: 0_level_0,Mean Absolute Error,Mean Squared Error,Root Mean Squared Error
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Linear Regression,4.2227e-05,1.111e-06,0.001053939
SGD Regression,1.4391e-05,1.108e-06,0.001052704
Decision Tree Regression,8.933e-06,1.108e-06,0.001052703
LGBM,2.2716e-05,1.108e-06,0.001052729
XGBoost,1.3923e-05,1.113e-06,0.001055017


As we can see from the performance evaluation matrix:

(1) For Mean Absolute Error indicator, Decision Tree Regression < XGBoost < SGD Regression < LGBM < Linear Regression.

(2) For Mean Squared Error indicator, Decision Tree Regression, SGD Regression, LGBM < Linear Regression < XGBoost.

(3) For Root Mean Squared Error indicator, Decision Tree Regression < SGD Regression < LGBM < Linear Regression < XGBoost.

Thus, $\color{red}{\text{Decision Tree Regression}}$ achieves best performance in used car price prediction. 