Hello Daniel!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure!

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

# <b>Machine Learning for Rusty Bargain<b>

<b>The goal of this project is to build a predictive model that produces accurate predictions for the price of vehicles while minimizaing both training and prediction time.<b>

    - The data provided is composed of information related to vehicles such as: mileage, vehicle type, model, and brand.

    - Several gradient boosted models must be trained, tuned, compared to each other, and sanity checked against a Linear Regression model before a final model can be chosen.
    
    - I will also use sklearn to build a properly tuned decision tree and a tuned random forest so I can acurrately compare the accuracy of gradient boosted models against that of non-boosted models.

In [3]:
import pandas as pd
import numpy as np
import scipy
from sklearn.metrics import mean_squared_error, r2_score, make_scorer, root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import lightgbm as lgb
import xgboost as xgb
from catboost import Pool, CatBoostRegressor
import time

In [4]:
df = pd.read_csv('~/Desktop/Projects/Datasets/car_data.csv')

In [5]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [7]:
df.columns = df.columns.str.lower()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   datecrawled        354369 non-null  object
 1   price              354369 non-null  int64 
 2   vehicletype        316879 non-null  object
 3   registrationyear   354369 non-null  int64 
 4   gearbox            334536 non-null  object
 5   power              354369 non-null  int64 
 6   model              334664 non-null  object
 7   mileage            354369 non-null  int64 
 8   registrationmonth  354369 non-null  int64 
 9   fueltype           321474 non-null  object
 10  brand              354369 non-null  object
 11  notrepaired        283215 non-null  object
 12  datecreated        354369 non-null  object
 13  numberofpictures   354369 non-null  int64 
 14  postalcode         354369 non-null  int64 
 15  lastseen           354369 non-null  object
dtypes: int64(7), object(

In [8]:
df.duplicated().sum()

262

In [9]:
df = df.drop_duplicates()
df.duplicated().sum()

0

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct
  
</div>

# Handling Missing Values

In [12]:
df.isna().sum()

datecrawled              0
price                    0
vehicletype          37484
registrationyear         0
gearbox              19830
power                    0
model                19701
mileage                  0
registrationmonth        0
fueltype             32889
brand                    0
notrepaired          71145
datecreated              0
numberofpictures         0
postalcode               0
lastseen                 0
dtype: int64

In [14]:
df['notrepaired'] = df['notrepaired'].apply(lambda x: np.random.choice(['yes', 'no']) if pd.isna(x) else x)

In [15]:
print(df['notrepaired'].unique())

['yes' 'no']


In [16]:
df.isna().sum()

datecrawled              0
price                    0
vehicletype          37484
registrationyear         0
gearbox              19830
power                    0
model                19701
mileage                  0
registrationmonth        0
fueltype             32889
brand                    0
notrepaired              0
datecreated              0
numberofpictures         0
postalcode               0
lastseen                 0
dtype: int64

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

It's not a good idea to drop a whole row becasue of NaNs in some columns. When you drop a row becasue of NaNs in several columns you loose information from other columns which can be usefull for model traning. So, when you work with ML models it's almost always better to fill NaNs instead of to drop rows. Moreover, it's very easy to fill NaNs in categorical columns. It's enough to fill the NaNs with a placeholder like string "unknown". Such approach works really good. So, please, fill all the NaNs instead of to drop them.
  
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

It seems this issues is not fixed. You dropped 11410 + 2236 rows above only because of some NaNs in categorical featues which can be easily filled with a placeholder like string "unknown". Please, fix it.
  
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

You're right, I must have missed this.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Fixed
  
</div>

I'll fill the  missing values by creating a separate dataframe that contains the mode for each column with missing values corresponding to model. I'll write a function to do so.

In [21]:
def fill_missing(df, x, y):
    group = df.groupby(x)[y].apply(lambda x: x.value_counts().idxmax() if not x.dropna().empty else None).reset_index()
    df = df.merge(group, on = x, how = 'left', suffixes = ("", "_mode"))
    return df

In [22]:
df = fill_missing(df, 'model', 'vehicletype')

In [23]:
df['vehicletype'] = df['vehicletype'].fillna(df['vehicletype_mode'])
df = df.drop('vehicletype_mode', axis = 1)

In [24]:
df = fill_missing(df, 'model', 'gearbox')

In [25]:
df['gearbox'] = df['gearbox'].fillna(df['gearbox_mode'])
df = df.drop('gearbox_mode', axis = 1)

In [26]:
df = fill_missing(df, 'model', 'fueltype')

In [27]:
df['fueltype'] = df['fueltype'].fillna(df['fueltype_mode'])
df = df.drop('fueltype_mode', axis = 1)

In [28]:
df['model'] = df['model'].fillna('unknown')
df.isna().sum()

datecrawled             0
price                   0
vehicletype          6827
registrationyear        0
gearbox              4130
power                   0
model                   0
mileage                 0
registrationmonth       0
fueltype             7161
brand                   0
notrepaired             0
datecreated             0
numberofpictures        0
postalcode              0
lastseen                0
dtype: int64

In [29]:
display(df[df['vehicletype'].isna()].head())

Unnamed: 0,datecrawled,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired,datecreated,numberofpictures,postalcode,lastseen
260,04/04/2016 09:49,450,,2016,manual,0,unknown,150000,3,petrol,mitsubishi,no,04/04/2016 00:00,0,59302,06/04/2016 11:17
306,21/03/2016 14:38,200,,2009,,0,unknown,10000,0,,sonstige_autos,no,21/03/2016 00:00,0,6493,24/03/2016 02:47
435,27/03/2016 18:43,1300,,2017,manual,150,unknown,150000,10,,volkswagen,no,27/03/2016 00:00,0,70374,05/04/2016 15:15
443,24/03/2016 16:46,1950,,2017,manual,0,unknown,150000,7,petrol,volkswagen,no,24/03/2016 00:00,0,70376,30/03/2016 18:16
478,24/03/2016 17:49,0,,2000,manual,0,unknown,150000,0,,audi,yes,24/03/2016 00:00,0,72514,29/03/2016 03:45


The remaining missing values are still missing because filling according to the model is immposible for some rows as the model is also missing. I'll have to fill them according to the mode of the brand.

In [31]:
df = fill_missing(df, 'brand', 'vehicletype')

In [32]:
df['vehicletype'] = df['vehicletype'].fillna(df['vehicletype_mode'])
df = df.drop('vehicletype_mode', axis = 1)

In [33]:
df = fill_missing(df, 'brand', 'gearbox')

In [34]:
df['gearbox'] = df['gearbox'].fillna(df['gearbox_mode'])
df = df.drop('gearbox_mode', axis = 1)

In [35]:
df = fill_missing(df, 'brand', 'fueltype')

In [36]:
df['fueltype'] = df['fueltype'].fillna(df['fueltype_mode'])
df = df.drop('fueltype_mode', axis = 1)

In [37]:
df.isna().sum()

datecrawled          0
price                0
vehicletype          0
registrationyear     0
gearbox              0
power                0
model                0
mileage              0
registrationmonth    0
fueltype             0
brand                0
notrepaired          0
datecreated          0
numberofpictures     0
postalcode           0
lastseen             0
dtype: int64

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Good job!
  
</div>

All missing values have been dealt with.

In [40]:
df['numberofpictures'].unique()

array([0])

In [41]:
df = df.drop('numberofpictures', axis = 1)

Since the entire column 'numberofpictures' was full of 0's, I just dropped it from the dataframe.

In [43]:
display(df[(df['registrationmonth']==0)&(df['power']==0)].shape)

(15312, 15)

There are 15312 rows where registrationmonth and power are == 0. This is not good because neither of these columns should contain zeros. 
There may be zeros in other columns as well. The only column that should contain zeros at this point I think would be price, there are likely cars that are worth nothing. I'll change all of the zeros to np.nan so that I can see them clearly.

In [45]:
print((df['price'] == 0).sum())

10770


10770 price values = 0, I'll leave these alone and deal with the rest of the zeros in the dataset.

In [47]:
df_2 = df[['datecrawled','vehicletype','registrationyear','gearbox','power','model','mileage','registrationmonth','fueltype','brand','notrepaired','datecreated','postalcode','lastseen']
].replace(0,np.nan)
df_2.isna().sum()

datecrawled              0
vehicletype              0
registrationyear         0
gearbox                  0
power                40218
model                    0
mileage                  0
registrationmonth    37347
fueltype                 0
brand                    0
notrepaired              0
datecreated              0
postalcode               0
lastseen                 0
dtype: int64

I'll use the previous function to fill the power column with the mode for power according to each model where a 0 is present. However, the registration month is a bit of an issue. Filling this with the mean, mode, or otherwise is likely to skew the dataset. In order to preserve the 30808 rows, I'll fill the missing values with a random month between 1 and 12.

In [49]:
df[['power','registrationmonth']] = df[['power','registrationmonth']].replace(0,np.nan)

In [50]:
df = fill_missing(df, 'model', 'power')

In [51]:
df['power'] = df['power'].fillna(df['power_mode'])
df = df.drop('power_mode', axis = 1)

In [52]:
df['registrationmonth'] = df['registrationmonth'].apply(lambda x: np.random.choice([1,2,3,4,5,6,7,8,9,10,11,12]) if pd.isna(x) else x)
sorted(df['registrationmonth'].unique())

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0]

In [53]:
df['registrationmonth'] = df['registrationmonth'].astype('int')
display(df.isna().sum())

datecrawled          0
price                0
vehicletype          0
registrationyear     0
gearbox              0
power                2
model                0
mileage              0
registrationmonth    0
fueltype             0
brand                0
notrepaired          0
datecreated          0
postalcode           0
lastseen             0
dtype: int64

With two values in 'power' still missing, I'll just drop these two rows.

In [55]:
drop_index = df[df['power'].isna()].index
df = df.drop(index = drop_index, axis = 0)

All zeros, except for those present in the price column have been been dealt with, without having to drop any rows.
Next I need to change all dates to datetime format.

In [57]:
print(df['datecrawled'].iloc[:1])

0    24/03/2016 11:52
Name: datecrawled, dtype: object


In [58]:
#Changing datecrawled  and datecreated to datetime format
df['datecrawled'] = pd.to_datetime(df['datecrawled'], format = 'mixed')
df['datecrawled'] = pd.to_datetime(df['datecrawled'], format = "%d/%m/%YT%H:%M")
df['datecreated'] = pd.to_datetime(df['datecreated'], format = 'mixed')
df['datecreated'] = pd.to_datetime(df['datecreated'], format = "%d/%m/%YT%H:%M")
df['lastseen'] = pd.to_datetime(df['lastseen'], format = 'mixed')
df['lastseen'] = pd.to_datetime(df['lastseen'], format = "%d/%m/%YT%H:%M")
#keeping only the year, month, and day
df['datecrawled'] = df['datecrawled'].dt.to_period('m')
df['datecreated'] = df['datecreated'].dt.to_period('m')
df['lastseen'] = df['lastseen'].dt.to_period('m')

  df['datecrawled'] = df['datecrawled'].dt.to_period('m')
  df['datecreated'] = df['datecreated'].dt.to_period('m')
  df['lastseen'] = df['lastseen'].dt.to_period('m')


<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Well done!
  
</div>

# Data Preprocessing

- Split into training, validation, and test sets. Training set size = 0.7, valid and test set sizes = 0.15
- Scale data
- OHE categorical data

In [61]:
#Creating a copy of the dataframe
#Converting the notrepaired column to binary values
df_copy = df
df_copy['notrepaired'] = df['notrepaired'].replace('yes', 1)
df_copy['notrepaired'] = df['notrepaired'].replace('no', 0)
df_copy['notrepaired'].head()

  df_copy['notrepaired'] = df['notrepaired'].replace('no', 0)


0    1
1    1
2    0
3    0
4    0
Name: notrepaired, dtype: int64

I don't see any of the columns containing dates or postal codes being useful for model training so I'll drop them.

In [63]:
df_copy.drop(['datecrawled','lastseen','datecreated','postalcode'], axis = 1, inplace = True)

In [64]:
cat_feat = ['vehicletype','gearbox','model','fueltype','brand']
df_encoded= pd.get_dummies(df_copy, columns = cat_feat, drop_first = True, dtype = int)

In [65]:
df_encoded.shape

(354105, 308)

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

1. You can't use price column for this purpose because it's a direct way for target leakage
2. Actually it's not a problem at all to use OHE for the columns with 39 or 249 unique values. That's not a lot. Moreover you can use OHE for linear models which don't care about the number of features and OrdinalEncoding for tree based models which don't like a lot of features.
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct
  
</div>

In [68]:
#split into target and features
#split data into training and temporary sets
target = df_encoded['price']
features = df_encoded.drop('price', axis=1)
seed = 12345

In [69]:
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size = 0.3, random_state = seed)
print(features_train.shape)
print(features_temp.shape)

(247873, 307)
(106232, 307)


In [70]:
features_test, features_valid, target_test, target_valid = train_test_split(features_temp, target_temp, test_size = 0.5, random_state = seed)
print(features_test.shape)
print(features_valid.shape)

(53116, 307)
(53116, 307)


In [71]:
numeric = ['registrationyear','power','mileage','registrationmonth']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

All Categorical features have been encoded and all quantitative features have been scaled

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

You can't apply pd.get_dummies function to different data parts because you can't be sure that encoding will be the same. There are two possible solutions:
    
1. Use pd.get_dummies function only one before to split the data
2. Use OneHotEncoder from sklearn instead of pd.get_dummies function and train it only on train data (the best solution)
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Good job!
  
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct. Good job! 
    
But is postalcode a real quntative feature? Maybe it's a categorical feature? Think about it. Sometimes integer columns are not quantitative features.
  
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

I made the above changes you suggested, I tried both OneHotEncoder and pd.get_dummies. OneHotEncoder was giving me some issues in terms of processing the transformed dataset, mostly due to hardware limitations. pd.get_dummies wasn't an issue though. I chose to drop the postal code column before splitting and encoding the data. 1. I wasn't sure how to handle postal code as a categorical column since there were so many unique values. 2. I reasoned that postal code wouldn't be a significant factor in predicitions.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

This is a good decision. Unlikely postal code will be a significant factor for price prediction.
  
</div>

# Saving the data frames for later use
* I needed to restart the kernal a few times and didn't want to run the through the training processes of the models again so I saved the first randomized set of data below and then loaded those datasets to continue training with.

features_train.to_csv('features_train.csv', index = False) 
target_train.to_csv('target_train.csv', index = False) 
features_valid.to_csv('features_valid.csv', index = False) 
target_valid.to_csv('target_valid.csv', index = False) 
features_test.to_csv('features_test.csv', index = False) 
target_test.to_csv('target_test.csv', index = False)

features_train = pd.read_csv('features_train.csv') 
target_train = pd.read_csv('target_train.csv') 
features_valid = pd.read_csv('features_valid.csv') 
target_valid = pd.read_csv('target_valid.csv') 
features_test = pd.read_csv('features_test.csv') 
target_test = pd.read_csv('target_test.csv')

target_train = target_train['price']
target_valid = target_valid['price']
target_test = target_test['price']

seed = 12345

## Cross Validating

In [84]:
linear_model = LinearRegression()
tree_model = DecisionTreeRegressor()
forest_model = RandomForestRegressor()
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better = False)

In [85]:
linear_score = cross_val_score(linear_model, features_train, target_train, n_jobs = 8, cv = 5, scoring = rmse_scorer, error_score = 'raise')

In [86]:
tree_score = cross_val_score(tree_model, features_train, target_train, n_jobs = 8, cv = 5, scoring = rmse_scorer, error_score = 'raise')

In [87]:
forest_score = cross_val_score(forest_model, features_train, target_train, n_jobs = 8, cv = 5, scoring = rmse_scorer, error_score = 'raise')

In [88]:
linear_score = sum(linear_score) / len(linear_score)
tree_score = sum(tree_score) / len(tree_score)
forest_score = sum(forest_score) / len(forest_score)


print(f""""Average Linear Model Score: {linear_score}
\nAverage Decision Tree Score: {tree_score}
\nAverage Random Forest Score: {forest_score}""")

"Average Linear Model Score: -3224.708765633549

Average Decision Tree Score: -2291.6541645868447

Average Random Forest Score: -1801.2115772706675


The RMSE's for all models are very poor.

# Sanity Check with Linear Regression

In [91]:
lin_model = LinearRegression()
start_train = time.time()
lin_model.fit(features_train, target_train)
end_train = time.time()

In [92]:
training_time = end_train - start_train

start_pred = time.time()
lin_pred_valid = lin_model.predict(features_valid)
end_pred = time.time()
prediction_time = end_pred - start_pred

rmse_valid = mean_squared_error(target_valid, lin_pred_valid) ** 0.5
r2_valid = r2_score(target_valid, lin_pred_valid)



print(f"""Linear Regression Metrics:\n
Validation RMSE = {rmse_valid}\n
Validation R2 Score = {r2_valid}\n
Training time: {training_time}\n
Prediction time: {prediction_time}
""")

Linear Regression Metrics:

Validation RMSE = 3200.9190098039044

Validation R2 Score = 0.5003233317573783

Training time: 1.9516139030456543

Prediction time: 0.04970216751098633



<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

It's not a Training time. It's a Training time + Prediction time + model initialization time + rmse and r2 score calculation time. To calculate training time you need to calculate the time for method .fit() only. Moreover, you need to calcualte 2 times: training time and predictiom time. To calcualte prediction time you need to calculate the time for method .predict() only.
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Correct. Good job!
  
</div>

The R2 score was 0.5, better than I would've expected for such a poor RMSE of 3200. The training time was 1.9 s and the prediction time 0.04s.

# Decision Tree Regression

In [96]:
dt_model = DecisionTreeRegressor(random_state = seed, criterion = 'squared_error', splitter = 'best')
grid = {'max_depth':[15,18,20,22],'min_samples_leaf':[6,8,10],'min_samples_split':[100,200]}

In [97]:
dt_grid_search = GridSearchCV(estimator = dt_model, param_grid = grid, cv = 3, scoring = rmse_scorer, n_jobs = 8, verbose = 1)

start_tune = time.time()
dt_grid_search.fit(features_train, target_train)
end_tune = time.time()

Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [98]:
tuning_time = (end_tune - start_tune)
print(f'''Best Score: {dt_grid_search.best_params_}\n
Tuning Time: {tuning_time}''')

Best Score: {'max_depth': 22, 'min_samples_leaf': 6, 'min_samples_split': 100}

Tuning Time: 40.339447021484375


In [99]:
start_train = time.time()
final_dt_model = DecisionTreeRegressor(random_state = seed, criterion = 'squared_error', splitter = 'best', max_depth = 22,
                                      min_samples_leaf = 6, min_samples_split = 100).fit(features_train, target_train)
end_train = time.time()

In [100]:
training_time = end_train - start_train

start_pred = time.time()
dt_pred = final_dt_model.predict(features_test)
end_pred = time.time()
prediction_time = end_pred - start_pred

dt_rmse = mean_squared_error(target_test, dt_pred) ** 0.5
dt_r2 = r2_score(target_test, dt_pred)

print(f'''Decision Tree Metrics:\n
RMSE = {dt_rmse}\n
R2 Score = {dt_r2}\n
Training Time: {training_time}\n
Prediction Time = {prediction_time}''')

Decision Tree Metrics:

RMSE = 1956.1501117440741

R2 Score = 0.8159360096214793

Training Time: 2.331043004989624

Prediction Time = 0.04661917686462402


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

It's not a model Training time. It's hyperparamter tuning time. Inside GridSearchCV you trained a dt_model 3 * 6 * 4 * 3 * 3 = 648 times. So this time is about 648 times longer than model training time. To calculate model training time, you need to take best model, retrain it on train data and measure this time.
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Correct. Well done!
  
</div>

The best RMSE after iterating through different configurations of hyper parameters was 1956 and the r2 score was 0.81, this is a huge improvement over the average evaluation score. The Training time was only 2 s and the prediction time was 0.04 s.

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Good job! This is really Prediction Time. You need to do the same thing for the LinearRegression model above.
  
</div>

# Random Forest Regression

In [106]:
rf_model = RandomForestRegressor(random_state = seed, criterion = 'squared_error')
rf_grid = {'n_estimators':[25,30,35,40,45],'max_depth':[10,12,15,18,20],
           'min_samples_leaf':[8,10],'max_features':['sqrt','log2',None]}

In [107]:
rf_grid_search = GridSearchCV(estimator = rf_model, param_grid = rf_grid, cv = 3, scoring = rmse_scorer, n_jobs = 8, verbose = 1)

start_tune = time.time()
rf_grid_search.fit(features_train, target_train)
end_tune = time.time()

Fitting 3 folds for each of 150 candidates, totalling 450 fits


  _data = np.array(data, dtype=dtype, copy=copy,


In [108]:
tuning_time = end_tune - start_tune
print(f'''Best Score: {rf_grid_search.best_params_}\n
Tuning Time: {tuning_time}''')

Best Score: {'max_depth': 20, 'max_features': None, 'min_samples_leaf': 8, 'n_estimators': 45}

Tuning Time: 2031.386477947235


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Yes, this is Tuning time. Well done!
  
</div>

In [110]:
final_rf_model = RandomForestRegressor(random_state = seed, criterion = 'squared_error', max_depth = 20,
                                      max_features = None, min_samples_leaf = 8, n_estimators = 45)

start_train = time.time()
final_rf_model.fit(features_train, target_train)
end_train = time.time()

In [111]:
training_time = end_train - start_train

start_pred = time.time()
rf_pred_test = final_rf_model.predict(features_test)
end_pred = time.time()
pred_time = end_pred - start_pred

rmse_test = mean_squared_error(target_test, rf_pred_test) ** 0.5
r2_test = r2_score(target_test, rf_pred_test)

print(f'''Random Forest Metrics:\n
RMSE = {rmse_test}\n
R2 Score = {r2_test}\n
Training Time: {training_time}\n
Prediction Time: {pred_time}''')

Random Forest Metrics:

RMSE = 1829.7108770386967

R2 Score = 0.8389616118167803

Training Time: 64.46907901763916

Prediction Time: 0.289154052734375


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Both times are calculated correctly here. Good job!
  
</div>

Results of predictions on the test set show a massive improvement over Linear Regression and an improvement over the Decision Tree with an RMSE of 1829 and a R2 score of 0.83. However; this came at a the cost of time, it took 2031 s (33.8 min) to tune the model, 64 s to train the and 0.2 seconds to finish the prediction on the test set. 

# Gradient Descent Boosting with LightGBM

 - I'll use LightGBM to compare two models. One with a Gradient boosted decision tree, and another with a gradient boosted random forest.

In [115]:
gbm_dt_model = lgb.LGBMRegressor(boosting_type='gbdt',random_state=seed,metric='rmse',device='cpu',n_estimators=1000)

gbm_dt_grid = {'max_depth': [10, 15, 20],'num_leaves': [8, 10, 12],'learning_rate': [0.01, 0.05]}

In [116]:
dt_gbm_search = GridSearchCV(estimator = gbm_dt_model, param_grid = gbm_dt_grid, cv = 3, n_jobs = 8, verbose = False)
start_tune = time.time()
dt_gbm_search.fit(features_train, target_train)
end_tune = time.time()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003302 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 947
[LightGBM] [Info] Number of data points in the train set: 240688, number of used features: 290
[LightGBM] [Info] Start training from score 4478.135989


In [117]:
tuning_time = (end_tune - start_tune)/60
print(f'''Best Score: {dt_gbm_search.best_params_}\n
Tuning Time: {tuning_time}''')

Best Score: {'learning_rate': 0.05, 'max_depth': 10, 'num_leaves': 12}

Tuning Time: 2.7445446650187173


In [118]:
final_gbm_dt = lgb.LGBMRegressor(boosting_type='gbdt',random_state=seed,metric='rmse',device='cpu',n_estimators=1000,
                                learning_rate = 0.05, max_depth = 10, num_leaves = 12, verbose = 0)
train_start = time.time()
final_gbm_dt.fit(features_train, target_train, eval_set = [(features_valid, target_valid)])
train_end = time.time()

In [119]:
training_time = (train_end - train_start)

start_pred = time.time()
dt_pred_test = final_gbm_dt.predict(features_test)
end_pred = time.time()
pred_time = end_pred - start_pred

dt_rmse_test = mean_squared_error(target_test, dt_pred_test) ** 0.5
dt_r2_test = r2_score(target_test, dt_pred_test)

print(f'''Light GBM Decision Tree Metrics:\n
RMSE = {dt_rmse_test}\n
R2 Score = {dt_r2_test}\n
Training Time: {training_time}\n
Prediction Time: {pred_time}''')

Light GBM Decision Tree Metrics:

RMSE = 1860.448599221835

R2 Score = 0.8335055247539529

Training Time: 1.9005839824676514

Prediction Time: 0.3388669490814209


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct
  
</div>

This model performs fairly well, with an RMSE of 1860 and an R2 score of 0.83. It only took 2.7 min to tune and 1.9 s to train with an additional 0.3 seconds to finish predictions.

In [122]:
gbm_rf_model =lgb.LGBMRegressor(boosting_type='rf',random_state=seed, metric='rmse',device='cpu', bagging_fraction=0.9,
                                bagging_freq = 5, lambda_l1 = 0.1, lambda_l2 = 0.1, feature_fraction = 0.9)

gbm_rf_grid = {'max_depth':np.arange(10,20,2),'learning_rate':[0.01, 0.05],'n_estimators':np.arange(10,50,5),
               'num_leaves':np.arange(10,100,10)}

In [123]:
rf_gbm_search = GridSearchCV(estimator = gbm_rf_model, param_grid = gbm_rf_grid, cv = 3, n_jobs = 8, verbose = 1)
start_tune = time.time()
rf_gbm_search.fit(features_train, target_train)
end_tune = time.time()

Fitting 3 folds for each of 720 candidates, totalling 2160 fits
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.022314 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 935
[LightGBM] [Info] Number of data points in the train set: 160458, number of used features: 286
[LightGBM] [Info] Start training from score 4479.363173
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008174 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 934
[LightGBM] [Info] Number of data points in the train set: 160459, number of used features: 286
[LightGBM] [Info] Start training from score 4480.536897
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007319 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is n

In [124]:
tuning_time = (end_tune - start_tune) / 60

print(f'''Best Score: {rf_gbm_search.best_params_}\n
Tuning Time: {tuning_time}''')

Best Score: {'learning_rate': 0.01, 'max_depth': 12, 'n_estimators': 45, 'num_leaves': 90}

Tuning Time: 26.672352981567382


In [125]:
final_gbm_rf = lgb.LGBMRegressor(boosting_type='rf',random_state=seed, metric='rmse',device='cpu', bagging_fraction=0.9,
                                bagging_freq = 5, lambda_l1 = 0.1, lambda_l2 = 0.1, feature_fraction = 0.9,learning_rate = 0.01,
                                 max_depth = 12, n_estimators = 45, num_leaves = 90)
start_train = time.time()
final_gbm_rf.fit(features_train, target_train, eval_set = [(features_valid, target_valid)])
end_train = time.time()



In [126]:
training_time = end_train - start_train

start_pred = time.time()
rf_pred_test = final_gbm_rf.predict(features_test)
end_pred = time.time()
pred_time = end_pred - start_pred

rf_rmse_test = mean_squared_error(target_test, rf_pred_test) ** 0.5
rf_r2_test = r2_score(target_test, rf_pred_test)

print(f'''Light GBM Random Forest Metrics:\n
RMSE = {rf_rmse_test}\n
R2 Score = {rf_r2_test}\n
Training Time: {training_time}\n
Prediction time: {pred_time}''')

Light GBM Random Forest Metrics:

RMSE = 2205.495781685934

R2 Score = 0.7660209688263315

Training Time: 1.2260899543762207

Prediction time: 0.07891392707824707


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct
  
</div>

This model doesn't perform as well as the gradient descent boosted decision tree or the standard, sklearn Random Forest. With an RMSE of 2205 and an R2 Score of 0.76. The real issue with this model is that it was very computationally expensive, taking 26 min to tune, 1.2 s to train and .09 s to complete a prediction. If my computer was capable, I could tune the hyperparameters to make a far more complex and accurate model.

# Gradient Boosting With XGboost

In [130]:
xgb_tree_model = xgb.XGBRegressor(random_state = seed, booster = 'gbtree', eval_metric = 'rmse',objective = 'reg:squarederror',
                                  subsample = 0.8, colsample_bytree = 0.8, n_estimators = 1000)

xgb_tree_grid = {'learning_rate':[0.01,0.05],'max_depth':np.arange(10,20,2)}

In [131]:
xgb_tree_search = GridSearchCV(estimator = xgb_tree_model, param_grid = xgb_tree_grid, cv = 3, verbose = 1)
start_tune = time.time()
xgb_tree_search.fit(features_train, target_train)
end_tune = time.time()

Fitting 3 folds for each of 10 candidates, totalling 30 fits

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007207 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 932
[LightGBM] [Info] Number of data points in the train set: 160459, number of used features: 284
[LightGBM] [Info] Start training from score 4474.507905
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007400 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 935
[LightGBM] [Info] Number of data points in the train set: 160458, number of used features: 286
[LightGBM] [Info] Start training from score 4479.363173
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010546 seconds.
You can set `f

In [132]:
tuning_time = (end_tune - start_tune) / 60
print(f'''Best Score: {xgb_tree_search.best_params_}\n
Tuning Time: {tuning_time}\n''')

Best Score: {'learning_rate': 0.01, 'max_depth': 14}

Tuning Time: 28.021089299519858



In [133]:
final_xgb_tree = xgb.XGBRegressor(random_state = seed, booster = 'gbtree', eval_metric = 'rmse',objective = 'reg:squarederror',
                                  subsample = 0.8, colsample_bytree = 0.8, n_estimators = 1000, learning_rate = 0.01, max_depth = 14)
start_train = time.time()
final_xgb_tree.fit(features_train, target_train, eval_set = [(features_valid, target_valid)], verbose = 0)
end_train = time.time()

In [134]:
training_time = (end_train - start_train)

start_pred = time.time()
xgb_tree_pred = final_xgb_tree.predict(features_test)
end_pred = time.time()
pred_time = end_pred - start_pred

xgb_tree_rmse = mean_squared_error(target_test, xgb_tree_pred) ** 0.5
xgb_tree_r2 = r2_score(target_test, xgb_tree_pred)

print(f'''XGBoost gbtree Metrics:\n
RMSE = {xgb_tree_rmse}\n
R2 Score = {xgb_tree_r2}\n
Training Time: {training_time}\n
Prediction Time: {pred_time}''')

XGBoost gbtree Metrics:

RMSE = 1711.996044076624

R2 Score = 0.859015941619873

Training Time: 57.62065005302429

Prediction Time: 0.872236967086792


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct
  
</div>

XGboost's gradient boosted tree model has the best metrics so far. With an RMSE of 1711 and an R2 score of .85. It took 28 min to tune the model, 57s to train the model and only 0.8 s to complete predictions.

In [137]:
xgb_lin_model = xgb.XGBRegressor(random_state = seed, booster = 'gblinear', objective = 'reg:squarederror', feature_selector = 'cyclic',
                                alpha = 0.5, updater = 'coord_descent', n_estimators = 500)

xgb_lin_grid = {'learning_rate':[0.01, 0.05]}

In [138]:
xgb_lin_search = GridSearchCV(estimator = xgb_lin_model, param_grid = xgb_lin_grid, cv = 3, n_jobs = 8, verbose = 1)

start_tune = time.time()
xgb_lin_search.fit(features_train, target_train)
end_tune = time.time()

Fitting 3 folds for each of 2 candidates, totalling 6 fits


In [139]:
tuning_time = (end_tune - start_tune) / 60
print(f'''Best Score: {xgb_lin_search.best_params_}\n
Tuning Time: {tuning_time}''')
best_xgb_lin = xgb_lin_search.best_estimator_

Best Score: {'learning_rate': 0.05}

Tuning Time: 17.214201819896697


In [140]:
train_start = time.time()
final_xgb_lin = xgb.XGBRegressor(random_state = seed, booster = 'gblinear', objective = 'reg:squarederror', feature_selector = 'cyclic',
                                alpha = 0.5, updater = 'coord_descent', n_estimators = 500, learning_rate = 0.05)
final_xgb_lin.fit(features_train, target_train, eval_set = [(features_valid, target_valid)], verbose = 0)
train_end = time.time()

In [141]:
training_time = (train_end - train_start)/60

start_pred = time.time()
xgb_lin_pred = final_xgb_lin.predict(features_test)
end_pred = time.time()
pred_time = end_pred - start_pred

xgb_lin_rmse = mean_squared_error(target_test, xgb_lin_pred) ** 0.5
xgb_lin_r2 = r2_score(target_test, xgb_lin_pred)

print(f'''XGBoost gblinear Metrics:\n
RMSE = {xgb_lin_rmse}\n
R2 Score = {xgb_lin_r2}\n
Training Time: {training_time}\n
Prediction Time: {pred_time}''')

XGBoost gblinear Metrics:

RMSE = 3263.758508008104

R2 Score = 0.4876101613044739

Training Time: 5.451288970311483

Prediction Time: 0.08359313011169434


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct
  
</div>

Linear Regression is clearly not the right choice of model for the data. Even with Gradient Boosted Linear Regression the RMSE is 3263. The tuning time for only 500 iterations was 17 min. Training time was 5.45 min (longer than all the other models, and prdiction time was 0.08 sec.

# Gradient Boosting with CatBoost
* Since all categorical features have already been numerically encoded, I won't pass them as to the cat_features parameter .

In [145]:
cat_model = CatBoostRegressor(iterations = 1000, learning_rate = 0.5, depth = 10, boosting_type = 'Ordered', l2_leaf_reg = 3,
                             task_type = 'CPU', eval_metric = 'RMSE', early_stopping_rounds = 50, random_seed = seed)

In [146]:
start_train = time.time()
cat_model.fit(features_train, target_train, eval_set = [(features_valid, target_valid),(features_test, target_test)], verbose = 0)
end_train = time.time()

In [147]:
training_time = end_train - start_train

start_pred = time.time()
cat_test_pred = cat_model.predict(features_test)
end_pred = time.time()
prediction_time = end_pred - start_pred
cat_test_rmse = mean_squared_error(target_test, cat_test_pred) ** 0.5
cat_test_r2 = r2_score(target_test, cat_test_pred)

print(f'''CatBoost Model Test Set Metrics:
RMSE = {cat_test_rmse}\n
R2 Score = {cat_test_r2}\n
Training Time: {training_time}\n
Prediction Time: {prediction_time}''')

CatBoost Model Test Set Metrics:
RMSE = 1777.8789089421018

R2 Score = 0.8479561594014388

Training Time: 164.32920289039612

Prediction Time: 0.0320589542388916


<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct
  
</div>

Catboost made a fairly good model with an RMSE of 1777, an R2 score of 0.84. The training time was 164 s (2.7 min) and the prediction time was a negligeable 0.03 s. 

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

1. Please, to tune hyperparameters use GridSearchCV or RandomizedSearchCV. Using loops is unprofessional. Moreover, as I can see you work locally and have more than one core CPU. GridSearchCV and RandomizedSearchCV have n_jobs parameter which can significantly speed up a tuning process if you have several cores.
2. For each model you need to measure two separate times. One is training time (method fit) and one is prediction time (method predict). To do it, you can use library `time`. And you need to use both these times in the model analysis part below.
  
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

I did my best to integrate all of the models into GridSearchCV, I'm aware that some of the hyperparameters could be improved; however, I was running into CPU issues. I have a 2024 Macbook Air with an M3 chip, but the RAM is limited and I was running into very long run times and actually crashing Jupyter Notebook because of the memory requirements. I opted to reduce the cv to 3 and reduce parameters like num_leaves, iterations, and n_estimators.
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Good job! But you have some issues with time calculations for the first 2 models. For the rest models everything is correct.
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Everything is correct now. Great work!
  
</div>

## <b>Conclusion<b>

The data provided was a mess, some duplicates, a lot of missing values, zeros in places there shouldn't be zeros and an column that was completely empty. I did my best to fill any missing values or 0's with appropriate values according to the mode value ordered by model of car. This allowed me to accurately fill missing values in the gearbox, fueltype, and vehicle type features. I filled missing model values with 'unknown'.

I moved on to encoding the data by encoding any categorical columns and dropping most columns related to dates as these wouldn't have a significant impact on trained models and could confuse some. All quantitative variables were scaled using Standard Scaler.



<b>Standard Models<b>

- All models were tuned through GridSearchCV.
  
  - The Linear Regression model returned an RMSE of 3200, and R2 score of 0.5. It took 1.95 s to train and 0.04 s to complete predictions. This served only as a Sanity Check for the other models
 
  * Decision Tree Regressor:
    
     - The Decision Tree Regressor was a big improvement over the Linear Regression Model.
     - RMSE = 1956
     - R2 Score = 0.81
     - Training Time: 2.3 s
     - Prediction Time: 0.04 s
       
  * Random Forest Regressor:

    - This was significant improvement over the decision tree model, though it took far longer to tune, train and complete predictions.
    -  RMSE = 1829
    -  R2 Score = 0.83
    -  Training Time: 64 s
    -  Prediction Time: 0.3 s

<b> Gradient Boosted Models: <b>

   * <b>LightGBM:<b>

   - I tried two different boosting types and created two different models. A gbdt model and an rf model.

     * Gradient Boosting Decision Trees:

       - This model scored better than the sklearn decision tree regressor, but not as well as the sklearn random forest model.
       - RMSE = 1860
       - R2 Score = 0.83
       - Training Time: 1.9 s
       - Prediction Time: 0.3 s

     * Random Forest
    
       - This model 
       - RMSE = 1829
       - R2 Score = 0.83
       - Training Time: 64 s
       - Prediction Time: 0.2 s

   * <b>XGBoost:<b>

   - I built two different models with XGBoost, a Gradient Boosted Tree model and a Gradient Boosted Linear Regression model.

     * Gradient Boosted Tree:

       - The winner as far as I'm concerned, this model returned the best scores with a relatively low training time.
       - RMSE = 1711
       - R2 Score = 0.85
       - Training Time: 57 s
       - Predication Time: 0.8 s
    
     * Gradient Boosted Linear Regression:
    
       - This was a very poor model that I created just to see how much gradient boosting would affect linear regression trained on the data. It didn't perform as well as standard linear regression and it had the longest training time out of all the models.
       - RMSE = 3263
       - R2 Score = 0.48
       - Training Time: 5.45 min
       - Prediction Time: 0.08 s

   * <b>CatBoost:<b>

   - I only built one model with CatBoost using the Ordered boosting type. It was very comparable to the LightGBM decision tree.

     * Ordered Boosting
     
        - RMSE = 1777
        - R2 Score = 0.84
        - Training Time: 2.7 min
        - Prediction Time: 0.03 s


To sum up the findings, the best model was the XGBoost Gradient Boosted Tree model. It was not only the most accurate model for making predictions, but also one of the fastest to train. It returned the lowest RMSE of 1711 and the highest R2 score of 0.85. It was able to be trained in a low amount of time at just 57 s. In terms of making predictions, it was the slowest out of all 8 models, taking 0.8 seconds to complete predictions. This is still an incredibly short time to complete predictions and shouldn't prohibit the company from choosing this model.