<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Thanks for taking the time to improve the project! I left replies to your comments below. The project is now accepted, and you can move on to the next sprint. Keep up the good work! 

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there is one problem that needs to be fixed before the project is accepted. Let me know if you have any questions!

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

### Import Data and Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler

from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor, Pool
import xgboost as xgb

In [2]:
df = pd.read_csv('/datasets/car_data.csv')

### Overview of Data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

### Preprocess Data

#### Rename Columns

In [5]:
df = df.rename(columns={'DateCrawled': 'date_crawled', 'Price' : 'price', 'VehicleType' : 'vehicle_type', 'RegistrationYear' : 'registration_year', 'Gearbox' : 'gearbox', 'Power' : 'power', 'Model' : 'model', 'Mileage' : 'mileage', 'RegistrationMonth' : 'registration_month', 'FuelType' : 'fuel_type', 'Brand' : 'brand', 'NotRepaired' : 'not_repaired', 'DateCreated' : 'date_created', 'NumberOfPictures' : 'number_of_pictures', 'PostalCode' : 'postal_code', 'LastSeen' : 'last_seen'})

Columns have been renamed to their ideal snake case format.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good!

</div>

#### Convert date columns to datetime data type

In [6]:
df['date_crawled'] = pd.to_datetime(df['date_crawled'], format='%d/%m/%Y %H:%M')
df['date_created'] = pd.to_datetime(df['date_created'], format='%d/%m/%Y %H:%M')
df['last_seen'] = pd.to_datetime(df['last_seen'], format='%d/%m/%Y %H:%M')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   date_crawled        354369 non-null  datetime64[ns]
 1   price               354369 non-null  int64         
 2   vehicle_type        316879 non-null  object        
 3   registration_year   354369 non-null  int64         
 4   gearbox             334536 non-null  object        
 5   power               354369 non-null  int64         
 6   model               334664 non-null  object        
 7   mileage             354369 non-null  int64         
 8   registration_month  354369 non-null  int64         
 9   fuel_type           321474 non-null  object        
 10  brand               354369 non-null  object        
 11  not_repaired        283215 non-null  object        
 12  date_created        354369 non-null  datetime64[ns]
 13  number_of_pictures  354369 no

The contents of the columns containing dates - `date_crawled`, `date_created`, and `last_seen` - have been converted to the `datetime64[ns]` datatype.

### Check for duplicate rows

In [8]:
df.duplicated().sum()

262

In [9]:
df = df.drop_duplicates().reset_index(drop=True)

In [10]:
df.duplicated().sum()

0

Duplicate rows have been removed.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright!

</div>

### Handling missing values

In [11]:
df.isna().sum()

date_crawled              0
price                     0
vehicle_type          37484
registration_year         0
gearbox               19830
power                     0
model                 19701
mileage                   0
registration_month        0
fuel_type             32889
brand                     0
not_repaired          71145
date_created              0
number_of_pictures        0
postal_code               0
last_seen                 0
dtype: int64

#### Filling missing values with "unknown"

Rather than removing any of the rows with missing values, we will fill all the missing fields with `"unknown"` as they are categorical rather than numerical. Had there been numerical columns, we could have found a way to fill them with values rather than an arbitrary designation as we have.

This is because creating models that take unkown variables into account will be a more accurate representation of real world scenarios as many people selling vehicles may not fill out every field pertaining to their vehicle's information.

In [12]:
df = df.fillna('unknown')

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, this is one way to deal with missing values

</div>

#### Verify no more missing values

In [13]:
df.isna().sum()

date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
number_of_pictures    0
postal_code           0
last_seen             0
dtype: int64

All missing values have been filled, and no more remain.

### Observe values in each categorical column

#### `vehicle_type`

In [14]:
df['vehicle_type'].value_counts()

sedan          91399
small          79753
wagon          65115
unknown        37484
bus            28752
convertible    20180
coupe          16147
suv            11991
other           3286
Name: vehicle_type, dtype: int64

#### `gearbox`

In [15]:
df['gearbox'].value_counts()

manual     268034
auto        66243
unknown     19830
Name: gearbox, dtype: int64

#### `model`

In [16]:
df['model'].value_counts()

golf                  29215
other                 24402
3er                   19744
unknown               19701
polo                  13057
                      ...  
i3                        8
serie_3                   4
rangerover                4
range_rover_evoque        2
serie_1                   2
Name: model, Length: 251, dtype: int64

In [17]:
models = df['model'].unique()

In [18]:
### Displaying models in order
models.sort()
models

array(['100', '145', '147', '156', '159', '1_reihe', '1er', '200',
       '2_reihe', '300c', '3_reihe', '3er', '4_reihe', '500', '5_reihe',
       '5er', '601', '6_reihe', '6er', '7er', '80', '850', '90', '900',
       '9000', '911', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a8',
       'a_klasse', 'accord', 'agila', 'alhambra', 'almera', 'altea',
       'amarok', 'antara', 'arosa', 'astra', 'auris', 'avensis', 'aveo',
       'aygo', 'b_klasse', 'b_max', 'beetle', 'berlingo', 'bora',
       'boxster', 'bravo', 'c1', 'c2', 'c3', 'c4', 'c5', 'c_klasse',
       'c_max', 'c_reihe', 'caddy', 'calibra', 'captiva', 'carisma',
       'carnival', 'cayenne', 'cc', 'ceed', 'charade', 'cherokee',
       'citigo', 'civic', 'cl', 'clio', 'clk', 'clubman', 'colt', 'combo',
       'cooper', 'cordoba', 'corolla', 'corsa', 'cr_reihe', 'croma',
       'crossfire', 'cuore', 'cx_reihe', 'defender', 'delta', 'discovery',
       'doblo', 'ducato', 'duster', 'e_klasse', 'elefantino', 'eos',
       'escort', 'espac

`rangerover` appears along with `range_rover` - these are the same vehicle model in different formatting so they will be standardized.

In [19]:
df['model'] = df['model'].where(df['model'] != 'rangerover', 'range_rover')

<div class="alert alert-success">
<b>Reviewer's comment</b>

Nice!

</div>

#### `fuel_type`

In [20]:
df['fuel_type'].value_counts()

petrol      216161
gasoline     98658
unknown      32889
lpg           5307
cng            565
hybrid         233
other          204
electric        90
Name: fuel_type, dtype: int64

`petrol` and `gasoline` are the same fuel type. This being the case, we will rename `gasoline` to `petrol` to reflect this. `lpg` is liquified petroleum gas or propane autogas, as it's not represented twice, we will leave it as it is. `cng` is compressed natural gas, which consists primarily of methane, and like lpg is unique, so it will remain as it is.

In [21]:
df['fuel_type'] = df['fuel_type'].where(df['fuel_type'] != 'gasoline', 'petrol')

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good!

</div>

In [22]:
df['fuel_type'].value_counts()

petrol      314819
unknown      32889
lpg           5307
cng            565
hybrid         233
other          204
electric        90
Name: fuel_type, dtype: int64

`gasoline` values have been changed to `petrol` successfully.

#### `brand`

In [23]:
df['brand'].value_counts()

volkswagen        76960
opel              39902
bmw               36881
mercedes_benz     32025
audi              29439
ford              25163
renault           17915
peugeot           10988
fiat               9634
seat               6901
mazda              5611
skoda              5490
smart              5241
citroen            5143
nissan             4936
toyota             4601
hyundai            3583
sonstige_autos     3373
volvo              3207
mini               3201
mitsubishi         3022
honda              2817
kia                2463
suzuki             2320
alfa_romeo         2311
chevrolet          1751
chrysler           1439
dacia               898
daihatsu            806
subaru              762
porsche             758
jeep                677
trabant             589
land_rover          545
daewoo              542
saab                526
jaguar              505
rover               486
lancia              471
lada                225
Name: brand, dtype: int64

We are checking for abbreviations for repeated brands, i.e. Chevrolet and Chevy. The only potentially present is `land_rover` and `rover`. However, Rover my represent a different brand in and of itself as the brand has been owned by numerous companies over its life. We will observe entries that belong to `rover` and see if they represent only `land_rover` or more than just those vehicles.

In [24]:
display(df.query("brand == 'rover'")[['model']].value_counts())

model      
other          394
unknown         82
range_rover      4
freelander       3
discovery        2
defender         1
dtype: int64

The only known values of the `model` column when the `brand` is rover are Land Rover vehicles. However, this makes up 10 of the nearly 500 vehicles, so there is not enough information to change `rover` to `land_rover` and being that there are unique vehicles to the Rover brand, found through cursory research, that did or do not exist as part of the Land Rover brand, it is best to keep it as a unique entry.

#### `not_repaired`

In [25]:
df['not_repaired'].value_counts()

no         246927
unknown     71145
yes         36035
Name: not_repaired, dtype: int64

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Nice work on data preprocessing, although there are some problems with the data you seem to have missed. I suggest checking numerical feature distributions

</div>

<div class="alert alert-info">
Below the train_test_split I've used standard scaling on the features - I believe this is what you mean by checking the numerical feature distributions - the outcome being that they need to be standardized to be optimized for model creation. Please correct me if I'm wrong though.
</div>

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>

Not quite what I meant: there are some values which don't make sense (for example registration years below 1900 or above 2016, which is the dataset collection date if I recall correctly, or prices equal to 0, or horse power more that the most powerful prototype cars). It would be a good idea to remove such data points: the model is as good as the data :)

</div>

## Model training

### Convert categorical columns to numerical with OHE

In [26]:
### Remove columns with datetime so they don't go through OHE
df2 = df.drop(['date_crawled', 'date_created', 'last_seen'], axis=1)

In [27]:
df_ohe = pd.get_dummies(df2, drop_first=True)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Categorical features were encoded

</div>

In [28]:
### Add back columns with dateime to dataframe
# df_ohe['date_crawled'] = pd.to_numeric(df.date_crawled)
# df_ohe['date_created'] = pd.to_numeric(df.date_created)
# df_ohe['last_seen'] = pd.to_numeric(df.last_seen)
### Blocked in order to not remove, but not process

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Not sure what you need these dates for: they don't seem to have any connection to price

</div>

<div class="alert alert-info">
The only reason I replaced them in the data is because they are specified to be features in the data description for the project. Maybe that's just the generalized term for "not the target", but I erred on the side of leaving them in the data than removing them just in case. I've kept the code but blocked it from running as the three sets of dates aren't likely connected to price. I have no doubt the dates could be more useful in showing trends over time in an EDA or SDA, but as we're only building models, they're not terribly relevant at this time.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

> Maybe that's just the generalized term for "not the target"
    
Yep, that's exactly it :)

</div>

In [29]:
df_ohe.head()

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures,postal_code,vehicle_type_convertible,vehicle_type_coupe,vehicle_type_other,...,brand_smart,brand_sonstige_autos,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,not_repaired_unknown,not_repaired_yes
0,480,1993,0,150000,0,0,70435,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,18300,2011,190,125000,5,0,66954,0,1,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,2004,163,125000,8,0,90480,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1500,2001,75,150000,6,0,91074,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,3600,2008,69,90000,7,0,60437,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Categorical columns have been encoded to become numerical columns to be able to go through regressive modeling. We first removed the date columns to avoid issues with the one-hot encoding process then added them back into the encoded dataframe making it ready to split into training and test data.

### Splitting data for training and testing

In [30]:
target = df_ohe['price']
features = df_ohe.drop(['price', 'postal_code', 'registration_month'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.2, random_state=759638)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=759638)

<div class="alert alert-info">
I didn't add the date columnss back in above and removed the `postal_code` as well from my initial submission as it's just one more feature to bog down the model and there isn't a benefit to keeping it since the model can't interpret the actual meaning. I also removed `registration_month` as it's an arbitrary value which logically carries little to no weight on the price of a vehicle when compared to the year it's made, mileage, etc.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

These are great points! 

</div>

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train and test sets

</div>

Data has been separated into targets and features, then it was split at a 3:1:1 ratio into the training, validation, and test sets.

### Stadardize Scaling of Features

In [31]:
numeric = ['registration_year', 'power', 'number_of_pictures']

scaler = StandardScaler()
scaler.fit(features_train[numeric]) 
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[numeric] = scaler.transform(features_train[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_valid[numeric] = scaler.transform(features_valid[numeric])
A value is

<div class="alert alert-info">
I scaled the features, training the scalar on the training data set, which is the way the lessons taught. I have mixed thoughts on this, so any clarification you could offer would be appreciated. I understand the practice as the training set is the largest and theoretically the most representative of the data. However, not training with the validation or test sets in the features also seems less than ideal as we're scaling data then applying that scaling to an entirely different set. When validating or testing models in the way it makes sense as there is merit to doing so, however scaling in this same fashion doesn't seem optimal.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Scaling is done correctly! The scaler can't be trained on validation/test data for the simple reason that scaler is part of the model and we treat validation/test data as unseen by the model, so just like with the model itself, the scaler can only be fit using the train set.
    
You are correct that if the train and validation/test set distributions are significantly diffirent, this can lead to problems, but this is also true about the model itself, just like for the scaler: the model only works under the assumption that train and validation/test (and more importantly the new data the model will encounter in production which is emulated by validation and test data) are coming from the same distribution. 

</div>

### Training Models

#### Linear Regression

In [37]:
lr_model = LinearRegression()
lr_model.fit(features_train, target_train)
lr_prediction_valid = lr_model.predict(features_valid)
lr_RMSE_valid = mean_squared_error(target_valid, lr_prediction_valid, squared=False)

In [38]:
%%timeit -r 5
lr_model = LinearRegression()
lr_model.fit(features_train, target_train)

20.9 s ± 171 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [39]:
%%timeit -r 5
lr_prediction = lr_model.predict(features_test)
lr_RMSE = mean_squared_error(target_test, lr_prediction, squared=False)

200 ms ± 1.02 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


In [40]:
lr_prediction = lr_model.predict(features_test)
lr_RMSE = mean_squared_error(target_test, lr_prediction, squared=False)

In [41]:
print("Linear Regression model:")
print("Validation data RMSE:", lr_RMSE_valid)
print("Test data RMSE:", lr_RMSE)
print("Model training time: 20.9 s ± 171 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)")
print("Model evaluation time: 200 ms ± 1.02 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)")

Linear Regression model:
Validation data RMSE: 3267.256234555461
Test data RMSE: 3268.418536679359
Model training time: 20.9 s ± 171 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
Model evaluation time: 200 ms ± 1.02 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


#### Decision Tree Regressor

In [52]:
### Hyperparameter tuning loop
# dtr_depth_score = []
#
# for depth in range(75, 96):
#     model = DecisionTreeRegressor(max_depth=depth, random_state=759638)
#     model.fit(features_train, target_train)
#     prediction_train = model.predict(features_train)
#     score = mean_squared_error(target_train, prediction_train, squared=False)
#     dtr_depth_score.append([depth, score])

In [53]:
### Display tuning results
# display(dtr_depth_score)

In [55]:
### Validaation set RMSE
### 29 first sub 1000 RMSE -> 999.4349397355475
### 85 first with minimum RMSE of 859.7694180104155

In [59]:
dtr_model = DecisionTreeRegressor(max_depth=85, random_state=759638)
dtr_model.fit(features_train, target_train)

dtr_prediction_valid = dtr_model.predict(features_valid)
dtr_RMSE_valid = mean_squared_error(target_valid, dtr_prediction_valid, squared=False)

dtr_prediction = dtr_model.predict(features_test)
dtr_RMSE = mean_squared_error(target_test, dtr_prediction, squared=False)

In [60]:
%%timeit -r 5 
dtr_model = DecisionTreeRegressor(max_depth=85, random_state=759638)
dtr_model.fit(features_train, target_train)

8.06 s ± 72.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [61]:
%%timeit -r 5 
dtr_prediction_valid = dtr_model.predict(features_valid)
dtr_RMSE_valid = mean_squared_error(target_valid, dtr_prediction_valid, squared=False)

115 ms ± 1.88 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


In [63]:
print("Decision Tree Regressor model:")
print("Depth: 85")
print("Validation data RMSE:", dtr_RMSE_valid)
print("Test data RMSE:", dtr_RMSE)
print("Model training time: 8.06 s ± 72.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)")
print("Model evaluation time: 115 ms ± 1.88 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)")

Decision Tree Regressor model:
Depth: 85
Validation data RMSE: 2123.7956299915595
Test data RMSE: 2124.4425812318923
Model training time: 8.06 s ± 72.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
Model evaluation time: 115 ms ± 1.88 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


#### Random Forest Regressor

In [133]:
### Hyperparameter tuning loop
#
# rfr_depth_score = []
#
# for est in range(91, 100):
#     for depth in range (26, 29):
#         model = RandomForestRegressor(random_state=759638, n_estimators=est, max_depth=depth)
#         model.fit(features_train, target_train)
#         prediction_valid = model.predict(features_valid)
#         score = mean_squared_error(target_valid, prediction_valid, squared=False)
#         rfr_depth_score.append([est, depth, score])

In [132]:
### Display tuning results
# display(rfr_depth_score)

In [134]:
rfr_model = RandomForestRegressor(random_state=759638, n_estimators=94, max_depth=28)
rfr_model.fit(features_train, target_train)

rfr_prediction_valid = rfr_model.predict(features_valid)
rfr_RMSE_valid = mean_squared_error(target_valid, rfr_prediction_valid, squared=False)

rfr_prediction = rfr_model.predict(features_test)
rfr_RMSE = mean_squared_error(target_test, rfr_prediction, squared=False)

In [135]:
%%timeit -r 5
rfr_model = RandomForestRegressor(random_state=759638, n_estimators=94, max_depth=28)
rfr_model.fit(features_test, target_test)

2min 13s ± 1.76 s per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [136]:
%%timeit -r 5
rfr_prediction_valid = rfr_model.predict(features_valid)
rfr_RMSE_valid = mean_squared_error(target_valid, rfr_prediction_valid, squared=False)

2.77 s ± 34.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [138]:
print("Random Forest Regressor model:")
print("Estimators: 94")
print("Depth: 28")
print("Validation data RMSE:", rfr_RMSE_valid)
print("Test data RMSE:", rfr_RMSE)
print("Model training time: 2min 13s ± 1.76 s per loop (mean ± std. dev. of 5 runs, 1 loop each)")
print("Model evaluation time: 2.77 s ± 34.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)")

Random Forest Regressor model:
Estimators: 94
Depth: 28
Validation data RMSE: 1763.309152866489
Test data RMSE: 1763.767514478163
Model training time: 2min 13s ± 1.76 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
Model evaluation time: 2.77 s ± 34.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


#### LGBM Regressor

In [76]:
lgbm_model = LGBMRegressor(boosting_type='gbdt', num_leaves=100, max_depth=-1, learning_rate=0.3, n_estimators=100, subsample_for_bin=200000, random_state=759638)
lgbm_model.fit(features_train, target_train)

prediction_valid = lgbm_model.predict(features_valid)
lgbm_RMSE_valid = mean_squared_error(target_valid, prediction_valid, squared=False)

prediction_test = lgbm_model.predict(features_test)
lgbm_RMSE = mean_squared_error(target_test, prediction_test, squared=False)

In [77]:
%%timeit -r 5
lgbm_model = LGBMRegressor(boosting_type='gbdt', num_leaves=100, max_depth=-1, learning_rate=0.3, n_estimators=100, subsample_for_bin=200000, random_state=759638)
lgbm_model.fit(features_train, target_train)

8.08 s ± 244 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [78]:
%%timeit -r 5
prediction_valid = lgbm_model.predict(features_valid)
lgbm_RMSE = mean_squared_error(target_train, prediction_train, squared=False)

1.03 s ± 26.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [80]:
print("Gradiant Boost Decision Tree:")
print("Leaves: 100")
print("Maximum Depth: Unlimited")
print("Learning Rate: 0.3")
print("Maximum estimators: 100")
print("Validation data RMSE:", lgbm_RMSE_valid)
print("Test data RMSE:", lgbm_RMSE)
print("Model training time: 8.08 s ± 244 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)")
print("Model evaluation time: 1.03 s ± 26.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)")

Gradiant Boost Decision Tree:
Leaves: 100
Maximum Depth: Unlimited
Learning Rate: 0.3
Maximum estimators: 100
Validation data RMSE: 1738.2845762747445
Test data RMSE: 1740.3174084649806
Model training time: 8.08 s ± 244 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
Model evaluation time: 1.03 s ± 26.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


#### CatBoost Regressor

In [111]:
cbr_model = CatBoostRegressor(iterations=100, learning_rate=0.7, random_state=759638)
cbr_model.fit(features_train, target_train)

cbr_prediction_valid = cbr_model.predict(features_valid)
cbr_RMSE_valid = mean_squared_error(target_valid, cbr_prediction_valid, squared=False)

cbr_prediction = cbr_model.predict(features_test)
cbr_RMSE = mean_squared_error(target_test, cbr_prediction, squared=False)

0:	learn: 2928.2395467	total: 51ms	remaining: 5.05s
1:	learn: 2509.2562876	total: 101ms	remaining: 4.94s
2:	learn: 2350.8368205	total: 150ms	remaining: 4.85s
3:	learn: 2259.0073576	total: 199ms	remaining: 4.79s
4:	learn: 2208.9414953	total: 247ms	remaining: 4.69s
5:	learn: 2178.7901448	total: 289ms	remaining: 4.53s
6:	learn: 2133.9077040	total: 340ms	remaining: 4.52s
7:	learn: 2114.4654014	total: 385ms	remaining: 4.42s
8:	learn: 2092.1760936	total: 433ms	remaining: 4.37s
9:	learn: 2072.4204759	total: 480ms	remaining: 4.32s
10:	learn: 2061.6664071	total: 524ms	remaining: 4.24s
11:	learn: 2047.5902996	total: 574ms	remaining: 4.21s
12:	learn: 2034.0607720	total: 626ms	remaining: 4.19s
13:	learn: 2025.5155880	total: 671ms	remaining: 4.12s
14:	learn: 2008.7536454	total: 721ms	remaining: 4.08s
15:	learn: 1998.8068215	total: 770ms	remaining: 4.04s
16:	learn: 1994.0210785	total: 811ms	remaining: 3.96s
17:	learn: 1987.7295859	total: 853ms	remaining: 3.88s
18:	learn: 1978.1809019	total: 899ms	re

In [112]:
%%timeit -r 5
cbr_model = CatBoostRegressor(iterations=100, learning_rate=0.7, random_state=759638)
cbr_model.fit(features_train, target_train)

0:	learn: 2928.2395467	total: 55.5ms	remaining: 5.5s
1:	learn: 2509.2562876	total: 104ms	remaining: 5.08s
2:	learn: 2350.8368205	total: 152ms	remaining: 4.92s
3:	learn: 2259.0073576	total: 201ms	remaining: 4.82s
4:	learn: 2208.9414953	total: 244ms	remaining: 4.63s
5:	learn: 2178.7901448	total: 285ms	remaining: 4.46s
6:	learn: 2133.9077040	total: 337ms	remaining: 4.48s
7:	learn: 2114.4654014	total: 384ms	remaining: 4.42s
8:	learn: 2092.1760936	total: 429ms	remaining: 4.34s
9:	learn: 2072.4204759	total: 476ms	remaining: 4.28s
10:	learn: 2061.6664071	total: 520ms	remaining: 4.21s
11:	learn: 2047.5902996	total: 562ms	remaining: 4.12s
12:	learn: 2034.0607720	total: 610ms	remaining: 4.08s
13:	learn: 2025.5155880	total: 653ms	remaining: 4.01s
14:	learn: 2008.7536454	total: 699ms	remaining: 3.96s
15:	learn: 1998.8068215	total: 749ms	remaining: 3.93s
16:	learn: 1994.0210785	total: 789ms	remaining: 3.85s
17:	learn: 1987.7295859	total: 832ms	remaining: 3.79s
18:	learn: 1978.1809019	total: 877ms	r

In [113]:
%%timeit -r 5
cbr_prediction_valid = cbr_model.predict(features_valid)
cbr_RMSE_valid = mean_squared_error(target_valid, cbr_prediction_valid, squared=False)

47.4 ms ± 1.22 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


In [114]:
print("CatBoost Regressor:")
print("Learning Rate: 0.7")
print("Validation data RMSE:", cbr_RMSE_valid)
print("Test data RMSE:", cbr_RMSE)
print("Model training time: 6.51 s ± 28.1 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)")
print("Model evaluation time: 47.4 ms ± 1.22 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)")

CatBoost Regressor:
Learning Rate: 0.7
Validation data RMSE: 1824.6818586154889
Test data RMSE: 1829.6451823152663
Model training time: 6.51 s ± 28.1 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
Model evaluation time: 47.4 ms ± 1.22 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

It's great that you tried different models, tuned their hyperparameters and measured the time it takes to train and evaluate the model. There is one problem though: you used the train set scores to tune hyperparameters, which makes no sense. Train set score tells us nothing about how well the model generalizes to new data, only how well it memorized the train set. We can't use the test set either, as it is used to evaluate the final model. There are two options: use a separate validation set or use cross-validation for hyperparameter tuning

</div>

<div class="alert alert-info">
I went with creating a validation set reather than creating a new scorer for RMSE to be used with cross validation - I'm fairly certain that cross validation would yield better results - but as I see this as more of a theoretical examination of the benefits of gradiant boosting with models, this was the more simple way to correct my previous oversights.

I retuned hyperparameters for the models and adjusted the results below along with my conlcusion to reflect the changes. I also timed the training of the models and the evaluation seperately as you suggested after the analysis.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

It's fine to use a validation set here, as we have enough data :)
    
Cross-validation can be better in a sense that it can work with less data, and it can estimate not just the metric, but also its variance.
    
> I retuned hyperparameters for the models and adjusted the results below along with my conlcusion to reflect the changes. I also timed the training of the models and the evaluation seperately as you suggested after the analysis.
    
Awesome! 

</div>

## Model analysis

**Linear Regression model:**

Validation data RMSE: 3267.256234555461

Test data RMSE: 3268.418536679359

Model training time: 20.9 s ± 171 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Model evaluation time: 200 ms ± 1.02 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


**Decision Tree Regressor model:**

Depth: 85

Validation data RMSE: 2123.7956299915595

Test data RMSE: 2124.4425812318923

Model training time: 8.06 s ± 72.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Model evaluation time: 115 ms ± 1.88 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


**Random Forest Regressor model:**

Estimators: 94

Depth: 28

Validation data RMSE: 1763.309152866489

Test data RMSE: 1763.767514478163

Model training time: 2min 13s ± 1.76 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

Model evaluation time: 2.77 s ± 34.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


**Gradiant Boost Decision Tree:**

Leaves: 100

Maximum Depth: Unlimited

Learning Rate: 0.3

Maximum estimators: 100

Validation data RMSE: 1738.2845762747445

Test data RMSE: 1740.3174084649806

Model training time: 8.08 s ± 244 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Model evaluation time: 1.03 s ± 26.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


**CatBoost Regressor:**

Learning Rate: 0.7

Validation data RMSE: 1824.6818586154889

Test data RMSE: 1829.6451823152663

Model training time: 6.51 s ± 28.1 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Model evaluation time: 47.4 ms ± 1.22 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

Using the *Linear Regression* model as our sanity check - the baseline for our RMSE is rounded to `3268.4` on the test data, and training time of 20.9 seconds on average with a model evaluation time of 200 milliseconds on average. The best performing non-gradiant boosted model, with extensive time spent tuning hyperparameters, was the *Random Forest Regressor* model resulting in a rounded RMSE of `1763.7` on the test data, with an average training time of 2 minutes 13 seconds and average evaluation time of 2.77 seconds of average. The *Decision Tree Regressor* model with a rounded RMSE of `2124.4` on the test data, an average training time of 8.06 seconds, and average evaluation time of 115 milliseconds, represented a faster model than the previous two but scored worse than the Random Forest Regressor.

The gradiant boosted models we created resulted in the best choice when accuracy and processing time is taken into consideration. The *Gradiant Boost Decision Tree* model utilizing the Light GBM library, which resulted in a rounded RMSE of `1740.3` on the test data, average training time of 8.008 seconds, and and an average evaluation time of 1.03 seconds. The CatBoost Regressor, which resulted in a rounded RMSE of `1829.6` on the test data, with average training time of 6.51 seconds, and average evaluation time of 47.4 milliseconds. The gradiant boosted models are the ideal candidates for use - the decision would depend on whether accuracy or speed was the greater desire, though the Gradiant Boost Decision Tree model utilizing the LightGBM library isn't much slower than the CatBoost Regressor model with greater accuracy overall.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Conclusions look good. One possible improvement would be to separately measure training and prediction time.

</div>

<div class="alert alert-info">
Hopefully I made all the appropriate improvements. The models have had hyperparameters retuned, and the conclusion has been altered.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Yep, well done! Here we can note that training time is probably less important than prediction time (for example, as the new model is trained the old one can just keep running, but prediction time affects throughput)

</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed