Hello Mazin!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure!

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

# Determining Value of Used Cars Through Algorithm Analysis and Gradient Descent Training

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
#installing required libraries
!pip install -U scikit-learn
!pip install lightgbm

#importing libraries
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import OrdinalEncoder
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV



In [2]:
df = pd.read_csv('/datasets/car_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [3]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


We have several columns: 'DateCrawled', 'DateCreated' and 'LastSeen' that are listed as objects but represent dates instead. We will start by converting these columns to their intended types.

Of the 3 columns, 'DateCrawled' appears to offer the least value as a feature in our predictive models, and unlikely to provide any insight into determining car value as we have been tasked. We will therefore drop it while retaining and correcting the other 2 date columns

In [4]:
#converting 'DateCreated' and 'LastSeen' to datetime
df[['DateCreated', 'LastSeen']] = df[['DateCreated', 'LastSeen']].apply(pd.to_datetime, dayfirst=True)
#dropping 'DateCrawled'
df.drop('DateCrawled',axis=1,inplace=True)

In [5]:
#confirming the 3 date columns have been converted successfully
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   Price              354369 non-null  int64         
 1   VehicleType        316879 non-null  object        
 2   RegistrationYear   354369 non-null  int64         
 3   Gearbox            334536 non-null  object        
 4   Power              354369 non-null  int64         
 5   Model              334664 non-null  object        
 6   Mileage            354369 non-null  int64         
 7   RegistrationMonth  354369 non-null  int64         
 8   FuelType           321474 non-null  object        
 9   Brand              354369 non-null  object        
 10  NotRepaired        283215 non-null  object        
 11  DateCreated        354369 non-null  datetime64[ns]
 12  NumberOfPictures   354369 non-null  int64         
 13  PostalCode         354369 non-null  int64   

We see quite a few NaN values under 'VehicleType', 'FuelType', and 'NotRepaired' columns. 

In [6]:
df.isna().sum()

Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

A clearer picture emerges of our columns with NaN placeholders. 
- Around 20% of listings have a NaN value for 'NotRepaired'
- More than 10% of vehicle listings have NaN for 'VehicleType'
- Almost 10% of listings have a NaN value for 'FuelType' 
- Between 5-6% of vehicles have NaN values under 'Gearbox' or 'Model' 

In [7]:
df['NotRepaired'].value_counts()

no     247161
yes     36054
Name: NotRepaired, dtype: int64

In [8]:
df['VehicleType'].value_counts()

sedan          91457
small          79831
wagon          65166
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: VehicleType, dtype: int64

In [9]:
df['FuelType'].value_counts()

petrol      216352
gasoline     98720
lpg           5310
cng            565
hybrid         233
other          204
electric        90
Name: FuelType, dtype: int64

In [10]:
#df['Model'].value_counts()

We should delete at least 2 of the 3 columns above. As we discovered earlier, these are the columns with the most NaNs or missing values, and upon further exploration, we can confirm that they offer little value. 
- 'NotRepaired' shows 85% of listings are not repaired, in addition to missing data for 1/5 listings
- 'VehicleType' suffers from unclear values, as 'sedan' and 'small', the top 2 categories of vehicle type are the same thing. We will combine the sedan and small categories under 'sedan', and not delete this column as 45% of the values point to other vehicle types that significantly affect final price points. 
- 'FuelType' is similarly affected by multiple values that point to the same end result. Petrol and gasoline are the top 2 most frequently cited values, when they are British and American words for the same fuel type. Together, they account for ~98% of all non-NaN listings, providing little analytical value

In [11]:
#dropping 'NumberOfPictures' & 'PostalCode' columns as they are of little value to our ML models
df.drop(['NumberOfPictures','PostalCode'],axis=1,inplace=True)

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

The same repaired and not repaired cars may cost totally different. So, this feature can definitely useful for any model. It's okay if there are 85% of the same values. It doesn't mean that the feature is useless.

The situation with fuel type is similar. And you can easily join petrol and gasoline cars in the same category if you want.
    
Instead of to drop useful columns like 'NotRepaired','FuelType', think about to drop NumberOfPictures, PostalCode and all columns with dates. Are these columns useful or useless?
    
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct. But it's better to remove features like NumberOfPictures, PostalCode because they don't have useful information for ML models.
    
</div>

In [12]:
#consolidating values for 'small' under 'VehicleType' to be 'sedan'
df['VehicleType'] = df['VehicleType'].replace('small', 'sedan')

#consolidating 'petrol' and 'gasoline' for FuelType
df['FuelType'] = df['FuelType'].replace('petrol', 'gasoline')

In [13]:
#filling in NaN values with 'unknown'
df = df.fillna('unknown')

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

It's always not a good idea to drop a row because of NaNs in several columns when you work with ML models. When you drop a whole row because of NaNs in several columns you lose information from other columns which can be useful for model training. So, instead of to drop rows with NaNs it's better to fill them. Moreover, it's super easy to fill NaNs in categorical columns. You can just fill all the NaNs with a placeholder like string "unknown".
    
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct. Good job!
    
</div>

In [14]:
#confirming NaNs have been removed from our df
df.isna().sum()

Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
LastSeen             0
dtype: int64

###### Conclusion

We dropped rows with missing values and NaN's, and dropped the 'NotRepaired', 'FuelType', 'DateCrawled' columns for lack of value in informing our pricing model. 
We converted 'LastSeen' and 'DateCreated' columns to datetime, and consolidated the top 2 values under 'VehicleType' to improve model accuracy by clarifying that the two seemingly seperate values are in fact the same value. 

We will begin by encoding our categorical variables: 'VehicleType', 'Gearbox', 'Model', and 'Brand', to avoid inpout errors when using them to inform our models. 

In [15]:
#use ohe/label encoder instead of ordinal. wrong 

In [16]:
categorical_variables = ['VehicleType', 'Gearbox', 'Brand', 'Model', 'NotRepaired','FuelType']

for x in categorical_variables:
    print(df[x].value_counts())
    print()

df = pd.get_dummies(df, columns=categorical_variables, drop_first=False)

sedan          171288
wagon           65166
unknown         37490
bus             28775
convertible     20203
coupe           16163
suv             11996
other            3288
Name: VehicleType, dtype: int64

manual     268251
auto        66285
unknown     19833
Name: Gearbox, dtype: int64

volkswagen        77013
opel              39931
bmw               36914
mercedes_benz     32046
audi              29456
ford              25179
renault           17927
peugeot           10998
fiat               9643
seat               6907
mazda              5615
skoda              5500
smart              5246
citroen            5148
nissan             4941
toyota             4606
hyundai            3587
sonstige_autos     3374
volvo              3210
mini               3202
mitsubishi         3022
honda              2817
kia                2465
suzuki             2323
alfa_romeo         2314
chevrolet          1754
chrysler           1439
dacia               900
daihatsu            806
subaru      

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct
    
</div>

We have encoded our categorical features, and must also do the same for our features with dates, 'LastSeen' and 'DateCreated' to allow for its use in our regression analysis. 

In [17]:
#Dropping columns with dates
df.drop(['DateCreated','LastSeen'],axis=1,inplace=True)

# df['DateCreated']=df['DateCreated'].map(dt.datetime.toordinal)
# df['LastSeen']=df['LastSeen'].map(dt.datetime.toordinal)
# df.info()

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Actually dates are never used is such way in ML models. You will have a sprint about time series where you will study how to extract features from dates for ML models. Right now it's okay just to drop columns with dates.
    
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Fixed
    
</div>

## Model training

###### Splitting our Data

In [18]:
X = df.drop('Price',axis=1)
y = df['Price']
X_train, X_test_valid, y_train, y_test_valid = train_test_split(X, y, test_size=0.30, random_state=42)
X_test, X_valid, y_test, y_valid = train_test_split(X_test_valid, y_test_valid, test_size=0.50, random_state=42)

We will split our data between training, validation, and testing sets at a ratio of 70:15:15 

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct
    
</div>

In [19]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 248058 entries, 246550 to 121958
Columns: 316 entries, RegistrationYear to FuelType_unknown
dtypes: int64(4), uint8(312)
memory usage: 83.3 MB


###### Decision Tree

In [20]:
dt_params = {
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

dt_grid = GridSearchCV(DecisionTreeRegressor(), dt_params, scoring='neg_root_mean_squared_error', cv=3, n_jobs=-1)
dt_grid.fit(X_train, y_train)

dt_best = dt_grid.best_estimator_

start_time = time.time()
dt_pred = dt_best.predict(X_valid)
end_time = time.time()
dt_rmse = mean_squared_error(y_valid, dt_pred, squared=False)

end_time = time.time()

print(f"Decision Tree RMSE: {dt_rmse:.4f}")
print("Best Params:", dt_grid.best_params_)
print(f"Time Taken for Predictions: {end_time - start_time:.2f} seconds")

Decision Tree RMSE: 2033.9339
Best Params: {'max_depth': 20, 'min_samples_split': 10}
Time Taken for Predictions: 0.05 seconds


In [21]:
start_time = time.time()
dt_best.fit(X_train, y_train)
end_time = time.time()
print(f"Time Taken for Fitting/Training: {end_time - start_time:.2f} seconds")

Time Taken for Fitting/Training: 4.63 seconds


###### Random Forest

In [22]:
rf_params = {
    'n_estimators': [50, 100],
    'max_depth': [10,20],
    'min_samples_split': [2, 5]
}

rf_grid = GridSearchCV(
    RandomForestRegressor(random_state=42),
    rf_params,
    scoring='neg_root_mean_squared_error',
    cv=3,
    n_jobs=-1
)
start_time_rf = time.time()
rf_grid.fit(X_train, y_train)
end_time_rf = time.time()

rf_best = rf_grid.best_estimator_
start_time=time.time()
rf_pred = rf_best.predict(X_valid)
end_time = time.time()
rf_rmse = mean_squared_error(y_valid, rf_pred, squared=False)


print(f"Random Forest RMSE: {rf_rmse:.4f}")
print("Best Params:", rf_grid.best_params_)
print(f"Time Taken for Predictions: {end_time - start_time:.2f} seconds")

Random Forest RMSE: 1773.5471
Best Params: {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 100}
Time Taken for Fitting/Training: 3020.74 seconds
Time Taken for Predictions: 1.07 seconds


In [28]:
start_time = time.time()
rf_best.fit(X_train, y_train)
end_time = time.time()
print(f"Time Taken for Fitting/Training: {end_time - start_time:.2f} seconds")

Time Taken for Fitting/Training: 292.29 seconds


###### LightGBM

In [23]:
lgb_params = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'learning_rate': [0.05, 0.1]
}

lgb_grid = GridSearchCV(LGBMRegressor(random_state=21), lgb_params,
                        scoring='neg_root_mean_squared_error', cv=3, n_jobs=-1)
start_time_lgb = time.time()
lgb_grid.fit(X_train, y_train)
lgb_best = lgb_grid.best_estimator_
lgb_pred = lgb_best.predict(X_valid)
end_time = time.time()
lgb_rmse = mean_squared_error(y_valid, lgb_pred, squared=False)

print(f"LGBM RMSE: {lgb_rmse:.4f}")
print("Best Params:", lgb_grid.best_params_)
print(f"Time Taken for Predictions: {end_time - start_time:.2f} seconds")

LGBM RMSE: 1871.6730
Best Params: {'learning_rate': 0.1, 'max_depth': 20, 'n_estimators': 100}
Time Taken for Fitting/Training: 91.39 seconds
Time Taken for Predictions: 0.41 seconds


In [29]:
start_time = time.time()
lgb_best.fit(X_train, y_train)
end_time = time.time()
print(f"Time Taken for Fitting/Training: {end_time - start_time:.2f} seconds")

Time Taken for Fitting/Training: 13.46 seconds


Our Random Forest was the best scoring model with RMSE at 1773, better than LightGBM at 1871, and much better than decision Tree at 2033. The models' speed performance is inversely correlated to their accuracy, with our lowest scoring RMSE models proving to be the fastest. Decision Tree, our least accurate model for RMSE, was by far the quickest at less than 5 seconds for training. LightGBM was the second fastest at 13.5 seconds, and Random Forest was last at 292 seconds; 20x slower than LightGBM andmore than 60x slower than Decision Tree. 

Rusty Bargain is interested in a model that predicts car values with accuracy, quickly, and with minimal training/fitting time. While all the models made predictions within 0.05-1.07 seconds, there will be a tradeoff when it comes to accuracy vs. training time required, with the most accurate models tending to take longer to train.  

## Model analysis

Our Random Forest model with a max depth of 20 and n_estimators of 100 was our highest RMSE scoring model. This is the model we are most interested in.

Our Decision Tree's best scoring parameters were: {'max_depth': 10, 'min_samples_split': 10}

In [24]:
# start_time_DT = time.time()

# # Create and train the model with given best parameters
# model_DT = DecisionTreeRegressor(
#     random_state=21,
#     max_depth=10,
#     min_samples_split=10
# )

# model_DT.fit(X_train, y_train)

# # Predict and calculate RMSE
# y_pred_DT = model_DT.predict(X_test)
# rmse = mean_squared_error(y_test, y_pred_DT, squared=False)

# # Print results
# print('Decision Tree Regression RMSE = {:.5f}'.format(rmse))
# print("*** {:.2f} seconds ***".format(time.time() - start_time_DT))

In [25]:
# def time_eval(model_RF, X_train, y_train, X_test, y_test):
#     '''
#     This function will take a model, and train it while tracking the time to train.
#     Then, the model will be tested and scored via RMSE while tracking prediciton time. 
#     The function returns the training time, prediction time, RMSE, and the model.
#     The function converts the time to milliseconds for visualization ease. 
#     '''

#     # Train the model and record the time
#     train_start = time.perf_counter()
#     model_RF.fit(X_train, y_train)
    
#     train_time = round((time.perf_counter() - train_start) * 1000)

#     # Predict on the test set and record the time
#     pred_start = time.perf_counter()
#     preds = model_RF.predict(X_test)

#     pred_time = round((time.perf_counter() - pred_start) * 1000)

#     # Calculate rmse
#     rmse = rmse_fix(root_mean_squared_error(y_test, preds))

In [26]:
# Create and train the model with given best parameters
model_RF = RandomForestRegressor(
    random_state=21,
    max_depth=20,
    min_samples_split=2,
    n_estimators=100
)
start_time_rf=time.time()
model_RF.fit(X_train, y_train)
end_time_rf=time.time()
# Predict and calculate RMSE
start_time = time.time()
y_pred_RF = model_RF.predict(X_test)
end_time = time.time()
rmse = mean_squared_error(y_test, y_pred_RF, squared=False)

# Print results
print('Random Forest Regression RMSE = {:.5f}'.format(rmse))
print(f"Time Taken for Fitting/Training: {end_time_rf - start_time_rf:.2f} seconds")
print(f"Time Taken for Predictions: {end_time - start_time:.2f} seconds")

Random Forest Regression RMSE = 1765.21981
Time Taken for Fitting/Training: 284.37 seconds
Time Taken for Predictions: 1.22 seconds


In [27]:
# start_time_LG = time.time()
# # Best Params: {'learning_rate': 0.1, 'max_depth': 20, 'n_estimators': 100}

# # Create and train the model with given best parameters
# model_LG = LGBMRegressor(
#     random_state=21,
#     learning_rate=0.1,
#     max_depth=20,
#     n_estimators=100
# )
# model_LG.fit(X_train, y_train)

# # Predict and calculate RMSE
# y_pred_LG = model_LG.predict(X_test)
# rmse = mean_squared_error(y_test, y_pred_LG, squared=False)

# # Print results
# print('Random Forest Regression RMSE = {:.5f}'.format(rmse))
# print("*** {:.2f} seconds ***".format(time.time() - start_time_LG))

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Everything is correct. Great work! 
    
But the time you measured above includes: model initialization time, training time, prediction time and rmse calculation time. But for each model you need to measure two separate time: training time (method .fit only) and prediction time (method .predict only). So, please, fix it. To do it, you can use library `time`.
    
Be careful. Model training time and hyperparameters tuning time are not the same things. In the GridSearchCV you trained the model a lot of times but you need to measure the time for training a single model. To solve it, you need to take the best model after GridSearchCV, retrain it on train data and measure this time.
    
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Not fixed. 
    
1. For each model you need to measure two separate time: training time (method .fit only) and prediction time (method .predict only). So, please, fix it. To do it, you can use library time.
2.  Model training time and hyperparameters tuning time are not the same things. In the GridSearchCV you trained the model a lot of times but you need to measure the time for training a single model. To solve it, you need to take the best model after GridSearchCV, retrain it on train data and measure this time.
    
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Please, read it carefully: "Model training time and hyperparameters tuning time are not the same things. In the GridSearchCV you trained the model a lot of times but you need to measure the time for training a single model. __To solve it, you need to take the best model after GridSearchCV, retrain it on train data and measure this time.__"
    
In the "2 Model training" part for each of 3 different models you measured hyperparameters tuning time instead of training time. Please, fix it.
    
Yes, in the "3  Model analysis" part you correctly measured training time for RandomForestRegressor but you need to do the same thing for all other models.
    
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V4</b> <a class="tocSkip"></a>

Everything is correct now. Well done!
    
</div>

## Conclusion

Rusty Bargain asked us to help them build a model, prioritizing one which best predicts car value, quickly, and with minimal training time required. To that end we cleaned the data, and trained 3 different models: a Decision Tree, Random Forest, and LightGBM model and compared results between them. The Decision Tree is the fastest, the Random Forest is the most accurate, and the LightGBM is the most well rounded; almost as fast as the Decision Tree while being exponentially faster than the Random Forest, and almost as accurate as the Rnadom Forest while being much more accurate than the Decision Tree. 

We found the Random Forest to be the slowest but most accurate model in training. During testing, we confirmed this to be the case. My recomendation to Rusty Bargains is to use Random Forest if accuracy of price prediction is a more pressing priority than length of time required for training and predictions. If timeliness is of equal importance to accuracy of the price estimate, then the company should opt for LightGBM as it provides exponentially faster predictions while only slightly less accurate than Random Forest. 

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed