Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [106]:
import math 
import numpy as np
import pandas as pd
import math #needed for later
import seaborn as sns
import time
import statistics

import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.metrics import mean_squared_error
import xgboost as xgb

from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from IPython.display import display

In [107]:
rb = pd.read_csv('/datasets/car_data.csv')

In [108]:
rb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [109]:
rb.corr()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
Price,1.0,0.026916,0.158872,-0.333199,0.110581,,0.076055
RegistrationYear,0.026916,1.0,-0.000828,-0.053447,-0.011619,,-0.003459
Power,0.158872,-0.000828,1.0,0.024002,0.04338,,0.021665
Mileage,-0.333199,-0.053447,0.024002,1.0,0.009571,,-0.007698
RegistrationMonth,0.110581,-0.011619,0.04338,0.009571,1.0,,0.013995
NumberOfPictures,,,,,,,
PostalCode,0.076055,-0.003459,0.021665,-0.007698,0.013995,,1.0


In [110]:
rb.shape[0]

354369

In [111]:
rb_columns = list(rb.columns)
for i in range(len(rb_columns)):
    rb_columns[i] = rb_columns[i].lower()
rb.columns = rb_columns

Changed the columns to lowercase.

In [112]:
#https://pastebin.com/w09CX298

for column in rb_columns:
    rb = rb[rb[column].isna()==False].reset_index(drop=True)
rb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245814 entries, 0 to 245813
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   datecrawled        245814 non-null  object
 1   price              245814 non-null  int64 
 2   vehicletype        245814 non-null  object
 3   registrationyear   245814 non-null  int64 
 4   gearbox            245814 non-null  object
 5   power              245814 non-null  int64 
 6   model              245814 non-null  object
 7   mileage            245814 non-null  int64 
 8   registrationmonth  245814 non-null  int64 
 9   fueltype           245814 non-null  object
 10  brand              245814 non-null  object
 11  notrepaired        245814 non-null  object
 12  datecreated        245814 non-null  object
 13  numberofpictures   245814 non-null  int64 
 14  postalcode         245814 non-null  int64 
 15  lastseen           245814 non-null  object
dtypes: int64(7), object(

In [113]:
rb.shape[0]

245814

Got rid of null and NaNs.

In [114]:
rb = rb[rb['registrationyear'] <= 2016].reset_index(drop=True) #2016 is the cutoff
rb = rb.drop(['datecrawled','datecreated','lastseen'],axis=1)
#datecrawled, datecreated and lastseen don't seem to have any reasonable connection to car price.

In [115]:
rb.corr()

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth,numberofpictures,postalcode
price,1.0,0.554831,0.200518,-0.39788,0.044229,,0.065936
registrationyear,0.554831,1.0,0.070637,-0.352075,0.036221,,0.035987
power,0.200518,0.070637,1.0,0.035526,0.01695,,0.020162
mileage,-0.39788,-0.352075,0.035526,1.0,-0.00701,,-0.011501
registrationmonth,0.044229,0.036221,0.01695,-0.00701,1.0,,0.004006
numberofpictures,,,,,,,
postalcode,0.065936,0.035987,0.020162,-0.011501,0.004006,,1.0


Removed unreasonable registration years.

## Model training

In [116]:
#https://pastebin.com/SpP9WNbg

categorical_feats = ['vehicletype', 'gearbox', 'model', 'fueltype', 'brand','notrepaired']
#datecrawled, datecreated and lastseen don't seem to have any reasonable connection to car price.

#rb[categorical_feats].columns
encoder = OrdinalEncoder()
rb[categorical_feats] = OrdinalEncoder().fit_transform(rb[categorical_feats])
#rb_ordinal = pd.DataFrame(encoder.fit_transform(rb), columns=rb.columns)

features = rb.drop('price',axis=1)
target = rb['price']

features_train, features_rem, target_train, target_rem = train_test_split(features,target, test_size=0.66, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_rem,target_rem, test_size=0.50, random_state=12345)
#REMINDER: WE ALWAYS NEED TO VALIDATE 


start = time.time() #this ruins %%time
#regressor because lbgm encoder wasn't working
model = lgb.LGBMRegressor(num_iterations=1000, metric='rmse') 
model.fit(features_train, target_train, eval_set=(features_valid, target_valid), categorical_feature = categorical_feats, verbose=100)
end = time.time()

#Dividing total time by amount of iterations
print(f"cross_val_score : {cross_val_score(model, features, target, cv=3)}")
print(f"Mean Prediction Time [s]: {((end-start)/1000):.3f}")
print(); print(model)





New categorical_feature is ['brand', 'fueltype', 'gearbox', 'model', 'notrepaired', 'vehicletype']


[100]	valid_0's rmse: 1682.9
[200]	valid_0's rmse: 1652.79
[300]	valid_0's rmse: 1640.21
[400]	valid_0's rmse: 1635.13
[500]	valid_0's rmse: 1632.73
[600]	valid_0's rmse: 1631.16
[700]	valid_0's rmse: 1629.98
[800]	valid_0's rmse: 1629.43
[900]	valid_0's rmse: 1628.81
[1000]	valid_0's rmse: 1629.58




cross_val_score : [0.88571371 0.88360142 0.88534358]
Mean Prediction Time [s]: 0.022

LGBMRegressor(metric='rmse', num_iterations=1000)


In [117]:
#https://pastebin.com/bZ4qJVWt

rsmes = []
train_times = []
prediction_times = []

for depth in range(1,5):
    for est in range(1,100,10):
        train_start = time.time()
        model = RandomForestRegressor(n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        train_end = time.time()
        pred_start = time.time()
        prediction_valid = model.predict(features_valid)
        pred_end = time.time()
        train_times.append(train_end-train_start)
        prediction_times.append(pred_end-pred_start)
        rsmes.append(mean_squared_error(target_valid,prediction_valid)**0.5)
       
print(f"cross_val_score : {cross_val_score(model, features, target, cv=3)}")
print(f"mean Training Time [s]: {statistics.mean(prediction_times):.3f}")
print(f"mean Prediction Time [s]: {statistics.mean(prediction_times):.3f}")
print('mean rsme:', statistics.mean(rsmes))
print(); print(model)

cross_val_score : [0.70921007 0.70668836 0.7006047 ]
mean Training Time [s]: 0.074
mean Prediction Time [s]: 0.074
mean rsme: 3042.605565265592

RandomForestRegressor(max_depth=4, n_estimators=91)


In [118]:
train_start = time.time()
model = LinearRegression()
model.fit(features_train, target_train)
train_end = time.time()
pred_start = time.time()
prediction_valid = model.predict(features_valid)
pred_end = time.time()
print(f"Training Time [s]: {(train_end-train_start):.3f}")
print(f"Prediction Time [s]: {(pred_end-pred_start):.3f}")
print('rsme:', mean_squared_error(target_valid,prediction_valid)**0.5)
print(); print(model)


Training Time [s]: 0.024
Prediction Time [s]: 0.004
rsme: 3358.737285801417

LinearRegression()


In [119]:
train_start = time.time()
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(features_train, target_train)
DummyRegressor()
train_end = time.time()
pred_start = time.time()
prediction_valid = dummy_regr.predict(features_valid)
pred_end = time.time()
print(f"cross_val_score : {cross_val_score(dummy_regr, features, target, cv=3)}")
print(f"Training Time [s]: {(train_end-train_start):.3f}")
print(f"Prediction Time [s]: {(pred_end-pred_start):.3f}")
print('rsme:', mean_squared_error(target_valid,prediction_valid)**0.5)
print(); print(dummy_regr)


cross_val_score : [-1.38797105e-08 -1.04945077e-05 -9.84056020e-06]
Training Time [s]: 0.001
Prediction Time [s]: 0.001
rsme: 4719.215729334451

DummyRegressor()


885. Speed of 0.001 s. Time: 2.54 ms.

## Model analysis

In [120]:
start = time.time() #this ruins %%time
#regressor because lbgm encoder wasn't working
model = lgb.LGBMRegressor(num_iterations=1000, metric='rmse') 
model.fit(features_test, target_test, eval_set=(features_valid, target_valid), categorical_feature = categorical_feats, verbose=100)
end = time.time()

#Dividing total time by amount of iterations
print(f"cross_val_score : {cross_val_score(model, features, target, cv=3)}")
print(f"Mean Prediction Time [s]: {((end-start)/1000):.3f}")
print(); print(model)

New categorical_feature is ['brand', 'fueltype', 'gearbox', 'model', 'notrepaired', 'vehicletype']


[100]	valid_0's rmse: 1684.99
[200]	valid_0's rmse: 1657.49
[300]	valid_0's rmse: 1645.39
[400]	valid_0's rmse: 1638.2
[500]	valid_0's rmse: 1633.28
[600]	valid_0's rmse: 1630.72
[700]	valid_0's rmse: 1628.51
[800]	valid_0's rmse: 1627.61
[900]	valid_0's rmse: 1627.09
[1000]	valid_0's rmse: 1626.74




cross_val_score : [0.88571371 0.88360142 0.88534358]
Mean Prediction Time [s]: 0.030

LGBMRegressor(metric='rmse', num_iterations=1000)


DummyRegressor may be the fastest, but lightgbm is second fastest and has the best root mean squared error score. Which gets lower by the number of iteration. The higher the iteration, the smaller the rsme score. At the expense of being very slow, thousands of iterations should lead to a small enough score. 

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed