Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

The purpose of this report is to analyze the runtimes and performance of various machine learning algorithms compared to Lightgbm. Hyperparameters will be adjusted for all models.

In [1]:
import pandas as pd
import numpy as np
import statistics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.tree import DecisionTreeRegressor 

## Data preparation

In [2]:
car_data = pd.read_csv('/datasets/car_data.csv')

In [3]:
car_data.head(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
5,04/04/2016 17:36,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,04/04/2016 00:00,0,33775,06/04/2016 19:17
6,01/04/2016 20:48,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,01/04/2016 00:00,0,67112,05/04/2016 18:18
7,21/03/2016 18:54,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no,21/03/2016 00:00,0,19348,25/03/2016 16:47
8,04/04/2016 23:42,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,,04/04/2016 00:00,0,94505,04/04/2016 23:42
9,17/03/2016 10:53,999,small,1998,manual,101,golf,150000,0,,volkswagen,,17/03/2016 00:00,0,27472,31/03/2016 17:17


In [4]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [5]:
car_data_clean = car_data.dropna()

In [6]:
len(car_data_clean)/len(car_data)

0.6936667710776055

In [7]:
df=car_data_clean

All rows containing missing values were deleted. The rows containing missing values made up about 30% of the overall data. After deletion, 70% remains. This is sufficient enough of a dataset to generate a robust model.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245814 entries, 3 to 354367
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        245814 non-null  object
 1   Price              245814 non-null  int64 
 2   VehicleType        245814 non-null  object
 3   RegistrationYear   245814 non-null  int64 
 4   Gearbox            245814 non-null  object
 5   Power              245814 non-null  int64 
 6   Model              245814 non-null  object
 7   Mileage            245814 non-null  int64 
 8   RegistrationMonth  245814 non-null  int64 
 9   FuelType           245814 non-null  object
 10  Brand              245814 non-null  object
 11  NotRepaired        245814 non-null  object
 12  DateCreated        245814 non-null  object
 13  NumberOfPictures   245814 non-null  int64 
 14  PostalCode         245814 non-null  int64 
 15  LastSeen           245814 non-null  object
dtypes: int64(7), object(

In [9]:
# Columns were relabled to adhere to naming conventions
df = df.rename(columns={'DateCrawled' : 'date_crawled', 'Price' : 'price', 'VehicleType' : 'vehicle_type', 
                                   'RegistrationYear' : 'registration_year', 'Gearbox' : 'gearbox',
                                   'Power' : 'power', 'Model' : 'model', 'Mileage' : 'mileage',
                                   'RegistrationMonth' : 'registration_month', 'FuelType' : 'fuel_type',
                                   'Brand' : 'brand', 'NotRepaired' : 'not_repaired', 'DateCreated' : 'date_created',
                                   'NumberOfPictures' : 'number_of_pictures', 'PostalCode' : 'postal_code', 'LastSeen' : 'last_seen'})


In [10]:
# Unnecessary features are dropped
df = df.drop(['date_crawled', 'last_seen', 'date_created'], axis=1)

In [11]:
df.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,number_of_pictures,postal_code
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,0,91074
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,0,60437
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,0,33775
6,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,0,67112
7,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no,0,19348


## Model training

In [12]:
# One-hot encoding to convert categorical features
df = pd.get_dummies(df, drop_first=True)

In [13]:
# Split data into features and target
features_df = df.drop(['price'], axis=1)
target_df = df['price']

In [14]:
# Create training and validation sets
features_df_split, features_df_valid, target_df_split, target_df_valid = train_test_split(
    features_df, target_df, test_size=0.2, random_state=12345)

features_df_train, features_df_test, target_df_train, target_df_test = train_test_split(
    features_df_split, target_df_split, test_size=0.25, random_state=12345)

In [15]:
list(df.columns)

['price',
 'registration_year',
 'power',
 'mileage',
 'registration_month',
 'number_of_pictures',
 'postal_code',
 'vehicle_type_convertible',
 'vehicle_type_coupe',
 'vehicle_type_other',
 'vehicle_type_sedan',
 'vehicle_type_small',
 'vehicle_type_suv',
 'vehicle_type_wagon',
 'gearbox_manual',
 'model_145',
 'model_147',
 'model_156',
 'model_159',
 'model_1_reihe',
 'model_1er',
 'model_200',
 'model_2_reihe',
 'model_300c',
 'model_3_reihe',
 'model_3er',
 'model_4_reihe',
 'model_500',
 'model_5_reihe',
 'model_5er',
 'model_601',
 'model_6_reihe',
 'model_6er',
 'model_7er',
 'model_80',
 'model_850',
 'model_90',
 'model_900',
 'model_9000',
 'model_911',
 'model_a1',
 'model_a2',
 'model_a3',
 'model_a4',
 'model_a5',
 'model_a6',
 'model_a8',
 'model_a_klasse',
 'model_accord',
 'model_agila',
 'model_alhambra',
 'model_almera',
 'model_altea',
 'model_amarok',
 'model_antara',
 'model_arosa',
 'model_astra',
 'model_auris',
 'model_avensis',
 'model_aveo',
 'model_aygo',
 

In [16]:
%%time
#Train linear regression model

model = LinearRegression()
model.fit(features_df_train, target_df_train)
predictions_df_valid = model.predict(features_df_valid)
result = mean_squared_error(target_df_valid, predictions_df_valid)**0.5
print("RMSE of the linear regression model on the validation set:", result)

RMSE of the linear regression model on the validation set: 2714.891714868017
CPU times: user 9.41 s, sys: 2.44 s, total: 11.8 s
Wall time: 11.8 s


In [17]:
%%time

predictions_df_test = model.predict(features_df_test)
result = mean_squared_error(target_df_test, predictions_df_test)**0.5
print("RMSE of the linear regression model on the test set:", result)

RMSE of the linear regression model on the test set: 2703.028356185305
CPU times: user 48.1 ms, sys: 105 ms, total: 153 ms
Wall time: 146 ms


In [18]:
list_1 = [5,10,30,50]

In [19]:
%%time
for i in list_1:
    model = DecisionTreeRegressor(max_depth = i)
    model.fit(features_df_train, target_df_train)
    predictions_df_valid = model.predict(features_df_valid)
    result = mean_squared_error(target_df_valid, predictions_df_valid)**0.5
    print("RMSE of the DecisionTreeRegressor model on the validation set:", result)

RMSE of the DecisionTreeRegressor model on the validation set: 2438.295128937012
RMSE of the DecisionTreeRegressor model on the validation set: 2044.3453114352144
RMSE of the DecisionTreeRegressor model on the validation set: 2121.9993981403472
RMSE of the DecisionTreeRegressor model on the validation set: 2154.6249800713986
CPU times: user 13.7 s, sys: 508 ms, total: 14.2 s
Wall time: 14.2 s


In [34]:
%%time
for i in range(5,30):
    model = DecisionTreeRegressor(max_depth = i)
    model.fit(features_df_train, target_df_train)
    predictions_df_valid = model.predict(features_df_valid)
    result = mean_squared_error(target_df_valid, predictions_df_valid)**0.5
    print("max_depth:", i)
    print("RMSE of the DecisionTreeRegressor model on the validation set:", result)


max_depth: 5
RMSE of the DecisionTreeRegressor model on the validation set: 2438.295128937012
max_depth: 6
RMSE of the DecisionTreeRegressor model on the validation set: 2319.782498906562
max_depth: 7
RMSE of the DecisionTreeRegressor model on the validation set: 2228.960914475313
max_depth: 8
RMSE of the DecisionTreeRegressor model on the validation set: 2156.5210332288784
max_depth: 9
RMSE of the DecisionTreeRegressor model on the validation set: 2092.0859964308693
max_depth: 10
RMSE of the DecisionTreeRegressor model on the validation set: 2040.8918586872426
max_depth: 11
RMSE of the DecisionTreeRegressor model on the validation set: 2002.328293477705
max_depth: 12
RMSE of the DecisionTreeRegressor model on the validation set: 1974.5896625103226
max_depth: 13
RMSE of the DecisionTreeRegressor model on the validation set: 1962.3310341438453
max_depth: 14
RMSE of the DecisionTreeRegressor model on the validation set: 1967.26610264593
max_depth: 15
RMSE of the DecisionTreeRegressor mod

max_depth = 13 was the best hyperparameter for DecisionTreeRegressor. It achieved a RMSE score of 1962.3310341438453. The model will be evaluated against the test set.

In [49]:
%%time
model = DecisionTreeRegressor(max_depth = 13)
model.fit(features_df_train, target_df_train)
predictions_df_test = model.predict(features_df_test)
result = mean_squared_error(target_df_test, predictions_df_test)**0.5
print("RMSE of the DecisionTreeRegressor model on the test set:", result)

RMSE of the DecisionTreeRegressor model on the test set: 1959.9148872498342
CPU times: user 3.11 s, sys: 63.3 ms, total: 3.17 s
Wall time: 3.17 s


In [36]:
%%time
for i in list_1:
    model = RandomForestRegressor(n_estimators = i)
    model.fit(features_df_train, target_df_train)
    predictions_df_valid = model.predict(features_df_valid)
    result = mean_squared_error(target_df_valid, predictions_df_valid)**0.5
    print("n_estimators:", i)
    print("RMSE of the linear RandomForestRegressor on the validation set:", result)

n_estimators: 5
RMSE of the linear RandomForestRegressor on the validation set: 1758.6273972757406
n_estimators: 10
RMSE of the linear RandomForestRegressor on the validation set: 1696.2663123105829
n_estimators: 30
RMSE of the linear RandomForestRegressor on the validation set: 1652.1419751719657
n_estimators: 50
RMSE of the linear RandomForestRegressor on the validation set: 1642.885475082922
CPU times: user 5min 5s, sys: 756 ms, total: 5min 6s
Wall time: 5min 6s


RMSE scores are still dropping at n_estimators: 50. Larger values for n_estimators will be tested. 

In [37]:
list_2 = [100, 150, 200]

In [38]:
%%time
for i in list_2:
    model = RandomForestRegressor(n_estimators = i)
    model.fit(features_df_train, target_df_train)
    predictions_df_valid = model.predict(features_df_valid)
    result = mean_squared_error(target_df_valid, predictions_df_valid)**0.5
    print("n_estimators:", i)
    print("RMSE of the linear RandomForestRegressor on the validation set:", result)

n_estimators: 100
RMSE of the linear RandomForestRegressor on the validation set: 1632.6227496058839
n_estimators: 150
RMSE of the linear RandomForestRegressor on the validation set: 1631.6804587107638
n_estimators: 200
RMSE of the linear RandomForestRegressor on the validation set: 1629.0894745333253
CPU times: user 23min 43s, sys: 2.33 s, total: 23min 45s
Wall time: 23min 46s


RMSE scores dropped an insignificant amount between n_estimators: 150 and n_estimators: 200. The best RMSE score for RandomForestRegressor is 1593.5 with n_estimators = 200. This model will be evaluated with the test set.

In [39]:
%%time
model = RandomForestRegressor(n_estimators = 200)
model.fit(features_df_train, target_df_train)
predictions_df_test = model.predict(features_df_test)
result = mean_squared_error(target_df_test, predictions_df_test)**0.5
print("RMSE of the linear RandomForestRegressor on the test set:", result)

RMSE of the linear RandomForestRegressor on the test set: 1611.237528504111
CPU times: user 10min 33s, sys: 1.16 s, total: 10min 34s
Wall time: 10min 35s


In [40]:
train_data = lgb.Dataset(features_df_train, label=target_df_train)
valid_data = lgb.Dataset(features_df_valid, label=target_df_valid)
test_data = lgb.Dataset(features_df_test, label=target_df_test)

In [41]:
parameters_1 = {'objective': 'regression',
             'metric': 'rmse',
             'is_unbalance':'true',
             'boosting' : 'gbdt',
             'num_leaves': 5,
             'feature_fraction': 0.5,
             'bagging_fraction': 0.5,
             'bagging_freq': 20,
             'learning_rate': 0.01,
              'verbose': -1
             }

parameters_2 = {'objective': 'regression',
             'metric': 'rmse',
             'is_unbalance':'true',
             'boosting' : 'gbdt',
             'num_leaves': 10,
             'feature_fraction': 0.5,
             'bagging_fraction': 0.5,
             'bagging_freq': 20,
             'learning_rate': 0.01,
              'verbose': -1
             }

parameters_3 = {'objective': 'regression',
             'metric': 'rmse',
             'is_unbalance':'true',
             'boosting' : 'gbdt',
             'num_leaves': 15,
             'feature_fraction': 0.2,
             'bagging_fraction': 0.2,
             'bagging_freq': 20,
             'learning_rate': 0.01,
              'verbose': -1
             }


In [42]:
%%time
model_lgbm = lgb.train(parameters_1, train_set = train_data, valid_sets=valid_data, num_boost_round=5000, early_stopping_rounds=50)



[1]	valid_0's rmse: 4660.68
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4647.67
[3]	valid_0's rmse: 4620.94
[4]	valid_0's rmse: 4594.6
[5]	valid_0's rmse: 4588.54
[6]	valid_0's rmse: 4569.53
[7]	valid_0's rmse: 4563.2
[8]	valid_0's rmse: 4550.46
[9]	valid_0's rmse: 4524.82
[10]	valid_0's rmse: 4501.5
[11]	valid_0's rmse: 4478.65
[12]	valid_0's rmse: 4472.99
[13]	valid_0's rmse: 4454.84
[14]	valid_0's rmse: 4443.46
[15]	valid_0's rmse: 4419.01
[16]	valid_0's rmse: 4397
[17]	valid_0's rmse: 4385.17
[18]	valid_0's rmse: 4373.54
[19]	valid_0's rmse: 4349.94
[20]	valid_0's rmse: 4328.57
[21]	valid_0's rmse: 4305.59
[22]	valid_0's rmse: 4288.99
[23]	valid_0's rmse: 4272.66
[24]	valid_0's rmse: 4252.01
[25]	valid_0's rmse: 4229.54
[26]	valid_0's rmse: 4219.52
[27]	valid_0's rmse: 4203.8
[28]	valid_0's rmse: 4183.84
[29]	valid_0's rmse: 4168.46
[30]	valid_0's rmse: 4153.25
[31]	valid_0's rmse: 4142.69
[32]	valid_0's rmse: 4136.96
[33]	valid_0's rmse: 4115.4

In [43]:
%%time
model_lgbm = lgb.train(parameters_2, train_set = train_data, valid_sets=valid_data, num_boost_round=5000, early_stopping_rounds=50)

[1]	valid_0's rmse: 4658.24
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4642.37
[3]	valid_0's rmse: 4611.69
[4]	valid_0's rmse: 4581.42
[5]	valid_0's rmse: 4574.05
[6]	valid_0's rmse: 4551.85
[7]	valid_0's rmse: 4544.43
[8]	valid_0's rmse: 4528.95
[9]	valid_0's rmse: 4499.5
[10]	valid_0's rmse: 4473.17
[11]	valid_0's rmse: 4448.26
[12]	valid_0's rmse: 4440.77
[13]	valid_0's rmse: 4419.64
[14]	valid_0's rmse: 4406.24
[15]	valid_0's rmse: 4378.16
[16]	valid_0's rmse: 4354.44
[17]	valid_0's rmse: 4341.58
[18]	valid_0's rmse: 4328.88
[19]	valid_0's rmse: 4301.73
[20]	valid_0's rmse: 4278.7
[21]	valid_0's rmse: 4252.22
[22]	valid_0's rmse: 4232.82
[23]	valid_0's rmse: 4213.1
[24]	valid_0's rmse: 4189.82
[25]	valid_0's rmse: 4164.41
[26]	valid_0's rmse: 4152.34
[27]	valid_0's rmse: 4133.78
[28]	valid_0's rmse: 4111.67
[29]	valid_0's rmse: 4093.03
[30]	valid_0's rmse: 4075.24
[31]	valid_0's rmse: 4063.53
[32]	valid_0's rmse: 4056.52
[33]	valid_0's rmse: 40

In [44]:
%%time
model_lgbm = lgb.train(parameters_3, train_set = train_data, valid_sets=valid_data, num_boost_round=5000, early_stopping_rounds=50)

[1]	valid_0's rmse: 4658.75
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4650.98
[3]	valid_0's rmse: 4643.39
[4]	valid_0's rmse: 4628.5
[5]	valid_0's rmse: 4623.46
[6]	valid_0's rmse: 4599.98
[7]	valid_0's rmse: 4598.46
[8]	valid_0's rmse: 4592.98
[9]	valid_0's rmse: 4561.21
[10]	valid_0's rmse: 4557.71
[11]	valid_0's rmse: 4553.2
[12]	valid_0's rmse: 4551.4
[13]	valid_0's rmse: 4542.27
[14]	valid_0's rmse: 4527.83
[15]	valid_0's rmse: 4513.79
[16]	valid_0's rmse: 4488.55
[17]	valid_0's rmse: 4483.32
[18]	valid_0's rmse: 4477.85
[19]	valid_0's rmse: 4463.99
[20]	valid_0's rmse: 4454.59
[21]	valid_0's rmse: 4428.36
[22]	valid_0's rmse: 4424.39
[23]	valid_0's rmse: 4402.77
[24]	valid_0's rmse: 4392
[25]	valid_0's rmse: 4382.77
[26]	valid_0's rmse: 4377.36
[27]	valid_0's rmse: 4368.52
[28]	valid_0's rmse: 4363.97
[29]	valid_0's rmse: 4350.33
[30]	valid_0's rmse: 4338.44
[31]	valid_0's rmse: 4334.52
[32]	valid_0's rmse: 4326.58
[33]	valid_0's rmse: 4302.

In [45]:
%%time
model_lgbm = lgb.train(parameters_1, train_set = train_data, valid_sets=test_data, num_boost_round=5000, early_stopping_rounds=50)

[1]	valid_0's rmse: 4707.69
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4694.18
[3]	valid_0's rmse: 4667.08
[4]	valid_0's rmse: 4640.45
[5]	valid_0's rmse: 4634.08
[6]	valid_0's rmse: 4614.84
[7]	valid_0's rmse: 4608.17
[8]	valid_0's rmse: 4595.29
[9]	valid_0's rmse: 4569.29
[10]	valid_0's rmse: 4545.95
[11]	valid_0's rmse: 4523.09
[12]	valid_0's rmse: 4517.26
[13]	valid_0's rmse: 4498.88
[14]	valid_0's rmse: 4487.18
[15]	valid_0's rmse: 4462.37
[16]	valid_0's rmse: 4440.35
[17]	valid_0's rmse: 4428.16
[18]	valid_0's rmse: 4416.18
[19]	valid_0's rmse: 4392.3
[20]	valid_0's rmse: 4370.91
[21]	valid_0's rmse: 4347.59
[22]	valid_0's rmse: 4330.68
[23]	valid_0's rmse: 4314.04
[24]	valid_0's rmse: 4293.32
[25]	valid_0's rmse: 4270.55
[26]	valid_0's rmse: 4260.21
[27]	valid_0's rmse: 4244.07
[28]	valid_0's rmse: 4224.09
[29]	valid_0's rmse: 4208.29
[30]	valid_0's rmse: 4192.79
[31]	valid_0's rmse: 4181.93
[32]	valid_0's rmse: 4176.01
[33]	valid_0's rmse: 

In [46]:
%%time
model_lgbm = lgb.train(parameters_2, train_set = train_data, valid_sets=test_data, num_boost_round=5000, early_stopping_rounds=50)

[1]	valid_0's rmse: 4705.18
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4688.93
[3]	valid_0's rmse: 4657.91
[4]	valid_0's rmse: 4627.29
[5]	valid_0's rmse: 4619.72
[6]	valid_0's rmse: 4597.18
[7]	valid_0's rmse: 4589.42
[8]	valid_0's rmse: 4573.61
[9]	valid_0's rmse: 4543.8
[10]	valid_0's rmse: 4517.49
[11]	valid_0's rmse: 4492.57
[12]	valid_0's rmse: 4484.9
[13]	valid_0's rmse: 4463.43
[14]	valid_0's rmse: 4449.73
[15]	valid_0's rmse: 4421.3
[16]	valid_0's rmse: 4397.53
[17]	valid_0's rmse: 4384.32
[18]	valid_0's rmse: 4371.36
[19]	valid_0's rmse: 4343.88
[20]	valid_0's rmse: 4320.83
[21]	valid_0's rmse: 4294.07
[22]	valid_0's rmse: 4274.31
[23]	valid_0's rmse: 4254.2
[24]	valid_0's rmse: 4230.91
[25]	valid_0's rmse: 4205.16
[26]	valid_0's rmse: 4192.92
[27]	valid_0's rmse: 4173.99
[28]	valid_0's rmse: 4151.93
[29]	valid_0's rmse: 4132.89
[30]	valid_0's rmse: 4114.77
[31]	valid_0's rmse: 4102.81
[32]	valid_0's rmse: 4095.53
[33]	valid_0's rmse: 407

In [47]:
%%time
model_lgbm = lgb.train(parameters_3, train_set = train_data, valid_sets=test_data, num_boost_round=5000, early_stopping_rounds=50)

[1]	valid_0's rmse: 4705.79
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4697.88
[3]	valid_0's rmse: 4690.02
[4]	valid_0's rmse: 4674.84
[5]	valid_0's rmse: 4669.64
[6]	valid_0's rmse: 4645.76
[7]	valid_0's rmse: 4644.17
[8]	valid_0's rmse: 4638.48
[9]	valid_0's rmse: 4606.37
[10]	valid_0's rmse: 4602.76
[11]	valid_0's rmse: 4598.01
[12]	valid_0's rmse: 4596.26
[13]	valid_0's rmse: 4587.03
[14]	valid_0's rmse: 4572.32
[15]	valid_0's rmse: 4558.04
[16]	valid_0's rmse: 4532.74
[17]	valid_0's rmse: 4527.2
[18]	valid_0's rmse: 4521.66
[19]	valid_0's rmse: 4507.48
[20]	valid_0's rmse: 4497.98
[21]	valid_0's rmse: 4471.8
[22]	valid_0's rmse: 4467.73
[23]	valid_0's rmse: 4445.72
[24]	valid_0's rmse: 4434.9
[25]	valid_0's rmse: 4425.59
[26]	valid_0's rmse: 4420.12
[27]	valid_0's rmse: 4411.24
[28]	valid_0's rmse: 4406.68
[29]	valid_0's rmse: 4392.8
[30]	valid_0's rmse: 4380.97
[31]	valid_0's rmse: 4377.05
[32]	valid_0's rmse: 4368.82
[33]	valid_0's rmse: 434

## Model analysis

The LinearRegression model was used in this instance to provide a sanity check. It was not expected to perform well but it will be used as a standard to compare our other models to. All models should perform better than the LinearRegression model. The models were evaluated with validation sets and then verified with the test sets. Results of the test sets are shown below.

LinearRegression:
RMSE of the linear regression model on the test set: 2703.028356185305
CPU times: user 48.1 ms, sys: 105 ms, total: 153 ms
Wall time: 146 ms

DecisionTreeRegressor:
RMSE of the DecisionTreeRegressor model on the test set: 1959.9148872498342
CPU times: user 3.11 s, sys: 63.3 ms, total: 3.17 s
Wall time: 3.17 s

RandomForestRegressor:
n_estimators: 200
RMSE of the linear RandomForestRegressor on the test set: 1611.237528504111
CPU times: user 10min 33s, sys: 1.16 s, total: 10min 34s
Wall time: 10min 35s

Lightgbm:
valid_0's rmse: 1696.17
num_leaves: 10
feature_fraction: 0.5

The best model was the RandomForestRegressor with a RMSE score of 1611.23. Lightgbm was the second best model but it also had a runtime of less than 1/3 of the time of RandomForestRegressor. These parameters may be able to be even further optimized.



# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed