# Predicting the Sale Price of Bulldozers (Regression Problem)


## 1. Problem defition

> How well can we predict the future sale price of a bulldozer, given its various characteristics/features and previous examples of how much similar bulldozers have been sold for?

## 2. Data

The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

The key fields in train.csv are:

* SalesID: the uniue identifier of the sale
* MachineID: the unique identifier of a machine.  A machine can be sold multiple times
* saleprice: what the machine sold for at auction (only provided in train.csv)
* saledate: the date of the sale

## 3. Evaluation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more information about the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

**Note:** The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning Regression model which minimises RMSLE.

## 4. Features

Kaggle provides a data dictionary detailing all of the features of the dataset

Data dictionary can be seen here on Google Sheets : https://docs.google.com/spreadsheets/d/1hfDQfDOFsLhIBpne0-O1Li8ZAqNt9OuLaNUSQ3HzdR0/edit?usp=sharing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [2]:
# Load the prepared data
df_tmp = pd.read_csv("data/bluebook-for-bulldozers/train_tmp_no_missing_all_numerical.csv")
df_tmp.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,Undercarriage_Pad_Width_is_missing,Stick_Length_is_missing,Thumb_is_missing,Pattern_Changer_is_missing,Grouser_Type_is_missing,Backhoe_Mounting_is_missing,Blade_Type_is_missing,Travel_Controls_is_missing,Differential_Type_is_missing,Steering_Controls_is_missing
0,1646770,9500.0,1126363,8434,132,3.0,1974,68.0,0,4593,...,True,True,True,True,True,False,False,False,True,True
1,1821514,14000.0,1194089,10150,132,3.0,1980,4640.0,0,1820,...,True,True,True,True,True,True,True,True,False,False
2,1505138,50000.0,1473654,4139,132,3.0,1978,2838.0,0,2348,...,True,True,True,True,True,False,False,False,True,True
3,1671174,16000.0,1327630,8591,132,3.0,1980,3486.0,0,1819,...,True,True,True,True,True,True,True,True,False,False
4,1329056,22000.0,1336053,4089,132,3.0,1984,722.0,0,2119,...,True,True,True,True,True,False,False,False,True,True


In [3]:
df_tmp.isna().sum()

SalesID                         0
SalePrice                       0
MachineID                       0
ModelID                         0
datasource                      0
                               ..
Backhoe_Mounting_is_missing     0
Blade_Type_is_missing           0
Travel_Controls_is_missing      0
Differential_Type_is_missing    0
Steering_Controls_is_missing    0
Length: 103, dtype: int64

In [4]:
df_tmp.shape

(412698, 103)


# Modelling

In [7]:
%%time
#initial fit on all data (takes too much time)


# Instantiate model
model = RandomForestRegressor(n_jobs=-1, random_state=42)

# Fit the model
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp.SalePrice)

Wall time: 4min 50s


RandomForestRegressor(n_jobs=-1, random_state=42)

In [8]:
model.score(df_tmp.drop("SalePrice", axis=1), df_tmp.SalePrice)

0.9873039604105666

In [19]:
%%time
#initial fit on all data without n_jobs =-1 (takes even more time)


# Instantiate model
model = RandomForestRegressor(random_state=42)

# Fit the model
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp.SalePrice)

Wall time: 12min 38s


RandomForestRegressor(random_state=42)

In [20]:
model.score(df_tmp.drop("SalePrice", axis=1), df_tmp.SalePrice)

0.9873039604105666

### This metrics is not reliable since training and scoring done on same data

### Splitting data into train/valid sets

In [5]:
df_tmp.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,Undercarriage_Pad_Width_is_missing,Stick_Length_is_missing,Thumb_is_missing,Pattern_Changer_is_missing,Grouser_Type_is_missing,Backhoe_Mounting_is_missing,Blade_Type_is_missing,Travel_Controls_is_missing,Differential_Type_is_missing,Steering_Controls_is_missing
0,1646770,9500.0,1126363,8434,132,3.0,1974,68.0,0,4593,...,True,True,True,True,True,False,False,False,True,True
1,1821514,14000.0,1194089,10150,132,3.0,1980,4640.0,0,1820,...,True,True,True,True,True,True,True,True,False,False
2,1505138,50000.0,1473654,4139,132,3.0,1978,2838.0,0,2348,...,True,True,True,True,True,False,False,False,True,True
3,1671174,16000.0,1327630,8591,132,3.0,1980,3486.0,0,1819,...,True,True,True,True,True,True,True,True,False,False
4,1329056,22000.0,1336053,4089,132,3.0,1984,722.0,0,2119,...,True,True,True,True,True,False,False,False,True,True


According to the [Kaggle data page](https://www.kaggle.com/c/bluebook-for-bulldozers/data), the validation set and test set are split according to dates.

Since this is a time series problem.

E.g. using past events to try and predict future events.

Knowing this, randomly splitting our data into train and test sets using something like `train_test_split()` wouldn't work.

Instead, we split our data into training, validation and test sets using the date each sample occured.

In our case:
* Training = all samples up until 2011
* Valid = all samples form January 1, 2012 - April 30, 2012
* Test = all samples from May 1, 2012 - November 2012

For more on making good training, validation and test sets, check out the post [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) by Rachel Thomas.

In [6]:
df_tmp.saleYear.value_counts()

2009    43849
2008    39767
2011    35197
2010    33390
2007    32208
2006    21685
2005    20463
2004    19879
2001    17594
2000    17415
2002    17246
2003    15254
1998    13046
1999    12793
2012    11573
1997     9785
1996     8829
1995     8530
1994     7929
1993     6303
1992     5519
1991     5109
1989     4806
1990     4529
Name: saleYear, dtype: int64

In [7]:
df_val = df_tmp[df_tmp["saleYear"]==2012]
df_train = df_tmp[df_tmp["saleYear"]!=2012]
len(df_val), len(df_train)

(11573, 401125)

In [8]:
X_train, y_train = df_train.drop('SalePrice', axis = 1),df_train['SalePrice']
X_val, y_val = df_val.drop('SalePrice', axis =1),df_val['SalePrice']
X_train.shape,y_train.shape,X_val.shape,y_val.shape

((401125, 102), (401125,), (11573, 102), (11573,))

In [9]:
X_train.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,Undercarriage_Pad_Width_is_missing,Stick_Length_is_missing,Thumb_is_missing,Pattern_Changer_is_missing,Grouser_Type_is_missing,Backhoe_Mounting_is_missing,Blade_Type_is_missing,Travel_Controls_is_missing,Differential_Type_is_missing,Steering_Controls_is_missing
0,1646770,1126363,8434,132,3.0,1974,68.0,0,4593,1744,...,True,True,True,True,True,False,False,False,True,True
1,1821514,1194089,10150,132,3.0,1980,4640.0,0,1820,559,...,True,True,True,True,True,True,True,True,False,False
2,1505138,1473654,4139,132,3.0,1978,2838.0,0,2348,713,...,True,True,True,True,True,False,False,False,True,True
3,1671174,1327630,8591,132,3.0,1980,3486.0,0,1819,558,...,True,True,True,True,True,True,True,True,False,False
4,1329056,1336053,4089,132,3.0,1984,722.0,0,2119,683,...,True,True,True,True,True,False,False,False,True,True


In [10]:
y_train.head()

0     9500.0
1    14000.0
2    50000.0
3    16000.0
4    22000.0
Name: SalePrice, dtype: float64

### Building an evaluation function

According to Kaggle for the Bluebook for Bulldozers competition, [the evaluation function](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation) they use is root mean squared log error (RMSLE).

**RMSLE** = generally you don't care as much if you're off by $10 as much as you'd care if you were off by 10%, you care more about ratios rather than differences. **MAE** (mean absolute error) is more about exact differences.

Since Scikit-Learn doesn't have a function built-in for RMSLE, we'll create our own.

We can do this by taking the square root of Scikit-Learn's [mean_squared_log_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html#sklearn.metrics.mean_squared_log_error) (MSLE). MSLE is the same as taking the log of mean squared error (MSE).

We'll also calculate the MAE and R^2.

In [11]:
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

def rmsle(y_test, y_preds):
    """
    Returns root mean square log error.
    
    """
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

def eval_scores (model, X_train, X_val,y_train,y_val):
    """
    Returns mean absolute error, mean square log error, root mean square log error, R^2 score on train and validation data.
    """
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_val)
    score = {'Train MAE': mean_absolute_error(y_train,train_preds),
             'Val MAE': mean_absolute_error(y_val,val_preds),
            'Train MSLE':mean_squared_log_error(y_train, train_preds),
            'Val MSLE':mean_squared_log_error(y_val, val_preds),
            'Train RMSLE':rmsle(y_train, train_preds),
            'Val RMSLE':rmsle(y_val, val_preds),
            'Train R2':r2_score(y_train,train_preds),
            'Val R2':r2_score(y_val,val_preds)}
    return score

### Testing our model on a subset (to tune the hyperparameters)

Retraing an entire model would take far too long to continuing experimenting as fast as we want to.

So, take a sample of the training set and tune the hyperparameters on that before training a larger model.

If experiments are taking longer than 10-seconds, you should be trying to speed things up. You can speed things up by sampling less data or using a faster computer.

* Methods to restrict data given for training:  
1) Using slicing `model.fit(X_train[:10000],y_train[:10000])`  
2) Using Random Forest Regressors parameter `max_samples` : alter the number of samples each `n_estimator` in the [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) see's using the `max_samples` parameter.

In [12]:
model = RandomForestRegressor(n_jobs=-1, max_samples=10000)

Setting `max_samples` to 10000 means every `n_estimator` (default 100) in our `RandomForestRegressor` will only see 10000 random samples from our DataFrame instead of the entire 400,000.

In other words, we'll be looking at 40x less samples which means we'll get faster computation speeds but we should expect our results to worsen (the model has less samples to learn patterns from).

In [13]:
%%time
model.fit(X_train,y_train)

Wall time: 11.6 s


RandomForestRegressor(max_samples=10000, n_jobs=-1)

In [14]:
eval_scores(model, X_train,X_val,y_train,y_val)

{'Train MAE': 5567.267547447801,
 'Val MAE': 7241.5388810161585,
 'Train MSLE': 0.0666786854548756,
 'Val MSLE': 0.08834019544365973,
 'Train RMSLE': 0.25822216298156053,
 'Val RMSLE': 0.29722078568575877,
 'Train R2': 0.8604651593384437,
 'Val R2': 0.8312775040503089}

In [15]:
# without n_jobs=-1 (takes more time)
model = RandomForestRegressor(max_samples=10000)

In [16]:
%%time
model.fit(X_train,y_train)

Wall time: 27.6 s


RandomForestRegressor(max_samples=10000)

In [17]:
eval_scores(model, X_train,X_val,y_train,y_val)

{'Train MAE': 5573.695007915238,
 'Val MAE': 7294.009030502029,
 'Train MSLE': 0.06668398770298517,
 'Val MSLE': 0.0886198350638921,
 'Train RMSLE': 0.2582324296113584,
 'Val RMSLE': 0.29769083805836566,
 'Train R2': 0.8596554043983744,
 'Val R2': 0.8301423181105553}

## Hyperparameter Tuning with Randomized Search CV

In [44]:
from sklearn.model_selection import RandomizedSearchCV
rf_grid_1 = {"n_estimators": np.arange(10, 400, 20),
           "max_depth": [None, 3, 5, 10,20,40],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(3, 20, 2),
           "max_features": [0.5, 1.0, "sqrt", "auto"],
           "max_samples": [10000],
            "n_jobs":[-1]}
rf_grid_2 = {"n_estimators": np.arange(10, 200, 20),
           "max_depth": [None, 3, 5, 10,20,40],
           "min_samples_split": np.arange(4, 20, 2),
           "min_samples_leaf": np.arange(5, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000],
            "n_jobs":[-1]}
rf_grid_3 = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000],
            "n_jobs":[-1]}
rf_grid_4 = {"n_estimators": np.arange(100, 1000, 60),
           "max_depth": [None,5, 10,20, 40, 60],
           "min_samples_split": np.arange(2, 16, 2),
           "min_samples_leaf": np.arange(1, 16, 2),
           "max_features": [0.5,1.0,0.75, "auto"],
           "max_samples": [10000],
            "n_jobs":[-1]}
rf_grid_5 = {"n_estimators": np.arange(10, 1000, 70),
           "max_depth": [None, 10,20,40,60,80],
           "min_samples_split": np.arange(4, 20, 2),
           "min_samples_leaf": np.arange(5, 20, 2),
           "max_features": [0.5, 1, 0.75, "auto"],
           "max_samples": [10000],
            "n_jobs":[-1]}

In [24]:
# np.random.seed(42)
rs_model_1 = RandomizedSearchCV(RandomForestRegressor(random_state=42), 
                              param_distributions=rf_grid_1,
                              cv=5,
                              n_iter=250,
                              n_jobs=-1,
                              verbose=True)

In [26]:
rs_model_1.fit(X_train,y_train)

Fitting 5 folds for each of 250 candidates, totalling 1250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 20.7min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed: 54.7min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 100.8min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 160.2min
[Parallel(n_jobs=-1)]: Done 1250 out of 1250 | elapsed: 160.6min finished


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   n_iter=250, n_jobs=-1,
                   param_distributions={'max_depth': [None, 3, 5, 10, 20, 40],
                                        'max_features': [0.5, 1.0, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  30,  50,  70,  90, 110, 130, 150, 170, 190, 210, 230, 250,
       270, 290, 310, 330, 350, 370, 390]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [27]:
rs_model_1.best_params_

{'n_jobs': -1,
 'n_estimators': 330,
 'min_samples_split': 6,
 'min_samples_leaf': 3,
 'max_samples': 10000,
 'max_features': 1.0,
 'max_depth': 20}

In [28]:
eval_scores(rs_model_1, X_train, X_val, y_train, y_val)

{'Train MAE': 5623.229618772711,
 'Val MAE': 7269.1532811119905,
 'Train MSLE': 0.06751032884813418,
 'Val MSLE': 0.08764281604494915,
 'Train RMSLE': 0.259827498252464,
 'Val RMSLE': 0.296045293907789,
 'Train R2': 0.8559652681771991,
 'Val R2': 0.8259800880862996}

In [39]:
np.random.seed(42)
rs_model_res_1 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_4,
                              cv=5,
                              n_iter=70,
                              n_jobs=-1,
                              verbose=True)

In [40]:
rs_model_res_1.fit(X_train,y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 11.4min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 94.2min


KeyboardInterrupt: 

In [None]:
rs_model_res_1.best_params_

In [None]:
eval_scores(rs_model_res_1, X_train, X_val, y_train, y_val)

In [29]:
# np.random.seed(42)
rs_model_3 = RandomizedSearchCV(RandomForestRegressor(random_state=42), 
                              param_distributions=rf_grid_3,
                              cv=5,
                              n_iter=200,
                              n_jobs=-1,
                              verbose=True)

In [30]:
rs_model_3.fit(X_train,y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 22.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 29.5min finished


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   n_iter=200, n_jobs=-1,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [31]:
rs_model_3.best_params_

{'n_jobs': -1,
 'n_estimators': 80,
 'min_samples_split': 16,
 'min_samples_leaf': 1,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': None}

In [32]:
eval_scores(rs_model_3, X_train, X_val, y_train, y_val)

{'Train MAE': 5825.6618638095115,
 'Val MAE': 7430.221887715373,
 'Train MSLE': 0.07122314824978604,
 'Val MSLE': 0.09121719002119127,
 'Train RMSLE': 0.26687665362445256,
 'Val RMSLE': 0.30202183699393537,
 'Train R2': 0.8470343218522651,
 'Val R2': 0.8203493153243675}

In [33]:
np.random.seed(42)
rs_model_2 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_2,
                              cv=5,
                              n_iter=50,
                              n_jobs=-1,
                              verbose=True)

In [34]:
rs_model_2.fit(X_train,y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed: 12.8min finished


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=50,
                   n_jobs=-1,
                   param_distributions={'max_depth': [None, 3, 5, 10, 20, 40],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  30,  50,  70,  90, 110, 130, 150, 170, 190]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [35]:
rs_model_2.best_params_

{'n_jobs': -1,
 'n_estimators': 150,
 'min_samples_split': 12,
 'min_samples_leaf': 7,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': None}

In [36]:
eval_scores(rs_model_2, X_train, X_val, y_train, y_val)

{'Train MAE': 5914.5439021491575,
 'Val MAE': 7513.77702270813,
 'Train MSLE': 0.07317371530348829,
 'Val MSLE': 0.09212898330137742,
 'Train RMSLE': 0.2705064052910546,
 'Val RMSLE': 0.303527565966219,
 'Train R2': 0.8406109135952532,
 'Val R2': 0.8108118597419973}

In [33]:
# train grid 2 for more iter (promising results)
np.random.seed(42)
rs_model_res_2 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_2,
                              cv=5,
                              n_iter=200,
                              n_jobs=-1,
                              verbose=True)

In [None]:
rs_model_res_2.fit(X_train,y_train)

In [None]:
rs_model_res_2.best_params_

In [None]:
eval_scores(rs_model_res_2, X_train, X_val, y_train, y_val)

In [33]:
# Modified grid 2(grid 5)
np.random.seed(42)
rs_model_res_3 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_5,
                              cv=5,
                              n_iter=100,
                              n_jobs=-1,
                              verbose=True)

In [None]:
rs_model_res_3.fit(X_train,y_train)

In [None]:
rs_model_res_3.best_params_

In [None]:
eval_scores(rs_model_res_3, X_train, X_val, y_train, y_val)

## THE MODELS BELOW ARE JUST FOR SEEING HOW MUCH APPROX TIME FITTING MIGHT TAKE

In [19]:
np.random.seed(42)
rs_model_4 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_3,
                              cv=5,
                              n_iter=3,
                              verbose=True)

In [20]:
%%time
rs_model_4.fit(X_train,y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.0min finished


Wall time: 1min 2s


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=3,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [21]:
rs_model_4.best_params_

{'n_jobs': -1,
 'n_estimators': 60,
 'min_samples_split': 12,
 'min_samples_leaf': 1,
 'max_samples': 10000,
 'max_features': 1,
 'max_depth': None}

In [22]:
eval_scores(rs_model_4, X_train, X_val, y_train, y_val)

{'Train MAE': 8856.205385747859,
 'Val MAE': 11469.055474507251,
 'Train MSLE': 0.15313186644580068,
 'Val MSLE': 0.20834773084981376,
 'Train RMSLE': 0.3913206695867223,
 'Val RMSLE': 0.4564512360042568,
 'Train R2': 0.6848409258189497,
 'Val R2': 0.6323835976986882}

In [23]:
np.random.seed(42)
rs_model_5 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_1,
                              cv=5,
                              n_iter=3,
                              verbose=True)

In [24]:
%%time
rs_model_5.fit(X_train,y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  6.3min finished


Wall time: 7min 18s


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=3,
                   param_distributions={'max_depth': [None, 3, 5, 10, 20, 40],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  30,  50,  70,  90, 110, 130, 150, 170, 190, 210, 230, 250,
       270, 290, 310, 330, 350, 370, 390])},
                   verbose=True)

In [25]:
rs_model_5.best_params_

{'n_estimators': 370,
 'min_samples_split': 18,
 'min_samples_leaf': 11,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': 40}

In [26]:
eval_scores(rs_model_5, X_train, X_val, y_train, y_val)

{'Train MAE': 6143.589784893025,
 'Val MAE': 7745.768577585529,
 'Train MSLE': 0.07782946923265956,
 'Val MSLE': 0.09689841496328197,
 'Train RMSLE': 0.2789793347770755,
 'Val RMSLE': 0.31128510237928503,
 'Train R2': 0.8287803125204002,
 'Val R2': 0.7993129703127813}

In [28]:
np.random.seed(42)
rs_model_6 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_1,
                              cv=5,
                              n_iter=3,
                              verbose=True)

In [29]:
%%time
rs_model_6.fit(X_train,y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  3.4min finished


Wall time: 3min 45s


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=3,
                   param_distributions={'max_depth': [None, 3, 5, 10, 20, 40],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  30,  50,  70,  90, 110, 130, 150, 170, 190, 210, 230, 250,
       270, 290, 310, 330, 350, 370, 390]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [30]:
rs_model_6.best_params_

{'n_jobs': -1,
 'n_estimators': 370,
 'min_samples_split': 18,
 'min_samples_leaf': 11,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': 40}

In [31]:
eval_scores(rs_model_6, X_train, X_val, y_train, y_val)

{'Train MAE': 6143.589784893025,
 'Val MAE': 7745.768577585529,
 'Train MSLE': 0.07782946923265956,
 'Val MSLE': 0.09689841496328197,
 'Train RMSLE': 0.2789793347770755,
 'Val RMSLE': 0.31128510237928503,
 'Train R2': 0.8287803125204002,
 'Val R2': 0.7993129703127813}

In [32]:
np.random.seed(42)
rs_model_7 = RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions=rf_grid_1,
                              cv=5,
                              n_iter=3,
                              n_jobs =-1,
                              verbose=True)

In [33]:
%%time
rs_model_7.fit(X_train,y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  2.6min finished


Wall time: 3min 6s


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=3, n_jobs=-1,
                   param_distributions={'max_depth': [None, 3, 5, 10, 20, 40],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  30,  50,  70,  90, 110, 130, 150, 170, 190, 210, 230, 250,
       270, 290, 310, 330, 350, 370, 390]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [34]:
rs_model_7.best_params_

{'n_jobs': -1,
 'n_estimators': 370,
 'min_samples_split': 18,
 'min_samples_leaf': 11,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': 40}

In [35]:
eval_scores(rs_model_7, X_train, X_val, y_train, y_val)

{'Train MAE': 6146.362867283839,
 'Val MAE': 7748.332416624996,
 'Train MSLE': 0.0779130020670651,
 'Val MSLE': 0.09702029941418237,
 'Train RMSLE': 0.279129006137064,
 'Val RMSLE': 0.31148081708860076,
 'Train R2': 0.828421468633631,
 'Val R2': 0.7986031870439442}

In [19]:

rs_model_8 = RandomizedSearchCV(RandomForestRegressor(random_state=42), 
                              param_distributions=rf_grid_1,
                              cv=5,
                              n_iter=10,
                              n_jobs =-1,
                              verbose=True)

In [20]:
%%time
rs_model_8.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  7.4min finished


Wall time: 7min 39s


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   n_jobs=-1,
                   param_distributions={'max_depth': [None, 3, 5, 10, 20, 40],
                                        'max_features': [0.5, 1.0, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  30,  50,  70,  90, 110, 130, 150, 170, 190, 210, 230, 250,
       270, 290, 310, 330, 350, 370, 390]),
                                        'n_jobs': [-1]},
                   verbose=True)

In [21]:
rs_model_8.best_params_

{'n_jobs': -1,
 'n_estimators': 190,
 'min_samples_split': 14,
 'min_samples_leaf': 15,
 'max_samples': 10000,
 'max_features': 1.0,
 'max_depth': 10}

In [22]:
eval_scores(rs_model_8, X_train, X_val, y_train, y_val)

{'Train MAE': 6715.087018551988,
 'Val MAE': 8184.760927772924,
 'Train MSLE': 0.08993050732567222,
 'Val MSLE': 0.10702545351121837,
 'Train RMSLE': 0.29988415650993006,
 'Val RMSLE': 0.3271474491895334,
 'Train R2': 0.8020448991435206,
 'Val R2': 0.7785580734662484}

### Train a model with the best parameters

In [None]:
%%time
# Most ideal hyperparameters from the course
ideal_model_course = RandomForestRegressor(n_estimators=90,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None)
ideal_model_course.fit(X_train, y_train)

With these new hyperparameters as well as using all the samples, we can see an improvement to our models performance.

You can make a faster model by altering some of the hyperparameters. Particularly by lowering `n_estimators` since each increase in `n_estimators` is basically building another small model.

However, lowering of `n_estimators` or altering of other hyperparameters may lead to poorer results.

In [None]:
%%time
# Faster model
fast_model_course = RandomForestRegressor(n_estimators=40,
                                   min_samples_leaf=3,
                                   max_features=0.5,
                                   n_jobs=-1)
fast_model_course.fit(X_train, y_train)

In [None]:
eval_scores(fast_model_course, X_train, X_val, y_train, y_val)

### Make predictions on test data

Now we've got a trained model, it's time to make predictions on the test data.

Remember what we've done.

There are 3 main datasets:

* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

Our model is trained on data prior to 2011. However, the test data is from May 1 2012 to November 2012.

So what we're doing is trying to use the patterns our model has learned in the training data to predict the sale price of a Bulldozer with characteristics it's never seen before but are assumed to be similar to that of those in the training data.

In [None]:
df_test = pd.read_csv("data/bluebook-for-bulldozers/Test.csv", parse_dates =["saledate"])
df_test.head()

In [None]:
ideal_model.predict(df_test)