# Bulldozers

This notebook uses Random Forests to predict the auction sale price for a piece of heavy equipment to create a "blue book" for bulldozers. The data is from a Kaggle competition named "Blue Book for Bulldozers". The solution relies on the Fastai library.

In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [None]:
from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics

In [None]:
PATH = "data/bulldozers/"

In [None]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, 
                     parse_dates=["saledate"])

The above statement uses Python 3.6 formatted string literals. Formatted string literals are prefixed with 'f'. They contain replacement fields surrounded by curly braces. The replacement fields are expressions, which are evaluated at run time

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)

In [None]:
display_all(df_raw.tail().transpose())

### Preprocessing

The model will be scored on the RMSLE (root mean squared log error) between the actual and predicted auction prices. Therefore we take the log of the prices, so that RMSE will give us what we need.

In [None]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

add_datepart is a fastai function that splits the date into a number of useful attributes

In [None]:
add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()

In [None]:
train_cats(df_raw)

Reorder the category so that the ordering makes sense

In [None]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

In [None]:
df_raw.UsageBand = df_raw.UsageBand.cat.codes

Treating missing values

In [None]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Save in the feather format - a very fast format that stores data just as it is stored in RAM

In [None]:
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')

In [None]:
df_raw = pd.read_feather('tmp/bulldozers-raw')

### Building the model

proc_df makes a copy of the dataframe, drops the dependent variable from the original, fixes the missing numeric values (creating a new boolean column and replaces the original entry with median). Categorical missing variables are automatically handled by Pandas by setting them equal to -1.

In [None]:
df, y = proc_df(df_raw, 'SalePrice')

Splits the last 12,000 rows into a validation set.

In [None]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

In [None]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)

#### Out of Bag Score

With the Out of Bag score we calculate error on the training set, but only include the trees in the calculation of a row's error where that row was not included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

#### Reducing overfitting

Rather than limit the total amount of data that our model can access, let's instead limit it to a different random subset per tree.

In [None]:
set_rf_samples(20000)

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

We also specify a min_samples_leaf (minimum number of rows in every leaf node) and specify max_features, which is the proportion of features to randomly select that will be used at each split.

In [None]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)