# Imports and Paths

In [1]:
import os
import pandas as pd
import sys
import numpy as np
from sklearn.ensemble import *

In [2]:
%load_ext autoreload
%autoreload 2

In [5]:
PATH_base = 'E:\\GitHub\\data_science\\'
PATH_bd = 'E:\\GitHub\\data_science\\data\\uncompressed\\blue_book_for_bulldozers\\'
PATH_func = 'E:\\GitHub\\data_science\\src\\'

In [6]:
sys.path.append(PATH_func)

In [87]:
from features.utilities import *
from features.fastai import *

In [10]:
%ls ..\\data\\uncompressed\\blue_book_for_bulldozers\\

 Volume in drive E is Storage
 Volume Serial Number is D098-4C8A

 Directory of E:\GitHub\data_science\data\uncompressed\blue_book_for_bulldozers

03/24/2018  04:37 PM    <DIR>          .
03/24/2018  04:37 PM    <DIR>          ..
01/24/2013  09:08 PM       116,403,970 Train.csv
01/24/2013  07:11 PM         3,318,969 Valid.csv
               2 File(s)    119,722,939 bytes
               2 Dir(s)  1,852,864,679,936 bytes free


# Initial Data Munging

This dataset is from the kaggle competition [Blue Book for Bulldozers](https://www.kaggle.com/c/bluebook-for-bulldozers).  

In [12]:
df_raw = pd.read_csv(f'{PATH_bd}Train.csv', low_memory=False, parse_dates=["saledate"])

In [13]:
# df_raw.to_feather(f'{PATH_base}\\data\\uncompressed\\blue_book_for_buldozers\\bulldozer_raw')
# df_orig = df_raw.copy()
# df_raw = df_orig

In [75]:
display_some(df_raw.head().T,80,10)

Unnamed: 0,0,1,2,3,4
SalesID,1139246,1139248,1139249,1139251,1139253
SalePrice,66000,57000,10000,38500,11000
MachineID,999089,117657,434808,1026470,1057373
ModelID,3157,77,7009,332,17311
datasource,121,121,121,121,121
auctioneerID,3,3,3,3,3
YearMade,2004,1996,2001,2001,2007
MachineHoursCurrentMeter,68,4640,2838,3486,722
UsageBand,Low,Low,High,High,Medium
saledate,2006-11-16 00:00:00,2004-03-26 00:00:00,2004-02-26 00:00:00,2011-05-19 00:00:00,2009-07-23 00:00:00


## Metrics

In this competition, the goal is to predict the SalePrice and the metric used is the root mean squared log error (RMSLE).  Therefore, I will convert the SalePrice column into the log of the SalePrice

In [23]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

## Splitting Datetime

Split the dattime column `saledate` into multiple columns containing integer components of the datetime, e.g., year, month, day, day of week, weekend

In [24]:
add_datepart(df_raw, 'saledate')

## Convert Strings to Integer Categorical 

Convert all columns with string values to integer categorical variables

In [25]:
train_cats(df_raw)

In [26]:
df_raw.UsageBand.cat.categories

Index(['High', 'Low', 'Medium'], dtype='object')

When there is ordinality in the strings you can assign the categorical integer values to share that order

In [27]:
df_codes = df_raw.UsageBand.cat.codes
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

In [28]:
df_UsageBand_cat = pd.DataFrame({'Raw':df_raw.UsageBand, 'Unordered':df_codes, 'Ordered':df_raw.UsageBand.cat.codes})
df_UsageBand_cat.head()

Unnamed: 0,Ordered,Raw,Unordered
0,2,Low,1
1,2,Low,1
2,0,High,0
3,0,High,0
4,1,Medium,2


In [37]:
df, y, na_dict = proc_df(df_raw, 'SalePrice')

## Save to/Load from Feather

In [20]:
# df_raw.to_feather(f'{PATH_base}\\data\\interim\\bulldozer')
# df_raw = pd.read_feather(f'{PATH_base}\\data\\interim\\bulldozer')

# Random Forest

## $r^2$ and RMSE

In [47]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df, y)

0.9831537858723391

This score is the $r^2$ which is defined as :

$$r^2 = 1-\dfrac{SS_{res}}{SS_{tot}}  = 1-\dfrac{\sum\limits_i\left(y_i-f_i\right)^2}{\sum\limits_i\left(y_i-\bar{y}\right)^2}$$

where $y_i$ is the true output, $f_i$ is the model output and $\bar{y}$ is the mean true output.  The meaning of the $r^2$ values is as follows
  1. $r^2 = 1$ means that the model perfectly predicts the output
  1. $r^2 = 0$ means the model does no better than predicting the output using the mean output, i.e., $\bar{y}$
  1. $1 > r^2 > 0 $ means that the model is better than simply using the mean ouptut
  1. $r^2 < 0$ means that the model is worse than just using the mean output as a predictor
 
However, $r^2$ by itself, is not a good metric as an overfit model will have an $r^2$ close to 1 for the training set, but would do much worse on a test set. 

To overcome the issue of overfitting, we can split the data into a training and validation set; using the training set to build the model and the validation set to test how good the model does on a different set of data.

In [105]:
n_trn = len(df)-12000
X_train, X_valid = split_train_val(df,trn_amount=n_trn)
y_train, y_valid = split_train_val(y,trn_amount=n_trn)

X_train.shape, y_train.shape, X_valid.shape

((389125, 66), (389125,), (12000, 66))

In [106]:
train_rmse = custom_RFscore(m, X_train, y_train)
valid_rmse = custom_RFscore(m, X_valid, y_valid) 
train_score = m.score(X_train, y_train)
valid_score = m.score(X_valid, y_valid)
print(f'Training rmse: {train_rmse}')
print(f'Validation rmse: {valid_rmse}')
print(f'Training Score: {train_score}')
print(f'Validation Score: {valid_score}')

Training rmse: 0.2697188195572289
Validation rmse: 0.28549449257662696
Training Score: 0.8479601870379432
Validation Score: 0.854439531907599


## Speeding Things Up

In [98]:
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=na_dict)
X_train, _ = split_train_val(df_trn, trn_amount=20000)
y_train, _ = split_train_val(y_trn, trn_amount=20000)

In [102]:
m = RandomForestRegressor(n_jobs=-1);
m.fit(X_train, y_train);

In [103]:
train_rmse = custom_RFscore(m, X_train, y_train)
valid_rmse = custom_RFscore(m, X_valid, y_valid) 
train_score = m.score(X_train, y_train)
valid_score = m.score(X_valid, y_valid)
print(f'Training rmse: {train_rmse}')
print(f'Validation rmse: {valid_rmse}')
print(f'Training Score: {train_score}')
print(f'Validation Score: {valid_score}')

Training rmse: 0.11842444840537834
Validation rmse: 0.28549449257662696
Training Score: 0.9711021415124512
Validation Score: 0.854439531907599
