In [70]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

Get Train.csv from here https://www.kaggle.com/competitions/bluebook-for-bulldozers/data?select=Train.zip

Based on this Kaggle competition: https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

Class page: https://course18.fast.ai/lessonsml1/lesson1.html

Lesson repo: https://github.com/fastai/fastai1/blob/master/courses/ml1/lesson1-rf.ipynb

Fastai.structured no longer exists, you can find functions for that library here: https://github.com/fastai/fastai1/blob/master/old/fastai/structured.py

In [2]:
from fastai.imports import *

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics

In [7]:
PATH = "data/bulldozer/"

In [8]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])

In [9]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000):
        with pd.option_context("display.max_columns", 1000):
            display(df)

In [11]:
display_all(df_raw.tail().transpose())

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1.0,1.0,1.0,2.0,2.0
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


In [12]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

In [13]:
df_raw.SalePrice

0         11.097410
1         10.950807
2          9.210340
3         10.558414
4          9.305651
            ...    
401120     9.259131
401121     9.305651
401122     9.350102
401123     9.104980
401124     8.955448
Name: SalePrice, Length: 401125, dtype: float64

A regressor is a machine learning model that is trying to predict some sort of continuous outcome. In our scenario, we're trying to predict the price of the bulldozer, which is a continuous variable. Hence why we are opting for a RandomForestRegressor, not a classifier, in this instance.

RandomForestRegressor is an object from scikit learn. You first create an instance of the model, and then fit that based on the independent variable(s) and the dependent variable. In this instance, we are using all the data (save the SalePrice) as the independent variables, and the SalePrice as the dependent variable.

In [21]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

ValueError: could not convert string to float: 'Low'

Hm. The data contains a lot of string information. These are categorical variables, not continuous. Take saledate for example.

In [22]:
df_raw.saledate

0        2006-11-16
1        2004-03-26
2        2004-02-26
3        2011-05-19
4        2009-07-23
            ...    
401120   2011-11-02
401121   2011-11-02
401122   2011-11-02
401123   2011-10-25
401124   2011-10-25
Name: saledate, Length: 401125, dtype: datetime64[ns]

What are the interesting features of this date data? Maybe the days, maybe the year? What about whether the date was a holiday? 

In [38]:
def add_datepart(df, fldname):
    fld = df[fldname]
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    for n in ('Year', 'Month', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 
              'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 
              'Is_year_start'):
        df[targ_pre+n] = getattr(fld.dt, n.lower())
    df[targ_pre+'Elapsed'] = (fld - fld.min()).dt.days
    df.drop(fldname, axis=1, inplace=True)
        

In [None]:
add_datepart(df_raw, 'saledate')

In [41]:
df_raw.columns

Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries',
       'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state',
       'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure',
       'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission',
       'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type',
       'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier',
       'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System',
       'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type',
       'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer',
       'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleDay', 'saleDayofweek', 'saleDayofyear', 'saleIs_mont

In [53]:
def train_cats(df):
    for n,c in df.items():
        if c.dtype == "object": df[n] = c.astype('category').cat.as_ordered()

In [54]:
train_cats(df_raw)

In [56]:
df_raw.UsageBand.cat.categories

Index(['High', 'Low', 'Medium'], dtype='object')

In [61]:
df_raw.UsageBand = df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True)

In [63]:
df_raw.UsageBand.cat.codes

0         2
1         2
2         0
3         0
4         1
         ..
401120   -1
401121   -1
401122   -1
401123   -1
401124   -1
Length: 401125, dtype: int8

Ok, we've got everything in the table turned into numbers. But... we have a lot of missing values

In [64]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
S

In [65]:
os.makedirs('tmp', exist_ok=True)

This allows us to save the current progress of our dataframe to a file

In [67]:
df_raw.to_feather('tmp/raw')

In [68]:
df_raw = pd.read_feather('tmp/raw')

In [69]:
def proc_df(df, y_fld, skip_flds=None, do_scale=False, prepoc_fn=None, max_n_cat=None, subset=None):
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df, subset)
    df = df.copy()
    if prepoc_fn: preproc_fn(df)
    y = df[y_fld].values
    df = df.drop(skip_flds+[y_fld], axis=1)
    
    for n,c in df.items(): fix_missing(df,c,n)