# Random Forest

- skip_exec: true


What is a random forest?

- A universal machine learning algorithm that can be used for both classification and regression problems.
- A way of predicting something of any kind. It could be a category or a continuous value.
- It can predict it with columns of any kind. The columns could contain data about pixels, postcodes, revenue, etc.
- It is very resistant to overfitting, and where it does overfit, it is easy to fix.
- It does not require a separate validation set. In general it can tell you how well it generalises even if you only have one data set.
- It has few if any statistical assumptions about the data. E.g. it does not assume that the data is normally distributed, that the data is linear, that the data is balanced, etc.
- It does not require lots of feature engineering.


The curse of dimensionality is a largely meaningless concept in machine learning. The idea is having more columns creates a space that is more "empty" because the more dimensions you have the likelier it is that a point sits on the edge of a particular dimension. In theory this means that the distance between points is less meaningful in higher dimensions. This isn't actually a problem. Points do still have meaniful distances across other dimensions so you can still say one point is more or less similar to another. So e.g. K nearest neighbours still works fine in high dimensions. The curse of dimensionality is a problem in statistics, but not in machine learning.

In fact when doing feature engineering for machine learning you should add columns if they contain any information that could be useful to your model.


## Hyperparameters


`min_samples_leaf` - The minimum number of samples in a leaf. If a leaf has fewer than this number of samples, it will be merged with another leaf. This is a way of preventing overfitting. The default is 1, which means that every leaf will have at least one sample in it. If you have a lot of data, you can set this to a higher number. If you have a small amount of data, you can set this to a lower number.

`max_features` - The maximum number of features that will be considered when splitting a node. The default is `sqrt`, which means that the number of features will be the square root of the number of columns in the data set. If you have a lot of columns, you can set this to a higher number. If you have a small number of columns, you can set this to a lower number.


## Implementation


In [None]:
import math
import re

import numpy as np
import pandas as pd
from pandas.api.types import is_string_dtype
from fastai.tabular.all import add_datepart, cont_cat_split, Categorify, FillMissing, TabularPandas
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
pd.set_option('display.max_columns', None)
np.random.seed(42)
set_config(transform_output="pandas")

In [None]:
def r_mse(pred,y): 
    return round(math.sqrt(((pred-y)**2).mean()), 6)

def m_rmse(m, xs, y): 
    return r_mse(m.predict(xs), y)

In [None]:
df = pd.read_csv(f"../data/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False, parse_dates=["saledate"])
df["SalePrice"] = np.log(df["SalePrice"])

In [None]:
idxs = sorted(np.random.permutation(len(df))[:30000])
df = df.iloc[idxs].copy()

In [None]:
cond = (df["saledate"] < "2011-10-01")
train_idx = np.where(cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))

In [None]:
df["ProductSize"] = df["ProductSize"].astype("category")
df["ProductSize"] = df["ProductSize"].cat.set_categories(
    ["Compact", "Mini", "Small", "Medium", "Large / Medium", "Large"], ordered=True
)

df["UsageBand"] = df["UsageBand"].astype("category")
df["UsageBand"] = df["UsageBand"].cat.set_categories(["Low", "Medium", "High"], ordered=True)

df["datasource"] = df["datasource"].astype("category")

### FastAI Data Pipeline


In [None]:
fa_df = df.copy()

In [None]:
fa_df = add_datepart(fa_df, "saledate", drop=True)

In [None]:
conts, cats = cont_cat_split(fa_df, max_card=3, dep_var="SalePrice")

Categorical variables are made up of discrete levels, such as gender or product type for which addition and multiplication don't have meaning (even if they're stored as numbers). To use them in the model though we need to convert them to numbers.


In [None]:
procs = [Categorify, FillMissing]

In [None]:
to = TabularPandas(fa_df, procs, cats, conts, y_names="SalePrice", splits=splits)

In [None]:
rf_to = RandomForestRegressor(n_jobs=-1, random_state=42)
rf_to.fit(to.train.xs, to.train.y)

In [None]:
m_rmse(rf_to, to.valid.xs, to.valid.y)

0.302871

### Scikit-Learn data pipeline


In [None]:
sk_df = df.copy()

In [None]:
def add_datepart(df: pd.DataFrame, column_name: str, drop: bool = False) -> pd.DataFrame:
    prefix = re.sub("[Dd]ate$", "", column_name)
    attr = [
        "Year",
        "Month",
        "Week",
        "Day",
        "Dayofweek",
        "Dayofyear",
        "Is_month_end",
        "Is_month_start",
        "Is_quarter_end",
        "Is_quarter_start",
        "Is_year_end",
        "Is_year_start",
    ]
    col = df[column_name]
    week = (
        col.dt.isocalendar().week.astype(col.dt.day.dtype) if hasattr(col.dt, "isocalendar") else col.dt.week
    )
    for n in attr:
        df[f"{prefix}{n}"] = getattr(col.dt, n.lower()) if n != "Week" else week
    df[prefix + "Elapsed"] = np.where(~col.isna(), col.values.astype(np.int64) // 10 ** 9,np.nan)
    if drop:
        df = df.drop(column_name, axis=1)
    return df

def cont_cat_split(df, max_card=2, dep_var=None):
    "Helper function that returns column names of cont and cat variables from given `df`."
    cont_names, cat_names = [], []
    for label in df:
        if label == dep_var:
            continue
        if (
            pd.api.types.is_integer_dtype(df[label].dtype) and df[label].unique().shape[0] > max_card
        ) or pd.api.types.is_float_dtype(df[label].dtype):
            cont_names.append(label)
        else:
            cat_names.append(label)
    return cont_names, cat_names

In [None]:
sk_df = add_datepart(sk_df, "saledate", drop=True)

In [None]:
conts, cats = cont_cat_split(sk_df, max_card=3, dep_var="SalePrice")

In [None]:
sk_df_train = sk_df.iloc[train_idx].copy()
sk_df_valid = sk_df.iloc[valid_idx].copy()

In [None]:
def numericalise(X, cat_names):
    for n,c in df.items():
        if is_string_dtype(c): 
            df[n] = c.astype('category').cat.as_ordered()
    for cat in cat_names:
        X[cat] = pd.Categorical(X[cat]).codes + 1
    return X

In [None]:
numerical_pipe = SimpleImputer(strategy="mean", add_indicator=True)

preprocessing = ColumnTransformer(
    [
        ("cat", FunctionTransformer(numericalise, kw_args={"cat_names": cats}), cats),
        ("num", numerical_pipe, conts),
    ],
    verbose_feature_names_out=False,
)

rf = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", RandomForestRegressor(random_state=42)),
    ]
)
rf.fit(sk_df_train.drop("SalePrice", axis=1), sk_df_train["SalePrice"])

In [None]:
m_rmse(rf, sk_df_valid.drop("SalePrice", axis=1), sk_df_valid["SalePrice"])

0.28816