# Random forest with Entity Embeddings: Training the model

We compare the validation set performance of a random forest on two version of the ASHRAE dataset (preprocessed [in Part 1](https://www.kaggle.com/michelezoccali/random-forest-with-embeddings-tutorial-part-1)), differing in the treatment of categorical variables. These are treated:

1. with **standard ordinal encoding** (discrete levels), and
2. with **Entity Embeddings**, i.e. vectors of continuous values previously learned by a neural net.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import os
import datetime
import warnings
import gc

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

from tqdm.notebook import tqdm

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data_path = '../input/random-forest-with-embeddings-tutorial-part-1'

for dirname, _, filenames in os.walk(data_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load data

In [None]:
X_train = pd.read_feather(f'{data_path}/X_train.feather')
y_train = pd.read_feather(f'{data_path}/y_train.feather').meter_reading

In [None]:
y_train

# Modeling

Let us write a small wrapper function for the Random Forest, to be passed to a CV routine.

In [None]:
def RF_wrapper(Xt, yt, Xv, yv, fold=-1):
    
    model = RandomForestRegressor(n_jobs=-1, n_estimators=40,
                              max_samples=200000, max_features=0.5,
                              min_samples_leaf=5, oob_score=False).fit(Xt, yt)
    print(f'Training fold {fold}...')
    
    score_train = np.sqrt(mean_squared_error(model.predict(Xt), yt))
    oof = model.predict(Xv)
    score = np.sqrt(mean_squared_error(oof, yv))
    print(f'Fold {fold}: training RMSLE: {score_train},   validation RMSLE: {score}\n')
    return model, oof, score

Let us perform k-fold CV, without shuffling as this is a time series. An alternative would be to do a single train/validation split, possibly with a gap to mimic training/private split. Otherwise, one could try something like Time-series split CV.

In [None]:
def perform_CV(model_wrap, xs, ys, n_splits=3):
    
    kf = KFold(n_splits=n_splits, shuffle=False)

    models = []
    scores = []
    oof_total = np.zeros(xs.shape[0])


    for fold, (train_idx, val_idx) in enumerate(kf.split(xs), start=1):
        Xt, yt = xs.iloc[train_idx], ys[train_idx]
        Xv, yv = xs.iloc[val_idx], ys[val_idx]
        model, oof, score = model_wrap(Xt, yt, Xv, yv, fold)

        models.append(model)
        scores.append(score)
        oof_total[val_idx] = oof

    print('Training completed.')
    print(f'> Mean RMSLE across folds: {np.mean(scores)}, std: {np.std(scores)}')
    print(f'> OOF RMSLE: {np.sqrt(mean_squared_error(ys, oof_total))}')
    return models, scores, oof_total

Let's train the random forest **without** embeddings.

In [None]:
%%time
n_splits = 3
models, _, _ = perform_CV(RF_wrapper, X_train, y_train, n_splits=n_splits)

Let's see the average feature importance across models. We can use this to retroactively drop further superfluous features during preprocessing.

In [None]:
importance = pd.DataFrame([model.feature_importances_ for model in models],
                          columns=X_train.columns,
                          index=[f'Fold {i}' for i in range(1, n_splits + 1)])
importance = importance.T
importance['Average importance'] = importance.mean(axis=1)
importance = importance.sort_values(by='Average importance', ascending=False)

plt.figure(figsize=(10,7))
sns.barplot(x='Average importance', y=importance.index, data=importance);

In [None]:
del X_train
gc.collect()

Now let's repeat **with** embeddings.

In [None]:
X_embeds = pd.read_feather(f'{data_path}/X_embeds.feather')

In [None]:
%%time
models_emb, _, _ = perform_CV(RF_wrapper, X_embeds, y_train, n_splits=n_splits)

In [None]:
importance = pd.DataFrame([model.feature_importances_ for model in models_emb],
                          columns=X_embeds.columns,
                          index=[f'Fold {i}' for i in range(1, n_splits + 1)])
importance = importance.T
importance['Average importance'] = importance.mean(axis=1)
importance = importance.sort_values(by='Average importance', ascending=False)

plt.figure(figsize=(10,7))
sns.barplot(x='Average importance', y=importance.index, data=importance);

In [None]:
del X_embeds, y_train
gc.collect()

So, using these mysterious embeddings actually worked! Even if, in this case, it only helped performance a little (and was quite a bit slower too!). However, it is well worth knowing that this method exists, as in general it leads to much better performance. For more information, check out the paper [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf).

That's it. Do upvote this kernel if you found it of any use! 🖖