# <font color='#3366BB'>XGBoost with Entity Embeddings</font>

## <span style="color:#3366BB">[[Article](https://arxiv.org/pdf/1604.06737.pdf)]</span>

### Table of contents

- [Train Entity Embeddings (PyTorch)](#entity-embeddings)
- [Experiment 003 (Default)](#exp003)
    - [Train Model](#train003)
- [Experiment 004 (HyperOpt)](#exp004)
    - [Hyperparameter Tuning](#hyper)
    - [Train Model](#train004)


## Summary Results

| Experiment ID | Categorical Variables | NaN-cats | NaN-cont | Target Transformation | Hyperparameter Search | Backtesting            | Private Score | Public Score
|---------------|-----------------------|----------|----------|-----------------------|-----------------------|------------------------|---------------|---------------
| 001           | Target encoder        | XGBoost  | XGBoost  | Log transform         | Default               | No                     | 0.16925       | 0.17975
| 002           | Target encoder        | XGBoost  | XGBoost  | Log transform         | HyperOpt (100)        | TimeSeriesSplit k = 3  | 0.13975       | 0.12481
| 003           | Entity Embeddings     | #NAN#    | FastAI   | Log transform         | Default               | No                     | 0.15251       | 0.14079
| 004           | Entity Embeddings     | #NAN#    | FastAI   | Log transform         | HyperOpt (100)        | TimeSeriesSplit k = 3  | 0.13081       | 0.11572

***

## Import packages

In [None]:
# Model
import xgboost as xgb

# Data manipulation
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from category_encoders import TargetEncoder

# Hyperparameter search
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK, space_eval

# Utils
from time import time
from pathlib import Path

# Entity Embeddings
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
import pickle

# Viz
import matplotlib.pyplot as plt

# For reproducibility
seed = 123

In [None]:
# Define evaluation metric:
# src: https://www.kaggle.com/c/rossmann-store-sales/discussion/16794 (Chenglong Chen)

def ToWeight(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w

def rmspe(yhat, y):
    y = np.exp(y) - 1       
    yhat = np.exp(yhat) - 1 
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))
    return rmspe

def rmspe_xg(yhat, y):
    y = y.get_label()
    y = np.exp(y) - 1
    yhat = np.exp(yhat) - 1
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y - yhat)**2))
    return "rmspe", rmspe

# Entity Embeddings

For this section, I will use all categorical variables defined in the fastAI course. I will train embeddings for each categorical variables and use them in an XGBoost model. For the missing categorical variables, a new category names #NAN# will be used. Hence, an embedding will be learned for those.  

I tried applying both BN and Dropout before or after the relu activation, and it seemed that after resulted in better performance for default parameters.

For this part, I relied on this source: https://yashuseth.blog/2018/07/22/pytorch-neural-network-for-tabular-data-with-categorical-embeddings/  

*Figure 1: Example architecture for two categories:*

![image](https://user-images.githubusercontent.com/25487881/78181963-42bc1d00-7433-11ea-8236-6dd6f64e247a.png)

Now we will define the dataset for the dataloader and define the Pytorch model for learning the entity embeddings. Usually online you will find sources that integrade models for end-to-end training and allowing both categorical and continuous variables. In this case, the model is only ment to work with categorical variables and purely for learning the embeddings.

In [None]:
class CategoricalDataset(Dataset):
    def __init__(self, data, output_col=None):
        """
        Characterizes a Dataset for PyTorch
        
        Parameters
        ----------

        data: pandas data frame
          The data frame object for the input data. It must
          contain all the continuous, categorical and the
          output columns to be used.

        output_col: string
          The name of the output variable column in the data
          provided.
        """

        # Shape of the full dataset
        self.n = data.shape[0]

        # Store in 'y' the target values of the full dataset
        self.y = data[output_col].astype(np.float32).values.reshape(-1, 1)
        
        # Store list of categorical columns
        self.cat_cols = [col for col in data.columns if col != output_col]
            
        # Ensure the datatypes of the categorical variables are of int64 and in a numpy.ndarray
        self.cat_X = data[self.cat_cols].astype(np.int64).values

    def __len__(self):
        """
        Denotes the total number of samples.
        """
        return self.n

    def __getitem__(self, index):
        """
        Generates one sample of data based on the index.
        """
        return [self.y[index], self.cat_X[index]]

In [None]:
class FeedForwardNN(nn.Module):

    def __init__(self, emb_dims, lin_layer_sizes,
               output_size, emb_dropout, lin_layer_dropouts):

        """
        Parameters
        ----------

        emb_dims: List of two element tuples
          This list will contain a two element tuple for each
          categorical feature. The first element of a tuple will
          denote the number of unique values of the categorical
          feature. The second element will denote the embedding
          dimension to be used for that feature.

        lin_layer_sizes: List of integers.
          The size of each linear layer. The length will be equal
          to the total number
          of linear layers in the network.

        output_size: Integer
          The size of the final output.

        emb_dropout: Float
          The dropout to be used after the embedding layers.

        lin_layer_dropouts: List of floats
          The dropouts to be used after each linear layer.
        """

        super().__init__()

        # Embedding layers
        # We create a list of embedding layers (one for each categorical feature)
        # An embedding layer takes in the size of the vocabulary and the number of dimensions
        # nn.Embedding = A simple lookup table that stores embeddings of a fixed dictionary and size.
        self.emb_layers = nn.ModuleList([nn.Embedding(dict_size, emb_dim) for dict_size, emb_dim in emb_dims])

        # Need the concatenated length of the embeddings
        self.sum_length_embeddings = sum([emb_dim for dict_size, emb_dim in emb_dims])

        # The first linear layer (input dimensions , output dimensions): dense layer 0 in the EE paper
        first_lin_layer = nn.Linear(self.sum_length_embeddings, lin_layer_sizes[0])
        
        # Create alyers than contains first linear layer and adds a second layer: dense layer 1 in the EE paper
        self.lin_layers = nn.ModuleList([first_lin_layer] + [nn.Linear(lin_layer_sizes[0], lin_layer_sizes[1])])
        
        # Initialize weights in the two linear layers
        for lin_layer in self.lin_layers:
            nn.init.kaiming_normal_(lin_layer.weight.data)

        # Output Layer
        self.output_layer = nn.Linear(lin_layer_sizes[-1], output_size)
        nn.init.kaiming_normal_(self.output_layer.weight.data)
        
        # Batch Norm Layers - after all linear layers in network
        self.bn_layers = nn.ModuleList([nn.BatchNorm1d(size) for size in lin_layer_sizes])

        # Dropout Layers. They are separated due to different dropout probabilities
        self.emb_dropout_layer = nn.Dropout(emb_dropout)
        self.droput_layers = nn.ModuleList([nn.Dropout(size) for size in lin_layer_dropouts])

    def forward(self, cat_data):

        # Select appropriate lookup table for each id of data for each column
        x = [emb_layer(cat_data[:, i]) for i,emb_layer in enumerate(self.emb_layers)]
        
        # Here we concat (flatten) the embeddings selected columns wise
        x = torch.cat(x, 1)
        
        # Apply the embedding dropout
        x = self.emb_dropout_layer(x)

        for lin_layer, dropout_layer, bn_layer in zip(self.lin_layers, self.droput_layers, self.bn_layers):

            x = F.relu(lin_layer(x))
            x = bn_layer(x)
            x = dropout_layer(x)
            
#             # FastiAI approach: apply BN and dropout before relu
#             x = bn_layer(lin_layer(x))
#             x = dropout_layer(x)
#             x = F.relu(x)

        x = self.output_layer(x)

        return x

In [None]:
# Define which features to use - see lesson 6 of fastai.
categorical_features = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
                        'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
                        'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
                        'SchoolHoliday_fw', 'SchoolHoliday_bw']
output_feature = ['Sales']

# Load data
data_ee = pd.read_parquet('../data/03_primary/clean_train_valid.parquet')

# Filter columns 
data_ee = data_ee[categorical_features + output_feature + ['Date']]
data_ee.sort_values('Date', inplace=True)
data_ee.set_index('Date', inplace=True)

# Deal with missing categorical values
data_ee['PromoInterval'] = data_ee['PromoInterval'].fillna('#NAN#')
data_ee['Events'] = data_ee['Events'].fillna('#NAN#')

# The cross-validation (val) sets will be made starting from row 688500. Thus we can use the data before to learn the embeddings
data_ee = data_ee.iloc[:688500]

The data is almost ready. All that's left is use the label encoding technique to replace the categorical variables with integers. These will be used in the lookup table of the embedding layers of the neural network.

In [None]:
# Define dictionary that will contain all different label encoders by columns
label_encoders = {}
# Loop over categorical features, initialize a new label encoder and fit_transform the data
for cat_col in categorical_features:
    label_encoders[cat_col] = LabelEncoder()
    data_ee[cat_col] = label_encoders[cat_col].fit_transform(data_ee[cat_col])

Create a proper dataset.

In [None]:
dataset = CategoricalDataset(data=data_ee, output_col=output_feature)

Create the data loader.

In [None]:
dataloader = DataLoader(dataset, batch_size=64)

Use a heuristic approach to define the embedding sizes. In language modelling, it is often the case to see embeddings of size 600 or more. In the case of categorical variables, the data is much simpler. So, it we will define the embedding size as the minimum between half the number of categories, or 50.

In [None]:
cat_dims = [int(data_ee[col].nunique()) for col in categorical_features]
emb_dims = [(x, min(50, (x + 1) // 2)) for x in cat_dims]

In [None]:
# Set GPU
device = torch.device('cuda')

Create model. I have searched online for heuristics on the number of layers, hidden layer sizes and dropouts to use. I have settled on these, as they are the most common.

In [None]:
model = FeedForwardNN(emb_dims, lin_layer_sizes=[1000, 500], output_size=1, emb_dropout=0.001, lin_layer_dropouts=[0.001,0.01]).to(device)

In [None]:
# Define a different loss function
def MAPELoss(output, target):
    return torch.mean(torch.abs((target - output) / target))    

In [None]:
no_of_epochs = 5 # 10 was suggested by EE paper, but loss was going up
criterion = MAPELoss
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
epoch_losses = []
for epoch in range(no_of_epochs):
    
    running_loss = 0.0
    for y, cat_x in tqdm(dataloader,position=0):
        cat_x = cat_x.to(device)
        y  = y.to(device)
        # Forward Pass
        preds = model(cat_x)
        loss = criterion(preds, y)
        # Backward Pass and Optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()/len(cat_x)
    #print(np.mean(losses))
    print(running_loss)
    epoch_losses.append(running_loss)

In [None]:
# Print the categorical columns with the respective embedding shapes
list(zip(categorical_features, model.emb_layers))

In [None]:
torch.save(model.state_dict(), 'weights')

In [None]:
model.load_state_dict(torch.load('weights'))

In [None]:
print(f'Embedding matrix for DayOfWeek: \n\n{model.emb_layers[1].weight}')

In [None]:
# Move embeddings from GPU and put on CPU
cpu_embs = model.emb_layers.cpu()

In [None]:
def add_cat_embeddings(dataframe, embeddings, categorical_features):
    """ Map the entity embeddings to the label encoding. Replace label encoded data 
    with the entity embeddings. Deletes original label encoded data.
    
    Example of entity embedding dataframe:
    Events has 20 unique labels with size of embedding 10
    
    | Events | Events_0  | Events_1   | Events_2  |    ...     | Events_10 |
    |--------|-----------|------------|-----------|------------|-----------|
    | 0      | 2.384089  | 4.924449   | 15.312108 |    ...     | 13.249166 |
    | 1      | -4.169876 | -6.012678  | 4.244546  |    ...     | 10.307323 |
    | ...    |    ...    |    ...     |    ...    |    ...     | 23.748465 |
    | 20     | 5.596208  | -15.757868 | 10.706656 |    ...     | 7.573045  |
    
    Args
        dataframe: pd.DataFrame, original dataframe
        embeddings: ModuleList, list of embedding weights
    Returns
        dataframe: pd.DataFrame, updated dataframe 
    """
    for i, cat_var in enumerate(categorical_features):
        
        # Retreive respective label encoder and transform data
        dataframe[cat_var] = label_encoders[cat_var].transform(dataframe[cat_var])
        # Create a dataframe from respective embedding matrix
        df = pd.DataFrame(embeddings[i].weight.detach().numpy())
        # Rename columns to include the categorical name
        df = df.add_prefix(f'{cat_var}_')
        # Set a name for the index
        df.index.name = cat_var
        # Move named index as a column to join on
        df.reset_index(inplace=True)
        # Join original dataframe with new categorical embeddings
        dataframe = dataframe.merge(df, how='left',on=cat_var)
        # Remove original label encoded variable
        _ = dataframe.pop(cat_var)
                
    return dataframe

# Experiment 003 - Default

In [None]:
def data_pipeline(fpath, fname_train, fname_test, seed, cpu_embs, categorical_features):
    """
    Load data and preprocess it for training a model. This will also fill 
    missing values for continuous variables with the median and create a 
    flag column correspondingly. This isn't used for hyper-parameter tunning.
    
    Args
        fpath: string, folder path 
        fname_train: string, name with parquet extention
        fname_test: string, name with pkl extention
        seed: int, 
    Return
        dtrain: xgb.DMatrix, dataset used for training
        dvalid: xgb.DMatrix, dataset used for early stopping
        dtest: xgb.DMatrix, dataset used for submittion to kaggle
    """
    # Define the features to load
    columns = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 
                'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment', 
                'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear', 
                'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 
                'StateHoliday_bw', 'SchoolHoliday_fw', 'SchoolHoliday_bw', 'CompetitionDistance', 
                'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Max_Humidity', 
                'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 
                'CloudCover', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday', 
                'Promo', 'SchoolHoliday', 'Date', 'Sales']
    
    # 1) Load data
    train = pd.read_parquet(Path(fpath,fname_train), columns=columns)
    test = pd.read_pickle(Path(fpath,fname_test))
    columns.remove('Sales')
    test = test[columns + ['Id']]
    
    # 2) Let's use the date as the index and sort the data
    train.sort_values('Date', inplace=True)
    train.set_index('Date', inplace=True)
    test.sort_values('Date', inplace=True)
    test.set_index('Date', inplace=True)
    columns.remove('Date')
    test_ids = test.pop('Id') #Useful for submission
    
    # Deal with missing categorical values
    train['PromoInterval'] = train['PromoInterval'].fillna('#NAN#')
    train['Events'] = train['Events'].fillna('#NAN#')
    test['PromoInterval'] = test['PromoInterval'].fillna('#NAN#')
    test['Events'] = test['Events'].fillna('#NAN#')    

    # 3) Deal with missing continuous values
    for col_name in ['CompetitionDistance', 'CloudCover']:
        # Add na cols
        train[col_name+'_na'] = pd.isnull(train[col_name])
        test[col_name+'_na'] = pd.isnull(test[col_name])
        # Fill missing with median (default in FastAI)
        fillter = train[col_name].median()
        train[col_name] =  train[col_name].fillna(fillter)
        test[col_name] =  test[col_name].fillna(fillter)
        columns.append(col_name+'_na')
        
    # Replace categorical variables with embeddings
    train = add_cat_embeddings(train, cpu_embs, categorical_features)
    test = add_cat_embeddings(test, cpu_embs, categorical_features)
    columns = list(train.columns)

    columns.remove('Sales')
    # 4) Apply log transform to the target variable
    train['Sales'] = np.log1p(train['Sales'])

    # 5) Set asside a random 1% sample for early stopping
    # I'm separating the X and y simply for ease of use for the target encoder
    train_X, valid_X, train_y, valid_y = train_test_split(train[columns], train['Sales'], test_size=0.01, random_state=seed)

    # 6) Deal with categorical variables
    te = TargetEncoder(handle_missing='value')
    train_X = te.fit_transform(train_X, cols=['StoreType', 'Assortment', 'PromoInterval', 'State', 'Events'], y=train_y)
    valid_X = te.transform(valid_X)
    test = te.transform(test)

    # 7) Convert to DMatrix for XGBoost
    dtrain = xgb.DMatrix(train_X, train_y)
    dvalid = xgb.DMatrix(valid_X, valid_y)
    dtest = xgb.DMatrix(test)
    
    return dtrain, dvalid, dtest, test_ids

In [None]:
# Prepare the data
dtrain, dvalid, dtest, test_ids = data_pipeline('../data/03_primary', 'clean_train_valid.parquet', 'test_clean.pkl', seed, cpu_embs, categorical_features)

params = {'tree_method':'gpu_hist',
          'objective': 'reg:squarederror',
          'seed':seed}
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]

model = xgb.train(params=params, dtrain=dtrain, num_boost_round=4000, early_stopping_rounds=20, feval=rmspe_xg, verbose_eval=200, evals=watchlist)

predictions = model.predict(dtest)

pd.DataFrame({'Id':test_ids,
              'Sales':np.exp(predictions)}).to_csv('../data/03_primary/exp_003.csv', index=False)

<div class="alert alert-info">

### Results
    
Private Score: 0.15251  
Public Score: 0.14079

</div>

# Experiment 004 - Hyperparameter tunning

In [None]:
# Load training data
all_cols = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
            'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
            'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
            'SchoolHoliday_fw', 'SchoolHoliday_bw','CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
            'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
            'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
            'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday', 'Sales','Date']

data_train = pd.read_parquet('../data/03_primary/clean_train_valid.parquet', columns=all_cols)

# Sort date-wize and set Date as index
data_train.sort_values('Date',inplace=True)
data_train.set_index('Date', inplace=True)

# Deal with missing categorical values
data_train['PromoInterval'] = data_train['PromoInterval'].fillna('#NAN#')
data_train['Events'] = data_train['Events'].fillna('#NAN#')
        
# Replace categorical variables with embeddings
data_train = add_cat_embeddings(data_train, cpu_embs, categorical_features)

In [None]:
def optimize():
    space = {
            # Learning rate: default 0.3 -> range: [0,3]
           'eta': hp.quniform('eta', 0.01, 0.3, 0.001),
            # Control complexity (control overfitting)
            # Maximum depth of a tree: default 6 -> range: [0:∞]
            'max_depth':  hp.choice('max_depth', np.arange(5, 10, dtype=int)),
            # Minimum sum of instance weight (hessian) needed in a child: default 1
            'min_child_weight': hp.quniform('min_child_weight', 1, 3, 1),
            # Minimum loss reduction required: default 0 -> range: [0,∞]
            'gamma': hp.quniform('gamma', 0, 5, 0.5),

            # Add randomness to make training robust to noise (control overfitting)
            # Subsample ratio of the training instance: default 1
            'subsample': hp.quniform('subsample', 0.5, 1, 0.05),
            # Subsample ratio of columns when constructing each tree: default 1
            'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 1, 0.05),
            
            # Regression problem
            'objective': 'reg:squarederror',
            # For reproducibility
            'seed': seed,
            # Faster computation
            'tree_method':'gpu_hist'
            }
        
    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=100)
    
    return best

In [None]:
pred_folds = 3
train_times = []

# Prepare data
data_x = data_train.copy()
data_y = data_x.pop('Sales')
data_y = np.log1p(data_y)

In [None]:

def score(params):
    # Initialize timer
    start_time = time()
    # Initialize list of scores for each fold
    score_list = []
    # Set to 20 to have validation sets of about the same size as the real test size
    tscv = TimeSeriesSplit(n_splits=20)
    # Initialize a split counter
    split_iteration = -1
     
    for train_index, test_index in tscv.split(data_x):
        
        # Select the folds from the end (superintended) for the desired number of splits.
        split_iteration+=1
        if split_iteration < 20 - pred_folds: continue

        # select 1% of the training data for early stopping
        train_index, es_index = train_test_split(train_index, test_size=0.01, random_state=seed)
        
        # Select data by index from the time series cross validatation split
        X_train, X_val, X_test = data_x.iloc[train_index].copy(), data_x.iloc[es_index].copy(), data_x.iloc[test_index].copy()
        y_train, y_val, y_test = data_y.iloc[train_index].copy(), data_y.iloc[es_index].copy(), data_y.iloc[test_index].copy()
        
        # Deal with missing continuous values
        for col_name in ['CompetitionDistance', 'CloudCover']:
            # Add na cols
            X_train[col_name+'_na'] = pd.isnull(X_train[col_name])
            X_val[col_name+'_na'] = pd.isnull(X_val[col_name])
            X_test[col_name+'_na'] = pd.isnull(X_test[col_name])

            # Fill missing with median (default in FastAI)
            fillter = X_train[col_name].median()
            X_train[col_name] =  X_train[col_name].fillna(fillter)
            X_val[col_name] =  X_val[col_name].fillna(fillter)
            X_test[col_name] =  X_test[col_name].fillna(fillter)
            
        # Dmatrix for optimization - structure de donnees optmisizer
        dtrain = xgb.DMatrix(X_train, y_train)
        dtest = xgb.DMatrix(X_test, y_test)
        dvalid = xgb.DMatrix(X_val, y_val)
        
        # the last one is used for early stopping - contraire de doc
        watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
            
        # Can use feval for a custom objective function
        model = xgb.train(params, dtrain, early_stopping_rounds=100, num_boost_round=4000, verbose_eval=0, feval=rmspe_xg, evals=watchlist) 

        # validation - this will be the score that we append to a list. which will be fed as the score? Is all of this the score?
        y_pred = model.predict(xgb.DMatrix(X_test))
        #error = mean_absolute_error(y_test, y_pred)
        error = rmspe(y_test, y_pred)
        score_list.append(error)
    #print(f'Took  {np.round(time()-start_time,0)} (s) - RMSPE score: {np.mean(score_list)} - PARAMS: {params}')
    train_times.append(np.round(time()-start_time,0))
    return np.mean(score_list)

In [None]:
%%time
# trials will contain logging information
trials = Trials()

best_hyperparams = optimize()
print("The best hyperparameters are: ", "\n")
print(best_hyperparams)

In [None]:
summary_table = pd.DataFrame()

for i in range(len(trials.trials)):

    row = pd.concat([pd.DataFrame({'loss':[trials.trials[i]['result']['loss']]}), \
                     pd.DataFrame(trials.trials[i]['misc']['vals'])], axis=1)
    
    summary_table = summary_table.append(row)

summary_table = pd.concat([pd.DataFrame({'exp_time':train_times}),summary_table.reset_index(drop=True)],axis=1)
summary_table = summary_table.sort_values('loss')
summary_table.to_pickle('trials_004.pkl')

In [None]:
summary_table.sort_values('loss').head()

### Use hyperparameters found

In [None]:
# Prepare the data
dtrain, dvalid, dtest, test_ids = data_pipeline('../data/03_primary', 'clean_train_valid.parquet', 'test_clean.pkl', seed, cpu_embs, categorical_features)

# Retreive the best parameters using space_eval. This is because hyperopt returns the index of the value used when the distribution
# was set to hp.choice(...)
best_hyperparams = space_eval(space, best_hyperparams)

watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
training_curves_004 = {}
model = xgb.train(params=best_hyperparams, dtrain=dtrain, num_boost_round=4000, early_stopping_rounds=100, feval=rmspe_xg, verbose_eval=200, evals=watchlist,
                                           evals_result=training_curves_004)

predictions = model.predict(dtest)

pd.DataFrame({'Id':test_ids,
              'Sales':np.exp(predictions)}).to_csv('../data/03_primary/exp_004.csv', index=False)

In [None]:
plt.plot(training_curves_004['train']['rmspe'],label='Train')
plt.plot(training_curves_004['eval']['rmspe'],label='Eval')
plt.legend()
plt.xlabel('Iterations')
plt.ylabel('RMSPE')
plt.title('RMSPE Loss with early stopping = 100')

<div class="alert alert-info">
    
### Results

Private Score: 0.13081  
Public Score: 0.11572

</div>