[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/georgianpartners/Multimodal-Toolkit/blob/master/notebooks/text_w_tabular_classification.ipynb)

# Training a BertWithTabular Model for Clothing Review Recommendation Prediction

This guide follows closely with the [example](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb#scrollTo=bwl3I_VGAZXb) from HuggingFace for text classificaion on the GLUE dataset.

Install `transformers` from master, and also clone the repo to get some utility files

In [None]:
!pip install transformers==3.0

In [None]:
!git clone https://github.com/georgianpartners/Multimodal-Toolkit.git

In [None]:
!nvidia-smi

### All imports are here:

In [1]:
import logging
import os
import sys
from typing import Callable, Dict

import numpy as np
import pandas as pd
from pprint import pformat
from scipy.special import softmax
from transformers import (
    AutoTokenizer,
    AutoConfig,
    HfArgumentParser,
    Trainer,
    EvalPrediction,
    set_seed
)
curr_dir = os.getcwd()
sys.path.insert(0, os.path.join(curr_dir, 'Multimodal-Toolkit'))

from multimodal_exp_args import MultimodalDataTrainingArguments, ModelArguments, OurTrainingArguments
from evaluation import calc_classification_metrics, calc_regression_metrics
from multimodal.data.load_data import load_data_from_folder
from multimodal.model.tabular_config import TabularConfig
from multimodal.model.tabular_modeling_auto import AutoModelWithTabular
from util import create_dir_if_not_exists, get_args_info_as_str

logging.basicConfig(level=logging.INFO)

### Dataset

Our dataset is the [Womens Clothing E-Commerce Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) dataset from kaggle. It contains reviews written by customers about clothing items as well as whether they recommend the data or not. After obtaining from kaggle, the dataset has been randomly split into train, val, test based on the 8:1:1 split ratio.

In [2]:
DATA_DIR = os.path.join(curr_dir, 'Multimodal-Toolkit', 'datasets', 'Womens_Clothing_E-Commerce_Reviews')
train_df = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'), index_col=0)
val_df = pd.read_csv(os.path.join(DATA_DIR, 'val.csv'), index_col=0)
test_df = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'), index_col=0)
print('Num examples train-val-test')
print(len(train_df), len(val_df), len(test_df))

Num examples train-val-test
18788 2349 2349


#### Let us take a look at what the dataset looks like

In [3]:
train_df.head(5)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
12515,936,21,Pretty but not for me,"This sweater is very pretty, i love the knit a...",3,1,1,General Petite,Tops,Sweaters
20723,995,36,Beautiful,As beautiful as in the picture. couldn't go wr...,5,1,0,General,Bottoms,Skirts
17409,869,47,Adorable and comfortable!,Just bought this in black at my local store an...,5,1,4,General,Tops,Knits
7983,833,29,"Must have, elegant, chic",This top! i was hesitant to try this on becaus...,5,1,3,General,Tops,Blouses
5195,1059,38,Very flattering fit,This is a great pair of trousers for work but ...,5,1,1,General Petite,Bottoms,Pants


In [4]:
train_df.describe(include=np.object)

Unnamed: 0,Title,Review Text,Division Name,Department Name,Class Name
count,15748,18095,18776,18776,18776
unique,11431,18093,3,6,20
top,Love it!,Perfect fit and i've gotten so many compliment...,General,Tops,Dresses
freq,106,2,11081,8377,5024


In [5]:
train_df.describe()

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,18788.0,18788.0,18788.0,18788.0,18788.0
mean,919.561369,43.170055,4.197999,0.822706,2.53433
std,201.384374,12.285153,1.107503,0.381927,5.708134
min,0.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


### Here are the data and training parameters we will use.
For model we can specify any supported HuggingFace model classes (see README for more details) as well as any AutoModel that are from the supported model classes. For the data specifications, we need to specify a dictionary that specifies which columns are the `text` columns, `numerical feature` columns, `categorical feature` column, and the `label` column. If we are doing classification, we can also specify what each of the labels means in the label column through the `label list`. We can also specifiy these columns using a path to a json file with the argument `column_info_path` to `MultimodalDataTrainingArguments`.

In [7]:
text_cols = ['Title', 'Review Text']
cat_cols = ['Clothing ID', 'Division Name', 'Department Name', 'Class Name']
numerical_cols = ['Rating', 'Age', 'Positive Feedback Count']

column_info_dict = {
    'text_cols': text_cols,
    'num_cols': numerical_cols,
    'cat_cols': cat_cols,
    'label_col': 'Recommended IND',
    'label_list': ['Not Recommended', 'Recommended']
}


model_args = ModelArguments(
    model_name_or_path='bert-base-uncased'
)

data_args = MultimodalDataTrainingArguments(
    data_path=DATA_DIR,
    combine_feat_method='gating_on_cat_and_num_feats_then_sum',
    column_info=column_info_dict,
    task='classification'
)

training_args = OurTrainingArguments(
    output_dir="./logs/model_name",
    logging_dir="./logs/runs",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=32,
    num_train_epochs=1,
    logging_steps=250,
    evaluate_during_training=True
)

set_seed(training_args.seed)

### We first instantiate our HuggingFace tokenizer
This is needed to prepare our custom torch dataset. See `torch_dataset.py` for details.

In [8]:
tokenizer_path_or_name = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
print('Specified tokenizer: ', tokenizer_path_or_name)
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_path_or_name,
    cache_dir=model_args.cache_dir,
)

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /home/ec2-user/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

INFO:transformers.tokenization_utils_base:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt fr

Specified tokenizer:  bert-base-uncased


### Load dataset csvs to torch datasets
The function `load_data_from_folder` expects a path to a folder that contains `train.csv`, `test.csv`, and/or `val.csv` containing the respective split datasets. 

In [9]:
# Get Datasets
train_dataset, val_dataset, test_dataset = load_data_from_folder(
    data_args.data_path,
    data_args.column_info['text_cols'],
    tokenizer,
    label_col=data_args.column_info['label_col'],
    label_list=data_args.column_info['label_list'],
    categorical_cols=data_args.column_info['cat_cols'],
    numerical_cols=data_args.column_info['num_cols'],
    sep_text_token_str=tokenizer.sep_token,
    max_token_length=training_args.max_token_length,
)

INFO:load_data:1239 categorical columns
INFO:load_data:3 numerical columns
INFO:load_data:Text columns: ['Title', 'Review Text']
INFO:load_data:Raw text example: Pretty but not for me [SEP] This sweater is very pretty, i love the knit and the cream color. for some reason it just didn't flow the way i wanted it to, and i didn't love the v neck. just didn't pull me in completely!
INFO:load_data:1239 categorical columns
INFO:load_data:3 numerical columns
INFO:load_data:Text columns: ['Title', 'Review Text']
INFO:load_data:Raw text example: Great idea, poor execution [SEP] I absolutely loved the idea of an elongated hoodie as a lounge dress or robe. unfortunately, this particular product left a lot to be desired. firstly, the fit is slim, not relaxed. i'm 5'4", at 110 lbs and tried both the size 1 (equivalent of a small) and size 2 (equivalent of a medium). sizing up just added length and did not do much to increase roominess. they both zipped up easily but did not drape well. finally, the

In [10]:
if data_args.task == 'regression':
    num_labels = 1
else:
    num_labels = len(np.unique(train_dataset.labels))
num_labels

2

In [11]:
config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )
tabular_config = TabularConfig(num_labels=num_labels,
                               cat_feat_dim=train_dataset.cat_feats.shape[1],
                               numerical_feat_dim=train_dataset.numerical_feats.shape[1],
                               **vars(data_args))
config.tabular_config = tabular_config

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /home/ec2-user/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}



In [12]:
model = AutoModelWithTabular.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir
    )

INFO:transformers.modeling_utils:loading weights file https://cdn.huggingface.co/bert-base-uncased-pytorch_model.bin from cache at /home/ec2-user/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
- This IS expected if you are initializing BertWithTabular from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertWithTabular from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### We need to define a task-specific way of computing relevant metrics:

In [13]:
def build_compute_metrics_fn(task_name: str) -> Callable[[EvalPrediction], Dict]:
    def compute_metrics_fn(p: EvalPrediction):
        if task_name == "classification":
            preds_labels = np.argmax(p.predictions, axis=1)
            pred_scores = softmax(p.predictions, axis=1)[:, 1]
            return calc_classification_metrics(pred_scores, preds_labels,
                                               p.label_ids)
        elif task_name == "regression":
            preds = np.squeeze(p.predictions)
            return calc_regression_metrics(preds, p.label_ids)
        else:
            return {}
    return compute_metrics_fn


In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=build_compute_metrics_fn(data_args.task),
)

INFO:args:PyTorch: setting up devices
INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


### Launching the training is as simple is doing trainer.train() 🤗

In [15]:
%%time
trainer.train()

INFO:transformers.trainer:***** Running training *****
INFO:transformers.trainer:  Num examples = 18788
INFO:transformers.trainer:  Num Epochs = 1
INFO:transformers.trainer:  Instantaneous batch size per device = 16
INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 16
INFO:transformers.trainer:  Gradient Accumulation steps = 1
INFO:transformers.trainer:  Total optimization steps = 1175


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1175.0, style=ProgressStyle(description_w…

INFO:transformers.trainer:{'loss': 0.22426292389631272, 'learning_rate': 3.936170212765958e-05, 'epoch': 0.2127659574468085, 'step': 250}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 2349
INFO:transformers.trainer:  Batch size = 8


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=294.0, style=ProgressStyle(description_w…

INFO:transformers.trainer:{'eval_loss': 0.14863149721656932, 'eval_roc_auc': 0.9812131588602175, 'eval_threshold': 0.31695735454559326, 'eval_pr_auc': 0.9958951737185882, 'eval_recall': 0.9641372141372141, 'eval_precision': 0.9722222222222222, 'eval_f1': 0.9681628392484343, 'eval_tn': 395, 'eval_fp': 30, 'eval_fn': 104, 'eval_tp': 1820, 'epoch': 0.2127659574468085, 'step': 250}





INFO:transformers.trainer:{'loss': 0.16300668650865555, 'learning_rate': 2.8723404255319154e-05, 'epoch': 0.425531914893617, 'step': 500}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 2349
INFO:transformers.trainer:  Batch size = 8


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=294.0, style=ProgressStyle(description_w…

INFO:transformers.trainer:{'eval_loss': 0.14895194406513454, 'eval_roc_auc': 0.981175247645836, 'eval_threshold': 0.35274699330329895, 'eval_pr_auc': 0.9957904489018825, 'eval_recall': 0.9589397089397089, 'eval_precision': 0.9787798408488063, 'eval_f1': 0.9687582042530849, 'eval_tn': 400, 'eval_fp': 25, 'eval_fn': 107, 'eval_tp': 1817, 'epoch': 0.425531914893617, 'step': 500}
INFO:transformers.trainer:Saving model checkpoint to ./logs/model_name/checkpoint-500
INFO:transformers.configuration_utils:Configuration saved in ./logs/model_name/checkpoint-500/config.json





INFO:transformers.modeling_utils:Model weights saved in ./logs/model_name/checkpoint-500/pytorch_model.bin
INFO:transformers.trainer:{'loss': 0.15592467722296716, 'learning_rate': 1.8085106382978724e-05, 'epoch': 0.6382978723404256, 'step': 750}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 2349
INFO:transformers.trainer:  Batch size = 8


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=294.0, style=ProgressStyle(description_w…

INFO:transformers.trainer:{'eval_loss': 0.14800560437332916, 'eval_roc_auc': 0.9815739268680446, 'eval_threshold': 0.8034530282020569, 'eval_pr_auc': 0.9960075810375029, 'eval_recall': 0.9563409563409564, 'eval_precision': 0.9740603493912123, 'eval_f1': 0.965119328612641, 'eval_tn': 332, 'eval_fp': 93, 'eval_fn': 51, 'eval_tp': 1873, 'epoch': 0.6382978723404256, 'step': 750}





INFO:transformers.trainer:{'loss': 0.13811917620897293, 'learning_rate': 7.446808510638298e-06, 'epoch': 0.851063829787234, 'step': 1000}
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 2349
INFO:transformers.trainer:  Batch size = 8


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=294.0, style=ProgressStyle(description_w…

INFO:transformers.trainer:{'eval_loss': 0.14000431921643516, 'eval_roc_auc': 0.9828402837226368, 'eval_threshold': 0.3892201781272888, 'eval_pr_auc': 0.9962478798144302, 'eval_recall': 0.9615384615384616, 'eval_precision': 0.9721492380451918, 'eval_f1': 0.9668147373922132, 'eval_tn': 382, 'eval_fp': 43, 'eval_fn': 88, 'eval_tp': 1836, 'epoch': 0.851063829787234, 'step': 1000}
INFO:transformers.trainer:Saving model checkpoint to ./logs/model_name/checkpoint-1000
INFO:transformers.configuration_utils:Configuration saved in ./logs/model_name/checkpoint-1000/config.json





INFO:transformers.modeling_utils:Model weights saved in ./logs/model_name/checkpoint-1000/pytorch_model.bin
INFO:transformers.trainer:

Training completed. Do not forget to share your model on huggingface.co/models =)






CPU times: user 3min 24s, sys: 40.8 s, total: 4min 4s
Wall time: 4min 12s


TrainOutput(global_step=1175, training_loss=0.16534934793380981)

### Check that our training was successful using TensorBoard

In [18]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [24]:
%tensorboard --logdir ./logs/runs --port=6006

Reusing TensorBoard on port 8009 (pid 15486), started 0:00:26 ago. (Use '!kill 15486' to kill it.)