# Run MMBT Experiments

This notebook shows the end-to-end pipeline to fine-tune pre-trained MMBT model for multimodal (text and image) classification on our dataset.

Parts of this pipeline are adapted from the
Huggingface `run_mmimdb.py` script to execute the MMBT model. This code can
be accessed [here.](https://github.com/huggingface/transformers/blob/8ea412a86faa8e9edeeb6b5c46b08def06aa03ea/examples/research_projects/mm-imdb/run_mmimdb.py#L305)

## Skip unless on Google Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/LAP_MMBT
%pwd

/content/drive/MyDrive/LAP_MMBT


'/content/drive/MyDrive/LAP_MMBT'

Before running the cell below, make sure to select 'GPU' runtime type

In [3]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## Install Huggingface Library

These should have been installed during your environment set-up; you only need to run these cells in Google Colab.

In [4]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 9.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 46.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 50.8MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=9af3a36ceee

# Data directories and file paths


In [5]:
train_file = "image_labels_impression_frontal_train.jsonl" 
val_file = "image_labels_impression_frontal_val.jsonl" 
test_file = "image_labels_impression_frontal_test.jsonl" 

## Import Required Modules

In [6]:
from textBert_utils import set_seed
from MMBT.image import ImageEncoderDenseNet
from MMBT.mmbt_config import MMBTConfig
from MMBT.mmbt import MMBTForClassification

In [7]:
from MMBT.mmbt_utils import JsonlDataset, get_image_transforms, get_labels, load_examples, collate_fn

In [8]:
import argparse

In [9]:
import glob
import logging
import random
import json
import os
from collections import Counter
import numpy as np
from matplotlib.pyplot import imshow

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import Dataset
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [10]:
from sklearn.metrics import accuracy_score
from tqdm import tqdm, trange

from transformers import (
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModel,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)

try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

In [11]:
from transformers import (
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModel,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)

# Set-up Experiment Hyperparameters and Arguments

Specify the training, validation, and test files to run the experiment on. The default here is running the model on 'impression' texts.  

To re-make the training, validation, and test data, please refer to the information in the **data/** directory.  

Change the default values in the parser.add_argument function for the hyperparameters that you want to specify in the following cell or use the default option.  

For multiple experiment runs, please make sure to change the `output_dir` argument so that new results don't overwrit existing ones.

The arguments specified here are the same as in the `run_mmimdb.py` file 
in the [Huggingface example implementation of MMBT.](https://github.com/huggingface/transformers/blob/8ea412a86faa8e9edeeb6b5c46b08def06aa03ea/examples/research_projects/mm-imdb/run_mmimdb.py#L305)

In [12]:
parser = argparse.ArgumentParser(f'Project Hyperparameters and Other Configurations Argument Parser')

parser = argparse.ArgumentParser()

# Required parameters
parser.add_argument(
    "--data_dir",
    default="data/json",
    type=str,
    help="The input data dir. Should contain the .jsonl files.",
)
parser.add_argument(
    "--model_name",
    default="bert-base-uncased",
    type=str,
    help="model identifier from huggingface.co/models",
)
parser.add_argument(
    "--output_dir",
    default="mmbt_output_impression",
    type=str,
    help="The output directory where the model predictions and checkpoints will be written.",
)

    
parser.add_argument(
    "--config_name", default="bert-base-uncased", type=str, help="Pretrained config name if not the same as model_name"
)
parser.add_argument(
    "--tokenizer_name",
    default="bert-base-uncased",
    type=str,
    help="Pretrained tokenizer name or path if not the same as model_name",
)

parser.add_argument("--train_batch_size", default=32, type=int, help="Batch size for training.")
parser.add_argument(
    "--eval_batch_size", default=32, type=int, help="Batch size for evaluation."
)
parser.add_argument(
    "--max_seq_length",
    default=256,
    type=int,
    help="The maximum total input sequence length after tokenization. Sequences longer "
    "than this will be truncated, sequences shorter will be padded.",
)
parser.add_argument(
    "--num_image_embeds", default=3, type=int, help="Number of Image Embeddings from the Image Encoder"
)
parser.add_argument("--do_train", default=True, type=bool, help="Whether to run training.")
parser.add_argument("--do_eval", default=True, type=bool, help="Whether to run eval on the dev set.")
parser.add_argument(
    "--evaluate_during_training", default=True, type=bool, help="Rul evaluation during training at each logging step."
)


parser.add_argument(
    "--gradient_accumulation_steps",
    type=int,
    default=1,
    help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.1, type=float, help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
    "--num_train_epochs", default=4.0, type=float, help="Total number of training epochs to perform."
)
parser.add_argument("--patience", default=5, type=int, help="Patience for Early Stopping.")
parser.add_argument(
    "--max_steps",
    default=-1,
    type=int,
    help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")

parser.add_argument("--logging_steps", type=int, default=25, help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=25, help="Save checkpoint every X updates steps.")
parser.add_argument(
    "--eval_all_checkpoints",
    default=True, type=bool,
    help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
)

parser.add_argument("--num_workers", type=int, default=8, help="number of worker threads for dataloading")

parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")


args = parser.parse_args("")

# Setup CUDA, GPU & distributed training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
args.n_gpu = torch.cuda.device_count() if torch.cuda.is_available() else 0
args.device = device

In [13]:
# Setup Train/Val/Test filenames
args.train_file = train_file
args.val_file = val_file
args.test_file = test_file

## Showing a sample from JsonDataset
i.e. calling "\_\_getitem\_\_"

Note:   
image_end_token is the BERT token id for [SEP].   
image_start_token is the BERT token id for [CLS]. 


In [14]:
tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name,
        do_lower_case=True,
        cache_dir=None,
    )
train_dataset = load_examples(tokenizer, args)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [15]:
train_dataset[0]

{'image': tensor([[[ 0.4679,  0.4166,  0.3481,  ..., -0.7822, -0.7822, -0.7993],
          [ 0.4851,  0.5022,  0.5364,  ..., -0.5938, -0.5938, -0.6452],
          [ 0.4166,  0.3823,  0.4337,  ..., -0.4226, -0.4226, -0.4568],
          ...,
          [ 1.8722,  1.9064,  1.9235,  ...,  0.7248,  0.6734,  0.6221],
          [ 1.8208,  1.8550,  1.9064,  ...,  0.7248,  0.6734,  0.6392],
          [ 1.7865,  1.8208,  1.8379,  ...,  0.7077,  0.6906,  0.6563]],
 
         [[ 0.6078,  0.5553,  0.4853,  ..., -0.6702, -0.6702, -0.6877],
          [ 0.6254,  0.6429,  0.6779,  ..., -0.4776, -0.4776, -0.5301],
          [ 0.5553,  0.5203,  0.5728,  ..., -0.3025, -0.3025, -0.3375],
          ...,
          [ 2.0434,  2.0784,  2.0959,  ...,  0.8704,  0.8179,  0.7654],
          [ 1.9909,  2.0259,  2.0784,  ...,  0.8704,  0.8179,  0.7829],
          [ 1.9559,  1.9909,  2.0084,  ...,  0.8529,  0.8354,  0.8004]],
 
         [[ 0.8274,  0.7751,  0.7054,  ..., -0.4450, -0.4450, -0.4624],
          [ 0.8448,


### Training and Evaluating Functions.

In [16]:
def train(args, train_dataset, model, tokenizer):
    """ Train the model """
    
    tb_writer = SummaryWriter()

    train_sampler = RandomSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset,
        sampler=train_sampler,
        batch_size=args.train_batch_size,
        collate_fn=collate_fn
    )

    t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]

    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )
    

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info(
        "  Total train batch size = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps)
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    tr_loss, logging_loss = 0.0, 0.0
    best_accuracy, n_no_improve = 0, 0
    model.train()
    model.zero_grad()
    optimizer.zero_grad()
    train_iterator = trange(int(args.num_train_epochs), desc="Epoch")
    set_seed(args)  # Added here for reproductibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Training Batch Iteration")
        for step, batch in enumerate(epoch_iterator):
            # model.train()
            # each sample in batch is a tuple
            # batch is the return of the collate_fn function
            # see function definition for data tuple order
            batch = tuple(t.to(args.device) for t in batch)
            labels = batch[5]
            input_ids = batch[0]
            input_modal = batch[2]
            attention_mask = batch[1]
            modal_start_tokens = batch[3]
            modal_end_tokens = batch[4]

            #inputs = {
            #    "input_ids": batch[0],
            #    "input_modal": batch[2],
            #    "attention_mask": batch[1],
            #    "modal_start_tokens": batch[3],
            #    "modal_end_tokens": batch[4],
            #    "labels": batch[5]
            #}

            outputs = model(
                input_modal,
                input_ids=input_ids,
                modal_start_tokens=modal_start_tokens,
                modal_end_tokens=modal_end_tokens,
                attention_mask=attention_mask,
                token_type_ids=None,
                modal_token_type_ids=None,
                position_ids=None,
                modal_position_ids=None,
                head_mask=None,
                inputs_embeds=None,
                labels=labels,
                return_dict=True
            )
            #logits = outputs[0]  # model outputs are always tuple in transformers (see doc)
            #loss = criterion(logits, labels)
            loss = outputs.loss
            logits = outputs.logits
            
            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps


            loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:

                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    logs = {}
                    if args.evaluate_during_training:  
                        # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            eval_key = "eval_{}".format(key)
                            logs[eval_key] = value

                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps
                    learning_rate_scalar = scheduler.get_last_lr()[0]
                    logs["learning_rate"] = learning_rate_scalar
                    logs["loss"] = loss_scalar
                    logging_loss = tr_loss

                    for key, value in logs.items():
                        tb_writer.add_scalar(key, value, global_step)
                    print(json.dumps({**logs, **{"step": global_step}}))

                if args.save_steps > 0 and global_step % args.save_steps == 0:
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    torch.save(model_to_save.state_dict(), os.path.join(output_dir, WEIGHTS_NAME))
                    # uncomment below to be able to save args
                    # torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)


        results = evaluate(args, model, tokenizer)
        if results["accuracy"] > best_accuracy:
            best_accuracy = results["accuracy"]
            n_no_improve = 0
        else:
            n_no_improve += 1

        if n_no_improve > args.patience:
            train_iterator.close()
            break

    tb_writer.close()

    return global_step, tr_loss / global_step

In [17]:
def evaluate(args, model, tokenizer, evaluate=True, test=False, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir
    eval_dataset = load_examples(tokenizer, args, evaluate=evaluate, test=test)

    if not os.path.exists(eval_output_dir):
        os.makedirs(eval_output_dir)

    # Note that DistributedSampler samples randomly
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate_fn
    )

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    preds = []
    out_label_ids = []
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(args.device) for t in batch)

        with torch.no_grad():
            batch = tuple(t.to(args.device) for t in batch)
            labels = batch[5]
            input_ids = batch[0]
            input_modal = batch[2]
            attention_mask = batch[1]
            modal_start_tokens = batch[3]
            modal_end_tokens = batch[4]
            
            outputs = model(
                input_modal,
                input_ids=input_ids,
                modal_start_tokens=modal_start_tokens,
                modal_end_tokens=modal_end_tokens,
                attention_mask=attention_mask,
                token_type_ids=None,
                modal_token_type_ids=None,
                position_ids=None,
                modal_position_ids=None,
                head_mask=None,
                inputs_embeds=None,
                labels=labels,
                return_dict=True
            )
            #logits = outputs[0]  # model outputs are always tuple in transformers (see doc)
            #tmp_eval_loss = criterion(logits, labels)
            tmp_eval_loss = outputs.loss
            logits = outputs.logits
            eval_loss += tmp_eval_loss.mean().item()
        nb_eval_steps += 1
        # Move logits and labels to CPU            
        pred = torch.nn.functional.softmax(logits, dim=1).argmax(dim=1).cpu().detach().numpy()
        out_label_id = labels.detach().cpu().numpy()
        preds.append(pred)
        out_label_ids.append(out_label_id)

    eval_loss = eval_loss / nb_eval_steps

    preds = [l for sl in preds for l in sl]
    out_label_ids = [l for sl in out_label_ids for l in sl]

    result = {
        "loss": eval_loss,
        "accuracy": accuracy_score(out_label_ids, preds)
    }

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result


## Training MMBT Model 

Set up logging and the MMBT Model. Similar to the text-only model, check points 
are saved during a similar customizable interval.



In [18]:
# Setup logging
logger = logging.getLogger(__name__)
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)
logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
                    datefmt="%m/%d/%Y %H:%M:%S",
                    filename=os.path.join(args.output_dir, f"{os.path.splitext(args.train_file)[0]}_logging.txt"),
                    level=logging.INFO)
logger.warning("device: %s, n_gpu: %s",
        args.device,
        args.n_gpu
)
# Set the verbosity to info of the Transformers logger (on main process only):

# Set seed
set_seed(args)

In [19]:
# Setup model
labels = get_labels()
num_labels = len(labels)
transformer_config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name)
tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name,
        do_lower_case=True,
        cache_dir=None,
    )
transformer = AutoModel.from_pretrained(args.model_name, config=transformer_config, cache_dir=None)
img_encoder = ImageEncoderDenseNet(num_image_embeds=args.num_image_embeds)
multimodal_config = MMBTConfig(transformer, img_encoder, num_labels=num_labels, modal_hidden_size=1024)
model = MMBTForClassification(transformer_config, multimodal_config)

model.to(args.device)

logger.info(f"Training/evaluation parameters: {args}")

# Training
if args.do_train:
    train_dataset = load_examples(tokenizer, args)
    # criterion = nn.CrossEntropyLoss
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
    logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
    logger.info("Saving model checkpoint to %s", args.output_dir)
    # Save a trained model, configuration and tokenizer using `save_pretrained()`.
    # They can then be reloaded using `from_pretrained()`
    model_to_save = (model.module if hasattr(model, "module") else model)  # Take care of distributed/parallel training
    torch.save(model_to_save.state_dict(), os.path.join(args.output_dir, WEIGHTS_NAME))
    tokenizer.save_pretrained(args.output_dir)
    transformer_config.save_pretrained(args.output_dir)
    # Good practice: save your training arguments together with the trained model
    torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

    # Load a trained model and vocabulary that you have fine-tuned
    model = MMBTForClassification(transformer_config, multimodal_config)
    model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
    tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
    model.to(args.device)
logger.info("***** Training Finished *****")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Epoch:   0%|          | 0/4 [00:00<?, ?it/s]
Training Batch Iteration:   0%|          | 0/61 [00:00<?, ?it/s][A
Training Batch Iteration:   2%|▏         | 1/61 [00:18<18:57, 18.96s/it][A
Training Batch Iteration:   3%|▎         | 2/61 [00:32<17:06, 17.39s/it][A
Training Batch Iteration:   5%|▍         | 3/61 [00:47<15:55, 16.47s/it][A
Training Batch Iteration:   7%|▋         | 4/61 [01:01<15:04, 15.88s/it][A
Training Batch Iteration:   8%|▊         | 5/61 [01:15<14:18, 15.33s/it][A
Training Batch Iteration:  10%|▉         | 6/61 [01:29<13:43, 14.97s/it][A
Training Batch Iteration:  11%|█▏        | 7/61 [01:43<13:03, 14.51s/it][A
Training Batch Iteration:  13%|█▎        | 8/61 [01:57<12:40, 14.35s/it][A
Training Batch Iteration:  15%|█▍        | 9/61 [02:10<12:18, 14.21s/it][A
Training Batch Iteration:  16%|█▋        | 10/61 [02:24<11:54, 14.01s/it][A
Training Batch Iteration:  18%|█▊        | 11/61 [02:38<11:38, 13.98s/it][A
Training Batch Iteration:  20%|█▉        | 12/61 

{"eval_loss": 0.3071372416757402, "eval_accuracy": 0.8952234206471494, "learning_rate": 4.487704918032787e-05, "loss": 0.46950038194656374, "step": 25}



Training Batch Iteration:  41%|████      | 25/61 [10:02<54:19, 90.55s/it][A
Training Batch Iteration:  43%|████▎     | 26/61 [10:14<39:09, 67.11s/it][A
Training Batch Iteration:  44%|████▍     | 27/61 [10:27<28:42, 50.65s/it][A
Training Batch Iteration:  46%|████▌     | 28/61 [10:39<21:33, 39.21s/it][A
Training Batch Iteration:  48%|████▊     | 29/61 [10:52<16:44, 31.39s/it][A
Training Batch Iteration:  49%|████▉     | 30/61 [11:05<13:16, 25.68s/it][A
Training Batch Iteration:  51%|█████     | 31/61 [11:18<10:57, 21.93s/it][A
Training Batch Iteration:  52%|█████▏    | 32/61 [11:31<09:17, 19.21s/it][A
Training Batch Iteration:  54%|█████▍    | 33/61 [11:43<08:03, 17.26s/it][A
Training Batch Iteration:  56%|█████▌    | 34/61 [11:56<07:06, 15.81s/it][A
Training Batch Iteration:  57%|█████▋    | 35/61 [12:08<06:26, 14.86s/it][A
Training Batch Iteration:  59%|█████▉    | 36/61 [12:21<05:58, 14.33s/it][A
Training Batch Iteration:  61%|██████    | 37/61 [12:34<05:30, 13.75s/it][

{"eval_loss": 0.1675085269269489, "eval_accuracy": 0.9414483821263482, "learning_rate": 3.975409836065574e-05, "loss": 0.20768402725458146, "step": 50}



Training Batch Iteration:  82%|████████▏ | 50/61 [15:31<03:00, 16.38s/it][A
Training Batch Iteration:  84%|████████▎ | 51/61 [15:44<02:33, 15.39s/it][A
Training Batch Iteration:  85%|████████▌ | 52/61 [15:56<02:09, 14.43s/it][A
Training Batch Iteration:  87%|████████▋ | 53/61 [16:09<01:51, 13.92s/it][A
Training Batch Iteration:  89%|████████▊ | 54/61 [16:22<01:35, 13.67s/it][A
Training Batch Iteration:  90%|█████████ | 55/61 [16:35<01:20, 13.39s/it][A
Training Batch Iteration:  92%|█████████▏| 56/61 [16:48<01:07, 13.47s/it][A
Training Batch Iteration:  93%|█████████▎| 57/61 [17:00<00:52, 13.05s/it][A
Training Batch Iteration:  95%|█████████▌| 58/61 [17:13<00:39, 13.03s/it][A
Training Batch Iteration:  97%|█████████▋| 59/61 [17:25<00:25, 12.72s/it][A
Training Batch Iteration:  98%|█████████▊| 60/61 [17:38<00:12, 12.78s/it][A
Training Batch Iteration: 100%|██████████| 61/61 [17:50<00:00, 17.54s/it]

Evaluating:   0%|          | 0/21 [00:00<?, ?it/s][A
Evaluating:   5%|▍     

{"eval_loss": 0.17225747690757826, "eval_accuracy": 0.963020030816641, "learning_rate": 3.463114754098361e-05, "loss": 0.1314369172602892, "step": 75}



Training Batch Iteration:  23%|██▎       | 14/61 [00:27<03:53,  4.98s/it][A
Training Batch Iteration:  25%|██▍       | 15/61 [00:28<02:56,  3.84s/it][A
Training Batch Iteration:  26%|██▌       | 16/61 [00:29<02:14,  2.99s/it][A
Training Batch Iteration:  28%|██▊       | 17/61 [00:31<01:48,  2.46s/it][A
Training Batch Iteration:  30%|██▉       | 18/61 [00:32<01:29,  2.09s/it][A
Training Batch Iteration:  31%|███       | 19/61 [00:33<01:18,  1.86s/it][A
Training Batch Iteration:  33%|███▎      | 20/61 [00:35<01:14,  1.82s/it][A
Training Batch Iteration:  34%|███▍      | 21/61 [00:36<01:06,  1.66s/it][A
Training Batch Iteration:  36%|███▌      | 22/61 [00:38<01:08,  1.76s/it][A
Training Batch Iteration:  38%|███▊      | 23/61 [00:39<01:01,  1.62s/it][A
Training Batch Iteration:  39%|███▉      | 24/61 [00:40<00:53,  1.44s/it][A
Training Batch Iteration:  41%|████      | 25/61 [00:42<00:49,  1.39s/it][A
Training Batch Iteration:  43%|████▎     | 26/61 [00:43<00:44,  1.26s/it][

{"eval_loss": 0.13353006976346174, "eval_accuracy": 0.9583975346687211, "learning_rate": 2.9508196721311478e-05, "loss": 0.12748133765533567, "step": 100}



Training Batch Iteration:  64%|██████▍   | 39/61 [01:10<01:50,  5.03s/it][A
Training Batch Iteration:  66%|██████▌   | 40/61 [01:12<01:21,  3.87s/it][A
Training Batch Iteration:  67%|██████▋   | 41/61 [01:13<01:00,  3.03s/it][A
Training Batch Iteration:  69%|██████▉   | 42/61 [01:14<00:47,  2.51s/it][A
Training Batch Iteration:  70%|███████   | 43/61 [01:15<00:37,  2.07s/it][A
Training Batch Iteration:  72%|███████▏  | 44/61 [01:16<00:30,  1.81s/it][A
Training Batch Iteration:  74%|███████▍  | 45/61 [01:17<00:26,  1.63s/it][A
Training Batch Iteration:  75%|███████▌  | 46/61 [01:19<00:22,  1.52s/it][A
Training Batch Iteration:  77%|███████▋  | 47/61 [01:20<00:20,  1.44s/it][A
Training Batch Iteration:  79%|███████▊  | 48/61 [01:21<00:17,  1.34s/it][A
Training Batch Iteration:  80%|████████  | 49/61 [01:22<00:14,  1.25s/it][A
Training Batch Iteration:  82%|████████▏ | 50/61 [01:23<00:13,  1.21s/it][A
Training Batch Iteration:  84%|████████▎ | 51/61 [01:24<00:11,  1.17s/it][

{"eval_loss": 0.16078567731061152, "eval_accuracy": 0.9583975346687211, "learning_rate": 2.4385245901639343e-05, "loss": 0.07961560974828899, "step": 125}



Training Batch Iteration:   5%|▍         | 3/61 [00:17<05:08,  5.33s/it][A
Training Batch Iteration:   7%|▋         | 4/61 [00:18<03:52,  4.08s/it][A
Training Batch Iteration:   8%|▊         | 5/61 [00:19<02:56,  3.14s/it][A
Training Batch Iteration:  10%|▉         | 6/61 [00:20<02:20,  2.55s/it][A
Training Batch Iteration:  11%|█▏        | 7/61 [00:21<01:54,  2.12s/it][A
Training Batch Iteration:  13%|█▎        | 8/61 [00:22<01:36,  1.82s/it][A
Training Batch Iteration:  15%|█▍        | 9/61 [00:23<01:23,  1.60s/it][A
Training Batch Iteration:  16%|█▋        | 10/61 [00:25<01:14,  1.46s/it][A
Training Batch Iteration:  18%|█▊        | 11/61 [00:26<01:08,  1.37s/it][A
Training Batch Iteration:  20%|█▉        | 12/61 [00:27<01:05,  1.33s/it][A
Training Batch Iteration:  21%|██▏       | 13/61 [00:28<01:02,  1.31s/it][A
Training Batch Iteration:  23%|██▎       | 14/61 [00:30<01:02,  1.33s/it][A
Training Batch Iteration:  25%|██▍       | 15/61 [00:31<01:00,  1.31s/it][A
Train

{"eval_loss": 0.5786586153720107, "eval_accuracy": 0.8736517719568567, "learning_rate": 1.9262295081967212e-05, "loss": 0.07866687266621739, "step": 150}



Training Batch Iteration:  46%|████▌     | 28/61 [00:58<02:45,  5.03s/it][A
Training Batch Iteration:  48%|████▊     | 29/61 [01:00<02:05,  3.92s/it][A
Training Batch Iteration:  49%|████▉     | 30/61 [01:01<01:37,  3.15s/it][A
Training Batch Iteration:  51%|█████     | 31/61 [01:02<01:16,  2.54s/it][A
Training Batch Iteration:  52%|█████▏    | 32/61 [01:03<01:00,  2.09s/it][A
Training Batch Iteration:  54%|█████▍    | 33/61 [01:04<00:49,  1.76s/it][A
Training Batch Iteration:  56%|█████▌    | 34/61 [01:05<00:41,  1.53s/it][A
Training Batch Iteration:  57%|█████▋    | 35/61 [01:06<00:37,  1.46s/it][A
Training Batch Iteration:  59%|█████▉    | 36/61 [01:08<00:34,  1.39s/it][A
Training Batch Iteration:  61%|██████    | 37/61 [01:09<00:33,  1.41s/it][A
Training Batch Iteration:  62%|██████▏   | 38/61 [01:10<00:32,  1.40s/it][A
Training Batch Iteration:  64%|██████▍   | 39/61 [01:12<00:30,  1.36s/it][A
Training Batch Iteration:  66%|██████▌   | 40/61 [01:13<00:26,  1.26s/it][

{"eval_loss": 0.11410807844783578, "eval_accuracy": 0.9645608628659477, "learning_rate": 1.4139344262295081e-05, "loss": 0.10108448210172355, "step": 175}



Training Batch Iteration:  87%|████████▋ | 53/61 [01:41<00:40,  5.11s/it][A
Training Batch Iteration:  89%|████████▊ | 54/61 [01:43<00:27,  3.94s/it][A
Training Batch Iteration:  90%|█████████ | 55/61 [01:44<00:18,  3.10s/it][A
Training Batch Iteration:  92%|█████████▏| 56/61 [01:45<00:12,  2.49s/it][A
Training Batch Iteration:  93%|█████████▎| 57/61 [01:46<00:08,  2.07s/it][A
Training Batch Iteration:  95%|█████████▌| 58/61 [01:47<00:05,  1.79s/it][A
Training Batch Iteration:  97%|█████████▋| 59/61 [01:48<00:03,  1.61s/it][A
Training Batch Iteration:  98%|█████████▊| 60/61 [01:50<00:01,  1.54s/it][A
Training Batch Iteration: 100%|██████████| 61/61 [01:51<00:00,  1.82s/it]

Evaluating:   0%|          | 0/21 [00:00<?, ?it/s][A
Evaluating:   5%|▍         | 1/21 [00:00<00:12,  1.61it/s][A
Evaluating:  10%|▉         | 2/21 [00:01<00:11,  1.60it/s][A
Evaluating:  14%|█▍        | 3/21 [00:01<00:10,  1.64it/s][A
Evaluating:  19%|█▉        | 4/21 [00:02<00:09,  1.73it/s][A
Evalua

{"eval_loss": 0.109706214074755, "eval_accuracy": 0.9722650231124808, "learning_rate": 9.016393442622952e-06, "loss": 0.07073369422927499, "step": 200}



Training Batch Iteration:  28%|██▊       | 17/61 [00:32<03:46,  5.15s/it][A
Training Batch Iteration:  30%|██▉       | 18/61 [00:34<02:50,  3.96s/it][A
Training Batch Iteration:  31%|███       | 19/61 [00:35<02:10,  3.11s/it][A
Training Batch Iteration:  33%|███▎      | 20/61 [00:36<01:44,  2.54s/it][A
Training Batch Iteration:  34%|███▍      | 21/61 [00:37<01:24,  2.11s/it][A
Training Batch Iteration:  36%|███▌      | 22/61 [00:38<01:10,  1.81s/it][A
Training Batch Iteration:  38%|███▊      | 23/61 [00:39<01:01,  1.63s/it][A
Training Batch Iteration:  39%|███▉      | 24/61 [00:40<00:54,  1.47s/it][A
Training Batch Iteration:  41%|████      | 25/61 [00:42<00:51,  1.43s/it][A
Training Batch Iteration:  43%|████▎     | 26/61 [00:43<00:48,  1.39s/it][A
Training Batch Iteration:  44%|████▍     | 27/61 [00:44<00:45,  1.32s/it][A
Training Batch Iteration:  46%|████▌     | 28/61 [00:45<00:42,  1.28s/it][A
Training Batch Iteration:  48%|████▊     | 29/61 [00:46<00:38,  1.20s/it][

{"eval_loss": 0.12642938399776107, "eval_accuracy": 0.9691833590138675, "learning_rate": 3.89344262295082e-06, "loss": 0.016114526093006135, "step": 225}



Training Batch Iteration:  69%|██████▉   | 42/61 [01:14<01:34,  4.96s/it][A
Training Batch Iteration:  70%|███████   | 43/61 [01:15<01:08,  3.81s/it][A
Training Batch Iteration:  72%|███████▏  | 44/61 [01:16<00:50,  3.00s/it][A
Training Batch Iteration:  74%|███████▍  | 45/61 [01:17<00:39,  2.44s/it][A
Training Batch Iteration:  75%|███████▌  | 46/61 [01:18<00:31,  2.08s/it][A
Training Batch Iteration:  77%|███████▋  | 47/61 [01:20<00:25,  1.79s/it][A
Training Batch Iteration:  79%|███████▊  | 48/61 [01:21<00:20,  1.61s/it][A
Training Batch Iteration:  80%|████████  | 49/61 [01:22<00:18,  1.53s/it][A
Training Batch Iteration:  82%|████████▏ | 50/61 [01:23<00:15,  1.39s/it][A
Training Batch Iteration:  84%|████████▎ | 51/61 [01:24<00:12,  1.29s/it][A
Training Batch Iteration:  85%|████████▌ | 52/61 [01:25<00:11,  1.22s/it][A
Training Batch Iteration:  87%|████████▋ | 53/61 [01:26<00:09,  1.19s/it][A
Training Batch Iteration:  89%|████████▊ | 54/61 [01:27<00:08,  1.15s/it][

## Evaluating on the Test Set



In [20]:
# Evaluation
results = {}
if args.do_eval:
    checkpoints = [args.output_dir]
    if args.eval_all_checkpoints:
        checkpoints = list(os.path.dirname(c) 
        for c in sorted(glob.glob(args.output_dir + "/**/" + 
                                  WEIGHTS_NAME, recursive=False)))
        # recursive=False because otherwise the parent diretory gets included
        # which is not what we want; only subdirectories

    logger.info("Evaluate the following checkpoints: %s", checkpoints)

    for checkpoint in checkpoints:
        global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
        prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
        model = MMBTForClassification(transformer_config, multimodal_config)
        checkpoint = os.path.join(checkpoint, 'pytorch_model.bin')
        model.load_state_dict(torch.load(checkpoint))
        model.to(args.device)
        result = evaluate(args, model, tokenizer, evaluate=True, test=True, prefix=prefix)
        result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
        results.update(result)

results.keys()

Evaluating: 100%|██████████| 21/21 [03:58<00:00, 11.38s/it]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.02it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.03it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.03it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.04it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.03it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.04it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.04it/s]
Evaluating: 100%|██████████| 21/21 [00:10<00:00,  2.03it/s]


dict_keys(['loss_100', 'accuracy_100', 'loss_125', 'accuracy_125', 'loss_150', 'accuracy_150', 'loss_175', 'accuracy_175', 'loss_200', 'accuracy_200', 'loss_225', 'accuracy_225', 'loss_25', 'accuracy_25', 'loss_50', 'accuracy_50', 'loss_75', 'accuracy_75'])

In [21]:
results

{'accuracy_100': 0.9492307692307692,
 'accuracy_125': 0.963076923076923,
 'accuracy_150': 0.9046153846153846,
 'accuracy_175': 0.9692307692307692,
 'accuracy_200': 0.9646153846153847,
 'accuracy_225': 0.9692307692307692,
 'accuracy_25': 0.8969230769230769,
 'accuracy_50': 0.9461538461538461,
 'accuracy_75': 0.963076923076923,
 'loss_100': 0.13056670280084723,
 'loss_125': 0.15960173424155938,
 'loss_150': 0.4642904294388635,
 'loss_175': 0.1228057337215259,
 'loss_200': 0.1284510674221175,
 'loss_225': 0.1478217844407828,
 'loss_25': 0.2989686181147893,
 'loss_50': 0.15678607938545092,
 'loss_75': 0.16127859109214374}

## Saving Test Eval Results

The code automatically saved evaluation result from each checkpoint in its respective folder. This next cell simply saves all of them in one place.

In [22]:
with open('mmbt_both_front_eval_results.txt', mode='w', encoding='utf-8') as out_f:
    print(results, file=out_f)