# Introduction

In this notebook we will be investigating the [jiant](https://jiant.info/) NLP toolkit from the MLL group at NYU. We will conduct our exploration by using the toolkit to download, train, and evaluate two BERT based transformer models.

NOTE: I eventually ran into multiple sets of problems with jiant while writing this notebook. Instead of just getting rid of the work I had already done I will just point out where things change and/or fail.

In [7]:
import torch

from jiant.proj.main.export_model import export_model
import jiant.scripts.download_data.runscript as downloader
import jiant.proj.simple.runscript as simple_run

import os

## Working with jiant

jiant focuses on supporting model training and benchmarking for research and development. It allows integration with many state-of-the-art (or past SOTA) models, and it provides access to many common natural language datasets.

jiant works by having the user download a pre-trained model, and data associated with some target task. By interfacing with [Hugging Face](https://huggingface.co/) -- which hosts transformer models and NLP datasets -- jiant has access to hundreds of the most relevant and important such models and tasks. jiant then facilitates the user to fine-tune the chosen model on the new task and evaluate the results.

### Models

We will examine two BERT based models: ALBERT and DistilBERT. In both cases we will use the base version. Each of these models was created in an attempt to reduce the size and increase the speed of the original BERT transformer.

ALBERT has approached this task by factorizing the word embeddings, improving training by switching to sentence-order prediction instead of next-sentence predicition, and most importantly by sharing parameters across layers. DisilBERT, on the other hand, is simply a distilled version of the original BERT base model. This distillation process is well understood and works by training a student model to reproduce the results of a teacher model. In this case the teacher model is BERT and the student model is a reduced size version.

The models have the following important attributes:
- ALBERT
    - Depth -> 12
    - Embedding Dim -> 128
    - Parameters -> 11M
- DistilBERT
    - Depth -> 6
    - Embedding Dim -> 768
    - Parameters -> 66M

I think the most important thing to note form the numbers above is that ALBERT is deeper but has fewer pramaters and DistilBERT is shallower with more parameters. We are looking at this trade off between computational complexity and memory footprint. I expect DistilBERT to run faster and ALBERT  to perform better.

In the code below we download our two models to the "core/models" directory. In running it you may see a warning output, but this is just to make the user aware that the ALBERT model is to be used for a downstream task.

In [8]:
#Download models to "models" dir
m1 = 'albert-base-v2'
m2 = 'distilbert-base-uncased'

export_model(m1, './core/models/albert')
export_model(m2, './core/models/distilbert')

jiant attempts to be very modular by considering a model as a combination of encoder and task-specific head. The encoder is a pre-trained language model such as the those we have imported above. The task-specific head is dependent on the task and dataset being considered -- with multiple heads possible for multi-task learning. We consider our down-stream task next.

### Tasks

jiant considers a task to be the culmination of three things: raw data i/o, raw data transformations, and an evaluation scheme. 

The downstream task we will consider is the Stanford Question Answering Dataset version 2 ([SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/)). This dataset has 100,000 examples which consist of a question about a Wikipedia excerpt and an answer to that question from within the excerpt. As well, 50,000 adversarial examples are included which should have no answer.

Or so I thought. It turns out that the SQuAD Dataset only has a train split of data available for use. While this would be fine if we were more in control of the data ourselves (i.e. could craft our own splits), it causes problems down the road. Specifically, jiant is expecting there to be train, validation, and test splits, so without them things break.

Thus the downstream task we will consider is the Large-scale ReAding Comprehension Dataset From Examination (RACE; people will do anything for an acronym). This dataset is similar to SQuAD in that it is a question and answer based task. It was compiled from English examinations in China and contains roughly 28,000 excerpts with 100,000 questions. This seems like a perfect dataset for training NLP models as the source it was compiled from was created to test NLP models, albeit in the form of highly advanced biological beings (i.e. humans).

In the code below we download this dataset to the "core/tasks" directory.

In [9]:
#Download SQuAD dataset
task = 'race'

downloader.download_data([task], './core/tasks')

### Training 

We have our models and our target task, so the next step is training. We do exactly this using jiants "simple" api to define a configuration and run the training loop.

In [6]:
args = simple_run.RunConfiguration(
    run_name=f'{m1}_{task}',
    exp_dir='core',
    data_dir='core/tasks',
    hf_pretrained_model_name_or_path=m1,
    tasks=task,
    train_batch_size=2,
    num_train_epochs=1,
    train_examples_cap=1000
)

In [7]:
simple_run.run_simple(args)

Running from start
  jiant_task_container_config_path: core/run_configs/albert-base-v2_race_config.json
  output_dir: core/runs/albert-base-v2_race
  hf_pretrained_model_name_or_path: albert-base-v2
  model_path: core/models/albert/model/model.p
  model_config_path: core/models/albert/model/config.json
  model_load_mode: from_transformers
  do_train: True
  do_val: True
  do_save: False
  do_save_last: False
  do_save_best: False
  write_val_preds: False
  write_test_preds: False
  eval_every_steps: 0
  save_every_steps: 0
  save_checkpoint_every_steps: 0
  no_improvements_for_n_evals: 0
  keep_checkpoint_when_done: False
  force_overwrite: False
  seed: -1
  learning_rate: 1e-05
  adam_epsilon: 1e-08
  max_grad_norm: 1.0
  optimizer_type: adam
  no_cuda: False
  fp16: False
  fp16_opt_level: O1
  local_rank: -1
  server_ip: 
  server_port: 
device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
Using seed: 478504965
{
  "jiant_task_container_config_path": "core/ru

HBox(children=(FloatProgress(value=0.0, description='Training', max=500.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Eval (race, Val)', max=125.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Eval (race, Val)', max=125.0, style=ProgressStyle(descrip…


Loading Best


HBox(children=(FloatProgress(value=0.0, description='Eval (race, Val)', max=1222.0, style=ProgressStyle(descri…


{
  "aggregated": 0.4065889093513403,
  "race": {
    "loss": 1.3215613560522441,
    "metrics": {
      "major": 0.4065889093513403,
      "minor": {
        "acc": 0.4065889093513403
      }
    }
  }
}


In order to make the training time feasible on my machine I set the batch size to 2, and capped the maximum training examples at 1,000. This is obviously very limiting to the actual capabilities of the network, but it is what I can do. In the above output I will point out the "aggregated" result at the end, which is specifying the final accuracy. We see that our model managed just above 40% compared to the 64% reported by the ALBERT paper. This isn't too bad given the limited training, and it is certainly only possible due to the pre-trained language model.

Next, we encounter our second problem. jiant had seemed to be a capable toolkit, but the more I worked with it the more issues I encountered. The sum of this is seen in trying to train our second model: DistilBERT. Even though jiant claims to support any transformer model that Hugging Face supports, it does not support this one. It is possible that the issue is easily fixed on the software side, but I am not so interested in doing that. When the model is being loaded in the second code cell below we encounter a key error that cannot be obviously fixed on the user side. I even tried to edit the jiant source code locally, but I couldn't find a clear fix.

In [10]:
args = simple_run.RunConfiguration(
    run_name=f'{m2}_{task}',
    exp_dir='core',
    data_dir='core/tasks',
    hf_pretrained_model_name_or_path=m2,
    tasks=task,
    train_batch_size=2,
    num_train_epochs=1,
    train_examples_cap=1000
)

In [11]:
simple_run.run_simple(args)

Running from start
  jiant_task_container_config_path: core/run_configs/distilbert-base-uncased_race_config.json
  output_dir: core/runs/distilbert-base-uncased_race
  hf_pretrained_model_name_or_path: distilbert-base-uncased
  model_path: core/models/distilbert/model/model.p
  model_config_path: core/models/distilbert/model/config.json
  model_load_mode: from_transformers
  do_train: True
  do_val: True
  do_save: False
  do_save_last: False
  do_save_best: False
  write_val_preds: False
  write_test_preds: False
  eval_every_steps: 0
  save_every_steps: 0
  save_checkpoint_every_steps: 0
  no_improvements_for_n_evals: 0
  keep_checkpoint_when_done: False
  force_overwrite: False
  seed: -1
  learning_rate: 1e-05
  adam_epsilon: 1e-08
  max_grad_norm: 1.0
  optimizer_type: adam
  no_cuda: False
  fp16: False
  fp16_opt_level: O1
  local_rank: -1
  server_ip: 
  server_port: 
device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
Using seed: 1754488804
{
  "jiant_t

RuntimeError: Error(s) in loading state_dict for BertModel:
	Missing key(s) in state_dict: "embeddings.position_ids", "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.layer.0.attention.self.query.weight", "encoder.layer.0.attention.self.query.bias", "encoder.layer.0.attention.self.key.weight", "encoder.layer.0.attention.self.key.bias", "encoder.layer.0.attention.self.value.weight", "encoder.layer.0.attention.self.value.bias", "encoder.layer.0.attention.output.dense.weight", "encoder.layer.0.attention.output.dense.bias", "encoder.layer.0.attention.output.LayerNorm.weight", "encoder.layer.0.attention.output.LayerNorm.bias", "encoder.layer.0.intermediate.dense.weight", "encoder.layer.0.intermediate.dense.bias", "encoder.layer.0.output.dense.weight", "encoder.layer.0.output.dense.bias", "encoder.layer.0.output.LayerNorm.weight", "encoder.layer.0.output.LayerNorm.bias", "encoder.layer.1.attention.self.query.weight", "encoder.layer.1.attention.self.query.bias", "encoder.layer.1.attention.self.key.weight", "encoder.layer.1.attention.self.key.bias", "encoder.layer.1.attention.self.value.weight", "encoder.layer.1.attention.self.value.bias", "encoder.layer.1.attention.output.dense.weight", "encoder.layer.1.attention.output.dense.bias", "encoder.layer.1.attention.output.LayerNorm.weight", "encoder.layer.1.attention.output.LayerNorm.bias", "encoder.layer.1.intermediate.dense.weight", "encoder.layer.1.intermediate.dense.bias", "encoder.layer.1.output.dense.weight", "encoder.layer.1.output.dense.bias", "encoder.layer.1.output.LayerNorm.weight", "encoder.layer.1.output.LayerNorm.bias", "encoder.layer.2.attention.self.query.weight", "encoder.layer.2.attention.self.query.bias", "encoder.layer.2.attention.self.key.weight", "encoder.layer.2.attention.self.key.bias", "encoder.layer.2.attention.self.value.weight", "encoder.layer.2.attention.self.value.bias", "encoder.layer.2.attention.output.dense.weight", "encoder.layer.2.attention.output.dense.bias", "encoder.layer.2.attention.output.LayerNorm.weight", "encoder.layer.2.attention.output.LayerNorm.bias", "encoder.layer.2.intermediate.dense.weight", "encoder.layer.2.intermediate.dense.bias", "encoder.layer.2.output.dense.weight", "encoder.layer.2.output.dense.bias", "encoder.layer.2.output.LayerNorm.weight", "encoder.layer.2.output.LayerNorm.bias", "encoder.layer.3.attention.self.query.weight", "encoder.layer.3.attention.self.query.bias", "encoder.layer.3.attention.self.key.weight", "encoder.layer.3.attention.self.key.bias", "encoder.layer.3.attention.self.value.weight", "encoder.layer.3.attention.self.value.bias", "encoder.layer.3.attention.output.dense.weight", "encoder.layer.3.attention.output.dense.bias", "encoder.layer.3.attention.output.LayerNorm.weight", "encoder.layer.3.attention.output.LayerNorm.bias", "encoder.layer.3.intermediate.dense.weight", "encoder.layer.3.intermediate.dense.bias", "encoder.layer.3.output.dense.weight", "encoder.layer.3.output.dense.bias", "encoder.layer.3.output.LayerNorm.weight", "encoder.layer.3.output.LayerNorm.bias", "encoder.layer.4.attention.self.query.weight", "encoder.layer.4.attention.self.query.bias", "encoder.layer.4.attention.self.key.weight", "encoder.layer.4.attention.self.key.bias", "encoder.layer.4.attention.self.value.weight", "encoder.layer.4.attention.self.value.bias", "encoder.layer.4.attention.output.dense.weight", "encoder.layer.4.attention.output.dense.bias", "encoder.layer.4.attention.output.LayerNorm.weight", "encoder.layer.4.attention.output.LayerNorm.bias", "encoder.layer.4.intermediate.dense.weight", "encoder.layer.4.intermediate.dense.bias", "encoder.layer.4.output.dense.weight", "encoder.layer.4.output.dense.bias", "encoder.layer.4.output.LayerNorm.weight", "encoder.layer.4.output.LayerNorm.bias", "encoder.layer.5.attention.self.query.weight", "encoder.layer.5.attention.self.query.bias", "encoder.layer.5.attention.self.key.weight", "encoder.layer.5.attention.self.key.bias", "encoder.layer.5.attention.self.value.weight", "encoder.layer.5.attention.self.value.bias", "encoder.layer.5.attention.output.dense.weight", "encoder.layer.5.attention.output.dense.bias", "encoder.layer.5.attention.output.LayerNorm.weight", "encoder.layer.5.attention.output.LayerNorm.bias", "encoder.layer.5.intermediate.dense.weight", "encoder.layer.5.intermediate.dense.bias", "encoder.layer.5.output.dense.weight", "encoder.layer.5.output.dense.bias", "encoder.layer.5.output.LayerNorm.weight", "encoder.layer.5.output.LayerNorm.bias", "encoder.layer.6.attention.self.query.weight", "encoder.layer.6.attention.self.query.bias", "encoder.layer.6.attention.self.key.weight", "encoder.layer.6.attention.self.key.bias", "encoder.layer.6.attention.self.value.weight", "encoder.layer.6.attention.self.value.bias", "encoder.layer.6.attention.output.dense.weight", "encoder.layer.6.attention.output.dense.bias", "encoder.layer.6.attention.output.LayerNorm.weight", "encoder.layer.6.attention.output.LayerNorm.bias", "encoder.layer.6.intermediate.dense.weight", "encoder.layer.6.intermediate.dense.bias", "encoder.layer.6.output.dense.weight", "encoder.layer.6.output.dense.bias", "encoder.layer.6.output.LayerNorm.weight", "encoder.layer.6.output.LayerNorm.bias", "encoder.layer.7.attention.self.query.weight", "encoder.layer.7.attention.self.query.bias", "encoder.layer.7.attention.self.key.weight", "encoder.layer.7.attention.self.key.bias", "encoder.layer.7.attention.self.value.weight", "encoder.layer.7.attention.self.value.bias", "encoder.layer.7.attention.output.dense.weight", "encoder.layer.7.attention.output.dense.bias", "encoder.layer.7.attention.output.LayerNorm.weight", "encoder.layer.7.attention.output.LayerNorm.bias", "encoder.layer.7.intermediate.dense.weight", "encoder.layer.7.intermediate.dense.bias", "encoder.layer.7.output.dense.weight", "encoder.layer.7.output.dense.bias", "encoder.layer.7.output.LayerNorm.weight", "encoder.layer.7.output.LayerNorm.bias", "encoder.layer.8.attention.self.query.weight", "encoder.layer.8.attention.self.query.bias", "encoder.layer.8.attention.self.key.weight", "encoder.layer.8.attention.self.key.bias", "encoder.layer.8.attention.self.value.weight", "encoder.layer.8.attention.self.value.bias", "encoder.layer.8.attention.output.dense.weight", "encoder.layer.8.attention.output.dense.bias", "encoder.layer.8.attention.output.LayerNorm.weight", "encoder.layer.8.attention.output.LayerNorm.bias", "encoder.layer.8.intermediate.dense.weight", "encoder.layer.8.intermediate.dense.bias", "encoder.layer.8.output.dense.weight", "encoder.layer.8.output.dense.bias", "encoder.layer.8.output.LayerNorm.weight", "encoder.layer.8.output.LayerNorm.bias", "encoder.layer.9.attention.self.query.weight", "encoder.layer.9.attention.self.query.bias", "encoder.layer.9.attention.self.key.weight", "encoder.layer.9.attention.self.key.bias", "encoder.layer.9.attention.self.value.weight", "encoder.layer.9.attention.self.value.bias", "encoder.layer.9.attention.output.dense.weight", "encoder.layer.9.attention.output.dense.bias", "encoder.layer.9.attention.output.LayerNorm.weight", "encoder.layer.9.attention.output.LayerNorm.bias", "encoder.layer.9.intermediate.dense.weight", "encoder.layer.9.intermediate.dense.bias", "encoder.layer.9.output.dense.weight", "encoder.layer.9.output.dense.bias", "encoder.layer.9.output.LayerNorm.weight", "encoder.layer.9.output.LayerNorm.bias", "encoder.layer.10.attention.self.query.weight", "encoder.layer.10.attention.self.query.bias", "encoder.layer.10.attention.self.key.weight", "encoder.layer.10.attention.self.key.bias", "encoder.layer.10.attention.self.value.weight", "encoder.layer.10.attention.self.value.bias", "encoder.layer.10.attention.output.dense.weight", "encoder.layer.10.attention.output.dense.bias", "encoder.layer.10.attention.output.LayerNorm.weight", "encoder.layer.10.attention.output.LayerNorm.bias", "encoder.layer.10.intermediate.dense.weight", "encoder.layer.10.intermediate.dense.bias", "encoder.layer.10.output.dense.weight", "encoder.layer.10.output.dense.bias", "encoder.layer.10.output.LayerNorm.weight", "encoder.layer.10.output.LayerNorm.bias", "encoder.layer.11.attention.self.query.weight", "encoder.layer.11.attention.self.query.bias", "encoder.layer.11.attention.self.key.weight", "encoder.layer.11.attention.self.key.bias", "encoder.layer.11.attention.self.value.weight", "encoder.layer.11.attention.self.value.bias", "encoder.layer.11.attention.output.dense.weight", "encoder.layer.11.attention.output.dense.bias", "encoder.layer.11.attention.output.LayerNorm.weight", "encoder.layer.11.attention.output.LayerNorm.bias", "encoder.layer.11.intermediate.dense.weight", "encoder.layer.11.intermediate.dense.bias", "encoder.layer.11.output.dense.weight", "encoder.layer.11.output.dense.bias", "encoder.layer.11.output.LayerNorm.weight", "encoder.layer.11.output.LayerNorm.bias", "pooler.dense.weight", "pooler.dense.bias". 

## Conclusion

The goal of this investigation was to explore an established package for NLP research that speeds up some of the more mundane processes associated with deep learning. As well, I was interested in comparing the qualities of two similar BERT variations relevant to class. In the end this was mostly a waste of time. I certainly learned some things, but mostly I learned to be smarter when working with third-party packages like jiant.