In [1]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

[NeMo W 2020-10-22 20:22:45 experimental:28] Module <class 'nemo.collections.nlp.modules.common.huggingface.auto.AutoModelEncoder'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2020-10-22 20:22:45 experimental:28] Module <class 'nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder'> is experimental, not ready for production and is not fully supported. Use at your own risk.




In this tutorial, we are going to describe how to finetune a BERT-like model based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) on [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://openreview.net/pdf?id=rJ4km2R5t7). 

# GLUE tasks
GLUE Benchmark includes 9 natural language understanding tasks:

## Single-Sentence Tasks

* CoLA - [The Corpus of Linguistic Acceptability](https://arxiv.org/abs/1805.12471) is a set of English sentences from published linguistics literature. The task is to predict whether a given sentence is grammatically correct or not.
* SST-2 - [The Stanford Sentiment Treebank](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence: positive or negative.

## Similarity and Paraphrase tasks

* MRPC - [The Microsoft Research Paraphrase Corpus](https://www.aclweb.org/anthology/I05-5002.pdf) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.
* QQP - [The Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.
* STS-B - [The Semantic Textual Similarity Benchmark](https://arxiv.org/abs/1708.00055) is a collection of sentence pairs drawn from news headlines, video, and image captions, and natural language inference data. The task is to determine how similar two sentences are.

## Inference Tasks

* MNLI - [The Multi-Genre Natural Language Inference Corpus](https://cims.nyu.edu/~sbowman/multinli/multinli_0.9.pdf) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The task has the matched (in-domain) and mismatched (cross-domain) sections.
* QNLI - [The Stanford Question Answering Dataset](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question. The task is to determine whether the context sentence contains the answer to the question.
* RTE The Recognizing Textual Entailment (RTE) datasets come from a series of annual [textual entailment challenges](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment). The task is to determine whether the second sentence is the entailment of the first one or not.
* WNLI - The Winograd Schema Challenge is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices (Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. 2012).

All tasks are classification tasks, except for the STS-B task which is a regression task. All classification tasks are 2-class problems, except for the MNLI task which has 3-classes.

More details about GLUE benchmark could be found [here](https://gluebenchmark.com/).

# Datasets

**To proceed further, you need to download the GLUE data.** For example, you can download [this script](https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py) using `wget` and then execute it by running:

`python download_glue_data.py`

use `--tasks TASK` if datasets for only selected GLUE tasks are needed

After running the above commands, you will have a folder `glue_data` with data folders for every GLUE task. For example, data for MRPC task would be under glue_data/MRPC.

This tutorial and [examples/nlp/glue_benchmark/glue_benchmark.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/glue_benchmark/glue_benchmark.py) work with all GLUE tasks without any modifications. For this tutorial, we are going to use MRPC task.





In [2]:
# supported task names: ["cola", "sst-2", "mrpc", "sts-b", "qqp", "mnli", "qnli", "rte", "wnli"]
TASK = 'sst-2'
DATA_DIR = 'DATA_DIR/glue_data/SST-2'
WORK_DIR = "WORK_DIR"
MODEL_CONFIG = 'glue_benchmark_config.yaml'

In [3]:
! ls -l $DATA_DIR

total 4003
-rwxrwxrwx 1 root root   94931 Oct 22 20:17 dev.tsv
drwxrwxrwx 2 root root       0 Oct 22 20:17 original
-rwxrwxrwx 1 root root  197335 Oct 22 20:17 test.tsv
-rwxrwxrwx 1 root root 3806081 Oct 22 20:17 train.tsv


For each task, there are 3 files: `train.tsv, dev.tsv, and test.tsv`. Note, MNLI has 2 dev sets: matched and mismatched, evaluation on both dev sets will be done automatically.

In [4]:
# let's take a look at the training data 
! head -n 5 {DATA_DIR}/train.tsv

sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0


# Model configuration

Now, let's take a closer look at the model's configuration and learn to train the model.

GLUE model is comprised of the pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model followed by a Sequence Regression module (for STS-B task) or  Sequence classifier module (for the rest of the tasks).

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that are related to the Model - language model, a classifier, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [5]:
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/nlp/glue_benchmark/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

Downloading config file...


In [6]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
print(OmegaConf.to_yaml(config))

WORK_DIR/configs/glue_benchmark_config.yaml
supported_tasks:
- cola
- sst-2
- mrpc
- sts-b
- qqp
- mnli
- qnli
- rte
- wnli
trainer:
  gpus: 1
  num_nodes: 1
  max_epochs: 3
  max_steps: null
  accumulate_grad_batches: 1
  amp_level: O0
  precision: 16
  accelerator: ddp
  checkpoint_callback: false
  logger: false
model:
  task_name: mrpc
  supported_tasks:
  - cola
  - sst-2
  - mrpc
  - sts-b
  - qqp
  - mnli
  - qnli
  - rte
  - wnli
  output_dir: null
  nemo_path: null
  dataset:
    data_dir: ???
    max_seq_length: 128
    use_cache: true
    num_workers: 2
    pin_memory: false
    drop_last: false
  train_ds:
    file_name: train.tsv
    shuffle: true
    num_samples: -1
    batch_size: 32
  validation_ds:
    file_name: dev.tsv
    shuffle: false
    num_samples: -1
    batch_size: 32
  tokenizer:
    tokenizer_name: ${model.language_model.pretrained_model_name}
    vocab_file: null
    tokenizer_model: null
    special_tokens: null
  language_model:
    pretrained_model_name

# Model Training
## Setting up Data within the config

Among other things, the config file contains dictionaries called **dataset**, **train_ds** and **validation_ds**. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.

We assume that both training and evaluation files are located in the same directory, and use the default names mentioned during the data download step. 
So, to start model training, we simply need to specify `model.dataset.data_dir`, like we are going to do below.

Also notice that some config lines, including `model.dataset.data_dir`, have `???` in place of paths, this means that values for these fields are required to be specified by the user.

Let's now add the data directory path, task name and output directory for saving predictions to the config.

In [7]:
config.model.task_name = TASK
config.model.output_dir = WORK_DIR
config.model.dataset.data_dir = DATA_DIR

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.

Let's first instantiate a Trainer object

In [8]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

Trainer config - 

gpus: 1
num_nodes: 1
max_epochs: 3
max_steps: null
accumulate_grad_batches: 1
amp_level: O0
precision: 16
accelerator: ddp
checkpoint_callback: false
logger: false



In [9]:
# lets modify some trainer configs
# checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda

config.trainer.precision = 16 if torch.cuda.is_available() else 32
# config.trainer.precision = 32

config.model.dataset.num_workers=4 
# config.model.optim.lr=4.232e-04

config.model.train_ds.batch_size=128

# for mixed precision training, uncomment the line below (precision should be set to 16 and amp_level to O1):
# config.trainer.amp_level = O1

# remove distributed training flags
config.trainer.distributed_backend = None

# setup max number of steps to reduce training time for demonstration purposes of this tutorial
# config.trainer.max_steps = 128

# setup max number of epochs 
config.trainer.max_epochs=1#1

# does not save checkpoints (faster training iterations without saves) 
config.exp_manager.create_checkpoint_callback=False

trainer = pl.Trainer(**config.trainer)

GPU available: True, used: True
INFO - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
INFO - TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
INFO - Using native 16bit precision.


## Setting up a NeMo Experiment

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:

In [10]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

[NeMo I 2020-10-22 20:22:47 exp_manager:169] Experiments will be logged at /mnt/batch/tasks/shared/LS_root/mounts/clusters/vac20amlofgpuv100-go3/code/Users/Johnathon.Stringer/AzureML-NeMo/nemo_experiments/mrpc/2020-10-22_20-22-47
[NeMo I 2020-10-22 20:22:47 exp_manager:503] TensorboardLogger has been set up


'/mnt/batch/tasks/shared/LS_root/mounts/clusters/vac20amlofgpuv100-go3/code/Users/Johnathon.Stringer/AzureML-NeMo/nemo_experiments/mrpc/2020-10-22_20-22-47'

Before initializing the model, we might want to modify some of the model configs. For example, we might want to modify the pretrained BERT model and use [Megatron-LM BERT](https://arxiv.org/abs/1909.08053) or [AlBERT model](https://arxiv.org/abs/1909.11942):

In [11]:
# get the list of supported BERT-like models, for the complete list of HugginFace models, see https://huggingface.co/models
print(nemo_nlp.modules.get_pretrained_lm_models_list(include_external=True))

# specify BERT-like model, you want to use, for example, "megatron-bert-345m-uncased" or 'bert-base-uncased'
PRETRAINED_BERT_MODEL = "albert-base-v1"

['megatron-bert-345m-uncased', 'megatron-bert-345m-cased', 'megatron-bert-uncased', 'megatron-bert-cased', 'biomegatron-bert-345m-uncased', 'biomegatron-bert-345m-cased', 'bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased', 'bert-base-multilingual-uncased', 'bert-base-multilingual-cased', 'bert-base-chinese', 'bert-base-german-cased', 'bert-large-uncased-whole-word-masking', 'bert-large-cased-whole-word-masking', 'bert-large-uncased-whole-word-masking-finetuned-squad', 'bert-large-cased-whole-word-masking-finetuned-squad', 'bert-base-cased-finetuned-mrpc', 'bert-base-german-dbmdz-cased', 'bert-base-german-dbmdz-uncased', 'cl-tohoku/bert-base-japanese', 'cl-tohoku/bert-base-japanese-whole-word-masking', 'cl-tohoku/bert-base-japanese-char', 'cl-tohoku/bert-base-japanese-char-whole-word-masking', 'TurkuNLP/bert-base-finnish-cased-v1', 'TurkuNLP/bert-base-finnish-uncased-v1', 'wietsedv/bert-base-dutch-cased', 'facebook/bart-base', 'facebook/bart-large', 'facebo

In [12]:
# add the specified above model parameters to the config
config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.
Also, the pretrained BERT model will be downloaded, note it can take up to a few minutes depending on the size of the chosen BERT model.

In [13]:
model = nemo_nlp.models.GLUEModel(cfg=config.model, trainer=trainer)

[NeMo I 2020-10-22 20:22:47 glue_benchmark_model:99] Using DATA_DIR/glue_data/SST-2/dev.tsv for model evaluation.


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=684.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=760289.0), HTML(value='')))


[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:109] Processing DATA_DIR/glue_data/SST-2/train.tsv
[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:230] Writing example 0 of 67349
[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:312] *** Example ***
[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:313] guid: train-1
[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:314] tokens: [CLS] ▁hide ▁new ▁secretion s ▁from ▁the ▁parental ▁units [SEP]
[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:315] input_ids: 2 3077 78 27467 18 37 14 21207 1398 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[NeMo I 2020-10-22 20:22:48 glue_benchmark_dataset:316] input_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=47376396.0), HTML(value='')))


[NeMo I 2020-10-22 20:23:04 modelPT:583] Optimizer config = Adam (
    Parameter Group 0
        amsgrad: False
        betas: (0.9, 0.999)
        eps: 1e-08
        lr: 5e-05
        weight_decay: 0.0
    )
[NeMo I 2020-10-22 20:23:04 lr_scheduler:554] Scheduler "<nemo.core.optim.lr_scheduler.WarmupAnnealing object at 0x7f5b282596d8>" 
    will be used during training (effective maximum steps = 526) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1
    max_steps: 526
    )


## Monitoring training progress
Optionally, you can create a Tensorboard visualization to monitor training progress.

In [14]:
try:
  from google import colab
  COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
  COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
  %load_ext tensorboard
  %tensorboard --logdir {exp_dir}
else:
  print("To use tensorboard, please use this notebook in a Google Colab environment.")

To use tensorboard, please use this notebook in a Google Colab environment.


Note, it’s recommended to finetune the model on each task separately. Also, based on [GLUE Benchmark FAQ#12](https://gluebenchmark.com/faq), there are might be some differences in dev/test distributions for QQP task and in train/dev for WNLI task.

In [15]:
# start model training
trainer.fit(model)

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1


[NeMo I 2020-10-22 20:23:07 modelPT:583] Optimizer config = Adam (
    Parameter Group 0
        amsgrad: False
        betas: (0.9, 0.999)
        eps: 1e-08
        lr: 5e-05
        weight_decay: 0.0
    )
[NeMo I 2020-10-22 20:23:07 lr_scheduler:554] Scheduler "<nemo.core.optim.lr_scheduler.WarmupAnnealing object at 0x7f5b2911b1d0>" 
    will be used during training (effective maximum steps = 526) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1
    max_steps: 526
    )



  | Name       | Type               | Params
--------------------------------------------------
0 | bert_model | AlbertEncoder      | 11 M  
1 | pooler     | SequenceClassifier | 592 K 
2 | loss       | CrossEntropyLoss   | 0     
INFO - 
  | Name       | Type               | Params
--------------------------------------------------
0 | bert_model | AlbertEncoder      | 11 M  
1 | pooler     | SequenceClassifier | 592 K 
2 | loss       | CrossEntropyLoss   | 0     


HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…

[NeMo I 2020-10-22 20:23:08 glue_benchmark_model:200] DEV_ evaluation: {'acc': 0.5}
[NeMo I 2020-10-22 20:23:08 glue_benchmark_model:207] Saving labels and predictions to WORK_DIR/sst-2.txt


    
    Please use self.log(...) inside the lightningModule instead.
    
    # log on a step or aggregate epoch metric to the logger and/or progress bar
    # (inside LightningModule)
    self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
    


HBox(children=(HTML(value='Training'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

[NeMo I 2020-10-22 20:25:39 glue_benchmark_model:200] DEV_ evaluation: {'acc': 0.8818807339449541}
[NeMo I 2020-10-22 20:25:39 glue_benchmark_model:207] Saving labels and predictions to WORK_DIR/sst-2.txt



1

## Training Script

If you have NeMo installed locally, you can also train the model with [examples/nlp/glue_benchmark/glue_benchmark.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/glue_benchmark/glue_benchmark.py).

To run training script, use:

`python glue_benchmark.py \
 model.dataset.data_dir=PATH_TO_DATA_DIR \
 model.task_name=TASK`


Average results after 3 runs:

| Task  |         Metric           | ALBERT-large | ALBERT-xlarge | Megatron-345m | BERT base paper | BERT large paper |
|-------|--------------------------|--------------|---------------|---------------|-----------------|------------------|
| CoLA  | Matthew's correlation    |     54.94    |     61.72     |     64.56     |      52.1       |       60.5       |
| SST-2 | Accuracy                 |     92.74    |     91.86     |     95.87     |      93.5       |       94.9       |
| MRPC  | F1/Accuracy              |  92.05/88.97 |  91.87/88.61  |  92.36/89.46  |      88.9/-     |     89.3/-       |
| STS-B | Person/Spearman corr.    |  90.41/90.21 |  90.07/90.10  |  91.51/91.61  |     -/85.8      |      -/86.5      |
| QQP   | F1/Accuracy              |  88.26/91.26 |  88.80/91.65  |  89.18/91.91  |     71.2/-      |     72.1/-       |
| MNLI  | Matched /Mismatched acc. |  86.69/86.81 |  88.66/88.73  |  89.86/89.81  |    84.6/83.4    |     86.7/85.9    |
| QNLI  | Accuracy                 |     92.68    |     93.66     |     94.33     |      90.5       |       92.7       |
| RTE   | Accuracy                 |     80.87    |     82.86     |     83.39     |      66.4       |       70.1       |

WNLI task was excluded from the experiments due to the problematic WNLI set.
The dev sets were used for evaluation for ALBERT and Megatron models, and the test sets results for [the BERT paper](https://arxiv.org/abs/1810.04805).

Hyperparameters used to get the results from the above table, could be found in the table below. Some tasks could be further finetuned to improve performance numbers, the tables are for a baseline reference only.
Each cell in the table represents the following parameters:
Number of GPUs used/ Batch Size/ Learning Rate/ Number of Epochs. For not specified parameters, please refer to the default parameters in the training script.

| Task  | ALBERT-large | ALBERT-xlarge | Megatron-345m |
|-------|--------------|---------------|---------------|
| CoLA  | 1 / 32 / 1e-5 / 3  |  1 / 32 / 1e-5 / 10 |  4 / 16 / 2e-5 / 12 |
| SST-2 | 4 / 16 / 2e-5 / 5  |  4 / 16 / 2e-5 /12  |  4 / 16 / 2e-5 / 12 |
| MRPC  | 1 / 32 / 1e-5 / 5  |  1 / 16 / 2e-5 / 5  |  1 / 16 / 2e-5 / 10 |
| STS-B | 1 / 16 / 2e-5 / 5  |  1 / 16 / 4e-5 / 12 |  4 / 16 / 3e-5 / 12 |
| QQP   | 1 / 16 / 2e-5 / 5  | 4 / 16 / 1e-5 / 12  |  4 / 16 / 1e-5 / 12 |
| MNLI  | 4 / 64 / 1e-5 / 5  |  4 / 32 / 1e-5 / 5  |  4 / 32 / 1e-5 / 5  | 
| QNLI  | 4 / 16 / 1e-5 / 5  |  4 / 16 / 1e-5 / 5  |  4 / 16 / 2e-5 / 5  | 
| RTE   | 1 / 16 / 1e-5 / 5  | 1 / 16 / 1e-5 / 12  |  4 / 16 / 3e-5 / 12 |
