<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 3.0 Pretraining Language Models

There are many pretrained BERT models that can be used "off-the-shelf".  However, there are times when it is advantageous to train or fine-tune a new language model for downstream NLP tasks.  For example, medical papers use vocabularies that are specific to the medical domain, so a language model trained on medical papers will be better suited to projects that process medical text than one trained on more general text.  

In this notebook, you'll learn how to pretrain a BERT language model with domain-specific data.  
    
**[3.1 Data Preparation](#3.1-Data-Preparation)<br>**
**[3.2 Training the BERT Tokenizer](#3.2-Training-the-BERT-Tokenizer)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[3.2.1 Exercise: Tokenize a Term](#3.2.1-Exercise:-Tokenize-a-Term)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.2.2 Update the BERT Vocabulary](#3.2.2-Update-the-BERT-Vocabulary)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.2.3 Exercise: Train a Larger Vocabulary](#3.2.3-Exercise:-Train-a-Larger-Vocabulary)<br>
**[3.3 Launch BERT Pretraining with NeMo](#3.3-Launch-BERT-Pretraining-with-NeMo)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[3.3.1 TensorBoard Visualization](#3.3.1-TensorBoard-Visualization)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.3.2 Practical Considerations](#3.3.2-Practical-Considerations)<br>

---
# 3.1 Data Preparation

Masked neural language models, such as BERT, are trained on text.  However, the text must first be transformed into numerical representations, a process called tokenization.  The network is then trained by masking random words in the input sentence and predicting the missing words.  The trained language model can then be used in downstream NLP tasks, where it is referred to as a "pretrained" language model.

With NVIDIA NeMo, the tokenization can be done either on-the-fly during training or offline before training.

- **On-the-fly data preprocessing:** The training and validation text files should have words separated by spaces:
                                [WORD] [SPACE] [WORD] [SPACE] [WORD] [SPACE] [WORD]
                                
- **Offline data preprocessing:** Data is prepared in advance in HD5F format. This is the recommended preprocessing for large text corpora.  Refer to [BERT quick start guide](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#quick-start-guide) for the offline data preprocessing script. 

In our example, we will use the on-the-fly data preprocessing pipeline.  We'll train BERT on the [NCBI-disease corpus](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/).
The NCBI corpus is a set of 793 PubMed abstracts.  Our goal is to create a pretrained model for the medical domain.  Here's an example of text abstracts:

In [1]:
! tail -5 /dli/task/data/train.txt

Low levels of beta hexosaminidase A in healthy individuals with apparent deficiency of this enzyme. Appreciable beta hexosaminidase A ( hex A ) activity has been detected in cultured skin fibroblasts and melanoma tissue from healthy individuals previously reported as having deficiency of hex A activity indistinguishable from that of patients with Tay-Sachs disease ( TSD ) . Identification and quantitation of hex A , amounting to 3 . 5 % -6 . 9 % of total beta hexosaminidase activity , has been obtained by cellulose acetate gel electrophoresis , DEAE-cellulose ion-exchange chromatography , radial immunodiffusion , and radioimmunoassay . Previous family studies suggested that these individuals may be compound heterozygotes for the common mutant TSD gene and a rare ( allelic ) mutant gene . Thus , the postulated rate mutant gene appears to code for the expression of low amounts of hex A . Heterozygotes for the rare mutant may be indistinguishable from heterozygotes for the common TSD muta

---
# 3.2 Training the BERT Tokenizer

As discussed in the previous notebook, the BERT tokenizer splits the text into tokens following a predefined vocabulary. The tokenizer algorithm generates the vocabulary following variants of Top-K frequent words from text corpus.

The vocabulary size is limited because the training cost increases with the size of the vocabulary. Including all unique words from the text corpus into the vocabulary would explode the complexity of training beyond the capabilities of the tokenizer. For instance, the BERT model that was released in 2018, with a subword tokenizer algorithm called WordPiece, has a vocabulary limit of 30,000.

How, then, do tokenizers deal with terms that are not part of the vocabulary, or **out-of-vocabulary (OOV)** words?

1. One option is to replace OOV words with a special token \[UNK\]. In this case, all OOV terms will have the same representation for the neural network loosing the semantic. 
1. A second option is to split OOV words at the character level. This increases the size of the input to the neural language model, adding the challenge of learning the relationship between characters to keep the semantic.
1. Sub-word tokenizers, such as BERT WordPiece, provide a solution in between the word token and character split option. It tokenizes OOV words into subwords.

Let's have a look at the `bert-base-uncased` tokenizer:

In [2]:
# import nemo nlp collection 
from nemo.collections import nlp as nemo_nlp

# load the bert-base-uncased tokenizer 
tokenizer_uncased = nemo_nlp.modules.get_tokenizer(tokenizer_name="bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


In [3]:
# get the vocabulary size
print(" The vocabulary size: ", tokenizer_uncased.vocab_size)

 The vocabulary size:  30522


As an example, take a look at the format tokenization for years with BERT. Years prior to 2021 appear frequently enough in the corpus to be part of the vocabulary, while years in the future are OOV and are split into sub-tokens.

Try it in the cell below using the `tokenizer_uncased.text_to_tokens()` function for various years.

In [4]:
# Bert tokenizer for years
print("Tokenized year: ", tokenizer_uncased.text_to_tokens('2019'))
print("Tokenized year: ", tokenizer_uncased.text_to_tokens('2020'))
print("Tokenized year: ", tokenizer_uncased.text_to_tokens('2021'))
print("Tokenized year: ", tokenizer_uncased.text_to_tokens('2022'))
print("Tokenized year: ", tokenizer_uncased.text_to_tokens('2023'))
print("Tokenized year: ", tokenizer_uncased.text_to_tokens('2030'))

Tokenized year:  ['2019']
Tokenized year:  ['2020']
Tokenized year:  ['2021']
Tokenized year:  ['202', '##2']
Tokenized year:  ['202', '##3']
Tokenized year:  ['203', '##0']


The years tokenization example gives us some intuition into the process.  How about domain-specific context such as medical jargon? For a concrete example, try again with the following sentence:

_"Further studies suggested that low dilutions of C5D serum contain a factor or factors interfering at some step in the hemolytic assay of C5 rather than a true C5 inhibitor or inactivator"_

This sentence includes several medical terms such as dilutions, C5D, C5, hemolytic and assay.

In [5]:
# Bert tokenizer for domain-specific example
SAMPLES = "Further studies suggested that low dilutions of C5D serum contain a factor or factors interfering at some step in the hemolytic assay of C5 rather than a true C5 inhibitor."
print("Tokenized sentence: ", tokenizer_uncased.text_to_tokens(SAMPLES))

Tokenized sentence:  ['further', 'studies', 'suggested', 'that', 'low', 'dil', '##ution', '##s', 'of', 'c', '##5', '##d', 'serum', 'contain', 'a', 'factor', 'or', 'factors', 'interfering', 'at', 'some', 'step', 'in', 'the', 'hem', '##ol', '##ytic', 'ass', '##ay', 'of', 'c', '##5', 'rather', 'than', 'a', 'true', 'c', '##5', 'inhibitor', '.']


You can see medical jargon tokenized as subwords: 
- dilutions -> 'dil', '##ution', '##s'
- hemolytic ->'hem', '##ol', '##ytic'
- assay -> 'ass', '##ay'
- C5 ->'c', '##5'
- C5D ->'c', '##5', '##d'

The medical jargon such as dilutions, hemolytic and assay are not in the standard BERT tokenizer vocabulary. Therefore, they cannot be individually tokenized and are divided into subwords.

## 3.2.1 Exercise: Tokenize a Term
Correct the "FIXME" lines below to tokenize the term "COVID-19" using the BERT tokenizer.  Check the [solution](solutions/ex3.2.1.ipynb) if you need to.

In [6]:
# Tokenize a new term
TEXT = 'COVID-19'
print("Tokenized sentence: ", tokenizer_uncased.text_to_tokens(TEXT))

Tokenized sentence:  ['co', '##vid', '-', '19']


## 3.2.2 Update the BERT Vocabulary

It is possible to add domain specific words into the tokenizer vocabulary with the `tokenizer_uncased.tokenizer.add_tokens()` function. The embeddings vector for each new token will be initialized with random values.

In [7]:
# Add some medical jargon to the vocabulary of Bert tokenizer
additional_tokens = tokenizer_uncased.tokenizer.add_tokens(["dilutions", "hemolytic"])
print(" The vocabulary size before: ", tokenizer_uncased.vocab_size)
print(" The vocabulary size after : ", tokenizer_uncased.vocab_size)

 The vocabulary size before:  30524
 The vocabulary size after :  30524


In [8]:
# Tokenize the sentence with the new vocabulary 
print("Tokenized sentence: ", tokenizer_uncased.text_to_tokens(SAMPLES))

Tokenized sentence:  ['further', 'studies', 'suggested', 'that', 'low', 'dilutions', 'of', 'c', '##5', '##d', 'serum', 'contain', 'a', 'factor', 'or', 'factors', 'interfering', 'at', 'some', 'step', 'in', 'the', 'hemolytic', 'ass', '##ay', 'of', 'c', '##5', 'rather', 'than', 'a', 'true', 'c', '##5', 'inhibitor', '.']


When the number of domain-specific words to incorporate into the vocabulary is high, it is the best to train a new tokenizer from a domain-specific corpus, rather than to use the pretrained tokenizer. 

Let's train a new WordPiece tokenizer on the [NCBI-disease corpus] corpus, limiting the vocabulary size to 10,000. 

In [9]:
vocab_size= 10000
text_corpus=["/dli/task/data/train.txt"]

# add the special tokens required for BERT pretraining.
special_tokens = ["<PAD>","<UNK>","<CLS>","<SEP>","<MASK>"]

In [10]:
from tokenizers import BertWordPieceTokenizer

my_bert_tokenizer = BertWordPieceTokenizer()
my_bert_tokenizer.train(files=text_corpus, vocab_size=vocab_size,
                        min_frequency=1, special_tokens=special_tokens,
                        show_progress=True, wordpieces_prefix="##")

In [11]:
# get the new vocabulary size
print(" The new vocabulary size  : ", len(my_bert_tokenizer.get_vocab()))

 The new vocabulary size  :  10000


In [12]:
# save the new vocabulary 
my_bert_tokenizer.save_model(directory="/dli/task/data/")

['/dli/task/data/vocab.txt']

In [13]:
!tail -20 /dli/task/data/vocab.txt 

electrophysiology
d17s857
delayed
maintaining
contributions
arg170
362arg
362ser
grandmother
grandmatrilineal
cytoskeleton
tyr231
tyr180
israelis
d14s291
angioedema
angiokeratoma
d13s314
d13s316
portugal


Once the vocabulary is defined, we can load the tokenizer with the new vocabulary using the `nemo_nlp.modules.get_tokenizer()` function. Let's tokenize the previous text sample and compare to the vanilla BERT tokenizer. 
The domain-specific jargon should now be encoded as individual tokens.

In [14]:
# load the tokenizer from the vocabulary 
special_tokens_dict = {"unk_token": "<UNK>", "sep_token": "<SEP>", "pad_token": "<PAD>", "bos_token": "<CLS>", "mask_token": "<MASK>","eos_token": "<SEP>", "cls_token": "<CLS>"}
tokenizer_custom = nemo_nlp.modules.get_tokenizer(tokenizer_name="bert-base-uncased", vocab_file='/dli/task/data/vocab.txt', special_tokens=special_tokens_dict)

print("BERT tokenizer with custom vocabulary: ", tokenizer_custom.text_to_tokens(SAMPLES))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


BERT tokenizer with custom vocabulary:  ['further', 'studies', 'suggested', 'that', 'low', 'dil', '##ution', '##s', 'of', 'c5d', 'serum', 'contain', 'a', 'factor', 'or', 'factors', 'interfer', '##ing', 'at', 'some', 'step', 'in', 'the', 'hemolytic', 'assay', 'of', 'c5', 'rather', 'than', 'a', 'true', 'c5', 'inhibitor', '.']


## 3.2.3 Exercise: Train a Larger Vocabulary 

Correct the "FIXME" lines to train a BERT tokenizer with a vocabulary size of 15,000. Check the [solution](solutions/ex3.2.3.ipynb) if you need to.

In [15]:
# Train a larger vocabulary 
vocab_size= 15000
my_bert_tokenizer_15k= BertWordPieceTokenizer()
my_bert_tokenizer_15k.train(files=text_corpus, vocab_size=vocab_size, min_frequency=1,
                            special_tokens=special_tokens, show_progress=True, wordpieces_prefix="##")

print(" The new vocabulary size  : ", len(my_bert_tokenizer_15k.get_vocab()))

 The new vocabulary size  :  15000


---
# 3.3 Launch BERT Pretraining with NeMo

We will use the model configuration for on-the-fly data preprocessing, [bert_pretraining_from_text_config.yaml](nemo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml), along with a training script, [bert_pretraining.py](nemo/examples/nlp/language_modeling/bert_pretraining.py). The YAML configuration file provides the parameters needed by the training script, and the parameter values can be overridden as needed. 

You'll learn more about NeMo configuration files and scripts in a later module.  For now, we'll just note a few important YAML keys in the configuration file:
- `trainer`: Training process parameters such as the number of GPUs, Mixed precision training, number of epochs, etc.
- `model.only_mlm_loss`: Use masked language model without next sentence prediction
- `model.mask_prob`: Probability of masking a token in the input text during data processing
- `model.train_ds`/`model.validation_ds`: datasets parameters
- `model.tokenizer`: tokenizer parameters
- `model.language_model`: language model architecture parameters
- `model.optim`: Optimizer parameters

Find more details about bert_pretraining parameters in the [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/bert_pretraining.html#quick-start-guide).

For BERT offline pretraining with preprocessed data, use the dedicated configuration, [bert_pretraining_from_preprocessed_config.yaml](nemo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml).

In [16]:
# Show the configuration file
! cat nemo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml

# BERT Pretraining from Text
name: &name PretrainingBERTFromText
trainer:
  gpus: 1 # the number of gpus, 0 for CPU, or list with gpu indices
  num_nodes: 1
  max_epochs: 2 # the number of training epochs
  max_steps: null # precedence over max_epochs
  accumulate_grad_batches: 1 # accumulates grads every k batches
  precision: 16 # 16 to use AMP
  amp_level: O1 # O1 or O2 if using AMP
  accelerator: ddp
  gradient_clip_val: 0.0
  log_every_n_steps: 1
  val_check_interval: 1.0 # check once per epoch .25 for 4 times per epoch
  checkpoint_callback: false # provided by exp_manager
  logger: false # provided by exp_manager

model:
  nemo_path: null # exported .nemo path
  only_mlm_loss: false # only use masked language model without next sentence prediction
  num_tok_classification_layers: 1 # number of token classification head output layers
  num_seq_classification_layers: 2 # number of sequence classification head output layers
  max_seq_length: 128
  # The maximum total input sequence

In [17]:
%%time
# Override the parameters specific to our data; run only two epochs for now
! python nemo/examples/nlp/language_modeling/bert_pretraining.py \
    model.train_ds.data_file=/dli/task/data/train.txt\
    model.validation_ds.data_file=/dli/task/data/test.txt\
    model.tokenizer.vocab_file=/dli/task/data/vocab.txt\
    model.train_ds.batch_size=16 \
    trainer.max_epochs=2

    Use OmegaConf.to_yaml(cfg)
    
    
[NeMo I 2022-07-28 08:37:28 bert_pretraining:28] Config:
     name: PretrainingBERTFromText
    trainer:
      gpus: 1
      num_nodes: 1
      max_epochs: 2
      max_steps: null
      accumulate_grad_batches: 1
      precision: 16
      amp_level: O1
      accelerator: ddp
      gradient_clip_val: 0.0
      log_every_n_steps: 1
      val_check_interval: 1.0
      checkpoint_callback: false
      logger: false
    model:
      nemo_path: null
      only_mlm_loss: false
      num_tok_classification_layers: 1
      num_seq_classification_layers: 2
      max_seq_length: 128
      mask_prob: 0.15
      short_seq_prob: 0.1
      language_model:
        pretrained_model_name: bert-base-uncased
        lm_checkpoint: null
        config:
          attention_probs_dropout_prob: 0.1
          hidden_act: gelu
          hidden_dropout_prob: 0.1
          hidden_size: 768
          initializer_range: 0.02
          intermediate_size: 3072
          max_po

## 3.3.1 TensorBoard Visualization
Open [TensorBoard](/tensorboard/) in your browser.  Then, click the link to see graphs of experiment metrics like loss and accuracy saved in the `nemo_experiments` folder.

## 3.3.2 Practical Considerations

Pretraining a Transformer-based language models does not require labeled text corpus datasets. However, it does require a large amount of data and compute time.  For example, pretraining a BERT model on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) + [bookcorpus](https://huggingface.co/datasets/bookcorpus) using an NVIDIA DGX-1 server with 8 V100 GPUs takes about 6 days in mixed precision mode. You can find out more about BERT training and fine-tuning performance at https://catalog.ngc.nvidia.com/orgs/nvidia/resources/bert_for_pytorch/performance.

On the other hand, fine-tuning a Transformer-based model is less computationally intensive, but requires labeled data. The lab in Part 2 will focus on fine-tuning BERT models for downstream NLP tasks such as text classification and named entity recognition.

---
<h2 style="color:green;">Congratulations!</h2>

You've completed the BERT pretraining notebook!  

You've learned:
* How to train a BERT tokenizer
* How to pretrain a BERT language model with NeMo

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>