In [1]:
BRANCH = 'megatron_docs'

In [2]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[nlp]

Collecting nemo_toolkit[nlp]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision megatron_docs) to /tmp/pip-install-uqmwosba/nemo-toolkit
  Running command git clone -q https://github.com/NVIDIA/NeMo.git /tmp/pip-install-uqmwosba/nemo-toolkit
  Running command git checkout -b megatron_docs --track origin/megatron_docs
  Switched to a new branch 'megatron_docs'
  Branch 'megatron_docs' set up to track remote branch 'megatron_docs' from 'origin'.
Collecting onnx>=1.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/36/ee/bc7bc88fc8449266add978627e90c363069211584b937fd867b0ccc59f09/onnx-1.7.0-cp36-cp36m-manylinux1_x86_64.whl (7.4MB)
[K     |████████████████████████████████| 7.4MB 2.6MB/s 
[?25hCollecting pytorch-lightning==0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/ed/af/2f10c8ee22d7a05fe8c9be58ad5c55b71ab4dd895b44f0156bfd5535a708/pytorch_lightning-0.9.0-py3-none-any.whl (408kB)
[K     |████████████████████████████████| 409kB 45.0MB/s 
Collecti

In [3]:
import os
import wget
from nemo.collections import nlp as nemo_nlp
from omegaconf import OmegaConf

[NeMo W 2020-09-04 04:48:23 experimental:28] Module <class 'nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2020-09-04 04:48:23 experimental:28] Module <class 'nemo.collections.nlp.modules.common.sequence_token_classifier.SequenceTokenClassifier'> is experimental, not ready for production and is not fully supported. Use at your own risk.




# Language models

Natural Language Processing (NLP) field experienced a huge leap in recent years due to the concept of transfer learning enabled through pretrained language models.

[BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), [Megatron-LM](https://arxiv.org/abs/1909.08053), and many other proposed language models achieve state-of-the-art results on many NLP tasks, such as:
* question answering
* sentiment analysis
* named entity recognition and many others.

In NeMo, most of the NLP models represent a pretrained language model followed by a Token Classification layer or a Sequence Classification layer or a combination of both. By changing the language model, you can improve the performance of your final model for the specific downstream task you are solving.

With NeMo you can use either pretrain a BERT model from your data or use a pretrained language model from [HuggingFace transformers](https://github.com/huggingface/transformers) or [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) libraries.

Let's take a look at the list of available pretrained language models:


In [4]:
nemo_nlp.modules.get_pretrained_lm_models_list()

['megatron-bert-345m-uncased',
 'megatron-bert-345m-cased',
 'biomegatron-bert-345m-uncased',
 'biomegatron-bert-345m-cased',
 'megatron-bert-uncased',
 'megatron-bert-cased',
 'bert-base-uncased',
 'bert-large-uncased',
 'bert-base-cased',
 'bert-large-cased',
 'bert-base-multilingual-uncased',
 'bert-base-multilingual-cased',
 'bert-base-chinese',
 'bert-base-german-cased',
 'bert-large-uncased-whole-word-masking',
 'bert-large-cased-whole-word-masking',
 'bert-large-uncased-whole-word-masking-finetuned-squad',
 'bert-large-cased-whole-word-masking-finetuned-squad',
 'bert-base-cased-finetuned-mrpc',
 'bert-base-german-dbmdz-cased',
 'bert-base-german-dbmdz-uncased',
 'cl-tohoku/bert-base-japanese',
 'cl-tohoku/bert-base-japanese-whole-word-masking',
 'cl-tohoku/bert-base-japanese-char',
 'cl-tohoku/bert-base-japanese-char-whole-word-masking',
 'TurkuNLP/bert-base-finnish-cased-v1',
 'TurkuNLP/bert-base-finnish-uncased-v1',
 'wietsedv/bert-base-dutch-cased',
 'distilbert-base-uncased

All NeMo [NLP models](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp) have an associated config file. As an example, let's examine the config file for the Named Entity Recognition (NER) model (more details about the model and the NER task, could be found [here](https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Punctuation_and_Capitalization.ipynb)).

In [5]:
MODEL_CONFIG = "token_classification_config.yaml"

# download the model's configuration file 
if not os.path.exists(MODEL_CONFIG):
    print('Downloading config file...')
    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/examples/nlp/token_classification/conf/' + MODEL_CONFIG)
else:
    print ('Config file already exists')

Downloading config file...


In [6]:
# this line will print the entire config of the model
config = OmegaConf.load(MODEL_CONFIG)
print(OmegaConf.to_yaml(config))

trainer:
  gpus: 1
  num_nodes: 1
  max_epochs: 5
  max_steps: null
  accumulate_grad_batches: 1
  gradient_clip_val: 0.0
  amp_level: O0
  precision: 16
  distributed_backend: ddp
  checkpoint_callback: false
  logger: false
  row_log_interval: 1
  val_check_interval: 1.0
  resume_from_checkpoint: null
exp_manager:
  exp_dir: null
  name: token_classification_model
  create_tensorboard_logger: true
  create_checkpoint_callback: true
model:
  nemo_path: null
  label_ids: null
  dataset:
    data_dir: ???
    class_balancing: null
    max_seq_length: 128
    pad_label: O
    ignore_extra_tokens: false
    ignore_start_end: false
    use_cache: true
    num_workers: 2
    pin_memory: false
    drop_last: false
  train_ds:
    text_file: text_train.txt
    labels_file: labels_train.txt
    shuffle: true
    num_samples: -1
    batch_size: 64
  validation_ds:
    text_file: text_dev.txt
    labels_file: labels_dev.txt
    shuffle: false
    num_samples: -1
    batch_size: 64
  language_mod

For the purposes of this tutorial, we are interested in the language_model part of the Named Entity Recognition Model.

In [7]:
print(OmegaConf.to_yaml(config.model.language_model))

pretrained_model_name: bert-base-uncased
bert_checkpoint: null
bert_config: null
tokenizer: nemobert
vocab_file: null
tokenizer_model: null
do_lower_case: false



There are might be slight differences from one model to another, but most of them have the following important parameters associated with a language model:
* `pretrained_model_name` - a name of the pretrained model from either HuggingFace or Megatron-LM libraries
* `bert_checkpoint` - a path to the pretrained language model checkpoint if, for example, you trained a BERT model with your own data
* `bert_config` - a path the model config file if a language you want to use differs from the model's default configuration

To modify the default language model, specify the desired language model name with `model.language_model.pretrained_model_name` argument, like this:

In [8]:
config.model.language_model = 'roberta-base'

and then start the training as usual (please see [tutorials/nlp](https://github.com/NVIDIA/NeMo/tree/main/tutorials/nlp) for more details about training a particular model). 

You can also provide a pretrained language model checkpoint and a configuration file if available.

# Downstream tasks with Megatron and BioMegatron LM

All the above holds for both HuggingFace and Megatron-LM pretrained language models, but let's closely examine the Megatron-LM.

[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. More details could be found at [Megatron-LM github repo](https://github.com/NVIDIA/Megatron-LM).

To see the list of available Megatron-LM models in NeMo, run:


In [9]:
nemo_nlp.modules.get_megatron_lm_models_list()

['megatron-bert-345m-uncased',
 'megatron-bert-345m-cased',
 'biomegatron-bert-345m-uncased',
 'biomegatron-bert-345m-cased',
 'megatron-bert-uncased',
 'megatron-bert-cased']

If you want to use one of the available Megatron-LM models, specify its name with `model.language_model.pretrained_model_name` argument, for example:

In [10]:
config.model.language_model = 'megatron-bert-345m-uncased'

If you have a different checkpoint or a model configuration file, use these general Megatron-LM model names:
* `megatron-bert-uncased` or 
* `megatron-bert-cased` 

and provide associated bert_config and bert_checkpoint files, as follows:

`model.language_model.pretrained_model_name=megatron-bert-uncased \
model.language_model.bert_checkpoint=<PATH_TO_CHECKPOINT> \
model.language_model.bert_config=<PAHT_TO_CONFIG>`
 
 or 
 
`model.language_model.pretrained_model_name=megatron-bert-cased \
model.language_model.bert_checkpoint=<PATH_TO_CHECKPOINT> \
model.language_model.bert_config=<PAHT_TO_CONFIG>`

The general Megatron-LM model names are used to download the correct vocabulary file needed to setup the model correctly. Note, the data preprocessing and model training is done in NeMo. Megatron-LM has its own set of training arguments (including tokenizer) that are ignored during finetuning in NeMo. Please see downstream task [config files and training scripts](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp) for all NeMo supported arguments.

## Download pretrained model

With NeMo, the original and domain-specific Megatron-LM BERT models and model configuration files will be downloaded automatically, but they also could be downloaded with the links below:

[Megatron-LM BERT Uncased 345M (~345M parameters): https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m/files?version=v0.1_uncased)

[Megatron-LM BERT Cased 345M (~345M parameters): https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m/files?version=v0.1_cased)

[BioMegatron-LM BERT Cased 345M (~345M parameters): https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345mcased](https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345mcased)

[BioMegatron-LM BERT Uncased 345M (~345M parameters)](https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345muncased): https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345muncased