In [1]:
BRANCH='main'

"""
This notebook is currently being upated to work with the ptuning/prompt-tuning refactor. Please use NeMo r1.8.0 instead of main in the mean time. 
"""

'\nThis notebook is currently being upated to work with the ptuning/prompt-tuning refactor. Please use NeMo r1.8.0 instead of main in the mean time. \n'

In [2]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

'\nYou can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n\nInstructions for setting up Colab are as follows:\n1. Open a new Python 3 notebook.\n2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)\n3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)\n4. Run this cell to set up dependencies.\n'

In [3]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

[NeMo W 2022-04-15 06:08:22 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
    


In the era of super large language models, the traditional "pre-train, fine-tune" procedure is replaced by "pre-train, prompt, and predict" method as shown in the [survey paper](https://arxiv.org/pdf/2107.13586.pdf). The prompt method is versatile enough to support all kinds of NLP tasks as shown in the following table: 

<table>
    <thead>
        <tr>
            <th>Type</th>
            <th>Task</th>
            <th>Input ([X])</th>
            <th>Template</th>
            <th>Answer([Y])</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan=3>Text CLS</td>
            <td>Sentiment</td>
            <td>I love this movie.</td>
            <td>[X] The movie is [Y]</td>
            <td>great<br>fantastic<br>...</td>
        </tr>
        <tr>
            <td>Topics</td>
            <td>He prompted the LM.</td>
            <td>[X] The text is about [Y]</td>
            <td>sports<br>science<br>...</td>
        </tr>
        <tr>
            <td>Intention</td>
            <td>What is taxi fare to Denver?</td>
            <td>[X] The question is about [Y]</td>
            <td>quantity<br>city<br>...</td>
        </tr>
        <tr>
            <td rowspan=1>Text-span CLS</td>
            <td>Aspect<br>Sentiment</td>
            <td>Poor service but good food.</td>
            <td>[X] What about service? [Y]</td>
            <td>Bad<br>Terrible<br>...</td>
        </tr>
        <tr>
            <td rowspan=1>Text-pair CLS</td>
            <td>NLI</td>
            <td>[X1]: An old man with ...<br>[X2]: A man walks ...</td>
            <td>Hypothesis: [X1], Premise: [X2], Answer: [Y]</td>
            <td>Contradiction<br>Entailment<br>...</td>
        </tr>
        <tr>
            <td rowspan=1>Tagging</td>
            <td>NER</td>
            <td>[X1]: Mike went to Paris.<br>[X2]: Paris</td>
            <td>[X1] [X2] is a [Y]</td>
            <td>Yes<br>No<br>...</td>
        </tr>
        <tr>
            <td rowspan=2>Text Generation</td>
            <td>Summarization</td>
            <td>Las Vegas police ...</td>
            <td>[X] TL;DR: [Y]</td>
            <td>The victim ...<br>A woman ...<br>...</td>
        </tr>
        <tr>
            <td>Translation</td>
            <td>Je vous aime.</td>
            <td>French [X] English: [Y]</td>
            <td>I love you.<br>I fancy you.<br>...</td>
        </tr>
    </tbody>
</table>

In this tutorial, we are going to describe how to use [P-Tuning method](https://arxiv.org/pdf/2103.10385.pdf) , which is one of the prompt engineering methods, to find good prompts for large GPT models. We show it can solve multiple downstream NLP tasks with good performance. P-Tuning leverages few continuous free parameters to serve as prompts fed as the input to the pre-trained language models. Freezing the large language model weights, P-Tuning model can be trained efficiently while delivering stats of art performance. 

Large Language Model can be trained with [NeMo Megatron](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling), up to multi-billion parameters. In this notebook, we will use the pre-trained 344M GPT model released from NGC.

# Task Description
P-Tuning method can be applied to solve various NLP tasks. Without losing generality, in this notebook, we are going to use P-Tuning method to solve two NLP tasks: **Sentiment Analysis** task and **Question and Answer** task. 

**Sentiment Analysis** task is also known as opinion mining or emotion AI. It is a sub-field of NLP that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc. 

For instance, **given sentences from news title, is it a good or bad news?**<br>

**Question and Answer** task is to find the answer to a question given the context text. 

For instance, 
```
Context: 
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Question:
How many Grammy awards did Beyoncé win for her first solo album?
```

# Dataset
We will use [Financial PhraseBank dataset](https://huggingface.co/datasets/financial_phrasebank) for sentiment analysis task and [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) for question and answer task.

The [Financial PhraseBank dataset](https://huggingface.co/datasets/financial_phrasebank) contains the sentiments for financial news headlines from the perspective of a retail investor. Further details about the dataset can be found in: Malo, P., Sinha, A., Takala, P., Korhonen, P. and Wallenius, J. (2014): “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the American Society for Information Science and Technology.

Here's an example of what an annotated abstract from the corpus looks like:

```
HELSINKI Thomson Financial - Shares in Cargotec fell sharply in early afternoon trade after the cargo handling group posted a surprise drop in April-June profits , which overshadowed the large number of new orders received during the three months .@negative
LONDON MarketWatch -- Share prices ended lower in London Monday as a rebound in bank stocks failed to offset broader weakness for the FTSE 100 .@negative
Operating profit fell to EUR 35.4 mn from EUR 68.8 mn in 2007 , including vessel sales gain of EUR 12.3 mn .@negative
Sales in Finland decreased by 10.5 % in January , while sales outside Finland dropped by 17 % .@negative
```

The [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Let's download the dataset.

In [4]:
DATA_DIR = "DATA_DIR"
os.makedirs(DATA_DIR, exist_ok=True)

## Downloading Financial Phrase Bank Dataset

The dataset is collected by Malo et al. 2014, and can be downloaded from this [link](https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip). The zip file for the Financial Phrase Bank Dataset has been provided for ease of download and use.

In [5]:
!wget https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
!unzip FinancialPhraseBank-v10.zip -d {DATA_DIR}

--2022-04-15 06:08:23--  https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
Resolving www.researchgate.net (www.researchgate.net)... 104.17.32.105, 104.17.33.105, 2606:4700::6811:2069, ...
Connecting to www.researchgate.net (www.researchgate.net)|104.17.32.105|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.researchgate.net/profile/Pekka-Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip [following]
--2022-04-15 06:08:23--  https://www.researchgate.net/profile/Pekka-Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
Reusing existing connection to www.researchgate.net:443.
HTTP request sent, awaiting response... 200 OK
Length: 681890 (666K) [application/zip]
Saving to: ‘FinancialPhraseBank-v10.zip’


2022-04-15 06:08:23

In [6]:
# If you want to see more examples, you can explore the text of the corpus using the file browser to the left, or open files directly, for example typing a command like the following in a code-cell:

! head -1 $DATA_DIR/FinancialPhraseBank-v1.0/Sentences_50Agree.txt

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral


## Download the SQuAD dataset

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

In [7]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!mv train-v2.0.json {DATA_DIR}

--2022-04-15 06:08:24--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2022-04-15 06:08:25 (108 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]



## Pre-process Financial Phrase Bank Dataset

In this pre-process step, we are going to convert the downloaded dataset into the format that can be used for P-Tuning dataloader. The data is split into 10 folds so we can do 10-fold cross validation. In this notebook, we will use the first fold.

In [8]:
import json
import random

random.seed(1234)
files = ['Sentences_50Agree.txt', 'Sentences_66Agree.txt', 'Sentences_75Agree.txt', 'Sentences_AllAgree.txt']
base_dir = DATA_DIR + '/FinancialPhraseBank-v1.0/'
files = [base_dir + f for f in files]

alllines = []
for fn in files:
    with open(fn, 'r', encoding="ISO-8859-1") as f:
        alllines.extend(f.readlines())

random.shuffle(alllines)
fold = 10
fold_size = len(alllines) // fold

chunk_start = list(range(0, 14780, 1478))

chunks = []

for start_id in chunk_start:
    chunks.append(alllines[start_id:start_id+fold_size])

def gen_file(data, fold_id, split_type):
    filename = "{}/{}_{}.txt".format(base_dir, split_type, fold_id)
    with open(filename, 'w') as f:
        obj = {}
        for line in data:
            splits = line.split('@')
            part1 = splits[0].strip()
            part2 = splits[1].strip()
            obj['sentence'] = part1
            obj['label'] = part2
            obj['taskname'] = 'sentiment-task'
            f.write(json.dumps(obj)+'\n')


def gen_fold(fold_number):
    lists = list(range(fold))
    test_id = (fold_number + fold) % fold
    val_id = (fold_number + fold - 1) % fold
    test_set = chunks[test_id]
    val_set = chunks[val_id]
    lists.remove(test_id)
    lists.remove(val_id)
    train_set = []
    for idd in lists:
        train_set += chunks[idd]
    gen_file(train_set, fold_number, 'train')
    gen_file(val_set, fold_number, 'validation')
    gen_file(test_set, fold_number, 'test')

gen_fold(0)

The data is converted to the loss json file. Each line has three keys "sentence", "label" and "prompt_tag". 
Here are the first two lines of converted data:

In [9]:
!head -n 2 $DATA_DIR/FinancialPhraseBank-v1.0/train_0.txt

{"sentence": "The contract includes heating plant equipment and associated installation work .", "label": "neutral", "taskname": "sentiment-task"}
{"sentence": "The utility will also provide services related to electricity management , such as hedging trades and risk management and reporting .", "label": "neutral", "taskname": "sentiment-task"}


### Preprocess SQuAD Dataset


In [10]:
file_name = DATA_DIR + '/train-v2.0.json'
with open(file_name, 'r') as f:
    data_obj = json.load(f)

articles = data_obj['data']
test_len = 40
validation_len = 40
train_len = len(articles) - test_len - validation_len
train_records = []
validation_records = []
test_records = []


def get_records(sub_articals, records):
    for article in sub_articals:
        paragraphs = article['paragraphs']
        for paragraph in paragraphs:
            qas = paragraph['qas']
            context = paragraph['context'].strip()
            for qa in qas:
                record = {}
                record['question'] = qa['question'].strip()
                record['context'] = context
                if qa['is_impossible']:
                    record['label'] = 'NA'
                else:
                    record['label'] = qa['answers'][0]['text'].strip()
                record['taskname'] = 'qa-task'
                records.append(json.dumps(record))
get_records(articles[:train_len], train_records)
get_records(articles[train_len:train_len+validation_len], validation_records)
get_records(articles[train_len+validation_len:], test_records)
random.shuffle(train_records)
random.shuffle(validation_records)
random.shuffle(test_records)
squad_dir = "DATA_DIR/squad"
os.makedirs(squad_dir, exist_ok=True)
with open(squad_dir+'/train.txt', 'w') as f:
    f.write("\n".join(train_records))
with open(squad_dir+'/validation.txt', 'w') as f:
    f.write("\n".join(validation_records))
with open(squad_dir+'/test.txt', 'w') as f:
    f.write("\n".join(test_records))



The data is converted to the loss json file. Each line has three keys "question", "context", "label" and "prompt_tag". 
Here are the first two lines of converted data:

In [11]:
!head -n 2 {squad_dir}/train.txt

{"question": "What book did the New York Times publish excerpts from?", "context": "On July 8, 2007 The Washington Post published excerpts from UCLA Professor Amy Zegart's book Spying Blind: The CIA, the FBI, and the Origins of 9/11. The Post reported from Zegart's book that government documents show the CIA and FBI missed 23 potential chances to disrupt the terrorist attacks of September 11, 2001. The primary reasons for the failures included: agency cultures resistant to change and new ideas; inappropriate incentives for promotion; and a lack of cooperation between the FBI, CIA and the rest of the United States Intelligence Community. The book blamed the FBI's decentralized structure, which prevented effective communication and cooperation among different FBI offices. The book suggested that the FBI has not evolved into an effective counter-terrorism or counter-intelligence agency, due in large part to deeply ingrained agency cultural resistance to change. For example, FBI personnel 

### Combine the two datasets

The P-tune model includes a prompt encoder which is used to generate virtual tokens. Its output can be conditioned on the task tags so the P-tune model supports multiple tasks simultaneously. We are going to mix the Financial phrase bank dataset and SQuAD dataset together.

In [12]:
mix_data_dir = f"{DATA_DIR}/mix"
os.makedirs(mix_data_dir, exist_ok=True)
!cat $DATA_DIR/FinancialPhraseBank-v1.0/train_0.txt {squad_dir}/train.txt | shuf > {mix_data_dir}/train.txt
!cat $DATA_DIR/FinancialPhraseBank-v1.0/validation_0.txt {squad_dir}/validation.txt | shuf > {mix_data_dir}/validation.txt
!cat $DATA_DIR/FinancialPhraseBank-v1.0/test_0.txt {squad_dir}/test.txt | shuf > {mix_data_dir}/test.txt

Here are the first two lines of converted data:

In [13]:
!head -n 2 {mix_data_dir}/train.txt

{"question": "What other name does the KInsey scale go by?", "context": "The Kinsey scale, also called the Heterosexual-Homosexual Rating Scale, was first published in Sexual Behavior in the Human Male (1948) by Alfred Kinsey, Wardell Pomeroy, and Clyde Martin and also featured in Sexual Behavior in the Human Female (1953). The scale was developed to combat the assumption at the time that people are either heterosexual or homosexual and that these two types represent antitheses in the sexual world. Recognizing that a large portion of population is not completely heterosexual or homosexual and people can experience both heterosexual and homosexual behavior and psychic responses, Kinsey et al., stated:", "label": "Heterosexual-Homosexual Rating Scale", "taskname": "qa-task"}
{"question": "What did the Washington Naval Treaty of 1920 limit?", "context": "The development of flattop vessels produced the first large fleet ships. In 1918, HMS Argus became the world's first carrier capable of

## Convert the Megatron-LM Weights to Nemo file

P-Tuning method works the best with large GPT language models. From our experiences, models of size 5B or above give good performance. If you already have a large GPT model ready, skip this section. 

In this example, we will use the pretrained 344M NeMo Megatron GPT model from [Megatron-LM project](https://github.com/NVIDIA/Megatron-LM). To load it in NeMo Megatron, We first need to convert the Megatron-LM checkpoint to the `.nemo` file. Let's download the pretrained model weights and vocabulary file.



In [14]:
import pathlib
gpt_file = 'megatron_lm_345m_v0.0.zip'
vocab_file = 'gpt2-vocab.json'
merge_file = 'gpt2-merge.txt'
checkpoint_filename = 'model_optim_rng.pt'

if not pathlib.Path(gpt_file).exists():
    !wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O $gpt_file
    !unzip -o $gpt_file
    !wget https://s3.amazonaws.com/models.huggingface.co/bert/$vocab_file -O $vocab_file 
    !wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O $merge_file



--2022-04-15 06:08:29--  https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 54.193.81.248, 54.177.228.217
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|54.193.81.248|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://prod-model-registry-ngc-bucket.s3.us-west-2.amazonaws.com/org/nvidia/models/megatron_lm_345m/versions/v0.0/files.zip?response-content-disposition=attachment%3B%20filename%3D%22files.zip%22&response-content-type=application%2Fzip&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEN7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMSJHMEUCIQCd2aNobSx1vNwpWsVtFi2FL10p%2F2bwVkkJMJubSmXwSwIgNiM5TEuUdLEly5ikoi0ClCM4%2BtPYOeWhJor%2FeB0gZwgqgwQIh%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAEGgw3ODkzNjMxMzUwMjciDC0gGuFmPuYRQIq%2FeCrXA2pI6s8ZQEPoHs91RxflkQLYJPxXOCRyWgfxGBN7t8q02wpN00qTudRR8dkQFRrCToFs64pbs3ubKs5UVG639sRolnTL7zocyrau9VuLFOcWq5sR%2FcBClxN3LuqZYSFmUWf4uNNX7%2FJq93RE07pHBLIZRkk

In [15]:
WORK_DIR = "WORK_DIR"
os.makedirs(WORK_DIR, exist_ok=True)

# Prepare the model parameters 
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
MODEL_CONFIG = "megatron_gpt_config.yaml"
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

config file is already exists


In [16]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
config.model.num_layers = 24
config.model.hidden_size = 1024
config.model.ffn_hidden_size = 4096
config.model.num_attention_heads = 16
config.model.tokenizer.vocab_file = vocab_file
config.model.tokenizer.merge_file = merge_file
config.model.tensor_model_parallel_size = 1
config.model.data.data_prefix = ''
config.model.max_position_embeddings = 1024
config.model.data.seq_length = 1024
config.model.encoder_seq_length = 1024
config.cfg = {}
config.cfg.cfg = config.model
with open('hparams.yaml', 'w') as f:
    f.write(OmegaConf.to_yaml(config.cfg))

WORK_DIR/configs/megatron_gpt_config.yaml


In [17]:
import os
PWD = os.getcwd()
wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py')
!python -m torch.distributed.run --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py --checkpoint_folder=$PWD/release/mp_rank_00/ --checkpoint_name=$checkpoint_filename --hparams_file=$PWD/hparams.yaml --nemo_file_path=$PWD/gpt_344m.nemo --model_type=gpt --tensor_model_parallel_size=1

100% [.........................................................] 20898 / 20898[NeMo W 2022-04-15 06:08:53 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
    
[NeMo I 2022-04-15 06:08:53 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
      rank_zero_warn(
    
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-04-15 06:08:55 megatron_lm_ckpt_to_nemo:387] loading checkpoint /prompt-tuning/refactor/NeMo/tutorials/nlp/release/mp_rank_00/model_optim_rng.pt
converted 354.87M parameters
[NeMo W 2022-04-15 06:08:55 megatron_lm_ckpt_to_nemo:347] the checkpoint version is 0
      rank_zero_deprecation(
    
[NeMo I 2022-04-15 06:08:55 megatron_init:191] Rank

# Model configuration

Our P-Tuning text classification model is comprised of the pretrained GPT LM model followed by a prompt encoder layer.

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [18]:
MODEL_CONFIG = "megatron_gpt_prompt_learning_config.yaml"

In [19]:
# download the model's configuration file 
BRANCH="continuous_prompt_refactor"
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

config file is already exists


In [20]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
# Note: these are small batch-sizes - increase as appropriate to available GPU capacity
config.model.batch_size = 8

WORK_DIR/configs/megatron_gpt_prompt_learning_config.yaml


# Model Training
## Setting up Data within the config

Among other things, the config file contains dictionaries called train_ds, validation_ds and test_ds. These are configurations used to setup the Dataset.


In [21]:
# in this tutorial train and dev datasets are located in the same folder, so it is enough to add the path of the data directory to the config
config.model.data.train_ds = [DATA_DIR+'/mix/train.txt',]
config.model.data.validation_ds = [DATA_DIR+'/mix/validation.txt',]
config.model.data.test_ds = [DATA_DIR+'/mix/test.txt',]

## Add the Data Processors to Generate the Prompts

To customize different prompts for different tasks, we simply need to specify the prompt task template in the config file. The virtual token markers `<|VIRTUAL_PROMPT_#|>` signify where you want virtual tokens to be placed in the the template string. `<|VIRTUAL_PROMPT_0|>`, `<|VIRTUAL_PROMPT_1|>`, and `<|VIRTUAL_PROMPT_2|>` indicate where a number of virtual tokens matching the values given at `virtual_token_splits[0]`, `virtual_token_splits[1]` and `virtual_token_splits[2]` will be placed. The other variable fields `{var}` refer to the variables in the data record. For example:

Given the data record **{"sentence1": "And he said, Mama, I'm home.", "sentence2": "He didn't say a word."}**, along with `virtual_token_splits = [3, 3, 3]` and `prompt_template = "<|VIRTUAL_PROMPT_0|> Hypothesis: [sentence1], <|VIRTUAL_PROMPT_1|> Premise: [sentence2] <|VIRTUAL_PROMPT_2|> Answer:"`, the input will be translated into **<span style="color:red">VVV</span> Hypothesis: And he said, Mama, I'm home.<span style="color:red">VVV</span> Premise: He didn't say a word.<span style="color:red">VVV</span> Answer:**, where <span style="color:red">VVV</span> are three virtual tokens.

Let's configure the proper template for the two dataset we prepared:

In [22]:
  config.model.task_templates = [
    {
      "taskname": "qa-task",
      "prompt_template": "<|VIRTUAL_PROMPT_0|> Context: {context} <|VIRTUAL_PROMPT_1|> Question: {question}? <|VIRTUAL_PROMPT_2|> Answer: {label}",
      "total_virtual_tokens": 9,
      "virtual_token_splits":[3, 3, 3],
      "truncate_field": "content",
    },
    {
      "taskname": "sentiment-task",
      "prompt_template": "<|VIRTUAL_PROMPT_0|> Sentence: {sentence} <|VIRTUAL_PROMPT_1|> Sentiment: {label}",
      "total_virtual_tokens": 9,
      "virtual_token_splits":[6, 3],
      "truncate_field": "sentence",
    },
  ]

Note each `task_template` item has 5 fields. Besides the `prompt_template` string, the `taskname` refers to the `taskname` in the data record. The `truncate_field` specifies which field in the data is going to be cut if the length of the input exceeds the maximum sequence length of the model.`total_virtual_tokens` specifies the total number of virtual tokens that will be inserted into the model prompt. `virtual_token_splits` specifies the number of virtual tokens that belong at each `<|VIRTUAL_PROMPT_#|>` marker. `virtual_token_splits` values should add up to `total_virtual_tokens`. The number of `virtual_token_splits` should match the number of `<|VIRTUAL_PROMPT_#|>` markers. 

After you p-tune your model this time, you can always go back and p-tune your model on more tasks without over writting the virtual prompts who've trained this time. You can also use a different number of `total_virtual_tokens` between each training session as long as tasks ptuned at the same time have the same number of `total_virtual_tokens`. For this reason, you ptune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). 

You do this by setting the `new_tasks` and `existing_tasks` values in the config file. Because we are ptuning a model with no existing tasks, you should set `existing_tasks=[]` and `new_tasks=['qa-task', 'sentiment-task']` as follows:

In [23]:
config.model.new_tasks = ['qa-task', 'sentiment-task']
config.model.existing_tasks = []

After ptuning is complete, you can run inference on all tasks at the same time, regradless of their `total_virtual_tokens` value.

In [24]:
print(OmegaConf.to_yaml(config.model))

seed: 1234
nemo_path: ${name}.nemo
lm_finetune: false
pseudo_token_base: PROMPT_
virtual_prompt_style: p-tuning
encoder_seq_length: 2048
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
batch_size: 8
restore_path: null
language_model_path: models/megatron_125M_gpt.nemo
existing_tasks: []
new_tasks:
- qa-task
- sentiment-task
task_templates:
- taskname: qa-task
  prompt_template: '<|VIRTUAL_PROMPT_0|> Context: {context} <|VIRTUAL_PROMPT_1|> Question:
    {question}? <|VIRTUAL_PROMPT_2|> Answer: {label}'
  total_virtual_tokens: 9
  virtual_token_splits:
  - 3
  - 3
  - 3
  truncate_field: content
- taskname: sentiment-task
  prompt_template: '<|VIRTUAL_PROMPT_0|> Sentence: {sentence} <|VIRTUAL_PROMPT_1|>
    Sentiment: {label}'
  total_virtual_tokens: 9
  virtual_token_splits:
  - 6
  - 3
  truncate_field: sentence
prompt_tuning:
  new_prompt_init_methods:
  - text
  new_prompt_init_text:
  - some init text goes here
p_tuning:
  dropout: 0.0
  num_layers: 2
  save_tuned_prom

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.

Let's first instantiate a Trainer object

In [32]:
# lets modify some trainer configs
# checks if we have GPU available and uses it
accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config.trainer.accelerator = accelerator
config.trainer.devices = 1
config.trainer.max_epochs = 3
config.trainer.val_check_interval = 1.0

# for PyTorch Native AMP set precision=16
config.trainer.precision = 16 if torch.cuda.is_available() else 32

# remove distributed training flags
config.trainer.strategy = None

trainer = pl.Trainer(**config.trainer)
#from nemo.collections.nlp.parts.nlp_overrides import NLPDDPPlugin
#trainer = pl.Trainer(plugins=[NLPDDPPlugin()], **config.trainer)

print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

      rank_zero_deprecation(
    
      rank_zero_deprecation(
    
Using 16bit native Automatic Mixed Precision (AMP)


<nemo.collections.nlp.parts.nlp_overrides.NLPDDPPlugin object at 0x7f3b9dc19100>
<pytorch_lightning.strategies.launchers.subprocess_script._SubprocessScriptLauncher object at 0x7f3b9dac6280>


MisconfigurationException: `Trainer(strategy='ddp')` or `Trainer(accelerator='ddp')` is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible strategies: Trainer(strategy=None|dp|tpu_spawn). In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function.

## Setting up a NeMo Experiment

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:

In [26]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))
os.makedirs(WORK_DIR, exist_ok=True)

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

      rank_zero_deprecation(
    
[NeMo W 2022-04-15 06:09:07 exp_manager:557] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2022-04-15 06:09:07 exp_manager:409] There was no checkpoint folder at checkpoint_dir :/prompt-tuning/refactor/NeMo/tutorials/nlp/nemo_experiments/megatron_virtual_prompt_gpt/checkpoints. Training from scratch.


[NeMo I 2022-04-15 06:09:07 exp_manager:281] Experiments will be logged at /prompt-tuning/refactor/NeMo/tutorials/nlp/nemo_experiments/megatron_virtual_prompt_gpt
[NeMo I 2022-04-15 06:09:07 exp_manager:647] TensorboardLogger has been set up


      rank_zero_deprecation("`Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.")
    
[NeMo W 2022-04-15 06:09:07 exp_manager:881] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to -1. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


'/prompt-tuning/refactor/NeMo/tutorials/nlp/nemo_experiments/megatron_virtual_prompt_gpt'

We will use the converted `.nemo` file as our LM model.

In [27]:
# add the specified above model parameters to the config
config.model.language_model_path = 'gpt_344m.nemo'
config.model.tensor_model_parallel_size = 1
config.exp_manager.checkpoint_callback_params.save_top_k = 1

Your final config looks like:

In [28]:
print("Model config - \n")
print(OmegaConf.to_yaml(config))

Model config - 

name: megatron_virtual_prompt_gpt
trainer:
  devices: 1
  accelerator: gpu
  num_nodes: 1
  precision: 16
  logger: false
  enable_checkpointing: false
  replace_sampler_ddp: false
  max_epochs: 3
  max_steps: null
  log_every_n_steps: 10
  val_check_interval: 1.0
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  resume_from_checkpoint: null
  strategy: null
exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: ${name}
  create_wandb_logger: false
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: true
  resume_ignore_no_checkpoint: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 1
    mode: min
    save_nemo_on_train_end: true
    filename: megatron_gpt_prompt_tune--{val_loss:.3f}-{step}
    model_parallel_size: ${model.tensor_model_parallel_size}
    save_best_model: true
model:
  seed: 1234
  nemo_path: ${name}.nemo
  lm_finetune: false
  pseudo_token_base: PROMPT_
  v

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.

In [29]:
from nemo.collections.nlp.models.language_modeling.megatron_gpt_prompt_learning_model import MegatronGPTPromptLearningModel
model_ptune = MegatronGPTPromptLearningModel(cfg=config.model, trainer=trainer)

      rank_zero_deprecation(
    


[NeMo I 2022-04-15 06:09:07 megatron_init:191] Rank 0 has data parallel group: [0]
[NeMo I 2022-04-15 06:09:07 megatron_init:194] All data parallel group ranks: [[0]]
[NeMo I 2022-04-15 06:09:07 megatron_init:195] Ranks 0 has data parallel rank: 0
[NeMo I 2022-04-15 06:09:07 megatron_init:203] Rank 0 has model parallel group: [0]
[NeMo I 2022-04-15 06:09:07 megatron_init:204] All model parallel group ranks: [[0]]
[NeMo I 2022-04-15 06:09:07 megatron_init:214] Rank 0 has tensor model parallel group: [0]
[NeMo I 2022-04-15 06:09:07 megatron_init:218] All tensor model parallel group ranks: [[0]]
[NeMo I 2022-04-15 06:09:07 megatron_init:219] Rank 0 has tensor model parallel rank: 0
[NeMo I 2022-04-15 06:09:07 megatron_init:233] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2022-04-15 06:09:07 megatron_init:245] Rank 0 has embedding group: [0]
[NeMo I 2022-04-15 06:09:07 megatron_init:251] All pipeline model parallel group ranks: [[0]]
[NeMo I 2022-04-15 06:09:07 megatron_init:252]

Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using mask_token, but it is not set yet.


[NeMo I 2022-04-15 06:09:11 megatron_gpt_model:783] Padded vocab_size: 50304, original vocab_size: 50257, dummy tokens: 47.
[NeMo I 2022-04-15 06:09:13 nlp_overrides:404] Model MegatronGPTModel was successfully restored from /prompt-tuning/refactor/NeMo/tutorials/nlp/gpt_344m.nemo.
[NeMo I 2022-04-15 06:09:13 auto_tokenizer:171] 9 special tokens added, resize your model accordingly.


Using pad_token, but it is not set yet.
Using mask_token, but it is not set yet.


## Monitoring training progress
Optionally, you can create a Tensorboard visualization to monitor training progress.
If you're not using Colab, refer to [https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks) if you're facing issues with running the cell below.

In [30]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

To use tensorboard, please use this notebook in a Google Colab environment.


In [31]:
# start model training
trainer.fit(model_ptune)

[NeMo I 2022-04-15 06:09:14 gpt_prompt_learning_dataset:60] Loading and tokenizing dataset ... 


13538it [00:14, 939.62it/s] 

[NeMo I 2022-04-15 06:09:28 gpt_prompt_learning_dataset:134] Skipped 0 sentences, sequence length too short or too long even after truncation
[NeMo I 2022-04-15 06:09:28 gpt_prompt_learning_dataset:60] Loading and tokenizing dataset ... 



118101it [01:57, 1004.93it/s]

[NeMo I 2022-04-15 06:11:26 gpt_prompt_learning_dataset:134] Skipped 0 sentences, sequence length too short or too long even after truncation
[NeMo I 2022-04-15 06:11:26 gpt_prompt_learning_dataset:60] Loading and tokenizing dataset ... 



13460it [00:13, 1002.63it/s]

[NeMo I 2022-04-15 06:11:39 gpt_prompt_learning_dataset:134] Skipped 0 sentences, sequence length too short or too long even after truncation



LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2022-04-15 06:11:40 modelPT:496] The lightning trainer received accelerator: <pytorch_lightning.accelerators.gpu.GPUAccelerator object at 0x7f3b9dc3ddc0>. We recommend to use 'ddp' instead.


[NeMo I 2022-04-15 06:11:40 modelPT:587] Optimizer config = FusedAdam (
    Parameter Group 0
        betas: [0.9, 0.98]
        bias_correction: True
        eps: 1e-08
        lr: 0.0001
        weight_decay: 0.01
    )
[NeMo I 2022-04-15 06:11:40 lr_scheduler:833] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f3b9dabbf40>" 
    will be used during training (effective maximum steps = 44286) - 
    Parameters : 
    (warmup_steps: 50
    constant_steps: 10
    min_lr: 1.0e-06
    max_steps: 44286
    )



  | Name            | Type                   | Params
-----------------------------------------------------------
0 | model           | MegatronGPTModel       | 354 M 
1 | word_embeddings | VocabParallelEmbedding | 51.5 M
2 | prompt_table    | PromptTable            | 0     
3 | prompt_encoder  | PromptEncoder          | 14.7 M
-----------------------------------------------------------
14.7 M    Trainable params
354 M     Non-trainable params
369 M     Total params
739.158   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

      rank_zero_warn(
    


tensor([[50257, 50258, 50259,  ..., 50256, 50256, 50256],
        [50257, 50258, 50259,  ..., 50256, 50256, 50256],
        [50257, 50258, 50259,  ..., 50256, 50256, 50256],
        ...,
        [50257, 50258, 50259,  ..., 50256, 50256, 50256],
        [50257, 50258, 50259,  ..., 50256, 50256, 50256],
        [50257, 50258, 50259,  ..., 50256, 50256, 50256]], device='cuda:0')
tensor([[20402,    12, 35943, 50256],
        [20402,    12, 35943, 50256],
        [20402,    12, 35943, 50256],
        [20402,    12, 35943, 50256],
        [34086,  3681,    12, 35943],
        [20402,    12, 35943, 50256],
        [20402,    12, 35943, 50256],
        [20402,    12, 35943, 50256]], device='cuda:0')
tensor([[ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, False],
        ...,
        [ True,  True,  True,  ..., False, False, False],
        [ True,  True,  True,  ..., False, False, Fals

AssertionError: intra_layer_model parallel group is not initialized

# Inference

To see how the model performs, we can run model in the inference mode

In [None]:
# let's first select a subset of our dev data
query_examples = [
    {"taskname": "sentiment-task", "sentence": "The Finland-based company says it will move into an existing 260,000-square-foot facility in September ."},
    {"taskname": "qa-task", "question": "What are the closest relatives of the deinonychosaurs?", "context": "The consensus view in contemporary paleontology is that the flying theropods, or avialans, are the closest relatives of the deinonychosaurs, which include dromaeosaurids and troodontids. Together, these form a group called Paraves. Some basal members of this group, such as Microraptor, have features which may have enabled them to glide or fly. The most basal deinonychosaurs were very small. This evidence raises the possibility that the ancestor of all paravians may have been arboreal, have been able to glide, or both. Unlike Archaeopteryx and the non-avialan feathered dinosaurs, who primarily ate meat, recent studies suggest that the first avialans were omnivores.", "label": "flying theropods"}
]
response = model_ptune.generate(inputs=query_examples)

print('The prediction results of some sample queries with the trained model:')
for result in response['sentences']:
    print(result)
    print("-" * 30)

## Training Script

If you have NeMo installed locally, you can also train the model with `examples/nlp/language_modeling/megatron_gpt_prompt_learning.py`.

To run training script, first change the values in `examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml` to your desired values, then run:
```
python examples/nlp/language_modeling/megatron_gpt_prompt_learning.py \
    --config-name=prompt_learning_megatron_gpt_inference.yaml
```