In [None]:
BRANCH='main'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [72]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

In the era of super large language models, the traditional "pre-train, fine-tune" procedure is replaced by "pre-train, prompt, and predict" method as shown in the [survey paper](https://arxiv.org/pdf/2107.13586.pdf). The prompt method is versatile enough to support all kinds of NLP tasks as shown in the following table: 

<table>
    <thead>
        <tr>
            <th>Type</th>
            <th>Task</th>
            <th>Input ([X])</th>
            <th>Template</th>
            <th>Answer([Y])</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan=3>Text CLS</td>
            <td>Sentiment</td>
            <td>I love this movie.</td>
            <td>[X] The movie is [Y]</td>
            <td>great<br>fantastic<br>...</td>
        </tr>
        <tr>
            <td>Topics</td>
            <td>He prompted the LM.</td>
            <td>[X] The text is about [Y]</td>
            <td>sports<br>science<br>...</td>
        </tr>
        <tr>
            <td>Intention</td>
            <td>What is taxi fare to Denver?</td>
            <td>[X] The question is about [Y]</td>
            <td>quantity<br>city<br>...</td>
        </tr>
        <tr>
            <td rowspan=1>Text-span CLS</td>
            <td>Aspect<br>Sentiment</td>
            <td>Poor service but good food.</td>
            <td>[X] What about service? [Y]</td>
            <td>Bad<br>Terrible<br>...</td>
        </tr>
        <tr>
            <td rowspan=1>Text-pair CLS</td>
            <td>NLI</td>
            <td>[X1]: An old man with ...<br>[X2]: A man walks ...</td>
            <td>Hypothesis: [X1], Premise: [X2], Answer: [Y]</td>
            <td>Contradiction<br>Entailment<br>...</td>
        </tr>
        <tr>
            <td rowspan=1>Tagging</td>
            <td>NER</td>
            <td>[X1]: Mike went to Paris.<br>[X2]: Paris</td>
            <td>[X1] [X2] is a [Y]</td>
            <td>Yes<br>No<br>...</td>
        </tr>
        <tr>
            <td rowspan=2>Text Generation</td>
            <td>Summarization</td>
            <td>Las Vegas police ...</td>
            <td>[X] TL;DR: [Y]</td>
            <td>The victim ...<br>A woman ...<br>...</td>
        </tr>
        <tr>
            <td>Translation</td>
            <td>Je vous aime.</td>
            <td>French [X] English: [Y]</td>
            <td>I love you.<br>I fancy you.<br>...</td>
        </tr>
    </tbody>
</table>

In this tutorial, we are going to describe how to use [P-Tuning method](https://arxiv.org/pdf/2103.10385.pdf) , which is one of the prompt engineering methods, to find good prompts for large GPT models. We show it can solve multiple downstream NLP tasks with good performance. P-Tuning leverages few continuous free parameters to serve as prompts fed as the input to the pre-trained language models. Freezing the large language model weights, P-Tuning model can be trained efficiently while delivering stats of art performance. 

Large Language Model can be trained with [NeMo Megatron](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling), up to multi-billion parameters. In this notebook, we will use the pre-trained 344M GPT model released from NGC.

# Task Description
P-Tuning method can be applied to solve various NLP tasks. Without losing generality, in this notebook, we are going to use P-Tuning method to solve two NLP tasks: **Sentiment Analysis** task and **Question and Answer** task. 

**Sentiment Analysis** task is also known as opinion mining or emotion AI. It is a sub-field of NLP that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc. 

For instance, **given sentences from news title, is it a good or bad news?**<br>

**Question and Answer** task is to find the answer to a question given the context text. 

For instance, 
```
Context: 
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Question:
How many Grammy awards did Beyoncé win for her first solo album?
```

# Dataset
We will use [Financial PhraseBank dataset](https://huggingface.co/datasets/financial_phrasebank) for sentiment analysis task and [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) for question and answer task.

The [Financial PhraseBank dataset](https://huggingface.co/datasets/financial_phrasebank) contains the sentiments for financial news headlines from the perspective of a retail investor. Further details about the dataset can be found in: Malo, P., Sinha, A., Takala, P., Korhonen, P. and Wallenius, J. (2014): “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the American Society for Information Science and Technology.

Here's an example of what an annotated abstract from the corpus looks like:

```
HELSINKI Thomson Financial - Shares in Cargotec fell sharply in early afternoon trade after the cargo handling group posted a surprise drop in April-June profits , which overshadowed the large number of new orders received during the three months .@negative
LONDON MarketWatch -- Share prices ended lower in London Monday as a rebound in bank stocks failed to offset broader weakness for the FTSE 100 .@negative
Operating profit fell to EUR 35.4 mn from EUR 68.8 mn in 2007 , including vessel sales gain of EUR 12.3 mn .@negative
Sales in Finland decreased by 10.5 % in January , while sales outside Finland dropped by 17 % .@negative
```

The [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Let's download the dataset.

In [24]:
DATA_DIR = "DATA_DIR"
os.makedirs(DATA_DIR, exist_ok=True)

## Downloading Financial Phrase Bank Dataset

The datase is collected by Malo et al. 2014, and can be downloaded from this [link](https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip). The zip file for the Financial Phrase Bank Dataset has been provided for ease of download and use.

In [25]:
!wget https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
!unzip FinancialPhraseBank-v10.zip -d {DATA_DIR}

--2022-02-07 20:38:46--  https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
Resolving www.researchgate.net (www.researchgate.net)... 104.17.32.105, 104.17.33.105, 2606:4700::6811:2169, ...
Connecting to www.researchgate.net (www.researchgate.net)|104.17.32.105|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.researchgate.net/profile/Pekka-Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip [following]
--2022-02-07 20:38:46--  https://www.researchgate.net/profile/Pekka-Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
Reusing existing connection to www.researchgate.net:443.
HTTP request sent, awaiting response... 200 OK
Length: 681890 (666K) [application/zip]
Saving to: ‘FinancialPhraseBank-v10.zip.2’


2022-02-07 20:38:

In [26]:
# If you want to see more examples, you can explore the text of the corpus using the file browser to the left, or open files directly, for example typing a command like the following in a code-cell:

! head -1 $DATA_DIR/FinancialPhraseBank-v1.0/Sentences_50Agree.txt

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral


## Download the SQuDA dataset

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

In [32]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!mv train-v2.0.json {DATA_DIR}

--2022-02-07 20:48:05--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.111.153, 185.199.108.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json.1’


2022-02-07 20:48:34 (1.38 MB/s) - ‘train-v2.0.json.1’ saved [42123633/42123633]



## Pre-process Financial Phrase Bank Dataset

In this pre-process step, we are going to convert the downloaded dataset into the format that can be used for P-Tuning dataloader. The data is split into 10 folds so we can do 10-fold cross validation. In this notebook, we will use the first fold.

In [90]:
import json
import random

random.seed(1234)
files = ['Sentences_50Agree.txt', 'Sentences_66Agree.txt', 'Sentences_75Agree.txt', 'Sentences_AllAgree.txt']
base_dir = DATA_DIR + '/FinancialPhraseBank-v1.0/'
files = [base_dir + f for f in files]

alllines = []
for fn in files:
    with open(fn, 'r', encoding="ISO-8859-1") as f:
        alllines.extend(f.readlines())

random.shuffle(alllines)
fold = 10
fold_size = len(alllines) // fold

chunk_start = list(range(0, 14780, 1478))

chunks = []

for start_id in chunk_start:
    chunks.append(alllines[start_id:start_id+fold_size])

def gen_file(data, fold_id, split_type):
    filename = "{}/{}_{}.txt".format(base_dir, split_type, fold_id)
    with open(filename, 'w') as f:
        obj = {}
        for line in data:
            splits = line.split('@')
            part1 = splits[0].strip()
            part2 = splits[1].strip()
            obj['sentence'] = part1
            obj['label'] = part2
            obj['prompt_tag'] = 'sentiment-task'
            f.write(json.dumps(obj)+'\n')


def gen_fold(fold_number):
    lists = list(range(fold))
    test_id = (fold_number + fold) % fold
    val_id = (fold_number + fold - 1) % fold
    test_set = chunks[test_id]
    val_set = chunks[val_id]
    lists.remove(test_id)
    lists.remove(val_id)
    train_set = []
    for idd in lists:
        train_set += chunks[idd]
    gen_file(train_set, fold_number, 'train')
    gen_file(val_set, fold_number, 'validation')
    gen_file(test_set, fold_number, 'test')

gen_fold(0)

The data is converted to the loss json file. Each line has three keys "sentence", "label" and "prompt-tag". 
Here are the first two lines of converted data:

In [30]:
!head -n 2 $DATA_DIR/FinancialPhraseBank-v1.0/train_0.txt

{"sentence": "The contract includes heating plant equipment and associated installation work .", "label": "neutral", "prompt-tag": "sentiment-task"}
{"sentence": "The utility will also provide services related to electricity management , such as hedging trades and risk management and reporting .", "label": "neutral", "prompt-tag": "sentiment-task"}


### Preprocess SQuAD Dataset


In [96]:
file_name = DATA_DIR + '/train-v2.0.json'
with open(file_name, 'r') as f:
    data_obj = json.load(f)

articles = data_obj['data']
test_len = 40
validation_len = 40
train_len = len(articles) - test_len - validation_len
train_records = []
validation_records = []
test_records = []


def get_records(sub_articals, records):
    for article in sub_articals:
        paragraphs = article['paragraphs']
        for paragraph in paragraphs:
            qas = paragraph['qas']
            context = paragraph['context']
            for qa in qas:
                record = {}
                record['question'] = qa['question']
                record['context'] = context
                if qa['is_impossible']:
                    record['label'] = 'NA'
                else:
                    record['label'] = qa['answers'][0]['text']
                record['prompt_tag'] = 'qa-task'
                records.append(json.dumps(record))
get_records(articles[:train_len], train_records)
get_records(articles[train_len:train_len+validation_len], validation_records)
get_records(articles[train_len+validation_len:], test_records)
random.shuffle(train_records)
random.shuffle(validation_records)
random.shuffle(test_records)
squad_dir = "DATA_DIR/squad"
os.makedirs(squad_dir, exist_ok=True)
with open(squad_dir+'/train.txt', 'w') as f:
    f.write("\n".join(train_records))
with open(squad_dir+'/validation.txt', 'w') as f:
    f.write("\n".join(validation_records))
with open(squad_dir+'/test.txt', 'w') as f:
    f.write("\n".join(test_records))



In [97]:
!head -n 2 {squad_dir}/train.txt

{"question": "Where does the Downeaster service to maine start?", "context": "Amtrak's Northeast Corridor and Chicago lines originate at South Station, which serves as a major intermodal transportation hub, and stop at Back Bay. Fast Northeast Corridor trains, which serve New York City, Washington, D.C., and points in between, also stop at Route 128 Station in the southwestern suburbs of Boston. Meanwhile, Amtrak's Downeaster service to Maine originates at North Station, despite the current lack of a dedicated passenger rail link between the two railhubs, other than the \"T\" subway lines.", "label": "North Station", "prompt_tag": "qa-task"}
{"question": "What period do Italian historians believe came immediately after the High Period of the Middle Ages?", "context": "The changes brought about by these developments have led many scholars to view this period as the end of the Middle Ages and beginning of modern history and early modern Europe. However, the division is somewhat artificia

### Combine the two datasets

In [98]:
mix_data_dir = f"{DATA_DIR}/mix"
os.makedirs(mix_data_dir, exist_ok=True)
!cat $DATA_DIR/FinancialPhraseBank-v1.0/train_0.txt {squad_dir}/train.txt | shuf > {mix_data_dir}/train.txt
!cat $DATA_DIR/FinancialPhraseBank-v1.0/validation_0.txt {squad_dir}/validation.txt | shuf > {mix_data_dir}/validation.txt
!cat $DATA_DIR/FinancialPhraseBank-v1.0/test_0.txt {squad_dir}/test.txt | shuf > {mix_data_dir}/test.txt

In [99]:
!head -n 10 {mix_data_dir}/train.txt

{"question": "Did Hampshire Constabulary record fewer or more crime incidents in 2009/10 than the year before?", "context": "According to Hampshire Constabulary figures, Southampton is currently safer than it has ever been before, with dramatic reductions in violent crime year on year for the last three years. Data from the Southampton Safer City Partnership shows there has been a reduction in all crimes in recent years and an increase in crime detection rates. According to government figures Southampton has a higher crime rate than the national average. There is some controversy regarding comparative crime statisitics due to inconsistencies between different police forces recording methodologies. For example, in Hampshire all reported incidents are recorded and all records then retained. However, in neighbouring Dorset crimes reports withdrawn or shown to be false are not recorded, reducing apparent crime figures. In the violence against the person category, the national average is 16

## Add the Data Processor to generate the prompted input

In [67]:
from nemo.collections.nlp.data.glue_benchmark.gpt_ptune_dataset import DataProcessor, register_taskdata_processor
from typing import Dict, List
from nemo.collections.common.tokenizers.tokenizer_spec import TokenizerSpec


class SentimentProcessor(DataProcessor):
    """Processor for the sentiment analysis data set."""

    def __init__(self):
        super().__init__()

    def get_ptune_query(
        self, content: Dict, prompt_token_id: int, max_seq_len: int, templates: List[int], tokenizer: TokenizerSpec,
    ):
        text_a = content['sentence']
        sentence_a = f" Sentence: {text_a}"
        sentence_b = f" Sentiment:"
        a_input_token_ids = tokenizer.text_to_ids(sentence_a)
        b_input_token_ids = tokenizer.text_to_ids(sentence_b)
        cut = 0
        total_num_ids = len(a_input_token_ids) + len(b_input_token_ids) + sum(templates)
        if total_num_ids > max_seq_len:
            cut = total_num_ids - max_seq_len
        return (
            [prompt_token_id] * (templates[0] + templates[1])
            + a_input_token_ids[cut:]
            + [prompt_token_id] * templates[2]
            + b_input_token_ids
        )

    def label2string(self, label):
        return ' ' + label

class QAProcessor(DataProcessor):
    """Processor for the question and answer data set."""

    def __init__(self):
        super().__init__()

    def get_ptune_query(
        self, content: Dict, prompt_token_id: int, max_seq_len: int, templates: List[int], tokenizer: TokenizerSpec,
    ):
        text_a = content['context']
        text_b = content['question']

        sentence_a = f" Context: {text_a}"
        sentence_b = f" Question: {text_b}?"
        sentence_c = f" Answer:"
        a_input_token_ids = tokenizer.text_to_ids(sentence_a)
        b_input_token_ids = tokenizer.text_to_ids(sentence_b)
        c_input_token_ids = tokenizer.text_to_ids(sentence_c)
        cut = 0
        total_num_ids = len(a_input_token_ids) + len(b_input_token_ids) + sum(templates)
        if total_num_ids > max_seq_len:
            cut = total_num_ids - max_seq_len
        return (
            [prompt_token_id] * templates[0]
            + a_input_token_ids[cut:]
            + [prompt_token_id] * templates[1]
            + b_input_token_ids
            + [prompt_token_id] * templates[2]
            + c_input_token_ids
        )

    def label2string(self, label):
        return ' ' + label



In [94]:
register_taskdata_processor("qa-task", QAProcessor())
register_taskdata_processor("sentiment-task", SentimentProcessor())

## Convert the Megatron-LM Weights to Nemo file

P-Tuning method works the best with large GPT lanague models. From our experiences, models of size 5B or above give good performance. If you already have a large GPT model ready, skip this section. 

In this example, we will use the pretrained 344M NeMo Megatron GPT model from [Megatron-LM project](https://github.com/NVIDIA/Megatron-LM). To load it in NeMo Megatron, We first need to convert the Megatron-LM checkpoint to the `.nemo` file. Let's download the pretrained model weights and vocabulary file.



In [69]:
import pathlib
gpt_file = 'megatron_lm_345m_v0.0.zip'
vocab_file = 'gpt2-vocab.json'
merge_file = 'gpt2-merge.txt'
checkpoint_filename = 'model_optim_rng.pt'

if not pathlib.Path(gpt_file).exists():
    !wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O $gpt_file
    !unzip -f $gpt_file
    !wget https://s3.amazonaws.com/models.huggingface.co/bert/$vocab_file -O $vocab_file 
    !wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O $merge_file



In [73]:
WORK_DIR = "WORK_DIR"
os.makedirs(WORK_DIR, exist_ok=True)

# Prepare the model parameters 
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
MODEL_CONFIG = "megatron_gpt_config.yaml"
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

config file is already exists


In [74]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
config.model.num_layers = 24
config.model.hidden_size = 1024
config.model.ffn_hidden_size = 4096
config.model.num_attention_heads = 16
config.model.tokenizer.vocab_file = vocab_file
config.model.tokenizer.merge_file = merge_file
config.model.tensor_model_parallel_size = 1
config.model.data.data_prefix = ''
config.model.max_position_embeddings = 1024
config.model.data.seq_length = 1024
config.model.encoder_seq_length = 1024
config.cfg = {}
config.cfg.cfg = config.model
with open('hparams.yaml', 'w') as f:
    f.write(OmegaConf.to_yaml(config.cfg))

WORK_DIR/configs/megatron_gpt_config.yaml


In [None]:
import os
PWD = os.getcwd()
wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py')
!python -m torch.distributed.run --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py --checkpoint_folder=$PWD/release/mp_rank_00/ --checkpoint_name=$checkpoint_filename --hparams_file=$PWD/hparams.yaml --nemo_file_path=$PWD/gpt_344m.nemo --model_type=gpt --tensor_model_parallel_size=1

# Model configuration

Our P-Tuning text classification model is comprised of the pretrained GPT LM model followed by a prompt encoder layer.

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [75]:
MODEL_CONFIG = "megatron_ptune_gpt.yaml"

In [None]:
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

In [77]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
# Note: these are small batch-sizes - increase as appropriate to available GPU capacity
config.model.data.train_ds.batch_size=8
config.model.data.validation_ds.batch_size=8

WORK_DIR/configs/megatron_ptune_gpt.yaml


# Model Training
## Setting up Data within the config

Among other things, the config file contains dictionaries called train_ds, validation_ds and test_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.


In [79]:
# in this tutorial train and dev datasets are located in the same folder, so it is enough to add the path of the data directory to the config
#config.model.dataset.classes = ['positive', 'neutral', 'negative']
config.model.data.train_ds.file_path = DATA_DIR+'/mix/train.txt'
config.model.data.validation_ds.file_path = DATA_DIR+'/mix/validation.txt'
config.model.data.test_ds.file_path = DATA_DIR+'/mix/test.txt'


# if you want to decrease the size of your datasets, uncomment the lines below:
# NUM_SAMPLES = 1000
# config.model.train_ds.num_samples = NUM_SAMPLES
# config.model.validation_ds.num_samples = NUM_SAMPLES

In [80]:
print(OmegaConf.to_yaml(config))

name: megatron_ptune_gpt
trainer:
  gpus: 2
  num_nodes: 1
  precision: 16
  logger: false
  checkpoint_callback: false
  replace_sampler_ddp: false
  max_epochs: 3
  max_steps: null
  log_every_n_steps: 10
  val_check_interval: 300
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  resume_from_checkpoint: null
exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: megatron_ptune_gpt
  create_wandb_logger: false
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: true
  resume_ignore_no_checkpoint: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: val_acc
    save_top_k: 2
    mode: max
    always_save_nemo: false
    filename: megatron_gpt--{val_acc:.3f}-{step}
    model_parallel_size: ${model.tensor_model_parallel_size}
    save_best_model: true
model:
  tensor_model_parallel_size: 1
  seed: 1234
  nemo_path: null
  use_lm_finetune: false
  pseudo_token: '[PROMPT]'
  max_decode_length: null
  language_model:
    nem

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.

Let's first instantiate a Trainer object

In [81]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

Trainer config - 

gpus: 2
num_nodes: 1
precision: 16
logger: false
checkpoint_callback: false
replace_sampler_ddp: false
max_epochs: 3
max_steps: null
log_every_n_steps: 10
val_check_interval: 300
accumulate_grad_batches: 1
gradient_clip_val: 1.0
resume_from_checkpoint: null



In [82]:
from nemo.collections.nlp.parts.nlp_overrides import NLPDDPPlugin


# lets modify some trainer configs
# checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda
config.trainer.max_epochs = 100
config.trainer.val_check_interval=95230
# for PyTorch Native AMP set precision=16
config.trainer.precision = 16 if torch.cuda.is_available() else 32

# remove distributed training flags
config.trainer.accelerator = None

trainer = pl.Trainer(plugins=[NLPDDPPlugin()], **config.trainer)

Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


## Setting up a NeMo Experiment

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:

In [83]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))
os.makedirs(WORK_DIR, exist_ok=True)

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

[NeMo W 2022-02-07 21:39:49 exp_manager:558] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2022-02-07 21:39:49 exp_manager:411] There was no checkpoint folder at checkpoint_dir :/NeMo/tutorials/nlp/nemo_experiments/megatron_ptune_gpt/checkpoints. Training from scratch.


[NeMo I 2022-02-07 21:39:49 exp_manager:283] Experiments will be logged at /NeMo/tutorials/nlp/nemo_experiments/megatron_ptune_gpt
[NeMo I 2022-02-07 21:39:49 exp_manager:648] TensorboardLogger has been set up


[NeMo W 2022-02-07 21:39:49 exp_manager:901] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to -1. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


'/NeMo/tutorials/nlp/nemo_experiments/megatron_ptune_gpt'

We will use the converted `.nemo` file as our LM model.

In [85]:
# add the specified above model parameters to the config
# config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL
config.model.language_model.nemo_file = 'gpt_344m.nemo'
config.model.tensor_model_parallel_size = 1
config.exp_manager.checkpoint_callback_params.save_top_k = 1

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.

In [86]:
from nemo.collections.nlp.models.language_modeling.megatron_ptune_gpt_model import MegatronGPTPTuneModel
model_ptune = MegatronGPTPTuneModel(cfg=config.model, trainer=trainer)

[NeMo I 2022-02-07 21:41:55 tokenizer_utils:193] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m and custom vocab file: /tmp/tmpre5q9okj/bfcdca5e44814366bdb5dcd651325152_gpt2-vocab.json
[NeMo I 2022-02-07 21:41:55 tokenizer_utils:125] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /tmp/tmpre5q9okj/bfcdca5e44814366bdb5dcd651325152_gpt2-vocab.json, special_tokens_dict: {}, and use_fast: False


Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using mask_token, but it is not set yet.


[NeMo I 2022-02-07 21:42:02 megatron_gpt_model:763] Padded vocab_size: 50304, original vocab_size: 50257, dummy tokens: 47.
[NeMo I 2022-02-07 21:42:05 save_restore_connector:154] Model MegatronGPTModel was successfully restored from /NeMo/tutorials/nlp/gpt_344m.nemo.
[NeMo I 2022-02-07 21:42:05 auto_tokenizer:171] 1 special tokens added, resize your model accordingly.


Using pad_token, but it is not set yet.
Using mask_token, but it is not set yet.


## Monitoring training progress
Optionally, you can create a Tensorboard visualization to monitor training progress.
If you're not using Colab, refer to [https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks) if you're facing issues with running the cell below.

In [87]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

To use tensorboard, please use this notebook in a Google Colab environment.


In [100]:
# start model training
trainer.fit(model_ptune)

[NeMo I 2022-02-07 21:48:49 megatron_ptune_gpt_model:332] Building P-Tune datasets.
[NeMo I 2022-02-07 21:48:49 gpt_ptune_dataset:201] Processing DATA_DIR/mix/test.txt
[NeMo I 2022-02-07 21:48:50 gpt_ptune_dataset:331] Writing example 0 of 13538
[NeMo I 2022-02-07 21:49:01 gpt_ptune_dataset:331] Writing example 10000 of 13538
[NeMo I 2022-02-07 21:49:04 gpt_ptune_dataset:201] Processing DATA_DIR/mix/train.txt
[NeMo I 2022-02-07 21:49:12 gpt_ptune_dataset:331] Writing example 0 of 118101
[NeMo I 2022-02-07 21:49:23 gpt_ptune_dataset:331] Writing example 10000 of 118101
[NeMo I 2022-02-07 21:49:33 gpt_ptune_dataset:331] Writing example 20000 of 118101
[NeMo I 2022-02-07 21:49:43 gpt_ptune_dataset:331] Writing example 30000 of 118101
[NeMo I 2022-02-07 21:49:53 gpt_ptune_dataset:331] Writing example 40000 of 118101
[NeMo I 2022-02-07 21:50:04 gpt_ptune_dataset:331] Writing example 50000 of 118101
[NeMo I 2022-02-07 21:50:14 gpt_ptune_dataset:331] Writing example 60000 of 118101
[NeMo I 20

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4]


[NeMo I 2022-02-07 21:51:26 modelPT:577] Optimizer config = Adam (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.999]
        eps: 1e-08
        lr: 1e-05
        weight_decay: 0.0005
    )
[NeMo I 2022-02-07 21:51:26 lr_scheduler:833] Scheduler "<nemo.core.optim.lr_scheduler.WarmupAnnealing object at 0x7f9dc58ae730>" 
    will be used during training (effective maximum steps = 1476300) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1
    max_steps: 1476300
    )
[NeMo I 2022-02-07 21:51:28 nlp_overrides:94] Configuring DDP for model parallelism.



  | Name           | Type                   | Params
----------------------------------------------------------
0 | model          | MegatronGPTModel       | 354 M 
1 | embeddings     | VocabParallelEmbedding | 51.5 M
2 | prompt_encoder | PromptEncoder          | 14.7 M
----------------------------------------------------------
14.7 M    Trainable params
354 M     Non-trainable params
369 M     Total params
739.158   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

[NeMo I 2022-02-07 21:51:41 megatron_ptune_gpt_model:318] Validation loss: 3.0266573429107666
[NeMo I 2022-02-07 21:51:41 megatron_ptune_gpt_model:319] Validation accuracy: 0.0


Training: 0it [00:00, ?it/s]

I0207 21:51:43.998908 140320770778944 distributed.py:902] Reducer buckets have been rebuilt in this iteration.


Validating: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


# Inference

To see how the model performs, we can run model in the inference mode

In [None]:
# let's first create a subset of our dev data
query_examples = [

]
results = model_ptune.cuda().ptune_inference(queries=query_examples, batch_size=1, decode_token_len=15)
print('The prediction results of some sample queries with the trained model:')
for query, result in zip(query_examples, results):
    print(f'Query : {query}')
    print(f'Predicted label: {result}')

## Training Script

If you have NeMo installed locally, you can also train the model with `examples/nlp/text_classification/ptune_text_classification.py`.

To run training script, use:
```
python examples/nlp/language_modeling/megatron_gpt_ptune.py \
    trainer.gpus=1 \
    model.tensor_model_parallel_size=1 \
    model.language_model.nemo_file=gpt_344m.nemo \
    model.train_ds.file_path=TRAIN_FILE \
    model.prompt_encoder.template=[3,3,3] \
    model.train_ds.batch_size=8 \
    model.validation_ds.file_path=VAL_FILE \
    model.test_ds.file_path=TEST_FILE \
```