In [None]:
BRANCH = 'main'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[nlp]

In [None]:
from nemo.utils.exp_manager import exp_manager
from nemo.collections import nlp as nemo_nlp

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

# Task Description
For every word in our training dataset we’re going to predict:

- punctuation mark that should follow the word and
- whether the word should be capitalized

# Dataset
This model can work with any dataset as long as it follows the format specified below. 
The training and evaluation data is divided into *2 files: text.txt and labels.txt*. 
Each line of the **text.txt** file contains text sequences, where words are separated with spaces: [WORD] [SPACE] [WORD] [SPACE] [WORD], for example:



```
when is the next flight to new york
the next flight is ...
...
```



The **labels.txt** file contains corresponding labels for each word in text.txt, the labels are separated with spaces. Each label in labels.txt file consists of 2 symbols:

- the first symbol of the label indicates what punctuation mark should follow the word (where O means no punctuation needed);
- the second symbol determines if a word needs to be capitalized or not (where U indicates that the word should be upper cased, and O - no capitalization needed.)

In this tutorial, we are considering only commas, periods, and question marks the rest punctuation marks were removed. To use more punctuation marks, update the dataset to include desired labels, no changes to the model needed. 

Each line of the **labels.txt** should follow the format: 
[LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt). 
For example, labels for the above text.txt file should be:



```
OU OO OO OO OO OO OU ?U
OU OO OO OO ...
...
```



The complete list of all possible labels for this task used in this tutorial is: `OO, ,O, .O, ?O, OU, ,U, .U, ?U.`

## Download and preprocess the data¶

In this notebook we are going to use a subset of English examples from the [Tatoeba collection of sentences](https://tatoeba.org/eng) this script will download and preprocess the Tatoeba data [NeMo/examples/nlp/token_classification/get_tatoeba_data.py](https://github.com/NVIDIA/NeMo/blob/master/examples/nlp/token_classification/get_tatoeba_data.py). Note, for further experiments with the model, set NUM_SAMPLES=-1 and consider including other datasets to improve model performance. 


In [None]:
DATA_DIR = "PATH_TO_DATA"
WORK_DIR = "PATH_TO_CHECKPOINTS_AND_LOGS"
MODEL_CONFIG = "punctuation_capitalization_config.yaml"

# model parameters
BATCH_SIZE = 128
MAX_SEQ_LENGTH = 64
LEARNING_RATE = 0.00002
NUM_SAMPLES = 100000

In [None]:
## download get_tatoeba_data.py script to download and preprocess the Tatoeba data
os.makedirs(WORK_DIR, exist_ok=True)
if not os.path.exists(WORK_DIR + '/get_tatoeba_data.py'):
    print('Downloading get_tatoeba_data.py...')
    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/candidate/examples/nlp/token_classification/data/get_tatoeba_data.py', WORK_DIR)
else:
    print ('get_tatoeba_data.py is already exists')

In [None]:
# download and preprocess the data
# --clean_dir flag deletes raw Tataoeba data, remove the flag to avoid multiple data downloads if you want to experiment with the dataset size
! python $WORK_DIR/get_tatoeba_data.py --data_dir $DATA_DIR --num_sample $NUM_SAMPLES --clean_dir

after execution of the above cell, your data folder will contain the following 4 files needed for training (raw Tatoeba data could be present if `--clean_dir` was not used):
- labels_dev.txt
- labels_train.txt
- text_dev.txt
- text_train.txt


In [None]:
! ls -l {DATA_DIR}

In [None]:
# let's take a look at the data 
print('Text:')
! head -n 5 {DATA_DIR}/text_train.txt

print('\nLabels:')
! head -n 5 {DATA_DIR}/labels_train.txt

# Model Configuration

In the Punctuation and Capitalization Model, we are jointly training two token-level classifiers on top of the pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model: 
- one classifier to predict punctuation and
- the other one - capitalization.

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that will relate to the Model - language model, token classifiers, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [None]:
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/pc_tutorial/examples/nlp/token_classification/conf/punctuation_capitalization_config.yaml', config_dir)
else:
    print ('config file is already exists')

In [None]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
print(OmegaConf.to_yaml(config))

# Setting up Data within the config

Among other things, the config file contains dictionaries called dataset, train_ds and validation_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.

If both training and evaluation files are located in the same directory, simply specify `model.dataset.data_dir`, like we are going to do below.
However, if your evaluation files are located in a different directory, or you want to use multiple datasets for evaluation, specify paths to the directory(ies) with evaluation file(s) in the following way:

`model.validation_ds.ds_item=[PATH_TO_DEV1,PATH_TO_DEV2]` (Note no space between the paths and square brackets).

Also notice that some configs, including `model.dataset.data_dir` have `???` in place of paths, this values are required to be specified by the user.

Let's now add the data directory path to the config.

In [None]:
# in this tutorial train and dev data is located in the same folder, so it is enought to add the path of the data directory to our config
config.model.dataset.data_dir = DATA_DIR

# Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem!

Lets first instantiate a Trainer object!

In [None]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

In [None]:
# lets modify some trainer configs
# checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda
config.trainer.precision = 16 if torch.cuda.is_available() else 32

# For mixed precision training, use precision=16 and amp_level=O1

# Reduces maximum number of epochs to 1 for a quick training
config.trainer.max_epochs = 1

# Remove distributed training flags
config.trainer.distributed_backend = None

trainer = pl.Trainer(**config.trainer)

# Setting up a NeMo Experiment¶

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it!

In [None]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

# Model Training

Before initializing the model, we might want to modify some of the model configs.

In [None]:
# specify BERT-like model, you want to use
PRETRAINED_BERT_MODEL = "bert-base-uncased"

# complete list of supported BERT-like models
nemo_nlp.modules.get_pretrained_lm_models_list()

In [None]:
# add the specified above model parameters to the config
config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL
config.model.train_ds.batch_size = BATCH_SIZE
config.model.validation_ds.batch_size = BATCH_SIZE
config.model.optim.lr = LEARNING_RATE
config.model.train_ds.num_samples = NUM_SAMPLES
config.model.validation_ds.num_samples = NUM_SAMPLES

In [None]:
# initialize the model
# dataset we'll be prepared for training and evaluation during
config.trainer.max_epochs = 3
model = nemo_nlp.models.PunctuationCapitalizationModel(cfg=config.model, trainer=trainer)

## Monitoring training progress
Optionally, you can create a Tensorboard visualization to monitor training progress.

In [None]:
# load the TensorBoard notebook extension
%load_ext tensorboard
%tensorboard --logdir {exp_dir}

In [None]:
# start the training
trainer.fit(model)

After training for 3 epochs, macro averaged F1 for punctuation task should be around 93%, macro averaged F1 for capitalization task about 98%.

# Inference

To see how the model performs, let’s run inference on a few examples.

In [None]:
# define the list of queiries for inference
queries = [
        'we bought four shirts and one mug from the nvidia gear store in santa clara',
        'what can i do for you today',
        'how are you',
        'how is the weather in',
    ]
inference_results = model.add_punctuation_capitalization(queries)
print()

for query, result in zip(queries, inference_results):
    print(f'Query   : {query}')
    print(f'Combined: {result.strip()}\n')

If you have NeMo installed locally, you can also train the model with `nlp/token_classification/punctuation_capitalization.py.`

To run training script, use:

`python punctuation_and_capitalization.py model.dataset.data_dir=PATH_TO_DATA_DIR`

Set NUM_SAMPLES=-1 and consider including other datasets to improve the performance of the model.