<a href="https://colab.research.google.com/github/SophieShin/NLP_22_Fall/blob/main/%5BSSH%5Dlab12_HF_tokeniser_mT5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 12 – HF Transformers Tokenisers and Multilingual Translation

Install requirements

In [1]:
!pip install transformers sentencepiece datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 14.8 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 52.9 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 62.0 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 45.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 40.2 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-

# Part 1. Working with HF Transformer Tokeniser

In [2]:
from transformers import BertTokenizer
tok = BertTokenizer.from_pretrained('bert-base-uncased')

text = "Hello world!"

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [30]:
# Call tokenize() on text
tok.tokenize(text)

['hello', 'world', '!']

In [31]:
# Convert these tokens to ids
tok.convert_tokens_to_ids(tok.tokenize(text))

[7592, 2088, 999]

In [32]:
# Call encode() on text SOS/EOS
tok.encode(text)

[101, 7592, 2088, 999, 102]

These have added start of sequence (101) and end of sequence (102) tokens. We can further specify how long this sequence should be, add padding and return PyTorch tensors.

In [6]:
tok.encode(text, max_length = 512, padding='max_length', return_tensors='pt') #[0,:10]

tensor([[ 101, 7592, 2088,  999,  102,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,  

In [33]:
tok.encode_plus(text)

{'input_ids': [101, 7592, 2088, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

This now includes token ids, segment ids (`token_type_ids`) and attention mask (which tokens should be attended to, in this case, all). We can also specify max length and padding, like we did with `encode()`.

In [8]:
# Specify max length, padding and return PyTorch tensors
tok.encode_plus(text, max_length = 512, padding='max_length', return_tensors='pt') #[0,:10]

{'input_ids': tensor([[ 101, 7592, 2088,  999,  102,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,

The segment ids are all 0s because there is only one sequence and the attention mask is 1 applied to only the first 5 tokens. The remaining tokens are paddings so the attention mask is 0.

Encode plus cannot deal with a list of sentences. To process batches of sentences, we can use `batch_encode_plus()`.

In [9]:
text_list = [text, "My soul is painted like the wings of butterflies, fairy tales of yesterday will grow but never die"]
tok.batch_encode_plus(text_list)

{'input_ids': [[101, 7592, 2088, 999, 102], [101, 2026, 3969, 2003, 4993, 2066, 1996, 4777, 1997, 15023, 1010, 8867, 7122, 1997, 7483, 2097, 4982, 2021, 2196, 3280, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

We see an array of token ids, segment ids and attention mask for EACH sentence. If we are processing a batch, then we should keep their lengths to be equal by specifying the length and padding shorter sentences, like before. We also want to return PyTorch tensors so that we can run `shape` on it.

In [35]:
token_ids = tok.batch_encode_plus(text_list, max_length = 10, padding='max_length', truncation=True, return_tensors='pt')
token_ids['input_ids']

tensor([[ 101, 7592, 2088,  999,  102,    0,    0,    0,    0,    0],
        [ 101, 2026, 3969, 2003, 4993, 2066, 1996, 4777, 1997,  102]])

In [11]:
# Q. What would happen if we removed truncation=True?


We can also get the shape of specific items in `token_ids`

In [36]:
token_ids['input_ids'].shape

torch.Size([2, 10])

### Run tokenizer directltly
Just running the tokeniser on the text.

In [37]:
tok(text)

{'input_ids': [101, 7592, 2088, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

The output looks identical to `encode_plus()`.
Let's call it on `text_list`.

In [14]:
tok(text_list)

{'input_ids': [[101, 7592, 2088, 999, 102], [101, 2026, 3969, 2003, 4993, 2066, 1996, 4777, 1997, 15023, 1010, 8867, 7122, 1997, 7483, 2097, 4982, 2021, 2196, 3280, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

The output looks identical to `batch_encode_plus()`.

Tokeniser can take a sentence or a list of sentences and call the appropriate function, i.e. encode_plus() for a sentence and batch_encode_plus for a list of sentences.

References:
- [James Briggs](https://youtu.be/bWLvGGJLzF8)


# Part 2. Multilingual Machine Translation using HF Transformers

In [38]:
from datasets import load_dataset
from IPython.display import display
from IPython.html import widgets
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import torch
from torch import optim
from torch.nn import functional as F
from transformers import AdamW, AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import get_linear_schedule_with_warmup
from tqdm import notebook

sns.set()

In [39]:
model_repo = 'google/mt5-small'

Load tokeniser and model

In [40]:
tokenizer = AutoTokenizer.from_pretrained(model_repo)

In [41]:
# Device config
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Load the model, [MT5 model](https://arxiv.org/abs/2010.11934) which is a multilingual language model that can be used for various NLP tasks.

In [42]:
# Model description: https://huggingface.co/google/mt5-base
model = AutoModelForSeq2SeqLM.from_pretrained(model_repo)
model = model.to(device)

In [45]:
input_sent = "This is a test sentence!"
token_ids = tokenizer.encode(input_sent, return_tensors= 'pt').to(device)

token_ids

tensor([[ 1494,   339,   259,   262,  2978,   259, 98923,   309,     1]],
       device='cuda:0')

In [46]:
model_out = model.generate(token_ids)
print(model_out)

tensor([[     0, 250099,      1]], device='cuda:0')


In [48]:
output_text = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(model_out[0]))
output_text

'<pad> <extra_id_0></s>'

# Steps
1. Load the pretrained model and tokenizer
2. Load dataset
3. Transform dataset into input (entails a minor model change)
4. Train/finetune the model on our dataset
5. Test the model

In [23]:
example_input_str = '<ms> This is just a test pretqw.' # PRETQW는 실제로 없는 단어
input_ids = tokenizer.encode(example_input_str, return_tensors='pt')
print('Input IDs:', input_ids)


Input IDs: tensor([[ 1042,   282,   263,   669,  1494,   339,  1627,   259,   262,  2978,
         10300, 27282,   260,     1]])


We don't see a one-to-one mapping from words to tokens.
Let's convert these ids back to tokens to see what they are.

In [50]:
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
print('Tokens:', tokens)

Tokens: ['▁<', 'm', 's', '>', '▁This', '▁is', '▁just', '▁', 'a', '▁test', '▁pret', 'qw', '.', '</s>']


Underscores are a way that tokenisers represent spaces or start of words. We can view the whole vocabulary of this multilingual model.

In [43]:
tokenizer.vocab

{'2260': 199330,
 'ença': 33168,
 '▁instrumen': 202123,
 'มีผู้': 212403,
 '7">': 173057,
 '▁Meteor': 118286,
 'ajbolj': 50568,
 'מתר': 198113,
 'Gemeente': 114420,
 '▁office': 11474,
 'tës': 40098,
 'კოს': 133370,
 '心臓': 204277,
 'стріл': 92007,
 'การจัด': 135630,
 'tendo': 37929,
 'نتو': 166356,
 'Друг': 32056,
 '▁τό': 42252,
 'లోప': 184606,
 'ოთ': 14318,
 'plān': 38630,
 'стріч': 145167,
 'anzi': 43511,
 'Nokia': 104215,
 '▁ಸಂಸ್': 148713,
 'hitan': 168226,
 'шалар': 237166,
 'plash': 200783,
 '▁Rij': 64292,
 'ceiro': 113893,
 'Alexandria': 121660,
 '▁revolu': 70096,
 'τέρ': 69995,
 'åde': 6479,
 'XJ': 58745,
 '依托': 206365,
 'الإسلام': 228545,
 '擂': 241090,
 'က်ယ္': 186187,
 'ਾਵ': 171336,
 '营收': 213324,
 '▁nation': 30341,
 '부산': 19669,
 '▁OB': 53896,
 'противоречи': 141526,
 'De': 4209,
 'ընկեր': 136715,
 'യിട്ട': 216765,
 'eettis': 236877,
 '▁प्रति': 7792,
 '▁رون': 158824,
 '▁Concentr': 159696,
 'ندا': 30441,
 'partiet': 71311,
 'อันดับที่': 83627,
 'brzy': 104045,
 'hrad': 40414,
 

In [44]:
# Sort it by token number. Scroll to the top to see the first few tokens
sorted(tokenizer.vocab.items(), key=lambda x: x[1])

[('<pad>', 0),
 ('</s>', 1),
 ('<unk>', 2),
 ('<0x00>', 3),
 ('<0x01>', 4),
 ('<0x02>', 5),
 ('<0x03>', 6),
 ('<0x04>', 7),
 ('<0x05>', 8),
 ('<0x06>', 9),
 ('<0x07>', 10),
 ('<0x08>', 11),
 ('<0x09>', 12),
 ('<0x0A>', 13),
 ('<0x0B>', 14),
 ('<0x0C>', 15),
 ('<0x0D>', 16),
 ('<0x0E>', 17),
 ('<0x0F>', 18),
 ('<0x10>', 19),
 ('<0x11>', 20),
 ('<0x12>', 21),
 ('<0x13>', 22),
 ('<0x14>', 23),
 ('<0x15>', 24),
 ('<0x16>', 25),
 ('<0x17>', 26),
 ('<0x18>', 27),
 ('<0x19>', 28),
 ('<0x1A>', 29),
 ('<0x1B>', 30),
 ('<0x1C>', 31),
 ('<0x1D>', 32),
 ('<0x1E>', 33),
 ('<0x1F>', 34),
 ('<0x20>', 35),
 ('<0x21>', 36),
 ('<0x22>', 37),
 ('<0x23>', 38),
 ('<0x24>', 39),
 ('<0x25>', 40),
 ('<0x26>', 41),
 ('<0x27>', 42),
 ('<0x28>', 43),
 ('<0x29>', 44),
 ('<0x2A>', 45),
 ('<0x2B>', 46),
 ('<0x2C>', 47),
 ('<0x2D>', 48),
 ('<0x2E>', 49),
 ('<0x2F>', 50),
 ('<0x30>', 51),
 ('<0x31>', 52),
 ('<0x32>', 53),
 ('<0x33>', 54),
 ('<0x34>', 55),
 ('<0x35>', 56),
 ('<0x36>', 57),
 ('<0x37>', 58),
 ('<0x38>',

Up until token 258, these are just spaces that you can use to add your own tokens. Otherwise they are unused. From token 259, we see common tokens, such as underscore, comma, period, etc.


### Load the data – Asian Language Treebank (ALT)
If you follow the link to the alt dataset, you will see 13 supported languages

In [53]:
# Source: https://huggingface.co/datasets/alt
dataset = load_dataset('alt')



Downloading and preparing dataset alt/alt-parallel to /root/.cache/huggingface/datasets/alt/alt-parallel/1.0.0/e784a3f2a9f6bdf277940de6cc9d700eab852896cd94aad4233caf26008da9ed...


Generating train split:   0%|          | 0/18094 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1004 [00:00<?, ? examples/s]

DatasetGenerationError: ignored

In [None]:
train_dataset = dataset['train']
test_dataset = dataset['test']
train_dataset

Let's see what one of these rows look like

In [None]:
train_dataset[0]

We will see that this is a dictionary with two IDs, a url, followed by **translations** in various languages. Pull out the source and target languages of our choice. We will select English, Malay, Chinese and Japanese.

In [None]:
LANG_TOKEN_MAPPING = {
    'en': '<en>',
    'ms': '<ms>',
    'zh': '<zh>',
    'ja': '<ja>'
}

We want to add the special tokens denoting the language, e.g. `<en>, <ms>`, etc. to our vocab

In [None]:
# Create a dict containing the special tokens from our dict above
special_tokens_dict = {'additional_special_tokens': list(LANG_TOKEN_MAPPING.values())}
# Add this dict to tokens via tokeniser function
tokenizer.add_special_tokens(special_tokens_dict)
# Need to change the model's embedding to include these new tokens (only they will be initialised, the others remain unchanged)
model.resize_token_embeddings(len(tokenizer))

Let's see what is maximum sequence length of the model


In [None]:
model.config.max_length

Set our mex length to the model's max length

In [None]:
max_seq_len = model.config.max_length

The tokenizer.encode_plus function combines multiple steps for us:

1.- Split the sentence into tokens. 2.- Add the special `[CLS]` and `[SEP]` tokens. 3.- Map the tokens to their IDs. 4.- Pad or truncate all sentences to the same length. 5.- Create the attention masks which explicitly differentiate real tokens from `[PAD]` tokens.



In [None]:
# Recall our example input string
example_input_str

In [None]:
# Q. Create token_ids for the example input sentence by calling the tokeniser with appropriate arguments. 
# Print out the token ids
token_ids = tokenizer.encode(
    example_input_str, 
    max_length=max_seq_len,
    padding='max_length',
    truncation=True,
    return_tensors='pt')

print(token_ids)


In [None]:
tokens = tokenizer.convert_ids_to_tokens(token_ids[0])
print(tokens)

### Prepare the input and target strings 
- Call the tokeniser on the to contain the special token to indicate the target language at its front

In [None]:
def encode_input_str(text, target_lang, tokenizer, seq_len,
                     lang_token_map=LANG_TOKEN_MAPPING):

  target_lang_token = lang_token_map[target_lang]

  # Tokenize and add special token to the front of text
  input_ids = tokenizer.encode(
      text = target_lang_token + text,
      return_tensors = 'pt',
      padding = 'max_length',
      truncation = True,
      max_length = seq_len)

  return input_ids[0]


def encode_target_str(text, tokenizer, seq_len,
                      lang_token_map=LANG_TOKEN_MAPPING):

  token_ids = tokenizer.encode(
      text = text,
      return_tensors = 'pt',
      padding = 'max_length',
      truncation = True,
      max_length = seq_len)
  
  return token_ids[0]

### Get the specific translations from the dictionary

In [None]:
def format_translation_data(translations, lang_token_map,
                            tokenizer, seq_len=128):

  # Choose 2 languages randomly from out list of languages for translation
  langs = list(lang_token_map.keys())
  input_lang, target_lang = np.random.choice(langs, size=2, replace=False)
  # print(f"Translating {input_lang} to {target_lang}")

  # Get the translations for the batch
  input_text = translations[input_lang]
  target_text = translations[target_lang]

  # Ignore missing input or target sentences
  if input_text is None or target_text is None:
    return None

  # Get token ids for input sentence
  input_token_ids = encode_input_str(
      input_text, target_lang, tokenizer, seq_len, lang_token_map)
  
  # Get token ids for target sentence
  target_token_ids = encode_target_str(
      target_text, tokenizer, seq_len, lang_token_map)

  return input_token_ids, target_token_ids

In [None]:
# Q. What does the 'replace=False' argument do when selecting 2 random languages for translation?


In [None]:
# Test `data_transform` on one sentence from the train set
format_translation_data(
    train_dataset[10]['translation'], LANG_TOKEN_MAPPING, tokenizer)


In [None]:
in_ids, out_ids = format_translation_data(
    train_dataset[10]['translation'], LANG_TOKEN_MAPPING, tokenizer)

print(' '.join(tokenizer.convert_ids_to_tokens(in_ids)))
print(' '.join(tokenizer.convert_ids_to_tokens(out_ids)))


### Let's Work with Batches
We first see what batch from the dataset will look like. 
Suppose we get the first two elements of the ATL dictionary:


In [None]:
train_dataset[:2]

We see that we have just one dictionary but each key now has a list containing multiple values. So we can pass a batch of data by specifying the start and end indexes of the dataset.

### Transform Batch
Now we need a list of inputs and targets, and call `format_translation_data` multiple times as we iterate through the list.

In [None]:
def transform_batch(batch, lang_token_map, tokenizer):

  inputs = []
  targets = []

  # Iterate through the translations for the batch
  for translation_set in batch['translation']:
    formatted_data = format_translation_data(
        translation_set, lang_token_map, tokenizer, max_seq_len)
    
    # Skip NULL translations
    if formatted_data is None:
      continue
    
    # Append translations to the input and target lists
    input_ids, target_ids = formatted_data
    inputs.append(input_ids.unsqueeze(0))
    targets.append(target_ids.unsqueeze(0))
    
  # Concatenate all the batches
  batch_input_ids = torch.cat(inputs).to(device)
  batch_target_ids = torch.cat(targets).to(device)

  return batch_input_ids, batch_target_ids

### Generate Data from Dataset
An additional function to generate data as an iterator, shuffle the dataset so that we get a random set of sentences for the batches. We set our batch size to be 32 by default if not supplied.

In [None]:
def get_data_generator(dataset, lang_token_map, tokenizer, batch_size=32):
  dataset = dataset.shuffle()
  for i in range(0, len(dataset), batch_size):
    raw_batch = dataset[i:i+batch_size]
    yield transform_batch(raw_batch, lang_token_map, tokenizer)

Now we can call `get_data_generator` instance which is an iterable and get the next bacth by calling next()

In [None]:
data_gen = get_data_generator(train_dataset, LANG_TOKEN_MAPPING, tokenizer, 8)
data_batch = next(data_gen)
data_batch[0].shape


Generate the next batch of data. You can call this code cell multiple times and see different results.

In [None]:
data_batch = next(data_gen)
data_batch[0]

Let's see what the sample batch of input and target translations look like

In [None]:
print("Input sentences:")
for in_ids in data_batch[0]:
  print(' '.join(tokenizer.convert_ids_to_tokens(in_ids)))

print("\nTarget sentences:")
for out_ids in data_batch[1]:
  print(' '.join(tokenizer.convert_ids_to_tokens(out_ids)))


## Finetune the Model on BERT

In [None]:
# Hyper parameters
n_epochs = 5
batch_size = 16
print_freq = 50
# checkpoint_freq = 1000
lr = 5e-4 # range: 1e-3 to 1e-5 
n_batches = int(np.ceil(len(train_dataset) / batch_size)) # divide length of train set wth batch size
total_steps = n_epochs * n_batches
n_warmup_steps = int(total_steps * 0.01) # First 1% of steps will be warm up steps where lr can warm up and stabilise

Optimiser and Scheduler
- AdamW is a slightly improved Adam with regards to weight decay
- Scheduler is used to help with adjusting the learning rate as training progresses
- Scheduler should be applied after the parameter update
- `get_linear_schedule_with_warmup` creates a schedule with a learning rate that at first goes through a warmup period during which it increases linearly from 0 to the initial `lr` set in the optimiser, and then decreases linearly from the initial `lr` set in the optimiser to 0
- See [here](https://www.kaggle.com/code/snnclsr/learning-rate-schedulers) for sample schedulers available in PyTorch and Hugging Face (transformers)

In [None]:
optimizer = AdamW(model.parameters(), lr=lr) 
scheduler = get_linear_schedule_with_warmup(
    optimizer, n_warmup_steps, total_steps)

In [None]:
losses = []

Function for Model Evaluation, i.e. forward pass and loss calculation

In [None]:
def eval_model(model, gdataset, max_iters=8):
  test_generator = get_data_generator(gdataset, LANG_TOKEN_MAPPING,
                                      tokenizer, batch_size)
  eval_losses = []
  for i, (input_batch, label_batch) in enumerate(test_generator):
    if i >= max_iters:
      break

    model_out = model.forward(
        input_ids = input_batch,
        labels = label_batch)
    eval_losses.append(model_out.loss.item())

  return np.mean(eval_losses)

Mount drive and set a path to save the model 

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

In [None]:
# Path where model weights will be saved. You can load these later.
model_path = '/content/gdrive/My Drive/mt5_translation.pt'

### Train
- Takes < 5 mins per epoch (~23 mins for 5 epochs)

In [None]:
for epoch in range(n_epochs):
  # Get batch of data of sentence pairs of random languages
  data_generator = get_data_generator(train_dataset, LANG_TOKEN_MAPPING,
                                      tokenizer, batch_size)
                
  # input token ids and expected token ids
  for batch_idx, (input_batch, label_batch) \
      in notebook.tqdm(enumerate(data_generator), total=n_batches): # Progress bar
      # in tqdm_notebook(enumerate(data_generator), total=n_batches): # Progress bar
    optimizer.zero_grad()

    # Forward pass
    model_out = model.forward(
        input_ids = input_batch,
        labels = label_batch) # labels will help with loss calculation (next)

    # Calculate loss and backprop
    loss = model_out.loss
    losses.append(loss.item())
    loss.backward()
    optimizer.step()
    scheduler.step() # change the lr

    # Print training update info every 50 batches
    if (batch_idx + 1) % print_freq == 0:
      avg_loss = np.mean(losses[-print_freq:])
      print('Epoch: {} | Step: {} | Avg. loss: {:.3f} | lr: {}'.format(
          epoch+1, batch_idx+1, avg_loss, scheduler.get_last_lr()[0]))
      
  test_loss = eval_model(model, test_dataset)
  print('Test loss of {:.3f}'.format(test_loss))
  torch.save(model.state_dict(), model_path)


# Save model to specified model path in Drive
torch.save(model.state_dict(), model_path)

### Plot the loss

In [None]:
# Graph the loss

window_size = 50
smoothed_losses = []
for i in range(len(losses)-window_size):
  smoothed_losses.append(np.mean(losses[i:i+window_size]))

plt.plot(smoothed_losses[100:])

### Test on Sample Text from Test Set / Manual

In [None]:
test_sentence = test_dataset[0]['translation']['en']
# test_sentence = 'これは普通のテスト' # English translation: "This is a test sentence"
print('Raw input text:', test_sentence)

# Prepare the sentence
input_ids = encode_input_str(
    text = test_sentence,
    target_lang = 'zh', # specify target language here, e.g. 'en', 'ms', 'ja', 'zh'
    tokenizer = tokenizer,
    seq_len = model.config.max_length,
    lang_token_map = LANG_TOKEN_MAPPING)

input_ids = input_ids.unsqueeze(0).to(device)

print('Truncated input text:', tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[0])))

Use model.generate function to translate on the finetuned model. Apart from the inout_ids, num_beams tells how many of the previous predictions to keep when generating text, the bigger the better, num_return_sequences specifies how many sentences you want returned.

In [None]:
output_tokens = model.generate(input_ids, num_beams=10, num_return_sequences=3)
# print(output_tokens)
for token_set in output_tokens:
  print(tokenizer.decode(token_set, skip_special_tokens=True))