# Practice round: Chinese-English translation

Huggingface transformer doc: https://huggingface.co/transformers/

Huggingface tokenizer doc: https://huggingface.co/transformers/

Useful resources from huggingface -- fine-tuning a model from scratch: https://huggingface.co/blog/how-to-train

The code I wrote before might be helpful: https://github.com/submal/ctec-lambus/blob/master/xprmt/xprmt_06.ipynb

A code example of fine-tuning T5 for text summarization: https://towardsdatascience.com/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81 

The github repo of the example T5 text summarization: https://github.com/priya-dwivedi/Deep-Learning/blob/master/wikihow-fine-tuning-T5/Tune_T5_WikiHow-Github.ipynb

LighningModule API
https://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html#lightningmodule-apihttps://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html#lightningmodule-api

In [1]:
import json
import pandas as pd
import jieba
from tokenizers import SentencePieceBPETokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    T5Model, 
    T5ForConditionalGeneration, 
    AdamW, 
    get_linear_schedule_with_warmup
)
import pytorch_lightning as pl
from pytorch_lightning import loggers as pl_loggers
import argparse
import time
import numpy as np
import nlp
import logging
import os

# Enable GPU if possible 
device = torch.device(
    'cuda:0' if torch.cuda.is_available() else 'cpu'
)
print(f'device = {device}')

device = cuda:0


## Load data

In [2]:
with open('./cn_en_weibo_data/data.cn-en.json', 'r', encoding = 'utf-8') as myfile:
    raw = myfile.read().split('\n')  

# Turn raw strings into a list of dictionaries
weiboDict = [json.loads(line) for line in raw]

# Load and shuffle data
weiboDf= pd.DataFrame(weiboDict).sample(frac=1).reset_index(drop=True)

weiboDf.tail()

Unnamed: 0,id,source,target
1998,3536490363662881,你唯一应该努力超越的人，就是昨天的自己,YesThe only person you should try to be better...
1999,10613096526,诸事顺遂却不说是走运，只是在骗自己。,Those who have succeeded at anything and don’t...
2000,3471250242574624,海明威说过，这世界很美好，值得我们为之奋斗。我同意后半句。——《七宗罪,Hemingway once wrote，The world is a fine place...
2001,3542673203850352,车站，回家的起点! 无论这里上演过多少苦辣酸甜，它一直都是人们希望,Railway stations: starting point of going home
2002,3498400785985006,你知道就算大雨让这座城市颠倒，我也会给你怀抱。,You know that even if the downpour turned the ...


It seems the data is far from clean. However, for prototyping purpose, we will not focus too much on cleaning right now. 

## Parsing and tokenizing Chinese texts

We use `jieba` library (结巴分词) for parsing Chinese text. For more information, see https://github.com/fxsjy/jieba/

In [3]:
chTexts = weiboDf['source']
enTexts = weiboDf['target']

# Tokenize all Chinese texts in the dataframe and store as a list
chTokensGen = [jieba.cut(sentence) for sentence in chTexts]

# Output a sample tokenization
print(list(chTokensGen[0]))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\presu\AppData\Local\Temp\jieba.cache
Loading model cost 0.553 seconds.
Prefix dict has been built successfully.


['不要', '只', '追求', '漂亮', '的', '外表', '，', '它会', '蒙蔽', '你', '的', '眼睛', '；', ' ', '不要', '只', '追求', '财富', '，', '它', '只不过', '是', '过眼烟云', '；', '追求', '一个', '能', '让', '你', '真心', '微笑', '的', '人', '，', ' ', '只有', '笑容', '能', '使', '黑夜', '不再', '漫长', '，', '驱走', '阴霾', '，', '带来', '温暖', '阳光']


It turns out with tokenizers based on `sentencePiece`, the tokenization happens at sentence level, and the tokenizer is trained recognize subwords. Therefore we will not use other parsers for now. 

In [4]:
pathAllCh = './cn_en_weibo_data/allCh.txt'
pathAllEn = './cn_en_weibo_data/allEn.txt'

# Store all Chinese text in a single file 
with open(pathAllCh, 'w', encoding = 'utf-8') as file: 
    for line in chTexts:
        file.write(line + '\n')
    file.close()
    
# Store all English text in a single file 
with open(pathAllEn, 'w', encoding = 'utf-8') as file: 
    for line in enTexts: 
        file.write(line + '\n')
    file.close()

My feeling is that we cannot use a pretrained tokenizer to train it from scratch. Instead, we might need to import Byte-Pair Encoding, or WordPiece, or SentencePiece by scratch. 

https://huggingface.co/transformers/tokenizer_summary.html#sentencepiece

https://github.com/huggingface/tokenizers

<a href="https://huggingface.co/docs/tokenizers/python/latest/">Huggingface tokenizer doc</a>

In the following cell, we train a `SentencePiece` tokenizer. 

In [5]:
chTokenizer = SentencePieceBPETokenizer()

chTokenizer.train([pathAllCh], 
                vocab_size = 20000, 
                special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

# Show an example of tokenizer works
output = chTokenizer.encode(chTexts[0])
print(output.ids, output.tokens, output.offsets, output.attention_mask)

# We shall save the tokenizer to disk 
chTokenizer.save_model('.', 'myTokenizer')

[7099, 2582, 3837, 30, 4117, 394, 1188, 1858, 7389, 1402, 1010, 5757, 2229, 6240, 1828, 1899, 5751, 1002, 355, 1651, 3659, 4244, 15, 1670, 1680, 1351] ['▁不要只追求漂亮的外表', ',它会', '你的眼睛', ';', '▁不要只追求', '富', ',它', '只不过', '是过眼云;', '追求', '一个', '能让你', '真心', '微笑的人,', '▁只有', '笑容', '能使', '黑', '夜', '不再', '漫长', ',走', ',', '带来', '温暖', '阳光'] [(0, 10), (10, 13), (13, 17), (17, 18), (20, 26), (26, 27), (27, 29), (29, 32), (32, 37), (36, 39), (39, 41), (41, 44), (44, 46), (46, 51), (53, 56), (56, 58), (58, 60), (60, 61), (61, 62), (62, 64), (64, 66), (66, 68), (68, 69), (68, 71), (70, 73), (73, 75)] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


['.\\myTokenizer-vocab.json', '.\\myTokenizer-merges.txt']

We also have the option to encode a list of texts as a batch. 

In [6]:
output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)

['▁不要只追求漂亮的外表', ',它会', '你的眼睛', ';', '▁不要只追求', '富', ',它', '只不过', '是过眼云;', '追求', '一个', '能让你', '真心', '微笑的人,', '▁只有', '笑容', '能使', '黑', '夜', '不再', '漫长', ',走', ',', '带来', '温暖', '阳光']
['▁"', '▁——', '我希望找到这样一个人,即使', '我', '微笑着', '说“', '我还', '好”', '的时候', ',他', '也能', '觉得', '到', '我的', '痛苦', '。']
['▁生活的', '地', '平', '线', '是', '随着', '心的', '开', '而变', '广', '的。']


## Demo: padding and truncation

Huggingface tokenizer allows us to pad or truncate according to a length. The following are common utilities for padding and truncation: 

`Tokenizer.enable_padding(**args)` -- Enable padding

`Tokenizer.padding` -- Info about padding

`Tokenizer.no_padding()` -- Disable padding

`Tokenizer.enable_truncation(**args)` -- Enable truncation 

`Tokenizer.truncation` -- Info about truncation 

`Tokenizer.no_truncation()` -- Disable truncation 

In [7]:
# With Padding
chTokenizer.enable_padding(length = 15)

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.padding)

['▁不要只追求漂亮的外表', ',它会', '你的眼睛', ';', '▁不要只追求', '富', ',它', '只不过', '是过眼云;', '追求', '一个', '能让你', '真心', '微笑的人,', '▁只有', '笑容', '能使', '黑', '夜', '不再', '漫长', ',走', ',', '带来', '温暖', '阳光']
['▁"', '▁——', '我希望找到这样一个人,即使', '我', '微笑着', '说“', '我还', '好”', '的时候', ',他', '也能', '觉得', '到', '我的', '痛苦', '。']
['▁生活的', '地', '平', '线', '是', '随着', '心的', '开', '而变', '广', '的。', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
{'length': 15, 'pad_to_multiple_of': None, 'pad_id': 0, 'pad_token': '[PAD]', 'pad_type_id': 0, 'direction': 'right'}


In [8]:
# No padding
chTokenizer.no_padding()

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.padding)

['▁不要只追求漂亮的外表', ',它会', '你的眼睛', ';', '▁不要只追求', '富', ',它', '只不过', '是过眼云;', '追求', '一个', '能让你', '真心', '微笑的人,', '▁只有', '笑容', '能使', '黑', '夜', '不再', '漫长', ',走', ',', '带来', '温暖', '阳光']
['▁"', '▁——', '我希望找到这样一个人,即使', '我', '微笑着', '说“', '我还', '好”', '的时候', ',他', '也能', '觉得', '到', '我的', '痛苦', '。']
['▁生活的', '地', '平', '线', '是', '随着', '心的', '开', '而变', '广', '的。']
None


In [9]:
# With truncation
chTokenizer.enable_truncation(max_length = 3)

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.truncation)

['▁不要只追求漂亮的外表', ',它会', '你的眼睛']
['▁"', '▁——', '我希望找到这样一个人,即使']
['▁生活的', '地', '平']
{'max_length': 3, 'stride': 0, 'strategy': 'longest_first'}


In [10]:
# No truncation
chTokenizer.no_truncation()

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.truncation)

['▁不要只追求漂亮的外表', ',它会', '你的眼睛', ';', '▁不要只追求', '富', ',它', '只不过', '是过眼云;', '追求', '一个', '能让你', '真心', '微笑的人,', '▁只有', '笑容', '能使', '黑', '夜', '不再', '漫长', ',走', ',', '带来', '温暖', '阳光']
['▁"', '▁——', '我希望找到这样一个人,即使', '我', '微笑着', '说“', '我还', '好”', '的时候', ',他', '也能', '觉得', '到', '我的', '痛苦', '。']
['▁生活的', '地', '平', '线', '是', '随着', '心的', '开', '而变', '广', '的。']
None


<span style="color:red;">Pending problem.</span> As I tried to follow the tutorial and load the tokenizer saved on disk, unexpected error was reported. For now, skip loading saved tokenizer and proceed with other important steps.  

<span style="color:red;">Bottleneck for now.</span> Do we need special token for T5? If yes, how to insert special T5 tokens into our tokenization? Similar to `tokenizers.processors.BertProcessing`, do we have `tokenizers.processors.T5Processing`? 

<span style="color:red;">Solution.</span> 1. Thoroughtly read documentation for T5 model in huggingface doc; 2. Explore `huggingface/tokenizers` library on github. 

For now, halt with tokenizer and proceed with language model until bumping into problems. Keep in mind the confusion about special token. 

## Tokenizing English text

In [11]:
enTokenizer = SentencePieceBPETokenizer()

enTokenizer.train([pathAllEn], 
                vocab_size = 20000, 
               special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

# Show an example of tokenizer works
output = enTokenizer.encode(enTexts[0])
print(output.ids, output.tokens, output.offsets, output.attention_mask)

# We shall save the tokenizer to disk 
# tokenizer.save_model('.', 'myTokenizer')

[1811, 970, 925, 5501, 4278, 950, 1174, 2443, 20, 1811, 970, 925, 6057, 3651, 936, 4384, 3427, 2195, 925, 1074, 1002, 1407, 874, 2725] ['▁Don’t', '▁go', '▁for', '▁looks;', 'they', '▁can', '▁de', 'ceive', '.', '▁Don’t', '▁go', '▁for', '▁wealth;', 'even', '▁that', '▁fades', '▁away.', '▁Go', '▁for', '▁someone', '▁who', '▁makes', '▁you', '▁smile.'] [(0, 5), (5, 8), (8, 12), (12, 19), (19, 23), (23, 27), (27, 30), (30, 35), (35, 36), (36, 42), (42, 45), (45, 49), (49, 57), (57, 61), (61, 66), (66, 72), (72, 78), (78, 81), (81, 85), (85, 93), (93, 97), (97, 103), (103, 107), (107, 114)] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Preprocess before training

To utilize PyTorch and GPU computation, we need to create instances of `Dataset` object. 

`DataLoader` class allows us to iterate a dataset with given batch size. We will define dataloaders in `T5FineTuner` class. 

The following cell overwrites `Dataset` class. 

In [12]:
class MyDataset(Dataset):
    
    '''
    Load original data and util from memory or file
    Texts must be passed as lists 
    '''
    def __init__(self, 
                 chTexts, enTexts, # Suppose the two colums are of the same length
                 chTokenizer, enTokenizer, 
                 chMaxLen, enMaxLen): 
        super().__init__()
        self.chTexts = chTexts 
        self.enTexts = enTexts
        self.chTokenizer = chTokenizer
        self.enTokenizer = enTokenizer 
        
        # Enable padding and truncation
        self.chTokenizer.enable_padding(length = chMaxLen)
        self.chTokenizer.enable_truncation(max_length = chMaxLen)
        self.enTokenizer.enable_padding(length = enMaxLen)
        self.enTokenizer.enable_truncation(max_length = enMaxLen)
        
    '''
    Return the size of dataset
    '''
    def __len__(self):
        return len(self.chTexts)
    
    '''
    -- The routine for querying one data entry 
    -- The index of must be specified as an argument
    -- Return a dictionary 
    '''
    def __getitem__(self, idx): 
        # Apply tokenizer 
        chOutputs = chTokenizer.encode(chTexts[idx])
        enOutputs = enTokenizer.encode(enTexts[idx])
        
        # Get numerical tokens
        chEncoding = chOutputs.ids
        enEncoding = enOutputs.ids
        
        # Get attention mask 
        chMask = chOutputs.attention_mask
        enMask = enOutputs.attention_mask
        
        return {
            'source_ids': torch.tensor(chEncoding), 
            'source_mask': torch.tensor(chMask), 
            'target_ids': torch.tensor(enEncoding), 
            'target_mask': torch.tensor(enMask)
        }
    

Now we test `Dataset`. 

In [13]:
chMaxLen = 100
enMaxLen = 100

dataset = MyDataset(chTexts[:1500], enTexts[:1500], 
                    chTokenizer, enTokenizer, 
                    chMaxLen = chMaxLen, enMaxLen = enMaxLen)

print(len(dataset))
# print(dataset.__getitem__(0))

# dataloader = DataLoader(dataset, batch_size = 16, num_workers = 0)

1500


## Define model class

PyTorch native, despite its great flexibility, may trap you in detailed errors that mess up the entire code. For example, you may forget important details like `optimizer.zero_grad()` or `tensor.to(device)` in PyTorch native. For both learning purpose and clarity in the long run, we use `pytorch_lightning` to define model class. 

In [14]:
class T5FineTuner(pl.LightningModule): 
    
    ''' Part 1: Define the architecture of model in init '''
    def __init__(self, hparams): 
        super(T5FineTuner, self).__init__()
        self.hparams = hparams
        self.model = T5ForConditionalGeneration.from_pretrained(
            hparams['pretrainedModelName'], 
            return_dict = True    # I set return_dict true so that outputs  are presented as dictionaries
        )
        self.chTokenizer = hparams['chTokenizer']
        self.enTokenizer = hparams['enTokenizer']
        # self.rouge_metric = nlp.load_metric('rouge')
        
        # No idea what the "freeze" is doing
        if self.hparams['freeze_embeds']:
            self.freeze_embeds()
        if self.hparams['freeze_encoder']:
            self.freeze_params(self.model.get_encoder())
            assert_all_frozen(self.model.get_encoder())
            
            
            
    ''' Part 2: Define the forward propagation '''
    def forward(self, 
                input_ids, 
                attention_mask = None, 
                decoder_input_ids = None, 
                decoder_attention_mask = None, 
                lm_labels = None
               ): 
        # Type `Seq2SeqLMOutput`
        return self.model(
            input_ids, 
            attention_mask = attention_mask, 
            decoder_input_ids = decoder_input_ids, 
            decoder_attention_mask = decoder_attention_mask, 
            lm_labels = lm_labels
        )
    
    
    ''' Part 3: Prepare optimizer and scheduler '''
    def configure_optimizers(self): 
        model = self.model 
        no_decay = ['bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {
                # model.named_parameters() can't find doc?
                'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 
                'weight_decay': self.hparams['weight_decay']
            }, 
            {
                'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                'weight_decay': 0.0
            }
        ]
        optimizer = AdamW(
            optimizer_grouped_parameters, 
            lr = self.hparams['learning_rate']
        )
        self.opt = optimizer
        return [optimizer]
    
    
    # Override this method to adjust how Trainer calls each optimizer 
    def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure = None, on_tpu = False, using_native_amp = False, using_lbfgs = False):
        optimizer.step()
        optimizer.zero_grad()   
        self.lr_scheduler.step()
    
    
    '''
    -- Part 4: Define training logic
    -- In PyTorch native, we have to manually define the epoch loop, define the batch loop, and manually perform model.train(), loss.backward(), optimizer.step(), optimizer.zero_grad()
    -- In pytorch_lightening, the training_step() method only needs to return the loss of the batch
    '''
    def training_step(self, batch, batch_idx): 
        loss = self._step(batch)
        tensorboard_logs = {'train_loss': loss}
        self.log('my_loss', loss, prog_bar = True, on_step = True, on_epoch = True, logger = True)
        return {'loss': loss, 'log': tensorboard_logs}
    
    
    # subroutine for training_step()
    def _step(self, batch): 
        lm_labels = batch['target_ids']    # !! Does not apply! 
        lm_labels[lm_labels[:, ] == 0] = -100    # !! Verify that id for pad is 0 
         
        # !! There is a `__call__` method associated with self ?! 
        outputs = self(
            input_ids = batch['source_ids'],    # !! Does not apply! 
            attention_mask = batch['source_mask'], 
            lm_labels = lm_labels, 
            decoder_attention_mask = batch['target_mask']
        )
        
        return outputs.loss    # !! Or should it be outputs[0] ? 
    
    
    # Called at the end of training epoch 
    # Do something with all the outputs from every training step 
    def training_epoch_end(self, outputs): 
        avg_train_loss = torch.stack([x['loss'] for x in outputs]).mean()
        tensorboard_logs = {'avg_train_loss': avg_train_loss}
        return {
            'avg_train_loss': avg_train_loss, 
            'log': tensorboard_logs, 
            'progress_bar': tensorboard_logs
        }
    
    
    
    '''
    -- Part 5: Define validation logic
    -- In PyTorch native, we have to define the batch loop, and manually perform model.eval(), torch.no_grad()
    -- In pytorch_lightening, the training_step() method only needs to return the loss of the batch
    '''
    def validation_step(self, batch, batch_idx): 
        return self._generative_step(batch)
    
    # subroutine for validation_step()
    def _generative_step(self, batch): 
        t0 = time.time()
        
        # !! model.generate() Can't find doc !
        generated_ids = self.model.generate(
            batch['source_ids'],    # !! Does not apply ! 
            attention_mask = batch['source_mask'], 
            use_cache = True, 
            decoder_attention_mask = batch['target_mask'],     # !! Does not apply! 
            max_length = self.hparams['max_output_len'],
            num_beams = 2,     # ?? What is this?
            repetition_penalty = 2.5,     # ?? What is this?
            length_penalty = 1.0,     # ?? What is this?
            early_stopping = True    # ?? What is this?
        )
        
        preds = self.ids_to_clean_text(enTokenizer, generated_ids)    # translation predicted by model 
        target = self.ids_to_clean_text(enTokenizer, batch['target_ids'])    # !! Does not apply! 
        
        gen_time = (time.time() - t0) / batch['source_ids'].shape[0]    # !! Does not apply
        
        loss = self._step(batch)
        
        # Compute metrics
        # ?? What is the deal with "rouge" in the code example? 
        base_metrics = {'val_loss': loss}
        trans_len = np.mean(list(map(len, generated_ids)))
        base_metrics.update(
            gen_time = gen_time, 
            gen_len = trans_len, 
            preds = preds, 
            target = target
        )
        # self.rouge_metric.add_batch(preds, target)
        
        return base_metrics
        
    
    #
    def validation_epoch_end(self, outputs): 
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        
        # rouge_results = self.rouge_metric.compute()
        # rouge_dict = self.parse_score(rouge_results)
        # tensorboard_logs.update(rouge1 = rouge_dict['rouge1'], rougeL = rouge_dict['rougeL'])
        
        # Clear out the lists for next epoch 
        # !! I don't see those two variables defined anywhere 
        self.target_gen = []
        self.prediction_gen = []
        
        return {
            'avg_val_loss': avg_loss, 
            # 'rouge1': rouge_results['rouge1'], 
            # 'rougeL': rouge_results['rougeL'], 
            'log': tensorboard_logs,
            'progress_bar': tensorboard_logs
        }
        
        
        
    '''Part 6: Define dataloaders'''
    def train_dataloader(self): 
        train_dataset = get_dataset(
            chTexts = chTexts[:1800], 
            enTexts = enTexts[:1800], 
            chTokenizer = self.chTokenizer, 
            enTokenizer = self.enTokenizer, 
            chMaxLen = self.hparams['max_input_len'], 
            enMaxLen = self.hparams['max_output_len']
        )
        
        dataloader = DataLoader(
            train_dataset, 
            batch_size = self.hparams['train_batch_size'], 
            drop_last = True, 
            shuffle = True, 
            num_workers = 0 
        )
        
        # The code below deals with scheduler. And I have no idea what the code is doing 
        t_total = (
            (len(dataloader.dataset) // (self.hparams['train_batch_size'] * max(1, self.hparams['n_gpu'])))
            // self.hparams['gradient_accumulation_steps']
            * float(self.hparams['num_train_epochs'])
        )
        scheduler = get_linear_schedule_with_warmup(
            self.opt, num_warmup_steps=self.hparams['warmup_steps'], num_training_steps=t_total
        )
        self.lr_scheduler = scheduler
        return dataloader

    
    def val_dataloader(self):
        val_dataset = get_dataset(
            chTexts = chTexts[1800:1950], 
            enTexts = enTexts[1800:1950], 
            chTokenizer = self.chTokenizer, 
            enTokenizer = self.enTokenizer, 
            chMaxLen = self.hparams['max_input_len'], 
            enMaxLen = self.hparams['max_output_len']
        )
        
        return DataLoader(
            val_dataset, 
            batch_size = self.hparams['eval_batch_size'], 
            num_workers = 0
        )
    
    
    def test_dataloader(self):
        test_dataset = get_dataset(
            chTexts = chTexts[1950:], 
            enTexts = enTexts[1950:], 
            chTokenizer = self.chTokenizer, 
            enTokenizer = self.enTokenizer, 
            chMaxLen = self.hparams['max_input_len'], 
            enMaxLen = self.hparams['max_output_len']
        )
        
        return DataLoader(
            test_dataset, 
            batch_size = self.hparams['eval_batch_size'], 
            num_workers = 0
        )
    
    
    
    ''' ==================================
    # Collection of helper functions 
    # Not predefined by LightningModule
    ===================================='''
    
    # Decode a batch of ids and return a list of strings 
    def ids_to_clean_text(self, tokenizer, ids_batch): 
        ids_batch_tensor = torch.tensor(ids_batch)
        # Make sure that the ids come as a batch and that decode_batch() method will work properly 
        assert (ids_batch_tensor.ndim >= 2), 'Ids do not form a batch'
        return tokenizer.decode_batch(ids_batch.tolist())
            
        
    # tqdm is a library for showing progress bar 
    # Retrieve info needed for progresse bar 
    def get_tqdm_dict(self): 
        # !! What is self.trainer? I never saw it defined 
        tqdm_dict = {
            'loss': '{:.3f}'.format(self.trainer.avg_loss), 
            'lr': self.lr_scheduler.get_last_lr()[-1]
        }
        return tqdm_dict
    
    '''=================================
    # Methods that I have no idea what they are doing 
    =================================='''
    def freeze_params(self, model):
        for par in model.parameters():
            par.requires_grad = False
            
            
    def freeze_embeds(self):
        # Freeze token embeddings and positional embeddings for bart, just token embeddings for t5.
        try:
            self.freeze_params(self.model.model.shared)
            for d in [self.model.model.encoder, self.model.model.decoder]:
                freeze_params(d.embed_positions)
                freeze_params(d.embed_tokens)
        except AttributeError:
            self.freeze_params(self.model.shared)
            for d in [self.model.encoder, self.model.decoder]:
                self.freeze_params(d.embed_tokens)
                

In [15]:
hparams = {
    'chTokenizer': chTokenizer, 
    'enTokenizer': enTokenizer, 
    'pretrainedModelName': 't5-small', 
    'weight_decay': 0.0, 
    'learning_rate': 3e-4, 
    'max_input_len': 100, 
    'max_output_len': 100, 
    'train_batch_size': 8, 
    'eval_batch_size': 8, 
    'num_train_epochs': 2, 
    'n_gpu': 1
    # For now, we do train-test split manually when defining dataloader, instead of loading the following param 
    # 'n_train': 2000
    # 'n_val': 150
    # 'n_test': 50
}

# ?? What are these hyperparameters ? 
hparamsNoUnderstand = {
    'freeze_encoder': False, 
    'freeze_embeds': False, 
    'adam_epsilon': 1e-8,
    'warmup_steps': 0,
    'gradient_accumulation_steps': 8,
    'resume_from_checkpoint': None, 
    'val_check_interval': 0.05,
    'early_stop_callback': False, 
    'fp_16': False, 
    'opt_level': 'O1', 
    'max_grad_norm': 1.0, 
    'seed': 42
}

hparams.update(hparamsNoUnderstand)


I have no idea what the following code cells are doing. 

In [16]:
tb_logger = pl_loggers.TensorBoardLogger('logs/')

## If resuming from checkpoint, add an arg resume_from_checkpoint
train_params = dict(
    accumulate_grad_batches=hparams['gradient_accumulation_steps'],
    gpus=hparams['n_gpu'],
    max_epochs=hparams['num_train_epochs'], 
    # early_stop_callback=False,
    precision= 16 if hparams['fp_16'] else 32,
    amp_level=hparams['opt_level'],
    resume_from_checkpoint=hparams['resume_from_checkpoint'],
    gradient_clip_val=hparams['max_grad_norm'], 
    # checkpoint_callback=checkpoint_callback,
    val_check_interval=hparams['val_check_interval'],
    logger = tb_logger
    # callbacks=[LoggingCallback()],
)

## Train model

In [17]:
def get_dataset(chTexts, enTexts, chTokenizer, enTokenizer, chMaxLen, enMaxLen):
    return MyDataset(chTexts, enTexts, chTokenizer, enTokenizer, chMaxLen, enMaxLen)

In [18]:
model = T5FineTuner(hparams)
trainer = pl.Trainer(**train_params)
trainer.fit(model)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60 M  


HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…

  ids_batch_tensor = torch.tensor(ids_batch)
Please use self.log(...) inside the lightningModule instead.

# log on a step or aggregate epoch metric to the logger and/or progress bar
# (inside LightningModule)
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
Please use self.log(...) inside the lightningModule instead.

# log on a step or aggregate epoch metric to the logger and/or progress bar
# (inside LightningModule)
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)


HBox(children=(HTML(value='Training'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…






1

In [19]:
# Save model
trainer.save_checkpoint("safecheck2011121455_t5_ch_en_1.ckpt")

In [20]:
# Load from saved model 
modelLoaded = T5FineTuner.load_from_checkpoint(checkpoint_path="t5_ch_en_1.ckpt").to(device)

## Check model predictions

In [21]:
import textwrap 
from tqdm.auto import tqdm 

datashow = get_dataset(
            chTexts = chTexts, 
            enTexts = enTexts, 
            chTokenizer = chTokenizer, 
            enTokenizer = enTokenizer, 
            chMaxLen = hparamsDict['max_input_len'], 
            enMaxLen = hparamsDict['max_output_len']
        )

loader = DataLoader(datashow, batch_size = 32)
it = iter(loader)

batch = next(it)

outs = modelLoaded.model.generate(
    batch['source_ids'].cuda(), 
    attention_mask=batch['source_mask'].cuda(),
    use_cache=True,
    decoder_attention_mask=batch['target_mask'].cuda(),
    max_length = hparamsDict['max_output_len'], 
    num_beams = 2, 
    repetition_penalty = 2.5, 
    length_penalty = 1.0, 
    early_stopping = True
)


NameError: name 'hparamsDict' is not defined

In [None]:
preds = [enTokenizer.decode(ids) for ids in outs.tolist()]

texts = [chTokenizer.decode(ids) for ids in batch['source_ids'].tolist()]
targets = [enTokenizer.decode(ids) for ids in batch['target_ids'].tolist()]

for i in range(32):
    lines = textwrap.wrap("Chinese Text:\n%s\n" % texts[i], width=100)
    print("\n".join(lines))
    print("\nActual translation: %s" % targets[i])
    print("\nPredicted translation: %s" % preds[i])
    print("=====================================================================\n")

