# Assignment 3
## Introduction
In this Assignment, we will implement a transformer model to finish a translation task between English and Chinese step by step. Before we start, you are suggested to read the original paper, the lecture of Hung-yi Lee and the blogs of the transformers. It may take much time to do those, however, only in this way, can you get the deep understanding of the task.

### the original paper
- paper: https://arxiv.org/pdf/1706.03762.pdf

### transformer blog: 
- https://ketanhdoshi.github.io/Transformers-Overview/
- https://ketanhdoshi.github.io/Transformers-Arch/
- https://ketanhdoshi.github.io/Transformers-Attention/
- https://ketanhdoshi.github.io/Transformers-Why/


### the lecture of Hung-yi Lee

- bilibili: https://www.bilibili.com/video/BV1v3411r78R/?spm_id_from=333.337.search-card.all.click&vd_source=155ff7fe8c811c0bd4176244f231e86b
 
- slides: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/self_v7.pdf

Also, here are some implementations of the other frameworks which you can refer.

- [implement of keras](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/)
- [implement of huggingface](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt)
## Download dataset
Let's start with the dataset

In [31]:
# check your development environment.
import torch
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
import logging
import sys
from pathlib import Path
import datetime
import os

def beijing(sec, what):
    beijing_time = datetime.datetime.now() + datetime.timedelta(hours=8)
    return beijing_time.timetuple()


logging.Formatter.converter = beijing
# set log
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s: %(message)s',
                     datefmt='%Y-%m-%d %H:%M:%S',)

logging.info('The version information:')
logging.info(f'Python: {sys.version}')
logging.info(f'PyTorch: {torch.__version__}')
assert torch.cuda.is_available() == True, 'Please finish your GPU develop environment'

## Fix random seed

In [32]:
import random
import numpy as np

seed = 2023

random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  
np.random.seed(seed)  
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

logging.info(f'The random seed is fixed to {seed}')

In [33]:
#Download and unzip files
import requests

#define download function
def download(url, save_dir = Path.cwd()):
    os.makedirs(save_dir, exist_ok = True)
    
    file_name = url.split('/')[-1]
    file_path = save_dir / file_name
    if file_path.exists():
        logging.info(f'{file_name} exists!')
        return 
    logging.info(f'downloading {file_name} from {url}')
    response = requests.get(url)
    if response.ok:
        with open(file_path, 'wb') as f:
            f.write(response.content)
        logging.info(f"download  {file_name} from {url} successfully!")
    else:
        print(f"Fail to download  {file_name} from {url}")

In [34]:
dataset_url = 'http://data.statmt.org/wmt18/translation-task/training-parallel-nc-v13.tgz'

In [35]:
dataset_dir = Path('dataset')
download(dataset_url, save_dir=dataset_dir)

In [36]:
import tarfile
tgz_file_path = dataset_dir / 'training-parallel-nc-v13.tgz'
dataset_path = dataset_dir / 'training-parallel-nc-v13'

In [37]:
tar = tarfile.open(tgz_file_path)
if not dataset_path.exists():
    logging.info(f"exact {tgz_file_path} to {dataset_dir}")
    tar.extractall(dataset_dir)
else:
    logging.info(f"{dataset_path} exists!")

In [38]:
chinese_data_path = dataset_path / 'news-commentary-v13.zh-en.zh'
english_data_path = dataset_path / 'news-commentary-v13.zh-en.en'

## Get corpus

In [39]:
chinese_lines = []
english_lines = []

In [40]:
chinese_data_file = open(chinese_data_path, 'r')
english_data_file = open(english_data_path, 'r')

In [41]:
chinese_data_list = list(chinese_data_file.readlines())
english_data_list = list(english_data_file.readlines())

In [42]:
assert  len(chinese_data_list) == len(english_data_list) and len(chinese_data_list) == 252777, \
    'The number of sample error! Please load the dataset again'

In [43]:
number_of_samples = 5
index = 0
for chinese_sentence, english_sentence in zip(chinese_data_list, english_data_list):
    print(index, '\n Chinese sentence: ' + chinese_sentence, 'English sentence: ' , english_sentence)
    index = index + 1
    if index > number_of_samples:
        break

0 
 Chinese sentence: 1929年还是1989年?
 English sentence:  1929 or 1989?

1 
 Chinese sentence: 巴黎-随着经济危机不断加深和蔓延，整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况。
 English sentence:  PARIS – As the economic crisis deepens and widens, the world has been searching for historical analogies to help us understand what has been happening.

2 
 Chinese sentence: 一开始，很多人把这次危机比作1982年或1973年所发生的情况，这样得类比是令人宽心的，因为这两段时期意味着典型的周期性衰退。
 English sentence:  At the start of the crisis, many people likened it to 1982 or 1973, which was reassuring, because both dates refer to classical cyclical downturns.

3 
 Chinese sentence: 如今人们的心情却是沉重多了，许多人开始把这次危机与1929年和1931年相比，即使一些国家政府的表现仍然似乎把视目前的情况为是典型的而看见的衰退。
 English sentence:  Today, the mood is much grimmer, with references to 1929 and 1931 beginning to abound, even if some governments continue to behave as if the crisis was more classical than exceptional.

4 
 Chinese sentence: 目前的趋势是，要么是过度的克制（欧洲），要么是努力的扩展（美国）。
 English sentence:  The tendency is either excessive restraint (Eur

## Dataset division

In [44]:
dataset_list = []
for chinese_sentence, english_sentence in zip(chinese_data_list, english_data_list):
    dataset_list.append([english_sentence.replace('\n',''), chinese_sentence.replace('\n','')])
print(dataset_list[:5])

[['1929 or 1989?', '1929年还是1989年?'], ['PARIS – As the economic crisis deepens and widens, the world has been searching for historical analogies to help us understand what has been happening.', '巴黎-随着经济危机不断加深和蔓延，整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况。'], ['At the start of the crisis, many people likened it to 1982 or 1973, which was reassuring, because both dates refer to classical cyclical downturns.', '一开始，很多人把这次危机比作1982年或1973年所发生的情况，这样得类比是令人宽心的，因为这两段时期意味着典型的周期性衰退。'], ['Today, the mood is much grimmer, with references to 1929 and 1931 beginning to abound, even if some governments continue to behave as if the crisis was more classical than exceptional.', '如今人们的心情却是沉重多了，许多人开始把这次危机与1929年和1931年相比，即使一些国家政府的表现仍然似乎把视目前的情况为是典型的而看见的衰退。'], ['The tendency is either excessive restraint (Europe) or a diffusion of the effort (the United States).', '目前的趋势是，要么是过度的克制（欧洲），要么是努力的扩展（美国）。']]


In [45]:
from sklearn.model_selection import train_test_split
# train:test:dev = 8:1:1
train_dataset, test_and_dev_dataset = train_test_split(dataset_list, shuffle=True, test_size=0.2, random_state=2023)
test_dataset, dev_dataset = train_test_split(test_and_dev_dataset, shuffle=True, test_size=0.5, random_state=2023)

In [46]:
train_dataset[:5]

[['ABU DHABI – In Hermann Hesse’s novel Journey to the East, the character of H.H., a novice in a religious group known as The League, describes a figurine depicting himself next to the group’s leader, Leo.',
  '阿布扎比—赫尔曼·黑塞（Hermann Hesse）的小说《东游记》（Journey to the East）中的角色H. H.'],
 ['The second critical aspect is how one treats future outcomes relative to current ones – an issue that has aroused much attention among philosophers as well as economists.',
  '第二个关键的方面是人们应该看待未来的结果与当前的人的关系，这个问题在哲学家以及经济学家都引起了很大的关注。'],
 ['Pioneering Moroccan feminists began their work soon after independence in 1956.',
  '摩洛哥女权主义先驱们在1956年国家获得独立之后不久就投身于他们的追求。'],
 ['Stopping the spread of nuclear weapons, promoting more efficient energy use, taking action on climate change, and maintaining an open global economy – these and other tasks require Chinese participation, even cooperation, if globalization is not to overwhelm us all.',
  '阻止核武器扩散，促进能源更有效地利用，对气候变化采取措施以及维护开放的全球经济——这些以及其它的工作都需要中国的参与，甚至是合作，如果我们不想全球化压倒我们的话。

In [47]:
test_dataset[:5]

[['Electrifying agricultural areas would facilitate the storage and transportation of farmed products, improve food security, and increase farmers’ earning capacity.',
  '电气化农业区能够便利农作物的储存和 运输，改善粮食安全，提高农民的收入能力。'],
 ['Why, then, does the narrative still have such a hold on us today?',
  '那么，为何这一叙事至今仍有如此大的市场？'],
 ['Ideology trumps factuality.', '意识形态凌驾于事实。'],
 ['The tax would make debt more expensive, because it would be taxed, while making equity cheaper, because profits would not be taxed.',
  '税收将让债务变得更贵，因为债务将被课税，同时股本会变得更便宜，因为利润不会被课税。'],
 ['Governments will become more effective in the future only if voters learn to become more demanding of the policies that future governments adopt.',
  '如果选民学会提高对未来政府的政策要求，政府或许能在不久的将来实现效率提升。']]

In [48]:
dev_dataset[:5]

[['Even the most conservative estimates paint a grim picture.',
  '即使是最保守的估计也相当惨淡。'],
 ['Ethical machines would pose no threat to humanity.', '有伦理的机器不会造成人道威胁。'],
 ['The classic case is Britain’s short-lived return to gold in the interwar period.',
  '两次世界大战期间英国曾短暂恢复金本位制是这方面的经典案例。'],
 ['Because it is.', '因为事实就是如此。'],
 ['But this interpretation is a complete – and sometimes deliberate – misunderstanding of bank capital and the policies being pursued.',
  '但这一解读完全是对银行资本以及目前所追求的政策的误解，有时这是有意为之。']]

In [49]:
len(train_dataset), len(test_dataset), len(dev_dataset)

(202221, 25278, 25278)

In [50]:
import json
json_dir_path = Path('dataset')
Path.mkdir(json_dir_path, exist_ok=True)

dataset_list = [train_dataset, test_dataset, dev_dataset]
json_name_list = ['train.json', 'test.json', 'dev.json']
for dataset, json_name in zip(dataset_list, json_name_list):
    dataset_path = json_dir_path / json_name
    json_data = json.dumps(dataset)
    with open(dataset_path, "w",encoding = 'utf-8') as file:
        file.write(json_data)
    print(f'save {json_name} to {dataset_path}')

save train.json to dataset/train.json
save test.json to dataset/test.json
save dev.json to dataset/dev.json


## Tokenize
The step of Tokenizing is a very important step in the natrual language process(NLP) field. Please refer to [blog:why and how to tokenize](https://huggingface.co/learn/nlp-course/chapter6/1?fw=pt) to learn more about this process. 
[SentencePiece](https://github.com/google/sentencepiece/) is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
In this assignment, we will use the Byte-Pair Encoding([BPE](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt)) tokenization. Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

In [51]:
import sentencepiece as spm
def train(input_file, vocab_size, model_name, model_type, character_coverage):
    """
    search on https://github.com/google/sentencepiece/blob/master/doc/options.md to learn more about the parameters
    :param input_file: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor.
                       By default, SentencePiece normalizes the input with Unicode NFKC.
                       You can pass a comma-separated list of files.
    :param vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
    :param model_name: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
    :param model_type: model type. Choose from unigram (default), bpe, char, or word.
                       The input sentence must be pretokenized when using word type.
    :param character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with
                               rich character set like Japanse or Chinese and 1.0 for other languages with
                               small character set.
    """
    input_argument = '--input=%s --model_prefix=%s --vocab_size=%s --model_type=%s --character_coverage=%s ' \
                     '--pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 '
    cmd = input_argument % (input_file, model_name, vocab_size, model_type, character_coverage)
    cmd = 'spm_train '+ cmd
    # spm.SentencePieceTrainer.Train(cmd)
    # in this assignment, we use the following command
    os.system(cmd)


In [52]:
tokenizer_dir  = Path('tokenizer')
os.makedirs(tokenizer_dir, exist_ok = True)
eng_model_path = tokenizer_dir / Path('eng.model')
eng_vocab_path = tokenizer_dir /Path('eng.vocab')
chn_model_path = tokenizer_dir /Path('chn.model')
chn_vocab_path = tokenizer_dir / Path('chn.vocab')
eng_model_path.exists(),eng_vocab_path.exists(),chn_model_path.exists(), chn_vocab_path.exists()

(True, True, True, True)

In [53]:
# To fix the problem of "spm_train: not found" in kaggle, I add these statement.
# refer to: https://stackoverflow.com/questions/55278519/using-sentencepiece-as-a-command
# % git clone https://github.com/google/sentencepiece.git 
# % cd sentencepiece
# % mkdir build
# % cd build
# % cmake ..
# % make -j $(nproc)
# % sudo make install
# % sudo ldconfig -v

In [54]:
en_input = english_data_path
en_vocab_size = 32000
en_model_name = tokenizer_dir / Path('eng')
en_model_type = 'bpe'
en_character_coverage = 1

tokenizer_dir  = Path('tokenizer')
eng_model_path = tokenizer_dir / Path('eng.model')
eng_vocab_path = tokenizer_dir /Path('eng.vocab')

if eng_model_path.exists() and eng_vocab_path.exists():
    logging.info(f"{eng_model_path } and {eng_vocab_path} have exist! continue run the code")
else:
    train(en_input, en_vocab_size, en_model_name, en_model_type, en_character_coverage)

ch_input = chinese_data_path
ch_vocab_size = 32000
ch_model_name = tokenizer_dir / Path('chn')
ch_model_type = 'bpe'
ch_character_coverage = 0.9995

chn_model_path = tokenizer_dir / Path('chn.model')
chn_vocab_path = tokenizer_dir / Path('chn.vocab')
if chn_model_path.exists() and chn_vocab_path.exists():
    logging.info(f"{chn_model_path } and {chn_vocab_path} have exist! continue run the code")
else:
    train(ch_input, ch_vocab_size, ch_model_name, ch_model_type, ch_character_coverage)

## SentencePiece test
After we finish the sentencepiece training, let's do some test to get the understanding of its work processing

In [55]:
sp = spm.SentencePieceProcessor()
text = "美国总统特朗普今日抵达夏威夷。"

sp.Load('./tokenizer/chn.model')
print(sp.EncodeAsPieces(text))

# encode the text
s =sp.EncodeAsIds(text)
# embeding vector
print(s)
# decode the embedding vector
print(sp.decode_ids(s))

# let's do little change to the embeding functio vector
for i in range(0,len(s),2):
    print(f'{i}: {s[i]} --> {s[i] + 1}')
    s[i] = s[i] + 1 
# look new vector
print(s)
# decode the new embedding vector
print(sp.decode_ids(s))

['▁美国总统', '特朗普', '今日', '抵达', '夏威夷', '。']
[12908, 277, 7420, 7319, 18385, 28724]
美国总统特朗普今日抵达夏威夷。
0: 12908 --> 12909
2: 7420 --> 7421
4: 18385 --> 18386
[12909, 277, 7421, 7319, 18386, 28724]
传染性疾病特朗普减记抵达学生们。


In [56]:
sp = spm.SentencePieceProcessor()
text = "U.S. President Donald Trump arrived in Hawaii today."

# do same as above, but English
sp.Load('./tokenizer/eng.model')
print(sp.EncodeAsPieces(text))
s =sp.EncodeAsIds(text)
print(s)
print(sp.decode_ids(s))

for i in range(0,len(s),2):
    print(f'{i}: {s[i]} --> {s[i] + 1}')
    s[i] = s[i] + 1 
print(s)
print(sp.decode_ids(s))

['▁U', '.', 'S', '.', '▁President', '▁Donald', '▁Trump', '▁arrived', '▁in', '▁Hawaii', '▁today', '.']
[131, 31843, 31850, 31843, 811, 3575, 1023, 8437, 26, 18096, 858, 31843]
U.S. President Donald Trump arrived in Hawaii today.
0: 131 --> 132
2: 31850 --> 31851
4: 811 --> 812
6: 1023 --> 1024
8: 26 --> 27
10: 858 --> 859
[132, 31843, 31851, 31843, 812, 3575, 1024, 8437, 27, 18096, 859, 31843]
res.x. foreign Donald consum arriveded Hawaii future.


## Config
Here is the configuration of this assignment, you are only allowed to change these parameters:
- batch_size
- epoch_num
- lr
- beam size
- gpu_id
- device_id

In [57]:
from argparse import Namespace
import torch

dataset_path = Path('dataset')
experiment_path = Path('experiment')
Path.mkdir(dataset_path, exist_ok=True)
Path.mkdir(experiment_path, exist_ok=True)

config = Namespace(
d_model = 512,
n_heads = 8,
n_layers = 6,
d_k = 64,
d_v = 64,
d_ff = 2048,
dropout = 0.1,
padding_idx = 0,
bos_idx = 2,
eos_idx = 3,
src_vocab_size = 32000,
tgt_vocab_size = 32000,
batch_size = 128,
epoch_num = 100,
early_stop = 5,
lr = 3e-4,

# the max length of sentence in greed decode
max_len = 60,
# beam size for bleu
beam_size = 3,
# Label Smoothing
use_smoothing = False,
# NoamOpt
use_noamopt = True,

train_data_path = dataset_path / 'train.json',
dev_data_path = dataset_path / 'dev.json',
test_data_path = dataset_path / 'test.json',
output_model_path = experiment_path / 'model.pth',
log_path = experiment_path / 'train.log',
output_path = experiment_path / 'output.txt',

# gpu_id and device id is the relative id
# thus, if you wanna use os.environ['CUDA_VISIBLE_DEVICES'] = '2, 3'
# you should set CUDA_VISIBLE_DEVICES = 2 as main -> gpu_id = '0', device_id = [0, 1]
gpu_id = '0',
device_id = [0, 1],
)
# set device
if config.gpu_id != '':
    device = torch.device(f"cuda:{config.gpu_id}")
else:
    device = torch.device('cpu')

In [58]:
device

device(type='cuda', index=0)

## Utils

In [59]:
import sentencepiece as spm


def chinese_tokenizer_load():
    sp_chn = spm.SentencePieceProcessor()
    sp_chn.Load('./tokenizer/chn.model')
    return sp_chn


def english_tokenizer_load():
    sp_eng = spm.SentencePieceProcessor()
    sp_eng.Load('./tokenizer/eng.model')
    return sp_eng

def set_logger(log_path):
    """Set the logger to log info in terminal and file `log_path`.
    In general, it is useful to have a logger so that every output to the terminal is saved
    in a permanent file. Here we save it to `model_dir/train.log`.
    Example:
    ```
    logging.info("Starting training...")
    ```
    Args:
        log_path: (string) where to log
    """
    if os.path.exists(log_path) is True:
        os.remove(log_path)
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    if not logger.handlers:
        # Logging to a file
        file_handler = logging.FileHandler(log_path)
        file_handler.setFormatter(logging.Formatter('%(asctime)s:%(levelname)s: %(message)s'))
        logger.addHandler(file_handler)

        # Logging to console
        stream_handler = logging.StreamHandler()
        stream_handler.setFormatter(logging.Formatter('%(message)s'))
        logger.addHandler(stream_handler)


## Finish the Class: MTDataset(10 marks)
You are supposed to finish the code of the class **MTDataset** in following block.

In [60]:
import os
import json
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence

DEVICE = device


def subsequent_mask(size): 
    """Mask out subsequent positions."""
    # set the shape of subsequent_mask matrix
    attn_shape = (1, size, size)

    # create a subsequent_mask matrix with ones in the upper right corner (excluding the main diagonal) and zeros in the lower left corner (including the main diagonal).
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')

    # return a subsequent_mask matrix with False in the upper right corner (excluding the main diagonal) and True in the lower left corner (including the main diagonal).
    return torch.from_numpy(subsequent_mask) == 0


class Batch:
    """Object for holding a batch of data with mask during training."""
    def __init__(self, src_text, trg_text, src, trg=None, pad=0):
        self.src_text = src_text
        self.trg_text = trg_text
        src = src.to(DEVICE)
        self.src = src
        # Determine the non-empty part of the current input sentence as a bool sequence.
        # And add one dimension in front of seq length to form a matrix of dimension 1×seq length
        self.src_mask = (src != pad).unsqueeze(-2)
        # If the output target is not null, then you need to mask the target clause to be used by the decoder.
        if trg is not None:
            trg = trg.to(DEVICE)
            # Target input part to be used by decoder
            self.trg = trg[:, :-1]
            # The decoder training should predict the output target result
            self.trg_y = trg[:, 1:]
            # Attention mask the target input portion
            self.trg_mask = self.make_std_mask(self.trg, pad)
            # Counts the actual number of words in the target results that should be outputted
            self.ntokens = (self.trg_y != pad).data.sum()

    # Mask
    @staticmethod
    def make_std_mask(tgt, pad):
        """Create a mask to hide padding and future words."""
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & Variable(subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data))
        return tgt_mask


class MTDataset(Dataset):
    def __init__(self, data_path):
        self.out_en_sent, self.out_cn_sent = self.get_dataset(data_path, sort=True)
        self.sp_eng = english_tokenizer_load()
        self.sp_chn = chinese_tokenizer_load()
        self.PAD = self.sp_eng.pad_id()  # 0
        self.BOS = self.sp_eng.bos_id()  # 2
        self.EOS = self.sp_eng.eos_id()  # 3

    @staticmethod
    def len_argsort(seq):
        """
        Input: A list of tokenized sentences.
        Output: Indices that would sort the sentences by their lengths.

        This static method takes in a list of tokenized sentences and returns 
        a list of indices that would sort the sentences by their lengths.
        """
        return sorted(range(len(seq)), key=lambda x: len(seq[x]))

    def get_dataset(self, data_path, sort=False):
        """Sort Chinese and English in the same order, using the English sentence length ordering (sentence subscripts) as the base."""
        logging.info(f"get_dataset from:{os.path.abspath(data_path)}")
        dataset = json.load(open(data_path, 'r'))
        out_en_sent = []
        out_cn_sent = []
        for idx, _ in enumerate(dataset):
            out_en_sent.append(dataset[idx][0])
            out_cn_sent.append(dataset[idx][1])
        if sort:
            sorted_index = self.len_argsort(out_en_sent)
            out_en_sent = [out_en_sent[i] for i in sorted_index]
            out_cn_sent = [out_cn_sent[i] for i in sorted_index]
        return out_en_sent, out_cn_sent

    def __getitem__(self, idx):
        eng_text = self.out_en_sent[idx]
        chn_text = self.out_cn_sent[idx]
        return [eng_text, chn_text]

    def __len__(self):
        return len(self.out_en_sent)

    def collate_fn(self, batch):
        """
        Input: A batch of data.
        Output: A Batch object containing source and target texts, and their tokenized and padded versions.

        This method is responsible for:
        1. Extracting English and Chinese texts from the batch.
        2. Tokenizing and padding these texts.
        3. Returning a Batch object that holds these details.
        """
        src_text = [x[0] for x in batch]
        tgt_text = [x[1] for x in batch]
        
        src = [self.sp_eng.EncodeAsIds(text) for text in src_text]
        src = [torch.tensor(x) for x in src]
        
        tgt = [self.sp_chn.EncodeAsIds(text) for text in tgt_text]
        tgt = [torch.tensor(x) for x in tgt]

        src = pad_sequence(src, padding_value=self.PAD)
        tgt = pad_sequence(tgt, padding_value=self.PAD)

        return Batch(src_text, tgt_text, src, tgt, self.PAD)

## Finish the implement of Label Smoothing, Embeddings and Softmax, Positional Encoding, Attention and Position-wise Feed-Forward Networks(20 marks)
### Label Smoothing
Label Smoothing hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
We implement label smoothing using the KL div loss. Instead of using a one-hot target distribution, we create a distribution that has confidence of the correct word and the rest of the smoothing mass distributed throughout the vocabulary

### Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension$d_model$.
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite)[https://arxiv.org/abs/1608.05859]. In the embedding layers, we multiply those weights by $\sqrt d_{model}$

### Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension$\sqrt d_{model}$. as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite)[https://arxiv.org/pdf/1705.03122.pdf]
In this work, we use sine and cosine functions of different frequencies
$$
\begin{equation*}
PE_{pos, 2i} = sin(pos/10000^{2i/d_{model}}) 
\end{equation*}
$$
$$
\begin{equation*}
PE_{pos, 2i+1} = cos(pos/10000^{2i/d_{model}}) 
\end{equation*}
$$
where $pos$ is the position and $i$ is the dimension.That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $1000 * 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos + k}$ can be represented as a linear function of $PE_{pos}$.


In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop} = 0.1$

### Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

### Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

In [61]:
import math
import copy
import torch.nn as nn
import torch.nn.functional as F


class LabelSmoothing(nn.Module):
    """Implement label smoothing."""

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))


class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        # Embedding layer
        self.lut = nn.Embedding(vocab, d_model)
        # Embedding dimension 
        self.d_model = d_model

    def forward(self, x):
        """
        Input: Tensor 'x' containing token indices.
        Output: The corresponding embedding matrix, scaled by the square root of the embedding dimension.
        
        Fetches the embeddings for the given token indices and scales the result by 
        the square root of the embedding dimension.
        """
        return self.lut(x) * math.sqrt(self.d_model)


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Initialize an all-zero matrix with a size of max_len (the set maximum length) × embedding dimension.
        # To hold the positional embedding of all positions less than this length.
        pe = torch.zeros(max_len, d_model, device=DEVICE)
        # Generate a position subscripted tensor matrix (each row is a position subscript)
        """
        Forms like:
        tensor([[0.],
                [1.],
                [2.],
                [3.],
                [4.],
                ...])
        """
        position = torch.arange(0., max_len, device=DEVICE).unsqueeze(1)
        # Here the power operation is too much, we use exp and log to convert the denominator to be divided below the pos in the realization formula 
        # Be careful with the negative sign since it is the denominator
        div_term = torch.exp(torch.arange(0., d_model, 2, device=DEVICE) * -(math.log(10000.0) / d_model))

        # According to the formula, the positional texture values of each position in each embedding dimension are calculated and stored into the pe matrix
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add 1 dimension so that the pe dimension becomes: 1 x max_len x embedding dimension
        # (to facilitate subsequent batch summing of embedding of all words of a sentence with a batch)
        pe = pe.unsqueeze(0)
        # Save the pe matrix in a persistent buffer state (will not be used as a parameter to be trained)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Input: Tensor 'x' containing embeddings.
        Output: Tensor with positional encodings added to the embeddings.
        
        Process:
        1. Add positional encodings to the given embeddings 'x'.
        2. Ensure the positional encodings are aligned with the sequence length of 'x'.
        3. Apply dropout (defined in __init__) before returning.
        """
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)
        return self.dropout(x)


def attention(query, key, value, mask=None, dropout=None):
    """
    Compute the scaled dot product attention.
    
    Input:
    - query, key, value: Tensors for the query, key, and value
    - mask: Optional tensor to mask certain values 
    - dropout: Optional dropout layer for regularization
    
    Steps:
    1. Calculate 'scores' by computing the dot product of 'query' and 'key'. Don't forget to scale it.
    2. If a mask is provided, apply it to the 'scores' tensor. The idea is to set masked positions to a large negative value.
    3. Apply softmax to the 'scores' to get the attention probabilities.
    4. If a dropout layer is provided, apply dropout to the attention probabilities.
    5. Finally, compute the output by multiplying the attention probabilities with 'value'.
    
    Returns:
    - The result tensor and the attention probabilities.
    """
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    print("==================================================================")
    print(query.size())
    print(key.size())
    print(value.size())
    print(mask.size()) 
    print(scores.size())
    print("==================================================================")
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn



class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MultiHeadedAttention, self).__init__()
        # Guaranteed to be divisible
        assert d_model % h == 0
        # Get a HEAD's ATTENTION representation of the dimension
        self.d_k = d_model // h
        # Number of heads
        self.h = h
        # Define 4 fully connected functions for subsequent use as the WQ, WK, WV matrices and the last h polytopic attention matrices to be transformed after concat
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        """
        Forward propagation for multi-headed attention.
        
        Input:
        - query, key, value: Tensors for query, key, and value.
        - mask: Optional tensor to mask certain values.
        
        Steps:
        1. If mask is provided, adjust its shape.
        2. Find the batch size from the 'query' tensor.
        3. Apply the WQ, WK, WV transformations to query, key, and value respectively.
        4. Split the transformed tensors into 'h' blocks.
        5. For each block, calculate the attention values.
        6. Concatenate all the attention blocks.
        7. Apply the final linear transformation.
        
        Returns:
        - The output tensor after multi-headed attention.
        
        Note: You might want to revisit the 'attention' function you implemented before.
        """
        if mask is not None:
            mask = mask.unsqueeze(1)
            
        # 保持query, key, value的大小不变        
        query = self.linears[0](query)
        key = self.linears[1](key) 
        value = self.linears[2](value)

        nbatches = query.size(0)

        query = query.view(nbatches, -1, self.h, self.d_k).transpose(1, 2) 
        key = key.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
        value = value.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)

        x, self.attn = attention(query, key, value, mask=mask,                 
                                 dropout=self.dropout)

        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        
        return self.linears[-1](x)



class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        # Initialize α to all 1's and β to all 0's.
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        # smooth term (in calculus)
        self.eps = eps

    def forward(self, x):
        # Calculate mean and variance by last dimension
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)

        # Returns the result of Layer Norm
        return self.a_2 * (x - mean) / torch.sqrt(std ** 2 + self.eps) + self.b_2


class SublayerConnection(nn.Module):
    """
    The role of SublayerConnection is to connect the Multi-Head Attention and Feed Forward layers together.
    Only after the output of each layer, you have to do the Layer Norm first and then the residual connection.
    Sublayer is a lambda function
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        # Returns the result after joining the Layer Norm and the residuals.
        return x + self.dropout(sublayer(self.norm(x)))


def clones(module, N):
    """Clone model block, cloned model block parameters are not shared"""
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])


class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        The implementation of Feed Forward in Encoder and Deocder, which mainly contains a multilayer perceptron
        args: 
        d_model：the input dimension of Encoder
        d_ff：the intermediate dimension
        dropout：the rate of dropout

        """
        return self.w_2(self.dropout(F.relu(self.w_1(x))))
    
    

## Finish the implement of Encoder and EncoderLayer(10 marks)

In [62]:
class Encoder(nn.Module):
    def __init__(self, encoderlayer, N):
        super(Encoder, self).__init__()
        self.layers = clones(encoderlayer, N)

    def forward(self, x, mask):
        """
        The implementation of Encoder, note that the core is a stack of N encoder(layers)
        args：
        encoderlayer：the implementation of encoder in Encoder
        N：the number of encoder in Encoder, such as 6
        """
        for layer in self.layers:
            x = layer(x, mask)
        return x
    
    


class EncoderLayer(nn.Module):
    def __init__(self, d_model, multihead_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = multihead_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(d_model, dropout), 2)


    def forward(self, x, mask):
        """
        The implementation of encoder in Encoder, which is made up of self-attention layer, feed forward and norm layer etc
        args：
        d_model：the input dimension of Encoder
        multihead_attn：multihead attention module in encoder
        feed_forward：the feed forward module in encoder
        dropout：the rate of dropout
        x：the input of Encoder
        mask：the mask of multihead attention
        """
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)
    
    

## Finish the implement of Decoder and DecoderLayer(10 marks)

In [63]:
class Decoder(nn.Module):
    def __init__(self, decoderlayer, N):
        super(Decoder, self).__init__()
        self.layers = clones(decoderlayer, N)

    def forward(self, x, memory, src_mask, tgt_mask):
        """
        The implementation of Decoder, the core is a stack of N decoder(layers)
        args：
        decoder layer：the implementation of decoder
        N：the number of decoder in Decoder, such as 6
        """
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return x
    
    


class DecoderLayer(nn.Module):
    def __init__(self, d_model, multihead_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = multihead_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(d_model, dropout), 3)


    def forward(self, x, memory, src_mask, tgt_mask):
        """
        The implementation of a decoder in the Decoder, which is made up of self-attn, src_attn and feed forward
        args：
        d_model：the output dimension of Encoder
        multihead_attn：the multihead attention module(self attention) in decoder
        src_attn：the cross attention module in decoder
        feed_forward：the feed forward module
        memory：the output of Encoder
        x：the input of Decoder
        src_mask：the mask of cross attention module
        tgt_mask：the mask of multihead attention module(self attention)
        """
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, x, src_mask))
        return self.sublayer[2](x, self.feed_forward)

    

## Finish the implement of the whole model(10 marks)

In [64]:
class Transformer(nn.Module):
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        """
        The implementation of Transformer.
        args：
        encoder: the encoder of the transformer
        decoder: the decoder of the transformer
        src_embed: the embedding of the source sentence
        tgt_embed: the embedding of the target sentence
        generator: the output of the final layer of the decoder
        """
        super(Transformer, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def encode(self, src, src_mask):
        """
        args: 
        src: the source sentence
        src_mask: the masked source sentence 
        """
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        """
        args: 
        memory: the output of encoder
        src_mask: the masked source sentence
        tgt: the target sentence
        tgt_mask: the masked target sentence 
        """
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

    def forward(self, src, tgt, src_mask, tgt_mask):
        """
        args:
        src: the source sentence
        tgt: the target sentence
        src_mask: the masked source sentence
        tgt_mask: the masked target sentence 
        """
        memory = self.encode(src, src_mask)
        out = self.decode(memory, src_mask, tgt, tgt_mask)
        return self.generator(out)

In [65]:
class Generator(nn.Module):
    # vocab: tgt_vocab
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        # perform the log_softmax operation (taking the logarithm of the softmax result).
        return F.log_softmax(self.proj(x), dim=-1)


def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    c = copy.deepcopy

    attn = MultiHeadedAttention(h, d_model).to(DEVICE)

    ff = PositionwiseFeedForward(d_model, d_ff, dropout).to(DEVICE)

    position = PositionalEncoding(d_model, dropout).to(DEVICE)

    model = Transformer(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout).to(DEVICE), N).to(DEVICE),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout).to(DEVICE), N).to(DEVICE),
        nn.Sequential(Embeddings(d_model, src_vocab).to(DEVICE), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab).to(DEVICE), c(position)),
        Generator(d_model, tgt_vocab)).to(DEVICE)

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model.to(DEVICE)


def batch_greedy_decode(model, src, src_mask, max_len=64, start_symbol=2, end_symbol=3):
    batch_size, src_seq_len = src.size()
    results = [[] for _ in range(batch_size)]
    stop_flag = [False for _ in range(batch_size)]
    count = 0

    memory = model.encode(src, src_mask)
    tgt = torch.Tensor(batch_size, 1).fill_(start_symbol).type_as(src.data)

    for s in range(max_len):
        tgt_mask = subsequent_mask(tgt.size(1)).expand(batch_size, -1, -1).type_as(src.data)
        out = model.decode(memory, src_mask, Variable(tgt), Variable(tgt_mask))

        prob = model.generator(out[:, -1, :])
        pred = torch.argmax(prob, dim=-1)

        tgt = torch.cat((tgt, pred.unsqueeze(1)), dim=1)
        pred = pred.cpu().numpy()
        for i in range(batch_size):
            # print(stop_flag[i])
            if stop_flag[i] is False:
                if pred[i] == end_symbol:
                    count += 1
                    stop_flag[i] = True
                else:
                    results[i].append(pred[i].item())
            if count == batch_size:
                break

    return results


def greedy_decode(model, src, src_mask, max_len=64, start_symbol=2, end_symbol=3):
    memory = model.encode(src, src_mask)
    # Initialize the prediction content as a 1×1 tensor, fill it with the ID of the start symbol ('BOS'), and set the type to the input data type (LongTensor).
    ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
    # Iterate over the length subscript of the output
    for i in range(max_len - 1):
        # decode obtains the hidden layer representation
        out = model.decode(memory,
                           src_mask,
                           Variable(ys),
                           Variable(subsequent_mask(ys.size(1)).type_as(src.data)))
        # convert the hidden representation into a log-softmax probability distribution over the words in the dictionary.
        prob = model.generator(out[:, -1])
        # obtain the predicted word ID with the highest probability at the current position.
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        if next_word == end_symbol:
            break
        # concatenate the predicted character ID at the current position with the previously predicted content.
        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
    return ys

## Beam search
Beam search is a search algorithm commonly used in natural language processing and machine translation to find the most likely sequence of words given a set of possible choices. It works by exploring a set of candidate solutions and gradually narrowing down the options by selecting only the most promising ones based on a certain scoring function. This approach is particularly useful in cases where the search space is large and exhaustive search is not feasible. 

In [66]:
import torch

class Beam:
    """ Beam search """

    def __init__(self, size, pad, bos, eos, device=False):

        self.size = size
        self._done = False
        self.PAD = pad
        self.BOS = bos
        self.EOS = eos
        # The score for each translation on the beam.
        self.scores = torch.zeros((size,), dtype=torch.float, device=device)
        self.all_scores = []

        # The backpointers at each time-step.
        self.prev_ks = []

        # The outputs at each time-step.
        # Initialize to [BOS, PAD, PAD ..., PAD]
        self.next_ys = [torch.full((size,), self.PAD, dtype=torch.long, device=device)]
        self.next_ys[0][0] = self.BOS

    def get_current_state(self):
        """Get the outputs for the current timestep."""
        return self.get_tentative_hypothesis()

    def get_current_origin(self):
        """Get the backpointers for the current timestep."""
        return self.prev_ks[-1]

    @property
    def done(self):
        return self._done

    def advance(self, word_logprob):
        """Update beam status and check if finished or not."""
        num_words = word_logprob.size(1)

        # Sum the previous scores.
        if len(self.prev_ks) > 0:
            beam_lk = word_logprob + self.scores.unsqueeze(1).expand_as(word_logprob)
        else:
            # in initial case,
            beam_lk = word_logprob[0]

        flat_beam_lk = beam_lk.view(-1)
        best_scores, best_scores_id = flat_beam_lk.topk(self.size, 0, True, True)

        self.all_scores.append(self.scores)
        self.scores = best_scores

        # bestScoresId is flattened as a (beam x word) array,
        # so we need to calculate which word and beam each score came from
        prev_k = best_scores_id // num_words
        self.prev_ks.append(prev_k)
        self.next_ys.append(best_scores_id - prev_k * num_words)

        # End condition is when top-of-beam is EOS.
        if self.next_ys[-1][0].item() == self.EOS:
            self._done = True
            self.all_scores.append(self.scores)

        return self._done

    def sort_scores(self):
        """Sort the scores."""
        return torch.sort(self.scores, 0, True)

    def get_the_best_score_and_idx(self):
        """Get the score of the best in the beam."""
        scores, ids = self.sort_scores()
        return scores[1], ids[1]

    def get_tentative_hypothesis(self):
        """Get the decoded sequence for the current timestep."""

        if len(self.next_ys) == 1:
            dec_seq = self.next_ys[0].unsqueeze(1)
        else:
            _, keys = self.sort_scores()
            hyps = [self.get_hypothesis(k) for k in keys]
            hyps = [[self.BOS] + h for h in hyps]
            dec_seq = torch.LongTensor(hyps)

        return dec_seq

    def get_hypothesis(self, k):
        """ Walk back to construct the full hypothesis. """
        # print(k.type())
        hyp = []
        for j in range(len(self.prev_ks) - 1, -1, -1):
            hyp.append(self.next_ys[j + 1][k])
            k = self.prev_ks[j][k]

        return list(map(lambda x: x.item(), hyp[::-1]))


def beam_search(model, src, src_mask, max_len, pad, bos, eos, beam_size, device):
    """ Translation work in one batch """

    def get_inst_idx_to_tensor_position_map(inst_idx_list):
        """ Indicate the position of an instance in a tensor. """
        return {inst_idx: tensor_position for tensor_position, inst_idx in enumerate(inst_idx_list)}

    def collect_active_part(beamed_tensor, curr_active_inst_idx, n_prev_active_inst, n_bm):
        """ Collect tensor parts associated to active instances. """

        _, *d_hs = beamed_tensor.size()
        n_curr_active_inst = len(curr_active_inst_idx)
        # active instances (elements of batch) * beam search size x seq_len x h_dimension
        new_shape = (n_curr_active_inst * n_bm, *d_hs)

        # select only parts of tensor which are still active
        beamed_tensor = beamed_tensor.view(n_prev_active_inst, -1)
        beamed_tensor = beamed_tensor.index_select(0, curr_active_inst_idx)
        beamed_tensor = beamed_tensor.view(*new_shape)

        return beamed_tensor

    def collate_active_info(
            src_enc, src_mask, inst_idx_to_position_map, active_inst_idx_list):
        # Sentences which are still active are collected,
        # so the decoder will not run on completed sentences.
        n_prev_active_inst = len(inst_idx_to_position_map)
        active_inst_idx = [inst_idx_to_position_map[k] for k in active_inst_idx_list]
        active_inst_idx = torch.LongTensor(active_inst_idx).to(device)

        active_src_enc = collect_active_part(src_enc, active_inst_idx, n_prev_active_inst, beam_size)
        active_inst_idx_to_position_map = get_inst_idx_to_tensor_position_map(active_inst_idx_list)
        active_src_mask = collect_active_part(src_mask, active_inst_idx, n_prev_active_inst, beam_size)

        return active_src_enc, active_src_mask, active_inst_idx_to_position_map

    def beam_decode_step(
            inst_dec_beams, len_dec_seq, enc_output, inst_idx_to_position_map, n_bm):
        """ Decode and update beam status, and then return active beam idx """

        def prepare_beam_dec_seq(inst_dec_beams, len_dec_seq):
            dec_partial_seq = [b.get_current_state() for b in inst_dec_beams if not b.done]
            # Batch size x Beam size x Dec Seq Len
            dec_partial_seq = torch.stack(dec_partial_seq).to(device)
            # Batch size*Beam size x Dec Seq Len
            dec_partial_seq = dec_partial_seq.view(-1, len_dec_seq)
            return dec_partial_seq

        def predict_word(dec_seq, enc_output, n_active_inst, n_bm):
            assert enc_output.shape[0] == dec_seq.shape[0] == src_mask.shape[0]
            out = model.decode(enc_output, src_mask,
                               dec_seq,
                               subsequent_mask(dec_seq.size(1))
                               .type_as(src.data))
            word_logprob = model.generator(out[:, -1])
            word_logprob = word_logprob.view(n_active_inst, n_bm, -1)

            return word_logprob

        def collect_active_inst_idx_list(inst_beams, word_prob, inst_idx_to_position_map):
            active_inst_idx_list = []
            for inst_idx, inst_position in inst_idx_to_position_map.items():
                is_inst_complete = inst_beams[inst_idx].advance(
                    word_prob[inst_position])  # Fill Beam object with assigned probabilities
                if not is_inst_complete:  # if top beam ended with eos, we do not add it
                    active_inst_idx_list += [inst_idx]

            return active_inst_idx_list

        n_active_inst = len(inst_idx_to_position_map)

        # get decoding sequence for each beam
        # size: Batch size*Beam size x Dec Seq Len
        dec_seq = prepare_beam_dec_seq(inst_dec_beams, len_dec_seq)

        # get word probabilities for each beam
        # size: Batch size x Beam size x Vocabulary
        word_logprob = predict_word(dec_seq, enc_output, n_active_inst, n_bm)

        # Update the beam with predicted word prob information and collect incomplete instances
        active_inst_idx_list = collect_active_inst_idx_list(
            inst_dec_beams, word_logprob, inst_idx_to_position_map)

        return active_inst_idx_list

    def collect_hypothesis_and_scores(inst_dec_beams, n_best):
        all_hyp, all_scores = [], []
        for inst_idx in range(len(inst_dec_beams)):
            scores, tail_idxs = inst_dec_beams[inst_idx].sort_scores()
            all_scores += [scores[:n_best]]

            hyps = [inst_dec_beams[inst_idx].get_hypothesis(i) for i in tail_idxs[:n_best]]
            all_hyp += [hyps]
        return all_hyp, all_scores

    with torch.no_grad():
        # -- Encode
        src_enc = model.encode(src, src_mask)

        #  Repeat data for beam search
        NBEST = beam_size
        batch_size, sent_len, h_dim = src_enc.size()
        src_enc = src_enc.repeat(1, beam_size, 1).view(batch_size * beam_size, sent_len, h_dim)
        src_mask = src_mask.repeat(1, beam_size, 1).view(batch_size * beam_size, 1, src_mask.shape[-1])

        # -- Prepare beams
        inst_dec_beams = [Beam(beam_size, pad, bos, eos, device) for _ in range(batch_size)]

        # -- Bookkeeping for active or not
        active_inst_idx_list = list(range(batch_size))
        inst_idx_to_position_map = get_inst_idx_to_tensor_position_map(active_inst_idx_list)

        # -- Decode
        for len_dec_seq in range(1, max_len + 1):

            active_inst_idx_list = beam_decode_step(
                inst_dec_beams, len_dec_seq, src_enc, inst_idx_to_position_map, beam_size)

            if not active_inst_idx_list:
                break  # all instances have finished their path to <EOS>
            # filter out inactive tensor parts (for already decoded sequences)
            src_enc, src_mask, inst_idx_to_position_map = collate_active_info(
                src_enc, src_mask, inst_idx_to_position_map, active_inst_idx_list)

    batch_hyp, batch_scores = collect_hypothesis_and_scores(inst_dec_beams, NBEST)

    return batch_hyp, batch_scores


## BLEU score
BLEU (Bilingual Evaluation Understudy) score is a metric commonly used to evaluate the quality of machine-translated text by comparing it to one or more reference translations. It measures the overlap between the machine-generated text and the reference translations, taking into account the precision of n-grams (sequences of n words) and the brevity penalty. The higher the BLEU score, the better the translation
quality.

In [67]:
!pip install sacrebleu

Collecting sacrebleu
  Obtaining dependency information for sacrebleu from https://files.pythonhosted.org/packages/df/c0/ff53cb76c1b050ad25d056877ba6d3f6fa964134370c4ccf57ad933d6f72/sacrebleu-2.3.2-py3-none-any.whl.metadata
  Downloading sacrebleu-2.3.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m426.8 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting portalocker (from sacrebleu)
  Obtaining dependency information for portalocker from https://files.pythonhosted.org/packages/17/9e/87671efcca80ba6203811540ed1f9c0462c1609d2281d7b7f53cef05da3d/portalocker-2.8.2-py3-none-any.whl.metadata
  Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)
Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collecte

In [68]:
import torch.nn as nn
from torch.autograd import Variable

import logging
import sacrebleu
from tqdm import tqdm



def run_epoch(data, model, loss_compute):
    total_tokens = 0.
    total_loss = 0.

    for batch in tqdm(data):
        out = model(batch.src, batch.trg, batch.src_mask, batch.trg_mask)
        loss = loss_compute(out, batch.trg_y, batch.ntokens)

        total_loss += loss
        total_tokens += batch.ntokens
    return total_loss / total_tokens


def train(train_data, dev_data, model, model_par, criterion, optimizer):
    """train and save model"""
    best_bleu_score = 0.0
    bleu_score_list = []
    dev_loss_list = []
    early_stop = config.early_stop
    for epoch in range(1, config.epoch_num + 1):
        # train the model
        model.train()
        train_loss = run_epoch(train_data, model_par,
                               MultiGPULossCompute(model.generator, criterion, config.device_id, optimizer))
        logging.info("Epoch: {}, loss: {}".format(epoch, train_loss))
        # model validation
        model.eval()
        dev_loss = run_epoch(dev_data, model_par,
                             MultiGPULossCompute(model.generator, criterion, config.device_id, None))
        bleu_score = evaluate(dev_data, model)
        
        dev_loss_list.append(dev_loss.cpu().detach().numpy())
        bleu_score_list.append(bleu_score)
        
        logging.info('Epoch: {}, Dev loss: {}, Bleu Score: {}'.format(epoch, dev_loss, bleu_score))

        # save the current model if its loss on the dev set for the current epoch is better than the previously recorded best loss, and update the best loss value.
        if bleu_score > best_bleu_score:
            torch.save(model.state_dict(), config.output_model_path)
            best_bleu_score = bleu_score
            early_stop = config.early_stop
            logging.info("-------- Save Best Model! --------")
        else:
            early_stop -= 1
            logging.info("Early Stop Left: {}".format(early_stop))
        if early_stop == 0:
            logging.info("-------- Early Stop! --------")
            break
    return dev_loss_list, bleu_score_list


class LossCompute:
    """A single-gpu loss compute and train function."""

    def __init__(self, generator, criterion, opt=None):
        self.generator = generator
        self.criterion = criterion
        self.opt = opt

    def __call__(self, x, y, norm):
        x = self.generator(x)
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
                              y.contiguous().view(-1)) / norm
        loss.backward()
        if self.opt is not None:
            self.opt.step()
            if config.use_noamopt:
                self.opt.optimizer.zero_grad()
            else:
                self.opt.zero_grad()
        return loss.data.item() * norm.float()

class MultiGPULossCompute:
    """A multi-gpu loss compute and train function."""

    def __init__(self, generator, criterion, devices, opt=None, chunk_size=5):
        # Send out to different gpus.
        self.generator = generator
        self.criterion = nn.parallel.replicate(criterion, devices=devices)
        self.opt = opt
        self.devices = devices
        self.chunk_size = chunk_size

    def __call__(self, out, targets, normalize):
        total = 0.0
        generator = nn.parallel.replicate(self.generator, devices=self.devices)
        out_scatter = nn.parallel.scatter(out, target_gpus=self.devices)
        out_grad = [[] for _ in out_scatter]
        targets = nn.parallel.scatter(targets, target_gpus=self.devices)

        # Divide generating into chunks.
        chunk_size = self.chunk_size
        for i in range(0, out_scatter[0].size(1), chunk_size):
            # Predict distributions
            out_column = [[Variable(o[:, i:i + chunk_size].data,
                                    requires_grad=self.opt is not None)]
                          for o in out_scatter]
            gen = nn.parallel.parallel_apply(generator, out_column)

            # Compute loss.
            y = [(g.contiguous().view(-1, g.size(-1)),
                  t[:, i:i + chunk_size].contiguous().view(-1))
                 for g, t in zip(gen, targets)]
            loss = nn.parallel.parallel_apply(self.criterion, y)

            # Sum and normalize loss
            l_ = nn.parallel.gather(loss, target_device=self.devices[0])
            l_ = l_.sum() / normalize
            total += l_.data

            # Backprop loss to output of transformer
            if self.opt is not None:
                l_.backward()
                for j, l in enumerate(loss):
                    out_grad[j].append(out_column[j][0].grad.data.clone())

        # Backprop all loss through transformer.
        if self.opt is not None:
            out_grad = [Variable(torch.cat(og, dim=1)) for og in out_grad]
            o1 = out
            o2 = nn.parallel.gather(out_grad,
                                    target_device=self.devices[0])
            o1.backward(gradient=o2)
            self.opt.step()
            if config.use_noamopt:
                self.opt.optimizer.zero_grad()
            else:
                self.opt.zero_grad()
        return total * normalize

    
def evaluate(data, model, mode='eval', use_beam=True):
    """Predict using the trained model and print the output"""
    sp_chn = chinese_tokenizer_load()
    engs = []
    trg = []
    res = []
    with torch.no_grad():
        for batch in tqdm(data):
            en_sent = batch.src_text
            cn_sent = batch.trg_text
            src = batch.src
            src_mask = (src != 0).unsqueeze(-2)
            if use_beam:
                decode_result, _ = beam_search(model, src, src_mask, config.max_len,
                                               config.padding_idx, config.bos_idx, config.eos_idx,
                                               config.beam_size, device)
            else:
                decode_result = batch_greedy_decode(model, src, src_mask,
                                                    max_len=config.max_len)
            decode_result = [h[0] for h in decode_result]
            translation = [sp_chn.decode_ids(_s) for _s in decode_result]
            trg.extend(cn_sent)
            res.extend(translation)
            engs.extend(en_sent)
    if mode == 'test':
        for i in range(len(trg)):
            line = "idx: \n " + str(i) +': \n' + engs[i] +'\n label: '+trg[i] + '\n predict:' + res[i] + '\n'
            print(line)     
    trg = [trg]
    bleu = sacrebleu.corpus_bleu(res, trg, tokenize='zh')
    return float(bleu.score)


def test(data, model, criterion, mode='eval'):
    with torch.no_grad():
        # load model
        model.load_state_dict(torch.load(config.output_model_path))
        model_par = torch.nn.DataParallel(model)
        model.eval()
        # predict
        test_loss = run_epoch(data, model_par,
                              MultiGPULossCompute(model.generator, criterion, config.device_id, None))
        bleu_score = evaluate(data, model, mode)
        logging.info('Test loss: {},  Bleu Score: {}'.format(test_loss, bleu_score))
    return test_loss.cpu().detach().numpy(), bleu_score


def translate(src, model, model_path,  use_beam=True):
    """Predict a single sentence using the trained model and print the output."""
    sp_chn = chinese_tokenizer_load()
    with torch.no_grad():
        model.load_state_dict(torch.load(model_path))
        model.eval()
        src_mask = (src != 0).unsqueeze(-2)
        if use_beam:
            decode_result, _ = beam_search(model, src, src_mask, config.max_len,
                                           config.padding_idx, config.bos_idx, config.eos_idx,
                                           config.beam_size, device)
            decode_result = [h[0] for h in decode_result]
        else:
            decode_result = batch_greedy_decode(model, src, src_mask, max_len=config.max_len)
        translation = [sp_chn.decode_ids(_s) for _s in decode_result]
        return translation[0]


In [69]:
import logging
import numpy as np

import torch
# In order to run this cell in kaggle, I add this statement.
from torch.utils.data import DataLoader 

class NoamOpt:
    """Optim wrapper that implements rate."""

    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        """Update parameters and rate"""
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step=None):
        """Implement `lrate` above"""
        if step is None:
            step = self._step
        return self.factor * (self.model_size ** (-0.5) * min(step ** (-0.5), step * self.warmup ** (-1.5)))


def get_std_opt(model):
    """for batch_size 32, 5530 steps for one epoch, 2 epoch for warm-up"""
    return NoamOpt(model.src_embed[0].d_model, 1, 10000,
                   torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

def run():
    set_logger(config.log_path)

    train_dataset = MTDataset(config.train_data_path)
    dev_dataset = MTDataset(config.dev_data_path)
    test_dataset = MTDataset(config.test_data_path)

    logging.info("-------- Dataset Build! --------")
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=config.batch_size,
                                  collate_fn=train_dataset.collate_fn)
    dev_dataloader = DataLoader(dev_dataset, shuffle=False, batch_size=config.batch_size,
                                collate_fn=dev_dataset.collate_fn)
    test_dataloader = DataLoader(test_dataset, shuffle=False, batch_size=config.batch_size,
                                 collate_fn=test_dataset.collate_fn)

    logging.info("-------- Get Dataloader! --------")
    # initialize the model
    model = make_model(config.src_vocab_size, config.tgt_vocab_size, config.n_layers,
                       config.d_model, config.d_ff, config.n_heads, config.dropout)
    model_par = torch.nn.DataParallel(model)
    # train the model
    if config.use_smoothing:
        criterion = LabelSmoothing(size=config.tgt_vocab_size, padding_idx=config.padding_idx, smoothing=0.1)
        criterion.cuda()
    else:
        criterion = torch.nn.CrossEntropyLoss(ignore_index=0, reduction='sum')
    if config.use_noamopt:
        optimizer = get_std_opt(model)
    else:
        optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr)
    train_loss_list, train_bleu_score_list= train(train_dataloader, dev_dataloader, model, model_par, criterion, optimizer)
    test_loss, test_bleu_score = test(test_dataloader, model, criterion)
    
    return train_loss_list, train_bleu_score_list, test_loss, test_bleu_score


def check_opt():
    """check learning rate changes"""
    import numpy as np
    import matplotlib.pyplot as plt
    model = make_model(config.src_vocab_size, config.tgt_vocab_size, config.n_layers,
                       config.d_model, config.d_ff, config.n_heads, config.dropout)
    opt = get_std_opt(model)
    # Three settings of the lrate hyperparameters.
    opts = [opt,
            NoamOpt(512, 1, 20000, None),
            NoamOpt(256, 1, 10000, None)]
    plt.plot(np.arange(1, 50000), [[opt.rate(i) for opt in opts] for i in range(1, 50000)])
    plt.legend(["512:10000", "512:20000", "256:10000"])
    plt.show()

    
def one_sentence_translate(sent, model_path,beam_search=True):
    # model initialation
    model = make_model(config.src_vocab_size, config.tgt_vocab_size, config.n_layers,
                       config.d_model, config.d_ff, config.n_heads, config.dropout)
    BOS = english_tokenizer_load().bos_id()  # 2
    EOS = english_tokenizer_load().eos_id()  # 3
    src_tokens = [[BOS] + english_tokenizer_load().EncodeAsIds(sent) + [EOS]]
    batch_input = torch.LongTensor(np.array(src_tokens)).to(device)
    return translate(batch_input, model, model_path, use_beam=beam_search)

## Train model

In [70]:
train_loss_list, train_bleu_score_list, test_loss, test_bleu_score = run()

  0%|          | 0/1580 [00:00<?, ?it/s]

torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([30, 8, 127, 64])
torch.Size([30, 8, 127, 64])
torch.Size([30, 8, 127, 64])
torch.Size([30, 1, 127, 127])
torch.Size([3

  0%|          | 0/1580 [00:04<?, ?it/s]

torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 8, 128, 64])
torch.Size([34, 1, 1, 128])
torch.Size([34, 8, 128, 128])
torch.Size([30, 8, 127, 64])
torch.Size([30, 8, 127, 64])
torch.Size([30, 8, 127, 64])
torch.Size([30, 1, 127, 127])
torch.Size([3




RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ipykernel_47/656968424.py", line 46, in forward
    out = self.decode(memory, src_mask, tgt, tgt_mask)
  File "/tmp/ipykernel_47/656968424.py", line 35, in decode
    return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ipykernel_47/1672127285.py", line 14, in forward
    x = layer(x, memory, src_mask, tgt_mask)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ipykernel_47/1672127285.py", line 44, in forward
    x = self.sublayer[1](x, lambda x: self.src_attn(x, m, x, src_mask))
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ipykernel_47/1002088172.py", line 228, in forward
    return x + self.dropout(sublayer(self.norm(x)))
  File "/tmp/ipykernel_47/1672127285.py", line 44, in <lambda>
    x = self.sublayer[1](x, lambda x: self.src_attn(x, m, x, src_mask))
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ipykernel_47/1002088172.py", line 184, in forward
    key = key.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
RuntimeError: shape '[30, -1, 8, 64]' is invalid for input of size 2228224


In [None]:
logging.info('Test loss: {},  Bleu Score: {}'.format(test_loss, test_bleu_score))

## Visualize the training process

In [None]:
import matplotlib.pyplot as plt

epochs_list = list(range(len(train_loss_list)))
plt.figure(figsize=(20, 8))
plt.plot(epochs_list, train_loss_list)
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper right')
plt.show()

In [None]:
epochs_list = list(range(len(train_bleu_score_list)))
plt.figure(figsize=(20, 8))
plt.plot(epochs_list, train_bleu_score_list)
plt.title('Model Bleu Score')
plt.ylabel('Bleu Score')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper right')
plt.show()

## Test trained model

In [None]:
def translate_example(model_path=config.output_model_path):
    """单句翻译示例""" 
    sent = "I love Xiamen University"
    tgt= "我爱厦门大学"
    res = one_sentence_translate(sent, model_path,beam_search=False)
    print(f'{sent} => {res} \n label: {tgt} \n')

In [None]:
translate_example()

In [None]:
def test_dataset():
    test_dataset = MTDataset(config.test_data_path)

    logging.info("-------- Dataset Build! --------")
    test_dataloader = DataLoader(test_dataset, shuffle=False, batch_size=config.batch_size,
                                 collate_fn=test_dataset.collate_fn)

    logging.info("-------- Get Dataloader! --------")
    # initialize the model
    model = make_model(config.src_vocab_size, config.tgt_vocab_size, config.n_layers,
                       config.d_model, config.d_ff, config.n_heads, config.dropout)
    model_par = torch.nn.DataParallel(model)
    # train the model
    if config.use_smoothing:
        criterion = LabelSmoothing(size=config.tgt_vocab_size, padding_idx=config.padding_idx, smoothing=0.1)
        criterion.cuda()
    else:
        criterion = torch.nn.CrossEntropyLoss(ignore_index=0, reduction='sum')
    test_loss, test_bleu_score = test(test_dataloader, model, criterion, mode='test')

In [None]:
test_dataset()

## Question(40 marks)
1. Why can transformer train in parallel but not reference in parallel? (5 marks)

Answer:

2. What is the relationship between the convolution operations and the attention operations? (10 marks)

Answer:

3. Why is a mask needed after tokenization? Attention mechanisms also use masks, what are their functions respectively? (10 marks)

Answer:

4. Why does Transformer introduce positional coding? Why  do RNN, GRU, LSTM not need to introduce positional coding? (5 marks)

Answer:

5. After you finish your assignment, please describe the whole process of machine translation based transformer, in other word, how is an English sentence  translated into Chinese ? The more detailed, the better. (10 marks)

Answer:

## Last but not least
When you finish this assignment, you got the understanding that the Transformer model consists of the Encoder module and the Decoder module, however the encoder-decoder models are one of the models in large languages models(LLM); the Encoder module and Decoder module can be used individually.
We would like to suggest you to read the following papers
- Encoder-only: [BERT](https://aclanthology.org/N19-1423.pdf), [ViT](https://arxiv.org/pdf/2010.11929.pdf)
- Decoder-pnly: [GPT-1/2/3/4](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf), [ChatGPT](https://openai.com/research/gpt-4)
- Encoder-Decoder: [T5](https://jmlr.org/papers/v21/20-074.html)

For more LLM, please refer to [A Survey of Large Language Models](https://arxiv.org/pdf/2303.18223.pdf) and [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM)