# Task description
- Translate text from Chinese to English.
- Main goal: Get familiar with transformer.

## install the required package

In [1]:
!pip install torchmetrics

Collecting torchmetrics
  Downloading torchmetrics-1.5.1-py3-none-any.whl.metadata (20 kB)
Collecting lightning-utilities>=0.8.0 (from torchmetrics)
  Downloading lightning_utilities-0.11.8-py3-none-any.whl.metadata (5.2 kB)
Downloading torchmetrics-1.5.1-py3-none-any.whl (890 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m890.6/890.6 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lightning_utilities-0.11.8-py3-none-any.whl (26 kB)
Installing collected packages: lightning-utilities, torchmetrics
Successfully installed lightning-utilities-0.11.8 torchmetrics-1.5.1


## Import package

In [2]:
from google.colab import drive
drive.mount('/content/drive')
import os
import json
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torchsummary import summary
import torch.nn.functional as F
from torch.nn import Linear


Mounted at /content/drive


## Fix random seed

In [3]:
%cd "/content/drive/MyDrive/DL_Lab03/DL_Lab03"
def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_seed(87)

/content/drive/MyDrive/DL_Lab03/DL_Lab03


# Data
- Original dataset is [20k-en-zh-translation-pinyin-hsk](https://huggingface.co/datasets/swaption2009/20k-en-zh-translation-pinyin-hsk)
- We select 50000 English-Chinese sentence pairs for translation task

- Args:
  - BATCH_SIZE
  - data_dir: the path to the given translation dataset
- Tokenizer: BertTokenizer
  - encode: convert text to token ID
  - decode: convert token ID back to text
- Add paddings
  - make all the sentences the same length by inserting token ID = PAD_IDX at the back

In [4]:
data_dir = "./translation_data.json"
BATCH_SIZE = 64

## Show the raw data

In [5]:
translation_raw_data = pd.read_json(data_dir)
translation_raw_data = translation_raw_data
display(translation_raw_data)

Unnamed: 0,english,chinese
0,"Slowly and not without struggle, America began...",美国缓慢地开始倾听，但并非没有艰难曲折。
1,Dithering is a technique that blends your colo...,抖动是关于颜色混合的技术，使你的作品看起来更圆滑，或者只是创作有趣的材质。
2,This paper discusses the petrologic characteri...,本文以珲春早第三纪含煤盆地的地质构违背景为依据，分析了煤系地层的岩石学特征。
3,The second encounter relates to my grandfather...,第二次事件跟我爷爷的宝贝匣子有关。
4,One way to address these challenges would be t...,解决这些挑战的途径包括依照麻瓜在南非的经验设立真相与和解委员会。
...,...,...
49995,You were too obtuse to take the hint.,你太迟钝了， 没有理解这种暗示。
49996,"Therefore, in the event the mortgagee of ship ...",因此，在这种情况下船舶抵押权人放弃了债务人提供的担保就会影响其他担保人的利益，导致抵押权人的...
49997,"Fourth, puncture administrative bloat.",第四，削弱行政膨胀。
49998,Massimo Oddo says he won't be thinking about h...,马西莫。奥多声明他不会在世界杯决赛圈比赛结束之前考虑未来的俱乐部。


## Tokenizer

In [6]:
from transformers import BertTokenizer
tokenizer_en = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer_cn = BertTokenizer.from_pretrained("bert-base-chinese")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

In [7]:
english_seqs = translation_raw_data["english"].apply(lambda x: tokenizer_en.encode(x, add_special_tokens=True, padding=False))
chinese_seqs = translation_raw_data["chinese"].apply(lambda x: tokenizer_cn.encode(x, add_special_tokens=True, padding=False))

MAX_TOKENIZE_LENGTH = max(english_seqs.str.len().max(),chinese_seqs.str.len().max()) # longest string
MAX_TOKENIZE_LENGTH = pow(2, math.ceil(math.log(MAX_TOKENIZE_LENGTH)/math.log(2)))   # closest upper to the power of 2

print("max tokenize length:", MAX_TOKENIZE_LENGTH)

max tokenize length: 128


## Add paddings

In [8]:
PAD_IDX = 0
BOS_IDX = chinese_seqs.iloc[0][0]
EOS_IDX = chinese_seqs.iloc[0][-1]

def add_padding(token_list, max_length):
    if len(token_list)<max_length:
      token_list = token_list +[PAD_IDX] * (max_length - len(token_list))
    elif len(token_list) > max_length:
      token_list = token_list[:max_length]
    return token_list

# Assuming MAX_TOKENIZE_LENGTH is defined elsewhere
chinese_seqs = chinese_seqs.apply(lambda x: add_padding(x, MAX_TOKENIZE_LENGTH))
english_seqs = english_seqs.apply(lambda x: add_padding(x, MAX_TOKENIZE_LENGTH))


In [9]:
# check the padding result
print("=====chinese tokenized data=====")
print("this is my chinese_seqs.iloc[0]" , chinese_seqs.iloc[0])
print("=====english tokenized data=====")
print("this is my english_seqs.iloc[0]" , english_seqs.iloc[0])

=====chinese tokenized data=====
this is my chinese_seqs.iloc[0] [101, 5401, 1744, 5353, 2714, 1765, 2458, 1993, 967, 1420, 8024, 852, 2400, 7478, 3766, 3300, 5680, 7410, 3289, 2835, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
=====english tokenized data=====
this is my english_seqs.iloc[0] [101, 13060, 1105, 1136, 1443, 5637, 117, 1738, 1310, 1106, 5113, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## Datalodader
- Split dataset into training dataset(90%) and validation dataset(10%).
- Create dataloader to iterate the data.

In [10]:
data_size = len(translation_raw_data)
train_size = int(0.9*data_size)
test_size = data_size - train_size
print("train_size:",train_size)
print("test_size:",test_size)

en_training_data = []
cn_training_data = []
en_testing_data = []
cn_testing_data = []

for i in range(data_size):
    if (i < train_size):
        en_training_data.append(torch.Tensor(english_seqs.iloc[i]))
        cn_training_data.append(torch.Tensor(chinese_seqs.iloc[i]))
    else:
        en_testing_data.append(torch.Tensor(english_seqs.iloc[i]))
        cn_testing_data.append(torch.Tensor(chinese_seqs.iloc[i]))


class TextTranslationDataset(Dataset):
    def __init__(self, src, dst):
        self.src_list = src
        self.dst_list = dst

    def __len__(self):
        return len(self.src_list)

    def __getitem__(self, idx):
        return self.src_list[idx], self.dst_list[idx]

cn_to_en_train_set = TextTranslationDataset(cn_training_data, en_training_data)
cn_to_en_test_set = TextTranslationDataset(cn_testing_data, en_testing_data)

cn_to_en_train_loader = DataLoader(cn_to_en_train_set, batch_size=BATCH_SIZE, shuffle=False)
cn_to_en_test_loader = DataLoader(cn_to_en_test_set, batch_size=BATCH_SIZE, shuffle=True)



train_size: 45000
test_size: 5000


# Model
- TO-DO: Finish the model by yourself
- Base transformer layers in [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
    - TransformerEncoderLayer:
    - TransformerDecoderLayer:
- Positional encoding and input embedding
- Note that you may need masks when implementing attention mechanism
    - Padding mask: prevent input from attending to padding tokens
    - Causal mask: prevent decoder input from attending to future input

In [11]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import math

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, head_num):
        super(MultiHeadAttention, self).__init__()
        self.head_num = head_num
        self.d_model = d_model
        self.d_k = d_model // head_num  
        self.d_v = self.d_k  

        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

        
        self.fc = nn.Linear(d_model, d_model)

        # Dropout
        self.dropout = nn.Dropout(0.2)

    def forward(self, Q, K, V, src_padding_mask=None, future_mask=None):
        batch_size = Q.size(0)  

        
        Q = self.W_q(Q)  # [batch_size, seq_len, d_model]
        K = self.W_k(K)  # [batch_size, seq_len, d_model]
        V = self.W_v(V)  # [batch_size, seq_len, d_model]

        
        Q = Q.view(batch_size, -1, self.head_num, self.d_k).transpose(1, 2)  # [batch_size, head_num, seq_len, d_k]
        K = K.view(batch_size, -1, self.head_num, self.d_k).transpose(1, 2)  # [batch_size, head_num, seq_len, d_k]
        V = V.view(batch_size, -1, self.head_num, self.d_k).transpose(1, 2)  # [batch_size, head_num, seq_len, d_k]

        
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)  # [batch_size, head_num, seq_len, seq_len]

        
        if future_mask is not None:
            # future_mask  [batch_size, 1, seq_len, seq_len]
            scores = scores.masked_fill(future_mask == 0, float('-1e20'))

        if src_padding_mask is not None:
            #print("src_padding_mask", src_padding_mask.shape)
            # src_padding_mask  [batch_size, 1, 1, key_len]，擴展 [batch_size, head_num, query_len, key_len]
            scores = scores.masked_fill(src_padding_mask == 0, float('-1e20'))

        
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)  # dropout

        
        context = torch.matmul(attention_weights, V)  # [batch_size, head_num, seq_len, d_k]

        
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)  # [batch_size, seq_len, d_model]

        
        output = self.fc(context)  # [batch_size, seq_len, d_model]
        return output



In [12]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, dim_feedforward, nhead, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, nhead)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, src_padding_mask):
        # Self-attention layer
        #print(f"encoder input shape:{x.shape}")
        #print("do encoder")
        attn_output = self.self_attn(x, x, x, src_padding_mask=src_padding_mask)
        #print(f"Encoder Layer - Attention output shape: {attn_output.shape}")
        if attn_output.size() != x.size():
            attn_output = attn_output[:, :x.size(1), :]
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)

        # Feedforward layer
        ff_output = self.linear2(self.dropout(F.relu(self.linear1(x))))
        x = x + self.dropout2(ff_output)
        x = self.norm2(x)
        #print(f"encoder output shape:{x.shape}")
        return x

In [13]:
class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, dim_feedforward, nhead, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, nhead)
        self.cross_attn = MultiHeadAttention(d_model, nhead)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_padding_mask=None, tgt_padding_mask=None, tgt_future_mask=None):
        # Pre-LayerNorm Self-attention layer with future mask
        #print("do decode")
        x_norm = self.norm1(x)
        self_attn_output = self.self_attn(x_norm, x_norm, x_norm, src_padding_mask=tgt_padding_mask, future_mask=tgt_future_mask)
        x = x + self.dropout1(self_attn_output)

        
        if src_padding_mask is not None:
            src_padding_mask = src_padding_mask[:, :, :x.size(1), :]

        # Pre-LayerNorm Cross-attention layer with encoder output
        x_norm = self.norm2(x)
        cross_attn_output = self.cross_attn(x_norm, enc_output, enc_output, src_padding_mask=src_padding_mask)
        x = x + self.dropout2(cross_attn_output)

        # Pre-LayerNorm Feedforward layer
        x_norm = self.norm3(x)
        ff_output = self.linear2(self.dropout(F.gelu(self.linear1(x_norm))))
        x = x + self.dropout3(ff_output)

        return x


In [14]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, num_encoder_layers, num_decoder_layers, d_ff, dropout):
        super().__init__()
        self.encoder_layers = nn.ModuleList([TransformerEncoderLayer(d_model, d_ff, num_heads, dropout) for _ in range(num_encoder_layers)])
        self.decoder_layers = nn.ModuleList([TransformerDecoderLayer(d_model, d_ff, num_heads, dropout) for _ in range(num_decoder_layers)])

        self.norm = nn.LayerNorm(d_model)

    def forward(self, src_embeded, tgt_embeded, src_padding_mask, tgt_padding_mask, tgt_future_mask):
        enc_output = self.encode(src_embeded, src_padding_mask)
        dec_output = self.decode(tgt_embeded, enc_output, src_padding_mask, tgt_padding_mask, tgt_future_mask)
        return dec_output

    def encode(self, src_embeded, src_padding_mask=None):
        x = src_embeded
        for layer in self.encoder_layers:
            x = layer(x, src_padding_mask)
        return self.norm(x)

    def decode(self, tgt_embeded, enc_output, src_padding_mask=None, tgt_padding_mask=None, tgt_future_mask=None):
        x = tgt_embeded
        for layer in self.decoder_layers:
            x = layer(x, enc_output, src_padding_mask, tgt_padding_mask, tgt_future_mask)
        return self.norm(x)

In [15]:
class PositionalEncoding(nn.Module):
    def __init__(self, emb_size, dropout, maxlen=200):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        position = torch.arange(0, maxlen).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, emb_size, 2) * -(math.log(10000.0) / emb_size))
        pe = torch.zeros(maxlen, emb_size)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, token_embedding):
        token_embedding = token_embedding + self.pe[:, :token_embedding.size(1), :]
        return self.dropout(token_embedding)

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens):
        return self.embedding(tokens) * math.sqrt(self.emb_size)


In [16]:
import torch

def create_mask(src, tgt_input, num_heads):
    # Get the lengths of the source and target sequences
    src_len = src.size(1)
    tgt_len = tgt_input.size(1)

    # Create padding mask for the source (encoder)
    # Shape: (batch_size, n_heads, 1, src_len)
    src_padding_mask = (src != 0).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, src_len)
    src_padding_mask = src_padding_mask.expand(-1, num_heads, -1, -1)  # Expand to [batch_size, n_heads, 1, src_len]

    # Create padding mask for the target (decoder)
    # Shape: (batch_size, n_heads, 1, tgt_len)
    tgt_padding_mask = (tgt_input != 0).unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, tgt_len)
    tgt_padding_mask = tgt_padding_mask.expand(-1, num_heads, -1, -1)  # Expand to [batch_size, n_heads, 1, tgt_len]

    # Create future mask for target (decoder) to prevent attention from looking at future tokens
    # Shape: (1, 1, tgt_len, tgt_len)
    tgt_future_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt_input.device)).bool().unsqueeze(0).unsqueeze(1)  # (1, 1, tgt_len, tgt_len)
    tgt_future_mask = tgt_future_mask.expand(src.size(0), num_heads, -1, -1)  # Expand to match batch size and number of heads

    return src_padding_mask, tgt_padding_mask, tgt_future_mask





In [17]:
# Seq2Seq Network
class Seq2SeqNetwork(nn.Module):
    def __init__(self,
                 num_encoder_layers,
                 num_decoder_layers,
                 emb_size,
                 nhead,
                 src_vocab_size,
                 tgt_vocab_size,
                 dim_feedforward,
                 dropout=0.2):
        super().__init__()
        self.transformer = Transformer(
            d_model=emb_size,
            num_heads=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            d_ff=dim_feedforward,
            dropout=dropout
        )
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

    def forward(self,
                src,
                trg,
                tgt_future_mask=None,
                src_padding_mask=None,
                tgt_padding_mask=None):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_padding_mask=src_padding_mask, tgt_padding_mask=tgt_padding_mask, tgt_future_mask=tgt_future_mask)
        return self.generator(outs)


    def encode(self, src, src_padding_mask=None):
        return self.transformer.encode(self.positional_encoding(self.src_tok_emb(src)), src_padding_mask=src_padding_mask)

    def decode(self, tgt, memory, src_padding_mask=None, tgt_padding_mask=None, tgt_future_mask=None):
        return self.transformer.decode(self.positional_encoding(self.tgt_tok_emb(tgt)), memory, src_padding_mask=src_padding_mask, tgt_padding_mask=tgt_padding_mask, tgt_future_mask=tgt_future_mask)


## Note: The parameter size of model should be less than 100M (100,000k) !!!

In [18]:
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 2048
NUM_ENCODER_LAYERS = 1
NUM_DECODER_LAYERS = 1
SRC_VOCAB_SIZE = tokenizer_cn.vocab_size
TGT_VOCAB_SIZE = tokenizer_en.vocab_size
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transformer = Seq2SeqNetwork(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)
param_transformer = sum(p.numel() for p in transformer.parameters())
print (f"The parameter size of transformer is {param_transformer/1000} k")
#   The parameter size of model should be less than 100M (100,000k) !!!
#   The parameter size of model should be less than 100M (100,000k) !!!
#   The parameter size of model should be less than 100M (100,000k) !!!

The parameter size of transformer is 47895.876 k


# Training
- You can change the training setting by yourself including
  - Number of epoch
  - Optimizer
  - Learning rate
  - Learning rate scheduler
  - etc...

In [19]:
NUM_EPOCHS = 20
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(transformer.parameters(), lr=5e-4, betas=(0.9, 0.98), eps=1e-9)

## Translation quality metrics: BLEU score

In [20]:
from torchmetrics.text import BLEUScore

def bleu_score_func(predicted, truth, grams=1):
    preds = [predicted]
    truth = [[truth]]
    bleu = BLEUScore(n_gram=grams)
    return bleu(preds, truth)


def BLEU_batch(predict, truth, output_tokenizer):
    batch_size = predict.size(0)
    seq_len = (truth == EOS_IDX).nonzero(as_tuple = True)[1]+1
    total_score = 0
    for i in range(batch_size):
        predict_str = output_tokenizer.decode(predict[i,:seq_len[i] ], skip_special_tokens=False)
        truth_str = output_tokenizer.decode(truth[i, :seq_len[i]], skip_special_tokens=False)
        score_gram1 = bleu_score_func(predict_str.lower(), truth_str.lower(), grams=1)
        #score_gram2 = bleu_score_func(predict_str.lower(), truth_str, grams=2)
        #score_gram3 = bleu_score_func(predict_str.lower(), truth_str, grams=3)
        #score_gram4 = bleu_score_func(predict_str.lower(), truth_str, grams=4)
        #total_score = total_score + (score_gram1 + score_gram2 + score_gram3 + score_gram4) / 4.0
        total_score = total_score + score_gram1
    total_score = total_score / batch_size
    return total_score

## Training and Evaluation Functions

In [21]:
def train_epoch(model, optimizer, train_dataloader, loss_fn, DEVICE):
    model.train()
    losses = 0

    for src, tgt in train_dataloader:
        # Move data to the specified device (GPU or CPU)
        src = src.to(DEVICE).long()
        tgt = tgt.to(DEVICE).long()

        # Prepare input and target for the model
        tgt_input = tgt[:, :-1]  # Exclude the last element for the input to the decoder
        tgt_out = tgt[:, 1:]     # Shift target by one for teacher forcing

        # Get the number of heads from the model
        num_heads = model.head_num if hasattr(model, 'head_num') else 1

        # Create masks using the modified function
        src_padding_mask, tgt_padding_mask, tgt_future_mask = create_mask(src, tgt_input, num_heads=num_heads)

        # Forward pass
        logits = model(src, tgt_input, src_padding_mask=src_padding_mask,
                       tgt_padding_mask=tgt_padding_mask, tgt_future_mask=tgt_future_mask)

        # Zero the gradients
        optimizer.zero_grad()

        # Compute the loss
        loss = loss_fn(logits.view(-1, logits.shape[-1]), tgt_out.reshape(-1).long())

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Accumulate loss
        losses += loss.item()

    return losses / len(train_dataloader)





def evaluate(model, val_dataloader, loss_fn, DEVICE):
    model.eval()
    losses = 0
    score = 0

    num_heads = model.head_num if hasattr(model, 'head_num') else 1

    with torch.no_grad():  # Evaluation mode should not track gradients
        for src, tgt in val_dataloader:
            # Move data to the specified device (GPU or CPU)
            src = src.to(DEVICE).long()
            tgt = tgt.to(DEVICE).long()

            # Prepare input and target for the model
            tgt_input = tgt[:, :-1]  # Exclude the last element for the input to the decoder
            tgt_out = tgt[:, 1:]     # Shift target by one for teacher forcing

            # Create masks using the modified function
            src_padding_mask, tgt_padding_mask, tgt_future_mask = create_mask(src, tgt_input, num_heads=num_heads)

            # Forward pass
            logits = model(src, tgt_input, src_padding_mask=src_padding_mask,
                           tgt_padding_mask=tgt_padding_mask, tgt_future_mask=tgt_future_mask)

            # Prediction and loss calculation
            _, tgt_predict = torch.max(logits, dim=-1)
            score_batch = BLEU_batch(tgt_predict, tgt_out, tokenizer_en)

            loss = loss_fn(logits.view(-1, logits.shape[-1]), tgt_out.reshape(-1).long())
            losses += loss.item()
            score += score_batch

    # Average loss and score over the entire validation set
    avg_loss = losses / len(val_dataloader)
    avg_score = score / len(val_dataloader)

    return avg_loss, avg_score



## Start training
- MODEL_SAVE_PATH: path for storing the best model

In [22]:
MODEL_SAVE_PATH = "./model.ckpt"

In [23]:
from timeit import default_timer as timer
transformer = transformer.to(DEVICE)

best_acc = 1e20

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'


loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer, cn_to_en_train_loader, loss_fn, DEVICE)
    end_time = timer()
    val_loss, val_acc = evaluate(transformer, cn_to_en_test_loader, loss_fn, DEVICE)

    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, Val Acc: {val_acc:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

    # Save the best model so far.
    if train_loss < best_acc:
        best_acc = train_loss
        best_state_dict = transformer.state_dict()
        torch.save(best_state_dict, MODEL_SAVE_PATH)
        print("(model saved)")


Epoch: 1, Train loss: 6.316, Val loss: 5.769, Val Acc: 0.215, Epoch time = 240.620s
(model saved)
Epoch: 2, Train loss: 5.480, Val loss: 5.400, Val Acc: 0.238, Epoch time = 242.778s
(model saved)
Epoch: 3, Train loss: 5.030, Val loss: 5.186, Val Acc: 0.260, Epoch time = 242.939s
(model saved)
Epoch: 4, Train loss: 4.669, Val loss: 5.061, Val Acc: 0.280, Epoch time = 243.053s
(model saved)
Epoch: 5, Train loss: 4.362, Val loss: 5.013, Val Acc: 0.292, Epoch time = 242.972s
(model saved)
Epoch: 6, Train loss: 4.094, Val loss: 4.998, Val Acc: 0.296, Epoch time = 243.100s
(model saved)
Epoch: 7, Train loss: 3.858, Val loss: 5.011, Val Acc: 0.300, Epoch time = 243.060s
(model saved)
Epoch: 8, Train loss: 3.653, Val loss: 5.043, Val Acc: 0.306, Epoch time = 243.554s
(model saved)
Epoch: 9, Train loss: 3.471, Val loss: 5.096, Val Acc: 0.307, Epoch time = 243.185s
(model saved)
Epoch: 10, Train loss: 3.307, Val loss: 5.139, Val Acc: 0.308, Epoch time = 243.469s
(model saved)
Epoch: 11, Train lo

# Inference

In [24]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_padding_mask=src_mask)
    ys = [start_symbol]  # Store generated tokens in a list

    for i in range(max_len - 1):
        tgt = torch.tensor(ys).view(1, -1).to(DEVICE)  # Shape: [1, len(ys)]

        
        num_heads = model.transformer.decoder_layers[0].self_attn.head_num
        _, tgt_padding_mask, tgt_future_mask = create_mask(src, tgt, num_heads=num_heads)

        
        out = model.decode(tgt, memory, src_padding_mask=src_mask, tgt_padding_mask=tgt_padding_mask, tgt_future_mask=tgt_future_mask)
        prob = model.generator(out[:, -1, :])  

        
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        # 添加新生成
        ys.append(next_word)
        if next_word == EOS_IDX:
            break

    return torch.tensor(ys).view(1, -1)



# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str, input_tokenizer, output_tokenizer, max_len=50):
    model.eval()
    
    sentence = input_tokenizer.encode(src_sentence)
    sentence = torch.tensor(sentence).view(1, -1).to(DEVICE)  # Shape: [1, len(sentence)]

    
    num_heads = model.transformer.decoder_layers[0].self_attn.head_num
    src_padding_mask, _, _ = create_mask(sentence, sentence, num_heads=num_heads)  # 仅需要 src_padding_mask

    
    tgt_tokens = greedy_decode(model, sentence, src_padding_mask, max_len=max_len, start_symbol=BOS_IDX).flatten()

    
    output_sentence = output_tokenizer.decode(tgt_tokens.tolist(), skip_special_tokens=True)
    return output_sentence



## Load best model

In [25]:
transformer = Seq2SeqNetwork(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
transformer.to(DEVICE)
transformer.load_state_dict(torch.load("model.ckpt"))

  transformer.load_state_dict(torch.load("model.ckpt"))


<All keys matched successfully>

## Translation testing

In [26]:
sentence = "你好，欢迎来到中国"
ground_truth = 'Hello, Welcome to China'
predicted = translate(transformer, sentence, tokenizer_cn, tokenizer_en)

print(f'{"Input:":15s}: {sentence}')
print(f'{"Prediction":15s}: {predicted}')
print(f'{"Ground truth":15s}: {ground_truth}')
print("Bleu Score (1gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 1).item())
print("Bleu Score (2gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 2).item())
print("Bleu Score (3gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 3).item())
print("Bleu Score (4gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 4).item())

Input:         : 你好，欢迎来到中国
Prediction     : Hello, welcome you to China.
Ground truth   : Hello, Welcome to China
Bleu Score (1gram):  0.6000000238418579
Bleu Score (2gram):  0.3872983455657959
Bleu Score (3gram):  0.0
Bleu Score (4gram):  0.0


In [27]:
sentence = "早上好，很高心见到你"
ground_truth = 'Good Morning, nice to meet you'
predicted = translate(transformer, sentence, tokenizer_cn, tokenizer_en)

print(f'{"Input:":15s}: {sentence}')
print(f'{"Prediction":15s}: {predicted}')
print(f'{"Ground truth":15s}: {ground_truth}')
print("Bleu Score (1gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 1).item())
print("Bleu Score (2gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 2).item())
print("Bleu Score (3gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 3).item())
print("Bleu Score (4gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 4).item())

Input:         : 早上好，很高心见到你
Prediction     : Good morning, I see you.
Ground truth   : Good Morning, nice to meet you
Bleu Score (1gram):  0.32749229669570923
Bleu Score (2gram):  0.2589053809642792
Bleu Score (3gram):  0.0
Bleu Score (4gram):  0.0


In [28]:
sentence = "祝您有个美好的一天"
ground_truth = 'Have a nice day'
predicted = translate(transformer, sentence, tokenizer_cn, tokenizer_en)

print(f'{"Input:":15s}: {sentence}')
print(f'{"Prediction":15s}: {predicted}')
print(f'{"Ground truth":15s}: {ground_truth}')
print("Bleu Score (1gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 1).item())
print("Bleu Score (2gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 2).item())
print("Bleu Score (3gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 3).item())
print("Bleu Score (4gram): ", bleu_score_func(predicted.lower(), ground_truth.lower(), 4).item())

Input:         : 祝您有个美好的一天
Prediction     : I wish you have a good morning.
Ground truth   : Have a nice day
Bleu Score (1gram):  0.2857142984867096
Bleu Score (2gram):  0.2182179093360901
Bleu Score (3gram):  0.0
Bleu Score (4gram):  0.0
