#### Installing Necessary Libraries

In [None]:
!pip install torch==2.0.0 torchtext==0.15.1



In [None]:
!pip install torchdata==0.6.0



In [None]:
!pip install portalocker>=2.0.0

In [None]:
!pip install numpy==1.23.5



In [None]:
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm

Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/

#### Step2: Train a model on the training set (English-to-German)

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
SOURCE_LANG = 'en'
TARGET_LANG = 'de'
tokenizer_map = {}
vocab_map = {}


It contains, the get_tokenizer for tokenization, the build_vocab_from_iterator for vocabularies and datasets multi30k and Multi30k which are commonly adopted datasets for evaluation of English German translation models. The training and validation dataset URLs are specified to certain websites manually for the purpose of making sure that the sources are well utilized. As seen in the code, the SOURCE_LANG and TARGET_LANG variables are assigned values English (de) and  German (de) respectively. tokenizer_map and vocab_map dictionaries are also defined to store tokenizers and vocabularies specific to the language. The tokenizers will break down the unformatted text into tokens or words or subwords and the vocabularies will link these tokens to numbers which are inputs needed for the model.



In [None]:
tokenizer_map[SOURCE_LANG] = get_tokenizer('spacy', language='en_core_web_sm')
tokenizer_map[TARGET_LANG] = get_tokenizer('spacy', language='de_core_news_sm')
def extract_tokens(data_iterator: Iterable, lang: str) -> List[str]:
    lang_index_map = {SOURCE_LANG: 0, TARGET_LANG: 1}
    for sample in data_iterator:
        yield tokenizer_map[lang](sample[lang_index_map[lang]])
UNKNOWN_IDX, PADDING_IDX, START_IDX, END_IDX = 0, 1, 2, 3
special_tokens = ['<unk>', '<pad>', '<bos>', '<eos>']
for lang in [SOURCE_LANG, TARGET_LANG]:
    train_data_iter = Multi30k(split='train', language_pair=(SOURCE_LANG, TARGET_LANG))
    vocab_map[lang] = build_vocab_from_iterator(extract_tokens(train_data_iter, lang),
                                                min_freq=1,
                                                specials=special_tokens,
                                                special_first=True)
for lang in [SOURCE_LANG, TARGET_LANG]:
    vocab_map[lang].set_default_index(UNKNOWN_IDX)

The above code provides configuration to the tokenizers and vocabularies of the source English and target German languages in the given text data for a machine translation model implemented in the Pytorch's TorchText library. It first loads the English and German language models from spaCy library, which is among the most widely used NLP libraries, in order to initialize the tokenizers for both languages. Thereafter, the extract_tokens function walks through the dataset and makes use of the suitable tokenizer depending on the language. The first language is indexed at 0 and second at 1 with text samples being obtained at each of the two data iterator. Unknown words, padding, the start of a sequence and the end of a sequence are all represented by special tokens. The build_vocab_from_iterator function creates vocabularies of the two languages through the tokenized training and allows these special tokens on top of the created vocabularies. Moreover, it also changes each vocabulary to recognize default unknown word index.

#### Seq2Seq Network using Transformer

In [None]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class PositionalEncoding(nn.Module):
    def __init__(self,
                 embedding_dim: int,
                 dropout_rate: float,
                 max_sequence_length: int = 5000):
        super(PositionalEncoding, self).__init__()
        denominator = torch.exp(- torch.arange(0, embedding_dim, 2) * math.log(10000) / embedding_dim)
        position = torch.arange(0, max_sequence_length).reshape(max_sequence_length, 1)
        position_embedding = torch.zeros((max_sequence_length, embedding_dim))
        position_embedding[:, 0::2] = torch.sin(position * denominator)
        position_embedding[:, 1::2] = torch.cos(position * denominator)
        position_embedding = position_embedding.unsqueeze(-2)
        self.dropout = nn.Dropout(dropout_rate)
        self.register_buffer('position_embedding', position_embedding)
    def forward(self, token_embeddings: Tensor):
        return self.dropout(token_embeddings + self.position_embedding[:token_embeddings.size(0), :])
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int):
        super(TokenEmbedding, self).__init__()
        self.embedding_layer = nn.Embedding(vocab_size, embedding_dim)
        self.embedding_dim = embedding_dim
    def forward(self, tokens: Tensor):
        return self.embedding_layer(tokens.long()) * math.sqrt(self.embedding_dim)
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 encoder_layers: int,
                 decoder_layers: int,
                 embedding_dim: int,
                 num_heads: int,
                 source_vocab_size: int,
                 target_vocab_size: int,
                 feedforward_dim: int = 512,
                 dropout_rate: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer_model = Transformer(d_model=embedding_dim,
                                             nhead=num_heads,
                                             num_encoder_layers=encoder_layers,
                                             num_decoder_layers=decoder_layers,
                                             dim_feedforward=feedforward_dim,
                                             dropout=dropout_rate)
        self.output_layer = nn.Linear(embedding_dim, target_vocab_size)
        self.source_token_embedding = TokenEmbedding(source_vocab_size, embedding_dim)
        self.target_token_embedding = TokenEmbedding(target_vocab_size, embedding_dim)
        self.position_encoding = PositionalEncoding(embedding_dim, dropout_rate)
    def forward(self,
                source: Tensor,
                target: Tensor,
                source_mask: Tensor,
                target_mask: Tensor,
                source_padding_mask: Tensor,
                target_padding_mask: Tensor,
                memory_padding_mask: Tensor):
        source_embedding = self.position_encoding(self.source_token_embedding(source))
        target_embedding = self.position_encoding(self.target_token_embedding(target))
        output = self.transformer_model(source_embedding, target_embedding, source_mask, target_mask, None,
                                        source_padding_mask, target_padding_mask, memory_padding_mask)
        return self.output_layer(output)
    def encode(self, source: Tensor, source_mask: Tensor):
        return self.transformer_model.encoder(self.position_encoding(self.source_token_embedding(source)), source_mask)
    def decode(self, target: Tensor, memory: Tensor, target_mask: Tensor):
        return self.transformer_model.decoder(self.position_encoding(self.target_token_embedding(target)), memory,
                                              target_mask)

The above code provides the details of a seq2seq architecture aimed at machine translation through the transformer model using the Pytorch framework. The PositionalEncoding class appends positional numbers to token embeddings for a better understanding of the sequences since transformers do not make use of the sequential context. It generates position embeddings through sine and cosine functions and implements dropout for regularization purposes. Dense vector representations are provided to input tokens through the TokenEmbedding class which normalizes the embeddings by taking the square root of the embedding dimension . The primary class Seq2SeqTransformer unifies the state of the art transformer based model constituents: it creates a transformer architecture which has the required encoder and decoder layers, multi-head attention heads, and feed-forward layers together with their dimensions. It further has linear layers at the output to enable the representation of the hidden states to the size of vocabulary of the target language. The forward method executes the entire encoding and decoding process sequentially in an integrated fashion, embedding sources and targets and passing them through the transformer model. The encode method wraps and sequences the source input while the decode method targets the encoded source memory for the target input.

In [None]:
def generate_upper_triangle_mask(size):
    mask = (torch.triu(torch.ones((size, size), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask
def create_masks(source, target):
    source_sequence_length = source.shape[0]
    target_sequence_length = target.shape[0]
    target_mask = generate_upper_triangle_mask(target_sequence_length)
    source_mask = torch.zeros((source_sequence_length, source_sequence_length), device=DEVICE).type(torch.bool)
    source_padding_mask = (source == PADDING_IDX).transpose(0, 1)
    target_padding_mask = (target == PADDING_IDX).transpose(0, 1)
    return source_mask, target_mask, source_padding_mask, target_padding_mask

The above code allows us to define two auxiliary functions: generate_upper_triangle_mask and create_masks that are useful in the implementation of attention mechanisms in the Transformer model. The generate_upper_triangle_mask function performs the opposite process, that of creating a mask for the target sequence used in the decoding. It creates a matrix containing 1’s above the diagonal using torch.triu, which indicates to which tokens can be paid attention to by the model. This will make it so that the model doesn’t look at future tokens, an autoregressive property during training is thus achieved. Hence negative infinity (-inf) is used to mask areas that aren’t allowed while the zeros are used to mask the areas that are allowed ensuring only current and past tokens are used.

The third function under consideration is the create_masks function. It takes care of all the missing masks that are to be used for both the source as well as the target sequences. It first assesses the lengths of the source and target input sequences. It employs the upper triangle mask function in order to create the upper part of target_mask while setting all values of source_mask to 0 (matrix) which implicitly allows attention for the source input. The function also creates source_padding_mask and target_padding_mask, which are padding masks for the source and target sequences, respectively. These masks are used for padding tokens (PADDING_IDX) in the input sequences and prevent them from being attended during the attention mechanism.

In [None]:
torch.manual_seed(0)
SOURCE_VOCAB_SIZE = len(vocab_map[SOURCE_LANG])
TARGET_VOCAB_SIZE = len(vocab_map[TARGET_LANG])
EMBEDDING_DIM = 512
NUM_HEADS = 8
FEEDFORWARD_DIM = 512
BATCH_SIZE = 128
ENCODER_LAYERS = 3
DECODER_LAYERS = 3
seq2seq_transformer = Seq2SeqTransformer(ENCODER_LAYERS, DECODER_LAYERS, EMBEDDING_DIM,
                                         NUM_HEADS, SOURCE_VOCAB_SIZE, TARGET_VOCAB_SIZE, FEEDFORWARD_DIM)
for param in seq2seq_transformer.parameters():
    if param.dim() > 1:
        nn.init.xavier_uniform_(param)
seq2seq_transformer = seq2seq_transformer.to(DEVICE)
criterion = torch.nn.CrossEntropyLoss(ignore_index=PADDING_IDX)
optimizer = torch.optim.Adam(seq2seq_transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

First, in the code, a random state seed is set for the purposes of repeating the experiments. From the vocabularies that have been built earlier (vocab_map), the sizes of the source and target vocabularies are drawn. In particular, values such as EMBEDDING_DIM, NUM_HEADS, FEEDFORWARD_DIM, BATCH_SIZE, ENCODER_LAYERS, DECODER_LAYERS are set. The Seq2SeqTransformer model is created in accordance with these parameters, including the number of encoder and decoder layers, the embedding dimension, the number of attention heads, and the size of the feedforward network.

Moreover, the model accounts for Xavier uniform initialization at the beginning which sidetracks the issues of instability during training by ensuring that weights are not too big or too small. Then the model is put on the relevant device, either CPU or GPU. The loss function is CrossEntropyLoss with default ignore_index for padding tokens, in this case, the loss should not be affected by the padding. Adam optimizer is used with a learning rate of 0.0001, betas for momentum of 0.9 and 0.98; and smaller epsilon value of 1e-9, to prevent division by zero.

In [None]:
from torch.nn.utils.rnn import pad_sequence
def apply_transforms_in_sequence(*transforms):
    def inner_function(text_input):
        for transform in transforms:
            text_input = transform(text_input)
        return text_input
    return inner_function
def convert_to_tensor(token_ids: List[int]):
    return torch.cat((torch.tensor([START_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([END_IDX])))
text_transform_map = {}
for lang in [SOURCE_LANG, TARGET_LANG]:
    text_transform_map[lang] = apply_transforms_in_sequence(tokenizer_map[lang],
                                                            vocab_map[lang],
                                                            convert_to_tensor)
def collate_batch_fn(batch):
    source_batch, target_batch = [], []
    for source_sample, target_sample in batch:
        source_batch.append(text_transform_map[SOURCE_LANG](source_sample.rstrip("\n")))
        target_batch.append(text_transform_map[TARGET_LANG](target_sample.rstrip("\n")))

    source_batch = pad_sequence(source_batch, padding_value=PADDING_IDX)
    target_batch = pad_sequence(target_batch, padding_value=PADDING_IDX)
    return source_batch, target_batch

It comprises apply_transforms_in_sequence method that takes a list of text transforming steps and performs them in a sequence, which includes tokenization and vocabulary mapping. The convert_to_tensor function glosses the token IDs with starting token (START_IDX) and ending token (END_IDX). The transformation map into text_transform_map is a dictionary that contains both the source German language and the English target language transformation pipeline. The collate_batch_fn function aligns features from a sampled batch of text samples through transformations. These features include transforming of target and source raw text to text sequences of numbers. Subsequently, these sequences have been padded using pad_sequence and a padding token (PADDING_IDX) added to it in order to make all the sequences have the same length so as to be ready for training during model input.

In [None]:
from torch.utils.data import DataLoader
def train_one_epoch(seq2seq_transformer, optimizer):
    seq2seq_transformer.train()
    total_loss = 0
    training_data_iter = Multi30k(split='train', language_pair=(SOURCE_LANG, TARGET_LANG))
    training_dataloader = DataLoader(training_data_iter, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn)
    for source_batch, target_batch in training_dataloader:
        source_batch = source_batch.to(DEVICE)
        target_batch = target_batch.to(DEVICE)
        target_input = target_batch[:-1, :]
        source_mask, target_mask, source_padding_mask, target_padding_mask = create_masks(source_batch, target_input)
        logits = seq2seq_transformer(source_batch, target_input, source_mask, target_mask,
                                     source_padding_mask, target_padding_mask, source_padding_mask)
        optimizer.zero_grad()
        target_output = target_batch[1:, :]
        loss = criterion(logits.reshape(-1, logits.shape[-1]), target_output.reshape(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(list(training_dataloader))
def evaluate_model(seq2seq_transformer):
    seq2seq_transformer.eval()
    total_loss = 0
    validation_data_iter = Multi30k(split='valid', language_pair=(SOURCE_LANG, TARGET_LANG))
    validation_dataloader = DataLoader(validation_data_iter, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn)
    for source_batch, target_batch in validation_dataloader:
        source_batch = source_batch.to(DEVICE)
        target_batch = target_batch.to(DEVICE)
        target_input = target_batch[:-1, :]
        source_mask, target_mask, source_padding_mask, target_padding_mask = create_masks(source_batch, target_input)
        logits = seq2seq_transformer(source_batch, target_input, source_mask, target_mask,
                                     source_padding_mask, target_padding_mask, source_padding_mask)
        target_output = target_batch[1:, :]
        loss = criterion(logits.reshape(-1, logits.shape[-1]), target_output.reshape(-1))
        total_loss += loss.item()
    return total_loss / len(list(validation_dataloader))

The above code presents the training and evaluation loops of the sequence to sequence (Seq2Seq) Transformer model within the context of a machine translation task, based on PyTorch framework. The train_one_epoch method conducts the training for one epoch by training on the training set from the Multi30k dataset, which is a benchmark dataset for German-English translation. To this end, it leverages a DataLoader as a means for batching and utilizes the collate_batch_fn function for padding purposes. For each training step, the source and target sequences are then transferred to the target device. The target sequence includes two parts target_input and target_output where the last token is cut from the input and the first token is cut from the output. The attention mechanisms of the model are preceded by sources of information (source_mask, target_mask) and padding masks, generated in advance by the create_masks function. In this way, the logits output by the embeddings are obtained through the Transformer architecture. The computation of cross-entropy loss is straightforward and involves the usage of CrossEntropyLoss while loss.backward() is used to compute gradients of loss. The parameters of the model to minimize the loss is updated by the optimizer and the last for the epoch is the average loss of that epoch.

The evaluate_model function executes a comparable procedure as above but only for the validation dataset. In this case, the model's weights are not updated. The function operates in a loop processing the validation set, preparing the masks, forwarding the data, and calculating the loss for each batch. Also, since eval() is executed, there's no dropout, which simplifies the inference. The output of the function is the mean loss over the validation sample.

In [None]:
from timeit import default_timer as timer
EPOCHS = 20
for epoch in range(1, EPOCHS + 1):
    start_time = timer()
    training_loss = train_one_epoch(seq2seq_transformer, optimizer)
    end_time = timer()
    validation_loss = evaluate_model(seq2seq_transformer)
    print(f"Epoch: {epoch}, Training Loss: {training_loss:.3f}, Validation Loss: {validation_loss:.3f}, Epoch Time = {(end_time - start_time):.3f}s")
def greedy_decode(seq2seq_transformer, source, source_mask, max_length, start_token):
    seq2seq_transformer.eval()
    source = source.to(DEVICE)
    source_mask = source_mask.to(DEVICE)
    memory = seq2seq_transformer.encode(source, source_mask)
    decoded_tokens = torch.ones(1, 1).fill_(start_token).type(torch.long).to(DEVICE)
    for i in range(max_length - 1):
        target_mask = generate_upper_triangle_mask(decoded_tokens.size(0)).to(DEVICE)
        output = seq2seq_transformer.decode(decoded_tokens, memory, target_mask)
        output = output.transpose(0, 1)
        probabilities = seq2seq_transformer.output_layer(output[:, -1])
        _, next_token = torch.max(probabilities, dim=1)
        next_token = next_token.item()
        decoded_tokens = torch.cat([decoded_tokens, torch.ones(1, 1).type_as(source.data).fill_(next_token)], dim=0)
        if next_token == END_IDX:
            break
    return decoded_tokens
def translate(seq2seq_transformer, source_sentence):
    seq2seq_transformer.eval()
    source = text_transform_map[SOURCE_LANG](source_sentence).view(-1, 1).to(DEVICE)
    source_mask = torch.zeros(source.shape[0], source.shape[0]).type(torch.bool).to(DEVICE)
    target_tokens = greedy_decode(seq2seq_transformer, source, source_mask, max_length=source.shape[0] + 5, start_token=START_IDX).flatten()
    return " ".join(vocab_map[TARGET_LANG].lookup_tokens(list(target_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")



Epoch: 1, Training Loss: 5.900, Validation Loss: 4.628, Epoch Time = 52.793s
Epoch: 2, Training Loss: 4.125, Validation Loss: 3.782, Epoch Time = 51.241s
Epoch: 3, Training Loss: 3.459, Validation Loss: 3.313, Epoch Time = 51.339s
Epoch: 4, Training Loss: 3.021, Validation Loss: 3.010, Epoch Time = 51.100s
Epoch: 5, Training Loss: 2.697, Validation Loss: 2.811, Epoch Time = 51.124s
Epoch: 6, Training Loss: 2.440, Validation Loss: 2.645, Epoch Time = 51.360s
Epoch: 7, Training Loss: 2.228, Validation Loss: 2.505, Epoch Time = 52.057s
Epoch: 8, Training Loss: 2.051, Validation Loss: 2.380, Epoch Time = 51.788s
Epoch: 9, Training Loss: 1.901, Validation Loss: 2.298, Epoch Time = 51.334s
Epoch: 10, Training Loss: 1.772, Validation Loss: 2.231, Epoch Time = 51.177s
Epoch: 11, Training Loss: 1.653, Validation Loss: 2.206, Epoch Time = 51.003s
Epoch: 12, Training Loss: 1.545, Validation Loss: 2.166, Epoch Time = 51.259s
Epoch: 13, Training Loss: 1.448, Validation Loss: 2.157, Epoch Time = 51.

The code develops the functions responsible for the training loop of a sequence to sequence transformer model for a task of machine translation, greedy decoding, as well as translating. The training loop executes a specified number of epochs, measuring time in each of them and calculating the training and validation losses by means of the train_one_epoch and evaluate_model functions, respectively. A decoding technique, which is the greedy_decode function, is implemented in such a way that the model generates the next token one step at a time without forecasting what the next token is. It starts with the start token (START_IDX) and generates tokens until an end Up to when the maximum length of the header and the end token are produced (END_IDX). The translate function takes a source sentence, transforms it to the target space, prepares a source mask for the attention mechanism. It can perform along these lines by applying a greedy – decoding technique to synthesize target tokens, the synthesis is through the application of reconstructing through the inverse of the process applied when generating the syntactic structure. The last translation output will not include the inflection base forms to special tokens, does not include treaters, produces a clean sentence of a target language.

#### Step 3: Insert novel sentences into your English-to-German model.

In [None]:
print(translate(seq2seq_transformer, "A group of people standing in front of an igloo"))

 Eine Gruppe von Personen steht vor einem Iglu . 


In [None]:
print(translate(seq2seq_transformer, "The cat is in the house ."))

 Die Katze ist im Haus . 
