<a href="https://colab.research.google.com/github/DanieleAngioni97/Introductory-Seminar-PyTorch/blob/main/notebooks/04_nlp2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisites

In [1]:
!pip install torchtext torchdata portalocker

Collecting torchdata
  Downloading torchdata-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker
  Downloading portalocker-2.10.0-py3-none-any.whl (18 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=2.3.0->torchtext)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=2.3.0->torchtext)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=2.3.0->torchtext)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=2.3.0->torchtext)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12

This is needed as torchdata is still a in beta and we have to manually fix some bugs due to compatibility among versions.

In [2]:
import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

In [3]:
import torchtext
torchtext.disable_torchtext_deprecation_warning()

# Prepare data processing pipelines

The torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description fields of articles from the 4 largest classes of AG’s Corpus:
1. World
2. Sports
3. Business
4. Sci/Tech


The AG News contains 30,000 training and 1,900 test samples per class.

In [4]:
import torch
from torchtext.datasets import AG_NEWS

train_iter = iter(AG_NEWS(split="train"))

label, input = next(train_iter)
print(f"input: {input}")
print(f"label: {label}")

input: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
label: 3


## Tokenization

Basic normalization for a line of text.
Normalization includes
- lowercasing
- complete some basic text normalization for English words as follows:
    add spaces before and after '\''
    remove '\"',
    add spaces before and after '.'
    replace '<br \/>'with single space
    add spaces before and after ','
    add spaces before and after '('
    add spaces before and after ')'
    add spaces before and after '!'
    add spaces before and after '?'
    replace ';' with single space
    replace ':' with single space
    replace multiple spaces with single space

Returns a list of tokens after splitting on whitespace.

In [5]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("basic_english")
tokenized_input = tokenizer(input)

print(len(tokenized_input))
print(tokenized_input)

29
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short-sellers', ',', 'wall', 'street', "'", 's', 'dwindling\\band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.']


## Building the vocabulary
Using the tokenizer we can run across all the training set to extracts the existing tokens composing the final vocabulary.
To do that we can use the `build_vocab_from_iterator` function which accepts iterator that yield list or iterator of tokens.

In [6]:
from torchtext.vocab import build_vocab_from_iterator
train_iter = AG_NEWS(split="train")
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)
tokens_iterator = yield_tokens(train_iter)
UNK_TOKEN = '<unk>'
PAD_TOKEN = '<pad>'
specials = [UNK_TOKEN, PAD_TOKEN]
vocab = build_vocab_from_iterator(tokens_iterator, specials=specials)
# set the index of the out-of-vocab token (<unk>) as default (=0)
vocab.set_default_index(vocab[UNK_TOKEN])

We can also retrieve the vocabulary to go from a specific token to its corresponding index and viceversa

In [7]:
index_to_tokens_dict = vocab.get_itos()
tokens_to_index_dict = vocab.get_stoi()

print(tokens_to_index_dict["journal"])
print(index_to_tokens_dict[2361])

2361
journal


Using the vocabulary we can now easily get the list of indices for a given tokenized sentence

In [8]:
vocab(tokenizer("This is how we find indices from a sentence"))

[53, 22, 358, 508, 747, 18963, 30, 6, 2994]

In [10]:
train_iter = AG_NEWS(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
print(num_class)
print(vocab_size)

4
95812


# Generate data batch and iterator


## Padding
Differently from images, all the sentences in the training set have different length.
Since we want to exploit GPU parallelization we have to make it the same length so that we can create a single tensor of consistent shape for a given batch.

The function `torch.nn.utils.rnn.pad_sequence` help us by taking in input a list of different tensors (each with different sequence length) and padding additional tokens (for example using the token `'<pad>`) to each sample so that the sequence length of all the samples is matched.


In [11]:
from torch.nn.utils.rnn import pad_sequence
input1 = "This is the first sentence"
input2 = "This is the second sentence, which is longer"
tensor1 = torch.tensor(vocab(tokenizer(input1)))
tensor2 = torch.tensor(vocab(tokenizer(input2)))
inputs = [tensor1, tensor2]
padding_value = tokens_to_index_dict[PAD_TOKEN]
padded = pad_sequence(inputs,
                      padding_value=padding_value, # fill with <pad> tokens
                      batch_first=True)
print(padded)

tensor([[  53,   22,    3,   48, 2994,    1,    1,    1,    1],
        [  53,   22,    3,  128, 2994,    4,  104,   22, 1529]])



We can now define the `pad_collate_batch` function that return a single tensor with shape `(batch_size, sequence_length)`.
This function can be simply passed as argument to the `DataLoader`: in this way this will return directly the tensors that can be easily processed in a parallel way by a neural network.

In [13]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

device = "cuda" if torch.cuda.is_available() else "cpu"
assert device == 'cuda'

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1  # we want labels starting from 0

MAX_LEN = 100
def pad_collate_batch(batch):
    label_list, text_list = [], []
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        processed_text = processed_text[:MAX_LEN]   # cut too long sequences
        text_list.append(processed_text)
    padded_texts = pad_sequence(text_list,
                                padding_value=tokens_to_index_dict[PAD_TOKEN],
                                batch_first=True)
    tensor_labels = torch.tensor(label_list, dtype=torch.int64)
    return padded_texts, tensor_labels


train_iter = AG_NEWS(split="train")
dataloader = DataLoader(train_iter,
                        batch_size=64,
                        shuffle=True,
                        collate_fn=pad_collate_batch
                        )
x, y = next(iter(dataloader))
print(x.shape)
print(y.shape)

torch.Size([64, 90])
torch.Size([64])


# Embedding
The Embedding layer is a lookup table in which each index correspond to a vector in a high-dimensional space with `embedding_dim` dimensions.
These vectors are initialized randomly, but can be learned to position each word in a region of the space which is useful for the task at hand.
If the input to the embedding have shape `(batch_size, sequence_length)`, its output will have shape `(batch_size, sequence_length, embedding_dim)`

In [14]:
import torch.nn as nn
import torch

embedding_dim = 128
vocab_len = 1000

embed = nn.Embedding(num_embeddings=vocab_len,
                     embedding_dim=128,
                     padding_idx=1
                     )
batch_size = 8
sequence_length = 25
# generate random integer from 0 to 100
x = torch.randint(100, size=(batch_size, sequence_length))
print(x.shape)
embeddings = embed(x)
print(embeddings.shape)

torch.Size([8, 25])
torch.Size([8, 25, 128])


# Define the model
Explain shapes with RNN and batches


In [17]:
from torch import nn

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes,
                 num_recurrent_layers=3, bidirectional=False):
        super(TextClassificationModel, self).__init__()
        self.embed_dim = embed_dim
        self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                      embedding_dim=embed_dim,
                                      padding_idx=tokens_to_index_dict[PAD_TOKEN])
        self.num_recurrent_layers = num_recurrent_layers
        self.rnn = nn.GRU(input_size=embed_dim,
                          hidden_size=embed_dim,
                          num_layers=num_recurrent_layers,
                          bidirectional=bidirectional,
                          batch_first=True)
        n_directions = 2 if bidirectional else 1
        n_out_neurons = embed_dim * n_directions
        self.fc = nn.Linear(n_out_neurons, num_classes)

    def forward(self, x):
        # x.shape = (batch_size, seq_len)
        embedded = self.embedding(x)
        # embedded.shape = (batch_size, seq_len, embed_dim)
        # the initial hidden state is initialized to zero by default
        out, _ = self.rnn(embedded)
        # out.shape = (batch_size, seq_len, n_out_neurons)
        out = out[:, -1, :]   # pick only the last output
        out = self.fc(out)
        # out.shape = (batch_size, num_classes)
        return out

x, _ = next(iter(dataloader))
model = TextClassificationModel(len(vocab), embed_dim=128,
                                num_classes=4,
                                num_recurrent_layers=3,
                                bidirectional=True)
out = model(x)
print(out.shape)



torch.Size([64, 4])


# Training

In [20]:
import time
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

def evaluate(model, dataloader, device):
    model.eval()
    total_correct, total_samples = 0, 0
    with torch.no_grad():
        for idx, (x, y) in enumerate(dataloader):
            x, y = x.to(device), y.to(device)
            ypred = model(x).argmax(dim=1)
            total_correct += (ypred == y).sum().item()
            total_samples += x.shape[0]
    return total_correct / total_samples

# Hyperparameters
EPOCHS = 5  # epoch
LR = 1e-3  # learning rate
BATCH_SIZE = 64  # batch size for training
RANDOM_SEED = 42
embedding_size = 64

device = 'cuda' if torch.cuda.is_available() else 'cpu'
assert device == 'cuda'

# Prepare the train, validation and test dataloaders
train_iter, test_iter = AG_NEWS()
# Here we convert from iterable dataset to a mapping style
# which simply mean that we can access each sample with an index
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

train_size = int(len(train_dataset) * 0.95)

torch.manual_seed(RANDOM_SEED)
split_train_, split_valid_ = random_split(
    train_dataset, [train_size, len(train_dataset) - train_size]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=False, collate_fn=pad_collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=pad_collate_batch
)


model = TextClassificationModel(vocab_size, embedding_size, num_class).to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

print_every = 100
accuracy_checkpoint = None

for epoch in range(EPOCHS):
    epoch_start_time = time.time()
    model.train()
    start = time.time()
    total_correct, total_samples = 0, 0
    for idx, (x, y) in enumerate(train_dataloader):
        x, y = x.to(device), y.to(device)
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        ypred = out.argmax(dim=1)
        total_correct += (ypred == y).sum().item()
        total_samples += x.shape[0]
        if idx % print_every == 0:
            train_accuracy = total_correct / total_samples
            print(f"Epoch: [{epoch + 1}/{EPOCHS}], "\
                  f"Batch: [{idx + 1}/{len(train_dataloader)}], "\
                  f"Train. Accuracy: {train_accuracy:.8f}"
                  )
            total_correct, total_samples = 0, 0
    end = time.time()
    total_time = end - start

    valid_accuracy = evaluate(model, valid_dataloader, device)
    accuracy_checkpoint = valid_accuracy
    print("-" * 59)
    print(f"> End of epoch {epoch + 1}, took {total_time} seconds")
    print(f"> Valid. Accuracy: {valid_accuracy}")
    print("\n")

Epoch: [1/5], Batch: [1/1782], Train. Accuracy: 0.21875000
Epoch: [1/5], Batch: [101/1782], Train. Accuracy: 0.24703125
Epoch: [1/5], Batch: [201/1782], Train. Accuracy: 0.25421875
Epoch: [1/5], Batch: [301/1782], Train. Accuracy: 0.25062500
Epoch: [1/5], Batch: [401/1782], Train. Accuracy: 0.26203125
Epoch: [1/5], Batch: [501/1782], Train. Accuracy: 0.26609375
Epoch: [1/5], Batch: [601/1782], Train. Accuracy: 0.28203125
Epoch: [1/5], Batch: [701/1782], Train. Accuracy: 0.25062500
Epoch: [1/5], Batch: [801/1782], Train. Accuracy: 0.24593750
Epoch: [1/5], Batch: [901/1782], Train. Accuracy: 0.27968750
Epoch: [1/5], Batch: [1001/1782], Train. Accuracy: 0.43390625
Epoch: [1/5], Batch: [1101/1782], Train. Accuracy: 0.61406250
Epoch: [1/5], Batch: [1201/1782], Train. Accuracy: 0.73171875
Epoch: [1/5], Batch: [1301/1782], Train. Accuracy: 0.78031250
Epoch: [1/5], Batch: [1401/1782], Train. Accuracy: 0.80937500
Epoch: [1/5], Batch: [1501/1782], Train. Accuracy: 0.81781250
Epoch: [1/5], Batch:

# Evaluating

In [21]:
print("Checking the results of test dataset.")
test_accuracy = evaluate(model, test_dataloader, device)
print(f"Test accuracy: {test_accuracy:.4f}")

Checking the results of test dataset.
Test accuracy: 0.9084


In [22]:
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}
def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text)).unsqueeze(0)
        output = model(text)
        ypred = output.argmax(dim=1).item() + 1
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model.to("cpu")
pred = predict(ex_text_str, text_pipeline)
print(f"This is a {ag_news_label[pred]} news")

This is a Sports news


# Language Translation (Bonus)

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List


# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation succ

In [None]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for language in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object
    vocab_transform[language] = build_vocab_from_iterator(
        yield_tokens(train_iter, language),
        min_freq=1,
        specials=special_symbols,
        special_first=True)

# Set ``UNK_IDX`` as the default index. This index is returned when the token is not found.
# If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
for language in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[language].set_default_index(UNK_IDX)



In [None]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

In [None]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

In [None]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)



In [None]:
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# ``src`` and ``tgt`` language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

In [None]:
from torch.utils.data import DataLoader

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

In [None]:
from timeit import default_timer as timer
NUM_EPOCHS = 10

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))



Epoch: 1, Train loss: 5.344, Val loss: 4.106, Epoch time = 45.982s
Epoch: 2, Train loss: 3.761, Val loss: 3.309, Epoch time = 43.577s
Epoch: 3, Train loss: 3.157, Val loss: 2.887, Epoch time = 44.054s
Epoch: 4, Train loss: 2.767, Val loss: 2.640, Epoch time = 44.601s
Epoch: 5, Train loss: 2.477, Val loss: 2.442, Epoch time = 43.825s
Epoch: 6, Train loss: 2.247, Val loss: 2.306, Epoch time = 43.699s
Epoch: 7, Train loss: 2.055, Val loss: 2.207, Epoch time = 44.812s
Epoch: 8, Train loss: 1.893, Val loss: 2.114, Epoch time = 43.685s
Epoch: 9, Train loss: 1.754, Val loss: 2.054, Epoch time = 44.509s
Epoch: 10, Train loss: 1.628, Val loss: 2.008, Epoch time = 43.891s
Epoch: 11, Train loss: 1.520, Val loss: 1.961, Epoch time = 43.701s


KeyboardInterrupt: 

In [None]:

# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [None]:
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))

 A group of people standing in front of an empty auditorium . 
