## NLP lab 02

In [2]:
!pip install spacy sacrebleu optuna torchdata -U

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!spacy download en_core_web_sm

2023-06-04 13:32:48.670905: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-04 13:32:55.887026: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-04 13:32:55.887684: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-

In [4]:
!spacy download de_core_news_sm

2023-06-04 13:33:09.842877: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-04 13:33:13.889303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-04 13:33:13.891670: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-

The following code was taken from https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.

In [5]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List


# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}

In [6]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')

# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

# Set ``UNK_IDX`` as the default index. This index is returned when the token is not found.
# If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    vocab_transform[ln].set_default_index(UNK_IDX)

## Seq2Seq Network using Transformer

In [7]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

In [8]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

In [9]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

In [10]:
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# ``src`` and ``tgt`` language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

In [11]:
from torch.utils.data import DataLoader

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

In [12]:
from timeit import default_timer as timer
from tqdm import trange

In [13]:
NUM_EPOCHS = 18

for epoch in trange(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

  6%|▌         | 1/18 [00:44<12:29, 44.06s/it]

Epoch: 1, Train loss: 5.344, Val loss: 4.114, Epoch time = 43.225s


 11%|█         | 2/18 [01:29<11:55, 44.70s/it]

Epoch: 2, Train loss: 3.760, Val loss: 3.320, Epoch time = 44.360s


 17%|█▋        | 3/18 [02:13<11:06, 44.43s/it]

Epoch: 3, Train loss: 3.161, Val loss: 2.895, Epoch time = 43.328s


 22%|██▏       | 4/18 [02:59<10:31, 45.08s/it]

Epoch: 4, Train loss: 2.768, Val loss: 2.639, Epoch time = 45.323s


 28%|██▊       | 5/18 [03:44<09:47, 45.17s/it]

Epoch: 5, Train loss: 2.480, Val loss: 2.443, Epoch time = 44.577s


 33%|███▎      | 6/18 [04:31<09:08, 45.69s/it]

Epoch: 6, Train loss: 2.251, Val loss: 2.318, Epoch time = 45.900s


 39%|███▉      | 7/18 [05:16<08:21, 45.58s/it]

Epoch: 7, Train loss: 2.061, Val loss: 2.201, Epoch time = 44.593s


 44%|████▍     | 8/18 [06:03<07:39, 45.91s/it]

Epoch: 8, Train loss: 1.897, Val loss: 2.112, Epoch time = 45.798s


 50%|█████     | 9/18 [06:48<06:51, 45.74s/it]

Epoch: 9, Train loss: 1.754, Val loss: 2.061, Epoch time = 44.611s


 56%|█████▌    | 10/18 [07:35<06:08, 46.01s/it]

Epoch: 10, Train loss: 1.631, Val loss: 2.002, Epoch time = 45.829s


 61%|██████    | 11/18 [08:20<05:20, 45.79s/it]

Epoch: 11, Train loss: 1.524, Val loss: 1.969, Epoch time = 44.511s


 67%|██████▋   | 12/18 [09:07<04:36, 46.02s/it]

Epoch: 12, Train loss: 1.419, Val loss: 1.942, Epoch time = 45.735s


 72%|███████▏  | 13/18 [09:52<03:49, 45.82s/it]

Epoch: 13, Train loss: 1.334, Val loss: 1.968, Epoch time = 44.596s


 78%|███████▊  | 14/18 [10:39<03:04, 46.06s/it]

Epoch: 14, Train loss: 1.252, Val loss: 1.944, Epoch time = 45.828s


 83%|████████▎ | 15/18 [11:24<02:17, 45.84s/it]

Epoch: 15, Train loss: 1.173, Val loss: 1.933, Epoch time = 44.558s


 89%|████████▉ | 16/18 [12:11<01:32, 46.06s/it]

Epoch: 16, Train loss: 1.103, Val loss: 1.922, Epoch time = 45.790s


 94%|█████████▍| 17/18 [12:56<00:45, 45.81s/it]

Epoch: 17, Train loss: 1.039, Val loss: 1.899, Epoch time = 44.449s


100%|██████████| 18/18 [13:42<00:00, 45.71s/it]

Epoch: 18, Train loss: 0.979, Val loss: 1.906, Epoch time = 45.720s





In [14]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [15]:
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))

 A group of people standing in front of an igloo . 


### **(4 points)** Theoretical questions

Answer the following questions.
- **In the positional encoding, why are we using a combination of sinus and cosinus?**
*By combining enough function, we want that the model has enough granular information to encode the notion of position. Combining sine and cosine should provide more information about word position. We can assume that the sine function could provide information of variation of position over time and the cosine function could capture another type of variation, such as center of the sequence. This combination also respects the two constrains required for positional embeddings.*

- **In the Seq2SeqTransformer class,**
    - **What is the parameter nhead for?**
      *The parameter nhead is propagated to the Transformer of torch. According to torch documentation, it is "the number of heads in the multiheadattention models ".*

    - **What is the point of the generator?**
      *The purpose of the generator is to map the output of the decoder to the size of the target vocabulary. It converts the output of the decoder into probability scores  for each possible word in the target vocabulary. These scores can be converted into probabilities by using a softmax function.*
- **Describe the goal of the create_mask function. Why does it handle differently the source and target masks?**

  _This function generates two different types of masks._

  *   Source mask _which goal is to mask padding token. It doesn't mask ahead tokens_
  *   Target mask _which purpose is to prevent the model from looking ahead in the input and assure a non biased prediction. As the model should only be able to see past and current tokens._




Saving the model for future reuse.

In [16]:
torch.save(transformer, "model.pth")

In [17]:
# transformer = torch.load("model.pth")

## **(6 points)** Decoding functions

The tutorial uses a greedy approach at decoding. Implement the following variations.
* (3 points) A top-k sampling with temperature.

In [18]:
def topk(
    model: torch.nn.Module,
    src: torch.Tensor,
    src_mask: torch.Tensor,
    max_len: int,
    start_symbol: int,
    k: int,
    temperature: float = 1.0):
    """
    Generates a sequence of tokens using top-k sampling.

    Args:
        model: The translation model.
        src: The source sentence tensor.
        src_mask: The mask tensor for the source sentence.
        max_len: The maximum length of the output sequence.
        start_symbol: The index of the start symbol.
        k: The number of tokens to sample from.
        temperature: The temperature parameter

    Returns:
      The output sequence tensor.
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)
    
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        prob = torch.softmax(prob, dim=-1)
        
        # Temperature
        prob = prob / temperature
        
        # Top-k sampling
        values, indices = torch.topk(prob, k)
        next_word = torch.multinomial(values, 1)[0]
        next_word = indices[0][next_word].item()
        
        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

def translate_topk(model: torch.nn.Module, src_sentence: str, k: int, temperature: float):
    """
    Translates a source sentence using top-k sampling.

    Args:
        model: The translation model.
        src_sentence: The source sentence.
        k: The number of tokens to sample from.
        temperature: The temperature parameter for softmax.

    Returns:
        The translated sentence.
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = topk(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX, k=k, temperature=temperature).flatten()
        
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

* (1 point) A top-p sampling with temperature.

In [19]:
def topp(
    model: torch.nn.Module,
    src: torch.Tensor,
    src_mask: torch.Tensor,
    max_len: int,
    start_symbol: int,
    p: float = 0.9,
    temperature: float = 0.5):
    """Generates a sequence of tokens using top-p sampling.

    Args:
        model: The translation model.
        src: The source sentence tensor.
        src_mask: The mask tensor for the source sentence.
        max_len: The maximum length of the output sequence.
        start_symbol: The index of the start symbol.
        p: The probability threshold for top-p sampling.
        temperature: The temperature parameter for softmax.

    Returns:
        The output sequence tensor.
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)
    
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        probs = model.generator(out[:, -1])
        probs = torch.softmax(probs, dim=-1)

        # Temperature
        probs = probs / temperature
            
        # Top-p sampling 
        # Instead of keeping the top k tokens, we keep the smallest possible set of words whose cumulative probability exceeds the probability p.
        sorted_probs, indices = torch.sort(probs, dim=-1, descending=True)
        cum_sum_probs = torch.cumsum(sorted_probs, dim=-1)
        nucleus = cum_sum_probs < p
        nucleus = torch.cat([nucleus.new_ones(nucleus.shape[:-1] + (1,)), nucleus[..., :-1]], dim=-1)
        sorted_log_probs = torch.log(sorted_probs)
        sorted_log_probs[~nucleus] = float('-inf')
        sorted_probs = sorted_probs / torch.sum(sorted_probs, dim=-1, keepdim=True)
        
        # Sample
        next_word = torch.multinomial(sorted_probs, 1)[0]
        next_word = indices[0][next_word].item()
        
        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

def translate_topp(model: torch.nn.Module, src_sentence: str, p: float, temperature: float):
    """
    Translates a source sentence using top-p sampling.

    Args:
        model: The translation model.
        src_sentence: The source sentence.
        p: The probability threshold for top-p sampling.
        temperature: The temperature parameter for softmax.

    Returns:
        The translated sentence.
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = topp(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX, p=p, temperature=temperature).flatten()
        
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

* (2 point) Play with the k, p and temperature parameters, and qualitatively compare a few (at least 3) translation samples for each approach (even the greedy one).

In [37]:
samples = [
    'Eine Gruppe von Menschen steht vor einem Iglu .',
    'Ich spaziere in den Park',
    'Ich esse Schokolade.'
]

for sample in samples:
  print(f"Translation sample: {sample}")
  print("=== Greedy ===")
  print(translate(transformer, sample))

  print("==== Topp ====")
  for p in [0.25, 0.5, 0.75]:
    for temp in [0.25, 0.5, 0.75]:
      print(f"Temp: {temp:0.2f} p:{p} {translate_topp(transformer, sample, p, temp)}")

  print("==== Topk ====")
  for i in [1, 3, 5]:
    for temp in [0.25, 0.5, 0.75]:
      print(f"Temp:{temp:0.2f} k:{i} {translate_topk(transformer, sample, i, temp)}")

  print()

Translation sample: Eine Gruppe von Menschen steht vor einem Iglu .
=== Greedy ===
 A group of people standing in front of an igloo . 
==== Topp ====
Temp: 0.25 p:0.25  A group of people standing in front of an blossom studio . 
Temp: 0.50 p:0.25  A group of people stand in front an stiletto mitt . 
Temp: 0.75 p:0.25  A group of people standing in front of an Ohio audacious machines . 
Temp: 0.25 p:0.5  A group of people standing in front of an gao mature necks 
Temp: 0.50 p:0.5  A group of people standing in front of an roper . 
Temp: 0.75 p:0.5  A group of people standing in front of an examination . 
Temp: 0.25 p:0.75  A group of people are standing in front of an unfinished Derby . 
Temp: 0.50 p:0.75  A group of people stand in an snowsuits 
Temp: 0.75 p:0.75  A group of people stand in front of an antique backpacker . 
==== Topk ====
Temp:0.25 k:1  A group of people standing in front of an igloo . 
Temp:0.50 k:1  A group of people standing in front of an igloo . 
Temp:0.75 k:1  A 

For the first example, the quality of the output appears to be good with greedy, top-k, and top-p selection methods if you choose the appropriate values for p, k, and temperature. It yields results that are coherent and meaningful.

For the second example, top-p provides the most accurate sentence structure, while top-k selection yields the best results in terms of word's meaning.
The results of greedy decoding are always the same as when using top-k with k = 1.

The results of the last sentence are not accurate with the three methods. While the some phrases are better structured using top-p, the majority of the generated phrases, with different values of p, k, and temperature, do not convey any meaningful information in English.

### **(2 points)** Compute the BLEU score of the model

Use the [sacreBLEU](https://github.com/mjpost/sacreBLEU) implementation to evaluate your model and quantitatively compare the 4 implemented decoding approaches on the test set. Explain what all the output values mean (when using the `corpus_score` function).

In the [python section](https://github.com/mjpost/sacrebleu#using-sacrebleu-from-python), you'll notice the library accepts more than just one possible translation as reference, but the given dataset only has one translation per sample.

Using the `translate` function provided in the tutorial is pretty slow, as it translate text by text. It's recommended you modify the function to accept a list of texts as input, and batch them for translations (also **bonus point**).

In [21]:
from sacrebleu.metrics import BLEU
from tqdm import tqdm

In [22]:
bleu = BLEU()

test_iter = Multi30k(split='test', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))

def test_on(select_function):
    ref = []
    sys = []

    for (src, tgt) in tqdm(test_iter):
        ref.append(tgt)
        sys.append(select_function(transformer, src))

    return bleu.corpus_score(sys, [ref])

In [23]:
# greedy
print(f"Corpus score with greedy {test_on(lambda model, src: translate(model, src))}")

# topk
print(f"Corpus score with topk {test_on(lambda model, src: translate_topk(model, src, 5, 1.0))}")

# topp
print(f"Corpus score with topp {test_on(lambda model, src: translate_topp(model, src, 0.5, 1.0))}")

1000it [01:17, 12.89it/s]


Corpus score with greedy BLEU = 36.06 67.6/43.8/29.2/19.6 (BP = 1.000 ratio = 1.003 hyp_len = 12990 ref_len = 12955)


1000it [01:22, 12.16it/s]


Corpus score with topk BLEU = 29.98 63.5/38.2/23.3/14.4 (BP = 0.998 ratio = 0.998 hyp_len = 12933 ref_len = 12955)


1000it [01:24, 11.81it/s]

Corpus score with topp BLEU = 27.78 60.5/35.3/21.5/13.2 (BP = 0.996 ratio = 0.996 hyp_len = 12909 ref_len = 12955)





The output of the `corpus_score` are:
- the final BLEU score
- the precision value for 1 to 4 ngram 
- BP is the brevity penalty
- the ratio between hypothesis length and reference length
- hyp_len is the total number of characters for hypothesis text
- ref_lenis is the total number of characters for reference text

**\[Bonus\]** Use part of the test set to perform an hyperparameters search on the value of temperature, k, and p. Note that, normally, this should be done on a validation set, not the test set.

In [32]:
import optuna

# Hyper parameter search for topk with optuna
def objective(trial):
    k = trial.suggest_int("k", 1, 50)
    temperature = trial.suggest_float("temperature", 0.0, 1.0)

    sys = []
    ref = []
    number_sample = 100

    for (src, tgt) in test_iter:
      ref.append(tgt)
      sys.append(translate_topk(transformer, src, k, temperature))
      
      # Use only part of the data set
      number_sample -= 1
      if number_sample == 0:
        break

    return bleu.corpus_score(sys, [ref]).score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)

[I 2023-06-04 14:02:21,876] A new study created in memory with name: no-name-8a441318-47bd-48bc-8d6c-0141c34aa45f
[I 2023-06-04 14:02:29,522] Trial 0 finished with value: 24.53189746977117 and parameters: {'k': 29, 'temperature': 0.32833505429779664}. Best is trial 0 with value: 24.53189746977117.
[I 2023-06-04 14:02:37,757] Trial 1 finished with value: 25.57747252313568 and parameters: {'k': 14, 'temperature': 0.1755423075125907}. Best is trial 1 with value: 25.57747252313568.
[I 2023-06-04 14:02:46,354] Trial 2 finished with value: 24.428475165630555 and parameters: {'k': 28, 'temperature': 0.9270064946483098}. Best is trial 1 with value: 25.57747252313568.
[I 2023-06-04 14:02:53,405] Trial 3 finished with value: 27.490459202011852 and parameters: {'k': 27, 'temperature': 0.5604450698171157}. Best is trial 3 with value: 27.490459202011852.
[I 2023-06-04 14:03:01,772] Trial 4 finished with value: 23.55974277545776 and parameters: {'k': 46, 'temperature': 0.6054232611649147}. Best is t

In [33]:
print(f"Best trials for topk was with {study.best_params}")

Best trials for topk was with {'k': 23, 'temperature': 0.6289661724718179}


In [29]:
# Hyper parameter search for topp with optuna
def objective(trial):
    p = trial.suggest_float("p", 0.0, 1.0)
    temperature = trial.suggest_float("temperature", 0.0, 1.0)

    sys = []
    ref = []
    number_sample = 100

    for (src, tgt) in test_iter:
      ref.append(tgt)
      sys.append(translate_topp(transformer, src, p, temperature))
      
      # Use only part of the data set
      number_sample -= 1
      if number_sample == 0:
        break

    return bleu.corpus_score(sys, [ref]).score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)

[I 2023-06-04 13:59:13,948] A new study created in memory with name: no-name-c8c5e367-cd46-46c7-bc45-d700c1a16ea6
[I 2023-06-04 13:59:21,766] Trial 0 finished with value: 29.752798938573484 and parameters: {'p': 0.4905948522572179, 'temperature': 0.48033053021941596}. Best is trial 0 with value: 29.752798938573484.
[I 2023-06-04 13:59:30,362] Trial 1 finished with value: 23.804038638931065 and parameters: {'p': 0.11366246807284741, 'temperature': 0.8543853174106188}. Best is trial 0 with value: 29.752798938573484.
[I 2023-06-04 13:59:38,810] Trial 2 finished with value: 26.493190739945614 and parameters: {'p': 0.7970110876187265, 'temperature': 0.08066841879928077}. Best is trial 0 with value: 29.752798938573484.
[I 2023-06-04 13:59:46,331] Trial 3 finished with value: 24.022791601082385 and parameters: {'p': 0.6772755363247324, 'temperature': 0.09436665744254114}. Best is trial 0 with value: 29.752798938573484.
[I 2023-06-04 13:59:54,874] Trial 4 finished with value: 26.24861867060426

In [31]:
print(f"Best trials for topp was with {study.best_params}")

Best trials for topp was with {'p': 0.4905948522572179, 'temperature': 0.48033053021941596}


## Going further

If you want to understand in-depth how the transformer model works, I recommend you check [The Annotated Tranformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) from HarvardNLP. This article helps you write your own transformer from scratch in pyTorch.