Written by [Zhenhao Li](mailto:zhenhao.li18@imperial.ac.uk) and [Nihir](mailto:nv419@ic.ac.uk)

[The AI Core](www.theaicore.com)

# Recap

![s2s_attn](images/seq2seq_attention.png)
![attention_module](images/attention_module.png)

In [28]:
import torch
from torch import nn
import torch.nn.functional as F
import math as m
import numpy as np

In [25]:
hp = {"BATCH_SIZE": 512, 
      "D_MODEL": 512, 
      "P_DROP": 0.1, 
      "D_FF": 2048, 
      "HEADS": 8, 
      "LAYERS": 6, 
      "LR": 1e-3, 
      "EPOCHS": 40}

device = "cpu"

# Transformer

![transformerz](https://cdn.collider.com/wp-content/uploads/2017/06/transformers-5-optimus-prime-bumblebee.jpg)

No... not this kind

The Transformer is a non-recurrent model which has achieved SoTA results in sequence-to-sequence transduction problems. It is based entirely on attention. The recurrent nature of RNNs means that parallelisation is not possible as every hidden state relies on the hidden state before it. Impressively, because the Transformer is based on FCNNs, the possibility of parallelising the model is possible.

![image.png](https://miro.medium.com/max/500/1*do7YDFF2sads0p9BnjzrWA.png)

The Transformer follows a typical Encoder/Decoder architecture. It takes an input sequence of symbols $(x_1, ..., x_n)$ and maps this to a sequence of continuous representations $\mathbf{h} = (h_1, ..., h_n)$. Given $h$, the decoder generates a symbol one element at a time $(y_1, ..., y_m)$.

### Encoder
Each encoder layer in the Transformer consists of two sub-layers. The first of these layers is the _Multi-Head Self-Attention_ module, and the second is a module called a _Position-Wise Feed-Forward Network_. Immediately following each one of these modules is a _Residual Layer Normalization_.

### Decoder
A decoder layer is similar to an encoder layer, but it has one extra module inserted: _Masked Multi-Head Attention_. We will discuss the decoder more in a further session. 


At the heart of the Transformer is this concept known as _self-attention_. Let's look at the Transformer holistically and then see exactly what this is, and why and how it solves sequence-to-sequence tasks so effectively.

## The Holistic Transformer

We are attempting to solve a sequence to sequence translation task: German to English using the Transformer:

![whole_transformer](images/transformer.png)

The Transformer is comprised of a stack of encoders and a stack of decoders. The output from the final layer of the encoder stack is sent to the decoder. The input to the first encoder is done via a special embedding module. We will look at the decoder and the embedding module in a future session.

![transformer_high_level](images/transformer_high_level.png)

Within the encoder stack we have a set of connected encoders. The output of one encoder is sent as input to the next encoder. An encoder consists of two sub-layers. The first sub-layer consists of a _Multi-Head Self-Attention_ module, and the second, a _Feed-Forward_ module. Within each module, a _Residual Connection_ followed by a _Layer Normalization_ is immediately applied.

<div>
<img src="images/encoder_first_few.png" width="500"/>
</div>

Let's deconstruct what this new terminology means one by one. We'll start with _self attention_ and _multi-head self-attention_. Then we'll look at _residual connections_ followed by _layer normalization_. After we've covered the _feed-forward_ module, you've managed to understand most of the techniques in the Transformer! The encoder is simply the 6 aformenetioned things.

## Self-Attention

At the heart of the Transformer is self-attention. Self-attention is a mechanism that allows each input in a sequence to look at the whole sequence to compute a representation of the sequence.

...what?

Ok. Let's look at the following sentence: `the animal didn't cross the street because it was too tired`. What does the `it` refer to in this sentence? The street or the animal? This is trivial for us as humans to answer but not for a machine. Wouldn't it be nice if we could have some way of the computer understanding what `it` referred to?

This is what self-attention attempts to do. As the model processes each word in the input sequence, self-attention allows us to look at other words in the input sequence for ideas as to what we want to encode in the representation of this word.

![](http://jalammar.github.io/images/t/transformer_self-attention_visualization.png)

It is given by the formula:

$$Attention(Q,K,V) = softmax \left( \frac{QK^T}{\sqrt{d_k}}\right)V$$


To make this clearer, think of how a hidden state in an RNN incorporates the representation of the previous words into the current representation. Self-attention is how the Transformer attempts to use other words (not just the previous words) to encode the meaning of a particular word


![encoder_multihead](images/encoder_multihead.png)

![vector_attn_score](images/vector_attn_scores.png)

### Self-Attention in detail

Self-attention is a representation which you can think of as a score. The intent is to have a representation, per input word, which tells us much how much focus each word needs to pay to every word in the sequence. A bit complicated right? Hopefully by the end of this post, this meaning will be clear.

Before talking about matrices, let's talk in terms of vectors. We'll have the sequence length be of dimension $T$, and the encoding/embeddings of the word be $D$ dimensional.

Note that because I'm focusing on the FIRST Encoder here, the encodings of our sequence is the embeddings. But as you move up the encoder stack, the encoding of the sequence is the output from the previous encoder. Again, this is $D$ dimensional.

![word_encodings](images/word_encodings.png)


Ok. Now we're going to create three vectors for EACH input: A query vector ($q$), a key vector ($k$), and a value vector ($v$). These will be $d_k$ dimensional. $d_k$ is typically $D/8$. As far as real values go, usually $D=512$, while $d_k=64$. In our example, what is $d_k$?

Ok... so how how do we get $q$, $k$, $v$? We learn it of course!

how... do we learn it? We need a weights matrix which will transform our encodings into these vectors.
This means that $W^Q$, $W^K$ and $W^V$ are all $\in \mathbb{R}^{D×d_k}$

![qkv_vectors](images/qkv_vectors.png)

Ok, that's nice. We understand that $q$, $k$, $v$ are different projections of the same input now; but what do the query, key, value abstractions actually mean? They're useful terms we can use to think about attention. Let's look through the following so we can see the roles they play.

Recall what we defined self-attention as earlier: A representation, per input word, which tells us much how much focus each word needs to pay to every word in the sequence. For each word in our sequence, we will calculate a score by taking the dot product of the current word's query vector and the key vector for every word in the sequence.

![w1_til_softmax](images/w1_til_softmax.png)

Let's take stock of what we've done so far:
- The $÷\sqrt{d_k}$ is simply a practical scaling factor which leads to stabler gradients.
- Softmax turns our scores into a probability distribution (each score is now between 0 and 1, and the sum of the scores = 1).

We will now use these scores by multiplying them with their value vector. The intention here is that lower scoring words will have less weighting in the self-attention output as these words will now have a sense of "irrelevantness" (e.g. a low score like 0.0001 will "cancel out" its corresponding value vector).

![w1_til_z1](images/w1_til_z1.png)

Finally, the output for the current word is the summation of all the $softmax \times v$ vectors. I.e. a weighted sum:

![z_vector](images/z_vector.png)

Ok. So that's self-attention in vector form. What about in terms of matrices?

- Our input, $X$, is now a matrix of our sequence of words (i.e. $X \in \mathbb{R}^{T\times D}$):
![x_matrix](images/X_matrix_input.png)

- $Q$, $K$, $V$ are now also matrices $\in \mathbb{R}^{T \times d_k}$.
- $W^Q$,$W^K$,$W^V$ stay $\in \mathbb{R}^{D \times d_k}$.
- We now simply obtain $Z \in \mathbb{R}^{T \times d_k}$ by plugging $Q$, $K$, $V$ into our Attention formula.

![Z_matrix](images/Z_matrix.png)


- For the FIRST encoder, Q, K, V are determined by the embeddings of the input words
- For the rest of the encoder stack, Q, K, V are determined by the output of the previous encoder
- For the decoder stack, Q is determined in a similar fashion to the encoders. K and V, however, are passed from the final encoder to each of the decoders in the decoder stack.

In [5]:
# encodings = torch.Tensor([[[0.0, 0.1, 0.2, 0.3], [1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3]]]) # (1, 3, 4)
# Q_layer = nn.Linear(4, 3)
# K_layer = nn.Linear(4, 3)
# V_layer = nn.Linear(4, 3)

# Q = Q_layer(encodings)
# K = K_layer(encodings)
# V = V_layer(encodings)

In [4]:
def scaled_dot_product_attention(Q, K, V, dk=3):
    Q_K_matmul = torch.matmul(Q, K.T)
    matmul_scaled = Q_K_matmul/math.sqrt(dk)
    attention_weights = F.softmax(matmul_scaled, dim=-1)

    output = torch.matmul(attention_weights, V)

    return output, attention_weights

In [5]:
def print_attention(Q, K, V):
    n_digits = 3
    temp_out, temp_attn = scaled_dot_product_attention(Q, K, V)
    
    print ('Attention weights are:')
    print (np.around(temp_attn, 2))
    print ('Output is:')
    print (np.around(temp_out, 2))


In [8]:
temp_k = torch.Tensor([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]])  # (4, 3)

temp_v = torch.Tensor([[   1,0, 1],
                      [  10,0, 2],
                      [ 100,5, 0],
                      [1000,6, 0]])  # (4, 3)

In [9]:
# This `query` aligns with the second `key`,
# so the second `value` is returned.
temp_q = torch.Tensor([[0, 10, 0]])  # (1, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0., 1., 0., 0.]])
Output is:
tensor([[10.,  0.,  2.]])


In [10]:
# This query aligns with a repeated key (third and fourth), 
# so all associated values get averaged.
temp_q = torch.Tensor([[0, 0, 10]])  # (1, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0.0000, 0.0000, 0.5000, 0.5000]])
Output is:
tensor([[550.0000,   5.5000,   0.0000]])


In [11]:
# This query aligns equally with the first and second key, 
# so their values get averaged.
temp_q = torch.Tensor([[10, 10, 0]])  # (1, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0.5000, 0.5000, 0.0000, 0.0000]])
Output is:
tensor([[5.5000, 0.0000, 1.5000]])


In [12]:
temp_q = torch.Tensor([[0, 0, 10], [0, 10, 0], [10, 10, 0]])  # (3, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0.0000, 0.0000, 0.5000, 0.5000],
        [0.0000, 1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000]])
Output is:
tensor([[550.0000,   5.5000,   0.0000],
        [ 10.0000,   0.0000,   2.0000],
        [  5.5000,   0.0000,   1.5000]])


## Multihead Attention

Ok, we've just tackled self-attention, but the diagram tells us about something called multi-head attention. What is this?

Conceptually, and even in terms of implementation, it's quite simple. Let's recap what we just obtained: $Z$. Each $z$ vector (i.e. for each vector $z_1, z_2, ..., z_T$ in $Z$) is a representation which bakes in all the words in the sequence including itself.

A potential problem is that each $z_t$ _could_ be dominated by the representation for $t$'th word itself. This is what multihead attention solves.

Instead of performing self-attention 1 time, multihead attention performs it multiple times. This means that we have multiple different attention representations. Of course to obtain these multiple different representations (i.e. $Z$s), we need to learn multiple $Q, K, V$s. Each $Q, K, V, Z$ pair is referred to as one head and is of $(D/\text{num_heads})$ dimensionality. Typically, we use 8 heads (i.e. perform self-attention 8 different times). To "use" these newly obtained $Z$s, we concatenate them and pass them through a linear layer to project them back to $D$ dimensionality

![](http://jalammar.github.io/images/t/transformer_self-attention_visualization_2.png)

In [41]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], num_heads=hp["HEADS"], dropout=0.1):
        super().__init__()

        # d_q, d_k, d_v
        self.d = d_model//num_heads

        self.d_model = d_model
        self.num_heads = num_heads

        self.dropout = nn.Dropout(dropout)

        self.linear_Qs = [nn.Linear(d_model, self.d).to(device) for _ in range(num_heads)]
        self.linear_Ks = [nn.Linear(d_model, self.d).to(device) for _ in range(num_heads)]
        self.linear_Vs = [nn.Linear(d_model, self.d).to(device) for _ in range(num_heads)]

        self.mha_linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V):
        # shape(Q) = [B x seq_len x D/num_heads]
        # shape(K, V) = [B x seq_len x D/num_heads]

        Q_K_matmul = torch.matmul(Q, K.permute(0, 2, 1))
        scores = Q_K_matmul/m.sqrt(self.d)
        # shape(scores) = [B x seq_len x seq_len]

        attention_weights = F.softmax(scores, dim=-1)
        # shape(attention_weights) = [B x seq_len x seq_len]

        output = torch.matmul(attention_weights, V)
        # shape(output) = [B x seq_len x D/num_heads]

        return output, attention_weights

    def forward(self, pre_q, pre_k, pre_v):
        # shape(pre_q, pre_k, pre_v) = [B x seq_len x D]

        Q = [linear_Q(pre_q) for linear_Q in self.linear_Qs]
        K = [linear_K(pre_k) for linear_K in self.linear_Ks]
        V = [linear_V(pre_v) for linear_V in self.linear_Vs]
        # shape(Q, K, V) = [B x seq_len x D/num_heads] * num_heads

        output_per_head = []
        attn_weights_per_head = []
        # shape(output_per_head) = [B x seq_len x D/num_heads] * num_heads
        # shape(attn_weights_per_head) = [B x seq_len x seq_len] * num_heads
        for Q_, K_, V_ in zip(Q, K, V):
            output, attn_weight = self.scaled_dot_product_attention(
                Q_, K_, V_)
            # shape(output_per_head) = [B x seq_len x D/num_heads]
            # shape(attn_weights_per_head) = [B x seq_len x seq_len]
            output_per_head.append(output)
            attn_weights_per_head.append(attn_weight)

        output = torch.cat(output_per_head, -1)
        attn_weights = torch.stack(attn_weights_per_head).permute(0, 3, 1, 2)
        # shape(output) = [B x seq_len (K, V) x D]
        # shape(attn_weights) = [B x num_heads x seq_len (K, V) x seq_len(Q)]

        return self.mha_linear(self.dropout(output)), attn_weights

In [22]:
toy_encodings = torch.Tensor([[[0.0, 0.1, 0.2, 0.3], [1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3]]]) 
# shape(toy_encodings) = [B, T, D] = (1, 3, 4)
print("Toy Encodings:\n", toy_encodings)

D_MODEL = toy_encodings.shape[-1]

Toy Encodings:
 tensor([[[0.0000, 0.1000, 0.2000, 0.3000],
         [1.0000, 1.1000, 1.2000, 1.3000],
         [2.0000, 2.1000, 2.2000, 2.3000]]])


In [31]:
toy_MHA_layer = MultiHeadAttention_Encoder(d_model=D_MODEL, num_heads=2)
toy_MHA, _ = toy_MHA_layer(toy_encodings, toy_encodings, toy_encodings)
print("Toy MHA: \n", toy_MHA)
print("Toy MHA Shape: \n", toy_MHA.shape)

Toy MHA: 
 tensor([[[ 0.4249, -0.4863,  0.2636,  0.7792],
         [ 0.4515, -0.4947,  0.2841,  0.8008],
         [ 0.4782, -0.5032,  0.3046,  0.8226]]], grad_fn=<AddBackward0>)
Toy MHA Shape: 
 torch.Size([1, 3, 4])


Why are some of these values zero?

## Layer Normalization

In a typical problem where we want to employ neural networks, we normalize (standardize) our data before feeding it into the network. This is usually done by subtracting the global mean and dividing by the global standard deviation of the data:
$$x = \frac{x - \mu}{\sigma}$$

It is also possible to normalize _within_ the activations or layers of the network. There are many approaches for doing so: _Batch Normalization, Layer Normalization, Instance Normalization_ and _Group Normalization_.

We won't be diving into the details of layer normalization here as it's out of scope - the important thing to know is that some kind of normalization process over features internal to the model. A helper is provided for us in PyTorch:

In [17]:
# self.layer_norm = nn.LayerNorm(DIMENSIONALITY)
# layer_normed = self.layer_norm(thing_to_layer_norm)

To read further about layer normalization, check out the following resources (order _mostly_ from most accessible to least). The first few resources touch on batch normalization in order to give a conceptual understanding of what layer normalization is attempting to do:
- https://www.youtube.com/watch?v=DtEq44FTPM4
- https://www.youtube.com/watch?v=tNIpEZLv_eg
- https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/ (Recommended read)
- https://arxiv.org/abs/1706.03762

In [33]:
class Norm(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], dropout=hp["P_DROP"]):
        super().__init__()
        self.layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        ln = self.layer_norm(x)
        return self.dropout(ln)

## Residual Connection

In theory, if we continue stacking layers in a neural network on top of each other, we expect the training error to decrease. However, the reality doesn't follow suit.

![resid_theory](images/resnet_theory.PNG)

Resiudal connections solve these problems by introducing a _skip connection_ (or shortcut) between every other layer. All we need to do to implement a residual connection is add our residual input to our actual input:

![resid_network](images/plain_residual.PNG)

To read further about residual connections and why they work, please refer to the following resources:
- https://www.coursera.org/lecture/convolutional-neural-networks/resnets-HAhz9
- https://www.coursera.org/lecture/convolutional-neural-networks/why-resnets-work-XAKNO
- https://arxiv.org/abs/1512.03385

Implementing residual connections is deceivingly straightforward.

In [35]:
toy__prev_x = torch.randn(1, 3, 4)
# our residual connection is thus:
toy_residual = toy__prev_x + toy_MHA

print("Toy Residual: \n", toy_residual)
print("Toy Residual shape: \n", toy_residual.shape)

Toy Residual: 
 tensor([[[-0.3198, -0.7426,  1.5980, -0.3745],
         [-0.1725, -1.5926, -1.0669, -0.4305],
         [ 1.2165, -1.5966, -1.2953,  0.2745]]], grad_fn=<AddBackward0>)
Toy Residual shape: 
 torch.Size([1, 3, 4])


In [36]:
# and if we want to normalize it...
toy_Norm_layer = Norm(d_model=D_MODEL)
toy_norm = toy_Norm_layer(toy_residual)

print("Toy Norm: \n", toy_norm)
print("Toy Norm shape: \n", toy_norm.shape)

Toy Norm: 
 tensor([[[-0.4377, -0.9518,  1.8938, -0.5043],
         [ 1.2893, -0.0000, -0.0000,  0.7720],
         [ 1.5135, -1.2041, -0.9129,  0.6035]]], grad_fn=<MulBackward0>)
Toy Norm shape: 
 torch.Size([1, 3, 4])


## Taking stock

This brings us close to the end of our first sub-layer in the encoder (and about a 70% understanding of the new concepts in the Transformer).

![encoder_first_few](images/encoder_first_few.png)

The next sub-layer is relatively straightforward

## Position-wise Feed-Forward Network

The authors pass their output from the first sublayer into a feed-forward network. They call this a position-wise feed-forward network because it "is applied to each position separately and identically". This means that they run a feed-forward network over a rank 3 tensor (including batch size), over the sequence dimension. 

All this means is that the SAME weights are applied to all tokens in the sequence.

The authors use a two layered network given as:

$$FFN(x) = max(0,xW_1+b_1)W_2+b_2$$

With the first layer having a dimension of $d_{ff}=2048$ and the second as $D$. It's unclear why the inputs are projected to a larger dimension before being projected back down to $D$ dimensions, but [one source](https://graphdeeplearning.github.io/post/transformers-are-gnns/) suggests that it is a convergence trick which enables re-scaling of the feature vectors independently with each other.

In [37]:
# # D_FF = 2048
# D_FF = D_MODEL * 4
# P_DROP = 0.3

class PWFFN(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], d_ff=hp["D_FF"], dropout=hp["P_DROP"]):
        super().__init__()

        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        # shape(x) = [B x seq_len x D]

        ff = self.ff(x)
        # shape(ff) = [B x seq_len x D]

        return self.ff(x)

In [40]:
toy_PWFFN_layer = PWFFN(d_model=D_MODEL, d_ff=D_MODEL*4)
toy_PWFFN = toy_PWFFN_layer(toy_norm)

print("Toy PWFFN: \n", toy_PWFFN)
print("Toy PWFFN Shape: \n", toy_PWFFN.shape)

Toy PWFFN: 
 tensor([[[-0.2890, -0.2307,  0.3671, -0.0121],
         [ 0.1957, -0.4772,  0.3291,  0.3521],
         [ 0.2405, -0.5672,  0.2820,  0.3192]]], grad_fn=<AddBackward0>)
Toy PWFFN Shape: 
 torch.Size([1, 3, 4])


## Encoder Layer

This is everything we require for one arbitrary Encoder layer! Let's code up a class which contains the aforementioned

In [42]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], num_heads=hp["HEADS"], d_ff=hp["D_FF"], dropout=hp["P_DROP"]):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ff = PWFFN(d_model, d_ff)

    def forward(self, x, mask):
        # shape(x) = [B x seq_len x D]

        mha, encoder_attention_weights = self.mha(x, x, x, mask)
        norm1 = self.norm_1(x + mha)
        # shape(mha) = [B x seq_len x D]
        # shape(encoder_attention_weights) = [B x num_heads x seq_len x seq_len]
        # shape(norm1) = [B x seq_len x D]

        ff = self.ff(norm1)
        norm2 = self.norm_2(norm1 + ff)
        # shape(ff) = [B x seq_len x D]
        # shape(norm2) = [B x seq_len x D]

        return norm2, encoder_attention_weights

## Positional Encoding

Lets talk about embedding the inputs

![hihgh_level](images/transformer_high_level.PNG)

One thing we've not yet looked at is _Positional Encoding_. Recall that our model contains no recurrence and thus we need a way to make sense of the order of the sequence. To do so, we need to inject some information about the position of the tokens in the sequence. Positional Encoding is one strategy to do so.

So, we want our model to add some information to each of our words (embeddings) which indicates its position in the sequence. What are some strategies for doing so?

Well, we could just add the token position to the embedding (e.g. 1 for the first word, 2 for the second word, 3 for the third word). What are some issues with this approach?

What about linearly assigning values between 0 and 1 to the token embedding? (e.g. for an eight-length sequence, the first word has 0.125 added to it, the second 0.25, the third 0.375 etc).

The trick the authors propose is to add a $D$-dimensional vector to the embedding (which is also $D$-dimensional) instead of a single number. The vector which is added to the word embedding is __fixed__ - that is, it does NOT depend on the features of the word itself - only the position that it appears in.

Let's dissect the formula it's given by:

$$PE_{(pos,2i)}=sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$$
$$PE_{(pos,2i+1)}=cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$$

![PE_pseudo](pe.png)

![postional_encoding_graph](https://d33wubrfki0l68.cloudfront.net/ef81ee3018af6ab6f23769031f8961afcdd67c68/3358f/img/transformer_architecture_positional_encoding/positional_encoding.png)

In [44]:
class Embeddings(nn.Module):
    def __init__(self, vocab_size, pad_idx, d_model=hp["D_MODEL"]):
        super().__init__()
        self.d_model = d_model
        self.embed = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)

    def forward(self, x):
        # shape(x) = [B x seq_len]

        embedding = self.embed(x)
        # shape(embedding) = [B x seq_len x D]

        return embedding * m.sqrt(self.d_model)

In [45]:
toy_vocab = torch.LongTensor([[1, 2, 3, 4, 0, 0]])

toy_embedding_layer = Embeddings(5, pad_idx=0, d_model=D_MODEL)
toy_embeddings = toy_embedding_layer(toy_vocab)

print("Toy Embeddings: \n", toy_embeddings)
print("Toy Embeddings Shape: \n", toy_embeddings.shape)

Toy Embeddings: 
 tensor([[[-1.7563, -2.9034,  2.0407,  3.6623],
         [-2.0049,  2.4615, -6.2541, -1.5537],
         [ 0.4437,  0.0647, -0.0440,  1.3916],
         [ 4.1713,  0.6166, -0.0546, -2.8450],
         [ 0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<MulBackward0>)
Toy Embeddings Shape: 
 torch.Size([1, 6, 4])


In [46]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], dropout=hp["P_DROP"], max_seq_len=200):
        super().__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros(max_seq_len, d_model).to(device)
        pos = torch.arange(0, max_seq_len).unsqueeze(1).float()

        two_i = torch.arange(0, d_model, step=2).float()
        div_term = torch.pow(10000, (two_i/d_model)).float()
        pe[:, 0::2] = torch.sin(pos/div_term)
        pe[:, 1::2] = torch.cos(pos/div_term)

        pe = pe.unsqueeze(0)

        # assigns the first argument to a class variable
        # i.e. self.pe
        self.register_buffer("pe", pe)

    def forward(self, x):
        # shape(x) = [B x seq_len x D]
        pe = self.pe[:, :x.shape[1]].detach()
        x = x + pe
        # shape(x) = [B x seq_len x D]
        return self.dropout(x)

In [47]:
toy_PE_layer = PositionalEncoding(d_model=D_MODEL)
toy_PEs = toy_PE_layer(toy_embeddings)

print("Toy PE: \n", toy_PEs)
print("Toy PE Shape: \n", toy_PEs.shape)

Toy PE: 
 tensor([[[-1.9515, -2.1149,  2.2674,  5.1803],
         [-1.2927,  0.0000, -6.9379, -0.6153],
         [ 1.5033, -0.3905, -0.0266,  2.6571],
         [ 4.7916, -0.4148, -0.0273, -2.0505],
         [-0.8409, -0.7263,  0.0444,  0.0000],
         [-1.0655,  0.3152,  0.0555,  1.1097]]], grad_fn=<MulBackward0>)
Toy PE Shape: 
 torch.Size([1, 6, 4])


## The Encoder (and the Decoder)
![](https://i.giphy.com/media/11GWLm7bE2fibC/giphy.webp)

Cool! This brings us to the end of the Encoder part of the Transformer.

We have enough to code up our Encoder class. In this implementation, the Encoder is responsible for the first layers embedding and encoding, and then simply runs a for loop over the number of encoders we have.

In [49]:
class Encoder(nn.Module):
    def __init__(self, Embedding: Embeddings, d_model=hp["D_MODEL"], 
                 num_heads=hp["HEADS"], num_layers=hp["LAYERS"], 
                 d_ff=hp["D_FF"], dropout=hp["P_DROP"]):
        super().__init__()

        self.Embedding = Embedding

        self.PE = PositionalEncoding(
            d_model)

        self.encoders = nn.ModuleList([EncoderLayer(
            d_model,
            num_heads,
            d_ff,
            dropout
        ) for layer in range(num_layers)])

    def forward(self, x, mask):
        # shape(x) = [B x SRC_seq_len]

        embeddings = self.Embedding(x)
        encoding = self.PE(embeddings)
        # shape(embeddings) = [B x SRC_seq_len x D]
        # shape(encoding) = [B x SRC_seq_len x D]

        for encoder in self.encoders:
            encoding, encoder_attention_weights = encoder(encoding, mask)
            # shape(encoding) = [B x SRC_seq_len x D]
            # shape(encoder_attention_weights) = [B x SRC_seq_len x SRC_seq_len]

        return encoding, encoder_attention_weights

In [None]:
toy_encoder = ?

## The Decoder

Looking at the Transformer architecture again, we see the Decoder is quite similar to the Encoder. It has two subtle differences though. The __masked__ multi-head attention, and the multi-head attention module which receives inputs from the Encoder. Let's work through these respectively.

![](images/transformer_high_level_kv.PNG)
![](images/decoder_first_few.PNG)

## Masked Multihead Attention

Think back to our Seq2Seq model from last week. During decoding, we have access to the whole source sentence, and the decoded tokens we've decoded SO FAR. Obviously we can't have access to future tokens because we don't know what they are yet...

During training time however, we DO have access to future tokens because we have labelled pairs. To speed up training, the Transformer architecture enables us to feed in our whole target sequence to the model and use masking to tell the attention mechanism not to look at illegal (i.e. future) positions when considering the i'th token.

Recall what Multihead Attention (MHA) did. It worked out a representation for each token in the sequence given all the tokens. In other words, it was bidirectional. Masking is the strategy we use to tell the network not to look at the positions past the most recent word that has been decoded.

We will look at the Decoder Layer in its entirety in a bit, but for now let's focus on the masking process for MHA. We will implement this in our `MultiHeadAttention` module. Our mask is going to be the same shape as `Q_K_matmul` because this is what we are calculating attention over.

The mask is a tensor with with a value of `-inf` at illegal locations. For one sample in our input, the shape of `Q_K_matumul` is `[T, T]`. Here, `T` refers to the sequence length of the target output. We will talk about test time later, so let's consider the training case right now. During training, we have access to the whole target sequence. So at every timestep (i.e. for every word) in the sequence, an illegal location would be all the future timesteps that we're not meant to have access to. Our matrix is `[T, T]`. So at the 1st timestep, we don't have access to any words from the 2nd timestep onwards. At the 2nd timestep, we don't have acess to any words from the 3rd timestep onwards.

In [3]:
def create_mask(size):
    # since this mask is the same for a batch being fed into the model,
    # we will the mask Tensor with the batch size = 1.
    # Broadcasting will allow us to replicate this mask across all the other elements in the batch
    mask = torch.ones((1, size, size)).triu(1)
    mask = mask == 0
    return(mask)

In [4]:
toy_mask = create_mask(10)
print("TOY MASK: \n", toy_mask)

TOY MASK: 
 tensor([[[ True, False, False, False, False, False, False, False, False, False],
         [ True,  True, False, False, False, False, False, False, False, False],
         [ True,  True,  True, False, False, False, False, False, False, False],
         [ True,  True,  True,  True, False, False, False, False, False, False],
         [ True,  True,  True,  True,  True, False, False, False, False, False],
         [ True,  True,  True,  True,  True,  True, False, False, False, False],
         [ True,  True,  True,  True,  True,  True,  True, False, False, False],
         [ True,  True,  True,  True,  True,  True,  True,  True, False, False],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True, False],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True]]])


In [7]:
toy_scores = torch.arange(100).reshape(1, 10, 10)
print("TOY SCORES: \n", toy_scores)

TOY SCORES:  tensor([[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
         [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
         [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
         [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
         [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
         [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
         [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
         [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
         [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
         [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]])


In [31]:
# from our Multihead Attention
def scaled_dot_product_attention(Q, K, V, mask=None):
    # shape(Q) = [B x seq_len x D/num_heads]
    # shape(K, V) = [B x seq_len x D/num_heads]

    Q_K_matmul = torch.matmul(Q, K.permute(0, 2, 1))
    scores = Q_K_matmul/m.sqrt(self.d)
    # shape(scores) = [B x seq_len x seq_len]
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    attention_weights = F.softmax(scores, dim=-1)
    # shape(attention_weights) = [B x seq_len x seq_len]

    output = torch.matmul(attention_weights, V)
    # shape(output) = [B x seq_len x D/num_heads]

    return output, attention_weights

In [11]:
toy_scores = toy_scores.masked_fill(toy_mask == 0, -1)
toy_scores

tensor([[[ 0, -1, -1, -1, -1, -1, -1, -1, -1, -1],
         [10, 11, -1, -1, -1, -1, -1, -1, -1, -1],
         [20, 21, 22, -1, -1, -1, -1, -1, -1, -1],
         [30, 31, 32, 33, -1, -1, -1, -1, -1, -1],
         [40, 41, 42, 43, 44, -1, -1, -1, -1, -1],
         [50, 51, 52, 53, 54, 55, -1, -1, -1, -1],
         [60, 61, 62, 63, 64, 65, 66, -1, -1, -1],
         [70, 71, 72, 73, 74, 75, 76, 77, -1, -1],
         [80, 81, 82, 83, 84, 85, 86, 87, 88, -1],
         [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]])

## Multihead Attention (again 🙄)

Following the masked MHA, as was done in the encoder, we apply a skip-connection and a layer norm. The next stage in the decoding process runs attention between the source representing and target representation. This is done by taking the __key__ and __value__ tensors from the Encoder. The __query__ tensor comes from the previous step in the decoding process. In the image below, the orange arrow connecting the encoder and decoder represents the key and value from the final layer encoder layer.

In [51]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], num_heads=hp["HEADS"], dropout=hp["P_DROP"]):
        super().__init__()

        # d_q, d_k, d_v
        self.d = d_model//num_heads

        self.d_model = d_model
        self.num_heads = num_heads

        self.dropout = nn.Dropout(dropout)

        self.linear_Qs = [nn.Linear(d_model, self.d).to(device)
                          for _ in range(num_heads)]
        self.linear_Ks = [nn.Linear(d_model, self.d).to(device)
                          for _ in range(num_heads)]
        self.linear_Vs = [nn.Linear(d_model, self.d).to(device)
                          for _ in range(num_heads)]

        self.mha_linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # shape(Q) = [B x seq_len (Q) x D/num_heads]
        # shape(K, V) = [B x seq_len (K, V) x D/num_heads]

        Q_K_matmul = torch.matmul(Q, K.permute(0, 2, 1))
        scores = Q_K_matmul/m.sqrt(self.d)
        # shape(scores) = [B x ??_seq_len x SRC_seq_len]

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = F.softmax(scores, dim=-1)
        # shape(attention_weights) = [B x seq_len (K, V) x seq_len (Q)]

        output = torch.matmul(attention_weights, V)
        # shape(output) = [B x seq_len (K, V) x D/num_heads]

        return output, attention_weights

    def forward(self, pre_q, pre_k, pre_v, mask=None):
        # shape(pre_q (ENCODING)) = [B x SRC_seq_len x D]
        # shape(pre_q (DECODING)) = [B x TRG_seq_len x D]
        #
        # shape(pre_k, pre_v (MASKED ATTENTION)) = [B x TRG_seq_len x D]
        # shape(pre_k, pre_v (OTHERWISE)) = [B x SRC_seq_len x D]

        Q = [linear_Q(pre_q) for linear_Q in self.linear_Qs]
        K = [linear_K(pre_k) for linear_K in self.linear_Ks]
        V = [linear_V(pre_v) for linear_V in self.linear_Vs]
        # shape(Q) = [B x seq_len (Q) x D/num_heads] * num_heads
        # shape(K) = [B x seq_len (K, V) x D/num_heads] * num_heads
        # shape(V) = [B x seq_len (K, V) x D/num_heads] * num_heads

        output_per_head = []
        attn_weights_per_head = []
        # shape(output_per_head) = [B x seq_len (K, V) x D/num_heads] * num_heads
        # shape(attn_weights_per_head) = [B x seq_len (K, V) x seq_len (Q)] * num_heads
        for Q_, K_, V_ in zip(Q, K, V):
            output, attn_weight = self.scaled_dot_product_attention(
                Q_, K_, V_, mask)
            output_per_head.append(output)
            attn_weights_per_head.append(attn_weight)

        output = torch.cat(output_per_head, -1)
        attn_weights = torch.stack(attn_weights_per_head).permute(0, 3, 1, 2)
        # shape(output) = [B x seq_len (K, V) x D]
        # shape(attn_weights) = [B x num_heads x seq_len (K, V) x seq_len(Q)]

        return self.mha_linear(self.dropout(output)), attn_weights


## Decoder Layer

Following another residual layer normalization, we have a PWFFN as we did in the encoder. One arbitray decoder layer is thus the two aforementioned attention modules and the PWFFN with residual layer norms in between each step. 

In [52]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model=hp["D_MODEL"], num_heads=hp["HEADS"], d_ff=hp["D_FF"], dropout=hp["P_DROP"]):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.norm_3 = Norm(d_model)

        self.mha_1 = MultiHeadAttention(d_model, num_heads)
        self.mha_2 = MultiHeadAttention(d_model, num_heads)
        self.ff = PWFFN(d_model, d_ff)

    def forward(self, x, encoder_outputs, trg_mask, src_mask):
        # shape(x) = [B x TRG_seq_len x D]
        # shape(encoder_outputs) = [B x SRC_seq_len x D]

        masked_mha, masked_mha_attn_weights = self.mha_1(
            x, x, x, mask=trg_mask)
        # shape(masked_mha) = [B x TRG_seq_len x D]
        # shape(masked_mha_attn_weights) = [B x num_heads x TRG_seq_len x TRG_seq_len]

        norm1 = self.norm_1(x + masked_mha)
        # shape(norm1) = [B x TRG_seq_len x D]

        enc_dec_mha, enc_dec_mha_attn_weights = self.mha_2(
            norm1, encoder_outputs, encoder_outputs, mask=src_mask)
        # shape(enc_dec_mha) = [B x TRG_seq_len x D]
        # shape(enc_dec_mha_attn_weights) = [B x num_heads x TRG_seq_len x SRC_seq_len]

        norm2 = self.norm_2(norm1 + enc_dec_mha)
        # shape(norm2) = [B x TRG_seq_len x D]

        ff = self.ff(norm2)
        norm3 = self.norm_3(norm2 + ff)
        # shape(ff) = [B x TRG_seq_len x D]
        # shape(norm3) = [B x TRG_seq_len x D]

        return norm3, masked_mha_attn_weights, enc_dec_mha_attn_weights


### Decoder

Our Decoder class acts similarly to our Encoder class. It is responsible for embedding and encoding the input for the first layer, and then is simply a for loop over the layers we desire

In [53]:
class Decoder(nn.Module):
    def __init__(self, Embedding: Embeddings, d_model=hp["D_MODEL"], 
                 num_heads=hp["HEADS"], num_layers=hp["LAYERS"], 
                 d_ff=hp["D_FF"], dropout=hp["P_DROP"]):
        super().__init__()

        self.Embedding = Embedding

        self.PE = PositionalEncoding(
            d_model)

        self.decoders = nn.ModuleList([DecoderLayer(
            d_model,
            num_heads,
            d_ff,
            dropout
        ) for layer in range(num_layers)])

    def forward(self, x, encoder_output, trg_mask, src_mask):
        # shape(x) = [B x TRG_seq_len]

        embeddings = self.Embedding(x)
        encoding = self.PE(embeddings)
        # shape(embeddings) = [B x TRG_seq_len x D]
        # shape(encoding) = [B x TRG_seq_len x D]
        
        for decoder in self.decoders:
            encoding, masked_mha_attn_weights, enc_dec_mha_attn_weights = decoder(
                encoding, encoder_output, trg_mask, src_mask)
            # shape(encoding) = [B x TRG_seq_len x D]
            # shape(masked_mha_attn_weights) = [B x num_heads x TRG_seq_len x TRG_seq_len]
            # shape(enc_dec_mha_attn_weights) = [B x num_heads x TRG_seq_len x SRC_seq_len]

        return encoding, masked_mha_attn_weights, enc_dec_mha_attn_weights

## Transformer

Finally! The Transformer class! Our Transformer class is simple and will receive data from the train loop. At every pass it will do the following:
- Create a source and target mask
- Run the Encoder
- Run the Decoder
- Output logits for token prediction
    

In [54]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_len, trg_vocab_len, d_model=hp["D_MODEL"], d_ff=hp["D_FF"], 
                 num_layers=hp["LAYERS"], num_heads=hp["HEADS"], dropout=hp["P_DROP"]):
        super().__init__()

        self.num_heads = num_heads

        encoder_Embedding = Embeddings(
            src_vocab_len, SRC.vocab.stoi["<pad>"], d_model)
        decoder_Embedding = Embeddings(
            trg_vocab_len, TRG.vocab.stoi["<pad>"], d_model)

        self.encoder = Encoder(encoder_Embedding, d_model,
                               num_heads, num_layers, d_ff, dropout)
        self.decoder = Decoder(decoder_Embedding, d_model,
                               num_heads, num_layers, d_ff, dropout)

        self.linear_layer = nn.Linear(d_model, trg_vocab_len)

        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def create_src_mask(self, src):
        src_mask = (src != SRC.vocab.stoi["<pad>"]).unsqueeze(-2)
        return src_mask

    def create_trg_mask(self, trg):
        trg_mask = (trg != TRG.vocab.stoi["<pad>"]).unsqueeze(-2)
        mask = torch.ones((1, trg.shape[1], trg.shape[1])).triu(1).to(device)
        mask = mask == 0
        trg_mask = trg_mask & mask
        return trg_mask

    def forward(self, src, trg):
        # shape(src) = [B x SRC_seq_len]
        # shape(trg) = [B x TRG_seq_len]

        src_mask = self.create_src_mask(src)
        trg_mask = self.create_trg_mask(trg)
        # shape(src_mask) = [B x 1 x SRC_seq_len]
        # shape(trg_mask) = [B x 1 x TRG_seq_len]

        encoder_outputs, encoder_mha_attn_weights = self.encoder(src, src_mask)
        # shape(encoder_outputs) = [B x SRC_seq_len x D]
        # shape(encoder_mha_attn_weights) = [B x num_heads x SRC_seq_len x SRC_seq_len]
        decoder_outputs, _, enc_dec_mha_attn_weights = self.decoder(
            trg, encoder_outputs, trg_mask, src_mask)
        # shape(decoder_outputs) = [B x SRC_seq_len x D]
        # shape(enc_dec_mha_attn_weights) = [B x num_heads x TRG_seq_len x SRC_seq_len]
        logits = self.linear_layer(decoder_outputs)
        # shape(logits) = [B x TRG_seq_len x TRG_vocab_size]

        return logits

## Training

The authors of the paper used Adam with a decaying learning rate which follows the following algorithm:

$$\text{lr} = \sqrt{\frac{1}{\text{d}_\text{model}}} \times min\left(\sqrt{\frac{1}{i}}, i \times \text{warmup}^{-1.5}\right)$$

Where $i$ is the global step we're currently on. $\text{warmup}$ is a hyperparamter which the authors set to 4000, and $\text{d}_\text{model}$ is the dimensionality of our model

In [55]:
def custom_lr_optimizer(optimizer, step, d_model=hp["D_MODEL"], warmup_steps=4000):
    min_arg1 = m.sqrt(1/(step+1))
    min_arg2 = step * (warmup_steps**-1.5)
    lr = m.sqrt(1/d_model) * min(min_arg1, min_arg2)

    optimizer.param_groups[0]["lr"] = lr

    return optimizer

In [None]:
def train(model, SRC, TRG, MODEL_PATH, FORCE_MAX_LEN=50):
    model.train()
    optimizer = torch.optim.Adam(
        model.parameters(), lr=hp.LR, betas=(0.9, 0.98), eps=1e-9)
    criterion = nnCrossEntropyLoss(ignore_index=TRG.vocab.stoi["<pad>"])

    for epoch in tqdm(range(hp["EPOCHS"])):

        for step, batch in enumerate(train_iter):
            global_step = epoch * len(train_iter) + step

            model.train()
            optimizer.zero_grad()
            optimizer = custom_lr_optimizer(optimizer, global_step)

            src = batch.src.T
            trg = batch.trg.T

            trg_input = trg[:, :-1]

            preds, _, _, _ = model(src, trg_input)
            ys = trg[:, 1:]

            loss = criterion(preds.permute(0, 2, 1), ys)
            loss.mean().backward()
            optimizer.step()

            if global_step % 50 == 0:
                print("#"*90)

                rand_index = random.randrange(hp.BATCH_SIZE)

                model.eval()

                v = next(iter(val_iter))
                v_src, v_trg = v.src.T, v.trg.T

                v_trg_inp = v_trg[:, :-1].detach()
                v_trg_real = v_trg[:, 1:].detach()

                v_predictions, _, _, _ = model(v_src, v_trg_inp)
                max_args = v_predictions[rand_index].argmax(-1)
                print("For random element in VALIDATION batch (real/pred)...")
                print([TRG.vocab.itos[word_idx]
                       for word_idx in v_trg_real[rand_index, :]])
                print([TRG.vocab.itos[word_idx]
                       for word_idx in max_args])

                print("Length til first <PAD> (real -> pred)...")
                try:
                    pred_PAD_idx = max_args.tolist().index(3)
                except:
                    pred_PAD_idx = None

                print(v_trg_real[rand_index, :].tolist().index(
                    3), "  --->  ", pred_PAD_idx)

                val_loss = criterion(
                    v_predictions.permute(0, 2, 1), v_trg_real)
                print("TRAINING LOSS:", loss.mean().item())
                print("VALIDATION LOSS:", val_loss.mean().item())

                print("#"*90)

                writer.add_scalar(
                    "Training Loss", loss.mean().detach().item(), global_step)
                writer.add_scalar("Validation Loss",
                                  val_loss.mean().detach().item(), global_step)
        torch.save(model, MODEL_PATH)


In [None]:
MODEL_PATH = "transformer_model.pt"
writer = SummaryWriter()
transformer = Transformer(len(SRC.vocab), len(TRG.vocab)).to(device)
train(transformer, SRC, TRG, MODEL_PATH)

## Testing

In [None]:
def search(model, source_sentences, src_sos_idx=SRC.vocab.stoi["<sos>"], trg_sos_idx=TRG.vocab.stoi["<sos>"], max_seq_len=40):
    src = source_sentences.to(device)
    # shape(src) = [B x seq_len]

    batch_size = src.shape[0]
    seq_len = src.shape[1]

    outputs = torch.zeros(batch_size, max_seq_len).long().to(device)

    for seq_id in range(batch_size):
        input_sequence = src[seq_id].unsqueeze(0)
        preds = torch.LongTensor([trg_sos_idx]).to(device).unsqueeze(0)

        for t in range(max_seq_len-1):
            predictions, _, _, _ = transformer(input_sequence, preds)
            predicted_id = predictions[:, -1:, :].argmax(-1)
            preds = torch.cat((preds, predicted_id), dim=-1)

        outputs[seq_id] = preds

    return outputs

In [None]:

def get_text_from_tensor(tensor, SRC_or_TRG):
    # shape(tensor) = [B x seq_len]
    batch_output = []

    sos = SRC_or_TRG.vocab.stoi["<sos>"]
    eos = SRC_or_TRG.vocab.stoi["<eos>"]
    pad = SRC_or_TRG.vocab.stoi["<pad>"]

    for i in range(tensor.shape[0]):
        sequence = tensor[i]
        words = []
        for tok_idx in sequence:
            tok_idx = int(tok_idx)
            token = SRC_or_TRG.vocab.itos[tok_idx]

            if token == sos:
                continue
            elif token == eos or token == pad:
                break
            else:
                words.append(token)
        words = " ".join(words)
        batch_output.append(words)
    return batch_output


In [None]:
def evaluate_bleu(model, iterator):

    model.eval()

    hyp = []
    ref = []

    for batch in tqdm(iterator):
        src, trg = batch.src.T, batch.trg.T
        outputs = search(model, src)

        outputs = outputs[:, 1:]

        hyp += get_text_from_tensor(outputs, TRG)
        ref += get_text_from_tensor(trg, TRG)

    # expand dim of reference list
    # sys = ['translation_1', 'translation_2']
    # ref = [['truth_1', 'truth_2'], ['another truth_1', 'another truth_2']]
    ref = [ref]
    return sacrebleu.corpus_bleu(hyp, ref, force=True).score



In [None]:
def inference(model, source_sentence):
    source_sentence_tokens = SRC.preprocess(source_sentence)
    src = SRC.process([source_sentence_tokens]).T
    outputs = search(model, src)
    print(get_text_from_tensor(outputs, TRG))


In [None]:
transformer = torch.load(MODEL_PATH, map_location=device)
inference(transformer, "Eine Frau mit blonden Haaren trinkt aus einem Glas")
# print(evaluate_bleu(transformer, test_iter))


# Byte-Pair Encoding

So far, we've looked at tokenizing inputs primarily on whitespace and some punctuation rules. __Byte-Pair Encoding__ (BPE) is a conceptually simple and elegant unsupervised technique which allows us to be composable with our tokens while reducing the vocabulary size. It enables us to identify common 'patterns' in natural language and split on that pattern. What we mean by pattern here is common sequences of characters. For example, consider the sentence: `"They learned byte pair encoding successfully"`. There are some words in this sentence which contain common sequences of characters:
- learned -> learn + _ed_
- encoding -> encod + _ing_
- successfully -> success + _ful_ + _ly_

Now, let's say we had a list of all the common patterns in language (e.g. `[ed, ing, ful, ly]`). Composing our BPE corpus is relatively trivial. For every word in our corpus, we would check if the common pattern was present in that word, and if it is, our tokens for that one word become the root of the word and the patterns in that word. A special symbol (here, `@@`) is used to indicate the start or end of of a sub-word. So given that example sentence above, tokenizing it via BPE gives:
- `[they, learn@@, @@ed, byte, pair, encod@@, @@ing, success@@, @@ful@@, @@ly]`
- `[they, learned, byte, pair, encoding, successfully]` (tokenzing on whitespace).

As we are not using BPE in our models, the implementation details are out of scope for this workshop. However, here are some brilliant resources to demonstrate how to implement BPE given a corpus:
- [https://leimao.github.io/blog/Byte-Pair-Encoding/](https://leimao.github.io/blog/Byte-Pair-Encoding/)
- [https://nlp.h-its.org/bpemb/](https://nlp.h-its.org/bpemb/)
- [https://arxiv.org/abs/1508.07909](https://arxiv.org/abs/1508.07909)