In [4]:
import torch
from torch import nn
import torch.nn.functional as F
import math

# Transformer

![transformerz](https://cdn.collider.com/wp-content/uploads/2017/06/transformers-5-optimus-prime-bumblebee.jpg)

No... not this kind

The Transformer is a non-recurrent model which has achieved SoTA results in sequence-to-sequence transduction problems. It is based entirely on attention. The recurrent nature of RNNs means that parallelisation is not possible as every hidden state relies on the hidden state before it. Impressively, because the Transformer is based on FCNNs, the possibility of parallelising the model is possible.

![image.png](https://miro.medium.com/max/500/1*do7YDFF2sads0p9BnjzrWA.png)

The Transformer follows a typical Encoder/Decoder architecture. It takes an input sequence of symbols $(x_1, ..., x_n)$ and maps this to a sequence of continuous representations $\mathbf{h} = (h_1, ..., h_n)$. Given $h$, the decoder generates a symbol one element at a time $(y_1, ..., y_m)$.

### Encoder
Each encoder layer in the Transformer consists of two sub-layers. The first of these layers is the _Multi-Head Self-Attention_ module, and the second is a module called a _Position-Wise Feed-Forward Network_. Immediately following each one of these modules is a _Residual Layer Normalization_.

### Decoder
A decoder layer is similar to an encoder layer, but it has one extra module inserted: _Masked Multi-Head Attention_. We will discuss the decoder more in a further session. 


At the heart of the Transformer is this concept known as _self-attention_. Let's look at the Transformer holistically and then see exactly what this is, and why and how it solves sequence-to-sequence tasks so effectively.

## The Holistic Transformer

We are attempting to solve a sequence to sequence translation task: German to English using the Transformer:

![whole_transformer](images/transformer.png)

The Transformer is comprised of a stack of encoders and a stack of decoders. The output from the final layer of the encoder stack is sent to the decoder. The input to the first encoder is done via a special embedding module. We will look at the decoder and the embedding module in a future session.

![transformer_high_level](images/transformer_high_level.png)

Within the encoder stack we have a set of connected encoders. The output of one encoder is sent as input to the next encoder. An encoder consists of two sub-layers. The first sub-layer consists of a _Multi-Head Self-Attention_ module, and the second, a _Feed-Forward_ module. Within each module, a _Residual Connection_ followed by a _Layer Normalization_ is immediately applied.

![encoder_first_few](images/encoder_first_few.png)

Let's deconstruct what this new terminology means one by one. We'll start with _self attention_ and _multi-head self-attention_. Then we'll look at _residual connections_ followed by _layer normalization_. After we've covered the _feed-forward_ module, you've managed to understand most of the techniques in the Transformer! The encoder is simply the 6 aformenetioned things.

## Self-Attention

At the heart of the Transformer is self-attention. Self-attention is a mechanism that allows each input in a sequence to look at the whole sequence to compute a representation of the sequence.

...what?

Ok. Let's look at the following sentence: `the animal didn't cross the street because it was too tired`. What does the `it` refer to in this sentence? The street or the animal? This is trivial for us as humans to answer but not for a machine. Wouldn't it be nice if we could have some way of the computer understanding what `it` referred to?

This is what self-attention attempts to do. As the model processes each word in the input sequence, self-attention allows us to look at other words in the input sequence for ideas as to what we want to encode in the representation of this word.

![](http://jalammar.github.io/images/t/transformer_self-attention_visualization.png)

It is given by the formula:

$$Attention(Q,K,V) = softmax \left( \frac{QK^T}{\sqrt{d_k}}\right)V$$


To make this clearer, think of how a hidden state in an RNN incorporates the representation of the previous words into the current representation. Self-attention is how the Transformer attempts to use other words (not just the previous words) to encode the meaning of a particular word


![encoder_multihead](images/encoder_multihead.png)

![vector_attn_score](images/vector_attn_scores.png)

### Self-Attention in detail

Self-attention is a representation which you can think of as a score. The intent is to have a representation, per input word, which tells us much how much focus each word needs to pay to every word in the sequence. A bit complicated right? Hopefully by the end of this post, this meaning will be clear.

Before talking about matrices, let's talk in terms of vectors. We'll have the sequence length be of dimension $T$, and the encoding/embeddings of the word be $D$ dimensional.

Note that because I'm focusing on the FIRST Encoder here, the encodings of our sequence is the embeddings. But as you move up the encoder stack, the encoding of the sequence is the output from the previous encoder. Again, this is $D$ dimensional.

![word_encodings](images/word_encodings.png)


Ok. Now we're going to create three vectors for EACH input: A query vector ($q$), a key vector ($k$), and a value vector ($v$). These will be $d_k$ dimensional. $d_k$ is typically $D/8$. As far as real values go, usually $D=512$, while $d_k=64$. In our example, what is $d_k$?

Ok... so how how do we get $q$, $k$, $v$? We learn it of course!

how... do we learn it? We need a weights matrix which will transform our encodings into these vectors.
This means that $W^Q$, $W^K$ and $W^V$ are all $\in \mathbb{R}^{D×d_k}$

![qkv_vectors](images/qkv_vectors.png)


Ok, that's nice. We understand that $q$, $k$, $v$ are different projections of the same input now; but what do the query, key, value abstractions actually mean? They're useful terms we can use to think about attention. Let's look through the following so we can see the roles they play.

Recall what we defined self-attention as earlier: A representation, per input word, which tells us much how much focus each word needs to pay to every word in the sequence. For each word in our sequence, we will calculate a score by taking the dot product of the current word's query vector and the key vector for every word in the sequence.

![w1_til_softmax](images/w1_til_softmax.png)

Let's take stock of what we've done so far:
- The $÷\sqrt{d_k}$ is simply a practical scaling factor which leads to stabler gradients.
- Softmax turns our scores into a probability distribution (each score is now between 0 and 1, and the sum of the scores = 1).

We will now use these scores by multiplying them with their value vector. The intention here is that lower scoring words will have less weighting in the self-attention output as these words will now have a sense of "irrelevantness" (e.g. a low score like 0.0001 will "cancel out" its corresponding value vector).

![w1_til_z1](images/w1_til_z1.png)

Finally, the output for the current word is the summation of all the $softmax \times v$ vectors. I.e. a weighted sum:

![z_vector](images/z_vector.png)

Ok. So that's self-attention in vector form. What about in terms of matrices?

- Our input, $X$, is now a matrix of our sequence of words (i.e. $X \in \mathbb{R}^{T\times D}$):
![x_matrix](images/X_matrix_input.png)

- $Q$, $K$, $V$ are now also matrices $\in \mathbb{R}^{T \times d_k}$.
- $W^Q$,$W^K$,$W^V$ stay $\in \mathbb{R}^{D \times d_k}$.
- We now simply obtain $Z \in \mathbb{R}^{T \times d_k}$ by plugging $Q$, $K$, $V$ into our Attention formula.

![Z_matrix](images/Z_matrix.png)


- For the FIRST encoder, Q, K, V are determined by the embeddings of the input words
- For the rest of the encoder stack, Q, K, V are determined by the output of the previous encoder
- For the decoder stack, Q is determined in a similar fashion to the encoders. K and V, however, are passed from the final encoder to each of the decoders in the decoder stack.

In [5]:
# encodings = torch.Tensor([[[0.0, 0.1, 0.2, 0.3], [1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3]]]) # (1, 3, 4)
# Q_layer = nn.Linear(4, 3)
# K_layer = nn.Linear(4, 3)
# V_layer = nn.Linear(4, 3)

# Q = Q_layer(encodings)
# K = K_layer(encodings)
# V = V_layer(encodings)

In [6]:
def scaled_dot_product_attention(Q, K, V, dk=3):
    Q_K_matmul = torch.matmul(Q, K.T)
    matmul_scaled = Q_K_matmul/math.sqrt(dk)
    attention_weights = F.softmax(matmul_scaled, dim=-1)

    output = torch.matmul(attention_weights, V)

    return output, attention_weights

In [7]:
def print_attention(Q, K, V):
    n_digits = 3
    temp_out, temp_attn = scaled_dot_product_attention(Q, K, V)
    
    print ('Attention weights are:')
    print (np.around(temp_attn, 2))
    print ('Output is:')
    print (np.around(temp_out, 2))


In [8]:
import torch
import torch.nn.functional as F
import math
import numpy as np

In [9]:
temp_k = torch.Tensor([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]])  # (4, 3)

temp_v = torch.Tensor([[   1,0, 1],
                      [  10,0, 2],
                      [ 100,5, 0],
                      [1000,6, 0]])  # (4, 3)

In [10]:
# This `query` aligns with the second `key`,
# so the second `value` is returned.
temp_q = torch.Tensor([[0, 10, 0]])  # (1, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0., 1., 0., 0.]])
Output is:
tensor([[10.,  0.,  2.]])


In [11]:
# This query aligns with a repeated key (third and fourth), 
# so all associated values get averaged.
temp_q = torch.Tensor([[0, 0, 10]])  # (1, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0.0000, 0.0000, 0.5000, 0.5000]])
Output is:
tensor([[550.0000,   5.5000,   0.0000]])


In [12]:
# This query aligns equally with the first and second key, 
# so their values get averaged.
temp_q = torch.Tensor([[10, 10, 0]])  # (1, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0.5000, 0.5000, 0.0000, 0.0000]])
Output is:
tensor([[5.5000, 0.0000, 1.5000]])


In [13]:
temp_q = torch.Tensor([[0, 0, 10], [0, 10, 0], [10, 10, 0]])  # (3, 3)
print_attention(temp_q, temp_k, temp_v)

Attention weights are:
tensor([[0.0000, 0.0000, 0.5000, 0.5000],
        [0.0000, 1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000]])
Output is:
tensor([[550.0000,   5.5000,   0.0000],
        [ 10.0000,   0.0000,   2.0000],
        [  5.5000,   0.0000,   1.5000]])


## Multihead Attention

Ok, we've just tackled self-attention, but the diagram tells us about something called multi-head attention. What is this?

Conceptually, and even in terms of implementation, it's quite simple. Let's recap what we just obtained: $Z$. Each $z$ vector (i.e. for each vector $z_1, z_2, ..., z_T$ in $Z$) is a representation which bakes in all the words in the sequence including itself.

A potential problem is that each $z_t$ _could_ be dominated by the representation for $t$'th word itself. This is what multihead attention solves.

Instead of performing self-attention 1 time, multihead attention performs it multiple times. This means that we have multiple different attention representations. Of course to obtain these multiple different representations (i.e. $Z$s), we need to learn multiple $Q, K, V$s. Each $Q, K, V, Z$ pair is referred to as one head. Typically, we use 8 heads (i.e. perform self-attention 8 different times). To "use" these newly obtained $Z$s, we concatenate them and pass them through a linear layer to project them back to $D$ dimensionality

![](http://jalammar.github.io/images/t/transformer_self-attention_visualization_2.png)

In [14]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8, p_drop=0.3):
        super().__init__()

        # d_k
        self.d = d_model//num_heads
        
        # D
        self.d_model = d_model
        
        # Typically 8
        self.num_heads = num_heads

        self.dropout = nn.Dropout(p_drop)

        self.linear_Qs = [nn.Linear(d_model, self.d)
                          for head in range(num_heads)]
        self.linear_Ks = [nn.Linear(d_model, self.d)
                          for head in range(num_heads)]
        self.linear_Vs = [nn.Linear(d_model, self.d)
                          for head in range(num_heads)]

        self.mha_linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V):
        Q_K_matmul = torch.matmul(Q, K.transpose(-2, -1))
        matmul_scaled = Q_K_matmul/math.sqrt(self.d)

        attention_weights = F.softmax(matmul_scaled, dim=-1)

        output = torch.matmul(attention_weights, V)

        return output, attention_weights

    def forward(self, input):
        # shape(input) = [B x T x D]
        
        q = input
        k = input
        v = input
        
        # These will all be a list of Tensors
        Q = [linear(q) for linear in self.linear_Qs]
        K = [linear(k) for linear in self.linear_Ks]
        V = [linear(v) for linear in self.linear_Vs]
        # shape(Q) = shape(K) = shape(V) = [[B x T x d_k] * num_heads]

        scores_per_head = []
        attention_weights_per_head = []
        for Q_, K_, V_ in zip(Q, K, V):
            score, attention_weight = self.scaled_dot_product_attention(
                Q_, K_, V_)
            scores_per_head.append(score)
            attention_weights_per_head.append(attention_weight)

        concat_scores = torch.cat(scores_per_head, -1)
        # shape(concat_scores) = [B x T x D]
        output = self.dropout(self.mha_linear(concat_scores))
        # shape(output) = [B x T x D]
        
        return output

In [15]:
toy_encodings = torch.Tensor([[[0.0, 0.1, 0.2, 0.3], [1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3]]]) 
# shape(toy_encodings) = [B, T, D] = (1, 3, 4)
print("Toy Encodings:\n\t", toy_encodings)

D_MODEL = toy_encodings.shape[-1]

Toy Encodings:
	 tensor([[[0.0000, 0.1000, 0.2000, 0.3000],
         [1.0000, 1.1000, 1.2000, 1.3000],
         [2.0000, 2.1000, 2.2000, 2.3000]]])


In [16]:
toy_MHA_layer = MultiHeadAttention(d_model=D_MODEL, num_heads=2)
toy_MHA = toy_MHA_layer(toy_encodings)
print("Toy MHA: \n\t", toy_MHA)
print("Toy MHA Shape: \n\t", toy_MHA.shape)

Toy MHA: 
	 tensor([[[ 0.6393, -0.4274, -0.1076,  0.0000],
         [ 0.0000, -0.0000, -0.1177,  0.2135],
         [ 0.7447, -0.3604, -0.1274,  0.1824]]], grad_fn=<MulBackward0>)
Toy MHA Shape: 
	 torch.Size([1, 3, 4])


Why are some of these values zero?

## Layer Normalization

In a typical problem where we want to employ neural networks, we normalize (standardize) our data before feeding it into the network. This is usually done by subtracting the global mean and dividing by the global standard deviation of the data:
$$x = \frac{x - \mu}{\sigma}$$

It is also possible to normalize _within_ the activations or layers of the network. There are many approaches for doing so: _Batch Normalization, Layer Normalization, Instance Normalization_ and _Group Normalization_.

We won't be diving into the details of layer normalization here as it's out of scope - the important thing to know is that some kind of normalization process over features internal to the model. A helper is provided for us in PyTorch:

In [17]:
# self.layer_norm = nn.LayerNorm(DIMENSIONALITY)
# layer_normed = self.layer_norm(thing_to_layer_norm)

To read further about layer normalization, check out the following resources (order _mostly_ from most accessible to least). The first few resources touch on batch normalization in order to give a conceptual understanding of what layer normalization is attempting to do:
- https://www.youtube.com/watch?v=DtEq44FTPM4
- https://www.youtube.com/watch?v=tNIpEZLv_eg
- https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/ (Recommended read)
- https://arxiv.org/abs/1706.03762

## Residual Connection

In theory, if we continue stacking layers in a neural network on top of each other, we expect the training error to decrease. However, the reality doesn't follow suit.

[IAMEG]

Resiudal connections solve these problems by introducing a _skip connection_ (or shortcut) between every other layer. All we need to do to implement a residual connection is add our residual input to our actual input:

[IMAGE]

To read further about residual connections and why they work, please refer to the following resources:
- https://www.coursera.org/lecture/convolutional-neural-networks/resnets-HAhz9
- https://www.coursera.org/lecture/convolutional-neural-networks/why-resnets-work-XAKNO
- https://arxiv.org/abs/1512.03385

Implementing residual connections is deceivingly straightforward.

In [18]:
class AddNorm(nn.Module):
    def __init__(self, d_model=512, p_drop=0.3):
        super().__init__()
        self.layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(p_drop)

    def forward(self, x, res_input):
        ln = self.layer_norm(res_input + x)
        return self.dropout(ln)

In [19]:
toy_AddNorm_layer = AddNorm(d_model=D_MODEL)
toy_AddNorm = toy_AddNorm_layer(toy_MHA, toy_encodings)
print("Toy AddNorm: \n\t", toy_AddNorm)
print("Toy AddNorm shape: \n\t", toy_AddNorm.shape)

Toy AddNorm: 
	 tensor([[[ 0.0000, -0.0000, -0.3414,  0.5056],
         [-1.2446, -0.5290, -0.6558,  2.4295],
         [ 1.8030, -1.9345, -0.6962,  0.8276]]], grad_fn=<MulBackward0>)
Toy AddNorm shape: 
	 torch.Size([1, 3, 4])


## Taking stock

This brings us close to the end of our first sub-layer in the encoder (and about a 70% understanding of the new concepts in the Transformer).

![encoder_first_few](images/encoder_first_few.png)

The next sub-layer is relatively straightforward

## Position-wise Feed-Forward Network

The authors pass their output from the first sublayer into a feed-forward network. They call this a position-wise feed-forward network because it "is applied to each position separately and identically". This means that they run a feed-forward network over a rank 3 tensor (including batch size), over the sequence dimension. 

All this means is that the SAME weights are applied to all tokens in the sequence.

The authors use a two layered network given as:

$$FFN(x) = max(0,xW_1+b_1)W_2+b_2$$

With the first layer having a dimension of $d_{ff}=2048$ and the second as $D$. It's unclear why the inputs are projected to a larger dimension before being projected back down to $D$ dimensions, but [one source](https://graphdeeplearning.github.io/post/transformers-are-gnns/) suggests that it is a convergence trick which enables re-scaling of the feature vectors independently with each other.

In [20]:
# D_FF = 2048
D_FF = D_MODEL * 4
P_DROP = 0.3

class PointwiseFeedforward(nn.Module):
    def __init__(self, d_model=D_MODEL, d_ff=D_FF, p_drop=P_DROP):
        super().__init__()
        self.pffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(p_drop),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        return self.pffn(x)


In [21]:
toy_PFFN_layer = PointwiseFeedforward(d_model=D_MODEL, d_ff=D_FF)
toy_PFFN = toy_PFFN_layer(toy_AddNorm)
print("Toy PFFN: \n\t", toy_PFFN)
print("Toy PFFN Shape: \n\t", toy_PFFN.shape)

Toy PFFN: 
	 tensor([[[ 0.0380,  0.0518, -0.3602,  0.0769],
         [ 0.0561,  0.6201, -0.4566, -0.2010],
         [ 0.3849,  0.3601, -0.7434, -0.1085]]], grad_fn=<AddBackward0>)
Toy PFFN Shape: 
	 torch.Size([1, 3, 4])


In [22]:
toy_AddNorm_layer_2 = AddNorm(d_model=D_MODEL)
toy_AddNorm_2 = toy_AddNorm_layer_2(toy_PFFN, toy_AddNorm)
print("Toy AddNorm 2: \n\t", toy_AddNorm_2)
print("Toy AddNorm 2 Shape: \n\t", toy_AddNorm_2.shape)

Toy AddNorm 2: 
	 tensor([[[ 0.0000,  0.0000, -0.0000,  1.8437],
         [-1.2346,  0.0894, -0.0000,  0.0000],
         [ 2.0159, -1.4088, -0.0000,  0.0000]]], grad_fn=<MulBackward0>)
Toy AddNorm 2 Shape: 
	 torch.Size([1, 3, 4])


## Encoder Layer

This is everything we require for one arbitrary Encoder layer! Let's code up a class which contains the aforementioned

In [23]:
NUM_HEADS=8

class EncoderLayer(nn.Module):
    def __init__(self, d_model=D_MODEL, num_heads=NUM_HEADS, d_ff=D_FF, p_drop=P_DROP):
        super().__init__()

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_ff = d_ff
        self.p_drop = p_drop

        self.MHA = MultiHeadAttention(
            self.d_model, self.num_heads, self.p_drop)

        self.addNorm1 = AddNorm(self.d_model, self.p_drop)
        self.addNorm2 = AddNorm(self.d_model, self.p_drop)

        self.PFFN = PointwiseFeedforward(
            self.d_model, self.d_ff, self.p_drop)

    def forward(self, x):
        mha, _ = self.MHA(x)
        addNorm_1 = self.addNorm1(mha, x)

        pffn = self.PFFN(addNorm_1)
        addNorm_2 = self.addNorm2(pffn, addNorm_1)

        return addNorm_2


## Positional Encoding

Lets talk about embedding the inputs

[IMAGE]

One thing we've not yet looked at is _Positional Encoding_. Recall that our model contains no recurrence and thus we need a way to make sense of the order of the sequence. To do so, we need to inject some information about the position of the tokens in the sequence. Positional Encoding is one strategy to do so.

So, we want our model to add some information to each of our words (embeddings) which indicates its position in the sequence. What are some strategies for doing so?

Well, we could just add the token position to the embedding (e.g. 1 for the first word, 2 for the second word, 3 for the third word). What are some issues with this approach?

What about linearly assigning values between 0 and 1 to the token embedding? (e.g. for an eight-length sequence, the first word has 0.125 added to it, the second 0.25, the third 0.375 etc).

The trick the authors propose is to add a $D$-dimensional vector to the embedding (which is also $D$-dimensional) instead of a single number. The vector which is added to the word embedding is __fixed__ - that is, it does NOT depend on the features of the word itself - only the position that it appears in.

Let's dissect the formula it's given by:

$$PE_{(pos,2i)}=sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$$
$$PE_{(pos,2i+1)}=cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$$

[IMAGE]

In [24]:
class Embeddings(nn.Module):
    def __init__(self, len_vocab, d_model=D_MODEL):
        super(Embeddings, self).__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(len_vocab, self.d_model)

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)


In [25]:
toy_vocab = torch.LongTensor([[1, 2, 3, 4, 0, 0]])

toy_embedding_layer = Embeddings(5, d_model=D_MODEL)
toy_embeddings = toy_embedding_layer(toy_vocab)
print("Toy Embeddings: \n\t", toy_embeddings)
print("Toy Embeddings Shape: \n\t", toy_embeddings.shape)


Toy Embeddings: 
	 tensor([[[ 1.9631,  0.1769, -1.0451,  1.6996],
         [ 1.2375,  1.9748,  1.8169,  0.1104],
         [ 2.7404,  1.8952,  3.9096, -1.4101],
         [-3.9316, -0.9797, -1.3554, -2.1137],
         [-2.6030, -1.2361,  1.6496, -0.0630],
         [-2.6030, -1.2361,  1.6496, -0.0630]]], grad_fn=<MulBackward0>)
Toy Embeddings Shape: 
	 torch.Size([1, 6, 4])


In [26]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model=D_MODEL, p_drop=P_DROP, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()

        two_i = torch.arange(0, d_model, step=2).float()
        div_term = torch.pow(10000, (two_i/d_model)).float()
        pe[:, 0::2] = torch.sin(pos/div_term)
        pe[:, 1::2] = torch.cos(pos/div_term)

        pe = pe.unsqueeze(0)

        # assigns the first argument to a class variable
        # i.e. self.pe
        self.register_buffer("pe", pe)

        self.dropout = nn.Dropout(P_DROP)

    # x is the input embedding
    def forward(self, x):

        # work through this line :S
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

In [27]:
toy_PE_layer = PositionalEncoding(d_model=D_MODEL)
toy_PEs = toy_PE_layer(toy_embeddings)
print("Toy PE: \n\t", toy_PEs)
print("Toy PE Shape: \n\t", toy_PEs.shape)

Toy PE: 
	 tensor([[[ 2.8044,  1.6813, -0.0000,  3.8566],
         [ 2.9700,  0.0000,  2.6098,  1.5862],
         [ 0.0000,  2.1130,  5.6137, -0.5861],
         [-0.0000, -2.8139, -0.0000, -1.5916],
         [-4.7997, -2.6996,  0.0000,  1.3375],
         [-5.0885, -1.3606,  0.0000,  1.3368]]], grad_fn=<MulBackward0>)
Toy PE Shape: 
	 torch.Size([1, 6, 4])


## The Encoder (and the Decoder)
![](https://i.giphy.com/media/11GWLm7bE2fibC/giphy.webp)

Cool! This brings us to the end of the Encoder part of the Transformer. There's one more thing we need to touch on regarding the Encoder, but we will look at that in a bit.

Looking at the Transformer architecture again, we see the Decoder is quite similar to the Encoder. It has two subtle differences though. The __masked__ multi-head attention, and the multi-head attention module which receives inputs from the Encoder. Let's work through these respectively.

## Masked Multihead Attention

Think back to our Seq2Seq model from last week. During decoding, we have access to the whole source sentence, and the decoded tokens we've decoded SO FAR. Obviously we can't have access to future tokens because we don't know what they are yet...

During training time however, we DO have access to future tokens because we have labelled pairs. To speed up training, the Transformer architecture enables us to feed in our whole target sequence to the model and use masking to tell the attention mechanism not to look at illegal (i.e. future) positions when considering the i'th token.

Recall what Multihead Attention (MHA) did. It worked out a representation for each token in the sequence given all the tokens. Masking is the strategy we use to tell the network not to look at the positions past the most recent word that has been decoded.

We will look at the Decoder Layer in its entirety in a bit, but for now let's focus on the masking process for MHA. We will implement this in our `MultiHeadAttention` module. Our mask is going to be the same shape as `Q_K_matmul` because this is what we are calculating attention over.

The mask is a tensor with with a value of `-inf` at illegal locations. For one sample in our input, the shape of `Q_K_matumul` is `[T, T]`. Here, `T` refers to the sequence length of the target output. We will talk about test time later, so let's consider the training case right now. During training, we have access to the whole target sequence. So at every timestep (i.e. for every word) in the sequence, an illegal location would be all the future timesteps that we're not meant to have access to. Our matrix is `[T, T]`. So at the 1st timestep, we don't have access to any words from the 2nd timestep onwards. At the 2nd timestep, we don't have acess to any words from the 3rd timestep onwards.

[IMAGE]

In [28]:
def create_mask(batch_size, seq_len):
    mask = torch.ones((batch_size, seq_len, seq_len))
    mask = mask.triu(1)
    mask[mask == 1] = float('-inf')
    return mask.to(device)

In [38]:
device = "cpu"
toy_mask = create_mask(1, 10)
print("TOY MASK: \n\t", toy_mask)

tensor([[[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
         [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
         [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
         [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
         [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
         [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],
         [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
         [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])


In [40]:
toy_matmul = torch.arange(100).reshape(1, 10, 10).float()
print("TOY MATMUL: ", toy_matmul)

TOY MATMUL:  tensor([[[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],
         [10., 11., 12., 13., 14., 15., 16., 17., 18., 19.],
         [20., 21., 22., 23., 24., 25., 26., 27., 28., 29.],
         [30., 31., 32., 33., 34., 35., 36., 37., 38., 39.],
         [40., 41., 42., 43., 44., 45., 46., 47., 48., 49.],
         [50., 51., 52., 53., 54., 55., 56., 57., 58., 59.],
         [60., 61., 62., 63., 64., 65., 66., 67., 68., 69.],
         [70., 71., 72., 73., 74., 75., 76., 77., 78., 79.],
         [80., 81., 82., 83., 84., 85., 86., 87., 88., 89.],
         [90., 91., 92., 93., 94., 95., 96., 97., 98., 99.]]])


In [31]:
# from our MultiHeadAttention class:
def scaled_dot_product_attention(self, Q, K, V, mask=None):
    Q_K_matmul = torch.matmul(Q, K.transpose(-2, -1))
    matmul_scaled = Q_K_matmul/math.sqrt(self.d)

    if mask is not None:
        matmul_scaled += mask

    attention_weights = F.softmax(matmul_scaled, dim=-1)
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

In [46]:
toy_matmul += toy_mask
toy_matmul

tensor([[[ 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
         [10., 11., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
         [20., 21., 22., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
         [30., 31., 32., 33., -inf, -inf, -inf, -inf, -inf, -inf],
         [40., 41., 42., 43., 44., -inf, -inf, -inf, -inf, -inf],
         [50., 51., 52., 53., 54., 55., -inf, -inf, -inf, -inf],
         [60., 61., 62., 63., 64., 65., 66., -inf, -inf, -inf],
         [70., 71., 72., 73., 74., 75., 76., 77., -inf, -inf],
         [80., 81., 82., 83., 84., 85., 86., 87., 88., -inf],
         [90., 91., 92., 93., 94., 95., 96., 97., 98., 99.]]])

## Multihead Attention (again 🙄)

## Decoder Layer

## Transformer