In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")

In [2]:
def clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

nn.ModuleList is a special list that could store a list of neural network modules.
copy.deepcopy(module) will create <font color=red>independent copies</font> of the module, which is different from the normal copy. Thus, those copies <font color=red>will not share weights and other parameters.</font>


In [3]:
def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1))/math.sqrt(d_k) # Compute the scaled dot-product attention
    # transpose(-2, -1) means that we are transposing the last two dimensions of the tensor.
    # And the reason why we transpose the last two dimensions is that only the last two dimensions represent for the shape of one matrix
    # Cause we know that the query and key are both tensors in pytorch. The indices in front may represent for the batch size or the number of heads. Not the shape of matrices.
    # matul() is the matrix multiplication function in PyTorch
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9) # mask is a tensor that is used to store the positions of the elements that we want to mask
    p_attn = F.softmax(scores, dim = -1) # The dimension of the softmax is the dimension of the scores.
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

In [4]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1): # d_model is the length of the embedding, which is typically 512.
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        self.d_k = d_model // h # 7 ÷ 3 = 2...1, so 7//3 = 2
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

Four stacked fully connected layers whose input and output are all d_model dimensions.\
p is the probability of an element to be zeroed. And we know that dropout has been assigned to 0.1.\
And you may be very confused about <font color=cyan>why we use the linear layers here</font>. What's more, we do not even do any change to the dimension of the input tensor. So <font color=Coral>it's not like the linear layer in the PositionwiseFeedForward class where we say</font> the linear projection aims to scale the dimension of the input tensor so that the ReLU layer can capture the more important features. What the linear projection has done here is actually for the <font color=yellow>consideration of the calculation of query, key and value vectors</font>. If you still recall, the key and the value vectors are all derived from one feature vector. In some circumstances, query is also derived from the same feature vector, like the attention module in Encoder. And for the same feature vector, we use different weight matrix to generate key ad value vectors by the dot-product, which is actually the linear layer presented here. <font color=hotpink>So, the linear layer here is actually a form of dot-product, it's not other abstract function!!!</font>

In [5]:
    def forward(self, query, key, value, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1) # Add a dimension at the 1th dimension of the mask tensor.
        nbatches = query.size(0)
        query, key, value = [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) for l, x in zip(self.linears, (query, key, value))] #This is actually the splitting operation. 
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)

Forward function will be automatically called when we call the object of this class.\
Tensor "query" has four dimensions: batch size, number of heads, number of words, length of the embedding. As a result, the $0^{\text{th}}$ size of it is the bath size.

The fifth line is the key, <font color=DoderBlue>which in brief, is the splitting operation</font>.

First, we should know the role of zip() function. This function takes iterables and return a special tuple. For example:

In [6]:
# Example
list1 = [1, 2, 3]
list2 = ['a', 'b', 'c']
zipped = zip(list1, list2)
print(list(zipped))  # Output: [(1, 'a'), (2, 'b'), (3, 'c')]

[(1, 'a'), (2, 'b'), (3, 'c')]


In this specific circumstance, we aggregate the first three layers of stacked linear layers with the three tensors to be a tuple. 

Second, let's see the view() function, which works similarly to "reshape" in numpy. The first dimension of this tensor will be reset to nbatches, the second dimension of this tensor will be automatically adjusted. The third dimension of this tensor will be the number of heads. The fourth dimension of this tensor will be the dimension of the input vector $d_k$.\
Explain the function is easy, <font color=red>what really matters is why we reshape the tensor like this.</font>\
First, what is the tensor that needs reshaping? Based on the code, we know that its "l(x)". So what is "l(x)"? It is actually the linear combination of linear layers and query, key and value tensors. Because we know that zip() function returns a tuple which is [(linear layer1, query), (linear layer2, value), (linear layer3, key)]. And a linear layer will automatically call the forward function of nn.Linear, which takes a tensor as input and returns a tensor by using the linear operation function like: $Y=XW^T+b$. The detail codes are as follows:

![My Image](./imgs/1.png)

So, with the help with loop, we could do the linear operation for the three tensors. And the result of the linear operation will be reshaped to the tensor with the shape of <font color=pink>(nbatches, length of the words, number of words, length of the embedding)</font>. And that is not enough. The output tensors will go through a transpose(1,2) function, which means the number of words and the number of heads will be swapped and the new size of those tensors are: <font color=pink>(nbatches, number of heads, length of the words, length of the embedding)</font>. 

There is a huge necessity to add supplement descriptions. First, <font color=purple>please note that when those tensors went through the linear layers, they were still three-dimensional tensor. </font>For example, their size is [2,8,512] and can be illustrated as follows: 

<img src="./imgs/2.png" alt="My Image" width="500">

It is when they went through the view() function that they became four-dimensional. Plus, we need to know that the tensor going through the view() function has one more dimension that is used to represent the number of heads.

<img src="./imgs/3.png" alt="My Image" width="400">

So, this is the tensor that has already went through the view() function. We can easily tell that the core change in the tensor after going through the view() function is that the embedding of a word has been divided into four parts(heads). However, we want more. We want the vectors of a same head can be partitioned together as a part and that's why we need to transpose the tensor:

<img src="./imgs/4.png" alt="My Image" width="400">

We could see that, in this representation way, every head will get a whole series of words, though each vector representing a word is just one part of the embedding of the word. 

You may have another question: why don't we reshape those tensors into the shape of (nbatches, number of heads, length of the embedding, number of words) at the beginning, instead we first reshape them into the shape of (nbatches, length of the embedding, number of heads, number of words) and then transpose it to (nbatches, number of heads, length of the embedding, number of words)? For example, we have a tensor like [[1,2,3,4,5,6]], we reshape it into (2,3) first we get: [[1,2,3],[4,5,6]]. Then we transpose the $0^{\text{th}}$ and the $1^{\text{th}}$ dimension, we get: [[1,4],[2,5],[3,6]]. But if we directly reshape it into (3,2), we get: [[1,2],[3,4],[5,6]]. So the consequences are very much different. 

Now, let's review the multi-head attention class we will find there is one last sentence remained to be explained. \
And that is: <font color=orange>x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)</font>.\
In this sentence, we can see that the three tensors have been input to the attention function and the attention function will returen an attention score tensor. To better understand the change of the tensor, we can see the following picture <font color=red>(Taking one batch as an example and that's why the dimensions of the following tensors are 3)</font>: 

<img src="./imgs/5.png" alt="My Image" width="400">

After the dot product of Q and K, we get a tensor $Q\cdot K^T$. According to the formula: $Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$, we need to do the softmax to the $Q\cdot K^T$ tensor and then dot-product it with the value tensor: (After doing softmax, $Q\cdot K^T$ is called p_attn)

<img src="./imgs/6.png" alt="My Image" width="400">

And note that there is one last code line in the forward function:\
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)\
        return self.linears\[-1](x)

First, we know that this line of code is doing operation to the tensor x. And tensor x is the first element of the tuple returned by the attention function, which is the attention score calculated by $(\text{softmax}(\frac{QK^T}{\sqrt{d_k}}))V$. And according to the illustration above, we know that the attention score tensor has the same shape of the input tensor $Q$, which is (nbatches, number of heads, length of the words, length of the embedding). The last line of the code first transpose the second and the third dimension of the tensor, and then melt the last two dimensions into one dimension. The final tensor will be (nbatches, length of the words, number of heads * length of the embedding). If we give it a deeper thinking, we will find this is actually <font color=MediumSlateBlue>returning the tensor to the original shape of the input tensor</font>, i.e., (nbatches, length of the words, length of the embedding). As for the reason why we need to return the tensor to the original shape, we will discuss it in the next section.

Now, let's focus on the next step:

In [7]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

This is a pre-trained operation for the input of Decoder. As we can see, there are two linear layers and one dropout in this class. The first linear layer takes a tensor and does linear transformation to it according to $Y=XW^T+b$. Further, this linear layer projects its d_model features into d_ff features. Please note what does that mean. For example, the size of an input tensor is (2,8,512), it can be represented as follows:

<img src="./imgs/2.png" alt="My Image" width="400">

We could see that one word is represented as a 512-dimension vector, or we could say that one word is represented by 512 features and those features form a vector. That's why the linear projection only operates on the last dimension of the tensor. It aims to scale the d_model features into d_ff features. 

The second linear layer will project the d_ff features back to d_model features. The dropout function is used to prevent overfitting.

In [8]:
    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

Because the forward function will be automatically called, we could regard this function as a representative of the whole class. And let's dive into this forward function now. We could see that this class first puts the input tensor $x$ through the first linear layer, and then applies the ReLU activation function to the result. Next, after going through the dropout layer, the tensor will be put through the second linear layer. The final tensor will be returned to its original shape.

Still, after the explanation of what exactly does the feed forward network do, we should focus on what really matters: why? Namely, not only we should know what the function does, but also we should know why the function does that. First, it is the linear layer that projects the input features into a higher dimensional representation(d_diff is usually times of d_model). Why do we need to do this? We know that the output of the first linear layer will be sent to the ReLU activation function and ReLU is used to capture the more important features. So, we could say that the first linear layer is used to help the ReLU layer to capture the more important features because the tensor with higher dimensions will provide a stronger ability of presenting features. After the ReLU layer, the tensor will go through the dropout layer, which is used to prevent overfitting. Finally, the tensor will go through the second linear layer, which is used to project the tensor back to the original shape, for the sake of convenience of the remaining parts of Transformer.  

In [9]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model) # Create a tensor with the shape of (max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1) # Create a tensor with the shape of (max_len, 1)
        # The parameter "1" means that we add a dimension at the 1th dimension of the tensor.
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model)) # There should be one more pair of parentheses.
        pe[:, 0::2] = torch.sin(position * div_term) # "0::2" means that we start from the 0th element and take every 2 elements.
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe) # The value of buffered tensors will not be updated during the training process.

    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) # requires_grad=False means we won't take the gradient of this tensor. 
        return self.dropout(x)

The role of this class is to add the positional encoding to the input tensor, which could be easily understood by the codes lie in the forward function. \
So, that is a rough idea of what we can expect from this class.

 Now, let's dive into the details. \
First, please note that "pe" represents for <font color=red>"position encoding"</font>. \
We know that "?::2" means that starting from ? and take every two elements. <font color=purple>This is actually doing a classification of the even and odd elements</font>:\
$PE(j,2i)=sin\frac{j}{10000^{\frac{2i}{d_{model}}}}$\
$PE(j,2i+1)=cos\frac{j}{10000^{\frac{2i}{d_{model}}}}$\
Plus, we know that the initial shape of PE is (max_len, d_model). Then, we calculate every element of the PE and fill all the positions in this two-dimension tensor. After that, we use an unsqueeze function to add "1" in the $0^{\text{th}}$ dimension, so the shape of PE becomes (1, max_len, d_model). 
What's even more important is the detailed transformation of div_term and position and how they interact with pe. First, we know the initialization of position is "torch.arange(0, max_len).unsqueeze(1)". We read it step by step. First, we get a one-dimension tensor by "arange(0,max_len)" and its length is max_len. Then, we add a dimension at the $1^{\text{st}}$ dimension of the tensor. Thus, the shape of the tensor becomes (max_len, 1). Because the elements in the position tensor is $1,2,3,\cdots,max\_len$, so we could regard it as a index indicator.\
Then, let's focus on the div_term, which is given by: torch.exp(torch.arange(0, d_model, 2) * (-(math.log(10000.0) / d_model))). First, "torch.arange(0, d_model, 2)" gives a tensor that only contains the even numbers from 0 to d_model. Thus, its shape can be represented by (d_model/2,). And there is one factor that is also important, i.e., what is the content of tendor "div_term" and why is that? I think it may be better explained by an illustration:

<img src="./imgs/7.png" alt="My Image" width="500">

Although this illustration is created when I was sorting out my thoughts. Now, though I am too lazy to describe it with words, I think it is still a good way to explain the content of the tensor "div_term". \
So, now we have the knowledge of the shape of tensor "position" and "div_term". Plus, we know what exactly do they contain. It's about time to research on tensor "pe". First, its shape is (max_len, d_model) and it is initialized with zeros. According to the code, we know that it is given by the "dot-product" of "position" and "div_term". However, this is very confusing cause we know that the shape of position is (max_len,1) and the shape of div_term is (d_model,). So, how could they possibly do the dot-multiplication? That's all due to the broadcasting mechanism in PyTorch which is illustrated by an example as follows:

In [10]:
print(torch.tensor([[1],[2],[3]]) * torch.tensor([1,2,3]))

tensor([[1, 2, 3],
        [2, 4, 6],
        [3, 6, 9]])


So, now we finally have the pe tensor. The last thing is figure out what has the forward function done, which also contains many technical details and they extremely worth exploring. First, this part will explain why we set a variable named "max_len" dating back to the beginning. So, we have calculated the arguments of the $cos$ and $sin$ function in advance and <font color=pink>take as much as we need</font> <font color=green>(this operation is achieved bt "x.size(1)")</font> when we step into the real computation progress of the position encoding. For example, we may only have "d_model" elements in the input tensor, so we only need the first "d_model" elements in the position encoding tensor "$x$". And that's why the code "self.pe[:, :x.size(1)]" is used. After retrieving the corresponding shape of the position encoding tensor, we add it to the input tensor and return the result after going through the dropout layer.

In [11]:
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__() # super LayerNorm does not mean that inheriting from the LayerNorm class. It means that inheriting the parent class of LayerNorm
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps # We need to use this parameter in the forward function, so this line is needed.

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

The line `self.a_2 = nn.Parameter(torch.ones(features))` initializes a learnable parameter `a_2` for the `LayerNorm` class. This parameter is a tensor of ones with the same size as the number of features. The `nn.Parameter` wrapper indicates that this tensor should be considered a parameter of the module and will be updated during training. So, we could actually regard this sentence as two parts: the first part is to create a tensor and the second part is to make this tensor a learnable parameter. The same thing happens to the tensor "b_2". Plus, you should be aware of the shape of the tensor "torch.ones(features)" is (features,), where "features" is the number of features. It's a scalar.\
Please note that the "Parameter" is not an ordinary tensor. If we print this tensor, we will not simply get a tensor:

In [12]:
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        # 定义一个可训练参数
        self.weight = nn.Parameter(torch.randn(3, 3))

    def forward(self, x):
        return x @ self.weight

model = MyModel()

# 查看模型中的参数
for param in model.parameters():
    print(param)


Parameter containing:
tensor([[ 0.6846, -0.4826, -0.4196],
        [ 0.1056,  0.9032, -0.8747],
        [ 0.8853,  0.1844,  0.8025]], requires_grad=True)


Now, let's dive into the forward function. The sentence "mean = x.mean(-1, keepdim=True)" means we will take the average along the last dimension of the tensor("-1" means the last dimension). And the parameter "keepdim=True" means that we will keep the dimension of the tensor after taking the average. For example, if we have a tensor with the shape of (2,3,4), after taking the average along the last dimension, the shape of the tensor will be (2,3,1). And see what will happen if we set "keepdim=False":

In [13]:
import torch

x = torch.randn(2, 3, 4)
mean = x.mean(-1, keepdim=True)
mean1= x.mean(-1, keepdim=False)
print(x)
print(mean.shape)
print(mean)
print(mean1)


tensor([[[ 0.3402, -0.2618, -0.5816,  2.0402],
         [-0.2955,  1.6525,  1.5717,  1.3036],
         [-1.1185,  0.7750, -0.6286, -2.2982]],

        [[-0.4599, -0.0984, -0.5961,  1.3229],
         [-0.7004,  0.8759,  2.3946, -0.0964],
         [ 2.0903,  1.7862,  0.6387,  0.2851]]])
torch.Size([2, 3, 1])
tensor([[[ 0.3843],
         [ 1.0581],
         [-0.8176]],

        [[ 0.0421],
         [ 0.6184],
         [ 1.2001]]])
tensor([[ 0.3843,  1.0581, -0.8176],
        [ 0.0421,  0.6184,  1.2001]])


See? The last dimension will disappear and the averages of those three columns will be fused in a single bracket.

As for the sentence "std = x.std(-1, keepdim=True)", it is doing the same thing as the sentence above. Simply changing the average into the standard deviation of the tensor will do.

The same with previous explanations, after we know what does this code block do, we neet do know why the hell we should do like this, which leads to the explanation of LayerNorm. <font color=pink>First, we need to be aware that the layer normalization is actually one sort of normalization.</font> There is other normalization like BatchNormalization. However, what makes the LayerNormalization the LayerNorm is <font color=gold>it only normalize the dimension representing features, instead of the whole batch</font>. And that's why we take the average and standard deviation along the last dimension of the tensor. And if you still recall, we have illustrated this in the previous section where we split the input tensor into the number of heads. According to the illustration 2, we know that the length of the embedding is 512 and each element represents a feature. After we reshape this tensor into a four-dimensional tensor, we actually split the features into four parts and each part will be partitioned to a head. In that circumstance, the length of embeddings is actually the last dimension of this tensor. So, you may have a deeper understanding of this saying: "last dimension of the tensor is the dimension that represents features". And we do the normalization along this dimension is actually doing the normalization to the feature dimension. \
The other thing needs mentioning is the formula of the normalization is: $\hat{x} = \frac{x - \mu}{\sigma + \epsilon}$, where $\mu$ is the mean of the tensor and $\sigma$ is the standard deviation of the tensor. As for the $\epsilon$, it is used to keep the numerator from going to 0. After the normalization, we need to linearly project the tensor into any form of distributions. And that's why we need the learnable parameters "a_2" and "b_2". Please note that those two parameters are learnable and they will be updated during the training process.

Now, let's focus on the next part:

In [14]:
class SublayerConnection(nn.Module):

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

There is an extra programming technique should be re-emphasized: the forward function will not be called automatically when we instantiate an object of this class, instead, it will only be automatically called when we use this object. For example, when we instantiate an object named "norm", the forward function will not be automatically called. As a result, the parameter "size" is not the parameter of forward function, it is the parameter of the __init__ function. So when will the forward function is automatically called? It is the time when we use the object "norm": norm(x). \
Now, let's turn back to this class. What it has done is all contained in the forward function, where only one sentence is presented: "return x + self.dropout(sublayer(self.norm(x)))". So what does this sentence do? First, we could see that it uses the object "norm" to normalize the input tensor of the forward function of the class "SublayerConnection". After the input tensor is normalized, it will be transformed to a sublayer. This is very interesting because the sublayer is actually a class and in this specific circumstance, it is most likely the MultiHeadedAttention class, considering that we are researching Transformer. I have never expected that the parameter of a function could be a class. At last, the tensor being processed by those layers will be added to the original tensor. This is actually a residual connection. Without residual, if we directly process the tensor going through the multi-head attention module, gradient vanishing may happen. Nevertheless, with the help of the residual connection, that issue will not be a problem. For the detailed explanation and computation, please refer to the other article about the residual connection witten by me. 

Congratulations! We have finished the explanation of components of the Transformer. Now, let's figure out how to put them together and build a Transformer construction. \
First, let's take a look at the Encoder. 

In [15]:
class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn # An instantiation will be transferred to this class. And this instantiation is instantiated by the MultiHeadedAttention class.
        self.feed_forward = feed_forward # An instantiation will be transferred to this class. And this instantiation is instantiated by the PositionwiseFeedForward class.
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

Conventionally, we take a look at the first line of the code: self.self_attn = self_attn. This is simple. This is actually defining a variable that will be used later. However, <font color=yellow>what's not simple is that</font> how it is used in the forward function,because we could clearly see that this variable(object) has gained four parameters. <font color=blue>But we know, even for a same class, there will be different numbers of parameters passed to</font> \_\_init\_\_()<font color=blue> function(instantiating an object) and forward function(using this object). Not to mention the self_attn could refer to either "attention" and "MultiHeadedAttention".</font> So what exactly is "self_attention"? The answer will be lucidly got by observing the function "make_model" which is the whole construction of Transformer. Just looking into the code of this class, we may not have a real understanding of how exactly those lines of code interact with each other and how the variables are transferred and computed. To achieve this goal, we need combine the code of this class with how we construct the whole Transformer at the end. As a result, I will present the code of the whole Transformer in the following block:

In [16]:
def make_model(src_vocab, tgt_vocab, N=6, 
               d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab))
    
    # This was important from their code. 
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model

Just don't get hung up on other definitions like "EncoderDecoder", "Embedding" or "Generator" and simply focusing on this specific one part of a sentence: "EncoderLayer(d_model, c(attn), c(ff), dropout)". Here, we could see the second parameter of "EncoderLayer" is "c(attn)". And we could also know that "self_attn" is exactly the second parameter of "EncoderLayer" when we are defining this class. Now, we are talking. So, the meaning of "self_attn" can be explained from "c(attn)". And please note that "c" is actually another name of "copy.deep". So, "c(attn)" is actually a deep copy of the object "attn". And attn is actually the instantiation of the MultiHeadedAttention class. So, we could say that "self_attn" is actually the instantiation of the MultiHeadedAttention class. As a result, the parameters in "self_attn" are actually the parameters of the forward function of the MultiHeadedAttention class, which is "query", "key", "value" and "mask". Note that in the forward function of the MultiHeadedAttention class, "mask=None" is just the default value, doesn't mean it has to be "None". The specific value depends on what we want to pass to it.  

So, let's turn back to the writing of the EncoderLayer class. The next variable that is defined is "feed_forward". This should be an instantiation of the PositionwiseFeedForward class. And this instantiation is used in the "sublayer" network which is a stack of two same residual layers. And the parameters of the residual network(an instantiation of SublayerConnection class) are an input tensor and a "sublayer"<font color=orange>(Note that this "sublayer" is different from the "sublayer" presented in the current "EncoderLayer" class. This "sublayer" in the SublayerConnection class is actually any type of layers and in this specific circumstance, it could be, for instance, the multi-head module. And the "sublayer" in the EncoderLayer class is actually the stack of two residual layers)</font>. So the feed_forward is the "sublayer" that will be passed as a parameter into the second layer of the stack of two residual layers. And if you still recall, the first layer of the stack of two residual layers is the self-attention module. <font color=tan>And please note that the input tensor "x" of the second layer of the stack of two residual layers is not the x when it first appears, instead, it is the output tensor of the first layer of the stack of two residual layers.</font> Namely, the tensor "x" that is passed into the second layer of the stack of two residual layers is the output tensor of the self-attention module. Finally, we could draw a picture of what the EncoderLayer class has created, also, some details that could not be well-explained by words will be illustrated in this picture:

First, let's see what has "lambda x: self.self_attn(x, x, x, mask)" done:\
You may wonder why lambda function is used here. Simply put, this is because when you pass a parameter to sublayer, you pass both the original input and the output after the attention layer (because sublayer itself is a dropout and residual processing), lambda functions can be understood as an output that stores the input x after self_attn. You can also define a variable to store the result and pass it along with x to sublayer.

<img src="./imgs/9.png" alt="My Image" width="300">

After that, this layer will be regarded as a sublayer and will be passed into the first layer of the stack of two residual layers, i.e., a residual layer with a sublayer: "multi-head attention", so as the input tensor "x". In that residual layer, the input tensor "x" will first go through a normalization layer and then go through the sublayer which is the multi-head attention module. After that, the output will keep going through a dropout layer and then entering a residual module, which is simply adding the original input tensor. I will draw this progress in the following picture:

<img src="./imgs/10.png" alt="My Image" width="400">

Now, let's focus on the second sublayer of the stack of two residual layers:\
In this line of code, we could see that the input tensor is still "x", but its value has been already changed by the first sublayer of the stack of two sub-layers. To prevent confusion, we re-name it as "$x'$. And another parameter of this sublayer is an instantiation of the "PositionwiseFeedForward" class. So, the similar chain of thought, the input tensor will first go through a norm layer and then being passed to the sublayer parameter "feed_forward", where x will be simply linearly projected into another linear space. After that, the output of this linear layer will be processed by a dropout layer and add an extra original input tensor $x'$. Thus, the illustration is:

<img src="./imgs/11.png" alt="My Image" width="400">

Then, let's put them all together we will finally get the whole construct of the Encoder:

<img src="./imgs/12.png" alt="My Image" width="400">

So, the above is the specific structure of the Encoder. However, in the practical application, we will not only use just one Encoder. Instead, we will use a stack of Encoders. And the stack of Encoders is actually the Encoder class. So, let's see the Encoder class:

In [17]:
class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

This class will take a layer as an input which will actually be set as EncoderLayer, i.e., one Encoder layer. <font color=gold>(There is a programming technique worth mentioning. Although we could not know what exactly the parameter "layer" here simply through the code of this part, we actually know that this must be an instantiation of EncoderLayer. This parameter is prepared for it, even though it could be any module from the point of view of pure programming technology.)</font>\
Then, this class will define a stack of layers consisting of one single Encoder layer using the clone function. And in the forward function, we could see what exactly has an Encoder instantiation done. First, it iterates through the Encoder layers in the stack of Encoder layers and in each iteration, the current input tensor "x" will be updated by the single Encoder layer. After the original input tensor going through the whole stack of Encoder layers, the output will be normalized and that's the eventual output of a stack of Encoders. <font color=green>Thus, we could infer that the N Encoders are not working in parallel, instead, they co-process the input tensor one after one. </font>

Now, let's see the Decoder:

In [18]:
class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory # Maybe it's for the sake of writing convenience.
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

The first puzzle I met in this class is: what is "src_attn"? I know, for those who are familiar with attention mechanism, it is not hard to recall what is attention score, which is usually represented by $e_l$ and is calculated by $\text{score}(\symbfit{q},\symbfit{k}_l)$, where score function is generally classified into two cases: 1) dot-product score function and 2) additive score function. If you are interested in the details, you are very welcome to vist another note written by me, where I have devoted to this concept. However, what does it mean in this code? I assume it should have a similar meaning, but we must be very cautious. To figure this out, we still want to find out how it is used in practice. So, let's focus on the function "make_model" where a whole structure of Transformer is constructed. In this function, we could see when we instantiate an object of the class "DecoderLayer", the third parameter is "c(attn)", which is also used as the second parameter of this class and even in the EncoderLayer class. And we have already been aware that c is actually an alias of copy.deepcopy. So, "c(attn)" is actually a deep copy of the object "attn". And we know that "attn" is actually the instantiation of the MultiHeadedAttention class. So, we could say that "src_attn" is actually the instantiation of the MultiHeadedAttention class. Combining with the illustration of Transformer, we know that there are actually two attention modules in the Decoder and the first one is used to provide the query vectors for the next attention mechanism. According to the parameters, we could distinguish which one is the first attention module in Decoder: "self_attn" or "scr_attn"? If you still recall the illustration of Decoder, you will find that the inputs of the first attention module are all from a same tensor and the inputs of the second attention module are three, two of them(key and value) are provided by Encoder and one of them(query) is provided by the first attention module. With this form, we could easily guess that the first sublayer of the DecoderLayer class is the first attention module in the Decoder, for it has three same parameters and the second sublayer of the DecoderLayer class is the second attention module in the Decoder, for it has three parameters and two of them are the same and the left one is the output of the first sublayer, which is really an obvious hint. \
Please note that <font color=yellow>"tgt" represents for "target" </font>which makes sense because for the first attention module of the Decoder, its input will be masked and its input is actually the so-called "target sequence". If you are familiar with the theory of Transformer, you will know that during the training progress, we will train the model with the sentences that should have been generated by the model, which is actually very close to the labels of a dataset. If you are interested in the details, you are welcome to visit another passage written by me. Anyway, we call those "sentences" the "target sequence" and that's why we call the mask tensor for the first attention module of Decoder the "tgt_mask". 

<font color=red>STOP!STOP!STOP!</font> I need to clarify a programming technology detail that I had ignored in the previous explanation of this code. Remember that we had highlighted there are actually two "sublayer" in the whole project. One is in the DecoderLayer class and the other is in the SublayerConnection class. And the "sublayer" in the DecoderLayer class means one of the stack of three residual layers and the other "sublayer" means the parameter passed to the residual layer which is usually an attention layer. To help us distinguish those two sub-layers, I would like to re-name the one that is in the SublayerConnection class which representing the attention layer as "sub-sub-layer". In the previous explanation, I directly regard the "sub-sub-layer" as the parameter of the sublayer(residual layer). It makes sense, but if we dive into the programming technology details, we will find that when we use the "sub-sub-layer" in the forward function of the SublayerConnection class, we actually only pass one parameter to it, which is "self.norm(x)". However, this "sub-sub-layer" accepts four parameters in the forward function of the DecoderLayer class(to remind you, the "sub-sub-layer" is presented as "self_attn" in the DecoderLayer class). So, how is that possible? Even if we think about it from another angle, this issue still cannot be explained. Here is the details: the input tensor for the second attention module of Decoder will not simply accept one tensor, it will accept two different sources of tensors. So, there must be some mysteries lie behind the sentence "return x + self.dropout(sublayer(self.norm(x)))" which has sentenced that there is only one input for the attention module(note that even going through the normalization layer, one tensor is still one tensor). And that "mysteries" are actually one single mystery which is <font color=MediumSlateBlue>the "lambda function"</font>. The truth is, when we pass the "sub-sub-layer" to the sublayer(residual layer), we are not directly pass the "sub-sub-layer", instead, we seal it in an anonymous function, which takes (x,m,m,scr_mask) as its parameters. So, when we call the parameter "sublayer" in the forward function of class "SublayerConnection"(which is actually the "sub-sub-layer" we have mentioned before), we are not directly calling the "sub-sub-layer", instead, we are calling the anonymous function that has sealed the "sub-sub-layer" in advance. The other parameters like "m","m" and "sc_mask" will be fixed with the anonymous function and the anonymous function only needs to take one parameter from now on. That's why the sentence "return x + self.dropout(sublayer(self.norm(x)))" only transfers one parameter to the "sub-sub-layer".

<img src="./imgs/13.png" alt="My Image" width="400">

In [19]:
class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

The same with Encoder. I really have no patience to explain this. But deep inside I know that I do not deserve to say that as a beginner. \
The __init__ function has not much to say. It clones a stack of N Decoder layers. As for the forward function, it will take the input tensor, memory tensor, src_mask tensor and tgt_mask tensor as its parameters. You may be very confused what the hell is "memory". According to the previous codes, we know that it is actually regarded as the key and value vectors being input to the second attention module in Decoder. So, it is not hard to guess that "memory" is actually the output tensor of the Encoder. I am deeply confirmed that it will be illustrated in the EncoderDecoder class. But in current class, it is better for us to simply skip it. So, turn back to the forward function of the class Decoder, after taking those parameters we mentioned previously, it will iterate the input tensor along the stack of Decoder layers and finally normalize the output tensor. Not the parameters that the "layer" accepts may be a little bit confusing, so let's figure out what "layer" is. The answer is simple, it is the instantiation of the DecoderLayer class. So, the parameters of the forward function of the DecoderLayer class are actually the parameters of the forward function of the Decoder class. 

In [20]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

Now, let's introduce a basic module, without whom we could not understand class "EncoderDecoder" and "Generator". \
First, in the constructor, an instantiation of "Embedding" will be created. Note that the "Embedding" here is not "Embeddings". "Embedding" is a class provided by pytorch. Class "Embedding" takes two parameters one of whom represents the number of words that we are going to process and the other one means the size of the embedding vector of a word. So, we could know that there are "vocab" words in our target and each word will be embedded to a "d_model" large vector. As for the "Embedding" class, it will first create a matrix whose size is vocab $\times$ d_model. Then, its instantiation will take a tensor as a parameter. That tensor is constrained to be a one-dimensional vector, and we call it index vector. Suppose its size is "a", the instantiation of class "Embedding" will return a tensor with the size of (a,d_model). In that tensor, each row is the embedding of the corresponding index given by the index vector. To help you understand this, I will give you an example: 

In [21]:
import torch
import torch.nn as nn

# 定义嵌入层，假设词汇表大小为10，嵌入维度为5
embedding = nn.Embedding(10, 5)

# 创建输入张量（索引）
input = torch.LongTensor([1, 2, 3, 4])

# 获取对应的嵌入向量
output = embedding(input)
print(output)


tensor([[ 1.9670,  1.7187,  0.2507, -0.4365, -0.6057],
        [-0.1407,  0.2041, -0.3034,  0.8637, -0.2315],
        [-0.6936, -0.9771,  0.1209,  0.3547, -1.9757],
        [-0.6201,  0.4427, -1.1653, -0.8934,  1.6307]],
       grad_fn=<EmbeddingBackward0>)


As you can see, the tenor "input" is actually an index vector. Variable "embedding" is an instantiation of the class "Embedding". The parameters for the constructor is "10" and "5", so, there will be a pre-defined matrix with the size of (10,5) and every row in this matrix is actually an embedding vector. Now, we will pass the tensor "input"(index vector) to the forward function. In our example, the elements of the index vector is "1", "2", "3" and "4". As a result, the output tensor will be the combination of "the first+1 row of the pre-defined matrix", "the second+1 row of the pre-defined matrix", "the third+1 row of the pre-defined matrix" and "the fourth+1 row of the pre-defined matrix"("+1" because the starting index is always "0"). So, now you see the pattern of operation of "Embedding" class. 

Let's turn back to the class "Embeddings". The object defined in the constructor will construct a pre-defined matrix using the parameters "d_model" and "vocab" as the size of this pre-defined matrix. Then, in the forward function, an input tensor "x" will be regarded as an index vector and an embedding tensor will be returned based on this index tenor. Plus, it will multiply a scaling factor $\frac{1}{\sqrt{d_k}}$. That's all about class Embeddings.

Now, although there is no sign of using the class "Generator" in the class "EncoderDecoder", it's better for us to introduce this class in advance. 

In [24]:
class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

So, in the constructor, a linear layer will be created and the size of the input tensor is "d_model" and the size of the output tensor is "vocab". In other word, the number of features of the input vector is the length of embeddings and the number of features of the output vector is the number of words. The constructor name this method "proj" which refers to "linear projection". In the forward function, an input tensor "x" will be taken as a parameter and the function "F.log_softmax" will be exploited to calculate all the elements of the last dimension of the projected tensor. Why the last dimension? This is because only the last dimension represents the embeddings of words and the other dimensions simply show how those embeddings are organized in different batches, in different words. What really confuses me is that why we need to scale the size "d_model" to the size of "vocab". To deal with this question, we first need to know the function of the class "EncoderDecoder" in advance, <font color=Aqua>which is supposed to generate a table containing the possibility distribution of the next word by doing linear projection and softmax activation to the output tensor given by Decoder</font>.\
This has led to the clarification of one previous misunderstanding, which is, the input does be a sequence like "I love cat." However, there is another thing "vocabulary table" which is supposed to used with the input sequence. The input sequence first will be transferred to a series indices by the "vocabulary table" and then those indices will be taken by the embedding layer, finding their corresponding embeddings. If you still recall, this is exactly the process of the class "Embeddings". <font color=Teal>Therefore, I boldly assume that, class "Embeddings" should be put to a very front place so that it could transfer those words in input sequence to their corresponding.</font>\
So, we then need to know the output tensor of Decoder. Turning back to the class "DecoderLayer", we could recall that it will generate three instantiations of class "MultiHeadedAttention" and output the processed input tensor. And if you still recall, the output of DecoderLayer will be transformed to the original shape (nbatches, vocab, d_model). So, the shape of the multi-head module is (nbatches, vocab, d_model). When this tensor goes through the linear layer in the generator, its last dimension will be projected to the size of "vocab", generating the possibility distribution of every word in the vocabulary table.

In [22]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

So, what is class EncoderDecoder and how it works? First, let's focus on the biggest difference in this class and that is: it has two extra functions over forward function, they are encoder and decoder. So what are them and why do we need them? Me at the beginning, was very confused about the legitimacy of their existence. Because, we know, when we pass parameters for the class EncoderDecoder, we will pass Encoder layer and Decoder layer right on the first and second places of its parameters. What's more, in the constructor(\_\_init\_\_), we have already defined variables that represent for both Encoder and Decoder. Thus, Encoder and Decoder could directly be visited by the forward function. I guess this is for the sake of writing convenience. However, I am not sure. But we should really move forward. So the function "decoder" will take "source" and the "the mask of source" as its input and process the "source" by the "src_embed". So what is "src_embed"? We could find the answer in the class "make_model". In the instantiation step of class "EncoderDecoder", it passes "nn.Sequential(Embeddings(d_model, src_vocab), c(position))" to the third place of the parameters owning by class "EncoderDecoder". So, we could know that "src_embed" is actually an object created by "nn.Sequential(Embeddings(d_model, src_vocab), c(position))". So what does this sentence mean? This is actually like the clone function where we could create a stack of layers. The "Sequential" function also creates a stack of layers, however, each layer in the stack is different from each other. In this line of code, there are two layers in total, the first is an embedding layer, who takes two scalars as its input: "d_model"(the length of an embedding) and "src_vocab"(the number of words in the input sequence). This layer will create a pre-defined embedding matrix. If there is subsequent input sequence, it will be transferred into index vector and find its corresponding embedding in the pre-defined embedding matrix. Anyway, after creating the instantiation of "Embeddings", the function "Sequential" will put an instantiation of "PositionalEncoding" into the stack.  