In [26]:
# implement the architecture from (Attention Is All You Need) https://arxiv.org/abs/1706.03762
import torch
import torch.nn as nn

## Basic architecture 

![image.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12cdf506-6cd8-4afa-93a3-b77b82770309_2755x1570.png)

#### Position Embeddings

![image.gif](https://i.imgur.com/KgZCdzX.gif)

The typical way to implement the values of the embedding is by hard coding them by using a sine and cosine function of the vectors and elements’ positions

![image.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a77f49-ed6d-4614-9c64-505455bd0c83_2043x1300.png)

In [30]:
class PositionalEncoding(nn.Module):
    def __init__(self, context_size: int, d_model: int):
        """represent positional encoding as harcoded matrix of size (context_size,d_model)

        Args:
            context_size (int): max context size
            d_model (int): model hidden size
        """
        self.encoding = torch.zeros(
            size=(context_size, d_model)
        )  # placeholder matrix of the encoding , check above figures (orange matrix)
        pos = torch.arange(0, context_size).unsqueeze(
            dim=1
        )  # positions are ranged from 0 to context size (those are rows indexes in orange matrix in above figures)
        i = torch.arange(
            0, d_model, 2
        )  # i range from 0 to d_model in every pos (row in orange matrix)
        arg = pos / (10000 ** (2 * i / d_model))
        self.encoding[:, 0::2] = torch.sin(arg)  # even columns (even i)
        self.encoding[:, 1::2] = torch.cos(arg)  # odd i

    def forward(self, tokens_sequence: torch.Tensor) -> torch.Tensor:
        """encode embedded tokens sequence

        Args:
            tokens_sequence (torch.Tensor):

        Returns:
            torch.Tensor: position encoded embedded tokens
        """
        return self.encoding[
            : tokens_sequence.shape[1], :
        ]  # just query the self.encoding matrix with tokens sequence

#### Encoder Block

The encoder block is composed of a multi-head attention layer, a position-wise feed-forward network, and two-layer normalization.

![img.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6627a678-0582-4950-a829-a8e9e4e97db9_3289x1326.png)

The encoder is just the token embedding and the position embedding followed by multiple encoder blocks.

![img.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3808c9f-715e-4ab0-be11-34e16b3d8644_3540x1022.png)

#### Decoder Block

The decoder block is composed of a multi-head attention layer, a position-wise feed-forward network, a cross-attention layer, and three layer normalization.

![img.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0287aa3-7a69-41c4-a692-c1940e007f29_3301x1582.png)

the cross-attention layer computes the attentions between the decoder's hidden states and the encoder output

![img.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fa4f653-3985-40ac-932d-3eb023be2eb0_2723x1332.png)

The decoder is just the token embedding and the position embedding followed by multiple decoder blocks and the predicting head.

![img.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0808f5de-f713-4a56-8750-ac1cda39b929_2753x1542.png)

The predicting head is just a linear layer that projects the last hidden states from the d_model dimension to the size of the vocabulary. To predict, we perform an ArgMax function on the resulting probability vectors

![img.png](https://cdn.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc452d478-581f-4baf-941f-0ab07a39bdb3_3386x1342.png)