In [23]:
import torch
import torch.nn as nn

import numpy as np

In [16]:
import inspect

In [2]:
torch.cuda.is_available()

True

## Transformer

This notebook will go through each module of the transformer architecture, using the paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762) as reference.

![Transformer Architecture](images/architecture.png)

### Input Embeddings
The inputs to this layer are called **input IDs**. These IDs are the result of tokenizing the input text and representing each token as a number. This embeddings layer converts each input ID into a vector of dimensions `d_model`. 

We can do this using `nn.Embedding`.

In [17]:
inspect.signature(nn.Embedding)

<Signature (num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None, max_norm: Optional[float] = None, norm_type: float = 2.0, scale_grad_by_freq: bool = False, sparse: bool = False, _weight: Optional[torch.Tensor] = None, _freeze: bool = False, device=None, dtype=None) -> None>

`nn.Embedding` is essentially a lookup table. For a vocabulary of size `num_embeddings`, it maps each ID to a vector of size `embedding_dim`. Here is an example:

In [5]:
tmp_embedding = nn.Embedding(5, 10)

`tmp_embedding` can convert a given input ID into a 10 dimensional vector. The first argument to `nn.Embedding` specifies that the ID can have upto 5 different values. (0 <= ID < 5).

Let us try to embed a few valid IDs.

In [12]:
tmp_x = torch.tensor([1, 2, 3])
tmp_y = tmp_embedding(tmp_x)
print(tmp_y)
print(f'\nInput shape: {tmp_x.shape}\nOutput shape: {tmp_y.shape}')

tensor([[ 3.0246,  1.4152,  1.0616, -0.5277,  1.5843,  1.1386,  1.2152, -0.6311,
         -1.5814,  0.3376],
        [ 0.2341, -1.2184,  0.5547, -0.7850,  0.2350,  0.1653, -1.1052, -0.2977,
         -0.5672, -0.4266],
        [ 0.9919,  0.4291,  0.4355,  1.4797,  1.6144, -0.4034, -0.2371,  0.7966,
         -0.7457, -0.5895]], grad_fn=<EmbeddingBackward0>)

Input shape: torch.Size([3])
Output shape: torch.Size([3, 10])


Here the input was a tensor with 3 different IDs, and the output was a 10d vector for each of the IDs, as a (3, 10) tensor. Now, since the `num_embeddings` argument was set to 5, it should fail for any IDs >= 5. Let us test that out.

In [14]:
try:
    tmp_x = torch.tensor([5])
    tmp_y = tmp_embedding(tmp_x)
except IndexError:
    print("index out of range")

index out of range


Now that we understand what the embedding layer does, let us build a class for the InputEmbeddings part of the transformer.

In [19]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model # embeddings dimension
        self.vocab_size = vocab_size # num of items in vocabulary
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)

In the original paper, the embeddings have all their weights scaled by $\sqrt{d_{model}}$.

In [24]:
    def forward(self, x: torch.Tensor):
        return self.embedding(x) * np.sqrt(self.d_model)