<a href="https://colab.research.google.com/github/AyoubMDL/transformers_from_scratch/blob/main/positional_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torch.nn as nn

Criteria for positional encoding :    
1. It should output a unique encoding for each time-step 
(word’s position in a sentence)

2. Distance between any two time-steps should be consistent 
across sentences with different lengths.

3. Our model should generalize to longer sentences without 
any efforts. Its values should be bounded.

4. It must be deterministic

Positional encoding has a number of advantages : 

1. The sine and cosine functions have values in [-1, 1], which keeps the values of the positional encoding matrix in a normalized range.
2. As the sinusoid for each position is different, you have a unique way of encoding each position.
3. You have a way of measuring or quantifying the similarity between different positions, hence enabling you to encode the relative positions of words.

The positional encoding formula is given by:
\begin{align*}
PE_{(pos, 2i)} &= \sin\left(\frac{{pos}}{{10000^{\frac{{2i}}{{d_{\text{model}}}}}}}\right) \\
PE_{(pos, 2i+1)} &= \cos\left(\frac{{pos}}{{10000^{\frac{{2i}}{{d_{\text{model}}}}}}}\right)
\end{align*}

New formula :

\begin{align*}
PE_{(pos, i)} &= \sin\left(\frac{{pos}}{{10000^{\frac{{i}}{{d_{\text{model}}}}}}}\right)\ when\ i\ is\ even \\ 
PE_{(pos, i)} &= \cos\left(\frac{{pos}}{{10000^{\frac{{i-1}}{{d_{\text{model}}}}}}}\right)\ when\ i\ is\ odd
\end{align*}


In [3]:
max_seq_length = 10
d_model = 6

In [4]:
even_i = torch.arange(0, d_model, 2).float()
even_i

tensor([0., 2., 4.])

In [6]:
even_denominator = torch.pow(10000, even_i/d_model)
even_denominator

tensor([  1.0000,  21.5443, 464.1590])

In [7]:
odd_i = torch.arange(1, d_model, 2).float()
odd_i

tensor([1., 3., 5.])

In [8]:
odd_denominator = torch.pow(10000, (odd_i - 1)/d_model)
odd_denominator

tensor([  1.0000,  21.5443, 464.1590])

In [9]:
denominator = even_denominator

In [25]:
denominator

tensor([  1.0000,  21.5443, 464.1590])

In [12]:
position = torch.arange(max_seq_length, dtype=torch.float).reshape(max_seq_length, 1)
position

tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.],
        [8.],
        [9.]])

In [14]:
even_pe = torch.sin(position/ denominator)
odd_pe = torch.cos(position / denominator)

In [15]:
even_pe

tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.8415,  0.0464,  0.0022],
        [ 0.9093,  0.0927,  0.0043],
        [ 0.1411,  0.1388,  0.0065],
        [-0.7568,  0.1846,  0.0086],
        [-0.9589,  0.2300,  0.0108],
        [-0.2794,  0.2749,  0.0129],
        [ 0.6570,  0.3192,  0.0151],
        [ 0.9894,  0.3629,  0.0172],
        [ 0.4121,  0.4057,  0.0194]])

In [16]:
odd_pe

tensor([[ 1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.9989,  1.0000],
        [-0.4161,  0.9957,  1.0000],
        [-0.9900,  0.9903,  1.0000],
        [-0.6536,  0.9828,  1.0000],
        [ 0.2837,  0.9732,  0.9999],
        [ 0.9602,  0.9615,  0.9999],
        [ 0.7539,  0.9477,  0.9999],
        [-0.1455,  0.9318,  0.9999],
        [-0.9111,  0.9140,  0.9998]])

In [17]:
even_pe.shape

torch.Size([10, 3])

In [19]:
stacked = torch.stack([even_pe, odd_pe], dim=2)
stacked.shape

torch.Size([10, 3, 2])

In [26]:
stacked

tensor([[[ 0.0000,  1.0000],
         [ 0.0000,  1.0000],
         [ 0.0000,  1.0000]],

        [[ 0.8415,  0.5403],
         [ 0.0464,  0.9989],
         [ 0.0022,  1.0000]],

        [[ 0.9093, -0.4161],
         [ 0.0927,  0.9957],
         [ 0.0043,  1.0000]],

        [[ 0.1411, -0.9900],
         [ 0.1388,  0.9903],
         [ 0.0065,  1.0000]],

        [[-0.7568, -0.6536],
         [ 0.1846,  0.9828],
         [ 0.0086,  1.0000]],

        [[-0.9589,  0.2837],
         [ 0.2300,  0.9732],
         [ 0.0108,  0.9999]],

        [[-0.2794,  0.9602],
         [ 0.2749,  0.9615],
         [ 0.0129,  0.9999]],

        [[ 0.6570,  0.7539],
         [ 0.3192,  0.9477],
         [ 0.0151,  0.9999]],

        [[ 0.9894, -0.1455],
         [ 0.3629,  0.9318],
         [ 0.0172,  0.9999]],

        [[ 0.4121, -0.9111],
         [ 0.4057,  0.9140],
         [ 0.0194,  0.9998]]])

In [20]:
pe = torch.flatten(stacked, start_dim=1, end_dim=2)
pe

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000],
        [ 0.1411, -0.9900,  0.1388,  0.9903,  0.0065,  1.0000],
        [-0.7568, -0.6536,  0.1846,  0.9828,  0.0086,  1.0000],
        [-0.9589,  0.2837,  0.2300,  0.9732,  0.0108,  0.9999],
        [-0.2794,  0.9602,  0.2749,  0.9615,  0.0129,  0.9999],
        [ 0.6570,  0.7539,  0.3192,  0.9477,  0.0151,  0.9999],
        [ 0.9894, -0.1455,  0.3629,  0.9318,  0.0172,  0.9999],
        [ 0.4121, -0.9111,  0.4057,  0.9140,  0.0194,  0.9998]])

In [21]:
pe.shape

torch.Size([10, 6])

In [22]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super().__init__()
        self.max_seq_length = max_seq_length
        self.d_model = d_model

    def forward(self):
        even_i = torch.arange(0, self.d_model, 2).float()
        denominator = torch.pow(10000, even_i/self.d_model)
        position = torch.arange(self.max_seq_length).reshape(self.max_seq_length, 1)
        even_pe = torch.sin(position / denominator)
        odd_pe = torch.cos(position/ denominator)
        stacked = torch.stack([even_pe, odd_pe], dim=2)
        pe = torch.flatten(stacked, start_dim=1, end_dim=2)

        return pe

In [24]:
pe = PositionalEncoding(d_model=512, max_seq_length=12)
pe.forward().shape

torch.Size([12, 512])