In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, re 
import warnings

import pandas as pd
import numpy as np
import seaborn as sns
import torchtext
import matplotlib.pyplot as plt

print(torch.__version__)


2.4.1


## Embeddings
We represent each word with an embedding.

In [5]:
class Embedding(nn.Module):
  def __init__(self, vocab_size, embed_dim):
    super(Embedding, self).__init__()
    self.embed = nn.Embedding(vocab_size, embed_dim)
    self.embed_dim = embed_dim

  def forward(self, x):
    out = self.embed(x)
    return out

    

## Positional Encoding

$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right), PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$



Where:
- $pos$ is the position of the token in the sequence.
- $i$ is the dimension.
- $d_{\text{model}}$ is the dimension of the embeddings.

### Explanation:

The intuition for me is that positional encoding uses this function to give the model some information as to the relative position of a word in a sequence.

Suppose we had the sequence "I like dogs," where A, B, and C are the embeddings for each word respectively. Let's assume that our embeddings have only five dimensions (an unrealistic but simple example.)

"I": $A = [A_1, A_2, A_3, A_4, A_5]$

"like": $B = [B_1, B_2, B_3, B_4, B_5]$

"dogs": $A = [C_1, C_2, C_3, C_4, C_5]$

For each embedding, we create a positional encoding vector using the following formula:





- The positional encoding uses sine and cosine functions of different frequencies.
- For even dimensions, sine function is used.
- For odd dimensions, cosine function is used.
- The denominator $10000^{\frac{2i}{d_{\text{model}}}}$ ensures that the positional encodings for different dimensions have different frequencies.
