## **TRANSFORMER**

```
1. input data

2. ohe = one hot encoding (input_size x vocab_size)

3. x(input_size x nemb) = ohe(input_size x vocab_size) @ we(vocab_size x nemb)

4. xd = x + pos_emb

5. wq, wk, wv = (nemb x nemb)

6. xd @ wq,wk,wv = 3 * (input_size x nemb)
```

### **POSITIONAL ENCODING**

Positional encodings are crucial in Transformer models for several reasons:

- Preserving Sequence Order: Transformer models process tokens in parallel, lacking inherent knowledge of token order. Positional encodings provide the model with information about the position of tokens in the sequence, ensuring that the model can differentiate between tokens based on their position. This is essential for tasks where word order matters, such as language translation and text generation.

- Maintaining Contextual Information: In natural language processing tasks, the meaning of a word often depends on its position in the sentence. For example, in the sentence “The cat sat on the mat,” the word “cat” has a different meaning than in “The mat sat on the cat.” transformer

- Enhancing Generalization: By incorporating positional information, transformer models can generalize better across sequences of different lengths. This is particularly important for tasks where the length of the input sequence varies, such as document summarization or question answering. Positional encodings enable the model to handle input sequences of varying lengths without sacrificing performance.

- Mitigating Symmetry: Without positional encodings, the self-attention mechanism in Transformer models would treat tokens symmetrically, potentially leading to ambiguous representations. Positional encodings introduce an asymmetry into the model, ensuring that tokens at different positions are treated differently, thereby improving the model’s ability to capture long-range dependencies.

In summary, positional encodings are essential in Transformer models for preserving sequence order, maintaining contextual information, enhancing generalization, and mitigating symmetry. They enable Transformer models to effectively process and understand input sequences, leading to improved performance across a wide range of natural language processing tasks.

---

positional encodin$g_{(pos,2i)}$ => $sin({\frac{pos}{10000^{2_{i} / d_{model}}}})$

positional encodin$g_{(pos, 2i + 1)}$ => $cos({\frac{pos}{10000^{2_i - 1 / d_{model}}}})$

```
i => dimension index
dmodel => embedding length
pos => position of word in sequence

Reasons:
- Periodicity
- Constrained values
- Easy to extrapolate for long sequences
```
```
msl = x
dm = y

even_i.shape = odd_i.shape = (y // 2)
denom.shape = (y//2, 1)
pos.shape = (x, 1)
even_PE.shape = odd_PE.shape = (x, y//2)
stacked.shape = (x, y//2, 2)
PE.shape = (x, y)
```

In [1]:
import torch
import torch.nn as nn

In [2]:
# constants

max_seq_len = 10
d_model = 6

In [3]:
# splitting the even and odd indices

even_i = torch.arange(0, d_model, 2).float()
odd_i = torch.arange(1, d_model, 2).float()

(even_i.shape, odd_i.shape)

(torch.Size([3]), torch.Size([3]))

In [4]:
# computing the denominator

even_denom = pow(10000, even_i / d_model)
odd_denom = pow(10000, (odd_i - 1) / d_model)

even_denom == odd_denom

tensor([True, True, True])

In [5]:
denom = even_denom
denom.shape

torch.Size([3])

In [6]:
# positions of the input data

position = torch.arange(max_seq_len, dtype = torch.float).reshape(max_seq_len, 1)
position.shape

torch.Size([10, 1])

In [7]:
(position/denom).shape

torch.Size([10, 3])

In [8]:
# even positional encoding

even_PE = torch.sin(position / denom)
even_PE.shape

torch.Size([10, 3])

In [9]:
# odd positional encoding

odd_PE = torch.cos(position / denom)
odd_PE.shape

torch.Size([10, 3])

In [10]:
# combining the two pos_enc along dim 2

stacked = torch.stack([even_PE, odd_PE], dim = 2)
stacked.shape

torch.Size([10, 3, 2])

In [11]:
# concat the last two dimension

PE = torch.flatten(stacked, start_dim = 1, end_dim = 2)
PE.shape

torch.Size([10, 6])

### **SUMMARY**

In [12]:
class PositionalEncoding(nn.Module):
  def __init__(self, d_model, max_seq_len):
    super().__init__()
    self.d_model = d_model
    self.max_seq_len = max_seq_len

  def forward(self, x):
    even_i = torch.arange(0, self.d_model, 2).float()
    denominator = pow(10000, even_i / self.d_model)

    assert x < self.max_seq_len, "Max sequence length is {}".format(self.max_seq_len)
    position = torch.arange(x, dtype =torch.float).reshape(x, 1)

    even_PE = torch.sin(position / denominator)
    odd_PE = torch.cos(position / denominator)

    stacked = torch.stack([even_PE, odd_PE], dim = 2)
    PE = torch.flatten(stacked, start_dim = 1, end_dim = 2)

    return PE

In [13]:
pe = PositionalEncoding(6, 20)
pos_enc = pe.forward(10)
pos_enc.shape

torch.Size([10, 6])

In [14]:
pos_enc

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000],
        [ 0.1411, -0.9900,  0.1388,  0.9903,  0.0065,  1.0000],
        [-0.7568, -0.6536,  0.1846,  0.9828,  0.0086,  1.0000],
        [-0.9589,  0.2837,  0.2300,  0.9732,  0.0108,  0.9999],
        [-0.2794,  0.9602,  0.2749,  0.9615,  0.0129,  0.9999],
        [ 0.6570,  0.7539,  0.3192,  0.9477,  0.0151,  0.9999],
        [ 0.9894, -0.1455,  0.3629,  0.9318,  0.0172,  0.9999],
        [ 0.4121, -0.9111,  0.4057,  0.9140,  0.0194,  0.9998]])