# What is Position Information


## Positional Embedding information has to be unique 

We need to assign position information to each token eg. pos_emb("i") will be $[0, 0, 0]$

```python
i need help, however ..... i dont know,  i need  help
|   |   |     |            |    |    |   |    |    |     
^   ^   ^     ^            ^    ^    ^   ^    ^    ^
0   1   2     3            25   26   27  28   29   30
0   1   2     3            25   26   27  28   29   30
0   1   2     3            25   26   27  28   29   30
V   V   V     V            V     V   V   V     V   V
```

Above is the simplest position information where each number can be assigned with a int value and then we can add it to token embedding eg. dim = 3

```python
total_emb_i = token_emb("i") + pos_emb("i")
= ["0.34", "0.1", "-1.9"] + ["0", "0", "0"]

total_emb_need = token_emb("need") + pos_emb("need")
= ["-0.04", "0.12", "-1.09"] + ["1", "1", "1"]


total_emb_know = token_emb("need") + pos_emb("need")
= ["-0.90", "0.08", "-0.09"] + ["27", "27", "27"] = ["26.1", "27.08", "27.09"]
```


</br>

This leads us to 1st requirement when we need to design the positional embedding which is

- **Positional Embedding has to be unique for each token. This is easy to understand since we need to know this is the 1st token and that is the 2nd etc.**


## Positional Encoding should be able to take indefinite long input


```python
i need help, however ..... i dont know,  i need  help
|   |   |     |            |    |    |   |    |    |     
^   ^   ^     ^            ^    ^    ^   ^    ^    ^
0   1   2     3            25   26   27  28   29   30
0   1   2     3            25   26   27  28   29   30
0   1   2     3            25   26   27  28   29   30
V   V   V     V            V     V   V   V     V   V
```


For above input, we compute a fixed sized position embedding matrix D*L = (3, 30), where each vector is unique. However, if we have a longer input which sequence length is 31, then this is a issue. Therefore, in order for positional encoding to be able to take indefinite long input, we need PE to be a function of sequence length eg. i. So PE = $f(i)$


## Positional Encoding should contain Absolute Position Information

In addition to that, intuitively speaking we should let positional encoding contains information that \"however_3\" should be at the front of "i_25" and stay after "i_0". 



```python
i need help, however ..... i dont know,  i need  help
|   |   |     |            |    |    |   |    |    |     
^   ^   ^     ^            ^    ^    ^   ^    ^    ^
0   1   2     3            25   26   27  28   29   30
0   1   2     3            25   26   27  28   29   30
0   1   2     3            25   26   27  28   29   30
```

</br>

Given the intuitive positional encoding schema, we can see it indeed has that requirement where we know token got assigned with $[0, 0, 0]$ should be in the front of token assigned with $[3, 3, 3]$ because $0 < 3$

</br>

This leads us to 2nd requirement when we need to design the positional embedding which is

- **Positional Embedding should contain absolute positional information.**

where absolute Positional Encoding assigns a unique position vector to each position in the sequence. For example, the first word gets position 1, the second word position 2, and so on. This encoding is added to the token embeddings to incorporate positional information.

In addition to that, positional encoding should encode the relative word information which is the distance between 2 tokens, which can be calculated as the difference in their positions.

</br>

```python
I need help, however ..... I don't know, I need help
```

</br>


eg. Let's focus on how the token "help" at position 3 attends to other tokens using relative positional encoding.

Calculating Relative Positions:
For the token at position 3 ("help"), we calculate the relative positions to all other tokens:

| Token   | Absolute Position | Relative Position (Other Position - 3) |
|---------|-------------------|----------------------------------------|
| I       | 1                 | -2                                     |
| need    | 2                 | -1                                     |
| help,   | 3                 | 0                                      |
| ,       | 4                 | +1                                     |
| however | 5                 | +2                                     |
| .....   | 6                 | +3                                     |
| I       | 7                 | +4                                     |
| don't   | 8                 | +5                                     |
| know    | 9                 | +6                                     |
| ,       | 10                | +7                                     |
| I       | 11                | +8                                     |
| need    | 12                | +9                                     |
| help    | 13                | +10                                    |


## Position Information should contain relative position information

Relative position information in positional encoding means the model considers the distances between tokens when processing the sentence. 

</br>

```python
I need help, however ..... I don't know, I need help
```

</br>

For example, when the model looks at the word "need", it knows that "help" is one position away, both before and after the comma. This helps the model understand that "need" is closely related to "help" due to their consistent relative positions, even though they appear at different absolute positions in the sentence. By focusing on these relative distances, the model better captures the repeated pattern "I need help" and understands the relationships between words based on how far apart they are from each other.


Relative Positional Encoding: Focuses on the relative distances between tokens. Instead of knowing that a word is at position 5, the model knows that one word is, say, three positions away from another. This can capture patterns like "the next word," "the previous word," or "words within a certain range."


Also, this proximity & closeness information can be used to adjust attention score from "help" (position 3) to "need" (position 2):

Relative Position: $2−3=−1$
The model uses the embedding for relative position -1 to adjust the attention score.




However, we can say there're a lot of obvious limitations on above encoding idea

if the sequency length is very long, our total_emb will be very heavy tailed. As you can see from total_emb_know, $[26.1, 27.08, 27.09]$ become overly large as the sequence length increases which intuitively speaking not a good thing. The embedding value for each dim for the last token will be super large (heavy tailed) compared with total_emb_i = $[0.34, 0.1, -1.9]$. Semantic meaning will be distorted by large position value.


</br>


```python
i need help
^   ^   ^ 
0   1   2 
0   1   2 
0   1   2 
V   V   V 


i need help
^   ^   ^  
28  29  30  
28  29  30  
28  29  30  
V   V   V  
```


</br>


Another thing on heavy tail is that by using the encoding idea above, the final embedding for tokens "i", "need", "help" on position $29, 29, 30$ will be way different from the tokens "i", "need", "help" on position $0, 1, 2$ which will confused the model even if they should share some similar information (small distance on vector space between those two vectors.) Again, this is because since the position information overtaken the semantic embedding.




This leads us to the other 2 requirements when we need to design the positional embedding which is:
- **Its better positional embedding could be upper & lower bounded by some value no matter how long the sequence length gonna be.**
- **Its better same phrases on the different position should share some similarities on the vector space.**


# Summary:

- Positional Embedding information has to be unique.
- Positional Encoding should be able to take indefinite long input.
- Positional Encoding should contain Absolute Position Information. (index info)
- Position Information should contain relative position information. (closeness info)
- Its better positional embedding could be upper & lower bounded by some value no matter how long the sequence length gonna be.
- Its better same phrases on the different position should share some similarities on the vector space.



Position Embedding in Transformer:
- https://www.youtube.com/watch?v=5V9gZcAd6cE&list=PLmZlBIcArwhOPR2s-FIR7WoqNaBML233s

For above positional encoding, what they are doing is adding positional embedding to token embedding and we do self-attention interaction between each token within the sequence.

But how about we adding this positional encoding information onto self-attention mechanism matrix directly? Intuition behind this i think is obvious eg. the word far away from each other, may be we can gauge the self-attention score down a little bit.