Till now we've understood `Self-Attention` and `Multi-Head Attention` mechanisms in detail.

The major advantage of these mechanisms are:

- Parellelization: Unlike `RNNs` which process sequences sequentially, `Self-Attention` allows for parallel processing of all tokens in a sequence, significantly speeding up training times.

- Long-Range Dependencies: `Self-Attention` can capture dependencies between tokens regardless of their distance in the sequence, making it effective for understanding context over long texts. Basically, generates `Contextual Embeddings` for each word. So it considers the entire sentence while generating the embedding for a particular word.

- Flexibility: `Self-Attention` can be applied to various types of data, including text, images, and more, making it a versatile tool in deep learning.

**But,**

This mechanism has a major `Drawback` i.e. **`Lack of Positional Information`**.

Let's understand this with an example.

Consider the sentences:

S1 = `"Lion Kills Deer"`

S2 = `"Deer Kills Lion"`

If we calculate the `Self-Attention` for both sentences, the output `Contextual Embeddings` for the words `"Lion"` and `"Deer"` would be identical in both sentences, as `Self-Attention` does not take into account the order of words in the sequence. This means that the model would not be able to distinguish between the two sentences based on their meaning, as the positional information is lost.

This should not happen as the meaning of both sentences is completely different. But due to the lack of positional information in `Self-Attention`, the model fails to capture this difference.

Therefore, we can conclude that if we've a `Sentence` where the order of words matters, then `Self-Attention` alone is not sufficient to capture the meaning of the sentence.

So we need a way to represent or preserve the order of words in a sequence while using `Self-Attention`.

This is where **`Positional Encoding`** comes into play.


## **Positional Encoding**

In `RNNs` and `LSTMs`, the order of words is inherently preserved due to their sequential processing nature. However, in `Self-Attention`, since all words are processed simultaneously, we need to explicitly provide information about the position of each word in the sequence.

**`Positional Encoding`** is a technique used to inject information about the position of words in a sequence into the model. This is typically done by adding a positional encoding vector to each word embedding before feeding it into the `Self-Attention` mechanism.


## **Understanding `Positional Encoding` From `First Principles`**

The very basic solution to pass the `Postional Information` to the `Self-Attention` mechanism is by adding an `Extra Dimension` to the `Word Embeddings` that represents the position of each word in the sequence.

For example, consider the sentence: `"Lion Kills Deer"`.

The `Word Embeddings` for the words `"Lion"`, `"Kills"`, and `"Deer"` can be represented as follows (assuming 3D embeddings for simplicity):

$$
\text{Lion} = \begin{bmatrix} 0.2 \\ 0.8 \\ 0.5 \end{bmatrix}, \quad
\text{Kills} = \begin{bmatrix} 0.6 \\ 0.1 \\ 0.3 \end{bmatrix}, \quad
\text{Deer} = \begin{bmatrix} 0.9 \\ 0.4 \\ 0.7 \end{bmatrix}
$$

Now, we can add an `Extra Dimension` to each embedding to represent the position of each word in the sequence:

$$
\text{Lion} = \begin{bmatrix} 0.2 \\ 0.8 \\ 0.5 \\ 1 \end{bmatrix}, \quad
\text{Kills} = \begin{bmatrix} 0.6 \\ 0.1 \\ 0.3 \\ 2 \end{bmatrix}, \quad
\text{Deer} = \begin{bmatrix} 0.9 \\ 0.4 \\ 0.7 \\ 3 \end{bmatrix}
$$

Here, the last element in each embedding vector represents the position of the word in the sequence (1 for `"Lion"`, 2 for `"Kills"`, and 3 for `"Deer"`).

**But, this approach has some limitations:**

1. **Unbounded Values**: The positional values can grow indefinitely as the sequence length increases. This means that for very long sequences, the positional values can become very large, which may not be ideal for the model to learn effectively.

2. **Discrete Values**: The positional values are discrete integers, which may not provide enough granularity for the model to learn subtle differences in word positions. The positional values can become very large for long sequences, which may lead to numerical instability during training.

We know that `Neural Networks` work better with smaller values within the range of `-1 to 1` or `0 to 1` or `Continuous values`. So, if we've a sentence with `1000` words, the positional value for the last word would be `1000`, and during `Backpropagation`, this large value can lead to `Exploding Gradients`, making the training process unstable.

3. **Can't Capture Relative Positions**: This method only captures absolute positions of words, not their relative positions. For example, in the sentences `"The cat sat on the mat"` and `"The mat sat on the cat"`, the relative positions of `"cat"` and `"mat"` are important for understanding the meaning, but this approach does not capture that information.

Even if we think of normalizing these positional values by dividing them by the maximum length of the sequence, it still doesn't solve the problem of scalability and numerical instability completely.

- Because, in practice, sequences can vary greatly in length, and normalizing by the maximum length may not be effective for all sequences. For sentences with different lengths, there will be no consistent way to represent positions, leading to potential confusion for the model.

<hr>

So, now we need a function that:

- `Bounds` the positional values within a certain range (e.g., `-1 to 1` or `0 to 1`).

- Provides `Continuous Values` for better learning.

- Can `Capture Relative Positions` of words in addition to absolute positions i.e. `Periodic Function`.

One such function that satisfies all these criteria is the **`Sine`** and **`Cosine`** functions.

### **`Sine` Function for Positional Encoding**

As `Sine` function oscillates between `-1` and `1`, it provides bounded values.

Additionally, it is a continuous function, because for every `x` value, there exists a corresponding `y` value.

And,

it is periodic in nature, which means it can capture relative positions effectively.

Below is the graph of `Sine` function:

<img src="../../Notes_Images/Sine.png" alt="Sine Function" width="1600"/>

**How to use `Sine` function for `Positional Encoding`?**

What can we do is, we can represent the `Position` of each word with the `X-axis` values of the `Sine` function.

And, the corresponding `Y-axis` values can be used as the `Positional Encoding` values.

So our positional encoding for each position `pos` can be calculated as:

$$
PE(pos) = \sin(pos)
$$

So for the sentence `"Lion Kills Deer"`,

The positional encodings for the words `"Lion"`, `"Kills"`, and `"Deer"` would be:

$$
PE(1) = \sin(1) \approx 0.84 \\
PE(2) = \sin(2) \approx 0.91 \\
PE(3) = \sin(3) \approx 0.14
$$

So our final embeddings with positional encodings would be:

$$
\text{Lion} = \begin{bmatrix} 0.2 \\ 0.8 \\ 0.5 \\ 0.84 \end{bmatrix}, \quad
\text{Kills} = \begin{bmatrix} 0.6 \\ 0.1 \\ 0.3 \\ 0.91 \end{bmatrix}, \quad
\text{Deer} = \begin{bmatrix} 0.9 \\ 0.4 \\ 0.7 \\ 0.14 \end{bmatrix}
$$

This approach effectively incorporates positional information into the word embeddings while addressing the limitations of the previous method i.e. `Unbounded Values`, `Discrete Values`, and `Can't Capture Relative Positions`.

**But,** using only the `Sine` function has its own limitations.

- `Uniqueness` : The `Sine` function alone may not provide enough variation in positional encodings, especially for longer sequences. The `Positional Encodings` should be `Unique` for each position but as the `Sine` function is periodic, it can produce the same values for different positions, leading to potential ambiguities.

To overcome this limitation, why not represent the `Positional Encoding` with `Vectors` instead of `Scalars`?

### **Using `Sine` and `Cosine` Functions for Vectorized `Positional Encoding`**

<img src="../../Notes_Images/Sine_Cose.png" alt="Sine and Cosine Functions" width="1600"/>

So instead of using just the `Sine` function, we can use both `Sine` and `Cosine` functions to create a `Vector` for each position.

We'll calculate the positional encoding for each position `pos` as follows:

$$
PE(pos) = \begin{bmatrix} \sin(pos) \\ \cos(pos) \end{bmatrix}
$$

But in this approach, surely we've reduced the chances of collision, but still for longer sequences, there can be positions that yield similar `Sine` and `Cosine` values.

So, lets increase the dimensionality of the positional encoding vectors by using various `Frequencies` for the `Sine` and `Cosine` functions.

### **Using Different Frequencies for `Sine` and `Cosine` Functions**

<img src="../../Notes_Images/Sin_Cos_Freq.png" alt="Sine and Cosine Functions with Multiple Frequencies" width="1600"/>

We can achieve this by introducing a `Scaling Factor` that changes the frequency of the `Sine` and `Cosine` functions for each dimension of the positional encoding vector.

The positional encoding for each position `pos` and dimension `i` can be calculated as:

$$
PE(pos) = \begin{bmatrix} \sin(pos) \\ \cos(pos) \\ \sin(pos/2) \\ \cos(pos/2) \end{bmatrix}
$$

So now we've got the idea, if we keep increasing the dimensions and use different scaling factors for each dimension, we can create highly unique positional encodings for each position in the sequence. This is the most effective way to reduce collisions.

<hr>


## **Final Formulation of `Positional Encoding`**

So we will be representing the `Positional Encoding` for each position using `Vector` and we'll attach this positional encoding vector to the corresponding `Word Embedding` before feeding it into the `Self-Attention` mechanism.

**Size of Positional Encoding Vector:**

The size of the positional encoding vector will be equal to the size of the `Word Embeddings`. For example, if the `Word Embeddings` are of size `d_model` say `512`, then the positional encoding vector will also be of size `512`.

So, the size of `Word Embedding` and `Positional Encoding` will be the same i.e. `d_model`.

Then, we'll simply add these two vectors element-wise to get the final input to the `Self-Attention` mechanism.

For example, if the `Word Embedding` for a word is:

$$
\text{Word Embedding} = \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ \vdots \\ e_{d\_model} \end{bmatrix}
$$

And the `Positional Encoding` for that word's position is:

$$
PE(pos) = \begin{bmatrix} p_1 \\ p_2 \\ p_3 \\ \vdots \\ p_{d\_model} \end{bmatrix}
$$

Then, the final input to the `Self-Attention` mechanism will be:

$$
\text{Input} = \begin{bmatrix} e_1 + p_1 \\ e_2 + p_2 \\ e_3 + p_3 \\ \vdots \\ e_{d\_model} + p_{d\_model} \end{bmatrix}
$$

The output of this addition will be a vector of size `d_model`, which will be fed into the `Self-Attention` mechanism. This output vector now contains both the semantic information from the `Word Embedding` and the positional information from the `Positional Encoding`.

**Why are we not Concatenating the Vectors Instead of Adding Them?**

- **Dimensionality**: Concatenating the vectors would double the dimensionality of the input to the `Self-Attention` mechanism. If the `Word Embedding` and `Positional Encoding` are both of size `d_model`, concatenating them would result in a vector of size `2 * d_model`. This would increase the computational complexity and memory requirements of the model significantly.

- **Model Complexity**: Adding the vectors keeps the model simpler and more efficient. The `Self-Attention` mechanism is designed to work with fixed-size input vectors, and increasing the size of the input could complicate the architecture and make it harder to train.

### **What is the Formula for Calculating the `Positional Encoding Values`?**



### **How `Sine` Function`Captures`Relative Positions`?**
