Till now we've understood `Self-Attention` and `Multi-Head Attention` mechanisms in detail.

The major advantage of these mechanisms are:

- Parellelization: Unlike `RNNs` which process sequences sequentially, `Self-Attention` allows for parallel processing of all tokens in a sequence, significantly speeding up training times.

- Long-Range Dependencies: `Self-Attention` can capture dependencies between tokens regardless of their distance in the sequence, making it effective for understanding context over long texts. Basically, generates `Contextual Embeddings` for each word. So it considers the entire sentence while generating the embedding for a particular word.

- Flexibility: `Self-Attention` can be applied to various types of data, including text, images, and more, making it a versatile tool in deep learning.

**But,**

This mechanism has a major `Drawback` i.e. **`Lack of Positional Information`**.

Let's understand this with an example.

Consider the sentences:

S1 = `"Lion Kills Deer"`

S2 = `"Deer Kills Lion"`

If we calculate the `Self-Attention` for both sentences, the output `Contextual Embeddings` for the words `"Lion"` and `"Deer"` would be identical in both sentences, as `Self-Attention` does not take into account the order of words in the sequence. This means that the model would not be able to distinguish between the two sentences based on their meaning, as the positional information is lost.

This should not happen as the meaning of both sentences is completely different. But due to the lack of positional information in `Self-Attention`, the model fails to capture this difference.

Therefore, we can conclude that if we've a `Sentence` where the order of words matters, then `Self-Attention` alone is not sufficient to capture the meaning of the sentence.

So we need a way to represent or preserve the order of words in a sequence while using `Self-Attention`.

This is where **`Positional Encoding`** comes into play.


## **Positional Encoding**

In `RNNs` and `LSTMs`, the order of words is inherently preserved due to their sequential processing nature. However, in `Self-Attention`, since all words are processed simultaneously, we need to explicitly provide information about the position of each word in the sequence.

**`Positional Encoding`** is a technique used to inject information about the position of words in a sequence into the model. This is typically done by adding a positional encoding vector to each word embedding before feeding it into the `Self-Attention` mechanism.


## **Understanding `Positional Encoding` From `First Principles`**

The very basic solution to pass the `Postional Information` to the `Self-Attention` mechanism is by adding an `Extra Dimension` to the `Word Embeddings` that represents the position of each word in the sequence.

For example, consider the sentence: `"Lion Kills Deer"`.

The `Word Embeddings` for the words `"Lion"`, `"Kills"`, and `"Deer"` can be represented as follows (assuming 3D embeddings for simplicity):

$$
\text{Lion} = \begin{bmatrix} 0.2 \\ 0.8 \\ 0.5 \end{bmatrix}, \quad
\text{Kills} = \begin{bmatrix} 0.6 \\ 0.1 \\ 0.3 \end{bmatrix}, \quad
\text{Deer} = \begin{bmatrix} 0.9 \\ 0.4 \\ 0.7 \end{bmatrix}
$$

Now, we can add an `Extra Dimension` to each embedding to represent the position of each word in the sequence:

$$
\text{Lion} = \begin{bmatrix} 0.2 \\ 0.8 \\ 0.5 \\ 1 \end{bmatrix}, \quad
\text{Kills} = \begin{bmatrix} 0.6 \\ 0.1 \\ 0.3 \\ 2 \end{bmatrix}, \quad
\text{Deer} = \begin{bmatrix} 0.9 \\ 0.4 \\ 0.7 \\ 3 \end{bmatrix}
$$

Here, the last element in each embedding vector represents the position of the word in the sequence (1 for `"Lion"`, 2 for `"Kills"`, and 3 for `"Deer"`).

**But, this approach has some limitations:**

1. **Scalability**: As the length of the sequence increases, the positional dimension can become very large, leading to high-dimensional embeddings that are computationally expensive to process.

2. **Huge Number**: The positional values can become very large for long sequences, which may lead to numerical instability during training. We know that `Neural Networks` work better with smaller values within the range of `-1 to 1` or `0 to 1`. So, if we've a sentence with `1000` words, the positional value for the last word would be `1000`, and during `Backpropagation`, this large value can lead to `Exploding Gradients`, making the training process unstable.

Even if we think of normalizing these positional values by dividing them by the maximum length of the sequence, it still doesn't solve the problem of scalability and numerical instability completely.

- Because, in practice, sequences can vary greatly in length, and normalizing by the maximum length may not be effective for all sequences. For sentences with different lengths, there will be no consistent way to represent positions, leading to potential confusion for the model.

<hr>
