Till now we've understood `Self-Attention` and `Multi-Head Attention` mechanisms in detail.

The major advantage of these mechanisms are:

- Parellelization: Unlike `RNNs` which process sequences sequentially, `Self-Attention` allows for parallel processing of all tokens in a sequence, significantly speeding up training times.

- Long-Range Dependencies: `Self-Attention` can capture dependencies between tokens regardless of their distance in the sequence, making it effective for understanding context over long texts. Basically, generates `Contextual Embeddings` for each word. So it considers the entire sentence while generating the embedding for a particular word.

- Flexibility: `Self-Attention` can be applied to various types of data, including text, images, and more, making it a versatile tool in deep learning.

**But,**

This mechanism has a major drawback i.e. **`Lack of Positional Information`**.

Let's understand this with an example.

Consider the sentences:

S1 = `"Lion Kills Deer"`

S2 = `"Deer Kills Lion"`

If we calculate the `Self-Attention` for both sentences, the output `Contextual Embeddings` for the words `"Lion"` and `"Deer"` would be identical in both sentences, as `Self-Attention` does not take into account the order of words in the sequence. This means that the model would not be able to distinguish between the two sentences based on their meaning, as the positional information is lost.

This should not happen as the meaning of both sentences is completely different. But due to the lack of positional information in `Self-Attention`, the model fails to capture this difference.

Therefore, we can conclude that if we've a `Sentence` where the order of words matters, then `Self-Attention` alone is not sufficient to capture the meaning of the sentence.

So we need a way to represent or preserve the order of words in a sequence while using `Self-Attention`.

This is where **`Positional Encoding`** comes into play.


## **Positional Encoding**

In `RNNs` and `LSTMs`, the order of words is inherently preserved due to their sequential processing nature. However, in `Self-Attention`, since all words are processed simultaneously, we need to explicitly provide information about the position of each word in the sequence.

**`Positional Encoding`** is a technique used to inject information about the position of words in a sequence into the model. This is typically done by adding a positional encoding vector to each word embedding before feeding it into the `Self-Attention` mechanism.
