This note is part of `Self Attention` of **Hand Written Notes** on **Transformer** in **NLP Complete** repository.


## **Progess Till Now (Learning Summary)**

### **Word Embedding**

First we needed some way to represent `Sequential Data` like `Text` in a way that `Computers` can understand. So the most efficient way to do that is to convert each word into a `Vector` of numbers. This is called `Word Embedding`. `Embedding` is nothing but a `Dense Representation` of `Words` in a `Continuous Vector Space`. Each word is represented as a `Vector` of fixed size (like 50, 100, 300 dimensions etc.) where similar words are mapped to nearby points in that space. Some popular techniques for generating word embeddings include `Word2Vec`, `GloVe`, and `FastText`.

Word Embeddings captures semantic meaning of the words. For example, the words `"king"` and `"queen"` will have similar vector representations because they share similar contexts in language.

Words with similar meanings will be closer together in the vector space. For example, the vectors for `"king"` and `"queen"` will be closer to each other than to the vector for `"car"`.

<hr>

### **Self-Attention : First Principle**

But word embeddings alone are not sufficient to capture the `Context` of a word in a sentence. For example, the word `"bank"` can have different meanings depending on the context (financial institution vs. river bank), such as `"She went to the bank to deposit money"` vs. `"He sat by the bank of the river"`.

We need a mechanism to allow the model to focus on different parts of the input sequence when processing each word. This is where `Self-Attention` comes into play.

**Self-Attention**

Self-Attention is a mechanism that takes the `Raw Input Embeddings` and computes a new `Embedding` for each word by considering the entire sequence. This way, each word is represented by a new `Embedding` that considers the context provided by all other words in the sequence.

To do so, Self-Attention:

- First takes the `Input Embeddings` of all words in the sequence.

- Then, for each word, it computes `Dot Product` with the embeddings of all other words to determine how much attention to pay to each word.

- The result of the `Dot Product` is a `Scalar Value` that indicates the similarity or relevance between the words.

- The result of `Dot Product` is `Scaled` i.e. `Normalized` using `Softmax Function` to convert the scores into probabilities.

- The `Softmax` takes the `Dot Product` and converts into `Probabilities` that sum to 1. This helps in determining how much attention to pay to each word.

- The `Probabilities` are then used to compute a `Weighted Sum` of the `Input Embeddings` (which are also derived from the input embeddings) to produce the final output embedding for each word.

Below is the image that shows the basic working of `Self-Attention` mechanism.

<img src="../../../Notes_Images/Self_Attn.png" alt="Self Attention Basic" width="1400"/>

<hr>

**Contextual Embeddings**

The output embeddings from the Self-Attention mechanism are called `Contextual Embeddings` because they capture the meaning of each word in the context of the entire sequence. For example, the contextual embedding for the word `"bank"` will be different in the two sentences mentioned earlier, reflecting its different meanings based on context.

Let's we've two sentences:

1. `Money Bank Grows`

2. `River Bank Flows`

The `Contextual Embedding` for the word `"Bank"` in the first sentence will be influenced more by the words `"Money"` and `"Grows"`, while in the second sentence, it will be influenced more by `"River"` and `"Flows"`. This allows the model to understand the different meanings of the same word based on its context.

<hr>


## **How `Self-Attention` Computes in `Parallel`?**

As we know, to calculate the `Self-Attention` for a sequence of words, we need to compute the `Dot Product` between each word's embedding and every other word's embedding in the sequence.

This can be represented mathematically using `Matrix Multiplication`, which allows us to compute the `Self-Attention` scores for all words in the sequence simultaneously, rather than one at a time.

So, if we've below input embeddings for a sequence of 3 words:

<img src="../../../Notes_Images/Self_Attn.png" alt="Self Attention Matrix Multiplication" width="1400"/>

We can represent these embeddings as a matrix `E`, where each row corresponds to the embedding of a word in the sequence as below:

<img src="../../../Notes_Images/Self_Attn_2.png" alt="Self Attention Matrix Multiplication 2" width="1400"/>

<hr>

### **Actual Calculation of Self-Attention using Matrix Multiplication**

Let's we've two sentences:

1. `Money Bank Grows`

2. `River Bank Flows`

The input embeddings for these words are:

$$
E = \begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.4 & 0.5 & 0.6 \\
0.7 & 0.8 & 0.9 \\
0.2 & 0.1 & 0.4 \\
0.5 & 0.3 & 0.2
\end{bmatrix}
\qquad
\begin{array}{l}
\text{(Money)}\\
\text{(Bank)}\\
\text{(Grows)}\\
\text{(River)}\\
\text{(Flows)}
\end{array}
$$

We can represent these embeddings as a matrix `E`:

$$
E =
\begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.4 & 0.5 & 0.6 \\
0.7 & 0.8 & 0.9 \\
0.2 & 0.1 & 0.4 \\
0.5 & 0.3 & 0.2
\end{bmatrix}
\qquad
\begin{array}{l}
\text{(Money)}\\
\text{(Bank)}\\
\text{(Grows)}\\
\text{(River)}\\
\text{(Flows)}
\end{array}
$$

Now if we multiply the matrix `E` with its transpose `E^T`, we get the `Dot Product` scores between each pair of words in the sequence:

$$
S = E \cdot E^T
$$

Calculating the matrix multiplication:



$$
S =
\begin{bmatrix}
0.14 & 0.32 & 0.50 & 0.20 & 0.22 \\
0.32 & 0.77 & 1.22 & 0.38 & 0.44 \\
0.50 & 1.22 & 1.94 & 0.56 & 0.62 \\
0.20 & 0.38 & 0.56 & 0.21 & 0.26 \\
0.22 & 0.44 & 0.62 & 0.26 & 0.38
\end{bmatrix}
$$

This resulting matrix `S` contains the `Dot Product` scores between each pair of words in the sequence. Each element `S[i][j]` represents the similarity score between the `i-th` and `j-th` words based on their embeddings.

<hr>
