# Introduction to Attention
<!-- estimated time: 4hours -->

This section will cover:

1. Sequence to sequence recap
2. Attention overview - Encoding
3. Attention overview - Decoding
4. Bahdanau and Luong Attention
5. Multiplicative attention
6. Additive attention
7. Computer vision applications
8. NLP application: Google neural machine translation
9. Other Attention Methods
10. The Transformer and Self-Attention
* Lab: Attention basics

Attention started out in the field of computer vision as an attempt to mimic human perception:
> "One important property of human percetption is that one does not tend to process a while in its entirety at once. Instead, humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye movements and decision making"
- [Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)

Note here that instead of processing the entirety of the image, all that is needed to know it is a picture of a bird is to ignore the background and instead focus on the item of interest. Further, if we can separate attention from the entirety of the image to componenets of it, we can describe the image in a more complete and nuanced manner:
<img src="assets/images/06/img_001.png" width=700 align='center'>

# 1: Seq2Seq Recap

Classic, i.e., those without attention, Seq2Seq models have to look at the original sentence that is to be translated one time and then use that *entire* input to produce every single small output term.

A sequence to sequence model takes in an input that is a sequence of items and then produces another sequence of items as an output.

* In machine translation, the input sequence is a series of words in one language and the output is a translation in another language.

* In text summarization, the input is a long sequence of words and the output is a short one.

The seq2seq model usually consists of an encoder and decoder. It works by the encoder first processing all of the inputs, turning the inputs into a single representation. Typically a single vector known as the **context** vector. The *context* vector contains whatever information the encoder was able to capture from the input sequence.
<img src="assets/images/06/img_002.png" width=700 align='center'>

The context vector is then sent to the decoder which uses it to formulate an output sequence. In machine translation scenarios, the encoder and decoder are both recurrent neural networks (RNNs), usually LSTM cells (long short term memory)
<img src="assets/images/06/img_003.png" width=700 align='center'>

In this scenario, the context vector is a vector of numbers encoding the information that the encoder captured from the input sequence. In real world scenarios, this vector has a length of $2^{n}$, like 256, 512, etc.
<img src="assets/images/06/img_004.png" width=700 align='center'>

If we look at the previous example, translating *comment allez vous* to *how are you*, we can see how the hidden state develops:

1. Take the first word and develop the first hidden state:
<img src="assets/images/06/img_005.png" width=700 align='center'>

2. In the second step, we take the second word AND the first hidden state as inputs to the RNN and produce the second hidden state:
<img src="assets/images/06/img_006.png" width=700 align='center'>

3. In the third step, we do the same process as the second, we take the third (and last) word AND the second hidden state as inputs and generate the third hidden state:
<img src="assets/images/06/img_007.png" width=700 align='center'>

The third hidden state is the context vector that will be passed to the decoder. **This highlights a limitation of seq2seq models!**

The encoder is confined to sending a single vector, no matter how long or short the input sequence is. Choosing a reasonable size fot this vector makes the model have problems with long input sequences. If you just use a very large number for the hidden unit vectors so that the context is very large, then the model overfits with short sequences and there is a performance reduction as you increase the number of parameters. **Attention in neural nets solves this issue.**

# 2. Attention overview - Encoding

A seq2seq model with attention works like this:

1. The encoder processes the input sequence, just like the model without attention, one word at a time. It produces a hidden state for each of these inputs and uses that hidden state in the next step.
<img src="assets/images/06/img_008.png" width=700 align='center'>

2. Then, the model passes the context vector to the decoder. However, unlike the context vector in the model WITHOUT attenttion, this one is not just the final hidden state, it's all of the hidden states.
<img src="assets/images/06/img_009.png" width=700 align='center'>

The benefit of passing all the hidden input states is that it gives us flexibility in the context size. Longer sequences can have longer context vectors that better capture the information from the input sequence.

Intuitively, each hidden state is (likely) most associated with the part of the input sequence that preceded how that word was generated. I.e., the first hidden state was produced after encoding the first word/input so it captures the essence of the first input the most of the hidden states.

So, when we **focus** on the first hidden state, we **focus** on the first input. And likewise when we focus on the second hidden state, we are focusing on the second input, and so on.
<img src="assets/images/06/img_010.png" width=700 align='center'>

3: Attention Overview - Decoding

At every time step, the attention decoder pays attention to the appropriate part of the input sequence using the context vector. The process for the decoder to know which aspects of the input sequence are best to pay attention to is learned during the training phase.

<img src="assets/images/06/img_011.png" width=700 align='center'>

The process learned is not as simple as going from the first to the last hidden vector; it's not just associating the current hidden vector with the thing to be predicted. It is more sophisticated.

If we consider the example of translating a french sentence to an english one. Assume we have a trained transformer model. Let's take the first 4 words of the sentence on the left of the picture:

<img src="assets/images/06/img_012.png" width=700 align='center'>

If we consider the words in the top portion of the picture, they are pretty well lined up with their french counterpart. But then, note the next few words to translate are "zone economique europeene". Something different happens:

<img src="assets/images/06/img_013.png" width=700 align='center'>

If we consider the darker the lighter shaded blocks to be associated with the words that are **foucsed** on for producing the next word in the statement, you can see that in the three words, "zone economique europeene", are not in sequence attended to to produce "european economic area". It is not produced as "area economic european" as would be in order from the french "zone economique europeene". The model was able to learn this representation from the training data set.

The model goes on to more or less produce the successive terms with sequentiality:
<img src="assets/images/06/img_014.png" width=700 align='center'>

This shows how the attention mechanism lets the model focus on the right parts of the sequence at the right time.

## Check-on Learning:

True or False: A sequence-to-sequence model processes the input sequence all in one step
> * False: a seq2seq model works by feeding one element of the input sequence at a time to the encoder

What are two limitations of seq2seq models that are solved by attention methods:
1. The fixed size of the context matrix passed fro mthe encoder to the decoder is a bottleneck
2. The difficulty of encoding long sequences and recalling long-term dependencies

How large is the context matrix in an *attention* seq2seq model?
> * Depends on the length of the input sequence. Adding attention alters that where the general seq2seq model is of a fixed length.

# 3. Attention overview - Decoding

In models without attention, we'd only feed the last context vector to the decoder RNN, in addition to the embedding of the end token, and it will begin to generate an element of the output sequence at each time step. 

<img src="assets/images/06/img_015.png" width=700 align='center'>

The case is different in an attention decoder.

<img src="assets/images/06/img_016.png" width=700 align='center'>

The attention decoder has the ability to look at the inputted words and the decoder's own hidden state:

<img src="assets/images/06/img_017.png" width=700 align='center'>

and then do the following:
1. Use a scoring function to score each hidden state in the context matrix
2. then pass those scores into a softmax function so that all values are positive, between zero and one, and all sum to one - these values are how much each vector will be expressed in the attention vector that the decoder will look at before producing an output

<img src="assets/images/06/img_018.png" width=700 align='center'>

3. Multiply each vector by its softmax score and then summing up those vectors produces an attention context vector - this is a basic weighted sum operation

<img src="assets/images/06/img_019.png" width=700 align='center'>

Note: The context vector is an important milestone in this process but it is not the end goal

4. Now, the decoder has looked at the input word and the attention context vector, which focuses its attention on the appropriate place of the input sequence - it produces a hidden state and it produces the first word in the output sequence

<img src="assets/images/06/img_020.png" width=700 align='center'>

5. Next, the decoder takes the previous output and hidden states as an input, generates an attention context vector for that time step, which produces a new hidden state for that time step and next word in the output sequence

<img src="assets/images/06/img_021.png" width=700 align='center'>

6. This continues until the output sequence is completed

<img src="assets/images/06/img_022.png" width=700 align='center'>

## Check on Learning:

In machine learning applications, the encoder and decoder are typically:
* Generative Adversarial Networks (GANs)
* **Recurrent Neural Networks (Typically vanilla RNN, LSTM, or GRU)**
* Mentats

What's a more reasonable embedding size for a real-world application?
* 4
* **200**
* 6,000

What are the steps that require calculating an attention vector in a seq2seq model with attention?
* Every tiem step in the model (both encoder and decoder)
* Every time step in the encoder only
* **Every time step in the decoder only**

# 4. Bahdanau (additive) and Luong (multiplicative) Attention

[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) *additive*

[Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) *multiplicative*

Before delving into the details of the scoring functions, we need to make a distinction between the two major types of attention mechanisms: **additive attention** and **multiplicative attention**

## Additive (Bahdanau) Attention 

The scoring function in Bahdanau (additive) attention: $$e_{ij}=v_{a}^\top\text{tanh}\left ( W_{a}s_{i-1}+U_{a}h_{j} \right )$$

* $h_{j}$ is the hidden state from the encoder
* $s_{i-1}$ is the hidden state of the decoder in the previous time step
* $U_{a} , W_{a} , v_{a}$ are all weight matrices that are learned during the training process

Basically, this is a scoring function that takes the hidden state of the encoder ($h_{j}$), hidden state of the decoder ($s_{i-1}$) and produces a single number for each decoder time step.

The scores are then passed into the softmax:
$$a_{ij} = \frac{exp \left( e_{ij} \right) }{\Sigma_{k=1}^{T_{x}} exp\left(e_{ik}\right) }$$

and then applied to a weighted sum:
$$ c_{i} = \Sigma_{j=1}^{t_{x}} a_{ij}h_{j}$$

where we multiply each encoder hidden state by its score and then we sum them, producing our attention context vector

<img src="assets/images/06/img_023.png" width=700 align='center'>

In this architecture, the encoder is a **bi-directional RNN** and they produce the encoder vector by concatenating the states of these two layers.

## Multiplicative (Luong) Attention

Luong attention built on top of the Bahdanau attention by adding a couple more scoring functions. The architecture is also different in that they used only the hidden states from the top RNN layer in the encoder which allows the encoder and the decoder to both be stacks of RNNs - this has led to some of the premier models that are in production right now.

<img src="assets/images/06/img_024.png" width=700 align='center'>

There are three scoring functions that we can choose from in multiplicative attention:

$$
\text{score}\left( \textbf{h}_{t}, \overline{\textbf{h}}_{s} \right)=
\begin{cases}
\textbf{h}_{t}^\top \overline{\textbf{h}}_{s} & \text{dot}\\
\textbf{h}_{t}^\top \textbf{W}_{\textbf{a}} \overline{\textbf{h}}_{s} & \text{general}\\
\textbf{v}_{a}^\top \text{tanh} \left( \textbf{W}_{\textbf{a}} \left [\textbf{h}_{t} ; \overline{\textbf{h}}_{s} \right ] \right ) & \text{concat}
\end{cases}
$$

The simplest one is the **dot scoring function**. It multimpleis the hidden states of the encoder by the hidden state of the decoder.

The **general scoring function** builds on top of the *dot* scoring function by adding a weight matrix between the encoder hidden state and decoder hidden state.

The **concat scoring function** is very similar to additive attention in that it adds up the hidden state of the encoder with the hidden state of the decoder $\left [\textbf{h}_{t} ; \overline{\textbf{h}}_{s} \right ]$, but is multiplied by weight matrix, $\textbf{W}_{\textbf{a}}$, applies a tanh activation, $\text{tanh}$, and then multiplies is by another weight matrix, $\textbf{v}_{a}^\top$. This is a function that we give the hidden state of the decoder at this time step and the hidden states of the encoder at all the time steps, and it will produce a score for each one of them.

<br><br>
We then do the softmax just as we did before:

$$\textbf{a}_{t}\left ( s \right) = \text{align} \left( \textbf{h}_{t} , \overline{\textbf{h}}_{s} \right) = \frac{\text{exp} \left ( \text{score} \left ( \textbf{h}_{t} , \overline{\textbf{h}}_{s} \right ) \right ) } { \Sigma_{s'} \left ( \text{score} \left ( \textbf{h}_{t} , \overline{\textbf{h}}_{s'} \right ) \right ) } $$

and then that would produce c of t, the attention context vector:

$$\tilde{\textbf{h}}_{t} = \text{tanh} \left( \textbf{W}_{\textbf{c}} \left [ \textbf{c}_{t} ; \textbf{h}_{t} \right] \right )$$

and $\tilde{\textbf{h}}_{t}$ is the final output of the decoder.

With Luong Attention, the following outcome is possible for an English to German Translation:

|class | sentence | 
|:-|-:|
|source| Orlando Bloom and Miranda Kerr still love each other|
|reference | Orlando Bloom und Miranda Kerr lieben sich noch immer |
|best | Orlando Bloom und Miranda Kerr lieben einander noch immer |
|seq2seq w/out attention | Orlando Bloom und Lucas Kerr lieben einander noch immer |

This is something we can attribute to capturing all of the information *just* in the last hidden state of the encoder. This is one of the powerful things that attention does: it gives the encoder the ability to look at parts of the input sequence, no matter how far back they were in the input sequence.

# 5. Multiplicative Attention

<!-- https://www.youtube.com/watch?v=1-OwCgrx1eQ&t=12s -->

Previously, we looked at how the key concept of attention is to calculate an attetion weight vector, which is used to amplify the signal from the most relevant parts of the input sequence and in the same time, minimize the signal of the least relevant parts.

Now, we'll focus on the scoring function that produce the attention weights. An attention scoring function tends to be a function that takes in the hidden state of the decoder and the set of hidden states of the encoder.

As this scoring happens at each timestep on the decoder side, we only use the hidden state of the decoder at that, or the previous, timestep in some scoring methods. Given the two inputs, decoder hidden states at decoding step `t` (vector) AND each encoder hidden state for each each encoding step (matrix), the scoring function produces a vector that scores each of these columns.

<img src="assets/images/06/img_025.png" width=700 align='center'>

Simplest scoring method is to just calculate the dot-product of two inputs. As an example, two vectors:

<img src="assets/images/06/img_026.png" width=700 align='center'>

The importance of is number however is what is interesting. The dot product of two vectors is geometrically the same as the multiplying the lengths of two vectors bt the cosine of the angle between them. Recall, cosine has the convenient property of equalling one if the angle is zero and it decreases the wider the angle becomes, same 180.

<img src="assets/images/06/img_027.png" width=700 align='center'>

Intuitively, that means that if two vectors with the same length, the smaller the angle between them, the larger the dot product becomes. And vice versa.

<img src="assets/images/06/img_028.png" width=700 align='center'>

Generalized though, we have the form for multiplicative attention, **when we assume the encoder and decoder have the same embedding space**:

<img src="assets/images/06/img_029.png" width=700 align='center'>

An issue with machine translation however is that the encoder and decoder are likely to have separate embedding spaces. In that case, we use a similar scoring method to the above with a slight variation where a weight matrix between the encoder hidden states and decoder hidden state. The weight matrix isa linear transformation that allows the inputs and outputs to use different embedding spaces and the result of this multiplication is the weights vector:

<img src="assets/images/06/img_030.png" width=700 align='center'>

If we look at things step by step:

A: 

1. The attention decoder (purple outlined blue filled circle) starts by taking an initial hidden state ($\text{h}_{\text{init}}$) as well as the embedding for the `<END>` symbol. It does its calculation and generates the hidden state at that time step, ignoring the actual outputs of the RNN, just using the hidden state:

<img src="assets/images/06/img_031.png" width=700 align='center'>

2. Then we do the attention step (white outline box). We do that by taking in the matrix of hidden states from the encoder (yellow 3x4 matrix). We produce the scoring as we mentioned (pink vector).

> If we're doing multiplicative attention, we'll use the dot product.
> We'll transform the scores with softmax
> Multiply softmax scores by each corresponding hidden state from the encoder

3. Sum the scores to produce the attention context vector (blue vector)

4. Concatenate the attention context vector ($\text{C}_{4}$ , blue vector) with the hidden state of the decoder ($\text{h}_{4}$ , purple vector) at the timestep

> In the example, h4 with c4
> So, we glue them together as one vector

5. Then pass them through a fully connected neural network (pink rounded rectangle) which is basically multiplying by the weights matrix $\text{W}_{\text{c}}$ and apply a $tanh$ activation
> The output of the fully connected layer would be our *first* outputted word in the output sequence

<img src="assets/images/06/img_032.png" width=700 align='center'>

Now proceed to second step, B:

1. Take the output (pink vector, $\text{how}^{*}$) from the first decoder timestep (section 4)

2. Produce h5 (purple vector) and start the attention (white dashed-line box) at this step
> Score, produce weight vector, softmax, multiply

3. Sum the scores to produce the context vector at state 5 (blue vector)

4. Concatenate h5 and c5

5. Then pass them through a fully connected neural network (pink rounded rectangle) which is basically multiplying by the weights matrix $\text{W}_{\text{c}}$ and apply a $tanh$ activation
> The output of the fully connected layer would be our *second* outputted word in the output sequence

<img src="assets/images/06/img_033.png" width=700 align='center'>

# 6. Additive Attention

We will now look at the 3rd commonly used scoring-function. It is called "concat" and we will do this through a *feed forward* neural network. The concat scoring method is commonly done by concatenating two vectors and making that the input to a feed forward neural network.

As a simple example, let's consider we are scoring the 1st time-step encoder hidden state with a 4th time-step decoder hidden state.

1. Merge the vectors to a single vector
2. Pass the merged vector to a feedforward neural network
> The feedforward neural network has a single hidden layer and outputs the score
> There paramters of this network are learned during the training process
>> $\text{W}_{\text{a}}$ weights matrix and the $\text{V}_{\text{a}}$

<img src="assets/images/06/img_034.png" width=700 align='center'>

The calculation can be viewed as follows:

<img src="assets/images/06/img_035.png" width=700 align='center'>

Something to note is the difference in concat method from the Bahdanau (additive) and Luong (multiplicative) scoring method:
* Multiplicative has one weight matrix
* Additive has two major differences
> * Weights matrix is split into two: $\text{W}_{\text{a}}$ and $\text{U}_{\text{a}}$ ; each is applied to the respective vector ($\text{W}_{\text{a}}$ to the decoder hidden state and $\text{U}_{\text{a}}$ to the encoder hidden state ) <br>
> * Use decoder hidden state from the previous time-step

<img src="assets/images/06/img_036.png" width=700 align='center'>

# 7. Computer Vision Applications

Super interesting computer vision applications using attention:
* [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/pdf/1502.03044.pdf)
> Achieved SOTA performance in caption generation in a number of datasets
> Trained on COCO - Common Objects in Context: set of 200k images with 5 captions each written by people<br>
<img src="assets/images/06/img_037.png" width=200 align='left'>
<img src="assets/images/06/img_038.png" width=200 align='left'>
<img src="assets/images/06/img_039.png" width=200 align='left'>
<img src="assets/images/06/img_040.png" width=700 align='center'>
<img src="assets/images/06/img_041.png" width=700 align='center'>

> * Uses a VGG net trained on ImageNet
> * annotations were created from this feature map
> * feature volume dimension (14x14x512) meaning 512 features of 14x14 dimension
> * Create annotation vector by flattening each feature from 14x14 to 196x1
>> * simple reshaping of the matrix to a vector
> * Reshaping leads to matrix of 196x512 so that its MxP instead of MxNxP
>> * Now have 512 features for each vector of 196 numbers
> * This is the context vector and can be used just like we have previously with attention mechanisms
<img src="assets/images/06/img_042.png" width=350 align='center'>
> * The decoder is an RNN and uses attention to focus on the appropriate annotation vector at each time step
> * We plug this into the attention process we've outlined before and that's our image captioning model
<img src="assets/images/06/img_043.png" width=700 align='center'>

* [Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://arxiv.org/pdf/1707.07998.pdf)
>

* [Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks](https://www.cv-foundation.org/openaccess/content_cvpr_2016/app/S19-04.pdf)
>

* [Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos](https://arxiv.org/pdf/1507.05738.pdf)

* [Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge](https://arxiv.org/pdf/1708.02711.pdf)

* [Visual Question Answering: A Survey of Methods and Datasets](https://arxiv.org/pdf/1607.05910.pdf)

# 8: NLP application: Google neural machine translation

### Google Neural Machine Translation

<img src="assets/images/06/img_044.jpeg" width=700 align='center'>

The best demonstration of an application is by looking at real-world systems that are in production right now. In late 2016, Google released the following paper describing Google’s Neural Machine Translation System:

[Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation [pdf]](https://arxiv.org/pdf/1609.08144.pdf)

This system later went into production powering up Google Translate.

Take a stab at reading the paper and connecting it to what we've discussed in this lesson so far. Below are a few questions to guide this external reading:

* Is the Google’s Neural Machine Translation System a sequence-to-sequence model?
* Does the model utilize attention?
* If the model does use attention, does it use additive or multiplicative attention?
* What kind of RNN cell does the model use?
* Does the model use bidirectional RNNs at all?

### Text Summarization:

[Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](https://arxiv.org/pdf/1602.06023.pdf)

# 9. Other Attention Methods

[Paper: Attention Is All You Need](https://arxiv.org/abs/1706.03762)

[Talk: Attention is all you need attentional neural network models – Łukasz Kaiser](https://www.youtube.com/watch?v=rBCqOTEfxvg)

### The Transformer

[YouTube](https://www.youtube.com/watch?v=VmsR9FVpQiM&t=1s)

Since the two main attention papers were published in 2014 and 2015, attention has been a very active area of research. While the two mechanisms continue to be commonly used, there habe been significant development over the years.

The Attention Is All You Need This paper noted that the complexity of the econder-decoder with attention models can be simplified by adopting a new type of model that **only** used attention, no RNNs. This model type is called the **Transformer**. In two of their experiments on machine translation tasks, the model proved superior in quality as well as requiring signifcantly less time to train.

The transformer takes a sequence as an input and generates a sequence, just like the seq2seq models we've discussed. The *difference* however is that it does not take the inputs one-by-one as the RNN does. Instead, it can produce all of the together in parallel. Perahps each element is processed by a seperate GPU if we want. It then produces the output one-by-one but also not using an RNN.

The Transformer model also breaks down into an encoder and a decoder, but replace RNNs, it uses a feed-forward NN and a concept called *self-attention*. This combination allows the encoder and decoder to work without RNNs, which vastly improves performance since it allows parallelization of processing that was not possible with RNNs.

<img src="assets/images/06/img_045.png" width=350 align='center'>

The transformer contains a stack of identical encoders and decoders. Six is the number the paper proposes.

<img src="assets/images/06/img_046.png" width=350 align='center'>

Let's focus on the encoder more closely. Each encoder layer contains two sub-layers - a multiheaded self-attention layer and a feed-forward layer:

<img src="assets/images/06/img_047.png" width=350 align='center'>

Note that this is contained completely in the encoder instead of being a decoder component like the previous attention mechanisms. This attention component helps the encoder comprehend its inputs by focusing on other parts of the input sequence that are relevant to each input element it processes. This idea is an extension of the work previously done on the concept of self-attention and it can aid comprehension.

In other architectures, like the LSTM, you can see an example of an implementation of a self-attention mechanism: 

<img src="assets/images/06/img_048.png" width=350 align='center'>

The structure of the transformer however allows for the model to focus on tokens that appear before AND after the token of interest instead of only the preceding tokens.

This is not the only attention mechanism in the transformer architecture however. The decoder contains an *encoder-decoder* attention mechanism and a *sel-attention* mechanism. One that allows it to focs on the relevant part of the inputs (encoder-decoder attention) and another that only pays attention to previous decoder outputs (self-attention).

<img src="assets/images/06/img_049.png" width=350 align='center'>

# 10: The Transformer and Self-Attention

Say we have words we want our encoder to read and create a representation of. As always, we begin by embedding them into vectors. Since the transformer gives us a lot of flexibility for parallelization, this example assumes we're looking at the process or GPU tasked with encoding the second token of the input sequence.

<img src="assets/images/06/img_050.png" width=350 align='center'>

The first step is to compare them. We score the embeddings against each other. So we compare the token we are currently "reading" or encoding with the other words in the input sequence (i.e., all the words in the input, not just two).

<img src="assets/images/06/img_051.png" width=350 align='center'>

Then, we scale the score by the dimension of the keys, which we're using a toy dimension of four, so we use $\sqrt{4}=2$:

<img src="assets/images/06/img_052.png" width=350 align='center'>

Then, we perform softmax on the scores:

<img src="assets/images/06/img_053.png" width=350 align='center'>

Then we multiply the softmax score with the embedding to get the level of expression of each of these vectors:

<img src="assets/images/06/img_054.png" width=350 align='center'>

