# ___Attention Mechanism___

_The attention mechanism is one of the most valuable breakthroughs in Deep Learning research in the last decade. It has spawned the rise of so many recent breakthroughs in natural language processing (NLP), including the Transformer architecture and Google’s BERT._

_If you’re working in NLP (or want to do so), you simply must know what the Attention mechanism is and how it works._

## ___What is Attention?___
_When we think about the English word “__Attention__”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data._

_In broad terms, __Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence__:_

* ___Between the input and output elements (General Attention)___
* ___Within the input elements (Self-Attention)___

_Let me give you an example of how Attention works in a translation task. Say we have the sentence “How was your day”, which we would like to translate to the French version - “Comment se passe ta journée”. What the Attention component of the network will do for each word in the output sentence is map the important and relevant words from the input sentence and assign higher weights to these words, enhancing the accuracy of the output prediction._

<img src='https://blog.floydhub.com/content/images/2019/09/Slide36.JPG' width = 600/>

_While Attention does have its application in other fields of deep learning such as Computer Vision, its main breakthrough and success comes from its application in Natural Language Processing (NLP) tasks. This is due to the fact that Attention was introduced to address the problem of long sequences in Machine Translation, which is also a problem for most other NLP tasks as well._

## ___How Attention Mechanism was Introduced in Deep Learning___
_The attention mechanism emerged as an improvement over the encoder decoder-based neural machine translation system in natural language processing (NLP). Later, this mechanism, or its variants, was used in other applications, including computer vision, speech processing, etc._

_Before __Bahdanau__ et al proposed the first Attention model in 2015, neural machine translation was based on encoder-decoder RNNs/LSTMs. Both encoder and decoder are stacks of LSTM/RNN units. It works in the two following steps:_

* _The encoder LSTM is used to process the entire input sentence and encode it into a context vector, which is the last hidden state of the LSTM/RNN. This is expected to be a good summary of the input sentence. All the intermediate states of the encoder are ignored, and the final state id supposed to be the initial hidden state of the decoder._


* _The decoder LSTM or RNN units produce the words in a sentence one after another_

<img src='https://lilianweng.github.io/lil-log/assets/images/encoder-decoder-example.png' width = 600/>

_In short, there are two RNNs/LSTMs. One we call the encoder – this reads the input sentence and tries to make sense of it, before summarizing it. It passes the summary (context vector) to the decoder which translates the input sentence by just seeing it._

_The main drawback of this approach is evident. If the encoder makes a bad summary, the translation will also be bad. And indeed it has been observed that the encoder creates a bad summary when it tries to understand longer sentences. It is called the __long-range dependency problem of RNN/LSTMs__._

_RNNs cannot remember longer sentences and sequences due to the vanishing/exploding gradient problem. It can remember the parts which it has just seen. Even Cho et al (2014), who proposed the encoder-decoder network, demonstrated that the performance of the encoder-decoder network degrades rapidly as the length of the input sentence increases._

_Although an LSTM is supposed to capture the long-range dependency better than the RNN, it tends to become forgetful in specific cases. Another problem is that there is no way to give more importance to some of the input words compared to others while translating the sentence._

_Now, let’s say, we want to predict the next word in a sentence, and its context is located a few words back. Here’s an example – “Despite originally being from Uttar Pradesh, as he was brought up in Bengal, he is more comfortable in Bengali”. In these groups of sentences, if we want to predict the word “Bengali”, the phrase “brought up” and “Bengal”- these two should be given more weight while predicting it. And although Uttar Pradesh is another state’s name, it should be “ignored”._

___Bahdanau et al (2015) came up with a simple but elegant idea where they suggested that not only can all the input words be taken into account in the context vector, but relative importance should also be given to each one of them.___

_So, whenever the proposed model generates a sentence, it searches for a set of positions in the encoder hidden states where the most relevant information is available. __This idea is called ‘Attention’__._

## ___Understanding the Attention Mechanism___

<img src='https://miro.medium.com/max/518/1*y-tu26-3EPzW6-TwJQvT_g.png' width = 300/>

_The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:_

* ___Producing the Encoder Hidden States___ _- Encoder produces hidden states of each element in the input sequence._


* ___Calculating Alignment Scores___ _between the previous decoder hidden state and each of the encoder’s hidden states are calculated (Note: The last encoder hidden state can be used as the first hidden state in the decoder)._


* ___Softmaxing the Alignment Scores___ _- the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed._


* ___Calculating the Context Vector___ _- the encoder hidden states and their respective alignment scores are multiplied to form the context vector._


* ___Decoding the Output___ _- the context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output._


* _The process __(steps 2-5) repeats__ itself for each time step of the decoder until an token is produced or output is past the specified maximum length._

_Now that we have a high-level understanding of the flow of the Attention mechanism for Bahdanau, let’s take a look at the inner workings and computations involved:_

<img src='https://lilianweng.github.io/lil-log/assets/images/encoder-decoder-attention.png' width = 600/>

_Say, we have a source sequence x of length n and try to output a target sequence y of length m:_

$$x=[x1,x2,…,xn]$$
$$y=[y1,y2,…,ym]$$

_The encoder is a bidirectional RNN (or other recurrent network setting of your choice) with a forward hidden state h→i and a backward one h←i. A simple concatenation of two represents the encoder state. The motivation is to include both the preceding and following words in the annotation of one word._

$$hi=[h→⊤i;h←⊤i] ⊤,i=1,…,n$$

_The decoder network has hidden state $st=f(st−1,yt−1,ct)$ for the output word at position $t, t=1,…,m$, where the context vector $ct$ is a sum of hidden states of the input sequence, weighted by alignment scores:_

<img src='https://miro.medium.com/max/576/1*kg-zvcFvmEPaKZmHrZ_Xow.png' width= 300/>

___The alignment model assigns a score $αt,i$ to the pair of input at position i and output at position t, (yt,xi), based on how well they match.___ _The set of ${αt,i}$ are weights defining how much of each source hidden state should be considered for each output. In Bahdanau’s paper, the alignment score α is parametrized by a __feed-forward network__ with a single hidden layer and this network is jointly trained with other parts of the model._

<img src='https://d3i71xaburhd42.cloudfront.net/9a7def005efb5b4984886c8a07ec4d80152602ab/5-Figure2-1.png' width = 600/>

___We can observer 3 sub-parts/ components in the above diagram:___

* _Encoder_
* _Attention_
* _Decoder_


___Encoder___

* _Contains a __RNN layer__ (Can be LSTMs or GRU)_.
* _There are 4 inputs:_ $x_{0}, x_{1}, x_{2}, x_{3}$
* _Each input goes through an Embedding Layer._
* _Each of the input generates a hidden representation._
* _This generates the outputs for the Encoder:_ $h_{0}, h_{1}, h_{2}, h_{3}$

___Attention___

* _Our goal is to generate the context vectors._
* _For example, context vector C_{1} tells us how much importance/ attention should be given the inputs:_ $x_{0}, x_{1}, x_{2}, x_{3}$.
* _This layer in turn contains 3 sub parts:_
    * _Feed Forward Network_
    * _Softmax Calculation_
    * _Context vector generation_
    
___Feed Forward Network___

* _Each A_{00}, A_{01}, A_{02}, A_{03} is a simple feed forward neural network with one hidden layer. The input for this feed forward network is:_
   * _Previous Decoder state_
   * _Output of Encoder states_
* _Each unit generates outputs:_ $e_{00}, e_{01}, e_{02}, e_{03}.e_{0i} = g(S_{0}, h_{i})$
* _g  can be any activation function such as __sigmoid, tanh or ReLu__._

___Softmax Calculation___

$$E_{0i} = \frac{exp(e_{0i})}{\sum_{i=0}^{3}exp(e_{0i})}$$
* _These $E_{00}, E_{01}, E_{02}, E_{03}$ are called the __attention weights__. This is what that decides how much importance should be given to the inputs $x_{0}, x_{1}, x_{2}, x_{3}$._
* _The advantage of applying softmax is as below:_
    * _All the weights lie between 0 and 1, i.e., 0 ≤ e1, e2, e3, e4, e5 ≤ 1_
    * _All the weights sum to 1, i.e., e1+e2+3+e4+e5 = 1_
    
_Thus we get a nice probabilistic interpretation of the attention weights._

___Contect Vector Generation___

$$C_{0} = E_{00} \ast h_{0} + E_{01} \ast h_{1} + E_{02} \ast h_{2} + E_{03} \ast h_{3}$$

* _We find C_{1}, C_{2}, C_{3} in the same way and feed it to different RNN units of the Decoder layer._
* _So this is the final vector which is the __product of (Probability Distribution) and (Encoder’s output)__ which is nothing but the attention paid to the input words._

___Decoder___

_We feed these Context Vectors to the RNNs of the Decoder layer. Each decoder produces an output which is the translation for the input words._

___What do you mean by "alignment" in the context of attention mechanism?___

_Bahdanau et al. align the decoder's sequence with the encoder's sequence. An __alignment score__ quantifies how well output at position i is aligned to the input at position j._

_The matrix of alignment scores is a nice byproduct to explicitly show the correlation between source and target words._

<img src='https://lilianweng.github.io/lil-log/assets/images/bahdanau-fig3.png' width=400/>

### ___Advantages of Attention___
_There are three main reasons to introduce the Attention mechanism:_

___Less parameters___

_Compared with CNN and RNN , the model complexity is smaller and the parameters are fewer. So the requirement for computing power is even smaller._

___High Speed___

_Attention solves the problem that RNN cannot be calculated in parallel. Each calculation of the Attention mechanism does not depend on the calculation result of the previous step, so it can be processed in parallel with CNN._

___Works Well___

_Prior to the introduction of the Attention mechanism, there was a problem that everyone was very distressed: long-distance information would be weakened, just like people with weak memory ability, can’t remember the past._

_Attention is to pick the point, even if the text is relatively long, you can grab the point in the middle without losing important information. The red expectations in the picture below are the points that have been singled out._

<img src='https://easy-ai.oss-cn-shanghai.aliyuncs.com/2019-11-06-long.jpg' width= 600/>

### ___Types of Attention___

#### ___Calculation Area___
_Depending on how many source states that contribute while deriving the attention vector(α), there can be three types of attention mechanisms:_

<img src='https://miro.medium.com/max/696/1*vH6f8lDnYrdJ7A37ZBu_0g.png'/>

* ___Global Attention (Soft Attention)___
_Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector. Here we will focus on all the intermediate state and collect all the contextual information so that our decoder model predicts the next word._

* ___Local Attention___
_Here, the attention is placed on only a few source positions. Only a few hidden or intermediate states of the encoder are considered for deriving the attended context vector since we only give importance to specific parts of the sequence. Local attention is also called __window-based attention__ because it's about selecting a window of input tokens for attention distribution._

* ___Hard Attention___ 
_When attention is placed on only one source state._

_Note : Only one drawback of attention is that it’s time-consuming. To overcome this problem Google introduced “Transformer Model”._

#### ___Information Used___

_Suppose we want to calculate Attention for a piece of original text. Here the original text refers to the text we want to pay attention to, then the information used includes internal information and external information, internal information refers to the original text itself, and external information refers to the original text with additional information._

* ___General Attention___ _, this method uses external information, often used for tasks that need to construct two text relationships. The query generally contains additional information to align the original text according to the external query._

* ___Local Attention (Self Attention)___ _, this method only uses internal information, key and value and query are only related to the input text. In self attention, key=value=query. Since there is no external information, each word in the original text can be calculated with Attention for all the words in the sentence, which is equivalent to finding the relationship inside the original text._


#### ___Structure Level___

_The structure aspect is divided into single-layer attention, multi-layer attention and multi-head attention according to whether or not the hierarchical relationship is divided:_

* ___Single layer Attention___ _, this is a more common practice, using a query to make an attention to a piece of original text._

* ___Multi-layer Attention___ _, generally used for models with hierarchical relations of texts. Suppose we divide a document into multiple sentences. In the first layer, we use a attention to calculate a sentence vector for each sentence (that is, a single layer). At the second level, we calculate the document vector (also a single layer of attention) for all the sentence vectors, and finally use the document vector to do the task._

* ___Multi-head attention___ _, which is the multi-head attention mentioned in Attention is All You Need. It uses multiple queries to make multiple attentions to a piece of original text. Each query pays attention to different parts of the original text, which is equivalent to repeating Multiple single layer attention._

#### ___Model Aspect___

_From the model point of view, Attention is generally used on CNN and LSTM, and can also directly perform pure Attention calculation._

* ___CNN+Attention___

_CNN's convolution operation can extract important features. I think this is also the idea of Attention, but CNN's convolution perception field is local, and it is necessary to enlarge the field of view by superimposing multiple layers of convolution. In addition, Max Pooling directly extracts the feature with the largest value, and also like the idea of hard attention, directly select a feature._

* ___LSTM+Attention___

_LSTM internally has a Gate mechanism, in which the input gate selects which current information to input, and the forget gate chooses which past information to forget. I think this is a certain degree of Attention, and it claims to solve the long-term dependency problem. In fact, LSTM needs to capture step by step. Sequence information, the performance on long text will slowly decay as the step increases, it is difficult to retain all useful information._

#### ___Similarity Calculation method___

_When doing the attention, we need to calculate the score. Common methods are:_

<table style='font-style: italic'>
  <thead>
    <tr>
      <th>Name</th>
      <th>Alignment score function</th>
      <th>Citation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Content-base attention</td>
      <td><span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-20-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mtext>score</mtext><mo stretchy=&quot;false&quot;>(</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>)</mo><mo>=</mo><mtext>cosine</mtext><mo stretchy=&quot;false&quot;>[</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>]</mo></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-391" style="width: 15.012em; display: inline-block;"><span style="display: inline-block; position: relative; width: 12.1em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.459em, 1011.99em, 2.747em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-392"><span class="mtext" id="MathJax-Span-393" style="font-family: MathJax_Main;">score</span><span class="mo" id="MathJax-Span-394" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-395"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-396" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-397" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-398" style="font-family: MathJax_Main;">,</span><span class="msubsup" id="MathJax-Span-399" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-400" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-401" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-402" style="font-family: MathJax_Main;">)</span><span class="mo" id="MathJax-Span-403" style="font-family: MathJax_Main; padding-left: 0.283em;">=</span><span class="mtext" id="MathJax-Span-404" style="font-family: MathJax_Main; padding-left: 0.283em;">cosine</span><span class="mo" id="MathJax-Span-405" style="font-family: MathJax_Main;">[</span><span class="msubsup" id="MathJax-Span-406"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-407" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-408" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-409" style="font-family: MathJax_Main;">,</span><span class="msubsup" id="MathJax-Span-410" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-411" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-412" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-413" style="font-family: MathJax_Main;">]</span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.344em; border-left: 0px solid; width: 0px; height: 1.392em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>score</mtext><mo stretchy="false">(</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>=</mo><mtext>cosine</mtext><mo stretchy="false">[</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">]</mo></math></span></span><script type="math/tex" id="MathJax-Element-20">\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \text{cosine}[\boldsymbol{s}_t, \boldsymbol{h}_i]</script></td>
      <td><a href="https://arxiv.org/abs/1410.5401">Graves2014</a></td>
    </tr>
    <tr>
      <td>Additive(*)</td>
      <td><span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-21-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mtext>score</mtext><mo stretchy=&quot;false&quot;>(</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>)</mo><mo>=</mo><msubsup><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi mathvariant=&quot;bold&quot;>v</mi></mrow><mi>a</mi><mi mathvariant=&quot;normal&quot;>&amp;#x22A4;</mi></msubsup><mi>tanh</mi><mo>&amp;#x2061;</mo><mo stretchy=&quot;false&quot;>(</mo><msub><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi mathvariant=&quot;bold&quot;>W</mi></mrow><mi>a</mi></msub><mo stretchy=&quot;false&quot;>[</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>;</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>]</mo><mo stretchy=&quot;false&quot;>)</mo></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-414" style="width: 18.988em; display: inline-block;"><span style="display: inline-block; position: relative; width: 15.292em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.347em, 1015.18em, 2.747em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-415"><span class="mtext" id="MathJax-Span-416" style="font-family: MathJax_Main;">score</span><span class="mo" id="MathJax-Span-417" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-418"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-419" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-420" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-421" style="font-family: MathJax_Main;">,</span><span class="msubsup" id="MathJax-Span-422" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-423" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-424" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-425" style="font-family: MathJax_Main;">)</span><span class="mo" id="MathJax-Span-426" style="font-family: MathJax_Main; padding-left: 0.283em;">=</span><span class="msubsup" id="MathJax-Span-427" style="padding-left: 0.283em;"><span style="display: inline-block; position: relative; width: 1.235em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.56em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="texatom" id="MathJax-Span-428"><span class="mrow" id="MathJax-Span-429"><span class="mi" id="MathJax-Span-430" style="font-family: MathJax_Main-bold;">v</span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.363em, 1000.62em, 4.147em, -999.997em); top: -4.309em; left: 0.619em;"><span class="mi" id="MathJax-Span-431" style="font-size: 70.7%; font-family: MathJax_Main;">⊤</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.475em, 1000.45em, 4.147em, -999.997em); top: -3.805em; left: 0.619em;"><span class="mi" id="MathJax-Span-432" style="font-size: 70.7%; font-family: MathJax_Math-italic;">a</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mi" id="MathJax-Span-433" style="font-family: MathJax_Main; padding-left: 0.171em;">tanh</span><span class="mo" id="MathJax-Span-434"></span><span class="mo" id="MathJax-Span-435" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-436"><span style="display: inline-block; position: relative; width: 1.627em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1001.18em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="texatom" id="MathJax-Span-437"><span class="mrow" id="MathJax-Span-438"><span class="mi" id="MathJax-Span-439" style="font-family: MathJax_Main-bold;">W</span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 1.179em;"><span class="mi" id="MathJax-Span-440" style="font-size: 70.7%; font-family: MathJax_Math-italic;">a</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-441" style="font-family: MathJax_Main;">[</span><span class="msubsup" id="MathJax-Span-442"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-443" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-444" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-445" style="font-family: MathJax_Main;">;</span><span class="msubsup" id="MathJax-Span-446" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-447" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-448" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-449" style="font-family: MathJax_Main;">]</span><span class="mo" id="MathJax-Span-450" style="font-family: MathJax_Main;">)</span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.344em; border-left: 0px solid; width: 0px; height: 1.462em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>score</mtext><mo stretchy="false">(</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>=</mo><msubsup><mrow class="MJX-TeXAtom-ORD"><mi mathvariant="bold">v</mi></mrow><mi>a</mi><mi mathvariant="normal">⊤</mi></msubsup><mi>tanh</mi><mo>⁡</mo><mo stretchy="false">(</mo><msub><mrow class="MJX-TeXAtom-ORD"><mi mathvariant="bold">W</mi></mrow><mi>a</mi></msub><mo stretchy="false">[</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>;</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">]</mo><mo stretchy="false">)</mo></math></span></span><script type="math/tex" id="MathJax-Element-21">\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \mathbf{v}_a^\top \tanh(\mathbf{W}_a[\boldsymbol{s}_t; \boldsymbol{h}_i])</script></td>
      <td><a href="https://arxiv.org/pdf/1409.0473.pdf">Bahdanau2015</a></td>
    </tr>
    <tr>
      <td>Location-Base</td>
      <td><span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-22-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><msub><mi>&amp;#x03B1;</mi><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>=</mo><mtext>softmax</mtext><mo stretchy=&quot;false&quot;>(</mo><msub><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi mathvariant=&quot;bold&quot;>W</mi></mrow><mi>a</mi></msub><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo stretchy=&quot;false&quot;>)</mo></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-451" style="width: 11.764em; display: inline-block;"><span style="display: inline-block; position: relative; width: 9.467em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.459em, 1009.36em, 2.803em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-452"><span class="msubsup" id="MathJax-Span-453"><span style="display: inline-block; position: relative; width: 1.403em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-454" style="font-family: MathJax_Math-italic;">α</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.619em;"><span class="texatom" id="MathJax-Span-455"><span class="mrow" id="MathJax-Span-456"><span class="mi" id="MathJax-Span-457" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span class="mo" id="MathJax-Span-458" style="font-size: 70.7%; font-family: MathJax_Main;">,</span><span class="mi" id="MathJax-Span-459" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-460" style="font-family: MathJax_Main; padding-left: 0.283em;">=</span><span class="mtext" id="MathJax-Span-461" style="font-family: MathJax_Main; padding-left: 0.283em;">softmax</span><span class="mo" id="MathJax-Span-462" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-463"><span style="display: inline-block; position: relative; width: 1.627em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1001.18em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="texatom" id="MathJax-Span-464"><span class="mrow" id="MathJax-Span-465"><span class="mi" id="MathJax-Span-466" style="font-family: MathJax_Main-bold;">W</span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 1.179em;"><span class="mi" id="MathJax-Span-467" style="font-size: 70.7%; font-family: MathJax_Math-italic;">a</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-468"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-469" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-470" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-471" style="font-family: MathJax_Main;">)</span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.413em; border-left: 0px solid; width: 0px; height: 1.462em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>α</mi><mrow class="MJX-TeXAtom-ORD"><mi>t</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>=</mo><mtext>softmax</mtext><mo stretchy="false">(</mo><msub><mrow class="MJX-TeXAtom-ORD"><mi mathvariant="bold">W</mi></mrow><mi>a</mi></msub><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo stretchy="false">)</mo></math></span></span><script type="math/tex" id="MathJax-Element-22">\alpha_{t,i} = \text{softmax}(\mathbf{W}_a \boldsymbol{s}_t)</script><br>Note: This simplifies the softmax alignment to only depend on the target position.</td>
      <td><a href="https://arxiv.org/pdf/1508.04025.pdf">Luong2015</a></td>
    </tr>
    <tr>
      <td>General</td>
      <td><span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-23-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mtext>score</mtext><mo stretchy=&quot;false&quot;>(</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>)</mo><mo>=</mo><msubsup><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi><mi mathvariant=&quot;normal&quot;>&amp;#x22A4;</mi></msubsup><msub><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi mathvariant=&quot;bold&quot;>W</mi></mrow><mi>a</mi></msub><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-472" style="width: 12.94em; display: inline-block;"><span style="display: inline-block; position: relative; width: 10.419em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.347em, 1010.42em, 2.803em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-473"><span class="mtext" id="MathJax-Span-474" style="font-family: MathJax_Main;">score</span><span class="mo" id="MathJax-Span-475" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-476"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-477" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-478" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-479" style="font-family: MathJax_Main;">,</span><span class="msubsup" id="MathJax-Span-480" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-481" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-482" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-483" style="font-family: MathJax_Main;">)</span><span class="mo" id="MathJax-Span-484" style="font-family: MathJax_Main; padding-left: 0.283em;">=</span><span class="msubsup" id="MathJax-Span-485" style="padding-left: 0.283em;"><span style="display: inline-block; position: relative; width: 1.179em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-486" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.363em, 1000.62em, 4.147em, -999.997em); top: -4.309em; left: 0.507em;"><span class="mi" id="MathJax-Span-487" style="font-size: 70.7%; font-family: MathJax_Main;">⊤</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.363em, 1000.34em, 4.147em, -999.997em); top: -3.693em; left: 0.507em;"><span class="mi" id="MathJax-Span-488" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-489"><span style="display: inline-block; position: relative; width: 1.627em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1001.18em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="texatom" id="MathJax-Span-490"><span class="mrow" id="MathJax-Span-491"><span class="mi" id="MathJax-Span-492" style="font-family: MathJax_Main-bold;">W</span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 1.179em;"><span class="mi" id="MathJax-Span-493" style="font-size: 70.7%; font-family: MathJax_Math-italic;">a</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-494"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-495" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-496" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.413em; border-left: 0px solid; width: 0px; height: 1.531em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>score</mtext><mo stretchy="false">(</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>=</mo><msubsup><mi mathvariant="bold-italic">s</mi><mi>t</mi><mi mathvariant="normal">⊤</mi></msubsup><msub><mrow class="MJX-TeXAtom-ORD"><mi mathvariant="bold">W</mi></mrow><mi>a</mi></msub><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub></math></span></span><script type="math/tex" id="MathJax-Element-23">\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \boldsymbol{s}_t^\top\mathbf{W}_a\boldsymbol{h}_i</script><br>where <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-24-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><msub><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi mathvariant=&quot;bold&quot;>W</mi></mrow><mi>a</mi></msub></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-497" style="width: 2.019em; display: inline-block;"><span style="display: inline-block; position: relative; width: 1.627em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.515em, 1001.63em, 2.691em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-498"><span class="msubsup" id="MathJax-Span-499"><span style="display: inline-block; position: relative; width: 1.627em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1001.18em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="texatom" id="MathJax-Span-500"><span class="mrow" id="MathJax-Span-501"><span class="mi" id="MathJax-Span-502" style="font-family: MathJax_Main-bold;">W</span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 1.179em;"><span class="mi" id="MathJax-Span-503" style="font-size: 70.7%; font-family: MathJax_Math-italic;">a</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.274em; border-left: 0px solid; width: 0px; height: 1.184em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow class="MJX-TeXAtom-ORD"><mi mathvariant="bold">W</mi></mrow><mi>a</mi></msub></math></span></span><script type="math/tex" id="MathJax-Element-24">\mathbf{W}_a</script> is a trainable weight matrix in the attention layer.</td>
      <td><a href="https://arxiv.org/pdf/1508.04025.pdf">Luong2015</a></td>
    </tr>
    <tr>
      <td>Dot-Product</td>
      <td><span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-25-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mtext>score</mtext><mo stretchy=&quot;false&quot;>(</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>)</mo><mo>=</mo><msubsup><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi><mi mathvariant=&quot;normal&quot;>&amp;#x22A4;</mi></msubsup><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-504" style="width: 10.923em; display: inline-block;"><span style="display: inline-block; position: relative; width: 8.795em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.347em, 1008.79em, 2.803em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-505"><span class="mtext" id="MathJax-Span-506" style="font-family: MathJax_Main;">score</span><span class="mo" id="MathJax-Span-507" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-508"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-509" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-510" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-511" style="font-family: MathJax_Main;">,</span><span class="msubsup" id="MathJax-Span-512" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-513" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-514" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-515" style="font-family: MathJax_Main;">)</span><span class="mo" id="MathJax-Span-516" style="font-family: MathJax_Main; padding-left: 0.283em;">=</span><span class="msubsup" id="MathJax-Span-517" style="padding-left: 0.283em;"><span style="display: inline-block; position: relative; width: 1.179em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-518" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.363em, 1000.62em, 4.147em, -999.997em); top: -4.309em; left: 0.507em;"><span class="mi" id="MathJax-Span-519" style="font-size: 70.7%; font-family: MathJax_Main;">⊤</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.363em, 1000.34em, 4.147em, -999.997em); top: -3.693em; left: 0.507em;"><span class="mi" id="MathJax-Span-520" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-521"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-522" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-523" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.413em; border-left: 0px solid; width: 0px; height: 1.531em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>score</mtext><mo stretchy="false">(</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>=</mo><msubsup><mi mathvariant="bold-italic">s</mi><mi>t</mi><mi mathvariant="normal">⊤</mi></msubsup><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub></math></span></span><script type="math/tex" id="MathJax-Element-25">\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \boldsymbol{s}_t^\top\boldsymbol{h}_i</script></td>
      <td><a href="https://arxiv.org/pdf/1508.4025.pdf">Luong2015</a></td>
    </tr>
    <tr>
      <td>Scaled Dot-Product(^)</td>
      <td><span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-26-Frame" tabindex="0" style="position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mtext>score</mtext><mo stretchy=&quot;false&quot;>(</mo><msub><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub><mo stretchy=&quot;false&quot;>)</mo><mo>=</mo><mfrac><mrow><msubsup><mi mathvariant=&quot;bold-italic&quot;>s</mi><mi>t</mi><mi mathvariant=&quot;normal&quot;>&amp;#x22A4;</mi></msubsup><msub><mi mathvariant=&quot;bold-italic&quot;>h</mi><mi>i</mi></msub></mrow><msqrt><mi>n</mi></msqrt></mfrac></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-524" style="width: 10.531em; display: inline-block;"><span style="display: inline-block; position: relative; width: 8.459em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.011em, 1008.46em, 3.139em, -999.997em); top: -2.349em; left: 0em;"><span class="mrow" id="MathJax-Span-525"><span class="mtext" id="MathJax-Span-526" style="font-family: MathJax_Main;">score</span><span class="mo" id="MathJax-Span-527" style="font-family: MathJax_Main;">(</span><span class="msubsup" id="MathJax-Span-528"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.363em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-529" style="font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.507em;"><span class="mi" id="MathJax-Span-530" style="font-size: 70.7%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-531" style="font-family: MathJax_Main;">,</span><span class="msubsup" id="MathJax-Span-532" style="padding-left: 0.171em;"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.139em, 1000.62em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-533" style="font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.805em; left: 0.675em;"><span class="mi" id="MathJax-Span-534" style="font-size: 70.7%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="mo" id="MathJax-Span-535" style="font-family: MathJax_Main;">)</span><span class="mo" id="MathJax-Span-536" style="font-family: MathJax_Main; padding-left: 0.283em;">=</span><span class="mfrac" id="MathJax-Span-537" style="padding-left: 0.283em;"><span style="display: inline-block; position: relative; width: 1.627em; height: 0px; margin-right: 0.115em; margin-left: 0.115em;"><span style="position: absolute; clip: rect(3.251em, 1001.51em, 4.371em, -999.997em); top: -4.589em; left: 50%; margin-left: -0.781em;"><span class="mrow" id="MathJax-Span-538"><span class="msubsup" id="MathJax-Span-539"><span style="display: inline-block; position: relative; width: 0.843em; height: 0px;"><span style="position: absolute; clip: rect(3.475em, 1000.34em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-540" style="font-size: 70.7%; font-family: MathJax_Math-bold-italic;">s</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.475em, 1000.45em, 4.147em, -999.997em); top: -4.197em; left: 0.395em;"><span class="mi" id="MathJax-Span-541" style="font-size: 50%; font-family: MathJax_Main;">⊤</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.475em, 1000.23em, 4.147em, -999.997em); top: -3.749em; left: 0.395em;"><span class="mi" id="MathJax-Span-542" style="font-size: 50%; font-family: MathJax_Math-italic;">t</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span class="msubsup" id="MathJax-Span-543"><span style="display: inline-block; position: relative; width: 0.675em; height: 0px;"><span style="position: absolute; clip: rect(3.307em, 1000.45em, 4.147em, -999.997em); top: -3.973em; left: 0em;"><span class="mi" id="MathJax-Span-544" style="font-size: 70.7%; font-family: MathJax_Math-bold-italic;">h</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; top: -3.861em; left: 0.451em;"><span class="mi" id="MathJax-Span-545" style="font-size: 50%; font-family: MathJax_Math-italic;">i</span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(3.307em, 1001.01em, 4.371em, -999.997em); top: -3.581em; left: 50%; margin-left: -0.501em;"><span class="msqrt" id="MathJax-Span-546"><span style="display: inline-block; position: relative; width: 1.011em; height: 0px;"><span style="position: absolute; clip: rect(3.475em, 1000.39em, 4.147em, -999.997em); top: -3.973em; left: 0.619em;"><span class="mrow" id="MathJax-Span-547"><span class="mi" id="MathJax-Span-548" style="font-size: 70.7%; font-family: MathJax_Math-italic;">n</span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(0.955em, 1000.45em, 1.291em, -999.997em); top: -1.565em; left: 0.619em;"><span style="display: inline-block; overflow: hidden; vertical-align: -0.053em; border-top: 1.2px solid; width: 0.451em; height: 0px;"></span><span style="display: inline-block; width: 0px; height: 1.067em;"></span></span><span style="position: absolute; clip: rect(3.251em, 1000.62em, 4.315em, -999.997em); top: -3.917em; left: 0em;"><span><span style="font-size: 70.7%; font-family: MathJax_Main;">√</span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span></span></span><span style="display: inline-block; width: 0px; height: 3.979em;"></span></span><span style="position: absolute; clip: rect(0.843em, 1001.63em, 1.235em, -999.997em); top: -1.285em; left: 0em;"><span style="display: inline-block; overflow: hidden; vertical-align: 0em; border-top: 1.3px solid; width: 1.627em; height: 0px;"></span><span style="display: inline-block; width: 0px; height: 1.067em;"></span></span></span></span></span><span style="display: inline-block; width: 0px; height: 2.355em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.83em; border-left: 0px solid; width: 0px; height: 2.365em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>score</mtext><mo stretchy="false">(</mo><msub><mi mathvariant="bold-italic">s</mi><mi>t</mi></msub><mo>,</mo><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><msubsup><mi mathvariant="bold-italic">s</mi><mi>t</mi><mi mathvariant="normal">⊤</mi></msubsup><msub><mi mathvariant="bold-italic">h</mi><mi>i</mi></msub></mrow><msqrt><mi>n</mi></msqrt></mfrac></math></span></span><script type="math/tex" id="MathJax-Element-26">\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \frac{\boldsymbol{s}_t^\top\boldsymbol{h}_i}{\sqrt{n}}</script><br>Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.</td>
      <td><a href="http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf">Vaswani2017</a></td>
    </tr>
  </tbody>
</table>

###  ___Applications of Attention Mechanism___

_Beyond its early application to __machine translation__, attention mechanism has been applied to other NLP tasks such as __sentiment analysis, POS tagging, document classification, text classification, and relation classification__._

_By combining CNN with self-attention, the Google Brain team achieved top results for __image classification and object detection__. In __Visual Question Answering (VQA)__, where there's a need to focus on small areas or details of the image, attention mechanism is useful. Attention is also useful for __image captioning__._

_In __speech recognition__, attention aligns characters and audio._

_In one medical study, higher attention was given to __abnormal heartbeats from ECG readings__ to more accurately detect specific heart conditions. In another study based on ICU data, feature-level attention was used rather than attention on embeddings. This provided physicians better interpretability._