# Attention is All you Need - Paper reproduction 

### Paper
Title: "Attention Is All You Need"

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

Conference: NeurIPS 2017

Paper Link: [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)

The paper introduces the Transformer model, an innovative architecture for sequence transduction tasks such as machine translation. Unlike traditional Recurrent Neural Networks (RNNs) (Sherstinsky, 2020) [2] and Long Short-Term Memory (LSTM) networks, the Transformer relies entirely on self-attention mechanisms to model dependencies between input and output sequences. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.  This eliminates recurrence, enabling better parallelization, faster training, and improved translation performance.

### Introduction: Why was this done?

Before Transformers, deep learning models used for sequence-to-sequence tasks, such as machine translation, text summarization, and speech recognition, relied heavily on Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Gated Recurrent Units (GRUs). While these models were effective in learning sequential data, they had significant limitations that hindered their efficiency, scalability, and performance.

#### Limitations of Previous Sequence Models
The primary challenge of working with sequential data is maintaining the relationships between different elements in a sequence, especially when dependencies span long distances. Earlier models attempted to address this issue using recurrence and convolution-based architectures.

**1. Recurrent Neural Networks (RNNs)** [2]. RNNs process input sequences one step at a time, maintaining a hidden state that carries information from previous time steps. This allows the network to capture short-term dependencies between elements in the sequence.

How RNNs Work : An input sequence is processed one token at a time, then at each time step, the model updates a hidden state that encodes information from previous time steps and then the final hidden state is used to generate predictions.

Problems with RNN:
- Sequential Processing: RNNs process inputs step-by-step, making them computationally slow and difficult to parallelize.
- Vanishing and Exploding Gradient Problem: During backpropagation, gradients can either become too small (vanish) or grow too large (explode), making it difficult for the model to learn dependencies across long distances.
- Short-Term Memory: Even though RNNs store information across time steps, they struggle to retain knowledge over long sequences.

**2. Long Short-Term Memory Networks (LSTMs)** LSTMs were designed to address the short-term memory issue in RNNs. They introduce a memory cell that can store information over extended time steps, preventing it from being forgotten too quickly.

How LSTM's work: LSTMs use gates (input, forget, and output) to control what information is stored and discarded. The forget gate determines which information from the past should be erased. The input gate decides what new information should be added. The output gate determines what information should be passed to the next time step.

Problems with LSTM's:
- Still sequential: Like RNNs, LSTMs require inputs to be processed step-by-step, limiting their ability to leverage modern parallel computing hardware.
- Computationally expensive: LSTMs require more parameters than RNNs, increasing memory and computation costs.
- Difficult to scale: Training large LSTMs on massive datasets requires extensive computational resources.

**3. Gated Recurrent Units (GRU's)** GRUs are a simplified version of LSTMs that merge the forget and input gates into a single update gate. This reduces the number of parameters while maintaining performance similar to LSTMs.\

Problems with GRU's:
- Still sequential processing: Like RNNs and LSTMs, GRUs require time-step-based computation.
- Difficulty handling long-range dependencies: While better than standard RNNs, GRUs still struggle to capture very long-term dependencies efficiently.


### Key Contributions of the Transformer Model



### Model Architecture



The Transformer consists of an encoder-decoder structure, where both the encoder and decoder are built using stacked layers of self-attention and  point-wise, fully
connected layers. Below is an illustration of the architecure. (Encoder is to the left, Decoder is to the right, respectively)

![image.png](attachment:image.png)

The encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence
of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

##### Scaled Dot-Product Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.


$$
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) V
$$



Explanation: 
- Q (queries) – a matrix with the representation of the current word.
- K (keys) – a matrix with representations of all words in the sequence.
- V (values) – a matrix with actual word embeddings.
- $ d_k $ scaling factor to prevent large attention scores. 

Intuitively, Self-attention computes the relevance of words in a sequence using queries
keys  and values

![image.png](attachment:image.png)

##### Multi-Head Attention

Instead of performing a single attention function, the Transformer is linearly projecting the queries, keys and values h times with different, learned
linear projections to $ d_k, d_k, d_v $ dimensions, respectively. 

![image.png](attachment:image.png)

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O
$$

where each head is computed as:

$$
\text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i)
$$


Where $W^Q, W^K, W^V$ are learnable weight matrices for query, key, and value transformations. 

#####  Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in the encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

$$
\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2 \quad
$$



##### Positional Encoding

Since the Transformer does not use recurrence, positional encodings are added to the input embeddings:

$$
PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}} \right)
$$

$$
PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}} \right)
$$

These encodings inject word order information into the model.


##### Why Self Attention? 
-  Total computational complexity per layer -  self-attention layers are faster than recurrent layers 
- The amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
- The  path length between long-range dependencies in the network -  The shorter the paths between any combination of positions in the input
and output sequences, the easier it is to learn long-range dependencies
- Self-attention could yield more interpretable models

##### Results

The Transformer achieves state-of-the-art performance on English-German (WMT 2014) and English-French translation tasks.

- English-to-German BLEU Score: Transformer (Big)- 28.4 (previous best: 26.3)
- English-to-French BLEU Score: Transformer (Big)- 41.8 (previous best: 40.4)
- Successfully applied to English constituency parsing, outperforming previous state-of-the-art models.
- Works well even in low-resource language settings.


##### **Refferences**

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. Retrieved from https://arxiv.org/abs/1706.03762.

[2] Sherstinsky, A. (2020). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D Nonlinear Phenomena, 404, 132306. https://doi.org/10.1016/j.physd.2019.132306

[3]

[4]

[5]