
# 10.5 Transformer and Attention-Based Models





## Limitations of RNNs and LSTMs for Long Sequences

Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks
process data sequentially, one time step at a time. While this structure
allows them to model temporal dependencies, it introduces several limitations
when dealing with long sequences.

As sequence length increases, recurrent models suffer from slow training due
to their inherently sequential computation. Additionally, despite gating
mechanisms, learning very long-range dependencies remains challenging due to
gradient degradation and information bottlenecks.



## Motivation for Attention Mechanisms

Attention mechanisms were introduced to overcome the limitations of recurrent
models by allowing the network to directly focus on the most relevant parts
of the input sequence.

Instead of compressing all information into a single hidden state, attention
enables the model to dynamically weight different input elements when producing
an output, improving information flow and representational capacity.

The key idea behind attention is to dynamically assign importance to different input elements based on their relevance to the current prediction. This enables the model to focus on informative parts of the sequence while ignoring irrelevant information, leading to improved performance and interpretability.



## Intuition Behind Attention

The core idea behind attention is to compute an output as a weighted combination
of all input elements. These weights represent how important each input element
is relative to the current task.

By learning these weights, the model can selectively emphasize relevant inputs
while suppressing less important ones.



### Scaled Dot-Product Attention

$$
\text{Attention}(Q, K, V) =
\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

where:
- \(Q\) represents queries  
- \(K\) represents keys  
- \(V\) represents values  
- \(d_k\) is the key dimensionality



## Self-Attention Concept (High-Level)

Self-attention is a special form of attention in which the queries, keys, and values are all derived from the same input sequence. This allows each element in the sequence to attend to every other element, enabling the model to capture global context efficiently.

Through self-attention, the model can learn relationships such as semantic similarity, syntactic structure, or long-range dependencies without relying on recurrence or convolution. This mechanism is central to the Transformer architecture and enables parallel computation across the entire sequence.

Self-attention significantly improves the modelâ€™s ability to learn complex relationships within long sequences.


## Transformer Architecture Overview

The Transformer architecture eliminates recurrence entirely and relies
solely on attention mechanisms.

It consists of two main components:
- **Encoder**: processes the input sequence
- **Decoder**: generates the output sequence

Each component is built using stacked layers of self-attention and
feed-forward neural networks.



![image.png](attachment:image.png)



### Positional Encoding

Since Transformers do not use recurrence, positional information is injected
into the input embeddings using positional encodings:

$$
PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$

$$
PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$



## Why Transformers Outperform Recurrent Models

Transformers outperform recurrent models due to several key advantages:
- Parallel processing of sequences
- Efficient modeling of long-range dependencies
- Better scalability with data and model size
- Reduced training time compared to RNN-based models

These properties have made Transformers the foundation of modern
state-of-the-art models in natural language processing and beyond.



## Summary

In this chapter, we explored the motivation, intuition, and architecture
behind Transformer and attention-based models. By replacing recurrence
with attention, Transformers provide a powerful and scalable solution
for sequence modeling tasks.


## Task for the Reader
1.Explain why RNNs and LSTMs face difficulties when modeling very long sequences.

2.Define self-attention and explain how it differs from traditional attention mechanisms.

3.Draw and label the basic Transformer encoderâ€“decoder architecture.

4.Explain the role of positional encoding in Transformer models.

5.Compare Transformers and recurrent models in terms of parallelism and long-range dependency modeling.
