### Preparing Inputs for LLMs
We’ve completed preparing the input to be fed into large language models (LLMs).

### Attention Mechanisms
Now, let’s explore the **attention mechanism** in detail.

We will implement four types of attention mechanisms:
1. **Simplified self-attention**
2. **Self-attention**
3. **Causal attention**
4. **Multi-head attention**


### Section 3.1 (The problem with modelling long sequences)

### The Problem with Pre-LLM Architectures Without Self-Attention  

Before the introduction of self-attention, **Recurrent Neural Networks (RNNs)** were the state-of-the-art (SOTA) for tasks like language translation.  

In RNN-based architectures:  
- The encoder processes the entire input sequence into a final hidden state (along with intermediate hidden states after each input word).  
- The decoder uses this final hidden state to generate the output sequence.  

**Key limitation:** The decoder primarily relies on the final hidden state of the encoder, ignoring intermediate hidden states. This often results in a **loss of context**, especially for longer input sequences, as critical information from earlier parts of the input may not be


### Section 3.2 (Capturing data dependencies with attention mechanisms)

### Evolution of Attention Mechanisms  

An improvement over the basic RNN encoder-decoder architecture was the **Bahdanau attention mechanism** (2014), which allowed the decoder to selectively access different parts of the input sequence.  

Three years later, the paper *"Attention Is All You Need"* (2017) introduced the **Transformer architecture**, a significant improvement over the 2014 model.  


### Section 3.3 (Attending to different parts of the input with self-attention)

### Understanding Self-Attention  

#### What is "self" in self-attention?  
"Self" refers to the model's ability to learn relationships and dependencies between various parts of the input (e.g., words in a sentence).  
This is in contrast to traditional attention models, which focus on relationships between two different sequences (e.g., in sequence-to-sequence models).  

#### What is the goal of self-attention?  
The goal of self-attention is to compute a **context vector** for every token in the input sequence.  

#### What is a context vector?  
A **context vector** is generated for each input element by combining information from other input elements. Each element contributes to the context vector based on its **attention score**.  

#### How is the attention score for a token computed?  
The attention score for a token `t` in the input sentence is computed by performing the **dot product** of that token's representation (called the **query**) with the representations of all other tokens in the input sequence.  


Below is an example sentence. Each token in the sentence has an embedding. Below are steps to obtain query vector for a token:

In [5]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

Step 1: We compute the attention scores for token 'journey' (the query) by computing its dot product with all tokens in the input sequence.

In [2]:
query = inputs[1]  # 'journey' is the query

attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


Step 2: We normalize the attention scores using 'softmax' for interpretability and to maintain training stability.

In [7]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


Step 3: Now we compute context vector of token 'journey' by multiplying attention score with corresponding token embeddings of all tokens in the input sentence.


In [8]:
query = inputs[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])
