### Introduction to the Decoder Components
Before diving into the decoder, it’s essential to understand the components that make up the decoder architecture. The key components of the decoder are:
 - Masked Multi-Head Attention
 -  Multi-Head Attention
 - Add & Norm
 - Feed-Forward Network
 
We have already discussed multi-head attention, Add & Norm and Feed-Forward Network in the context of the encoder. Now, we will first focus on masked multi-head attention and then explore the workings of the decoder in detail.

---------

# Masked Multi-Head Attention in the Decoder

In this notebook, we will explore the **masked multi-head attention** mechanism in the transformer decoder. We will take a step-by-step approach to understand how it works, using the example input sequence `["START", "The", "cat", "sat", "on", "mat"]`.

Note: We can learn about how the encoder output is served as input to the decoder in the next notebook. It is very important, but for now, let's focus on the working of masked multi-head attention.

### Step 1: Input Sequence

The input to the masked multi-head attention layer consists of the target sequence, which starts with the `[START]` token and includes previously generated tokens as they are produced. For our example, let's use the following input sequence:

- **Input Sequence**: `["START", "The", "cat", "sat", "on", "mat"]`

### Step 2: Compute Query (Q), Key (K), and Value (V) Vectors

For each token in the input sequence, the decoder computes the Query (Q), Key (K), and Value (V) vectors. These vectors will be used to calculate the attention scores.


 - $$ Q = XW^Q $$
 - $$ K = XW^K $$
 - $$ V = XW^V $$

Where:
- $$ X $$ is the input embeddings,
- $$ W^Q, W^K, W^V $$ are the weight matrices for queries, keys, and values, respectively.

**`Note:`** We have explained this clealry in **04. Self-attnetion** notebook, please refer that notebook for clear undertanding

![image.png](attachment:image.png)

### Step 3: Attention Scores Calculation

The attention scores are calculated using the dot product of Q and K. The scores indicate how much focus each token should have on the others in the sequence.


$$ \text{Attention Score}(Q, K) = Q \cdot K^T $$

![image.png](attachment:image.png)

**Attention Scores**:

![image-2.png](attachment:image-2.png)

### Step 4: Apply Masking

A masking mechanism is applied to ensure that each token can only attend to itself and the tokens that precede it in the sequence. This is done by setting the attention scores for future tokens to negative infinity (or a very large negative number).

$$ \text{Masked Score} = \text{Attention Score} + \text{Mask} $$

![image.png](attachment:image.png)

**Masked Scores**:

![image.png](attachment:image.png)

### Step 5: Softmax to Get Attention Weights

The attention weights are computed by applying the softmax function to the masked scores. This results in a probability distribution over the tokens that can be attended to.

$$ \text{Attention Weights} = \text{softmax}(\text{Masked Scores}) $$

![image.png](attachment:image.png)

**Attention Weights**:

![image.png](attachment:image.png)

### Step 6: Compute Context Vectors

The context vector for each token is computed as a weighted sum of the Value vectors based on the attention weights. This context vector captures the relevant information from the previous tokens in the target sequence.

$$ \text{Context Vector} = \sum (\text{Attention Weights} \cdot V) $$

![image.png](attachment:image.png)

**Context Vectors**:

![image.png](attachment:image.png)

### Step 7: Output of Masked Multi-Head Attention

The output of the masked multi-head attention layer is used in the subsequent layers of the decoder to generate the final output.

![image.png](attachment:image.png)

**Output**:

![image.png](attachment:image.png)

Masked multi-head attention is a crucial component of the transformer decoder that ensures the model generates sequences in an autoregressive manner. By applying masking, the decoder can only attend to previously generated tokens, which is essential for generating coherent and contextually relevant outputs.

In this notebook, we explored the masked multi-head attention mechanism in the transformer decoder, breaking down each step and understanding how it functions.
In the next notebook, we will cover how the decoder works in detail, including:
- How the encoder's output serves as input to the decoder.
- The overall flow of information within the decoder.