We have explored and understood the internal working components of the encoder part of transformers. The key components of the encoder include:

- **Embeddings and Positional Encoding**
- **Self-Attention and Multi-Head Attention**
- **Add & Norm and Feed-Forward Network**

Now, letâ€™s take a closer look at the overall architecture of the transformer and how the encoder processes input data to produce contextualized embeddings.

# Encoder - In Transformer Architecture

The transformer architecture consists of an encoder and a decoder. Each of these components is made up of multiple layers that process the input data. The encoder is responsible for transforming the input sequence into a continuous representation, while the decoder generates the output sequence based on this representation.

### Working of the Transformer Encoder

1. **Input Representation**:
- **Input Sentence:** The input sentence is first tokenized into individual words or subwords.
- **Embeddings:** Each token is converted into a dense vector representation (embedding).
- **Positional Encoding:** Since transformers do not have a built-in sense of order, positional encodings are added to the embeddings to provide information about the position of each token in the sequence.

![image.png](attachment:image.png)

2. **Multi-Head Self-Attention**:
- **Self-Attention:** The encoded inputs are processed through the self-attention mechanism, which computes attention scores to determine how much focus each token should have on every other token in the sequence.
- **Multi-Head Attention:** Instead of using a single set of attention scores, multi-head attention allows the model to learn different representations by applying multiple self-attention mechanisms in parallel.

3. **Add & Norm (After Self-Attention)**:
- **Residual Connection:** The output from the multi-head self-attention layer is added back to the original input (encoded inputs) using a residual connection.
- **Layer Normalization:** The combined output is then passed through layer normalization to stabilize the training process.

![image-2.png](attachment:image-2.png)

4. **Feed-Forward Network**:
   - The normalized outputs are passed through a feed-forward neural network. This network applies additional transformations to enhance the representation.
   - Each position in the sequence is processed independently.

5. **Add & Norm (After Feed-Forward)**:
- **Residual Connection:** The output from the feed-forward network is added back to the output of the previous Add & Norm step.
- **Layer Normalization:** The combined result is passed through layer normalization again.

![image.png](attachment:image.png)

### Encoder Output

The final output of the encoder is a sequence of contextualized representations, where each representation captures the essential information about the corresponding input token and its relationship with other tokens in the sequence. These encoder outputs are then passed to the decoder for further processing and generation of the output sequence.

### Here is the complete encoder part of Transformer

![encoder-transformer.drawio.png](attachment:encoder-transformer.drawio.png)

## What's Next: The Decoder

In the next notebook, we will shift our focus to the decoder part of the transformer architecture. The encoder outputs serve as the input for the decoder, which is responsible for generating the output sequence.

We will discuss the following aspects of the decoder:

- **Structure of the Decoder**: We will explore the components of the decoder and how they differ from the encoder.
- **Masked Self-Attention**: We will learn about the masked self-attention mechanism used in the decoder to ensure that the model only attends to previous tokens when generating the output sequence.
- **Encoder-Decoder Attention**: We will discuss how the decoder attends to the encoder outputs to incorporate information from the input sequence when generating the output.
- **Output Generation**: We will explore how the decoder generates the output sequence token by token, using techniques like beam search or greedy decoding.

By understanding the decoder part of the transformer architecture, we will have a complete picture of how transformers process input sequences and generate output sequences for tasks like machine translation, text summarization, and language generation.