# Transformers explanation

### Paper oficial

> [Attention is all you need](https://arxiv.org/abs/1706.03762)

## Overview of the Transformers Architecture

Transformers are in the spotlight, and for good reason. They have revolutionized the field over the past few years.  

The Transformer is an architecture that leverages Attention to significantly enhance the performance of models designed for sequence learning tasks.  

This architecture was first introduced in the 2017 paper *"[Attention Is All You Need]((https://arxiv.org/abs/1706.03762))"* and quickly established itself as the leading framework for most AI applications.  

Since then, various projects, including Google's BERT and OpenAI's GPT series, have been built on this architecture, delivering incredible performance results that have easily surpassed existing state-of-the-art benchmarks.  

The image below illustrates the standard Transformer architecture.  

<img src="imagens/transformer_imagem01.png" alt="IMG" style="width: 600px;"/>

---

## Encoder and Decoder Stacks  

At its core, the Transformer architecture consists of a stack of encoder layers and decoder layers. To avoid confusion, we will refer to individual layers as *Encoder* or *Decoder* and use the terms *encoder stack* or *decoder stack* for a group of encoder or decoder layers.  

Both the encoder stack and decoder stack include their respective embedding layers for their inputs. Finally, there is an output layer to generate the final output.  

<img src="imagens/transformer_imagem02.jpeg" alt="IMG" style="width: 600;"/>

All encoders are identical to each other, as are all decoders.  

The encoder includes the crucial self-attention layer, which computes relationships between different words in a sequence, as well as a feed-forward layer.  

The decoder contains the self-attention layer, the feed-forward layer, and an additional encoder-decoder attention layer.  

Each encoder and decoder has its own set of weights. There are many variations of the Transformer architecture. Some Transformer architectures do not include any decoders and rely solely on the encoder.  

<img src="imagens/transformer_imagem03.jpeg" alt="IMG" style="width: 600;"/>

## What Does Self-Attention Do?

The key to the groundbreaking performance of the Transformer is its use of Attention, specifically Self-Attention.  

When processing a word, Attention allows the model to focus on other words in the input that are closely related to that word.  

For example, "Ball" is closely related to "blue" and "hold." On the other hand, "blue" is not related to "boy," as shown in the image below:  

<img src="imagens/transformer_image04.png" alt="IMG" style="width: 370px;" />  

The Transformer architecture uses Self-Attention by relating each word in the input sequence to all other words.  

---

## The Transformer Model Training Process  

The Transformer operates slightly differently during training and inference.  

Let’s first look at the data flow during training. Training data consists of two parts:  

- The source or input sequence (e.g., "You are welcome" in English, for a translation task).  
- The target or output sequence (e.g., "De nada" in Portuguese).  

The Transformer’s goal is to learn how to generate the target sequence using both the input sequence and the target sequence.  

<img src="imagens/transformer_imagem05.jpeg" alt="IMG" style="width: 700px;" />  

The Transformer processes the data as follows:  

1. The input sequence is converted into embeddings (with positional encoding) and fed into the encoder.  
2. The encoder stack processes this and produces an encoded representation of the input sequence.  
3. The target sequence, appended with a start-of-sentence token, is converted into embeddings (with positional encoding) and fed into the decoder.  
4. The decoder stack processes this alongside the encoded representation from the encoder stack to produce an encoded representation of the target sequence.  
5. The output layer converts the prior processing into word probabilities and the final output sequence.  
6. The Transformer’s loss function compares this output sequence with the target sequence from the training data. This loss is used to generate gradients to train the Transformer during backpropagation.  

---

## The Transformer Model Inference Process  

During inference, we only have the input sequence and do not have the target sequence to feed into the decoder. The Transformer’s objective is to generate the target sequence based solely on the input sequence.  

Similar to a Seq2Seq model, the output is generated in a loop, with the output sequence from the previous time step fed into the decoder in the next time step until an end-of-sentence token is encountered.  

The difference from the Seq2Seq model is that, at each time step, the entire generated output sequence so far is fed back into the decoder rather than just the last word.  

<img src="imagens/transformer_imagem06.jpeg" alt="IMG" style="width: 700px;" />  

The data flow during inference is:  

1. The input sequence is converted into embeddings (with positional encoding) and fed into the encoder.  
2. The encoder stack processes this and produces an encoded representation of the input sequence.  
3. Instead of the target sequence, an empty sequence with only a start-of-sentence token is used. This is converted into embeddings (with positional encoding) and fed into the decoder.  
4. The decoder stack processes this alongside the encoded representation from the encoder stack to produce an encoded representation of the target sequence.  
5. The output layer converts this into word probabilities and produces an output sequence.  
6. The last word of the output sequence is taken as the predicted word. This word is then appended to the second position in the decoder input sequence, which now contains the start-of-sentence token and the first predicted word.  
7. Return to step 3. As before, feed the new decoder sequence into the model. Then, take the second predicted word and append it to the decoder sequence. Repeat this until predicting an end-of-sentence token. Note that, since the encoder sequence does not change with each iteration, steps 1 and 2 do not need to be repeated every time.  

## Text Classification and Language Models with Transformers

Transformers are highly versatile and are used for most NLP tasks, such as language modeling and text classification. They are frequently employed in sequence-to-sequence models for applications like machine translation, text summarization, question answering, named entity recognition, and speech recognition.

Different variants of the Transformer architecture are tailored to specific problems. The basic encoder layer serves as a common building block for these architectures, with different application-specific "heads" depending on the problem being addressed.

---

### Text Classification Architecture with Transformers  

A sentiment analysis application, for instance, would take a text document as input. A classification head uses the Transformer’s output to generate predictions for class labels, such as positive or negative sentiment.  

<img src="imagens/transformer_imagem07.jpeg" alt="IMG" style="width: 700px;" />  

---

### Language Model Architecture with Transformers  

A language model architecture takes the initial portion of an input sequence, such as a text phrase, and generates new text by predicting the subsequent phrases.  

A language model head uses the Transformer’s output to generate probabilities for each word in the vocabulary. The word with the highest probability becomes the predicted next word in the sentence.  

<img src="imagens/transformer_imagem08.jpeg" alt="IMG" style="width: 700px;" />  

---

## How the Embedding Layer Works  

Like any NLP model, the Transformer needs two pieces of information for each word: the meaning of the word and its position in the sequence.  

- The Embedding layer encodes the meaning of the word.  
- The Position Encoding layer represents the position of the word.  

The Transformer combines these two encodings by adding them.  

There are two embedding layers in the Transformer:  

1. **Input Embedding Layer:** The input sequence is passed through this layer, which is also called input embedding.  
   <img src="imagens/transformer_imagem11.jpeg" alt="IMG" style="width: 700px;" />  

2. **Output Embedding Layer:** The target sequence is fed into this layer after shifting the targets one position to the right and inserting a start token in the first position. During inference, since there is no target sequence, the output sequence is fed into this layer in a loop, which is why it’s called the Output Embedding Layer.  

The text sequence is mapped to numerical word IDs using the vocabulary. The embedding layer maps each input word to an embedding vector, which is a richer representation of the word's meaning.  

<img src="imagens/transformer_imagem12.jpeg" alt="IMG" style="width: 700px;" />  

---

## How the Positional Encoding Layer Works  

**Positional Encoding** is a critical technique in the Transformer architecture used to incorporate positional information into input sequences.  

Since the Transformer architecture lacks recurrence or convolutions, it has no inherent knowledge of word order or relative positions within the sequence. Positional encoding resolves this issue by adding positional information to each word.  

### How Positional Encoding Works  

1. **Adding Positional Information:** For each word in the input sequence, a positional encoding vector is added to the word embedding vector. This positional encoding vector includes information about the word’s position in the sequence.  

2. **Calculation Formula:** Positional encoding is computed using sine and cosine functions at different frequencies. The formula for each dimension of the positional encoding vector is:  

**Even Dimensions (\(i\) is even):**

$PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$

**Odd Dimensions (\(i\) is odd):**

$PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$

### Where:
- \(pos\): The position of the token in the sequence.
- \(2i\): The even index of the dimension.
- \(2i+1\): The odd index of the dimension.
- \(d_{model}\): The dimensionality of the model embeddings.

**Explanation:**
- **Frequency scaling:** The denominator $(10000^{\frac{2i}{d_{model}}})$ ensures different frequencies for each dimension.
- **Sine and cosine:** Alternate dimensions use sine and cosine functions to encode positions uniquely.

This encoding creates periodic patterns that help the model learn relationships between tokens based on their relative positions.

3. **Why Use Sine and Cosine?**  
   These functions allow the model to easily learn to represent relative distances between words. The combination of sine and cosine at varying frequencies ensures each position produces a unique vector while still allowing the model to generalize to unseen positions during training.  

4. **Adding to Word Embeddings:**  
   The positional encoding vector is added to the word embedding vector. This enables the model to process both content information (from the word embedding) and positional information for each word.  

Positional encoding is essential for Transformer-based models because it provides the positional context needed to understand word order in a sequence. Without it, the Transformer would treat all inputs as unordered sets, which would be inadequate for many NLP tasks like translation, where word order is crucial for meaning.  

## Matrix Dimensions

Deep learning models process a batch of training samples at a time. The embedding and positional encoding layers operate on matrices that represent a batch of sequence samples. The embedding layer takes a matrix in the format (samples, sequence length) of word IDs. It encodes each word ID into a word vector of length equal to the embedding size, resulting in an output matrix in the format (samples, sequence length, embedding size).

The positional encoding uses an encoding size equal to the embedding size, producing a similarly formatted matrix that can be added to the embedding matrix.

<img src="imagens/transformer_imagem14.jpeg" alt="IMG" style="width: 600px;" />

The (samples, sequence length, embedding size) shape produced by the embedding and positional encoding layers is preserved throughout the Transformer as the data flows through the stacks of encoders and decoders until it is reshaped by the final output layers.

This illustrates the 3D matrix dimensions in the Transformer. However, for simplicity, we'll omit the first dimension (for samples) in the following explanations and use 2D representations for a single sample.

---

## Encoder Stack Matrices

The encoder and decoder stacks consist of several (typically six) encoders and decoders, respectively, connected sequentially.

- The first encoder in the stack receives input from the embedding and positional encoding layers.  
- Subsequent encoders receive input from the previous encoder in the stack.

The encoder processes its input through a **multi-head self-attention layer.** The output of self-attention is passed to a **feed-forward layer,** which then sends its output to the next encoder.

<img src="imagens/transformer_imagem15.png" alt="IMG" style="width: 370px;" />

Both the self-attention and feed-forward sublayers have a **residual connection** around them, followed by **layer normalization.**

The output of the last encoder is passed to each decoder in the decoder stack, as explained below.

---

## Decoder Stack Matrices

The decoder structure is similar to the encoder, with a few differences.

- Like the encoder, the first decoder in the stack receives input from the output embedding and positional encoding layers.  
- Subsequent decoders in the stack take input from the previous decoder.

The decoder processes its input through a **multi-head self-attention layer,** which operates slightly differently than the encoder’s self-attention. It is restricted to attend only to **previous positions** in the sequence. This is achieved by **masking future positions,** a topic we will discuss later.

<img src="imagens/transformer_imagem16.png" alt="IMG" style="width: 370px;" />

Unlike the encoder, the decoder includes a second **multi-head attention layer** called the **encoder-decoder attention layer.** This layer works similarly to self-attention but combines two input sources:  
1. The self-attention layer below it.  
2. The output from the encoder stack.

The output from the encoder-decoder attention layer is passed to a **feed-forward layer,** which then sends its output to the next decoder.

Each of these sublayers—self-attention, encoder-decoder attention, and feed-forward—has a **residual connection** around it, followed by **layer normalization.**

## How the Self-Attention Mechanism Works

Now, let's examine how various vectors/tensors flow through these components to transform a model's input into its output. As with most NLP applications, we begin by converting each input word into a vector using an embedding algorithm.

Each word is embedded into a 512-dimensional vector (standard architecture). These vectors are represented by simple boxes in the image below:

<img src="imagens/transformer_imagem09.png" alt="IMG" style="width: 750px;" />

The general abstraction for all encoders is that they take a list of 512-dimensional vectors as input.

After embedding the words in our input sequence, each one flows through the two layers of the encoder.

<img src="imagens/transformer_imagem10.png" alt="IMG" style="width: 750px;" />

Here, we observe a key property of the Transformer: each word at each position follows its unique path through the encoder. Dependencies exist between these paths in the **Self-Attention** layer. However, the **feed-forward layer** has no such dependencies, allowing parallel execution of these paths, which speeds up training.

Next, we'll switch to a shorter sentence to explore what happens in each encoder sublayer.

---

## The Math of Self-Attention - Initial Vectors

The first step in calculating Self-Attention is to create three vectors from each input vector of the encoder (i.e., the embedding of each word). For each word, we create a **query vector (Q)**, a **key vector (K)**, and a **value vector (V)**. These vectors are created by multiplying the embedding by three matrices trained during the training process.

Note that these new vectors have dimensions smaller than the embedding vectors. Their dimensionality is 64, while the embedding and encoder input/output vectors are 512-dimensional. **This reduced dimensionality is an architectural choice to make multi-head attention computation more efficient.**

Multiplying \( x_1 \) by the weight matrix \( W_Q \) produces \( q_1 \), the **query vector** associated with that word. Similarly, we create **queries, keys, and values** for each word in the input sentence. These vectors serve as useful abstractions for calculating attention.

<img src="imagens/transformer_imagem17.png" alt="IMG" style="width: 750px;" />

---

## The Math of Self-Attention - Score Calculation

The second step is to calculate a **score.** Let’s say we are calculating Self-Attention for the first word in this example, “Thinking.”

We need to score each word in the input sentence relative to this word. The score determines how much focus to place on other parts of the sentence when encoding a word in a specific position.

The score is calculated by computing the **dot product** of the query vector with the key vector of the respective word. For example, if we’re processing Self-Attention for the word at position #1, the first score is the dot product of \( q_1 \) and \( k_1 \). The second score is the dot product of \( q_1 \) and \( k_2 \), and so on.

<img src="imagens/transformer_imagem18.png" alt="IMG" style="width: 750px;" />

---

## The Math of Self-Attention - Softmax Operation

The third and fourth steps involve dividing the scores by 8 (the square root of the key vector dimension, 64, from the original paper). This scaling ensures more stable gradients. Other values could be used here, but this is the standard.

The results are then passed through a **softmax** operation. Softmax normalizes the scores so they are all positive and sum to 1.

<img src="imagens/transformer_imagem19.png" alt="IMG" style="width: 750px;" />

The **softmax scores** determine how much each word contributes to the attention for a specific position. The word at the current position typically has the highest softmax score, but other relevant words may also contribute significantly.

---

## The Math of Self-Attention - Weighted Sum

The fifth step is to multiply each **value vector** by its softmax score. This effectively amplifies the values of relevant words while dampening the values of irrelevant ones.

The sixth step is to **sum the weighted value vectors.** This produces the output of the Self-Attention layer for the specific position (in this case, for the first word).

<img src="imagens/transformer_imagem20.png" alt="IMG" style="width: 750px;" />

This completes the Self-Attention calculation. The resulting vector can then be passed to the feed-forward neural network. However, in programming implementations, this calculation is performed in matrix form for faster processing.

---

## The Math of Self-Attention - Scaled Dot Product and Final Formulation

In general, a Self-Attention mechanism requires four components:

- **Queries (Q):** Represent the positions being encoded.
- **Keys (K):** Compared against the queries to determine attention levels.
- **Values (V):** Weighted by attention scores and summed to produce the layer's output.
- **Scoring Function:** Determines which elements require more attention. It takes a query and key as input and outputs an attention weight.

The most common scoring function implementation is the **dot product** (or a small MLP comparing similarity metrics).

The **scaled dot product attention** mechanism calculates how much one part of the input should focus on another using the dot product of input vectors. This mechanism is used in both encoder and decoder layers.

- First, the dot product between queries and keys is calculated.
- Next, the result is scaled by dividing by the square root of the key vector dimension, which prevents excessively large gradients during training in deep networks.
- After scaling, a **softmax function** is applied to produce **attention weights** (probabilities) indicating the relative importance of each value.

The final attention output is obtained by weighting the values by these probabilities and summing them.

The mathematical formulation for **scaled dot product attention** is:

$\text{Scaled Dot-Product Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

Where

- **$(QK^T)$** computes the similarity between the query and all keys.
- **$(\frac{1}{\sqrt{d_k}})$** is used to scale the dot product and avoid excessively large values, which could result in numerical instability.
- The **$softmax$** function converts the attention scores into a probability distribution.
- The final output is a weighted sum of the values \(V\), based on the computed attention weights.

And there you have it: **Attention is all you need!**

### Final Expression:


### Key Points:
- **\( QK^T \)** computes the similarity between the query and all keys.
- **\( \frac{1}{\sqrt{d_k}} \)** is used to scale the dot product and avoid excessively large values, which could result in numerical instability.
- The **softmax** function converts the attention scores into a probability distribution.
- The final output is a weighted sum of the values \(V\), based on the computed attention weights.

This mechanism allows the model to focus on different parts of the input sequence based on the relevance of each token to the current query.

## Functioning of the Feed-Forward Layer and Mathematical Formulation

### Formula

$FFN(x) = ReLU(W_1x + b_1)W_2 + b_2$

<img src="imagens/transformer_imagem03.jpeg" alt="IMG" style="width: 370px;" />

The formula above represents a Feed-Forward Neural Network (FFN) with one hidden layer and the ReLU activation function. Here's a step-by-step breakdown:

**FFN(x):** Represents the feed-forward neural network function, which takes an input 'x' and produces an output. This output results from processing the input through the network layers.

**ReLU:** The Rectified Linear Unit activation function. It operates simply: if the input is positive, it returns the input; if negative, it returns zero. This introduces non-linearity into the model, essential for learning complex patterns.

**W₁ and W₂:** These are the weights of the network's two layers. They are matrices that transform the input data into the outputs for each layer. The weights are adjusted during training to minimize prediction error.

**x:** The input vector for the network.

**b₁ and b₂:** Bias terms for the two layers. These biases are added after the weight multiplications and before applying the activation function. Like weights, they are adjusted during training and make the neural network more flexible.

**W₁x + b₁:** The linear combination of the input vector 'x' and the weights and bias of the first layer.

**ReLU(W₁x + b₁):** The ReLU activation function is applied to the first linear combination, setting negative values to zero while leaving positive values unchanged.

**(ReLU(W₁x + b₁))W₂ + b₂:** The activated output from the first layer is multiplied by the weights of the second layer, and the second bias is added, producing the final network output.

This structure is common in dense neural networks, where each neuron in one layer is connected to every neuron in the next. These networks can be trained for tasks like classification or regression, depending on the number of output units and the loss function used.

## How Layer Normalization Works and Mathematical Formulation

**Formula:**

$\text{layernorm}(x + \text{sublayer}(x))$

<img src="imagens/transformer_imagem31.png" alt="IMG" style="width: 700px;" />

<img src="imagens/transformer_imagem29.png" alt="IMG" style="width: 470px;" />

The formula above represents a typical operation within a Transformer block, specifically within one of its many attention or feed-forward neural network sublayers. Here’s a breakdown of what this operation does:

### Components:

- **layernorm**: Refers to **layer normalization**, a normalization technique applied to activation vectors within the network. Normalizing data at each layer stabilizes learning and allows the model to train faster and more effectively.

- **x**: Represents the input vector for the current sublayer of the neural network. In the context of a Transformer, \( x \) could be the output of a previous sublayer, such as an attention layer.

- **sublayer(x)**: Represents a function that performs a transformation on the input vector \( x \). Depending on the context, **sublayer** can be an attention layer, a feed-forward layer, or any other operation. For example, in a Transformer, this might involve self-attention or positional feed-forward computations.

- **x + sublayer(x)**: Represents a **residual connection** or **skip connection**, where the output of the sublayer is added to the original input vector. Residual connections are crucial for effectively training deep networks, as they help mitigate issues like vanishing or exploding gradients, commonly observed in other deep learning architectures such as RNNs and LSTMs.

- **layernorm(x + sublayer(x))**: After adding the original input vector to the sublayer output, layer normalization is applied. The resulting normalized vector is then passed to the next sublayer or layer.

---

### Explanation:

This formula describes the process of passing an input \( x \) through a sublayer (e.g., attention or feed-forward network), adding the sublayer output to the original input via residual connections, and applying layer normalization to the result. This structure is a critical part of the Transformer architecture, allowing the model to "remember" the original input while incorporating newly processed information.

Normalization benefits the model by reducing training time, making it less biased toward higher-value features, and preventing gradient explosions or vanishing by constraining gradients to a specific range. In summary, training a model with unnormalized features in gradient descent is suboptimal, making normalization essential. The key decision is choosing the type of normalization.

---

### Types of Normalization:

<img src="imagens/transformer_imagem30.png" alt="IMG" style="width: 470px;" />

1. **Batch Normalization**:
   - Involves all sentences in a batch.
   - For each feature across sentences, calculates a mean and variance for normalization.
   - Ensures the mean is close to zero and variance close to one for all features.
   - This process is repeated for every feature in the input data.

2. **Layer Normalization**:
   - Calculates the mean and variance of all features within a single sentence.
   - It doesn’t depend on batch size or whether sentences are in the same batch.
   - Simply uses all features of a single sentence for normalization.

Layer normalization was initially designed for recurrent neural networks (RNNs) because batch normalization's performance depends on mini-batch size. The Transformer architecture uses layer normalization throughout the model as it works exceptionally well for NLP tasks.

---

## Visual Representation of Multi-Head Attention

The Transformer processes multiple attention heads in parallel, referring to each as an "Attention Head." This is known as **Multi-Head Attention**.

<img src="imagens/transformer_imagem22.png" alt="IMG" style="width: 500px;" />

### Process:

1. **Queries (Q), Keys (K), and Values (V)**:
   - These are generated by passing the input through separate linear layers, each with its own weights.
   - This produces three outputs: \( Q \), \( K \), and \( V \), which are then combined using the attention formula below to produce the **Attention Scores**.

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

<img src="imagens/transformer_imagem21.png" alt="IMG" style="width: 350px;" /><br/>
<img src="imagens/transformer_imagem23.png" alt="IMG" style="width: 500px;" />

### Key Takeaways:

- The \( Q \), \( K \), and \( V \) values contain encoded representations of each word in the sequence.
- The **self-attention mechanism** computes attention scores, combining each word with all others in the sequence.
- These scores encode how much each word should focus on every other word in the sequence.

### Masking in Attention:

While discussing the decoder earlier, masking was briefly mentioned. The mask (illustrated in the attention diagrams) controls what the model focuses on during training. For example, in causal settings, it prevents attention to future words. This ensures predictions rely only on prior context.

## Attention Masks

When calculating the attention scores, the Attention module implements a masking step. This serves two purposes:

1. **In Encoder Self-Attention and Encoder-Decoder Attention**:  
   Masking zeros out attention outputs where padding tokens exist in the input sentences. This ensures that padding does not contribute to self-attention.  
   _(Note: Since input sequences may have different lengths, they are padded with tokens, as in most NLP models, to enable fixed-length vectors to be passed into the Transformer.)_

2. **In Decoder Self-Attention**:  
   Masking prevents the decoder from "seeing" the rest of the target sentence while predicting the next word.  

The decoder processes words from the source sequence and uses them to predict words in the target sequence. During training, **Teacher Forcing** is used, where the complete target sequence is fed as input to the decoder. As a result, while predicting a word at a given position, the decoder has access to preceding and succeeding target words, allowing it to "cheat" by using future target words.

For example, when predicting **Word 3**, the decoder should only refer to the first three words of the target input and not the fourth word (**AI**).

<img src="imagens/transformer_imagem24.png" alt="IMG" style="width: 500px;" />

To address this, the Decoder masks the input words that appear later in the sequence.  

---

### How Masking Works:

When calculating attention scores, masking is applied to the numerator just before the Softmax. Masked elements are set to negative infinity, so that Softmax converts those values to zero.

<img src="imagens/transformer_imagem25.png" alt="IMG" style="width: 500px;" />

---

## Output Generation Process

The final decoder in the stack passes its output to the Output Component, which converts it into the final output sentence.

- The **Linear Layer** projects the decoder vector into **Word Scores**, with one score for every unique word in the target vocabulary, at each position in the sentence.  
  - For example, if the final sentence has 7 words and the target vocabulary has 10,000 unique words, 10,000 score values are generated for each of the 7 words.  
  - These scores represent the likelihood of each vocabulary word appearing at that position in the sentence.  

This helps explain why a large language model (LLM) can have billions of parameters.

- The **Softmax Layer** transforms these scores into probabilities (values between 0 and 1).  
  - At each position, the word index with the highest probability is selected and mapped to the corresponding word in the vocabulary.  
  - These selected words form the Transformer's output sequence.

<img src="imagens/transformer_imagem26.png" alt="IMG" style="width: 500px;" />

---

## Loss Function

During training, a **loss function**, such as cross-entropy loss, is used to compare the generated output probability distribution with the target sequence. The probability distribution represents the likelihood of each word occurring at each position.

<img src="imagens/transformer_imagem27.png" alt="IMG" style="width: 800px;" />

---

### Example:

Assume the target vocabulary contains only four words. The goal is to produce a probability distribution that matches the expected target sequence: **"De nada END"**.

- For the first word position, the probability distribution should assign a probability of 1 to "De" and 0 to all other words.  
- Similarly, "welcome" and "END" should have probabilities of 1 in the second and third positions, respectively.

The loss function computes the difference between the predicted and expected distributions. This loss is then used to calculate gradients to train the Transformer via backpropagation. The training process here is the same as in other deep learning architectures.