In [39]:
!nvidia-smi

Sun Jun 23 13:08:20 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

Cloning into 'notebooks'...
remote: Enumerating objects: 526, done.[K
remote: Counting objects: 100% (526/526), done.[K


In [None]:
from utils import *
setup_chapter()

In [None]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)
# Query vector is likely very simillar due to PCA or lack of context

## What Key and Query Vectors Represent

- **Fundamental Basis**: Both key and query vectors are transformations of the token embeddings. Here’s what they fundamentally represent:
  - **Token Embeddings**: These are dense vector representations of tokens (words or subwords) that capture semantic and syntactic information of the tokens. They are typically initialized from pre-trained models or randomly and then refined during training.
  
  - **Transformation**: For each token in the input, its embedding is transformed into three different vectors via separate learnable weight matrices:
    - **Query Vector (Q)**: Represents the token in a form that seeks out relevant information from other tokens. It's like asking, "Given what I represent, which other tokens should I pay attention to, and how strongly?"
    - **Key Vector (K)**: Represents the token as a source of information, answering the queries from all other tokens. It's like saying, "Here's how I can be useful to each query based on what I represent."

- **Purpose and Interaction**:
  - **Interaction**: During the self-attention process, each query vector (representing the current token’s information needs) is matched against all key vectors (representing how each token can provide relevant information). The output of this matching (via dot product and softmax) determines the attention scores.
  - **Use of Scores**: These scores are then used to aggregate value vectors (another transformation of the token embeddings, representing the actual content to be forwarded) to produce the output that combines relevant information from across the input sequence.


In [None]:
from transformers import AutoTokenizer
model_ckpt = "bert-base-uncased"
text = "time flies like an arrow"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

In [None]:
from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

In [None]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

In [None]:
import torch
from math import sqrt

query = key = value = inputs_embeds # setting Q, K, V to input_embeds // for now is set equal for simplicity's sake. in future will apply independent weight matrices instead
dim_k = key.size(-1) # dim_k is the size of the last dimension of key vector // which represents size of each key vector, often denotes as d_k in transformer literature

# bmm = batch matrix multiplcation
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k) # needs to allign key vector so appropriate matrix operation can be performed here
scores.size()


4. **Compute Attention Scores**:
   ```python
   scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
   ```
   - `torch.bmm`: This function performs a batch matrix multiplication. `query` and `key.transpose(1,2)` are the operands.
   - `key.transpose(1,2)`: Transposes the second and third dimensions of `key`. If `key` is a tensor of shape \([batch\_size, seq\_length, dim\_k]\), transposing the last two dimensions converts it to \([batch\_size, dim\_k, seq\_length]\). This is necessary to align the dimensions for matrix multiplication, where we need the feature dimensions of keys to match the corresponding dimension in queries.
   - The division by `sqrt(dim_k)` is a scaling factor used in transformers to prevent the dot products from growing too large in magnitude, which could lead to instability in gradients during training (a technique often referred to as "scaling").

---

### What This Code Represents

This portion of code is implementing the attention score calculation part of the self-attention mechanism. These scores, once computed, would typically be passed through a softmax function to normalize them so they sum to one. The normalized scores would then be used to compute a weighted sum of the `value` vectors, which would be the output of the self-attention layer.

The code essentially captures the essence of how each token in a sequence attends to every other token, using a shared embedding representation for simplicity in this snippet. In a full transformer model, the queries, keys, and values would likely be derived from the same input embeddings but transformed separately through different linear layers (learned weight matrices).

In [None]:
# now applying the softmax function here
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

### Understanding `weights.sum(dim=-1)`

After applying softmax, each row of the `scores` tensor will represent a probability distribution across the sequence length. The purpose of calling `weights.sum(dim=-1)` is to verify that the softmax operation has normalized the scores properly along the specified dimension. Here’s what each part does:

1. **`F.softmax(scores, dim=-1)`**:
   - This function converts the logits (scores) in the tensor to probabilities that are easier to work with in subsequent computations. The `dim=-1` argument specifies that the softmax should be applied to the last dimension of the tensor, which is typical in attention mechanisms to normalize scores across each query’s attention over keys.

2. **`weights.sum(dim=-1)`**:
   - This method sums the elements of `weights` along the last dimension (`dim=-1`). In the context of the softmax output, this operation should result in a tensor where every element is 1.0.
   - The sum is taken across the dimension where softmax was applied, confirming that the probabilities for each set of scores (each row in your 2D tensor context) add up to 1. This is a crucial check to ensure that the softmax function has normalized the data correctly.

### Purpose

- **Verification**: By performing `weights.sum(dim=-1)`, the code is essentially verifying that the output of the softmax function represents valid probability distributions. For a correctly implemented softmax, the sum across the probabilities in each group (or for each query in attention mechanisms) should be exactly 1.
- **Debugging and Validation**: This step is also useful for debugging. If the sum isn’t 1, it indicates an issue with the softmax application or possibly with the input data formatting or handling.

The output `tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)` confirms that for each sequence (or batch of sequences), the attention weights across all keys correctly sum to 1, meaning they are proper probability distributions. This is a standard practice to ensure the integrity of calculations in models that use such mechanisms.

In [None]:
# Final step of scaled dot product attention --> multiply attention weights by value vector
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

In [None]:
# wrapping all the steps above into a function that we can use later :)
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

## Multi-headed Attention

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim): # "embed_dim = dimensionality of input embedding" // "head_dim = dimensionality of each attention head"
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim) # these are transformations that will be applied to embedding vectors (as seen unsed by forward(), which does
        self.k = nn.Linear(embed_dim, head_dim) # transformations onto the input "hidden_state"
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        # applying here the previously defined function
        attn_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size # variables are initialised from the config file (more specifically the config class)
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        # basically this class just relying on multiple "AttentionHead" classes. Hence the name // refer to pipeline, these respective attention heads will then be concatenated
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        ) # a list of AttentionHead instances. Each configured to process part of the embedding space.
        self.output_linear = nn.Linear(embed_dim, embed_dim) # final linear layer which combines the outputs from all attention heads back into a single tensor of the original embedding dimension.

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1) # concatenating the outputs from all attention heads using torch.cat // last dimension because final learnt feature
        x = self.output_linear(x) # transformed to output_linear class attribute
        return x

Yes, your understanding is correct. In the multi-head attention mechanism, each attention head receives only a portion of the input vectors' dimensions, not the entire input vector. This design is intentional and is a key feature of the multi-head attention mechanism in transformer architectures. Let's break down why this is done and what benefits it provides:

### Why Each Head Receives Only a Portion of the Input Vectors:

1. **Dimensionality Splitting**:
   - In multi-head attention, the input embedding dimension (`embed_dim`) is divided into multiple heads (`num_attention_heads`). Each head operates on a separate segment of the input vector.
   - If `embed_dim` is, say, 512 and there are 8 attention heads, each head operates on a 64-dimensional segment of the input (512 / 8 = 64). This means each head transforms its segment of the input embedding into separate query, key, and value vectors of dimension 64.

2. **Diverse Feature Focus**:
   - By dividing the embedding dimensions among different heads, each head can learn to focus on different features or aspects of the input data. This is akin to having multiple independent feature extractors that each capture different types of relationships within the data.
   - For example, one head might focus on syntactic aspects of text, another on semantic cues, and another on contextual nuances. This specialization allows the model to capture a richer and more diverse set of information from the same input sequence.

### Benefits of Dividing Dimensions Across Heads:

1. **Increased Model Flexibility**:
   - Each head potentially focuses on different parts of the sequence or different kinds of relationships, which enhances the model's ability to understand and represent complex data structures and dependencies. This is particularly useful in tasks that require understanding nuanced relationships in data, like natural language understanding.

2. **Parallel Processing**:
   - Since each head operates independently of the others, this setup allows for efficient parallel processing of the attention mechanism across multiple heads. Each head processes its segment of the input simultaneously, making the computation faster and more scalable.

3. **Robustness to Overfitting**:
   - By forcing each head to focus on only a part of the input vector, the model can potentially reduce overfitting. Each head learns a more specific aspect of the data, leading to a more generalized understanding when all heads are combined.

4. **Rich Representation**:
   - Combining the outputs of multiple specialized attention heads allows the model to form a more comprehensive representation of each input token. This composite understanding is richer and more nuanced than what could be achieved with a single head focusing on the entire embedding dimension.

### Conclusion

The splitting of dimensions among multiple heads is a strategic choice in transformer design that contributes significantly to their success across various applications. It allows the model to capture a broader and more detailed understanding of the data, leading to better performance on tasks requiring deep contextual awareness and complex relational understanding.

In [None]:
multihead_attn = MultiHeadAttention(config) # pass in same config to ensure we're using the same setting as BERT
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

In [None]:
# Using BertViz here again to visualise the attention for 2 different use of word "flies"

from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

In [None]:
# Now implementing the Feed-Forward Layer of encoder // This is where most capacity and memorisation is hypothesised to happen; and this is the part
# that's normally scaled up when scaling up the model.
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU() # type of activation function // smoother than RELU
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

### GELU (Gaussian Error Linear Unit)
- **What It Is**: GELU is a type of activation function that is used to introduce non-linearity into the network. The function is given by:
  $$
  GELU(x) = xP(X \leq x) = x \Phi(x)
  $$
  where $$\Phi(x)$$ is the cumulative distribution function (CDF) for the standard normal distribution. It approximates the outputs as if the inputs were stochastic, deciding on a neuron’s activation based on a probabilistic estimate.
  
- **Why Use It**: GELU allows the model to control the noise in its predictions and generalizations, similar to a dropout mechanism but performed in the activation phase. It helps the model learn more robust features by effectively enabling and disabling neurons during training, but based on the value of the data passing through, rather than randomly. It's particularly popular in transformer models, used by default in architectures like GPT and BERT.

### Dropout
- **What It Is**: Dropout is a regularization technique used in neural networks to prevent overfitting. During training, dropout randomly "drops" (i.e., sets to zero) a proportion of the neurons in a layer based on a specified probability (dropout rate), which is provided in the configuration as `hidden_dropout_prob`.

- **How It Works**: By randomly omitting different subsets of the features detected by neurons during different phases of training, dropout forces the network not to rely too heavily on any single neuron for providing a prediction. This helps to spread out the "importance" across many neurons, improving the model’s generalization ability to unseen data.

- **Usage in the Code**:
  ```python
  x = self.dropout(x)
  ```
  After computing the outputs of the second linear transformation and applying the GELU activation function, dropout is applied. This helps to ensure that the network remains robust and doesn’t overfit to the noise in the training data.

### FeedForward Layer Structure in the Code
The feed-forward layer in the transformer architecture typically consists of two linear transformations with a non-linearity (like GELU) in between. Here’s how it’s structured in your code:

1. **First Linear Transformation**: Maps the input from the hidden size to a larger intermediate size. This expansion allows the network to create a higher-dimensional space in which to learn complex patterns:
   ```python
   x = self.linear_1(x)
   ```

2. **Activation Function (GELU)**: Introduces non-linearity and complexity, allowing the network to learn non-linear decision boundaries:
   ```python
   x = self.gelu(x)
   ```

3. **Second Linear Transformation**: Typically, this step projects the data back from the intermediate size to the original hidden size, allowing the layer outputs to be stackable or addable with other components like residuals from skip connections:
   ```python
   x = self.linear_2(x)
   ```

4. **Dropout**: Applied after the non-linearity to reduce overfitting by randomly setting a fraction of the output features to zero during training:
   ```python
   x = self.dropout(x)
   ```

This feed-forward layer is a critical component of transformer models, providing the capacity to learn complex patterns and relationships in data, augmented by dropout and advanced activation functions like GELU for better training dynamics and model robustness.

In [None]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

In [None]:
class TransformerEncoderLayer(nn.Module): # this class encapsulates a single layer of transformer encoder
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size) # 2 layer normalisation modules which stabilise the training by normalising the input across the features.
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size) # These are applied before and after the attention mechanism, respectively/
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

Query:

        x = x + self.feed_forward(self.layer_norm_2(x))

the code line above is referring to the class seen in the image attached. i am confused as to how is forward() even called here from FeedForward class because the method name isnt explicitly called?

---

```python
x = x + self.feed_forward(self.layer_norm_2(x))
```

Here's how the `forward()` method of the `FeedForward` class is implicitly called:

### Understanding PyTorch's Module Behavior

In PyTorch, each class that inherits from `nn.Module` (like your `FeedForward` class) is designed to handle calls to the instance as if they were calls to the `forward()` method. This behavior is facilitated by the `__call__` method that `nn.Module` implements. Here's the basic idea:

1. **Module Class Implementation**: When you create an instance of a class derived from `nn.Module` and pass input to this instance as if it were a function, the `__call__` method of the base `nn.Module` class is triggered.

2. **Calling `__call__` Method**:
   - The `__call__` method in turn calls the `forward()` method of your class.
   - This setup allows you to write classes that encapsulate network layers, and use instances of these classes as callable objects, which internally manage the process of passing inputs through the layers.

### In Your Code

- `self.feed_forward`: This is an instance of the `FeedForward` class.
- When you do `self.feed_forward(...)`, you're actually calling the `__call__` method of the `FeedForward` instance.
- The `__call__` method then calls the `forward()` method of `FeedForward`, passing whatever arguments you've supplied (`self.layer_norm_2(x)` in this case).

### Why Use This Approach?

This design pattern provides a couple of benefits:

- **Clarity and Simplicity**: It allows the definition of models in a way that the actual computational operations (in `forward()`) are cleanly separated from other mechanisms of the class like parameter initialization, gradient handling, etc.
- **Flexibility**: You can add preprocessing or hooks before or after the forward pass without changing how the layer/module is used in the model’s architecture.

### Example Detailed

Given your `FeedForward` class, the `forward()` method is defined to execute these steps:

```python
def forward(self, x):
    x = self.linear_1(x)
    x = self.gelu(x)
    x = self.linear_2(x)
    x = self.dropout(x)
    return x
```

So when you call `self.feed_forward(self.layer_norm_2(x))`, it means:

1. Compute `self.layer_norm_2(x)` to normalize `x`.
2. Pass this normalized `x` to the `FeedForward` instance, which processes it through two linear layers with a GELU activation and a dropout in between as defined in its `forward()` method.
3. The result of `self.feed_forward(...)` is then added back to `x` in the residual connection style typical to transformers, `x = x + self.feed_forward(self.layer_norm_2(x))`.

This pattern is central to the design of modular, reusable components in neural network libraries like PyTorch, making it easier to experiment with complex architectures.

In [None]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size() # I believe here once again the forward() method is being invoked implicitlhy by passing in parameter alone

### Positional Embeddings