# Transformer Neural Networks - Understanding

## 1. Overall Transformer Neural Networks Architecture

![Transformer Architect Image](./img/transformerarchitect.png)

As we see, the overall architecture of the Transformer is in the upper part. It is probably quite difficult to understand the detailed picture so we need a broader picture. Let's see the next picture!

![Transformer Architect Image](./img/transformerblock.png)

1. Input/Output Pre-processing

Token Embeddings
- **Function**: Converts each token in the input sequence into a dense vector of fixed size (commonly 512 dimensions).
- **Implementation**: An embedding layer maps each token to a dense vector.

Positional Encodings
- **Function**: Adds information about the position of each token in the sequence to the token embeddings, since the Transformer architecture does not inherently capture sequence order.
- **Implementation**: Sinusoidal functions are used to generate positional encodings, which are then added to the token embeddings.

2. Encoder

Processes the input sequence with a stack of identical layers. Each layer consists of:

Multi-Head Self-Attention
- **Function**: Allows each token to attend to all other tokens in the sequence, capturing dependencies regardless of their distance in the sequence.
- **Implementation**: Multiple attention heads operate in parallel to learn different aspects of the input.

Add & Norm
- **Function**: Adds the input of each sub-layer to its output (residual connection) and applies layer normalization to stabilize and speed up training.
- **Implementation**: Addition followed by normalization.

Feed-Forward
- **Function**: Applies a fully connected feed-forward network to each position independently and identically.
- **Implementation**: Two linear transformations with a ReLU activation in between.

3. Decoder

Generates the output sequence from the encoded input using a stack of identical layers. Each layer consists of:

Masked Multi-Head Self-Attention
- **Function**: Prevents attending to future tokens in the sequence during training (autoregressive property).
- **Implementation**: Similar to the encoder's self-attention but with a mask to prevent future token attention.

Multi-Head Attention over Encoder’s Output
- **Function**: Allows each position in the decoder to attend to all positions in the encoder's output.
- **Implementation**: Standard multi-head attention mechanism applied to the encoder’s output.

Add & Norm
- **Function**: Similar to the encoder's Add & Norm, it adds residual connections and normalizes the output.
- **Implementation**: Addition followed by normalization.

Feed-Forward
- **Function**: Similar to the encoder's feed-forward network.
- **Implementation**: Two linear transformations with a ReLU activation in between.

4. Output Post-processing

**Function**
- Transforms the decoder’s output into probabilities over the vocabulary.

**Implementation**
- A linear layer followed by a softmax function.

## 2. Detailed architecture

In [19]:
# import library
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import warnings

In [3]:
warnings.filterwarnings('ignore')

In [4]:
if torch.cuda.is_available():
    print("CUDA is available. PyTorch is using GPU.")
    print("Number of GPUs available: ", torch.cuda.device_count())
    print("GPU name: ", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available. PyTorch is using CPU.")

CUDA is available. PyTorch is using GPU.
Number of GPUs available:  1
GPU name:  NVIDIA GeForce GTX 1650


### 2.1. Input Pre-processing

### 2.2. Encoder

#### 2.2.1. Multi-Head Attention Block

![Encoder](./img/multihead-attention.png)

With input include:
> - Query: What are we looking for?
> - Key: What do we offer for?
> - Value: What do we actually offer for?

With output include:
> - New Value

In [9]:
# We need Query, Key, Value For Example with sequence "I love you so much" 
sequence_length , k_dim, v_dim = 5 , 10 , 10
q = torch.randn(sequence_length, k_dim)
k = torch.randn(sequence_length, k_dim)
v = torch.randn(sequence_length, v_dim)

In [10]:
print(q)
print(k)
print(v)

tensor([[-0.9396, -0.7610, -0.7723,  0.2179,  0.3140, -0.2625, -0.8789,  0.6244,
         -0.5245,  1.2776],
        [-0.7042,  0.1270,  1.6283, -1.5839,  1.0607, -0.1741, -0.5154,  1.0230,
         -0.0220,  1.5466],
        [ 2.4340,  3.1458, -1.2979, -1.6841, -0.2796,  1.7375, -0.3538,  0.6227,
         -0.9432,  0.9826],
        [-0.4292,  0.4479,  0.2516, -0.1197,  0.3054, -0.7416,  1.0121, -1.2222,
         -0.2376, -2.1837],
        [-0.6403,  0.3252,  0.0717,  0.9909, -0.8243, -0.3065, -0.0571, -0.5463,
         -0.5385,  0.3860]])
tensor([[ 0.2530,  0.7966, -0.2162, -1.2275, -0.7208,  0.2060,  0.6973,  1.0014,
         -0.1323, -0.5996],
        [-1.8674,  0.7221,  0.1546, -1.6387, -1.7548,  0.6751, -0.9566,  1.4463,
         -1.7166,  0.9057],
        [ 0.2133,  0.4005, -0.7649, -1.1516, -0.1740,  0.8551, -0.5812,  0.7501,
         -0.3315,  0.2571],
        [ 0.3450, -0.1597, -1.0515,  0.2874, -1.3148,  0.5086, -0.1085,  1.4433,
         -0.5419, -1.5695],
        [ 0.3310, 

- Scaled Dot-Product Attention:
Funtion:

$$
\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}+M\bigg)
$$

$$
\text{new V} = \text{self attention}.V
$$ 

Q,K,V -(1)-> Matmul(Q,K.T) -(2)-> Scale -(3)-> Masking (Not Required for Encode) -(4)-> Softmax -(5)-> Matmul (. ,V) -> new V

In [15]:
# Step 1: calculate Q.K_t
step1 = torch.matmul(q,k.t())
step1

tensor([[-1.9091,  3.8016,  1.0370, -0.5988,  0.4307],
        [ 0.4555,  5.6862,  1.6175, -4.7966, -2.3024],
        [ 5.9415,  5.6973,  7.4837,  2.3756, -6.1123],
        [ 0.7904, -3.9823, -2.6418,  0.3848,  0.0512],
        [-1.3507,  1.5958, -1.4197, -0.2317,  2.0358]])

In [16]:
# Step 2: Scale
# Check variance
q.var(), k.var(), step1.var()

(tensor(1.0842), tensor(0.8988), tensor(11.5111))

As we see, the variance distance between Q, K, matmul(Q,K.T) is very high. We have to scale again so that the softmax function can work effectively, creating a probability distribution.

In [21]:
step2 = step1 / math.sqrt(k_dim)
step2

tensor([[-0.6037,  1.2022,  0.3279, -0.1894,  0.1362],
        [ 0.1440,  1.7981,  0.5115, -1.5168, -0.7281],
        [ 1.8789,  1.8016,  2.3665,  0.7512, -1.9329],
        [ 0.2500, -1.2593, -0.8354,  0.1217,  0.0162],
        [-0.4271,  0.5046, -0.4489, -0.0733,  0.6438]])

But **why** is **the number of Key dimension** and **why** use **square** ?

Dividing by the square root of the dimension of the weight vector (in this case k_dim) has an important meaning.

Specifically, in the Multi-Head Attention mechanism, after calculating attention scores, we will apply the softmax function to normalize these scores into a probability distribution. Dividing each score by the square root of k_dim helps control the amplitudes of the scores, preventing them from becoming too large or too small. This improves model stability and performance.

In [25]:
# Step 3: Masking
# This is not required in the encoding block but it is required in the decoding block
# Masking is used to make sure that the current word/ token/ ... doesn't take the context from the future / generated word. That is cheating!!!!
mask = torch.tril(torch.ones(sequence_length,sequence_length))
mask

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

In [26]:
# mask filter
step3 = torch.tril(step2)
step3

tensor([[-0.6037,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.1440,  1.7981,  0.0000,  0.0000,  0.0000],
        [ 1.8789,  1.8016,  2.3665,  0.0000,  0.0000],
        [ 0.2500, -1.2593, -0.8354,  0.1217,  0.0000],
        [-0.4271,  0.5046, -0.4489, -0.0733,  0.6438]])

In [28]:
step3 = torch.where(step3 == 0, float('-inf'), step3)
step3

tensor([[-0.6037,    -inf,    -inf,    -inf,    -inf],
        [ 0.1440,  1.7981,    -inf,    -inf,    -inf],
        [ 1.8789,  1.8016,  2.3665,    -inf,    -inf],
        [ 0.2500, -1.2593, -0.8354,  0.1217,    -inf],
        [-0.4271,  0.5046, -0.4489, -0.0733,  0.6438]])

**Why** is '-inf' ?

When we take this to the Softmax Function(x), $e^{x}$ will go to value '0' if x goes to '-inf'. Otherwise, The value '0' will result in $e^{x}$ equals 1!!!

$$
\text{softmax} = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$

In [42]:
# Step 4: Softmax
step4 = (torch.exp(step3).t() / torch.sum(torch.exp(step3),axis=-1)).t()
step4

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1606, 0.8394, 0.0000, 0.0000, 0.0000],
        [0.2814, 0.2604, 0.4582, 0.0000, 0.0000],
        [0.4101, 0.0907, 0.1385, 0.3607, 0.0000],
        [0.1129, 0.2866, 0.1104, 0.1608, 0.3294]])

In [46]:
# Step5: matmul(.,V)
step5 = torch.matmul(step4, v)
step5

tensor([[ 0.3206, -0.8451, -1.4655,  0.0448, -0.7504, -0.4821, -0.3502,  1.9636,
          0.7334,  0.6107],
        [-0.1972,  0.3817, -0.3529, -0.9692, -0.2507, -1.0954,  0.6383, -0.7904,
          0.6558, -1.1193],
        [ 0.4236, -0.0417, -0.5607, -0.8724,  0.1027, -0.2031,  0.2102,  0.6222,
          0.5862, -0.7802],
        [ 0.5033,  0.4360, -0.7411, -0.0136,  0.0680, -0.3787,  0.3576,  0.3756,
          0.3203, -0.0997],
        [ 0.5721,  0.3924, -0.0542, -0.1984,  0.5691, -0.4914,  0.4664, -0.1302,
          0.4292, -0.6773]])