In [1]:
from src import load_qwen_model
from src import forwards_pass_flops
from src import model_generation_flops, model_training_flops, model_evaluation_flops

In [2]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model, tokeniser, device = load_qwen_model(model_name)

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


# Included in this Notebook
- An outline of the Qwen + LoRA architecture
- A summary of the mathematical approach taken
- Examples from the defined function used to calculate flops

# Break Down of Qwen Achitecture and Flops Calculation

## And flops Budget plan at the end

In [3]:
print(model)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

## The Architecture:

    - Token to Embedding Layer (no compute)
    - 24 Transformer Layers:
            - RMSNorm
            - Rotary Positional Embedding (Query and Value Heads)
            - Grouped Query Attention/ Multi Head Attention (14 Query Heads, 2 Key Heads, 2 Value Heads)
                    - Query, Key, Value Heads + LoRA Rank Adaptation if included
                    - Attention Mechanism
                    - Softmax
                    - Self Attention Mechanism (Values * Softmax)
                    - Concatination
                    - Linear Transformation (Mixing)
            - Residual Connection
            - RSMNorm
            - MLP (Projection Up, SwiGLU Activation Function, Projection Down)
            - Residual Connection
    - RMSNorm
    - Embedding to Vocabulary Linear Layer
----
For probabilitic generation:

    - Softmax over Vocabuluary Space

![Architecture of Qwen](diagrams/QwenModel.jpg)

- note the rotary positional embeddings are applied to the query and key values internally not beforehand but are shown here as markers

## Mathematical Calculations

The majority of the computations in the following analysis can be grouped as **matrix multiplications**. 

For a matrix multiplication of dimensions:  
**(D, S) × (S, R) → (D, R)**

The number of floating-point operations required are:

- **Multiplications**:  
  $ D \times S \times R $

- **Additions**:  
  $ D \times (S - 1) \times R $

> Each output element in the resulting matrix requires `S` multiplications and `S - 1` additions.

# Qwen (Without LoRA)
- The Lora additional structure is introduced later.

---

## Before the Transformer Layers
### Convert Tokens to Embedding:
- This is treated as a memory operation in the forwards pass (although this mapping is adjusted during the backward pass) it is simply a mapping of key to embedding vector
- It does not add to the flops budget given the assumtion (backlwards pass = 2x Forwards Pass)

---

## Tranformer Layer: 24 of them
### RMS Norm
- RMS works by calculating the squared mean of each input element and calculating its squared mean along the embedding space.
- Each input element is then divided by its repective embedding RMS and a constant is then applied.
- This is applied before and after self attention in each layer, as well as once again after all 24 transformer layers.
The **Root Mean Square (RMS)** is defined as:


$$ \text{RMS}(\mathbf{a}) = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} a_i^2 } $$

Each normalised element is:

$$ \bar{a}_i = \frac{a_i}{\text{RMS}(\mathbf{a})} $$

---

## Multi Head (Grouped Query) Self Attention 

| Section   |   No Heads |
|:----------|-----------:|
| Query     |         14 |
| Key       |          2 |
| Value     |          2 |

- The qwen model used grouped query attnention and this has a different number of query heads to key and value heads, however for this calculation it is assumed to be the same mechanism as Multi Head Attention.
- We use 14 query heads each of dimension **No Tokens X Dimension Embedding Space/ No. Heads**

### Rotary Positional Embeddings
- The Qwen model does not use standard positional embeddings added onto the embedding space however applies rotary positional embedding to every query-key matrices in the self attention.
- This acts as a rotation matrix multiplication in which it encodes relative position information while **preserving distance metrics**.
- This does not include the FLOPS of calculating the rotation matrix (as intrusted) but does include its application to the original space.
- This is treated as a simple matrix multiplication to the overall **global** query and key heads however it can also be thought of as applied internally within each head, which would acheive the same flops.
- A function is provided for applying it to once matrix and it is reused twice per self attention block (query and value)

We define the rotary embedding transformation as:

$$
f_{\{q,k\}}(x_m, m) = R_{\Theta, m}^d \, W_{\{q,k\}} x_m
$$
where the rotation matrix $ R_{\Theta, m}^d $ is:

$$
R_{\Theta, m}^d =
\begin{pmatrix}
\cos(m\theta_1) & -\sin(m\theta_1) & 0 & 0 & \cdots & 0 \\
\sin(m\theta_1) & \cos(m\theta_1)  & 0 & 0 & \cdots & 0 \\
0 & 0 & \cos(m\theta_2) & -\sin(m\theta_2) & \cdots & 0 \\
0 & 0 & \sin(m\theta_2) & \cos(m\theta_2)  & \cdots & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & 0 & \cdots & \cos(m\theta_{d/2}) & -\sin(m\theta_{d/2}) \\
0 & 0 & 0 & 0 & \cdots & \sin(m\theta_{d/2}) & \cos(m\theta_{d/2}) \\
\end{pmatrix}
$$

- In this calculation for robustness i have included the computational flops by times by 0 in the matrix multiplication however in qwen architecture this may be simplified.

### Query, Key, Value Matrices
- The qwen model use of grouped query attention may mean that the flops is slightly different to what is applied, as with only 2 Key and Value heads, their dimensionality differs and are effectively reused in individual heads
- Here we assume standard multi head attention, which makes calclulting the query, key value heads trivial with matrix multiplication
- It is important to note that qwen models use Bias terms within these and this has been accounted for

### Attention (inc softmax)
- Tne attention is simply broken down across the 14 heads as a matrix multiplication
- Additionally we accound for scaling by the dimensionality of the head. This is treated as a single square root as it can be stored in memory. But it is a division over the entire Attention results. 
- The softmax is applied to the attention matrix overall which is simply an exponential on every term in the input matrix (no_tokens * no_tokens), followed by the sum across the 1 dimension and finally a division. This is done across all 14 heads

### Self Attention (Attention * Values)
- This is a simply matrix multiplication (No Tokens, No Tokens) X (No, Tokens, Dension of Value Head)
- Within the Masked Self-Attention operation, the upper triangular portion of the attention score matrix is masked out, as tokens are only allowed to attend to previous positions in the sequence. 
- The inner working of the multiplication whether the associated multiplication occour or are treated as a memory operation, to provide a conservative estimate the full cost is carried foward. 

### Concatination
- Concatinating the results of each head is treated as a memory operation and thus does not require flops calculations

### Linear Transformation (Mixing)
- Denoted at the `o_proj): Linear(in_features=896, out_features=896, bias=False)` this is simple a linear tranformation at the end of the head concatination inorder to allow mixing of the information from each head
- A simple matrix multiplication by (embbed dim, embedding dim)

---

## Post Self Attention

### Residual Connectiom
- Residual connections in Qwen allow the embedding to carry forward information from earlier layers, preserving gradient flow and enabling deeper architectures.
- This is a simple element-wise addition over a matrix of shape (no tokens, embedding dim), combining the input and output of a sub-layer

### RMS Norm
- Another RMS norm is applied here, idenitcal to the last

---

## MLP (Feed Forward with swiGLU activation)
- This calculation if preformed over two functions `mlp_flops`, `swiglu_flops`. 
- This combination of MLP and MLP can be thought of in two ways. 
- The simple base part of the MLP can be thought of as a projection up from embedding dimension to the dimension of the mlp hidden layer (4864), and then a projection back down. Note that none of these projection have biases in the qwen model. 

$$
\text{SwiGLU}(x, W, V, b, c, \beta) = \text{Swish}_\beta(xW + b) \odot (xV + c)
$$

where:
- $ x $ is the input,
- $ W, V $ are projection weight matrices,
- $ b, c $ are biases (0 in the qwen model)
- $ \odot $ denotes element-wise multiplication,
- $ \text{Swish}_\beta(z) = \frac{z}{1 + e^{-\beta z}} $ is the generalized Swish activation.

- I like to think of this as two up projection matrices (biases zero) giving two hidden layers, in which swish is applied to one of them and then they are element wise multiplied giving a single hidden layer configuration. This is then passed into the down projection
- The functions and calculations are split accordingly:
- `mlp_flops`:
    - Calculates the flops of two projections up into this space which are then turned into one and one projection down
- `swiglu_flops`:
    - Calculates the flops of applying the swish to one of the higher dimensional layers and then the element wise multiplication
    - Assume negation in exponetial factor has no cost

### Residual Connection
- Another Residual Connection is applied here, idenitcal to the last

---

## Outside of transformer

### RMS Norm
- A final RMS norm is applied here, idenitcal to the last

### Vocabuluary Projection
- Finally, there is a **linear projection (including bias)** that maps the model's output from the embedding space to the **vocabulary space**.
- This step is used to produce **logits** for each token in the vocabulary, which represent unnormalized probabilities.
- Specifically, we project from the **embedding dimension** (e.g., 896 in Qwen) to the **vocabulary size** (e.g., 151,936 tokens).

---

### **During training, evaluation and greedy (determininstic) sampling this is where the model stops**
- The cross entropy loss is calculated outside of the model and uses the logits and we do not account for these flops
- In greedy sampling it simply takes the vocab token with the highest value 


--- 

## For Probabilistic (Stochastic) Generation
- Inorder for the model to generate based of a probabilistic distribution the logits must be normalised, this involves applying a softmax to the final steps vocabuluary dimension and sampling based of this. This is only applied in the `model_generation_flops` function later if randomness = true


![Architecture of Qwen](diagrams/Lora_view.png)

## Addition of Lora Ranks
- In this project with apply LoRA to the query projection and value projection layers of the attention mechanism 
- It can be thought of as a seperate channel of matrix multiplication (2 steps) and then added to the original proection matrices. 
- In this implementation, we the query and value heads as a global matricies across the mutliple self attention heads and thus is applied to the full matrix at once however this is the same computational cost as applying to to the individual head individually and summing flops across heads

### Broken Down:
- Matrix Multiplication 1: Input X Lora 1 - (N_tokens,Embed dim) x (Embed, LoRA_Ranks)
- Matrix Multiplication 2: Result x Lora 2 -(N_tokens, LoRA_Ranks) * (LoRA_Ranks, Embed dim)
- Addition: Sum on top of Frozen Weight Matrix




# Final Aggregation
- I will now explain how these individual functions are broken down into

# Single Forwards Pass

- Aggregates all of the function above into a single forward pass of the full architecture, accepting lora ranks (if lora rank = 0 does not add any computational cost)
- It does not include the cost of generation (ie softmax) as this is implemented in a layer function, a few examples are given below to show that LoRA's additional computational cost is minimial

### Change with lora: lora effects it at 4 orders of magnitude lower than total

### No LoRA

In [4]:
total_forward_pass, total_lora = forwards_pass_flops(no_tokens = 512, lora_ranks = 0, print_summary = True)

Operation Cost                     Additions          Multiplications    Divisions          Exponentiations    Square Roots       Total             
Single Attention Block             2.93e+09           2.94e+09           7.34e+06           3.67e+07           10                5.91e+09          
Single MLP Block                   6.69e+09           6.7e+09            2.49e+06           2.49e+07           0                 1.34e+10          
RMS, Residual etc                  1.83e+06           1.84e+06           9.19e+05           0                  1.02e+04          4.6e+06           
Single Transformer Layer           9.63e+09           9.63e+09           1.07e+07           6.16e+07           1.02e+04          1.93e+10          
LoRA cost in this Layer            0                  0                  0                  0                  0                 0                 
Full Forward Pass                  3.01e+11           3.01e+11           2.58e+08           1.48e+09           

### LoRA: Rank 1

In [5]:
total_forward_pass, total_lora = forwards_pass_flops(no_tokens = 512, lora_ranks = 1, print_summary = True)

Operation Cost                     Additions          Multiplications    Divisions          Exponentiations    Square Roots       Total             
Single Attention Block             2.93e+09           2.94e+09           7.34e+06           3.67e+07           10                5.91e+09          
Single MLP Block                   6.69e+09           6.7e+09            2.49e+06           2.49e+07           0                 1.34e+10          
RMS, Residual etc                  1.83e+06           1.84e+06           9.19e+05           0                  1.02e+04          4.6e+06           
Single Transformer Layer           9.63e+09           9.64e+09           1.07e+07           6.16e+07           1.02e+04          1.93e+10          
LoRA cost in this Layer            1.83e+06           2.75e+06           0                  0                  0                 4.59e+06          
Full Forward Pass                  3.01e+11           3.01e+11           2.58e+08           1.48e+09           

### LoRA: Rank 4

In [13]:
total_forward_pass, total_lora = forwards_pass_flops(no_tokens = 512, lora_ranks = 4, print_summary = True)

Operation Cost                     Additions          Multiplications    Divisions          Exponentiations    Square Roots       Total             
Single Attention Block             2.93e+09           2.94e+09           7.34e+06           3.67e+07           10                5.91e+09          
Single MLP Block                   6.69e+09           6.7e+09            2.49e+06           2.49e+07           0                 1.34e+10          
RMS, Residual etc                  1.83e+06           1.84e+06           9.19e+05           0                  1.02e+04          4.6e+06           
Single Transformer Layer           9.63e+09           9.64e+09           1.07e+07           6.16e+07           1.02e+04          1.93e+10          
LoRA cost in this Layer            7.34e+06           8.26e+06           0                  0                  0                 1.56e+07          
Full Forward Pass                  3.01e+11           3.01e+11           2.58e+08           1.48e+09           

# Generating from Model:
- This is very computationally expensive
- If we request multiple tokens as output, the model must **generate tokens autoregressively** — meaning it generates one token at a time.
- At each step, the model:
  1. Takes in the current input sequence (starting with the original prompt),
  2. Produces logits over the vocabulary,
  3. Selects or samples the next token (based on decoding strategy),
  4. Appends the new token to the input and **feeds it back into the model** to generate the next.

- So if I give it a context of 512 and also be 512 more, it has to:
    1. Run model with context 512
    2. Run model with context 513
    3. Run model with context 514 
    4. ...
    5. Run model with Context 1023


Additionally if we are using stohastic sampling it must add a final softmax to the output at each interation. 

These cacluations are included in: `model_generation_flops`

## NOTE:
- In this example running a context of 512 to generate 512 costs 4.7317e+14 flops, which is 0.5% of our computational budget and not feasible to do often


### Greedy sampling, No Lora

In [7]:
total_flops, total_lora_flops = model_generation_flops(tokens_given = 900, tokens_generated = 294, lora_ranks = 0, randomness = False) 

Total FLOPs for generating 294 tokens: 3.7761e+14
Total FLOPs from LoRA adaptation: 0
Percentage of Total FLOPs Budget:   0.37761 %


In [8]:
total_flops, total_lora_flops = model_generation_flops(tokens_given = 512, tokens_generated = 512, lora_ranks = 0, randomness = False)

Total FLOPs for generating 512 tokens: 4.7317e+14
Total FLOPs from LoRA adaptation: 0
Percentage of Total FLOPs Budget:   0.47317 %


### Greedy sampling, Lora = 5 (Small addition)

In [9]:
total_flops, total_lora_flops = model_generation_flops(tokens_given = 512, tokens_generated = 512, lora_ranks = 5, randomness = False)

Total FLOPs for generating 512 tokens: 4.7353e+14
Total FLOPs from LoRA adaptation: 3.5481e+11
Percentage of Total FLOPs Budget:   0.47353 %


### Stochastic sampling, No Lora (Negligible difference)

In [10]:
total_flops, total_lora_flops = model_generation_flops(tokens_given = 512, tokens_generated = 512, lora_ranks = 0, randomness = True)

Total FLOPs for generating 512 tokens: 4.7317e+14
Total FLOPs from LoRA adaptation: 0
Percentage of Total FLOPs Budget:   0.47317 %


# Training the Model:
- In training we simply use the assumption that the backwards pass is 2 * the forwards pass
- We also do not account for the cross-entropy loss calculation which is assumed to be outside the model as per intstructures.
- Therefore the model trains of logit/ vocabuluary space
- It can account for a given batch and no of training steps through simple multiplication

## Note:

This implies we have 13,800 training periods (context 512 and batch_size, 4) in our model calculations - far far smaller than the suggested run times


### Single Training Step

In [14]:
total_flops, total_lora_flops = model_training_flops(no_tokens = 512, lora_ranks = 4, batch_size = 4, num_steps_training = 1)

Total FLOPs for training: 7.2455e+12
Total FLOPs from LoRA adaptation: 4.4909e+09
Percentage of Total FLOPs Budget:   0.0072455 %


### Testing for the training limit

In [18]:
total_flops, total_lora_flops = model_training_flops(no_tokens = 512, lora_ranks = 4, batch_size = 4, num_steps_training = 13800)

Total FLOPs for training: 9.9988e+16
Total FLOPs from LoRA adaptation: 6.1975e+13
Percentage of Total FLOPs Budget:   99.988 %


# Evaluate the Model:

- When evaluating the model internally, we rely on the cross entropy loss and thus we simply are running the forward pass for a given number of batches

In [12]:
valuation_flops, evaluation_lora_flops = model_evaluation_flops(no_tokens = 512, lora_ranks = 0 , batch_size = 100)

Total FLOPs for evaluation: 6.0342e+13
Total FLOPs from LoRA adaptation: 0
Percentage of Total FLOPs Budget:   0.060342 %
