In [None]:
!pip install -U -q datasets accelerate peft transformers

In [None]:
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)

# 1. Load the model from HuggingFace

<span style="font-size: 18px">
In this example, I'll use Qwen/Qwen2.5-0.5B-Instruct for fine-tuning.
</span>

In [None]:
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            torch_dtype=torch.bfloat16,
                                            device_map="auto")

In [4]:
config = AutoConfig.from_pretrained(model_id)

In [5]:
config

Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_

## 1.1 Model Architechture
<span style="font-size: 18px">

- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
- Number of Parameters: 0.49B
- Number of Paramaters (Non-Embedding): 0.36B
- Number of Layers: 24
- Number of Attention Heads (GQA): 14 for Q and 2 for KV
- Context Length: Full 32,768 tokens and generation 8192 tokens

### 1.1.1 Model Size & Depth:
>**"num_hidden_layers": 24**

>**"hidden_size": 896**

<br>
This shows that the model has a medium depth and small size.
<br>

### 1.1.2 Grouped Query Attention:
Grouped Query Attention (GQA) is an optimization technique for transformer models that balances computational efficiency and model performance. Inspired by the multi-head attention mechanism introduced in the seminal "Attention Is All You Need" paper, GQA addresses limitations of its predecessors: multi-head attention (MHA) and multi-query attention (MQA). Below is a detailed analysis of its architecture, benchmarks and tradeoffs.

<strong>Core Architechture</strong>
<br><br>
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20250626165907521883/file.webp" style="display: block; margin: 0 auto;" width="600px">
<br><br>
GQA divides query heads into G groups, each sharing a single key and value head. This contrasts with:

MHA: Each query head has unique key/value heads (high accuracy, high memory cost).
MQA: All query heads share one key/value head (lower memory cost, reduced accuracy).

In this model:
<br>
>**"num_attention_heads": 14**

>**"num_key_value_heads": 2**

<br>
This means that we have 14 Q heads and 2 KV heads:
<br>
Group 1:
  Q0, Q1, Q2, Q3, Q4, Q5, Q6 -> Use KV_0 
<br>
Group 2:
  Q7, Q8, Q9, Q10, Q11, Q12, Q13 -> Use KV_1
<br><br>
<strong>Math Formula:</strong>
<br><br>

$$
\begin{aligned}
\text{Attention}(Q_i, K_g, V_g) &= softmax\left( \frac{Q_i K_g^{T}}{\sqrt{d_k}} \right) V_g
\end{aligned}
$$

### 1.1.3 Rotary Postional Embedding (RoPE):

RoPE represents a novel approach in encoding positional information. Traditional methods, either absolute or relative, come with their limitations. Absolute positional embeddings assign a unique vector to each position, which though straightforward, doesn‚Äôt scale well and fails to capture relative positions effectively. Relative embeddings, on the other hand, focus on the distance between tokens, enhancing the model‚Äôs understanding of token relationships but complicating the model architecture.

RoPE ingeniously combines the strengths of both. It encodes positional information in a way that allows the model to understand both the absolute position of tokens and their relative distances. This is achieved through a rotational mechanism, where each position in the sequence is represented by a rotation in the embedding space. The elegance of RoPE lies in its simplicity and efficiency, enabling models to better grasp the nuances of language syntax and semantics.

<strong>The Mechanism of Rotary Positional Embeddings
</strong>
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*9T_o7ZbLK4mOSKJb5WNxeg.png" style="display: block; margin: 0 auto;" width="600px">
<br><br>
**RoPE introduces a novel concept**. Instead of adding a positional vector, it applies a rotation to the word vector. Imagine a two-dimensional word vector for ‚Äúdog.‚Äù To encode its position in a sentence, RoPE rotates this vector. The angle of rotation (Œ∏) is proportional to the word‚Äôs position in the sentence. For instance, the vector is rotated by Œ∏ for the first position, 2Œ∏ for the second, and so on. This approach has several benefits:

**Stability of Vectors**: Adding tokens at the end of a sentence doesn‚Äôt affect the vectors for words at the beginning, facilitating efficient caching.
Preservation of Relative Positions: If two words, say ‚Äúpig‚Äù and ‚Äúdog,‚Äù maintain the same relative distance in different contexts, their vectors are rotated by the same amount. This ensures that the angle, and consequently the dot product between these vectors, remains constant

<strong>Matrix Formulation of RoPE</strong>
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*42x3x4KKqSIKoajakjGmLg.png" style="display: block; margin: 0 auto;" width="600px">

### 1.1.4 Swish-Gated Linear Unit (SwiGLU)

<strong>What is swish?</strong>
<br><br>
Swish is a smooth, non-monotonic ‚Äî function that does not consistently increase or decrease ‚Äî activation function defined as :

<br><br>
<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*tDJKko60ciqzXEKs99fC5g.png" style="display: block; margin: 0 auto;" width="600px">
<br><br>
Œ≤ is a trainable parameter, but most implementations do not use it, setting Œ≤ = 1 and simplifying the function to : swish(x) = x * sigmoid(x) which is equivalent to the Sigmoid Linear Unit or SiLU.
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*9ZNwhMAa9Ci36xOmEuMiOg.png" style="display: block; margin: 0 auto;" width="600px">
<br><br>

Swish has been shown to outperform ReLU in many applications. Its main advantage is that it provides a smoother transition around 0, which leads to better optimization and faster convergence.
<br><br>
<strong>What is Gated Linear Unit:</strong><br><br>
Gated Linear Units (GLU) are neural network layers proposed by researchers at Microsoft in 2016. The idea behind this function is that it takes the output of a linear transformation and splits it into two parts: one part is passed through another linear transformation, while the second is passed through a sigmoid activation function. This is illustrated in the following formula:
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/0*9EIG-EIX837FKTM0" style="display: block; margin: 0 auto;" width="600px">
<br><br>
**‚ÄúThe output of each layer is a linear projection x‚àó W + b modulated by the gates œÉ(x ‚àó V + c). Similar to LSTMs, these gates multiply each element of the matrix x‚àóW+b and control the information passed on in the hierarchy.‚Äù**
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*9Ebd6nY2fvkjzoFdcvhE9g.png" style="display: block; margin: 0 auto;" width="600px">
<br><br>
<strong>What is SwiGLU:</strong>
As we mentioned earlier, SwiGLU is a combination of both Swish and GLU. It is basically a GLU, but instead of using the sigmoid function, we use Swish with Œ≤ = 1, as illustrated in the following formula:
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*s6xTNRLICLmjJ2UwQahegw.png" style="display: block; margin: 0 auto;" width="600px">

### 1.1.5 Root Mean Square Layer Normalization (RMSNorm):
<strong>What is Layer Norm:</strong>
<br>
Layer Norm formula is defined as:
$$
y = \frac{x ‚Äì \mu}{\sqrt{\sigma^2 + \epsilon}}
$$
<br><br>
The small quantity ùúñ prevents division by zero. Mean ùúá and variance ùúé2 are computed from input data across the feature dimension.
<br><br>
<strong>What is RMSNorm:</strong><br><br>
Most recent transformer models use RMS Norm instead of LayerNorm. The key difference is that RMS Norm only scales the input without shifting it. The mathematical formulation is:

$$
\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}}
$$
</span>



# 2. Load the dataset for fine tuning
<span style="font-size: 18px">
I'll use HuggingFaceH4/MATH for fine tuning.
</span>

In [8]:
from datasets import load_dataset
 
ds = load_dataset("HuggingFaceH4/MATH", "default", split="train")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/351k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/240k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/746 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/546 [00:00<?, ? examples/s]

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model.config.pad_token_id = tokenizer.pad_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id

In [22]:
def format_math_example(example):
    prompt = f"<|im_start|>user\nSolve the following math problem step by step: {example['problem']}<|im_end|>\n<|im_start|>assistant\n{example['solution']}<|im_end|>\n"
    return {"text": prompt}

ds = ds.map(format_math_example)

Map:   0%|          | 0/746 [00:00<?, ? examples/s]

In [27]:
def tokenize_func(examples):
    enc = tokenizer(
    examples["text"],
    padding="max_length",
    truncation=True,
    max_length=512
    )
    enc["labels"] = enc["input_ids"].copy()
    return enc

tokenized_ds = ds.map(tokenize_func, batched=True, remove_columns=ds.column_names)
tokenized_ds = tokenized_ds.train_test_split(test_size=0.1)

Map:   0%|          | 0/746 [00:00<?, ? examples/s]

# 3. Using LoRA for fine-tuning
<span style="font-size: 18px">

## 3.1 What is Low Rank Adaption (LoRA)?
<strong>Key Features of LoRA</strong>
- **Parameter Efficiency**: It reduces the number of trainable parameters, leading to lower memory usage during fine-tuning and inference.
- **Computational Efficiency**: It minimizes matrix operations, reducing the computational workload on GPUs/TPUs and speeding up the fine-tuning process.
- **Preservation of Pre-Trained Knowledge**: The original pre-trained model remains unchanged making it easy to revert to the base model when needed.
- **Scalability**: It can be applied to various transformer-based models like GPT, BERT and T5 making it versatile for different tasks.
- **Faster Fine-Tuning**: By updating fewer parameters, it accelerates the fine-tuning process compared to traditional methods.

<strong>Architecture of LoRA</strong>
LoRA is used with transformer-based models which are common in NLP tasks. Let's see how it works:

- **Pre-Trained Backbone**: We start with a large transformer model like GPT or BERT that has already been trained on a range of data.
- **Low-Rank Adaptation Layers**: It adds small low-rank matrices to the model‚Äôs attention mechanism. These matrices are the only parts of the model that get updated during fine-tuning.
- **Frozen Original Parameters**: The original weights of the model are kept frozen. This means we don‚Äôt modify the entire model, just the added low-rank matrices.
- **Task-Specific Fine-Tuning**: We fine-tune the low-rank matrices for the specific task such as sentiment analysis or translation while the rest of the model stays the same.
<br><br>
This approach helps us adapt large models to new tasks without changing the entire structure making it more efficient.
<br><br>

<strong>Working of LoRA</strong>

LoRA modifies the traditional fine-tuning process by introducing low-rank matrices into specific layers of a neural network allowing the model to adapt to new tasks without changing the entire model. Let's see how LoRA works:

1. **Decomposing the Weight Matrix**
Instead of updating the entire weight matrix during fine-tuning, it approximates it using two smaller low-rank matrices A and B. The adapted weight matrix (W') is calculated as:

$$
W = W' + BA
$$
Here W is the original weight matrix and A and B are the low-rank matrices. This decomposition allows the model to make task-specific adjustments without the need to retrain the entire model, drastically reducing the computational load.

2. **Training Only the LoRA Parameters**
During the fine-tuning process, only the low-rank matrices A and B are updated while the original model weights W remain frozen. This minimizes the number of parameters that need to be adjusted making fine-tuning faster and more memory-efficient compared to traditional methods where all model weights are updated.

3. **Inference with Adapted Weights**
After fine-tuning, the adapted weight matrix W‚Ä≤ is used for inference. This helps the model to make predictions for specific tasks, fine-tuned with minimal computational resources. Since only the low-rank matrices are updated, it maintains efficiency even during inference.

By using LoRA, we can adapt large pre-trained models to new tasks quickly and efficiently without the computational burden of full model fine-tuning.
</span>

In [28]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    task_type=TaskType.CAUSAL_LM,
    lora_dropout=0.1,
    lora_alpha=32,
    target_modules=['q_proj', 'v_proj']
)
if hasattr(model, "peft_config"):
    model = model.unload()
model = get_peft_model(model, lora_config)


training_args = TrainingArguments(
    output_dir="./qwen-math-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="none"
)



In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    tokenizer=tokenizer
)

  trainer = Trainer(


In [30]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2596,0.283799
2,0.2973,0.275894
3,0.2682,0.275


TrainOutput(global_step=504, training_loss=0.35457197423019104, metrics={'train_runtime': 1219.1347, 'train_samples_per_second': 1.651, 'train_steps_per_second': 0.413, 'total_flos': 2219905981218816.0, 'train_loss': 0.35457197423019104, 'epoch': 3.0})

In [35]:
trainer.save_model()

In [37]:
import math

eval_results = trainer.evaluate(eval_dataset=tokenized_ds["test"])
eval_loss = eval_results["eval_loss"]

perplexity = math.exp(eval_loss)
print(f"Perplexity: {perplexity}")

Perplexity: 1.316530761186102


In [71]:
tokenizer = AutoTokenizer.from_pretrained("./qwen-math-finetuned")

from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "./qwen-math-finetuned")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = """Solve the integral of x^2 from 0 to 1.
Write ONLY the final answer as a fraction on a single line.
Do NOT show intermediate steps."""

out = generator(prompt, max_new_tokens=1000, return_full_text=True)

text = out[0]['generated_text'].strip()

Device set to use cuda:0


Final answer: \]


In [74]:
import re

def extract_and_format_final_answer(text):
    pattern = r'\\boxed\{(.*)\}'
    matches = re.findall(pattern, text)
    
    if not matches:
        return ""
        
    final_content = matches[-1]
    
    # Format th√†nh LaTeX display
    formatted = f"\\\\[\\\\boxed{{{final_content}}}]"
    
    return formatted

result = extract_and_format_final_answer(text)
print(text)

Solve the integral of x^2 from 0 to 1.
Write ONLY the final answer as a fraction on a single line.
Do NOT show intermediate steps. To solve the integral \(\int_0^1 x^2 \, dx\), we can use the power rule for integration, which states that \(\int x^n \, dx = \frac{x^{n+1}}{n+1} + C\) for \(n \neq -1\). Here, we will integrate \(x^2\) directly.

The integral is:
\[
\int_0^1 x^2 \, dx
\]

We apply the power rule by multiplying each term in the parentheses by its exponent and then integrating:

\[
\left[ \frac{x^3}{3} \right]_0^1
\]

Next, we evaluate this at the upper limit (1) and subtract the value at the lower limit (0):

\[
\frac{1^3}{3} - \frac{0^3}{3}
\]

This simplifies to:

\[
\frac{1}{3} - 0 = \frac{1}{3}
\]

Therefore, the final answer is:

\[
\boxed{\frac{1}{3}}
\]
