# Before You Start [Learn About LORA]

> Large language models are large, and it can be expensive to update all model weights during training due to GPU memory limitations.

* **Problem**

>For example, suppose we have an LLM with 7B parameters represented in a weight matrix `W`. (In reality, the model parameters are, of course, distributed across different matrices in many layers, but for simplicity, we refer to a single weight matrix here).** During backpropagation**, we learn a `ΔW` matrix, which contains information on how much we want to update the original weights to minimize the loss function during training.
* The weight update is then as follows:
`W_updated = W + ΔW`
* If the weight matrix `W` contains **7B parameters**, then the weight update matrix `ΔW` also contains **7B parameters**, and computing the matrix `ΔW``
 can be very compute and memory intensive.

* **Solution: Low Rank Adaptation (LORA)**

>To make understanding LoRA easier, let’s take a sample example:
1. suppose we have model parameters represented by a `W (10x10)` matrix.
![image](https://i.postimg.cc/7LtmYJ1H/lora1.png)

>2. We can come up with two smaller matrices, which when multiplied, reconstruct a 10×10 matrix for example `W(10x10)=A(10,r)*B(r,10)`.
![image](https://i.postimg.cc/3Ry9yr9g/lora2.png)

> 3.This is a major efficiency win because instead of using **100 weights (10x10)** we now only have **2*(10*r) weights**.

>LORA method proposed replaces to decompose the weight changes,`ΔW=A*B`, into a lower-rank representation and make W frozen.
![image](https://i.postimg.cc/YqNRLsNy/lora3.png)

> the image bellow show the difference between full ft and ft+LORA.
![image](https://i.postimg.cc/QtRmcLnv/lora4.png)

**How much memory does this save?**

>It depends on the rank `r`, which is a **hyperparameter**. For example, if `ΔW` has 10,000 rows and 20,000 columns, it stores `200,000,000` parameters. If we choose A and B with r=8, then A has 10,000 rows and 8 columns, and B has 8 rows and 20,000 columns, that's 10,000×8 + 8×20,000 = `240,000` parameters, which is about **830× less than 200,000,000**.

**Are A and B will capture all the information that ΔW could capture?**

> Of course, **`A` and `B` can't capture all the information that `ΔW` could capture**, but this is by design. When using LoRA, we hypothesize that the model requires `W` to be a large matrix with full rank to capture all the knowledge in the pretraining dataset. However, when we finetune an LLM, we don't need to update all the weights and capture the core information for the adaptation in a smaller number of weights than ΔW would; hence, we have the low-rank updates via `AB`.

**Which parameters we will target with LORA?**

>You can target all model architecture layers, in our use case, we will target only the Key and Value weight matrices in each transformers layer to reduce memory requirements.

**Scaling Coefficient**


>```
scaling = alpha / r
weight += (lora_B @ lora_A) * scaling
```
* Choosing **alpha as two times r** is a common rule of thumb when using LoRA for LLM


# Install dependencies 📚

We need multiple librairies:

- `peft`for LoRA adapters
- `Transformers`for loading the model
- `datasets`for loading and using the fine-tuning dataset
- `trl`for the trainer class

In [None]:
! uv pip install -U datasets trl -q

# load dataset

> We will use [`HackAI-2025/Darija_SFT_Dataset`](https://huggingface.co/datasets/HackAI-2025/Darija_SFT_Dataset) dataset to fine tune [`atlasia/Al-Atlas-0.5B`](https://huggingface.co/atlasia/Al-Atlas-0.5B) or any other model in your choice.

* **SFT Dataset Example**

1. Instructions

<center>

![image](https://i.postimg.cc/hvTrrCkv/lora5.png)

</center>

2. Conversations

<center>

![image](https://i.postimg.cc/tRx2pTsb/lora6.png
)
</center>

In [None]:
# hf login
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset

In [None]:
dataset=load_dataset("HackAI-2025/Darija_SFT_Dataset",split="train")
dataset

(…)-00000-of-00001-b55ed4afe86252b4.parquet:   0%|          | 0.00/86.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/201 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'conversation', 'source', 'topic'],
    num_rows: 201
})

In [None]:
# show some examples from ds
dataset.to_pandas().head()

Unnamed: 0,id,conversation,source,topic
0,0,"[{'content': 'السلام لباس؟', 'role': 'user'}, ...",Manually generated,Travel
1,1,"[{'content': 'اهلا شنو سميتك؟', 'role': 'user'...",Manually generated,Language
2,2,"[{'content': 'أهلا شنو سميتك؟', 'role': 'user'...",Manually generated,Chit-chat/Games/Humor
3,3,[{'content': 'عافاك شحال من مدينة كاينة فالمغر...,Manually generated,Geography
4,4,[{'content': 'عافاك شحال من مدينة كاينة فالمغر...,Manually generated,Geography


In [None]:
# remove other columns and rename conversation to messages
dataset=dataset.select_columns("conversation").rename_column("conversation","messages")
dataset

Dataset({
    features: ['messages'],
    num_rows: 201
})

In [None]:
# show the first example of messages
from pprint import pprint # pprint for pretty print
pprint(dataset["messages"][0])

[{'content': 'السلام لباس؟', 'role': 'user'},
 {'content': 'لاباس الحمد لله، كاين شي حاجا بغيتي نعاونك فيها؟',
  'role': 'assistant'},
 {'content': 'اه عافاك بغيت نسافر فالمغرب فالصيف ولكن معرفتش فين نمشي. ممكن '
             'تعاوني؟',
  'role': 'user'},
 {'content': 'بلان كاين بزاف ديال البلايص اللي تقد تمشي ليهم فالمغرب، انا '
             'كنقترح عليك هدو:\n'
             '\n'
             '- شفشاون: هدي مدينة فالجبل، الديور ديالها زرقين او الجو فالمدينة '
             'كيجيب الراحة.\n'
             '- الصويرة: هاد المدينة فيها البحر الا فيك ميعوم. البحر ديالها '
             'زوين او فيها المدينة القديمة.\n'
             '- الداخلة: الداخلة هي مدينة فالصحرا ديال المغرب، حتاهيا فيها '
             'البحر. الناس كيجيو ليه من العالم كامل باش يلعبوا السبور.\n'
             '- مراكش: هاد المدينة عزيزة على السياح لكيجيو من برا. فيها جامع '
             'الفنا، المدينة القديمة ولكن فالصيف دايرة بحال الفران.\n'
             '- شلالات أوزود: هاد الشلالات كاينين فالجبل دالأطلس، هادوا اشهر '

# Load Model/Tokenizer

In [None]:
from transformers import AutoTokenizer,AutoModelForCausalLM
import torch
# select gpu if available
device="cuda" if torch.cuda.is_available() else "cpu"
model_id="atlasia/Al-Atlas-0.5B"

tokenizer=AutoTokenizer.from_pretrained(model_id)
model=AutoModelForCausalLM.from_pretrained(model_id).to(device)

tokenizer_config.json:   0%|          | 0.00/7.26k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

## Model Chat Template [Test]

Instruction fine-tuning involves training a model on a dataset where the input-output pairs, like those we extracted from the JSON file, are explicitly provided. There are various methods to format these entries for LLMs.

<center>

![image](https://i.postimg.cc/J4CKnLXk/lora7.png)
</center>

* Comparison of prompt styles for instruction fine-tuning in LLMs. The Alpaca style (left) uses a structured format with defined sections for instruction, input, and response, while the Phi-3 style (right) employs
a simpler format with designated <|user|> and <|assistant|> tokens.

In [None]:
# test tokenizer chat template
result=tokenizer.apply_chat_template(dataset["messages"][0],tokenize=False)
pprint(result)

('<|im_start|>system\n'
 'You are a helpful assistant.<|im_end|>\n'
 '<|im_start|>user\n'
 'السلام لباس؟<|im_end|>\n'
 '<|im_start|>assistant\n'
 'لاباس الحمد لله، كاين شي حاجا بغيتي نعاونك فيها؟<|im_end|>\n'
 '<|im_start|>user\n'
 'اه عافاك بغيت نسافر فالمغرب فالصيف ولكن معرفتش فين نمشي. ممكن '
 'تعاوني؟<|im_end|>\n'
 '<|im_start|>assistant\n'
 'بلان كاين بزاف ديال البلايص اللي تقد تمشي ليهم فالمغرب، انا كنقترح عليك '
 'هدو:\n'
 '\n'
 '- شفشاون: هدي مدينة فالجبل، الديور ديالها زرقين او الجو فالمدينة كيجيب '
 'الراحة.\n'
 '- الصويرة: هاد المدينة فيها البحر الا فيك ميعوم. البحر ديالها زوين او فيها '
 'المدينة القديمة.\n'
 '- الداخلة: الداخلة هي مدينة فالصحرا ديال المغرب، حتاهيا فيها البحر. الناس '
 'كيجيو ليه من العالم كامل باش يلعبوا السبور.\n'
 '- مراكش: هاد المدينة عزيزة على السياح لكيجيو من برا. فيها جامع الفنا، '
 'المدينة القديمة ولكن فالصيف دايرة بحال الفران.\n'
 '- شلالات أوزود: هاد الشلالات كاينين فالجبل دالأطلس، هادوا اشهر الشلالات '
 'فالمغرب سير تمنضر فيهوم معا راسك راه ايعج

## Test Model Before SFT

In [None]:
# Generate with before ft
prompt="السلام لباس؟"
messages=[{"role":"user","content":prompt}]
formatted_prompt=tokenizer.apply_chat_template(messages,tokenize=False)
ids=tokenizer(formatted_prompt,return_tensors="pt").to(device)
output_ids=model.generate(**ids,max_new_tokens=120)
output=tokenizer.decode(output_ids[0],skip_special_tokens=True)
print(output)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


system
You are a helpful assistant.
user
السلام لباس؟
السلام عليكم، كيف داير؟ binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله، شكرا على السؤال. binge
الحمد لله


# SFT + LORA

## Show Model Architecture

In [None]:
# show model architecture to select which layer we will apply lora
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

## set LORA Configs

* `r`:  This is the rank of the compressed matrices, Increasing this value will also increase the sizes of compressed matrices leading to less compression and thereby improved representative power. Values typically range between 4 and 64.

* `lora_alpha`: Controls the amount of change that is added to the original weights. In essence, it balances the knowledge of the original model with that of the new task. A rule of thumb is to choose a value twice the size of r.

* `target_modules`: Controls which layers to target. The LoRA procedure can choose to ignore specific layers, like specific projection layers. This can speed up training but reduce performance and vice versa.


In [None]:
from peft import LoraConfig
lora_config=LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"], # apply lora only on q_proj and v_proj
    bias="none",
)
lora_config

LoraConfig(task_type=None, peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=8, target_modules={'v_proj', 'q_proj'}, exclude_modules=None, lora_alpha=16, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False)

# TODO @nouamane: add qlora

## Set Training Args

* **What is gradient accumulation?**

> Gradient accumulation is a way to virtually increase the batch size during training, which is very useful when the available GPU memory is insufficient to accommodate the desired batch size. In gradient accumulation, gradients are computed for smaller batches and accumulated (usually summed or averaged) over multiple iterations instead of updating the model weights after every batch. Once the accumulated gradients reach the target “virtual” batch size, the model weights are updated with the accumulated gradients.

<center>

![image](https://i.postimg.cc/x10Rv3tj/lora10.png
)
</center>


In [None]:
from transformers import TrainingArguments

In [None]:
args=TrainingArguments(
    output_dir="alatlas_instruct_lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-4,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    num_train_epochs=4,
    bf16=True,
    save_total_limit=2,
    save_steps=100,
    logging_steps=10,
    report_to="wandb",
    hub_token="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"
)

## SFT Trainer

In [None]:
from trl import SFTConfig,SFTTrainer
sft_trainer=SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    args=args
)

Converting train dataset to ChatML:   0%|          | 0/201 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/201 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/201 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/201 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
sft_trainer.get_num_trainable_parameters()

540672

## Start Training

In [None]:
sft_trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mth3elma2[0m ([33mth3elma2-enset-mohammedia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,2.9589
20,3.1044
30,3.1289
40,2.6197
50,2.4993
60,2.4629
70,2.3011
80,2.2393
90,2.2149
100,2.1389


TrainOutput(global_step=100, training_loss=2.5668185424804686, metrics={'train_runtime': 316.2648, 'train_samples_per_second': 2.542, 'train_steps_per_second': 0.316, 'total_flos': 552914764402176.0, 'train_loss': 2.5668185424804686})

## Test Model After SFT

In [None]:
# Generate with after ft
prompt="السلام لباس"
messages=[{"role":"user","content":prompt}]
formatted_prompt=tokenizer.apply_chat_template(messages,tokenize=False)
ids=tokenizer(formatted_prompt,return_tensors="pt").to(device)
output_ids=model.generate(**ids,max_new_tokens=100,
                          repetition_penalty=1.2)
output=tokenizer.decode(output_ids[0],skip_special_tokens=True)
print(output)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


system
You are a helpful assistant.
user
السلام لباس
ﭺassistant
مرحبا، أنا هنا باش نعاونك فبزاف ديال الحوايج بحال اللغات و المهام اليومية.

مثلا إلى كنتي كتقلب على شي معلومة فالإنترنت ولا عندنا مشكل مع شي حاجة خاصها تدار؟ غادي يبان ليك هادشي بالتفصيل.
ولا يمكن تكون سولتيني أسئلة قبل من دابا وكنت عارف ش


## Push To THe HUB

In [None]:
sft_trainer.push_to_hub("abdeljalilELmajjodi/alatlas-sft-lora-gra")



tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/abdeljalilELmajjodi/alatlas_instruct_lora/commit/b1786bda85f7332e4585f499d4bb7074e18b1ad9', commit_message='abdeljalilELmajjodi/alatlas-sft-lora-gra', commit_description='', oid='b1786bda85f7332e4585f499d4bb7074e18b1ad9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/abdeljalilELmajjodi/alatlas_instruct_lora', endpoint='https://huggingface.co', repo_type='model', repo_id='abdeljalilELmajjodi/alatlas_instruct_lora'), pr_revision=None, pr_num=None)