<a href="https://colab.research.google.com/github/Medissaoui07/LLM-Experiments/blob/main/Building_Reasoning_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Implement Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library

In [None]:
!pip install -qqq datasets transformers  trl peft accelerate bitsandbytes wandb --progress-bar off
!pip install -qqq flash-attn --no-build-isolation --progress-bar off

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import GRPOConfig , GRPOTrainer
from peft import LoraConfig , get_peft_model


In [None]:
import huggingface_hub
huggingface_hub.login()


Before we begin , we first log to wandb

In [None]:
import wandb

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmohamedissaoui2468[0m ([33mmohamedissaoui2468-ensi[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

##load the dataset


In [None]:
dataset = load_dataset("mlabonne/smoltldr")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/981 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.44M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/151k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/151k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/200 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 200
    })
    test: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 200
    })
})


##load the model

In [None]:
model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype="auto",
                                             device_map="auto",)
tokenizer = AutoTokenizer.from_pretrained(model_id)



## now we load LoRA and configure it

In [None]:
lora_config = LoraConfig (
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    #lora_dropout=0.05,
    bias="none",
    target_modules="all-linear"
)

model = get_peft_model(model, lora_config)
print(model.print_trainable_parameters())

trainable params: 4,884,480 || all params: 139,399,488 || trainable%: 3.5039
None


##define the reward function
for GRPO , we can use any reward fucntion to improve the model . in our case we will use a function that encourages the model to generate text not too long .

In [None]:
length=50

def reward_fn(completions  , **kwargs) :
  return [-abs(length - len(completion)) for completion in completions]


now lets define the training arguments using GRPOConfig

In [None]:
training_args = GRPOConfig(
    output_dir="./results",  # output directory
    num_train_epochs=1,  # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    gradient_accumulation_steps=2,  # number of updates steps to accumulate before performing a backward/update pass
    max_completion_length=96 ,
    max_prompt_length=512,
    num_generations=8,
    optim="adamw_8bit",
    bf16=True,
    learning_rate=1e-5,
    report_to=["wandb"],
    logging_steps=1,


)

now we initialize the trainer

In [None]:
trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    reward_funcs=[reward_fn],
)
wandb.init(project="GRPO-SmolLM")


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
trainer.train()

Step,Training Loss
1,0.1139
2,0.124
3,0.1621
4,0.2518
5,0.1183
6,0.0786
7,0.0922
8,0.2898
9,0.2337
10,0.1413


TrainOutput(global_step=1000, training_loss=0.2539372077118605, metrics={'train_runtime': 8498.251, 'train_samples_per_second': 0.235, 'train_steps_per_second': 0.118, 'total_flos': 0.0, 'train_loss': 0.2539372077118605})