# Preference Alignment with Odds Ratio Preference Optimization (ORPO)

This notebook will guide you through the process of fine-tuning a language model using Odds Ratio Preference Optimization (ORPO). We will use the SmolLM2-135M model which has **not** been through SFT training, so it is not compatible with DPO. This means, you cannot use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with ORPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p> 
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Try on a subset of mlabonne's `orpo-dpo-mix-40k` dataset</p>
</div>



## Import libraries


In [1]:
import torch
import os
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
from trl import ORPOConfig, ORPOTrainer, setup_chat_format

# Authenticate to Hugging Face
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Format dataset

In [3]:
# Load dataset

# TODO: 🦁🐕 change the dataset to one of your choosing
dataset = load_dataset(path="trl-lib/ultrafeedback_binarized")

In [None]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.

## Define the model

In [2]:
model_name = "HuggingFaceTB/SmolLM2-135M"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, tokenizer = setup_chat_format(model, tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-ORPO"
finetune_tags = ["smol-course", "module_2"]

In [3]:
# Let's test the base model before training
prompt = "Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=300)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
user
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
Write a Python program that uses the sieve of Eratosthenes to find p

## Train model with ORPO

In [10]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
orpo_args = ORPOConfig(
    # Small learning rate to prevent catastrophic forgetting
    learning_rate=8e-6,
    # Linear learning rate decay over training
    lr_scheduler_type="linear",
    # Maximum combined length of prompt + completion
    max_length=1024,
    # Maximum length for input prompts
    max_prompt_length=512,
    # Controls weight of the odds ratio loss (λ in paper)
    beta=0.1,
    # Batch size for training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    # Helps with training stability by accumulating gradients before updating
    gradient_accumulation_steps=4,
    # Memory-efficient optimizer for CUDA, falls back to adamw_torch for CPU/MPS
    optim="paged_adamw_8bit" if device == "cuda" else "adamw_torch",
    # Number of training epochs
    num_train_epochs=1,
    # When to run evaluation
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
    # Log metrics every step
    logging_steps=1,
    # Gradual learning rate warmup
    warmup_steps=10,
    # Disable external logging
    report_to="none",
    # Where to save model/checkpoints
    output_dir="./results/",
    # Enable MPS (Metal Performance Shaders) if available
    use_mps_device=device == "mps",
    hub_model_id=finetune_name,
)

In [11]:
trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)



In [None]:
trainer.train()  # Train the model

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

  0%|          | 0/7767 [00:00<?, ?it/s]

{'loss': 1.7591, 'grad_norm': 5.750668525695801, 'learning_rate': 8e-07, 'rewards/chosen': -0.14872416853904724, 'rewards/rejected': -0.11321860551834106, 'rewards/accuracies': 0.5, 'rewards/margins': -0.035505570471286774, 'logps/rejected': -1.1321860551834106, 'logps/chosen': -1.4872417449951172, 'logits/rejected': 6.773185729980469, 'logits/chosen': 7.739833831787109, 'nll_loss': 1.6547642946243286, 'log_odds_ratio': -1.043013572692871, 'log_odds_chosen': -0.4343176484107971, 'epoch': 0.0}
{'loss': 2.0615, 'grad_norm': 9.084352493286133, 'learning_rate': 1.6e-06, 'rewards/chosen': -0.18907341361045837, 'rewards/rejected': -0.17790371179580688, 'rewards/accuracies': 0.5, 'rewards/margins': -0.011169707402586937, 'logps/rejected': -1.7790371179580688, 'logps/chosen': -1.890734076499939, 'logits/rejected': 10.08250904083252, 'logits/chosen': 8.176446914672852, 'nll_loss': 1.9662997722625732, 'log_odds_ratio': -0.9522268176078796, 'log_odds_chosen': -0.0811508446931839, 'epoch': 0.0}
{'

  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 1.9043797254562378, 'eval_runtime': 29.4796, 'eval_samples_per_second': 33.922, 'eval_steps_per_second': 16.961, 'eval_rewards/chosen': -0.15143051743507385, 'eval_rewards/rejected': -0.16438376903533936, 'eval_rewards/accuracies': 0.5239999890327454, 'eval_rewards/margins': 0.012953268364071846, 'eval_logps/rejected': -1.6438379287719727, 'eval_logps/chosen': -1.5143049955368042, 'eval_logits/rejected': 8.165465354919434, 'eval_logits/chosen': 7.4909749031066895, 'eval_nll_loss': 1.8308172225952148, 'eval_log_odds_ratio': -0.7356247901916504, 'eval_log_odds_chosen': 0.15601101517677307, 'epoch': 0.2}
{'loss': 1.6403, 'grad_norm': 4.881844997406006, 'learning_rate': 6.406600489880108e-06, 'rewards/chosen': -0.13981233537197113, 'rewards/rejected': -0.1276257038116455, 'rewards/accuracies': 0.375, 'rewards/margins': -0.012186648324131966, 'logps/rejected': -1.2762569189071655, 'logps/chosen': -1.3981233835220337, 'logits/rejected': 6.95015287399292, 'logits/chosen': 7.1969

  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 1.8770135641098022, 'eval_runtime': 30.5838, 'eval_samples_per_second': 32.697, 'eval_steps_per_second': 16.349, 'eval_rewards/chosen': -0.14907145500183105, 'eval_rewards/rejected': -0.16157890856266022, 'eval_rewards/accuracies': 0.5220000147819519, 'eval_rewards/margins': 0.012507444247603416, 'eval_logps/rejected': -1.6157889366149902, 'eval_logps/chosen': -1.4907145500183105, 'eval_logits/rejected': 7.412282943725586, 'eval_logits/chosen': 6.833620071411133, 'eval_nll_loss': 1.8031706809997559, 'eval_log_odds_ratio': -0.7384283542633057, 'eval_log_odds_chosen': 0.15269623696804047, 'epoch': 0.4}
{'loss': 1.8112, 'grad_norm': 7.254087448120117, 'learning_rate': 4.803919040866314e-06, 'rewards/chosen': -0.1672569215297699, 'rewards/rejected': -0.15190726518630981, 'rewards/accuracies': 0.25, 'rewards/margins': -0.01534966193139553, 'logps/rejected': -1.5190727710723877, 'logps/chosen': -1.6725692749023438, 'logits/rejected': 8.540181159973145, 'logits/chosen': 6.845414

  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 1.862196683883667, 'eval_runtime': 29.4449, 'eval_samples_per_second': 33.962, 'eval_steps_per_second': 16.981, 'eval_rewards/chosen': -0.14785191416740417, 'eval_rewards/rejected': -0.15948601067066193, 'eval_rewards/accuracies': 0.5270000100135803, 'eval_rewards/margins': 0.01163407601416111, 'eval_logps/rejected': -1.5948599576950073, 'eval_logps/chosen': -1.4785192012786865, 'eval_logits/rejected': 7.426166534423828, 'eval_logits/chosen': 6.838943958282471, 'eval_nll_loss': 1.7884366512298584, 'eval_log_odds_ratio': -0.7376004457473755, 'eval_log_odds_chosen': 0.1437922716140747, 'epoch': 0.6}
{'loss': 2.2166, 'grad_norm': 6.152320861816406, 'learning_rate': 3.20123759185252e-06, 'rewards/chosen': -0.17957809567451477, 'rewards/rejected': -0.14342840015888214, 'rewards/accuracies': 0.25, 'rewards/margins': -0.036149680614471436, 'logps/rejected': -1.434283971786499, 'logps/chosen': -1.7957807779312134, 'logits/rejected': 7.438933372497559, 'logits/chosen': 7.816839218

  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 1.854513168334961, 'eval_runtime': 30.0242, 'eval_samples_per_second': 33.306, 'eval_steps_per_second': 16.653, 'eval_rewards/chosen': -0.1472446471452713, 'eval_rewards/rejected': -0.15871301293373108, 'eval_rewards/accuracies': 0.5270000100135803, 'eval_rewards/margins': 0.01146837417036295, 'eval_logps/rejected': -1.587130069732666, 'eval_logps/chosen': -1.4724465608596802, 'eval_logits/rejected': 7.309061050415039, 'eval_logits/chosen': 6.743004322052002, 'eval_nll_loss': 1.7807462215423584, 'eval_log_odds_ratio': -0.7376692295074463, 'eval_log_odds_chosen': 0.1426137089729309, 'epoch': 0.8}
{'loss': 1.7861, 'grad_norm': 5.897767066955566, 'learning_rate': 1.5985561428387263e-06, 'rewards/chosen': -0.16134144365787506, 'rewards/rejected': -0.19520017504692078, 'rewards/accuracies': 0.5, 'rewards/margins': 0.03385874256491661, 'logps/rejected': -1.9520018100738525, 'logps/chosen': -1.6134145259857178, 'logits/rejected': 8.855836868286133, 'logits/chosen': 7.92081451416

: 

In [5]:
# Я обучил модель, но сразу не проверил, поэтому делаю такую загрузку

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "./SmolLM2-FT-ORPO"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
).to(device)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

prompt = "Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100"

messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Генерация
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=500)
print("ORPO output:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


ORPO output:
user
Write a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100
assistant
To find prime numbers up to 100, you can use the sieve of Eratosthenes. The sieve of Eratosthenes is a mathematical algorithm that is used to find prime numbers. The algorithm works by repeatedly dividing a number by its divisor until the result is a prime number.

Here's a Python program that uses the sieve of Eratosthenes to find prime numbers up to 100:

```python
import math

def sieve_of_eratosthenes(n):
    sieve = [2] * (n + 1)
    for i in range(2, n + 1):
        if i % 2 == 0:
            sieve[i // 2] += 1
        else:
            sieve[i // 2] += 1
    return sieve

def prime_numbers(n):
    sieve = [2] * (n + 1)
    for i in range(2, n + 1):
        if sieve[i] == 0:
            continue
        for j in range(i * 2, n + 1, i * 2):
            sieve[j] += 1
    return sieve

def find_primes(n):
    sieve = [2] * (n + 1)
    for i in range(2, n + 1):
     

In [None]:
from IPython.display import display, HTML
display(HTML("<h2>Kernel will shut down...</h2>"))

import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True) 

{'status': 'ok', 'restart': True}

: 

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `ORPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.