<a href="https://colab.research.google.com/github/Devmangukiya/llmtwin_handsbook/blob/main/preference_alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
# Installs Unsloth, xformers (Flash Attention), and all required packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "git+https://github.com/huggingface/trl.git@main" peft accelerate bitsandbytes


In [2]:
from unsloth import PatchDPOTrainer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


comet_ml is installed but the Comet API Key is not configured. Please set the `COMET_API_KEY` environment variable to enable Comet logging. Check out the documentation for other ways of configuring it: https://www.comet.com/docs/v2/guides/experiment-management/configure-sdk/#set-the-api-key
    PyTorch 2.7.0+cu126 with CUDA 1206 (you have 2.6.0+cu124)
    Python  3.11.12 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
import os
import torch
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported

In [4]:
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "mlabonne/TwinLlama-3.1-8B",
    max_seq_length =  max_seq_length,
    load_in_4bit = True,
    dtype = torch.float16
)

==((====))==  Unsloth 2025.6.2: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
model = FastLanguageModel.get_peft_model(
    model = model,
    r = 32,
    lora_alpha = 32,
    lora_dropout = 0,
    target_modules = ["q_proj","k_proj","v_proj","up_proj","down_proj","o_proj","gate_proj"]
)

Unsloth 2025.6.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
dataset = load_dataset("mlabonne/llmtwin-dpo",split="train")

In [7]:
dataset

Dataset({
    features: ['prompt', 'rejected', 'chosen'],
    num_rows: 1545
})

In [8]:
alpaca_template = """Below is an instruction that describe a task. write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}
"""
EOS_TOKEN = tokenizer.eos_token
def format_samples(example):
  example["prompt"] = alpaca_template.format(example["prompt"],"")
  example["chosen"] = example["chosen"] + EOS_TOKEN
  example["rejected"] = example["rejected"] + EOS_TOKEN
  return {"prompt": example["prompt"], "chosen": example["chosen"], "rejected": example["rejected"]}

dataset = dataset.map(format_samples)
dataset = dataset.train_test_split(test_size=0.05)

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'rejected', 'chosen'],
        num_rows: 1467
    })
    test: Dataset({
        features: ['prompt', 'rejected', 'chosen'],
        num_rows: 78
    })
})

In [10]:
import trl
print(trl.__version__)

0.19.0.dev0


In [11]:
from trl import DPOTrainer,DPOConfig

In [12]:
!pip install comet-ml>=3.43.2

In [14]:
import os
os.environ["COMET_API_KEY"] = "PLXjKHNrsqXvJ1AEcIrkDPgfD"

In [15]:
trainer = DPOTrainer(
    model = model,
    ref_model = None,
    tokenizer = tokenizer,
    beta = 0.5,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    max_length = max_seq_length//2,
    max_prompt_length = max_seq_length//2,
    args = DPOConfig(
        learning_rate = 2e-6,
        lr_scheduler_type = "linear",
        per_device_train_batch_size = 2,
        per_device_eval_batch_size = 2,
        gradient_accumulation_steps = 8,
        num_train_epochs = 1,
        fp16  = True,
        bf16 = False,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        warmup_steps = 10,
        output_dir = "output",
        eval_strategy = "steps",
        eval_steps = 0.2,
        report_to = "comet_ml",
        seed = 0
    )
)

trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,467 | Num Epochs = 1 | Total steps = 92
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 83,886,080/8,000,000,000 (1.05% trained)
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/dev-mangukiya/general/ba4766e0ff3040beb09d66124af1249e

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/content' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss,aux_loss
19,0.6854,0.6928,0.000394,-0.000306,0.564103,0.0007,-86.731148,-53.775188,-1.570228,-1.518277,0,0,0,0
38,0.6409,0.693074,0.00032,0.000169,0.5,0.000151,-86.73188,-53.770432,-1.57033,-1.518306,No Log,No Log,No Log,No Log
57,0.5528,0.692995,0.00058,0.000271,0.474359,0.000309,-86.729271,-53.769413,-1.570253,-1.518359,No Log,No Log,No Log,No Log
76,0.5733,0.692671,0.001199,0.000241,0.538462,0.000958,-86.723091,-53.769718,-1.570309,-1.518366,No Log,No Log,No Log,No Log


comet_ml is installed but the Comet API Key is not configured. Please set the `COMET_API_KEY` environment variable to enable Comet logging. Check out the documentation for other ways of configuring it: https://www.comet.com/docs/v2/guides/experiment-management/configure-sdk/#set-the-api-key


TrainOutput(global_step=92, training_loss=0.6413856598994007, metrics={'train_runtime': 1747.4439, 'train_samples_per_second': 0.84, 'train_steps_per_second': 0.053, 'total_flos': 0.0, 'train_loss': 0.6413856598994007, 'epoch': 1.0})

In [17]:
model = FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a paragraph to introduce supervised fine-tuning","")
inputs = tokenizer([message],return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs,streamer=text_streamer,max_new_tokens=256,use_cache=True)

<|begin_of_text|>Below is an instruction that describe a task. write a response that appropriately completes the request.
### Instruction:
Write a paragraph to introduce supervised fine-tuning
### Response:

Supervised fine-tuning is a method used to enhance the performance of a pre-trained model by training it on a specific dataset. This process involves using a labeled dataset to adjust the model's parameters, allowing it to better understand and respond to the data it is trained on. By doing this, the model can learn to make accurate predictions or classifications, thus improving its overall effectiveness. Supervised fine-tuning is particularly beneficial for models that need to be tailored for specific tasks, as it allows for a more targeted approach to training.<|end_of_text|>
