🚀 Let's explore Direct Preference Optimization (DPO) in AI, a cutting-edge technique revolutionizing how AI systems align with human values and preferences! 🌟

### What is DPO?
Direct Preference Optimization (DPO) is a fine-tuning technique for AI models, especially large language models (LLMs), enabling AI to learn directly from human preferences without relying on explicit rewards or complex objectives. Unlike traditional Reinforcement Learning from Human Feedback (RLHF), which involves reward models and policy optimization, DPO focuses on optimizing the model based on human preference comparisons.

### How Does DPO Work?
1. **Preference Elicitation:** Humans compare pairs of AI outputs (e.g., text or image alterations) and select their preferred option.

2. **Model Update:** The AI model analyzes human choices and adjusts its internal parameters to generate outputs that better align with human preferences.

3. **Iterative Improvement:** This iterative process refines the model's understanding of human preferences over time.

### Benefits of DPO
💡 **Simplicity and Efficiency:** DPO simplifies implementation and is computationally efficient by eliminating complex reward modeling and hyperparameter tuning.

🎯 **Alignment with Nuanced Values:** DPO captures subtle human preferences that may be challenging to translate into explicit rewards.

🌈 **Scalable Oversight:** Human evaluation focuses on comparing a manageable number of options, enhancing scalability.

🌟 **Improved Performance:** Studies demonstrate that DPO can achieve performance comparable to or better than existing methods in tasks like sentiment control, summarization, and dialogue generation.

### Why DPO is Different
💡 **Focus:** DPO directly optimizes for aligning with preferences, contrasting with RLHF, which optimizes a learned reward model.

🌈 **Complexity:** DPO is simpler and less computationally expensive compared to RLHF.

🚀 **Flexibility:** DPO effectively captures nuanced preferences that reward models may struggle to accommodate.

DPO is reshaping how AI systems interact with human values, offering a more aligned and user-friendly AI experience! 🌟🤖🌈

In [None]:
%pip install -q datasets trl peft bitsandbytes sentencepiece

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import os
import gc
import torch

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb

model_name = "google/gemma-1.1-2b-it"
new_model = "EI-gemma"

In [None]:
def chatml_format(example):
    # Format instruction
    message = {"role": "user", "content": example['prompt']}
    prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

    # Format chosen answer
    chosen = example['chosen']

    # Format rejected answer
    rejected = example['rejected']

    return {
        "prompt": prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Load dataset
dataset = load_dataset("OEvortex/SentimentSynth")['train']

# Save columns
original_columns = dataset.column_names

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "left"

# Format dataset
dataset = dataset.map(
    chatml_format,
    remove_columns=original_columns
)

# Print sample
dataset[1]

In [None]:
# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)
model.config.use_cache = False

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    save_strategy="no",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=100,
    bf16=True,
    report_to="wandb",
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
)




In [None]:
dpo_trainer.train()

In [None]:
# Save artifacts
dpo_trainer.model.save_pretrained("final-model")
tokenizer.save_pretrained("final-model")

# Flush memory
del dpo_trainer, model
gc.collect()
torch.cuda.empty_cache()

# Reload model in FP16 (instead of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Merge base model with the adapter
model = PeftModel.from_pretrained(base_model, "final-model")
model = model.merge_and_unload()

# Save model and tokenizer
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

# Push them to the HF Hub
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)