<a href="https://colab.research.google.com/github/2003Yash/RLHF_DPO_Finetuning/blob/main/RLHF_%26_DPO_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source: https://www.youtube.com/watch?v=bbVoDXoPrPM&list=PLZAGXXsIV3P3gCenOWRd56ZpeksdYFUKV

Usually Model Creation Steps:

- Pre-training (Creating the model)
- SFT (Supervised Finetuning - Add extra knowledge)
- RLHF (Make It behaviour more desired using human input)

# How To Achieve RLHF:

We make model make multiple output for each input and we rank them accordingly

- only con of RLHF is it's restricted based on

# DPO:

- Instead of running PPO loops with rewards, rollouts, and all that gym-style chaos, DPO just takes pairs of answers (one preferred, one rejected) and directly nudges the model’s logits so it leans toward the preferred one. Think of it as teaching by comparison rather than teaching by scoring.

# Below we will fine-tune an LLM to generate quality youtube video title based on video idea

## Step-1 (Create 5 Video Titles for Each Video Idea Each of 5 with 2 variations, and them manually choose each of 2 titlevaiation for all 5 titles and pic the best one manually)

- Manual part is not mentioned in code, neeed to do it directly


In [None]:
import csv
import re
from itertools import combinations

from together import Together
from dotenv import load_dotenv
import os
# load vars from .env
load_dotenv()

# set together api key
client = Together(api_key=os.getenv("TOGETHER_API_KEY"))

In [None]:
# Import ideas:

# Open and read the CSV file
with open('data/ideas.csv', mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)
    next(reader) # skip first line

    # intialize list to store ideas
    idea_list = []

    for row in reader:
        idea_list.append(row[0])

In [None]:
# Prompt template

template = lambda idea : f"""**YouTube Titles**:
- The 8 AI Skills That Will Separate Winners From Losers in 2025
- World's Lightest Solid!
- Why Are 96,000,000 Black Balls on This Reservoir?
- These 11 income streams made me $220,000 in 2024.
- I Make $15K/Month With 2 AI Apps
- Top 5 Reasons Not to Become a Data Analyst
- What Does a Data Analyst Actually Do?
- How I Would Become a Data Analyst if I had to Start Over in 2024 | 6 Month Plan
- How to learn to code FAST using ChatGPT (it's a game changer seriously)
- 6 Years of Studying Machine Learning in 26 Minutes
- My honest advice to someone who wants to be a data scientist
- Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)
- The Complete Machine Learning Roadmap
- My GPT-evaluator got 1000% better with this simple trick.
- The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer
- AI Explained at 5 Levels of Complexity
- Docker in 5 Minutes
- I asked 100 millionaires how to get rich–here's what happened.
- How I Build Projects (as an AI Engineer)
- AI Researcher critiques Claude 3.5 sonnet
- Data scientist explains how to predict the future
- I bought 10 data science courses so you don’t have to
- My AI Development Setup (From Scratch)
- How to Build a Resume Optimizer with AI (Code Walkthrough)
- I Quit My Job… Here’s How Much I Made 1 Year Later
- I Was Wrong About AI Consulting (what I learned)

--
Given the YouTube video idea write 5 engaging title ideas.

**Video Idea**: {idea}

**Additional Guidance**:
- Titles should be between 30 and 75 characters long
- Only return the title ideas, nothing else!
- Title ideas should be written as an ordered markdown list

"""

In [None]:
# Generate Titles:

%%time
triplet_list = []
for idea in idea_list:
    # generate completion
    response = client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct-Turbo",
        messages=[
            {"role": "user",
             "content": template(idea)
            },
    ],
        max_tokens=None,
        temperature=0.7,
        top_p=0.7,
        top_k=50,
        repetition_penalty=1,
        stop=["<|im_end|>"],
    )

    # parse completion (5 titles)
    response_raw = response.choices[0].message.content
    pattern = r"^\s*(?:[-*]|\d+\.)\s+(.+)$"
    title_list = re.findall(pattern, response_raw, re.MULTILINE)

    # generate all possible unique pairs
    title_pair_list = list(combinations(title_list, 2))

    # store all unique idea-title pairs in a list of dicts
    for a,b in title_pair_list:
        triplet_list.append({"idea":idea, "title_a": a, "title_b": b})

In [None]:
# Write Titles into a CSV:

with open("data/idea-title_pairs.csv", mode="w", newline="", encoding="utf-8") as file:

    # Extract field names from the first dictionary
    fieldnames = triplet_list[0].keys()

    # Create DictWriter object
    writer = csv.DictWriter(file, fieldnames=fieldnames)

    # Write the header row
    writer.writeheader()

    # Write data rows
    writer.writerows(triplet_list)

## Step-2 ( Prepare fine-tuning data )

In [None]:
import pandas as pd
import numpy as np
from datasets import DatasetDict, Dataset

In [None]:
df = pd.read_csv('data/idea-title_pairs-preferences.csv')

In [None]:
# Create Prompt

template = lambda idea : f"""Given the YouTube video idea write an engaging title.

**Video Idea**: {idea}

**Additional Guidance**:
- Title should be between 30 and 75 characters long
- Only return the title idea, nothing else!"""

In [None]:
def idea_to_prompt(idea):
    return [{"role": "user", "content": template(idea.lower())}]

In [None]:
df['prompt'] = df['idea'].apply(idea_to_prompt)

In [None]:
# create chosen and rejected responses
def title_to_completion(title):
    return [{"role": "assistant", "content": title}]

# create chosen and rejected columns
df['chosen'] = np.where(df['title_b_preferred'] == 1, df['title_b'].apply(title_to_completion), df['title_a'].apply(title_to_completion))
df['rejected'] = np.where(df['title_b_preferred'] == 1, df['title_a'].apply(title_to_completion), df['title_b'].apply(title_to_completion))

# NOW DF IS: PROMPT = CHOOSEN = REJECTED  => # columns

In [None]:
# write data to file
df.to_csv('data/preferences.csv')

In [None]:
# TRAIN AND TEST SPLIT

# shuffle dataframe
df_shuffled = df.iloc[:,-3:].sample(frac=1, random_state=42).reset_index(drop=True)

# 90-10 split
train_size = int(0.9 * len(df_shuffled))

# slice accordingly
df_train = df_shuffled.iloc[:train_size]
df_valid = df_shuffled.iloc[train_size:]
# Convert the pandas DataFrames back to Hugging Face Datasets
train_ds = Dataset.from_pandas(df_train)
valid_ds = Dataset.from_pandas(df_valid)

# Combine into a DatasetDict
dataset_dict = DatasetDict({
    'train': train_ds,
    'valid': valid_ds,
})

# push data to hub
dataset_dict.push_to_hub("your hugging-face hub id")

## Step-3 (Finetuning The Model)

In [None]:
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

In [None]:
# load data
dataset = load_dataset("dataset address/ hf hub id for dataset")

In [None]:
# load model

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # set pad token

In [None]:
# Generate title with base model

def format_chat_prompt(user_input, system_message="You are a helpful assistant."):
    """
    Formats user input into the chat template format with <|im_start|> and <|im_end|> tags.

    Args:
        user_input (str): The input text from the user.

    Returns:
        str: Formatted prompt for the model.
    """

    # Format user message
    user_prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n"

    # Start assistant's turn
    assistant_prompt = "<|im_start|>assistant\n"

    # Combine prompts
    formatted_prompt = user_prompt + assistant_prompt

    return formatted_prompt

In [None]:
# Set up text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device='mps')

# Example prompt
prompt = format_chat_prompt(dataset['valid']['prompt'][0][0]['content'])

# Generate output
outputs = generator(prompt, max_length=100, truncation=True, num_return_sequences=1, temperature=0.7)

print(outputs[0]['generated_text'])

In [None]:
# Train model

ft_model_name = model_name.split('/')[1].replace("Instruct", "DPO")

training_args = DPOConfig(
    output_dir=ft_model_name,
    logging_steps=25,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3, # 3 epoch are also good if fine-tuning is only slighly changing behaviour
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    save_strategy="epoch",
    eval_strategy="epoch",
    eval_steps=1,
)

device = torch.device('mps')

In [None]:
trainer = DPOTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=dataset['train'],
    eval_dataset=dataset['valid'],
)

trainer.train() # this line starts actual fine-tuning

In [None]:
# use fine-tuned model

# Load the fine-tuned model
ft_model = trainer.model

# Set up text generation pipeline
generator = pipeline("text-generation", model=ft_model, tokenizer=tokenizer, device='mps')

# Example prompt
prompt = format_chat_prompt(dataset['valid']['prompt'][0][0]['content'])

# Generate output
outputs = generator(prompt, max_length=100, truncation=True, num_return_sequences=1, temperature=0.7)

print(outputs[0]['generated_text'])

In [None]:
# push to HF hub

model_id = f"yaswanth/{ft_model_name}"
trainer.push_to_hub(model_id)

## Step-4 (Evaluate Fine-tuned model) = by generating some titles with normal and ft-model and manully evaluating

In [None]:
import csv
import random
from functions import generate_title
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd

from openai import OpenAI
from dotenv import load_dotenv
import os

# load vars from .env
load_dotenv()

# connect to openai API
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [None]:
# load ideas

# Open and read the CSV file
with open('data/ideas.csv', mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)
    next(reader) # skip first line

    # intialize list to store ideas
    idea_list = []

    for row in reader:
        idea_list.append(row[0])

In [None]:
# Randomly select 10 ideas

random.seed(0)
random_ideas = random.sample(idea_list, 50)
print(random_ideas)

In [None]:
# generate titles from base and fine-tuned models

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

checkpoint = 258
ft_model = AutoModelForCausalLM.from_pretrained(f"./Qwen2-0.5B-DPO/checkpoint-{checkpoint}")

In [None]:
base_title_list = []
ft_title_list = []

for idea in random_ideas:
    base_title_list.extend(generate_title(idea, model, tokenizer, num_titles=1))
    ft_title_list.extend(generate_title(idea, ft_model, tokenizer, num_titles=1))

In [None]:
df = pd.DataFrame({"base_title":base_title_list, "ft_title":ft_title_list})

In [None]:
# Check the Outputs Manually

df.head()