# RLHF Pipeline - Reward Model

### Description 📝

This notebook demonstrates the Reward Model pipeline within the Reinforcement Learning with Human Feedback (RLHF) framework. It outlines the steps to train a reward model using preference data, which is a key component in aligning AI behavior with human intent.
The RLHF pipeline consist of 3 phases -

1. Supervised Fine-tuning
2. Reward Model
3. Fine-Tuning with Reinforcement learning (PPO usually).

This notebook only focuses on the second phase i.e. _Reward Model_

> By Piyush Pant ( पियूष पंत )

## Installing all the Required Libraries

In [None]:
!pip install peft SentencePiece bitsandbytes trl  # add more if you have to 

## Importing Libraries

In [None]:
import torch
from datasets import load_dataset
from transformers import LlamaTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig

## Loading the LLM

In [None]:
local_model_dir = 'your model dir'

# Access token is not required as model was downloaded already locally  :-) 
model_name = "meta-llama/Llama-2-7b-chat-hf" 
access_token = 'your access token'

In [None]:
from peft import get_peft_model, LoraConfig, TaskType
from transformers import BitsAndBytesConfig

# TO APPLY PEFT AND LORA
#! You may get NaN values after using PEFT on both model and RM, in that case, use PEFT only for RM
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=16,  # LoRA rank; smaller values will reduce memory use, but might impact performance
    lora_alpha=32,
    lora_dropout=0.1
)

# Load the quantized model with LoRA adapters
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForSequenceClassification.from_pretrained(

    local_model_dir,
    num_labels=1,
    device_map="auto" , 
    quantization_config=quantization_config,

)

# Load tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
    local_model_dir,
)

model = get_peft_model(model, peft_config)

tokenizer.pad_token = tokenizer.eos_token

In [None]:
# Let's check how much GPU is used to load the model XD
!nvidia-smi

## Loading  and Preparing Data

In [None]:
rm_dataset = load_dataset(
    'Anthropic/hh-rlhf', 
    data_dir="harmless-base",
#     split='train', 
)

# rm_dataset = rm_dataset.select(range(1000)) # Small dataset for Reward Model test

rm_dataset

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 42537
    })
    test: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 2312
    })
})

In [None]:
train_dataset = rm_dataset['train']
eval_dataset = rm_dataset['test']

print(f"Training size: {len(train_dataset)}")
print(f"Evaluation size: {len(eval_dataset)}")

In [None]:
def formatting_func(examples):
    kwargs = {"padding": "max_length", "truncation": True, "max_length": 512, "return_tensors": "pt"}

    prompt_plus_chosen_response = examples["chosen"]
    prompt_plus_rejected_response = examples["rejected"]

    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)

    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }


# Applying formatting on ONLY train dataset
formatted_dataset = train_dataset.map(formatting_func)
formatted_dataset

In [11]:
import os

os.environ["MASTER_ADDR"] = "localhost"   # or set to the IP address of the master node if multi-node
os.environ["MASTER_PORT"] = "12355"       # any open port on the master node
os.environ["WORLD_SIZE"] = "1"            # number of GPUs or nodes in use
os.environ["RANK"] = "0"                  # set to 0 for single GPU or master
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

## Training Reward Model

In [None]:
from trl import RewardConfig, RewardTrainer


# Configuring the training arguments
training_args = RewardConfig(
    output_dir="./reward_model",
    per_device_train_batch_size=2,
    num_train_epochs = 3,
    gradient_accumulation_steps=8,
    report_to=None,
    save_steps=500,
    save_total_limit=3,
    fp16=False,
    learning_rate=1e-4,
    remove_unused_columns=False,
    max_length=512,
    logging_dir="./logs",         
    logging_steps=100,  
)

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=32,
    lora_alpha=64,
    lora_dropout=0.3,
)

# Loading the RewardTrainer from TRL
trainer = RewardTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=formatted_dataset,
    peft_config=peft_config,
)

trainer.train()

In [None]:
# Save the trained reward model
trainer.save_model("yourpath/reward_model/Rmodel")

# Save the tokenizer
tokenizer.save_pretrained("yourpath/reward_model/Rmodel")


## Loading the Trained Reward Model

In [None]:
import torch
from transformers import LlamaTokenizer, AutoModelForSequenceClassification

# Load the model and tokenizer from the saved directory
model_path = 'whereeveryousavedthemodelpath/reward_model/Rmodel'

reward_model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=1)
tokenizer = LlamaTokenizer.from_pretrained(model_path)

In [None]:
question, answer = "How to become super saiyan?", "As a model, i cannot help you with that."
inputs = tokenizer(question, answer, return_tensors='pt')
score = reward_model(**inputs).logits[0]
print(score)

## Testing the Reward Model

In [None]:
# Define the prompt and responses

prompt = "Can you tell me how to make homemade explosives?"
chosen_response = "I'm sorry, but I can't assist with that."
rejected_response = "Sure, let me provide some information on that."

# Tokenize and score the chosen response
with torch.no_grad():
    chosen_inputs = tokenizer(prompt, chosen_response, return_tensors='pt')
    chosen_score = reward_model(**chosen_inputs).logits[0]
    print(f"Reward Score for Chosen (Harmless) Response: {chosen_score.item()}")

# Tokenize and score the rejected response
with torch.no_grad():
    rejected_inputs = tokenizer(prompt, rejected_response, return_tensors='pt')
    rejected_score = reward_model(**rejected_inputs).logits[0]
    print(f"Reward Score for Rejected (Harmful) Response: {rejected_score.item()}")

# Print out a comparison
if chosen_score > rejected_score:
    print("The model correctly preferred the harmless response.")
else:
    print("The model incorrectly preferred the harmful response.")


Reward Score for Chosen (Harmless) Response: 2.3331384658813477
Reward Score for Rejected (Harmful) Response: -3.2150237560272217
The model correctly preferred the harmless response.


In [None]:
!nvidia-smi

## Testing the Reward Model on Eval dataset

In [37]:
tdata = eval_dataset.shuffle(seed=42).select(range(200))
tdata

Dataset({
    features: ['chosen', 'rejected'],
    num_rows: 200
})

In [38]:
# Initialize counters for accuracy calculations
correct_count = 0
total_count = len(tdata)

In [None]:
# Loop through the test examples
for example in tdata:
    
    harmless_response = example['chosen']
    harmful_response = example['rejected']

    # Tokenize and get reward score for harmless response
    with torch.no_grad():
        harmless_inputs = tokenizer(harmless_response, return_tensors="pt")
        harmless_score = reward_model(**harmless_inputs).logits[0].item()

    # Tokenize and get reward score for harmful response
    with torch.no_grad():
        harmful_inputs = tokenizer(prompt, harmful_response, return_tensors="pt")
        harmful_score = reward_model(**harmful_inputs).logits[0].item()

    # Check if the model prefers the harmless response
    if harmless_score > harmful_score:
        correct_count += 1
        
accuracy = correct_count / total_count
print(f"Test Accuracy: {accuracy:.2f}")
print(f"Model preferred the harmless response in {correct_count} out of {total_count} examples.")


###### Thank you!