# Rewards Model

It is recommended to use an `AutoModelForSequenceClassification` as the reward model. 

The reward model should be trained on a dataset of paired examples, where each example is a tuple of two sequences. T

he reward model should be trained to predict which example in the pair is more relevant to the task at hand.

### Key takeaways:

1. The dataset is prepared in a special way: We create a dataset for the chosen and a dataset for the rejected. By doing this, we can calculate the forward pass of each one of these options.

2. The loss function for the rewards model is calculated with a special formula: For the 2 options we have in our dataset, we use logsigmoid of the difference between the output of the forward prop of the Chosen and the Rejected options.



### The training

The training loop implemented is carrying out a "ranking loss" to teach the model to distinguish between "chosen" and "rejected" samples. The idea is that we want the model to assign a higher score to the chosen samples compared to the rejected ones. 

#### Forward Passes
Chosen Samples: For each batch, we do a forward pass through the model for the "chosen" samples (e.g., the better answers, or the answers that should be ranked higher). The model returns some scores stored in rewards_chosen.

Rejected Samples: Similarly, we do another forward pass for the "rejected" samples (e.g., the worse answers, or the answers that should be ranked lower). The model returns some scores stored in rewards_rejected.

So, for each batch, we get two sets of scores: one for the chosen and one for the rejected samples.

#### Custom Loss Calculation
Loss Calculation: The loss function aims to ensure that the score for each chosen sample is higher than the score for each corresponding rejected sample. Specifically, we want the model to maximize the difference (rewards_chosen - rewards_rejected).

1. rewards_chosen - rewards_rejected: This calculates the difference between the rewards for each pair of chosen and rejected samples.
2. logsigmoid(x): This is a smooth function that maps x to the range (0, 1). The idea is that if x is a large positive number (meaning rewards_chosen is much larger than rewards_rejected), then logsigmoid(x) approaches 0, which is what we want to minimize in a loss function.
3. Negative sign: The negative sign is because we want to maximize this value, but most optimizers are minimizing in nature.
4. .mean(): Finally, we take the mean over all pairs in a batch. This is our final loss value for the batch.

In [1]:
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer
import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn.functional as F
import torch.optim as optim
from torch import nn

import json 
import random 

from datasets import load_dataset

torch.cuda.empty_cache()



Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
binary_path: c:\Python311\Lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary c:\Python311\Lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:

# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set padding token
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)
model.to(device)

# IMPORTANT: Update the padding token ID in the model configuration
model.config.pad_token_id = model.config.eos_token_id

# Access the config to get the context size (max_position_embeddings)
context_size = model.config.max_position_embeddings
print(f"The context size of this model is {context_size} tokens.")


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The context size of this model is 1024 tokens.


### Load dataset from HF

The HG dataset we expect should have: prompt, chosen, rejected

In [4]:
# Load the SQuAD dataset
dataset = load_dataset("JuanKO/RLAIF_summarization_preference_gpt35")

train_dataset = dataset['train']


Downloading readme:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to C:/Users/juan_/.cache/huggingface/datasets/JuanKO___parquet/JuanKO--RLAIF_summarization_preference_gpt35-42b7255b666728e9/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/917k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to C:/Users/juan_/.cache/huggingface/datasets/JuanKO___parquet/JuanKO--RLAIF_summarization_preference_gpt35-42b7255b666728e9/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Prepare data
prompts = [item['prompt'] for item in train_dataset]
chosen = [item['chosen'] for item in train_dataset]
rejected = [item['rejected'] for item in train_dataset]
# Tokenization
max_length = 512  # Choose a max_length that fits your data
encodings = tokenizer(prompts, chosen, rejected, truncation=True, padding='max_length', max_length=max_length)


In [None]:
chosen_input_ids = []
rejected_input_ids = []
chosen_attention_mask = []
rejected_attention_mask = []


for i, prompt  in enumerate(prompts):
    chosen_input_ids.append(tokenizer(prompts[i], chosen[i], truncation=True, padding='max_length', max_length=max_length)['input_ids'])
    chosen_attention_mask.append(tokenizer(prompts[i], chosen[i], truncation=True, padding='max_length', max_length=max_length)['attention_mask'])
    
    # Assuming answer2 is the rejected answer when answer1 is chosen
    rejected_input_ids.append(tokenizer(prompts[i], rejected[i], truncation=True, padding='max_length', max_length=max_length)['input_ids'])
    rejected_attention_mask.append(tokenizer(prompts[i], rejected[i], truncation=True, padding='max_length', max_length=max_length)['attention_mask'])
        
chosen_input_ids = torch.tensor(chosen_input_ids).to(device)
rejected_input_ids = torch.tensor(rejected_input_ids).to(device)
chosen_attention_mask = torch.tensor(chosen_attention_mask).to(device)
rejected_attention_mask = torch.tensor(rejected_attention_mask).to(device)

dataset = TensorDataset(chosen_input_ids, chosen_attention_mask, rejected_input_ids, rejected_attention_mask)
loader = DataLoader(dataset, batch_size=16, shuffle=True)


### Special loss function
For Reward Model we use a loss function based on the log of sigmoid.

In this type of problem, we are less concerned with the absolute values of the outputs and more concerned with the relative difference between a "preferred" and a "rejected" output. The idea is to maximize the difference between the "preferred" and the "rejected" output so that the model learns to rank them correctly.

#### Important:
The model function is applied separately to the "chosen" and "rejected" sets of input IDs and attention masks. This implies that the "chosen" and "rejected" samples are passed separately through the model to compute their respective logits (or "rewards" in this context).

Here's a breakdown:

Separate Forward Passes: The model performs a forward pass for the "chosen" inputs (input_ids_chosen and attention_mask_chosen) and another forward pass for the "rejected" inputs (input_ids_rejected and attention_mask_rejected).

Logits as Rewards: The outputs (rewards_chosen and rewards_rejected) are taken as the model's estimated "rewards" or utilities for the "chosen" and "rejected" inputs.

Loss Computation: The loss is computed using these "rewards" via the log-sigmoid function applied to the difference between the two, as explained in previous responses.

This method is very explicit about which samples are "chosen" and which ones are "rejected," as they are processed separately.

#### Note
If we have more than 2 options (chosen, rejected), then we could implement another loss function using softmax. Something like this: 

    loss = -torch.log(F.softmax(output_preferred - output_rejected, dim=0) + 1e-8) # Added small constant (1e-8) to the logarithm to avoid numerical issues.
    

In [7]:
# Training setup
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
epochs = 3

# Training loop
for epoch in range(epochs):
    for i, batch in enumerate(loader):
        
        chosen_input_ids, chosen_attention_mask, rejected_input_ids, rejected_attention_mask = batch
        optimizer.zero_grad()
        
       # Forward pass for the "chosen" samples
        rewards_chosen = model(input_ids=chosen_input_ids, attention_mask=chosen_attention_mask)[0]
        
        # Forward pass for the "rejected" samples
        rewards_rejected = model(input_ids=rejected_input_ids, attention_mask=rejected_attention_mask)[0]
        
        # Compute the custom loss
        # Here's how this loss function works:
        # 1. rewards_chosen and rewards_rejected are the output scores from the model for the "chosen" and "rejected" summaries, respectively.
        # 2. rewards_chosen - rewards_rejected computes the difference between these two rewards. If the model is correctly ranking the summaries, this difference should be positive (i.e., rewards_chosen should be greater than rewards_rejected).
        # 3. nn.functional.logsigmoid(x) computes the log-sigmoid of x. The log-sigmoid function maps its input to a range between negative infinity and zero. For positive inputs, the output is closer to zero, and for negative inputs, the output is a large negative number.
        # 4. By minimizing the negative log-sigmoid of the difference, the model is encouraged to make rewards_chosen - rewards_rejected as large as possible, thereby pushing rewards_chosen to be higher than rewards_rejected.
        # 5. Thus, although the model is not explicitly told which summary is "chosen" and which one is "rejected," it learns to associate higher scores with "chosen" summaries and lower scores with "rejected" summaries by minimizing this custom loss function.             
        loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
        
        loss.backward()
        optimizer.step()
        
        print(f"Epoch {epoch+1}/{epochs}, Batch {i+1}/{len(loader)}, Loss: {loss.item()}")
    


Epoch 1/3, Batch 1/63, Loss: 0.7211203575134277
Epoch 1/3, Batch 2/63, Loss: 0.7099162340164185
Epoch 1/3, Batch 3/63, Loss: 0.7835023403167725
Epoch 1/3, Batch 4/63, Loss: 0.735124945640564
Epoch 1/3, Batch 5/63, Loss: 0.7402579188346863
Epoch 1/3, Batch 6/63, Loss: 0.7947850823402405
Epoch 1/3, Batch 7/63, Loss: 0.699426531791687
Epoch 1/3, Batch 8/63, Loss: 0.6798833012580872
Epoch 1/3, Batch 9/63, Loss: 0.8219254016876221
Epoch 1/3, Batch 10/63, Loss: 0.6807708740234375
Epoch 1/3, Batch 11/63, Loss: 0.7001925706863403
Epoch 1/3, Batch 12/63, Loss: 0.6904599666595459
Epoch 1/3, Batch 13/63, Loss: 0.7209701538085938
Epoch 1/3, Batch 14/63, Loss: 0.7088404893875122
Epoch 1/3, Batch 15/63, Loss: 0.6960053443908691
Epoch 1/3, Batch 16/63, Loss: 0.6933815479278564
Epoch 1/3, Batch 17/63, Loss: 0.6811074018478394
Epoch 1/3, Batch 18/63, Loss: 0.6584429740905762
Epoch 1/3, Batch 19/63, Loss: 0.6772416830062866
Epoch 1/3, Batch 20/63, Loss: 0.6997806429862976
Epoch 1/3, Batch 21/63, Loss: 0

In [8]:
# Save the model to a directory
save_directory = "rlaif_rewards_model"
model.save_pretrained(save_directory)

# Optionally, save the tokenizer as well, especially if you've added special tokens or made other changes
tokenizer.save_pretrained(save_directory)


('rlaif_rewards_model\\tokenizer_config.json',
 'rlaif_rewards_model\\special_tokens_map.json',
 'rlaif_rewards_model\\vocab.json',
 'rlaif_rewards_model\\merges.txt',
 'rlaif_rewards_model\\added_tokens.json')

In [9]:
import torch.nn.functional as F

def calc_reward(model, tokenizer, prompt, answer1, answer2):
    # Tokenize the input
    inputs = tokenizer(prompt, [answer1, answer2], return_tensors='pt', padding=True, truncation=True, max_length=100)
    
    model.to(device)
    inputs.to(device)

    # Get model output
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits

    # Calculate probabilities
    probs = F.softmax(logits, dim=-1)

    # Interpret the result 
    if probs[0, 0] > probs[0, 1]:
        print(f"The model prefers '{answer1}' with a probability of {probs[0, 0]:.4f}")
    else:
        print(f"The model prefers '{answer2}' with a probability of {probs[0, 1]:.4f}")
        
    return logits

In [10]:
# Test the function
prompt = "What are the latest developments in artificial intelligence?"
answer1 = "GANs are revolutionizing image creation, and NLP models like GPT-3 are transforming language tasks."
answer2 = "AI is making strides in healthcare for diagnosis, and reinforcement learning is advancing robotics."

logits = calc_reward(model, tokenizer, prompt, answer1, answer2)
print(logits)

The model prefers 'AI is making strides in healthcare for diagnosis, and reinforcement learning is advancing robotics.' with a probability of 0.5771
tensor([[3.1124, 3.4233]], device='cuda:0')


In [11]:
prompt = "What is the current state of the economy?"
answer1 = "I'm seeing some of the data back on here, about how much we need to increase our business expenditures. In a recent report, the Congressional Budget Office's Bureau of Economic Analysis estimated that the"
answer2 = "And how has your government done that?\n\nLudwig von Mises\n\nFrom the top down, the economy has become much better than it has been in the past several years."

logits = calc_reward(model, tokenizer, prompt, answer1, answer2)
print(logits)

The model prefers 'And how has your government done that?

Ludwig von Mises

From the top down, the economy has become much better than it has been in the past several years.' with a probability of 0.5764
tensor([[3.1492, 3.4571]], device='cuda:0')


In [12]:
import getpass
hf_token = getpass.getpass("Enter your HUGGINGFACE TOKEN: ")

In [None]:
hf_hub = "ENTER YOUR HUGGINGFACE MODEL REPO"
model.push_to_hub(hf_hub, token=hf_token)