# How ChatGPT Works Part 2: The Reward Model

> Given a prompt and a response, the reward model is a model trained to predict a scalar value representing how good a response is.

![](./images/Reward%20Model.png)


> The point of the reward model is to enable us to do reinforcement learning in step 3, by providing a reward signal for new responses generated.

## Data Collection and Labelling

The model is trained on a dataset of multiple responses to the same prompt.

To construct a dataset, human labellers were asked to rank different responses to the same prompt.

Typically, humans can store 5-9 items in their working memory.
So to make labelling faster and more accurate, it can be useful to limit the number of items that need to be compared.

> In the procedure used to train InstructGPT, the labellers were asked to rank between $K=4$ and $K=9$ responses to a single prompt at once.

This produces ${K \choose 2}$ ("K choose 2" = how many ways can you choose 2 items from a set of K) different pairs of examples 

- E.g ${4 \choose 2}$ = 6 is equivalent to saying that there are 6 different ways you can choose 2 different items out of a set of 4. 
    - `{A, B, C, D} -> {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}`

But to train our reward model, we need a label for the reward, not just rankings.
However, the absolute value of the score is actually not important.

> The absolute value of the reward predicted by the model is not important. What's important, is the difference between the reward predicted for different responses to a prompt. Preferred responses should have higher reward.

## The Loss Function

Remember, the loss is a measure of how bad the model is, so it is the thing we want to minimise.

We don't have a regression target to train the regression reward model with, so we need something else. 
If we take the difference of the two rewards, then a positive value indicates that the reward model predicts the first response is better, and if the difference is negative, then it indicates that the model predicts the second response is better.
If we pass this difference throught a sigmoid function, then we can interpret the output (a value between 0-1) as a confidence that the model predicts the first input is preferable.
We can then use this confidence score in the cross entropy loss function.

> The cross entropy loss function is used to train the reward model based on these rankings. 

![](./images/loss%20function.png)

Where $r(x, y)$ is the scalar output of the reward model for prompt $x$ and completion $y$ with parameters
$θ$, $yw$ is the preferred completion out of the pair of $yw$ and $yl$ (the ordering matters), and $D$ is the dataset of human
comparisons.

Let's implement a function that takes in the two scalar rewards and returns a loss.

In [None]:
#@title ### Run the following cell to download the necessary files for this lesson { display-mode: "form" } 
#@markdown Don't worry about what's in this collapsed cell

!pip install -q transformers


In [None]:
import torch

def loss_function(preferred_response_reward, alternate_response_reward):
    return -torch.mean(torch.log(torch.sigmoid(alternate_response_reward - preferred_response_reward)))

example_preferred_response_reward = torch.tensor([1.0])
example_alternate_response_reward = torch.tensor([0.0])
loss_function(example_preferred_response_reward, example_alternate_response_reward) # test



tensor(1.3133)

Now let's create the dataset which should return us a context (prompt), a preferred response and an alternate response.

In [32]:
import torch
import pandas as pd
import random

def create_response_pairs():

    data = pd.read_csv('reward_dataset.csv', sep="|")

    data = data.to_dict(orient="records")
    response_pairs = []

    for row in data:
        prompt = row["Prompt"]
        response_pairs.append((prompt, row["Most preferable response"], row["Somewhat preferable response"]))
        response_pairs.append((prompt, row["Most preferable response"], row["Least preferable response"]))
        response_pairs.append((prompt, row["Somewhat preferable response"], row["Least preferable response"]))
        
    return response_pairs


class RewardDataset(torch.utils.data.Dataset):
    def __init__(self):
        """Initializes the dataset."""
        self.response_pairs = create_response_pairs()
        print("Number of response pairs:", len(self.response_pairs))

    def __len__(self):
        """Returns the length of the dataset."""
        return len(self.response_pairs)

    def __getitem__(self, idx):
        """Returns the example in the dataset at the given index."""

        # Get the response pair at the given index
        response_pair = self.response_pairs[idx]
        prompt, preferred_response, alternate_response = response_pair

        # Return the preferred response, alternate response
        return prompt, preferred_response, alternate_response


dataset = RewardDataset()
print("Dataset length:", len(dataset))
example_idx = random.randint(0, len(dataset))
print(dataset[example_idx])


Number of response pairs: 54
Dataset length: 54
('What are some easy and healthy snack ideas?', 'Greek yogurt with berries and honey', 'Apple slices with almond butter')


The GPT2 model doesn't output regression predictions off the shelf. 
Instead, it outputs the deep contextual representations of each input vector, as an output set of vectors.
We can ignore all of the vector outputs except the last one, and apply a regression head to it to combine that final representation into a single value.

In [45]:
from transformers import GPT2Model, GPT2Tokenizer

class ChatGPT2RewardModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = GPT2Model.from_pretrained('gpt2')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.regression_head = torch.nn.Linear(768, 1)

    def forward(self, context, response):
        """
        Returns a scalar value representing the reward for this response, given the context.
        Args:
            context (str): The context. aka. the prompt.
            response (str): The response. aka. the response to the prompt.
        Returns:
            float: The reward for generating this response given the context.    
        """

        entire_text = context + response
        context_dict = self.tokenizer(
            '<|startoftext|>' + entire_text + '<|endoftext|>',
            #    truncation=True,
            #    max_length=max_length,
            #    padding="max_length"
        )

        input_ids = torch.tensor(context_dict.input_ids)
        attention_mask = torch.tensor(context_dict.attention_mask)

        # Forward pass
        gpt2_outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        all_output_vectors = gpt2_outputs.last_hidden_state
        last_output_vector = all_output_vectors[-1]

        # add batch_size dimension
        last_output_vector = last_output_vector.unsqueeze(0)
        reward = self.regression_head(last_output_vector)

        return reward

model = ChatGPT2RewardModel()

example = dataset[example_idx]
prompt, preferred_response, alternate_response = example

preferred_response_reward = model(prompt, preferred_response)
alternate_response_reward = model(prompt, alternate_response)

print("Preferred response reward:", preferred_response_reward.item())
print("Alternate response reward:", alternate_response_reward.item())

Preferred response reward: -2.683429718017578
Alternate response reward: -2.70759654045105


Now we need to implement a training loop.

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

In [48]:
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.tensorboard import SummaryWriter

def train(epochs=10):
    # Create the dataset and dataloader
    dataset = RewardDataset()

    # Create the optimizer
    optimizer = torch.optim.Adam(
        model.parameters(), lr=1e-5, betas=(0.9, 0.95)) # as used in the InstructGPT paper

    # Set up logging
    writer = SummaryWriter()  # for logging our loss to TensorBoard
    batch_idx = 0 # for setting the x-axis of our TensorBoard plots (loss vs. batch index)

    # Train the model
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}")
        for batch in tqdm(dataset):
            
            # Get the data
            prompt, preferred_reponse, alternate_response = batch
            preferred_response_reward = model(prompt, preferred_reponse)
            alternate_response_reward = model(prompt, alternate_response)

            loss = loss_function(preferred_response_reward, alternate_response_reward)

            # Backward pass
            loss.backward()
            optimizer.step()

            # Zero the gradients
            optimizer.zero_grad()

            # Log the loss
            # print(f"Loss: {loss.item()}", batch_idx)
            writer.add_scalar("Loss/Train", loss.item(), batch_idx)
            batch_idx += 1


train()


Number of response pairs: 54
Epoch 1


100%|██████████| 54/54 [01:59<00:00,  2.21s/it]


Epoch 2


  2%|▏         | 1/54 [00:03<02:59,  3.38s/it]


KeyboardInterrupt: 

Check out the changing loss in Tensorboard and make sure to qualitatively check your results too.


## Some Details

Using each ${K \choose 2}$ comparisons as separate datapoints was found to lead to overfitting. 
This is because each completion can appear in $K-1$ different batches and influence the gradient that many times.
Instead, each of those comparison pairs produced by one ranking task by a labeller was put into a single batch, so that the 

This means that the batch size varies between batches, which is an unusual case.

We'll need a custom dataloader to implement this.

<!-- As the model trains, it will try to learn to produce scores that output scores that would rank all items in the batch in the same order that a human labeller would. -->



In [None]:
# TODO extra

Tasks:
- Adapt the code so that the inputs are batched together
- Log the text outputs to tensorboard along with the score so you can get a qualitative idea of how things are going and sanity check your implementation
- Put the model on a GPU provided by Google Colab
- Experiment with the change in performance between batching all comparisons from a single task together and shuffling them into separate batches.
- Implement your own reward transformer using PyTorch's `TransformerDecoder` class without using a pre-trained backbone and compare the performance with GPT2. How does a RNN compare? Explore the relative time taken for each using tensorboard.
- Experiment with different optimisers
