# 🏆 Train a reward model for RLHF

Collecting comparison data to train a reward model is a crucial part of RLHF and LLM evaluation. This phase involves training a reward model to align responses with human preferences. Afterwards, during the reinforcement learning phase, the LLM is fine-tuned to generate better responses based on the reward model. In contrast to how the reward model scores prompt-response pairs, comparison data collection typically requires humans (and machines) to rank several responses to a single prompt.

In this example, we will describe how you can build dataset for collecting human preferences and train reward model using the [amazing `tlr` library](https://huggingface.co/docs/trl/index).

Let's get started!

<img src="../../../_static/images/llms/rm.svg" alt="Comparison collection for Reward Modeling" style="display:block;margin-left:auto;margin-right:auto;">

<div class="alert alert-info">

Note 

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.

</div>


## Setup

For this tutorial, you will need to have an Argilla server running. If you don't have one already, check out our [Quickstart](../../../getting_started/quickstart.md) or [Installation](../../../getting_started/installation/installation.md) pages. Once you do, complete the following steps:

1. Install the Argilla client and the required third party libraries using `pip`:

In [1]:
%pip install -U argilla pandas trl plotly -qqq

2. Let's make the necessary imports:

In [3]:
import random

import argilla as rg

from datasets import Dataset, load_dataset

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
)

from trl import RewardTrainer

3. If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [4]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="https://dvilasuero-trl.hf.space",#"http://localhost:6900", 
    api_key="admin.apikey"
)

## Configure the dataset
As a first step, let's load the dataset and quickly explore the data. This dataset contains a sample of the Dolly curated dataset with the original human responses and generated responses using the Falcon-7b-instruct model.

In [5]:
hf_dataset = load_dataset("argilla/dolly-curated-comparison-falcon-7b-instruct", split="train")



In [10]:
df = hf_dataset.to_pandas()
df

Unnamed: 0,prompt,response-1,response-2,category,original_response,external_id
0,What is Depreciation,What is Depreciation – 10 Important Facts to K...,What is Depreciation on a Car?\nDepreciation i...,open_qa,Depreciation is the drop in value of an asset ...,518
1,What do you know about the city of Aberdeen in...,"Aberdeen, the city in which I've lived for the...","As an AI language model, I don't have personal...",open_qa,Aberdeen is a city located in the North East o...,351
2,Describe thunderstorm season in the United Sta...,Describe thunderstorm season in the United Sta...,Describe thunderstorm season in the United Sta...,information_extraction,Thunderstorm season in the United States and C...,10567
3,"When did Peloton IPO?\nOn September 26, 2019, ...",When did Peloton IPO?\nPeloton launched its in...,When did Peloton IPO?\nPeloton IPO'd on May 26...,closed_qa,Peloton became a public company via an initial...,12412
4,What is the best way to answer an interview qu...,The best way to answer an interview question m...,Some of the best ways to answer an interview q...,creative_writing,The first recommended step is to ask clarifyin...,2521
...,...,...,...,...,...,...
7396,How do i accept the change,How do i accept the change in my life\nAccepti...,I's a great opportunity to improve. The only t...,brainstorming,Embrace the change and see the difference,15010
7397,Extract the teams that the footballer Sócrates...,Extract the teams that the footballer Sócrates...,Extract the teams that the footballer Sócrates...,information_extraction,"Brazil, Botafogo-SP, Corinthians, Fiorentina",9970
7398,Without quoting directly from the text give me...,Without quoting directly from the text give me...,Without quoting directly from the text give me...,summarization,"Brendon Small is a stand-up comedian, Creator...",14205
7399,Is Killing is Sin ? Is it ture,Is Killing is Sin ? Is it ture?\nKilling can b...,Is Killing is Sin ? Is it ture?\nKilling is no...,brainstorming,Killing a human being should not be sin becaus...,11253


For reward modelling, we would like to ask labelers to rank two responses for a prompt from best to worst. For this, we need to configure the fields to show and questions to ask to labelers.

The dataset will show the users three fields `instruction`, which corresponds to the prompt, `response-1` and `response-2`, which correspond to each of the responses to the instruction.


In [11]:
# list of fields that we will use later for our dataset settings
fields = [
    rg.TextField(name="instruction", title="User instruction"),
    rg.TextField(name="response-1"),
    rg.TextField(name="response-2")
]

In this use case, we will ask the users to select the best response. You can also ask users to rank more than two responses, but let's keep it straightforward for this example.

In [12]:
question = rg.RatingQuestion(
    name="choose-best",
    title="Choose the best response:",
    description="Choose the most helpful, harmless, and truthful response. Select 1 for response-1, 2 for response-2, or discard if both are equally good/bad.",
    values=[1,2],
    required=True
)

guidelines="""These guidelines are based on the paper [Training Language Models to Follow Instructions with Human Feedback](https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf)\n\n**You are given a text-based description of a task, submitted by a user.**\n\nThis task description may be in the form of an explicit instruction (e.g. \"Write a story about a wise frog.\"). The task may also be specified indirectly, for example by using several examples of the desired behavior (e.g. given a sequence of movie reviews followed by their sentiment, followed by one movie review without a sentiment, you can assume the task is to predict the sentiment of the final review), or by generating the start of a desired output (e.g. given \"There once was a wise frog named Julius,\" you can assume the task is to continue the story).\n\n**You will also be given several text outputs, intended to help the user with their task.** Your job is to evaluate these outputs to ensure that they are helpful, truthful, and harmless, and select the best according to these criteria.\n\nIn case **none of the outputs is correct, conforming to the criteria, or they're exactly equal, use the discard option to not rank them**.\n\nFor most tasks, being truthful and harmless is more important than being helpful.\n\nBy **helpful**, we mean that the output should follow the user's intention and help the user solve their task. Some examples of helpful behavior:\n\n- Writing in clear language.\n- Answering the question they meant to ask, even if they mis-asked it.\n- Being sensitive to internationality (e.g., \"football\" shouldn't mean American football, and \"the president\" doesn't necessarily mean the US president).\n- If an instruction is too confusing, asking for clarification and explaining why the instruction is confusing.\n- Not giving overly long or rambling answers, or repeating information from the question.\n- Not assuming extraneous extra context outside of what's given (besides things like facts about the world), unless that's an implied part of the task. For example, if asked to \"write a polite email response to this email: email body,\" the output shouldn't assume \"I can't make it this time, but am free next weekend.\" But if asked to \"Write an email to Socrates,\" then you can safely assume the output can make up a context.\n\nBy **truthful**, we mean that the output contains accurate information and doesn't mislead the user. Some examples of truthful behavior:\n\n- On tasks like summarization, where the output should only use information from input, not making up details that are not part of the input description.\n- Not producing clearly false information about the world (e.g., making up facts or promoting conspiracies). For example, the output should not state that Hillary Clinton has served time in prison.\n- Avoid generating misleading information or information with questionable authenticity. For example, if the input asks \"Why did Hillary Clinton go to jail?\", the output should not say \"It's not totally clear,\" but rather should refute the premise of the question.\n\nBy **harmless**, we mean that the output should not cause physical, psychological, or social harm to people; damage to or loss of equipment or property; damage to the environment; or harm to institutions or resources necessary to human well-being. Some examples of harmless behavior:\n\n- Treating other humans with kindness, respect, and consideration; not denigrating members of certain groups or using biased language against a particular group.\n- Not generating abusive, threatening, or offensive language or promoting violence.\n- Not writing sexual or violent content if it's not asked for.\n- Not giving bad real-world advice or promoting illegal activity.\n\nEvaluating model outputs may involve making trade-offs between these criteria. These trade-offs will depend on the task."""

## Create records

The next step is build the records for collecting comparison data. This step typically involves generating responses using one or several instruction-tuned model. 

.. tip:: When showing responses from two different models to labelers. It's recommended to randomly assign different model responses to `response-1` and `response-2` for each record. Otherwise, labelers might find a pattern and be biased towards a specific model. This is specially relevant for model comparison and evaluation, but also applies to comparison data for reward modelling. 

In this example, we've already generated a dataset using the instructions from the Dolly curated dataset with the Falcon-7B-instruct model. We will use the original human-written response as `response-1` and a response from Falcon as `response-2`. 

You can build the records and publish them for labelers as follows:

In [14]:
# build records from hf dataset
records = [
    rg.FeedbackRecord(fields={"instruction": r["prompt"], "response-1": r["original_response"], "response-2": r["response-2"]})
    for r in hf_dataset
]

# create dataset
dataset = rg.FeedbackDataset(
    fields=fields,
    questions=[question],
    guidelines=guidelines
)

# add records and publish
dataset.add_records(records)
dataset.push_to_argilla(name="comparison-data-falcon")




Additionally, you can push the dataset to the Hub for reproducibility and reuse:

In [None]:
#dataset.push_to_huggingface("comparison-data-falcon")

## Collect feedback and prepare the dataset

In [15]:
feedback_dataset = rg.FeedbackDataset.from_argilla('comparison-data-falcon')

  self.__records = parse_obj_as(List[FeedbackRecord], first_batch.items)
  records = parse_obj_as(List[FeedbackRecord], batch.items)
Fetching records from Argilla: 30it [00:10,  2.72it/s]


In [32]:
# if you haven't ranked any responses with the UI, use this pre-built dataset
feedback_dataset = rg.FeedbackDataset.from_huggingface("argilla/comparison-data-falcon-with-feedback")

Downloading and preparing dataset parquet/argilla--comparison-data-falcon-with-feedback to /root/.cache/huggingface/datasets/argilla___parquet/argilla--comparison-data-falcon-with-feedback-82c299a035372399/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/5.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/argilla___parquet/argilla--comparison-data-falcon-with-feedback-82c299a035372399/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  FeedbackRecord(


In [33]:
# build a dataset with chosen and rejected responses
rows = []
for record in feedback_dataset.records:
  if record.responses is None or len(record.responses) == 0:
    continue
  # get chosen index from RatingQuestion response
  chosen_id = record.responses[0].values["choose-best"].value
  rejected_id = 2 if chosen_id == 1 else 1

  # build rows for rm training
  rows.append({
      "instruction": record.fields["instruction"],
      "chosen_response": record.fields[f"response-{chosen_id}"],
      "rejected_response": record.fields[f"response-{rejected_id}"]
  })

# build dataset for training
prepared_dataset = Dataset.from_list(rows)
prepared_dataset.to_pandas()

Unnamed: 0,instruction,chosen_response,rejected_response
0,What is Depreciation,Depreciation is the drop in value of an asset ...,What is Depreciation – 10 Important Facts to K...
1,What do you know about the city of Aberdeen in...,Aberdeen is a city located in the North East o...,"As an AI language model, I don't have personal..."
2,Describe thunderstorm season in the United Sta...,Thunderstorm season in the United States and C...,Describe thunderstorm season in the United Sta...
3,"When did Peloton IPO?\nOn September 26, 2019, ...",Peloton became a public company via an initial...,When did Peloton IPO?\nPeloton IPO'd on May 26...
4,What is the best way to answer an interview qu...,The first recommended step is to ask clarifyin...,Some of the best ways to answer an interview q...
...,...,...,...
7396,How do i accept the change,Embrace the change and see the difference,I's a great opportunity to improve. The only t...
7397,Extract the teams that the footballer Sócrates...,"Brazil, Botafogo-SP, Corinthians, Fiorentina",Extract the teams that the footballer Sócrates...
7398,Without quoting directly from the text give me...,"Brendon Small is a stand-up comedian, Creator...",Without quoting directly from the text give me...
7399,Is Killing is Sin ? Is it ture,Killing a human being should not be sin becaus...,Is Killing is Sin ? Is it ture?\nKilling can b...


This dataset is ready to be used as comparison data to train a reward model.

<div class="alert alert-info">
The paper <a href="https://arxiv.org/abs/2305.18290",target="_blank">Direct Preference Optimization: Your Language Model is Secretly a Reward Model</a> proposes DPO, a promising method for using comparison data directly to model human preference, eliminating the need for a reward model and the RL step. Nevertheless, the comparison data collected in Argilla can be directly used for DPO.
</div>

## Train the reward model with `trl`


<div class="alert alert-info">
To run this step, you need to rank some examples using the Argilla UI, or run the step above with the load from Hugging Face call: `feedback_dataset = FeedbackDataset.from_huggingface`
</div>


In [None]:
model_name = "distilroberta-base"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

def formatting_func(examples):
    kwargs = {"padding": "max_length", "truncation": True, "max_length": 512, "return_tensors": "pt"}

    # Prepend the prompt and a line break to the original_response and response-1 fields.
    prompt_plus_chosen_response = examples["instruction"] + "\n" + examples["chosen_response"]
    prompt_plus_rejected_response = examples["instruction"] + "\n" + examples["rejected_response"]

    # Then tokenize these modified fields.
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)

    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }
    
formatted_dataset = prepared_dataset.map(formatting_func) 
formatted_dataset = formatted_dataset.train_test_split()

training_args = TrainingArguments(
    output_dir="./reward_model",
    per_device_train_batch_size=16,
    evaluation_strategy="steps",  
    logging_steps=200,  
)

trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
)

trainer.train()


Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 

Map:   0%|          | 0/7401 [00:00<?, ? examples/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Accuracy
20,0.6701,0.546184,0.878984
40,0.4069,0.266948,0.910319
60,0.2289,0.199327,0.927066
80,0.2803,0.182498,0.938952


## Summary

In this tutorial, we learned how to create an comparison dataset by ranking responses from the Dolly dataset and Falcon. With this dataset, we learned how to train a reward model using the `trl` framework.

## Appendix: How to build the dataset with pre-loaded responses

In [None]:
picker = ["response-1", "response-2"]

def get_chosen_and_not_chosen(l):
    # Generate a random index between 0 and length of the list - 1
    chosen_id = random.randint(0, len(l) - 1)
    not_chosen_id = 1 - chosen_id  # This will be 0 if chosen_id is 1 and vice versa

    return l[chosen_id], l[not_chosen_id], chosen_id

records = []

for r in hf_dataset:
  chosen, not_chosen, chosen_id = get_chosen_and_not_chosen(picker)
  chosen_from_falcon,  _, _ = get_chosen_and_not_chosen(picker)
  
  record = rg.FeedbackRecord(
        fields={ "instruction": r["prompt"], chosen: r["original_response"], not_chosen: r[chosen_from_falcon]},
        responses = [{"values": {"choose-best": {"value": chosen_id+1}}}],
        external_id=r['external_id']
  )
  records.append(record)

# create dataset
dataset = rg.FeedbackDataset(
    fields=fields,
    questions=[question],
    guidelines=guidelines
)

# add records and publish
dataset.add_records(records)

dataset.push_to_huggingface("argilla/comparison-data-falcon-with-feedback")