# Implementing RLHF with Custom Datasets
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/heartexlabs/label-studio-RLHF/blob/master/Implementing_RLHF_with_Custom_Datasets.ipynb)

Reinforcement Learning with Human Feedback (RLHF) is a popular approach in the field of natural language processing that aims to optimize language models for human preferences directly, rather than solely relying on traditional training methods such as supervised or unsupervised learning. With the recent public release of ChatGPT, RLHF has become a hot topic in both academic and industrial language modeling circles.

In this notebook, we will explore how to implement RLHF using the trlX library and create a custom dataset with Label Studio. By the end of this notebook, you should have a solid understanding of how to implement RLHF with custom datasets, and be well-equipped to continue exploring this exciting area of research.

The notebook will be structured as follows:

1. Introduction to RLHF and trlX
2. Setting up the environment and installing necessary libraries
3. Creating a custom dataset
4. Labeling our dataset with Label Studio
5. Training a preference model with our custom dataset
6. Tune our language model with our preference model using trlX
7. References

Let's get started!

## 1. Introduction to RLHF and trlX
Implementing RLHF with custom datasets can be a daunting task for those unfamiliar with the necessary tools and techniques. The primary objective of this notebook is to showcase a technique for reducing bias when fine-tuning Language Models (LLMs) using feedback from humans. To achieve this goal, we will be using a minimal set of tools, including Huggingface, GPT2, Label Studio, Weights and Biases, and trlX.

Our aim is to provide the most efficient and straightforward method for creating a pipeline that moves from raw data to a real-world RLHF system. We will walk through the process step-by-step, including an introduction to RLHF and trlX, setting up the environment, creating a custom dataset, labeling our dataset with Label Studio, training a preference model with our custom dataset, and finally, tuning our language model with our preference model using trlX.

Training Approach for RLHF ([source](https://arxiv.org/pdf/2009.01325.pdf)): 
1. Collect human feedback 
2. Train a reward model
3. Optimize LLM against the reward model


## 2. Setting up the environment and installing necessary libraries

In [1]:
!git clone https://github.com/CarperAI/trlx.git
!git config --global --add safe.directory /content/trlx && cd /content/trlx && pip install -e .

Cloning into 'trlx'...
remote: Enumerating objects: 6515, done.[K
remote: Counting objects: 100% (510/510), done.[K
remote: Compressing objects: 100% (287/287), done.[K
remote: Total 6515 (delta 325), reused 357 (delta 221), pack-reused 6005[K
Receiving objects: 100% (6515/6515), 46.61 MiB | 17.49 MiB/s, done.
Resolving deltas: 100% (4198/4198), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/trlx
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ray@ https://ray-ci-artifact-branch-public.s3.amazonaws.com/42bb0357a6fb13e4994789c824f3623f32869ad8/tmp/artifacts/.whl/ray-3.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl
  Downloading https:

In [2]:
# uninstall scikit_learn + jax to avoid numpy issues
!pip uninstall -y scikit_learn jax

Found existing installation: scikit-learn 1.2.2
Uninstalling scikit-learn-1.2.2:
  Successfully uninstalled scikit-learn-1.2.2
Found existing installation: jax 0.4.7
Uninstalling jax-0.4.7:
  Successfully uninstalled jax-0.4.7


In [3]:
import os

# run within repo
os.chdir('/content/trlx/examples/summarize_rlhf/')
print(os.getcwd())

/content/trlx/examples/summarize_rlhf


In [4]:
!pip install -r requirements.txt
!pip install mpi4py sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate>=0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting rouge-score>=0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=8167ef017233798653cbfb021432eafeb3ff2cf7de16b28d13eb912a7c1d0c0d
  Stored in directory: /root/.cache/pip/wheels/9b/3d/39/09558097d3119ca0a4d462df68f22c6f3c1b345ac63a09b86e
Successfully built rouge-score
Installing collected packages: rouge-score, evaluate
Successfully installed evaluate-0.4.0 rouge-score-0.1.2
Looking in indexes: https://pypi.org/simple, 

In [5]:
# run within repo
os.chdir('/content/trlx/examples/summarize_rlhf/reward_model/')
print(os.getcwd())

/content/trlx/examples/summarize_rlhf/reward_model


## 3. Creating a custom dataset 
In this section we will create a custom dataset for training our reward model. In the case of fine-tuning a LLM for human preference, our data tends to look like this: 

```json
{
    "prompt": "The quick brown fox...",
    "selection_1": "jumps over the lazy dog.",
    "selection_2": "bags few lynx.",
}
```

The labeler will provide feedback on which selection is preferred, given the prompt. This is the human feedback that will be incorporated into the system. This ranking by human labelers provides allows us to learn a model that scores the quality of our language model's responses.  

In this example, we'll show you how to create your own dataset. We'll start with a set of prompts, generate predictions for them using GPT-2, and then have users rank the predictions generated. 

Note: Due to the compute limitations of colab, we'll be using GPT-2 for this notebook. Thus, the quality of our predictions will not refelect much quality. If you have access to more hardware, then you can swap the GPT-2 model with a larger one like [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) or others. 

In [6]:
from transformers import pipeline, set_seed
import json

def generate_text(prompt_list, model_name='gpt2', max_length=50, num_return_sequences=2, seed=42):
    generator = pipeline('text-generation', model=model_name)
    set_seed(seed)
    results = []
    for prompt in prompt_list:
        result = generator(prompt, max_length=max_length, num_return_sequences=num_return_sequences)
        results.append([res['generated_text'].strip() for res in result])
    return results

In [7]:
prompts = [
    "What is the latest news on the stock market?",
    "What is the current state of the economy?",
    "What are the latest developments in technology?",
    "What is the political situation in the Middle East?",
    "What are the latest trends in fashion and beauty?",
    "What are the top travel destinations for this year?",
    "What are some healthy recipes for a vegan diet?",
    "What are the most important events happening in the world today?",
    "What are some tips for improving mental health?",
    "What are the best ways to save money for retirement?",
    "What are some popular new books or movies?",
    "What are some effective ways to reduce stress?",
    "What are the latest developments in artificial intelligence?",
    "What are some top-rated restaurants in your city?",
    "What are the best ways to stay fit and healthy?",
    "What are some tips for successful entrepreneurship?",
    "What are some effective ways to improve productivity?",
    "What are the latest developments in climate change research?",
    "What are some top-rated TV shows or movies on streaming services?",
    "What are some fun activities to do on weekends?",
    "What are some effective ways to manage time and prioritize tasks?",
    "What are the latest trends in home decor and design?",
    "What are the best ways to develop a successful career?",
    "What are some popular new products or gadgets?",
    "What are some effective ways to improve communication skills?",
    "What are some tips for successful relationships?",
    "What are the latest developments in space exploration?",
    "What are some top-rated online courses or certifications?",
    "What are some effective ways to improve public speaking skills?",
    "What are the latest trends in digital marketing?",
    "What are some fun and creative DIY projects?",
    "What are some effective ways to improve leadership skills?"
]

In [8]:
generated_text = generate_text(prompts)
print(generated_text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

[['What is the latest news on the stock market?', 'What is the latest news on the stock market?\n\nSource: Reuters and BK\n\nHow much are you watching?\n\nShares of Microsoft and Facebook are up nearly 1% in 24 hours, while they lost more than 1% after'], ['What is the current state of the economy?\n\nWe have a hard time getting an answer. During that same time period in 2011, gross domestic product expanded by 3.2% in the United States, by a modest 1.5% this', 'What is the current state of the economy?\n\nThe country has always been better off and now, under new leadership from Narendra Modi, the economy is doing better. However, people need to look at the situation where they depend more than they would'], ['What are the latest developments in technology? These are a big deal. But the biggest takeaway? It takes time to get the data. We want to be the data, but not our data", says Faisal.\n\nHe has come out', "What are the latest developments in technology? Will we be able to understa

In [9]:
def to_pairwise_format(prompts, generated_texts):
    data = []
    for i, (prompt, generated_text) in enumerate(zip(prompts, generated_texts)):
        data.append({
            'id': i+1,
            'data': {
                'prompt': prompt,
                'answer1': generated_text[0],
                'answer2': generated_text[1]
            }
        })
    return data

In [10]:
ls_data = to_pairwise_format(prompts, generated_text)

In [11]:
# Save our data to a json file
with open('ls_input_data.json', 'w') as f:
    json.dump(ls_data, f)

# 4. Labeling our dataset with Label Studio
Now that we have generated some examples, we will label them in Label Studio. 
Once we have the results of our human labels, we can export the data and train our Preference Model. 

1. First, we can spin up Label Studio following the instructions [here](https://labelstud.io/guide/install.html). 

2. Once we have label studio running, we can create a new project with the [Pariwise Classification template](https://labelstud.io/templates/pairwise_comparison.html). The templates themselves are really flexible, so we'll do some minor edits to make it look a little nicer. The configuration for this template is shown below. 

```xml
<?xml version="1.0" encoding="UTF-8"?>
<View>
   <Style>* { box-sizing: border-box; margin: 0; padding: 0; } body { font-family: 'Roboto', sans-serif; line-height: 1.6; background-color: #f0f0f0; } .container { margin: 0 auto; padding: 20px; background-color: #ffffff; border-radius: 5px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1); } .prompt { padding: 20px; background-color: #0084ff; color: #ffffff; border-radius: 5px; margin-bottom: 20px; box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1); } .answers { display: flex; justify-content: space-between; flex-wrap: wrap; gap: 20px; } .answer-box { flex-basis: 49%; padding: 20px; background-color: rgba(44, 62, 80, 0.9); color: #ffffff; border-radius: 5px; box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1); } .answer-box p { word-wrap: break-word; } .answer-box:hover { background-color: rgba(52, 73, 94, 0.9); cursor: pointer; transition: all 0.3s ease; } .lsf-richtext__line:hover { background: unset; } .answer-box .lsf-object { padding: 20px }</Style>
   <View className="container">
      <View className="prompt">
         <Text name="prompt" value="$prompt" />
      </View>
      <View className="answers">
         <Pairwise name="pw" toName="answer1,answer2" selectionStyle="background-color: #27ae60; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.2); border: 2px solid #2ecc71; cursor: pointer; transition: all 0.3s ease;" />
         <View className="answer-box">
            <Text name="answer1" value="$answer1" />
         </View>
         <View className="answer-box">
            <Text name="answer2" value="$answer2" />
         </View>
      </View>
   </View>
</View>
```

3. Next we'll drag and drop to upload our data, and we're off! 

4. Once we're finished labeling our data, we can export it and we're ready to train our preference model. 

Note: If you're using colab, upload the dataset into the root directory, and your file will be located at a path in `/content/...`, like `/content/project-7-at-2023-04-12-22-24-4c78f924.json`.

## 5. Training a preference model with our custom dataset
Now we're ready to train our preference model. We'll create a dataset from our labels, initialize our model from the pretrained LM, and then begin training. 

When we finally train our model, we can connect with Weights and Biases to log our training metrics. 

In [12]:
# # Example Label Studio Output if we skip the human labeling
# with open('/content/project-7-at-2023-04-12-22-24-4c78f924.json', 'r') as f:
#   data = json.load(f)

In [22]:
import os

import torch
from datasets import load_dataset
from reward_model import GPTRewardModel
from torch.utils.data import Dataset
from tqdm import tqdm
from transformers import AutoTokenizer, Trainer, TrainingArguments

def create_comparison_dataset_ls(path: str):
    with open(path, "r") as f:
        data = json.load(f)
    pairs = []
    for sample in data:
        chosen = None
        rejected = None
        for annotation in sample['annotations']:
            if annotation['result'][0]['value']['selected'] == 'left':
                chosen = sample['data']['prompt'] + '\n' + sample['data']['answer1']
                rejected = sample['data']['prompt'] + '\n' + sample['data']['answer2']
            else:
                chosen = sample['data']['prompt'] + '\n' + sample['data']['answer2']
                rejected = sample['data']['prompt'] + '\n' + sample['data']['answer1']
            pair = {
                'chosen': chosen,
                'rejected': rejected
            }
            pairs.append(pair)
    return pairs

class PairwiseDataset(Dataset):
    def __init__(self, pairs, tokenizer, max_length):
        self.chosen_input_ids = []
        self.chosen_attn_masks = []
        self.rejected_input_ids = []
        self.rejected_attn_masks = []
        for pair in tqdm(pairs):
            chosen, rejected = pair["chosen"], pair["rejected"]
            chosen_encodings_dict = tokenizer(
                "<|startoftext|>" + chosen + "<|endoftext|>",
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            rejected_encodings_dict = tokenizer(
                "<|startoftext|>" + rejected + "<|endoftext|>",
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            self.chosen_input_ids.append(chosen_encodings_dict["input_ids"])
            self.chosen_attn_masks.append(chosen_encodings_dict["attention_mask"])
            self.rejected_input_ids.append(rejected_encodings_dict["input_ids"])
            self.rejected_attn_masks.append(rejected_encodings_dict["attention_mask"])

    def __len__(self):
        return len(self.chosen_input_ids)

    def __getitem__(self, idx):
        return (
            self.chosen_input_ids[idx],
            self.chosen_attn_masks[idx],
            self.rejected_input_ids[idx],
            self.rejected_attn_masks[idx],
        )


class DataCollatorReward:
    def __call__(self, data):
        batch = {}
        batch["input_ids"] = torch.cat([f[0] for f in data] + [f[2] for f in data])
        batch["attention_mask"] = torch.cat([f[1] for f in data] + [f[3] for f in data])
        batch["labels"] = torch.tensor([0] * len(data) + [1] * len(data))
        return batch


def compute_metrics(eval_preds):
    chosen_end_scores = eval_preds.predictions[0]  # chosen scores
    rejected_end_scores = eval_preds.predictions[1]  # rejected scores

    result = {}
    acc = sum(chosen_end_scores > rejected_end_scores) / len(rejected_end_scores)
    result["accuracy"] = acc

    return result


In [31]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

if not os.path.exists("rm_checkpoint"):
    os.mkdir("rm_checkpoint")

# Initialize the reward model from the (supervised) fine-tuned GPT-2
model = GPTRewardModel("gpt2")

# Freeze the first 70% of the hidden layers of the reward model backbone
layers = model.transformer.h
num_layers = len(layers)
num_unfrozen = int(0.3 * num_layers)
for layer in layers[:-num_unfrozen]:
    layer.requires_grad_(False)

# Create the comparisons datasets
data_path = "/content/project-7-at-2023-04-12-23-04-c1e74664.json"
pairs = create_comparison_dataset_ls(data_path)
train_size = int(0.8 * len(pairs))  # 80% training, 20% validation
train_pairs = pairs[0:train_size]
val_pairs = pairs[train_size:]


# Make pairwise datasets for training
max_length = 550
train_dataset = PairwiseDataset(train_pairs, tokenizer, max_length=max_length)
val_dataset = PairwiseDataset(val_pairs, tokenizer, max_length=max_length)

# Create the collator to gather batches of pairwise comparisons
data_collator = DataCollatorReward()

100%|██████████| 25/25 [00:00<00:00, 597.46it/s]
100%|██████████| 7/7 [00:00<00:00, 713.32it/s]


In [None]:
training_args = TrainingArguments(
    output_dir="rm_checkpoint/",
    num_train_epochs=50,
    logging_steps=10,
    gradient_accumulation_steps=4,
    save_strategy="steps",
    evaluation_strategy="steps",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_accumulation_steps=1,
    eval_steps=10000,
    save_steps=10000,
    warmup_steps=100,
    logging_dir="./logs",
    fp16=True,
    bf16=False,
    learning_rate=1e-5,
    # deepspeed="ds_config_gpt_j.json",
    save_total_limit=1
)

Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    compute_metrics=compute_metrics,
    eval_dataset=val_dataset,
    data_collator=data_collator,
).train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss


## 6. TBD: Tune our language model with our preference model using trlX

Once we have our reward model, we can traing our model using PPO. We can find more details about this setup [here](https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf). 

```
accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py
```

## 7. References 
- [Implementing RLHF: Learning to Summarize with trlX](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2)

- [General overview about RLHF](https://huggingface.co/blog/rlhf)
- [Another end-to-end example with trlX](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2)
- [Similar human-in-the-loop annotation framework](https://github.com/CarperAI/cheese/tree/main/examples)
- [Antropic harmless RLHF paper](https://arxiv.org/pdf/2204.05862.pdf) and [blog about CAI general principles](https://lifearchitect.ai/anthropic/)