# RLHF Reward Tuning an LLM for Controlled Review generation

### Please refer to the respective sections in the book for further details.


## Environment Setup

### Import dependencies

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
%pip install transformers trl wandb

In [None]:
import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

### Configure Environment

In [5]:
ppo_config = PPOConfig(
    model_name="gpt2",
    learning_rate=1.41e-5,
    log_with="wandb",
)

sentiment_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}


fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).


In [6]:
import wandb

wandb.init()

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


## Load Dataset

In [7]:
def build_dataset(ppo_config, dataset_name="yelp_polarity", input_min_text_length=2, input_max_text_length=8):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    ds = load_dataset(dataset_name, split="train")
    ds = ds.shuffle().select(range(25000))
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

In [8]:
dataset = build_dataset(ppo_config)


def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])




  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/21024 [00:00<?, ?ex/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1184 > 1024). Running this sequence through the model will result in indexing errors


## Load Models

### Load pre-trained GPT-2 models

In [9]:
model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)

tokenizer.pad_token = tokenizer.eos_token

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

### Initialize PPOTrainer

In [None]:
ppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)

### Load BERT classifier

In [11]:
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
sentiment_pipe = pipeline("sentiment-analysis", model="VictorSanh/roberta-base-finetuned-yelp-polarity", device=device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/559 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at VictorSanh/roberta-base-finetuned-yelp-polarity were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [12]:
sample_text = "this restaurant was really good!!"
sentiment_pipe(sample_text, **sent_kwargs)



[[{'label': 'LABEL_0', 'score': 4.7658491134643555},
  {'label': 'LABEL_1', 'score': -4.581177711486816}]]

### Text Generation settings

In [14]:
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id}

## Optimize the Model for positive review generation

### Training

In [None]:
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}


for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    input_query_tensors = batch["input_ids"]

    response_tensors = []
    for input_query in input_query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(input_query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    texts = [q + r for q, r in zip(batch["input_query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    stats = ppo_trainer.step(input_query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

## Model Evaluation

In [16]:
bs = 16
game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

for i in range(bs):
    gen_len = output_length_sampler()
    output = ref_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **generation_kwargs
    ).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **generation_kwargs
    ).squeeze()[-gen_len:]
    response_tensors.append(output)

game_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
game_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

sample_texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
game_data["rewards (before)"] = [output[1]["score"] for output in sentiment_pipe(sample_texts, **sent_kwargs)]

sample_texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
game_data["rewards (after)"] = [output[1]["score"] for output in sentiment_pipe(sample_texts, **sent_kwargs)]

df_results = pd.DataFrame(game_data)
df_results



Unnamed: 0,query,response (before),response (after),rewards (before),rewards (after)
0,This place was an,"effort to develop relief materials, I visit",amazing trip. Here are my top three,-2.352868,4.272993
1,Ace,Lock focus to ensure that Stunner sends a Fals...,"— on a note they wrote about. I love our way,",-1.268909,3.298889
2,Came here for a Sunday,supporting the Boyle family after 2 days away...,Day task. I loved sharing my work with you. I...,2.186583,4.155334
3,Went for nice Saturday lunch,.\n\nBUF reports that the,and I really enjoyed it. I always,1.612728,4.005006
4,The first thing is the name.,"Then you look at a label like ""terms for",I love it. I love that I love that,-1.635414,3.198051
5,Well it's now 11:49,AM and I am playing a game I never,p.m. and I've been happy,-1.263709,1.778251
6,Client dinner,and any other spur for remorse,! I liked it. it,-2.832374,2.841812
7,If my insurance didn't,cover you for a year,"get covers, I still",-2.67131,-1.811401
8,I had an opportunity to hit this,island to see their hero and not just some st...,"team, and it was simply amazing. I always",2.986287,4.170051
9,I came,up unassigned and in awe and without,back and I like it. It's an,-1.148255,3.338899


Looking at the reward mean/median of the generated sequences we observe a significant difference.

In [17]:
print("mean:")
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print("median:")
display(df_results[["rewards (before)", "rewards (after)"]].median())

mean:


rewards (before)   -0.734012
rewards (after)     2.600187
dtype: float64


median:


rewards (before)   -1.205982
rewards (after)     3.248470
dtype: float64

### Save model

In [18]:
model.save_pretrained("gpt2-yelp-pos-v2")
tokenizer.save_pretrained("gpt2-yelp-pos-v2")

('gpt2-yelp-pos-v2/tokenizer_config.json',
 'gpt2-yelp-pos-v2/special_tokens_map.json',
 'gpt2-yelp-pos-v2/vocab.json',
 'gpt2-yelp-pos-v2/merges.txt',
 'gpt2-yelp-pos-v2/added_tokens.json',
 'gpt2-yelp-pos-v2/tokenizer.json')

In [19]:
import os
import shutil

folder_name = 'gpt2-yelp-pos-v2'

shutil.make_archive(folder_name, 'zip', folder_name)

print(f'Zipped the folder {folder_name} successfully.')


Zipped the folder gpt2-yelp-pos-v2 successfully.
