<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).

_based on the [original notebook](https://github.com/antndlcrx/oxford-llms-workshop/blob/main/materials/seminars/day_3/8_LLMs%20alignment%20with%20RLHF.ipynb) by Ilya Boytsov for the Oxford LLMs workshop_



In this session, you're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [2]:
%pip install -q trl==0.7.4 transformers==4.33.1 datasets==2.14.4 peft==0.5.0

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
%pip install -U datasets

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pyarrow-hotfix
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0
  Downloading fsspec-2023.10.0-py3-none-any.whl (166 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2023.12.1
    Uninstalling fsspec-2023.12.1:
      Successfully uninstalled fsspec-2023.12.1
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
Successfully installed datasets-2.15.0 fsspec-2023.10.0 pyarrow-hotfix-0.6
[0mNote: 

### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [4]:
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [2]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The movie is set on a beach, in which there is a large wave that travels west. The wave takes a couple of days to reach the beach. A boat that has arrived shows it can't get in to anything. The waves travel west and the boat


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.


__If you actually want to maximize sentiment (or other "label") instead of human preferences, train reward model as a classifier! (see week5)__


In [3]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [4]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class IMDBPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, imdb, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.chosen_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [5]:
TARGET_LABEL = 0   # and make sure it works by reviewing the sample printed below
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Found 12500 chosen and 12500 rejected texts, 156250000 pairs
CHOSEN: [CLS] If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story. < br / > < br / > One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives ( unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film ). < br / > < br / > One might better spend one's time staring out a window at a tree growing. < br / > < br / > [SEP]
REJECTED: [CLS] This movie has some things that are pretty amazing. First, it is supposed to be based on a true story. That, in itself, is amazing that multiple tornadoes would hit the same town at night in the fall - in Nebraska. I wonder if the real town's name was close to " Blainsworth " ( which is the town's name in the movie ). There is an Ainsworth, Nebraska,

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [6]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.5419
100,0.2056
150,0.153
200,0.1294
250,0.1015
300,0.1045
350,0.0939
400,0.0869
450,0.0848
500,0.0825




TrainOutput(global_step=1000, training_loss=0.1120507493019104, metrics={'train_runtime': 367.4636, 'train_samples_per_second': 87.083, 'train_steps_per_second': 2.721, 'total_flos': 0.0, 'train_loss': 0.1120507493019104, 'epoch': 0.0})

In [7]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model (1 point)

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [8]:

for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(
      imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: 4.59765625
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /

In [9]:
imdb_test = datasets.load_dataset("imdb", split='test')

# <a whole lot of your code here, feel free to spit it as you see fit>

In [14]:
reward_test_data = IMDBPairwiseDataset(imdb_test, reward_tokenizer, accepted_label=TARGET_LABEL)

Found 12500 chosen and 12500 rejected texts, 156250000 pairs


In [25]:
from torch.utils.data import Subset
subset_train = Subset(reward_data, torch.randint(0, len(reward_data), (20000, )))
subset_test = Subset(reward_test_data, torch.randint(0, len(reward_test_data), (20000, )))

In [26]:
trainer.evaluate(subset_train)

{'eval_loss': 0.047971371561288834,
 'eval_accuracy': 0.9834,
 'eval_runtime': 70.1199,
 'eval_samples_per_second': 285.226,
 'eval_steps_per_second': 35.653,
 'epoch': 0.0}

In [27]:
trainer.evaluate(subset_test)

{'eval_loss': 0.08350150287151337,
 'eval_accuracy': 0.9719,
 'eval_runtime': 69.3945,
 'eval_samples_per_second': 288.207,
 'eval_steps_per_second': 36.026,
 'epoch': 0.0}

### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [28]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was great seeing the original film just from the look on his face, you realize he was trying very hard to make the film he wanted, and he didn't get it. Another example of the need to get the audience to make a film together, you
Sample: It was just amazing. The acting was good though and we just got used to a scene before the scene where the girl is walking away the other night. Great direction, good editing, and I can understand people who want to have a look at the movie but
Sample: It was a good night film, but I was still disappointed. The acting was spot on. No chemistry between the leads.<br /><br />The "family" characters. My brother would never get involved in another film again. I gave this a zero
Sample: It was supposed to be a horror movie about a group of kids falling in love. They had some pretty good, but awful sex scenes, like when John Hurt and Charlie Cox fall in love and he finds them both sexy. It was hard to believe what kind
Sample: It was also the only

In [40]:
prompts = ['This movie is', 'Such movies as', 'If you consider watching this movie', 'The movie', 'After watching this movie']

In [48]:
import numpy as np

best_candidates = []
for prompt in prompts:
    inputs = main_tokenizer([prompt] * 16, return_tensors='pt').to(device)
    rewards = []
    candidates = main_model.generate(**inputs, max_new_tokens=100, do_sample=True)
    for candidate in candidates:
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
            rewards.append(reward)
    best = candidates[np.argsort(rewards)[-1]]
    best_candidates.append(main_tokenizer.decode(best))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [49]:
best_candidates

['This movie is amazing. That is one of the reasons for my success with this movie. If you are looking for a way to watch a low budget western, see The Last Witch Hunter, the underrated horror movie by John Woo, and then go and check out The Last Witch Hunter, you are going to find an amazing place!<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>',
 'Such movies as this one are not to show the horrors which the men carry out every night. Such is the way the movie is based on a novel by William Blatte. It seems to depict an extraordinary set of circumstances whi

# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [50]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [128]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [52]:
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([ 4.5977, -4.9883], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [53]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()



trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9390589771670923


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [54]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [55]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


------------------------------ STEP 0 ------------------------------
rewards/mean:	-0.174402237	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.622224808	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-0.266395569	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.478933990	<---- model-estimated average discounted reward
objective/kl:	-0.005801378	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	0.355812550	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.675371766	<---- model-estimated average discounted reward
objective/kl:	0.006085940	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3 --

## Main assignment - <u>actually</u> train the model (8 points)


Your main task for this week is to use the RLHF pipeline to train a model for a reward of your choice. Here's what you can choose from:

__A. Toxicity fine-tuning:__ train the model to be less (or more!) toxic. For this task, you may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, you may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).


__B. Actual human feedback:__ use one of the existing datasets with pairwise human feedback to align your langauge model. You may use [anthropic's hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) or any other data you see fit. You may also turn the tables and train the model to [minimize](https://habrastorage.org/getpro/geektimes/post_images/ac7/2ad/827/ac72ad82767d4132164a4b6b76196c42.jpg) human preferences, as long as your model does not degrade to gibberish.

__C. Controlled generation:__ Instead of training a reward model from human feedback, you may define the reward function as the text length (longer or shorter) or number of times the model uses specific words (e.g. "sorry", "apologize"). If you choose specific words, make sure the model generates them at least sometimes.

__Alternatively,__ you may choose a different task. However, unless your task is very similar to one of the above, there is a chance that it will be **significantly** harder to solve, requiring orders of magnitude more compute and tuning. If you are in doubt, please ask the course staff. If they are AFK (again >.<), please prefer one of the recommended tasks.


#### General tips & tricks


Things to look out for:
- during PPO stage, the reward model should be in eval mode (dropout disabled)
- make sure max_length and max_new_tokens are enough for your chosen dataset - at least most of the time
- when in doubt, view the data manually or inspect how the model performs on a few samples


We highly recommend that you manually check the performance after each sub-stage:
1. when you assembled the pairwise dataset, inspect a couple of from of *your* dataset class and detokenize them. Make sure that you-the-human understand why one sample was accepted and the other - rejected. At least most of the time. This also lets you spot tokenization/truncation errors.
2. after you trained a reward model, measure how accurate this model is in isolation. If your reward model is poor, any subsequent RLHF will also fail.
3. once you've trained the main model with RL, ask it to generate examples and explore how well it does. If it produces an obviously bad output, check if the reward model assigns high reward to that output. If yes, reward model is the culprit; if no, it's a question of better/longer PPO training.

__It is also a good idea to periodically print samples during training.__

__When stuck, simplify the problem.__ If you've spent a several hours enchanting the reward model but it still won't budge, try switching to a simple subtask. For instance, if you're training on hh-rlhf, try limiting it the dataset to 10% of the shortest sequences - they are typically easier to learn.


## Assignment stages (and grading)

Regardless of the specific task you chose, your solution needs to contain several parts that will be graded separately.


#### Stage 1: reward model (4 points)

Construct a dataset for training the reward model on your problem. Then, train a reward model on that dataset and evaluate how well can your model predict preferences on a hold-out (test) subset of your data.

Please make sure that the part of your notebook where you evaluate reward model is clearly visible and reasonably easy to read. And for all that is holy, do not call it IMDB unless it actually **is** data of imdb movie reviews :)

__Not all tasks require a reward model for later PPO fine-tuning.__ For instance, there's no reason to train a reward model if your reward equals sentence length. Likewise, toxicity reward can be estimated with a pre-trained toxicity classifier. __If your task does not require training a reward model, please train an unrelated model on [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) as though you were solving assignment version B.__ This is for grading purposes only, you won't use this model for stage 2.


#### Stage 2: RL fine-tuning (4 points)

Once the reward model is ready - or you can compute rewards without a model - it is time to maximize that reward with PPO. Optionally, you may replace PPO with another RL algorithm (or unlikelihood learning scheme), but only if you're feeling adventurous.


First, you need to choose a language model to be fine-tuned. You may choose any model, but make sure that your model **can** generate the data in your format. For instance, [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a general purpose LM and may (or may not) need prompt engineering to generate chat assistant responses. For that reason, it is best if you **do not use `"lvwerra/gpt2-imdb"` unless you're generating only movie reviews**.



There are two "difficulty modes" for this task:
For the **easy mode**, use [gpt2-large](https://huggingface.co/gpt2-large) or [opt-1.3b](https://huggingface.co/facebook/opt-1.3b) with minimal code changes.
If you want the **Hard mode:** use a larger (e.g. 7B) model in combination with `load_in_4bit` and LoRA, the same way we did last week.
Some reasonable model choices are [LLaMA-7B](https://huggingface.co/Enoch/llama-7b-hf), [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) for general-purpose LM or [guanaco-7b](https://huggingface.co/timdettmers/guanaco-7b), [vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) for chat-based tasks, though there are many more (see [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). In the hard mode, you will need to modify the training arguments to enable 4-bit fine-tuning. Furthermore, your experiments will take somewhat longer to complete. On the plus side, your model will produce significantly better results.

__High reward is not enough!__ RL algorithms are famous for [cheating their reward functions](https://openai.com/research/faulty-reward-functions). To ensure that your model is actually doing what you want it to do, you will need some additional evaluation. To get the full grade, provide at least 20 side-by-side examples of your fine-tuned model vs original model predictions and a short summary.

Alternatively, you may provide 5 examples and some extrinsic evaluation metric over many examples. For instance, you may use a different pre-trained toxicity score for option A. When dealing with human preferences, you may choose to [enlist actual humans](https://toloka.ai/) or [ask GPT4/Claude](https://arxiv.org/pdf/2304.03277.pdf) to compare your model's predictions. For task C, when optimizing for simple rewards like sentence lengths, it is enough to compare histograms of rewards (e.g. average lengths).












## STAGE 1

**DISCLAIMER: I TOOK THIS DATASET ONLY FOR THE EXPERIMENT AND DIDN'T MEAN TO OFFEND ANYONE**

In [5]:
import datasets
dataset = datasets.load_dataset("ucberkeley-dlab/measuring-hate-speech")

Downloading readme:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [103]:
hate_data = dataset['train']

In [124]:
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [107]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class HateSpeechPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, hate, tokenizer):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in hate if row['hatespeech'][0] == 2.0]
        self.rejected_texts = [row['text'] for row in hate if row['hatespeech'][0] == 0.0]
        assert self.chosen_texts, f"no texts with label 2"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index % len(self.chosen_texts)], truncation=True, return_tensors='pt')
        rejected = self.tokenizer(self.rejected_texts[index // len(self.chosen_texts)], truncation=True, return_tensors='pt')
        return dict(input_ids_chosen=chosen['input_ids'][0], attention_mask_chosen=chosen['attention_mask'][0],
                    input_ids_rejected=rejected['input_ids'][0], attention_mask_rejected=rejected['attention_mask'][0])

In [109]:
from torch.utils.data import Subset

In [119]:
reward_data = HateSpeechPairwiseDataset(Subset(hate_data, torch.arange(10000, 70000).reshape(-1, 1)), reward_tokenizer)

Found 15674 chosen and 39743 rejected texts, 622931782 pairs


In [120]:
reward_test_data = HateSpeechPairwiseDataset(Subset(hate_data, torch.arange(70000, 110000).reshape(-1, 1)), reward_tokenizer)

Found 19983 chosen and 17518 rejected texts, 350062194 pairs


In [125]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=256,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_250,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
    
)

trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.4829
100,0.299
150,0.2754
200,0.2667
250,0.2522
300,0.2406
350,0.2367
400,0.228
450,0.2175
500,0.208




TrainOutput(global_step=1250, training_loss=0.21882748947143554, metrics={'train_runtime': 1418.6943, 'train_samples_per_second': 225.56, 'train_steps_per_second': 0.881, 'total_flos': 0.0, 'train_loss': 0.21882748947143554, 'epoch': 0.0})

In [126]:
trainer.evaluate(Subset(reward_data, torch.randint(0, len(reward_data), (30000, ))))

{'eval_loss': 0.13572634756565094,
 'eval_accuracy': 0.9426,
 'eval_runtime': 144.6995,
 'eval_samples_per_second': 207.326,
 'eval_steps_per_second': 25.916,
 'epoch': 0.0}

In [127]:
trainer.evaluate(Subset(reward_test_data, torch.randint(0, len(reward_test_data), (30000, ))))

{'eval_loss': 0.24028734862804413,
 'eval_accuracy': 0.909,
 'eval_runtime': 162.711,
 'eval_samples_per_second': 184.376,
 'eval_steps_per_second': 23.047,
 'epoch': 0.0}

## STAGE 2

In [129]:
hate_data

Dataset({
    features: ['comment_id', 'annotator_id', 'platform', 'sentiment', 'respect', 'insult', 'humiliate', 'status', 'dehumanize', 'violence', 'genocide', 'attack_defend', 'hatespeech', 'hate_speech_score', 'text', 'infitms', 'outfitms', 'annotator_severity', 'std_err', 'annotator_infitms', 'annotator_outfitms', 'hypothesis', 'target_race_asian', 'target_race_black', 'target_race_latinx', 'target_race_middle_eastern', 'target_race_native_american', 'target_race_pacific_islander', 'target_race_white', 'target_race_other', 'target_race', 'target_religion_atheist', 'target_religion_buddhist', 'target_religion_christian', 'target_religion_hindu', 'target_religion_jewish', 'target_religion_mormon', 'target_religion_muslim', 'target_religion_other', 'target_religion', 'target_origin_immigrant', 'target_origin_migrant_worker', 'target_origin_specific_country', 'target_origin_undocumented', 'target_origin_other', 'target_origin', 'target_gender_men', 'target_gender_non_binary', 'target_

In [130]:
hate_data_for_rlhf = hate_data.filter(lambda row: len(row['text']) > 200, batched=False)
hate_data_for_rlhf = hate_data_for_rlhf.remove_columns(['comment_id', 'annotator_id', 'platform', 'sentiment', 'respect', 'insult', 'humiliate', 'status', 'dehumanize', 'violence', 'genocide', 'attack_defend', 'hatespeech', 'hate_speech_score', 'infitms', 'outfitms', 'annotator_severity', 'std_err', 'annotator_infitms', 'annotator_outfitms', 'hypothesis', 'target_race_asian', 'target_race_black', 'target_race_latinx', 'target_race_middle_eastern', 'target_race_native_american', 'target_race_pacific_islander', 'target_race_white', 'target_race_other', 'target_race', 'target_religion_atheist', 'target_religion_buddhist', 'target_religion_christian', 'target_religion_hindu', 'target_religion_jewish', 'target_religion_mormon', 'target_religion_muslim', 'target_religion_other', 'target_religion', 'target_origin_immigrant', 'target_origin_migrant_worker', 'target_origin_specific_country', 'target_origin_undocumented', 'target_origin_other', 'target_origin', 'target_gender_men', 'target_gender_non_binary', 'target_gender_transgender_men', 'target_gender_transgender_unspecified', 'target_gender_transgender_women', 'target_gender_women', 'target_gender_other', 'target_gender', 'target_sexuality_bisexual', 'target_sexuality_gay', 'target_sexuality_lesbian', 'target_sexuality_straight', 'target_sexuality_other', 'target_sexuality', 'target_age_children', 'target_age_teenagers', 'target_age_young_adults', 'target_age_middle_aged', 'target_age_seniors', 'target_age_other', 'target_age', 'target_disability_physical', 'target_disability_cognitive', 'target_disability_neurological', 'target_disability_visually_impaired', 'target_disability_hearing_impaired', 'target_disability_unspecific', 'target_disability_other', 'target_disability', 'annotator_gender', 'annotator_trans', 'annotator_educ', 'annotator_income', 'annotator_ideology', 'annotator_gender_men', 'annotator_gender_women', 'annotator_gender_non_binary', 'annotator_gender_prefer_not_to_say', 'annotator_gender_self_describe', 'annotator_transgender', 'annotator_cisgender', 'annotator_transgender_prefer_not_to_say', 'annotator_education_some_high_school', 'annotator_education_high_school_grad', 'annotator_education_some_college', 'annotator_education_college_grad_aa', 'annotator_education_college_grad_ba', 'annotator_education_professional_degree', 'annotator_education_masters', 'annotator_education_phd', 'annotator_income_<10k', 'annotator_income_10k-50k', 'annotator_income_50k-100k', 'annotator_income_100k-200k', 'annotator_income_>200k', 'annotator_ideology_extremeley_conservative', 'annotator_ideology_conservative', 'annotator_ideology_slightly_conservative', 'annotator_ideology_neutral', 'annotator_ideology_slightly_liberal', 'annotator_ideology_liberal', 'annotator_ideology_extremeley_liberal', 'annotator_ideology_no_opinion', 'annotator_race_asian', 'annotator_race_black', 'annotator_race_latinx', 'annotator_race_middle_eastern', 'annotator_race_native_american', 'annotator_race_pacific_islander', 'annotator_race_white', 'annotator_race_other', 'annotator_age', 'annotator_religion_atheist', 'annotator_religion_buddhist', 'annotator_religion_christian', 'annotator_religion_hindu', 'annotator_religion_jewish', 'annotator_religion_mormon', 'annotator_religion_muslim', 'annotator_religion_nothing', 'annotator_religion_other', 'annotator_sexuality_bisexual', 'annotator_sexuality_gay', 'annotator_sexuality_straight', 'annotator_sexuality_other'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

hate_data_for_rlhf = hate_data_for_rlhf.map(select_query_and_tokenize, batched=False)
hate_data_for_rlhf.set_format(type="torch")

Filter:   0%|          | 0/135556 [00:00<?, ? examples/s]

Map:   0%|          | 0/36201 [00:00<?, ? examples/s]

In [139]:
# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("gpt2-large", device_map='cuda:0')

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

In [149]:
prompts = [
    'u r', 'You are', 'get the', 'please', 'go',
    'let\'s', 'this man', ' ', 'kiss', 'i love',
    'return', 'bro', 'goodbye', 'enjoy your', 'i\'m sure that',
    'hello', 'this movie', 'that was', 'i have never imagined', 'i have never thought',    
]

In [150]:
import numpy as np

best_candidates = []
for prompt in prompts:
    inputs = main_tokenizer([prompt] * 16, return_tensors='pt').to(device)
    rewards = []
    candidates = main_model.generate(**inputs, max_new_tokens=100, do_sample=True)
    for candidate in candidates:
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
            rewards.append(reward)
    best = candidates[np.argsort(rewards)[-1]]
    best_candidates.append(main_tokenizer.decode(best))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [151]:
best_candidates_before_tuning = best_candidates
best_candidates_before_tuning

["u ry, eu yc, nx, ny, c, cx, cy\n\nThis will let you know if you're currently connected to the Wi-Fi network and you can also connect to your data connection after the download finishes. You can also add your Wi-Fi network settings from your phone, or you can copy them to your laptop. If the connection still doesn't work for you following these tips, try restarting your Wi-Fi setup.\n\n5. Connect",
 'You are allowed two guests per day.\n\nChildren are welcome, but only children under 6 must be accompanied by an adult. The cost to feed and feed-site all children under 6 includes: one 8 ounce cup of water, one 8 oz cup of protein powder, one 8 oz bag of chips, one 8 oz bag of fruit and vegetable, one 8 oz of juice, one 8 oz packet of salt and pepper, one 8 oz bottle of milk, one 8 oz gallon of laundry detergent,',
 "get the truth' or what would happen. I decided that I wasn't going to talk to my parents if I had to deal with the truth. I remember taking out my cell phone so that they'd 

In [189]:
import gc
torch.cuda.empty_cache()
gc.collect()

664

In [190]:
import peft

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("gpt2-large", device_map='cuda:0')

peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

trainable params: 5,898,240 || all params: 779,929,601 || trainable%: 0.7562528710844506


In [191]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-4,
    batch_size=32,
    ppo_epochs=4,
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=hate_data_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
) 

In [192]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


------------------------------ STEP 0 ------------------------------
rewards/mean:	-2.709594727	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.512554765	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-2.839062691	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.672181487	<---- model-estimated average discounted reward
objective/kl:	2.843236208	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	-2.757976532	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.908971310	<---- model-estimated average discounted reward
objective/kl:	5.367511272	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3



------------------------------ STEP 6 ------------------------------
rewards/mean:	-2.432014465	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.606701016	<---- model-estimated average discounted reward
objective/kl:	9.707696915	<---- how far we are from the original model (regularizer)





------------------------------ STEP 7 ------------------------------
rewards/mean:	-2.968502045	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.849359989	<---- model-estimated average discounted reward
objective/kl:	11.430515289	<---- how far we are from the original model (regularizer)

------------------------------ STEP 8 ------------------------------
rewards/mean:	-2.760032654	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.119538784	<---- model-estimated average discounted reward
objective/kl:	11.859004974	<---- how far we are from the original model (regularizer)





------------------------------ STEP 9 ------------------------------
rewards/mean:	-2.727336884	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.292415142	<---- model-estimated average discounted reward
objective/kl:	13.862224579	<---- how far we are from the original model (regularizer)

------------------------------ STEP 10 ------------------------------
rewards/mean:	-2.598836899	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.429044485	<---- model-estimated average discounted reward
objective/kl:	12.684354782	<---- how far we are from the original model (regularizer)

------------------------------ STEP 11 ------------------------------
rewards/mean:	-1.887302399	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.500426769	<---- model-estimated average discounted reward
objective/kl:	14.292473793	<---- how far we are from the original model (regularizer)

------------------------------ S



------------------------------ STEP 17 ------------------------------
rewards/mean:	-1.510776520	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.874777317	<---- model-estimated average discounted reward
objective/kl:	15.014264107	<---- how far we are from the original model (regularizer)

------------------------------ STEP 18 ------------------------------
rewards/mean:	-1.376991272	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.147479534	<---- model-estimated average discounted reward
objective/kl:	16.061763763	<---- how far we are from the original model (regularizer)

------------------------------ STEP 19 ------------------------------
rewards/mean:	-1.539452553	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.442421436	<---- model-estimated average discounted reward
objective/kl:	15.843231201	<---- how far we are from the original model (regularizer)





------------------------------ STEP 20 ------------------------------
rewards/mean:	-1.667388916	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.504861355	<---- model-estimated average discounted reward
objective/kl:	18.818943024	<---- how far we are from the original model (regularizer)





------------------------------ STEP 21 ------------------------------
rewards/mean:	-0.939167023	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.233695745	<---- model-estimated average discounted reward
objective/kl:	13.269552231	<---- how far we are from the original model (regularizer)





------------------------------ STEP 22 ------------------------------
rewards/mean:	-1.468479156	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.436835766	<---- model-estimated average discounted reward
objective/kl:	18.745349884	<---- how far we are from the original model (regularizer)

------------------------------ STEP 23 ------------------------------
rewards/mean:	-0.664945602	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.819217682	<---- model-estimated average discounted reward
objective/kl:	17.309314728	<---- how far we are from the original model (regularizer)

------------------------------ STEP 24 ------------------------------
rewards/mean:	-0.118713379	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.886066437	<---- model-estimated average discounted reward
objective/kl:	15.772443771	<---- how far we are from the original model (regularizer)

------------------------------ 



------------------------------ STEP 28 ------------------------------
rewards/mean:	-0.654510498	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.807394981	<---- model-estimated average discounted reward
objective/kl:	16.348711014	<---- how far we are from the original model (regularizer)

------------------------------ STEP 29 ------------------------------
rewards/mean:	-1.058044434	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.327441454	<---- model-estimated average discounted reward
objective/kl:	15.754873276	<---- how far we are from the original model (regularizer)

------------------------------ STEP 30 ------------------------------
rewards/mean:	-1.771949768	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.584623337	<---- model-estimated average discounted reward
objective/kl:	14.280710220	<---- how far we are from the original model (regularizer)





------------------------------ STEP 31 ------------------------------
rewards/mean:	-0.474246979	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.928556442	<---- model-estimated average discounted reward
objective/kl:	14.078136444	<---- how far we are from the original model (regularizer)





------------------------------ STEP 32 ------------------------------
rewards/mean:	0.283271790	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.640023470	<---- model-estimated average discounted reward
objective/kl:	10.203962326	<---- how far we are from the original model (regularizer)

------------------------------ STEP 33 ------------------------------
rewards/mean:	-0.484586716	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.911023617	<---- model-estimated average discounted reward
objective/kl:	14.195898056	<---- how far we are from the original model (regularizer)





------------------------------ STEP 34 ------------------------------
rewards/mean:	-0.760244370	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.982150316	<---- model-estimated average discounted reward
objective/kl:	10.753108978	<---- how far we are from the original model (regularizer)





------------------------------ STEP 35 ------------------------------
rewards/mean:	0.130962372	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.299914360	<---- model-estimated average discounted reward
objective/kl:	14.166477203	<---- how far we are from the original model (regularizer)





------------------------------ STEP 36 ------------------------------
rewards/mean:	0.065838814	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.396046162	<---- model-estimated average discounted reward
objective/kl:	13.810874939	<---- how far we are from the original model (regularizer)





------------------------------ STEP 37 ------------------------------
rewards/mean:	-1.616653442	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-3.341355801	<---- model-estimated average discounted reward
objective/kl:	12.908693314	<---- how far we are from the original model (regularizer)

------------------------------ STEP 38 ------------------------------
rewards/mean:	-0.512184143	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.521491051	<---- model-estimated average discounted reward
objective/kl:	10.939384460	<---- how far we are from the original model (regularizer)





------------------------------ STEP 39 ------------------------------
rewards/mean:	-0.180938721	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.267166853	<---- model-estimated average discounted reward
objective/kl:	10.668241501	<---- how far we are from the original model (regularizer)





------------------------------ STEP 40 ------------------------------
rewards/mean:	-0.061058044	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.974205494	<---- model-estimated average discounted reward
objective/kl:	10.986654282	<---- how far we are from the original model (regularizer)





------------------------------ STEP 41 ------------------------------
rewards/mean:	-0.158296585	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.049994469	<---- model-estimated average discounted reward
objective/kl:	11.873064041	<---- how far we are from the original model (regularizer)

------------------------------ STEP 42 ------------------------------
rewards/mean:	-0.227920532	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.186037540	<---- model-estimated average discounted reward
objective/kl:	12.739182472	<---- how far we are from the original model (regularizer)

------------------------------ STEP 43 ------------------------------
rewards/mean:	-0.193096161	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.834181786	<---- model-estimated average discounted reward
objective/kl:	9.630664825	<---- how far we are from the original model (regularizer)





------------------------------ STEP 44 ------------------------------
rewards/mean:	1.013856411	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.424227715	<---- model-estimated average discounted reward
objective/kl:	15.346023560	<---- how far we are from the original model (regularizer)





------------------------------ STEP 45 ------------------------------
rewards/mean:	0.497050524	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.517905235	<---- model-estimated average discounted reward
objective/kl:	16.601406097	<---- how far we are from the original model (regularizer)

------------------------------ STEP 46 ------------------------------
rewards/mean:	1.451988220	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.052968740	<---- model-estimated average discounted reward
objective/kl:	15.199113846	<---- how far we are from the original model (regularizer)

------------------------------ STEP 47 ------------------------------
rewards/mean:	-0.053337097	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.813966274	<---- model-estimated average discounted reward
objective/kl:	15.371620178	<---- how far we are from the original model (regularizer)





------------------------------ STEP 48 ------------------------------
rewards/mean:	0.934953213	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.329378128	<---- model-estimated average discounted reward
objective/kl:	18.521259308	<---- how far we are from the original model (regularizer)





------------------------------ STEP 49 ------------------------------
rewards/mean:	1.479290009	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.975120902	<---- model-estimated average discounted reward
objective/kl:	17.027679443	<---- how far we are from the original model (regularizer)



In [195]:
best_candidates = []
for prompt in prompts:
    inputs = main_tokenizer([prompt] * 16, return_tensors='pt').to(device)
    rewards = []
    candidates = main_model.model.generate(**inputs, max_new_tokens=100, do_sample=True)
    for candidate in candidates:
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
            rewards.append(reward)
    best = candidates[np.argsort(rewards)[-1]]
    best_candidates.append(main_tokenizer.decode(best))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [198]:
best_candidates_after_tuning = best_candidates

In [203]:
import re

for i in range(len(prompts)):
    print(f"PROMPT: {prompts[i]}\n\nBEFORE: {re.sub('<.*>', '', best_candidates_before_tuning[i])}\n\nAFTER: {re.sub('<.*>', '', best_candidates_after_tuning[i])}\n++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n")

PROMPT: u r

BEFORE: u ry, eu yc, nx, ny, c, cx, cy

This will let you know if you're currently connected to the Wi-Fi network and you can also connect to your data connection after the download finishes. You can also add your Wi-Fi network settings from your phone, or you can copy them to your laptop. If the connection still doesn't work for you following these tips, try restarting your Wi-Fi setup.

5. Connect

AFTER: u ri's.

You're a thug, cocksucker.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PROMPT: You are

BEFORE: You are allowed two guests per day.

Children are welcome, but only children under 6 must be accompanied by an adult. The cost to feed and feed-site all children under 6 includes: one 8 ounce cup of water, one 8 oz cup of protein powder, one 8 oz bag of chips, one 8 oz bag of fruit and vegetable, one 8 oz of juice, one 8 oz packet of salt and pepper, one 8 oz bottle of milk, one 8 oz gallon of laundry detergent,

AFTER: You are n

**Summary:**
In general, our model became better at generating hate speech after fine-tuning. However, it still sometimes generates meaningless text and should be tuned to provide better results. Also, there are some cases when initial model generated more negative texts than tuned one.