<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).

_based on the [original notebook](https://github.com/antndlcrx/oxford-llms-workshop/blob/main/materials/seminars/day_3/8_LLMs%20alignment%20with%20RLHF.ipynb) by Ilya Boytsov for the Oxford LLMs workshop_



In this session, you're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [2]:
%pip install -q trl==0.7.4 transformers==4.33.1 accelerate==0.28.0 datasets peft==0.5.0

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [1]:
from tqdm.auto import tqdm, trange

In [2]:
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)



In [3]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The movie I loved it was definitely the movie that would convince me to rent "A Day in the Life". It reminded me of the old old Hitchcock movies like "Shakespeare In Love" and "Citizen Kane". My guess is there are many more who


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.


__If you actually want to maximize sentiment (or other "label") instead of human preferences, train reward model as a classifier! (see week5)__


In [4]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [5]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class IMDBPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """

    column_names = ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected']

    def __init__(self, imdb, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.chosen_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [6]:
TARGET_LABEL = 0   # and make sure it works by reviewing the sample printed below
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Found 12500 chosen and 12500 rejected texts, 156250000 pairs
CHOSEN: [CLS] If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story. < br / > < br / > One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives ( unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film ). < br / > < br / > One might better spend one's time staring out a window at a tree growing. < br / > < br / > [SEP]
REJECTED: [CLS] This movie has some things that are pretty amazing. First, it is supposed to be based on a true story. That, in itself, is amazing that multiple tornadoes would hit the same town at night in the fall - in Nebraska. I wonder if the real town's name was close to " Blainsworth " ( which is the town's name in the movie ). There is an Ainsworth, Nebraska,

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [7]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

[2024-12-22 02:32:54,323] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
  0%|          | 0/1000 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed
  5%|▌         | 50/1000 [00:45<08:12,  1.93it/s]

{'loss': 0.5346, 'learning_rate': 1.34091e-05, 'epoch': 0.0}


 10%|█         | 100/1000 [01:12<12:57,  1.16it/s]

{'loss': 0.1763, 'learning_rate': 1.2718200000000001e-05, 'epoch': 0.0}


 15%|█▌        | 150/1000 [01:38<07:20,  1.93it/s]

{'loss': 0.142, 'learning_rate': 1.20132e-05, 'epoch': 0.0}


 20%|██        | 200/1000 [02:04<06:54,  1.93it/s]

{'loss': 0.1297, 'learning_rate': 1.1308200000000001e-05, 'epoch': 0.0}


 25%|██▌       | 250/1000 [02:30<06:28,  1.93it/s]

{'loss': 0.1093, 'learning_rate': 1.06032e-05, 'epoch': 0.0}


 30%|███       | 300/1000 [02:56<06:02,  1.93it/s]

{'loss': 0.1157, 'learning_rate': 9.8982e-06, 'epoch': 0.0}




{'loss': 0.0963, 'learning_rate': 9.1932e-06, 'epoch': 0.0}


 40%|████      | 400/1000 [03:47<05:12,  1.92it/s]

{'loss': 0.0835, 'learning_rate': 8.4882e-06, 'epoch': 0.0}


 45%|████▌     | 450/1000 [04:13<04:42,  1.94it/s]

{'loss': 0.0693, 'learning_rate': 7.7832e-06, 'epoch': 0.0}


 50%|█████     | 500/1000 [04:39<04:19,  1.93it/s]

{'loss': 0.0646, 'learning_rate': 7.0782e-06, 'epoch': 0.0}


 55%|█████▌    | 550/1000 [05:11<03:54,  1.92it/s]

{'loss': 0.0703, 'learning_rate': 6.3732e-06, 'epoch': 0.0}


 60%|██████    | 600/1000 [05:37<03:26,  1.94it/s]

{'loss': 0.0653, 'learning_rate': 5.6682e-06, 'epoch': 0.0}


 65%|██████▌   | 650/1000 [06:03<03:01,  1.92it/s]

{'loss': 0.0693, 'learning_rate': 4.9632e-06, 'epoch': 0.0}


 70%|███████   | 700/1000 [06:29<02:36,  1.91it/s]

{'loss': 0.0722, 'learning_rate': 4.2582e-06, 'epoch': 0.0}


 75%|███████▌  | 750/1000 [06:55<02:10,  1.91it/s]

{'loss': 0.0678, 'learning_rate': 3.5532e-06, 'epoch': 0.0}


 80%|████████  | 800/1000 [07:21<01:44,  1.91it/s]

{'loss': 0.0653, 'learning_rate': 2.8482e-06, 'epoch': 0.0}


 85%|████████▌ | 850/1000 [07:47<01:18,  1.92it/s]

{'loss': 0.0715, 'learning_rate': 2.1432e-06, 'epoch': 0.0}


 90%|█████████ | 900/1000 [08:13<00:52,  1.92it/s]

{'loss': 0.0578, 'learning_rate': 1.4382e-06, 'epoch': 0.0}


 95%|█████████▌| 950/1000 [08:39<00:26,  1.92it/s]

{'loss': 0.0795, 'learning_rate': 7.332e-07, 'epoch': 0.0}


100%|██████████| 1000/1000 [09:05<00:00,  1.92it/s]

{'loss': 0.0731, 'learning_rate': 2.82e-08, 'epoch': 0.0}


100%|██████████| 1000/1000 [09:12<00:00,  1.81it/s]

{'train_runtime': 552.4724, 'train_samples_per_second': 57.921, 'train_steps_per_second': 1.81, 'train_loss': 0.11066102957725525, 'epoch': 0.0}





TrainOutput(global_step=1000, training_loss=0.11066102957725525, metrics={'train_runtime': 552.4724, 'train_samples_per_second': 57.921, 'train_steps_per_second': 1.81, 'train_loss': 0.11066102957725525, 'epoch': 0.0})

In [8]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model (1 point)

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [9]:

for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(
      imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: 5.08203125
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /

In [11]:
from sklearn.metrics import roc_auc_score

imdb_test = datasets.load_dataset("imdb", split='test')

In [13]:
labels = []
rewards = []
for sample_index in trange(len(imdb_test)):
  inputs = reward_tokenizer(imdb_test[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    rewards.append(reward)
  labels.append(imdb_test[sample_index]['label'])

print(1 - roc_auc_score(labels, rewards))

100%|██████████| 25000/25000 [04:18<00:00, 96.80it/s] 


0.9717145632


### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [14]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was almost like I'd never heard of the film before I saw it. I knew of the opening sequence because I remember being told it was part of the early-'60's and I was still getting used to it. When I first saw it, I
Sample: It was an absolute no-brainer that the director would have to be nominated for the Academy Award. Now for the script. What follows is the story (I will not repeat myself here, as it is too long for one movie) of some pretty old-
Sample: It was also very strange it has no title of "Drama". I will be the last to write an example of that in a movie, but the original was never released in any form, other than a trailer released by American distributor HARDMADE.
Sample: It was so strange I really don't understand the way things were going until I sat down for it. And it seems, after watching it on TV, that you've become so accustomed to it, that this little movie is quite unbelievable. It also is a
Sample: It was a truly spectacular movie. All the cast performed w

In [15]:
prompts = ['This movie is']

In [22]:
import numpy as np

def ranked_generations(prompt, N=16):
    inputs = main_tokenizer([prompt] * N, return_tensors='pt').to(device)
    candidates = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
    samples = np.array([main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()) for candidate in candidates])

    rewards = []
    for sample in samples:
        inputs = reward_tokenizer(sample, truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
            rewards.append(reward)

    rewards = np.array(rewards)
    ranks = np.argsort(-rewards)
    for sample in samples[ranks]:
        print(sample)
        print('===========================================')

In [23]:
ranked_generations('This film is')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This film is not that bad, and as mentioned before, it's not all bad movies with bad plot. Instead of a typical "Hollywood" movie where the story is set on a spaceship, the movie is a full movie of a series of scenes involving a
This film is about two American families separated by the fact that a very wealthy family members are going to be losing their home at the funeral and they are coming home early and being treated by local police. They have nothing like the typical elderly or mentally handicapped living on
This film is so good you might want to throw in a few more.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endo

In [24]:
ranked_generations('The acting is')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The acting is quite terrible. If the writing were anything to go by, I would have said "Oh boy, that scene from The Great Gatsby" rather than this. He is a talented actor but the character is nothing more than plain old dumb luck.
The acting is terrible, the script is horrendous, the script, the acting, and even the story line is stupid.<br /><br />The only reason I rated this film 5 or even 3 is because I never really went into it thinking that I am a
The acting is all bad and the writing is unbelievable and has a bad taste. The movie ends in the middle which has me running to see the box office. They even put their hands in the air and it doesn't look cool.<br /><br />So
The acting is not only disappointing, it is downright embarrassing. In my personal opinion, no one with this much talent could have done anything else with their two hours. But even then, it seemed so over-the-top.<br /><br />I have
The acting is terrible, i mean i saw this before and i saw it after,it seemed to me l

In [25]:
ranked_generations('This movie makes me feel')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This movie makes me feel like I have seen the worst movie in 70 years, I know that, but this is the worst movie on DVD I have ever seen. I would not recommend this movie to anyone but really just go see it for the laughs.<|endoftext|><|endoftext|><|endoftext|><|endoftext|>
This movie makes me feel, of all time, that the movie was just an experiment in the development of a character. But there is no reason to believe that the audience will reject this theory of character development. It does not make sense! And it doesn't even make sense
This movie makes me feel bad. If you're a big fan of bad movies i strongly suggest watching it. If you do not love bad movies i would highly suggest this movie. I have seen some good movies about this topic but the ones I still believe are really just bad
This movie makes me feel bad. Not because of its quality but because of it. There is not much going on, not much to work with, not much to tell you to be aware of just how bad it is for everyone involv

In [26]:
ranked_generations('I just wanna say that')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I just wanna say that this movie was horrible. It was like watching a horror film. There was no tension in this movie.<br /><br />Overall, this movie disappointed me. I just wanted to laugh at the ridiculousness of the premise. The actors were OK
I just wanna say that I don't like all male pornstars and even though I am a lesbian, I'm a straight man so this was very disappointing.<br /><br />I watched the DVD together with my friends, but I'm glad this film didn't get
I just wanna say that once or twice i watched the movie and thought "Wow, there is just so much going on here" but it was really nothing more than a bunch of people going around talking to the TV looking for a "new movie". And as one of the
I just wanna say that I am always a HUGE fan of the Italian language movies...but I just don't get my own music in this kind.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|end

In [27]:
ranked_generations('This movie has some')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This movie has some of the worst acting and acting. When I saw it on TV and saw it on DVD several years later, I was quite disgusted. It is simply plain awful. The movie is about an aspiring comedian playing a role completely unsuitable for any woman in
This movie has some of the worst looking scenes I've seen in cinema. Especially considering that they only use the actors' faces. It is just so bad that they were able to put so many beautiful ladies here, the cast and crew can't even sit through such crap
This movie has some of the more surreal moments I've seen; the whole thing's like you haven't even noticed. And the one guy who was always annoying to us is really creepy. It's all just not worth it.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
This movie has some funny performances, but the whole plot and execution leaves you speechless while the actors are trying to stay cool. A real disappointment. 3 out of 10<|endoftext|><

# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [28]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Filter: 100%|██████████| 25000/25000 [00:00<00:00, 191322.94 examples/s]
Map:   0%|          | 0/24895 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors
Map: 100%|██████████| 24895/24895 [00:29<00:00, 855.63 examples/s]


Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [29]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [30]:
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([ 5.0781, -5.0039], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [31]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

CUDA extension not installed.
CUDA extension not installed.


trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9390589771670923


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [32]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    mini_batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [33]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  2%|▏         | 1/50 [00:36<30:11, 36.96s/it]

------------------------------ STEP 0 ------------------------------
rewards/mean:	0.215599060	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.365623236	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)



  4%|▍         | 2/50 [01:15<30:29, 38.12s/it]

------------------------------ STEP 1 ------------------------------
rewards/mean:	0.086395979	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.401234955	<---- model-estimated average discounted reward
objective/kl:	0.000980879	<---- how far we are from the original model (regularizer)



  6%|▌         | 3/50 [01:54<30:08, 38.49s/it]

------------------------------ STEP 2 ------------------------------
rewards/mean:	0.576278687	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.274356306	<---- model-estimated average discounted reward
objective/kl:	0.014108738	<---- how far we are from the original model (regularizer)



  8%|▊         | 4/50 [02:33<29:42, 38.75s/it]

------------------------------ STEP 3 ------------------------------
rewards/mean:	0.492718697	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.324958801	<---- model-estimated average discounted reward
objective/kl:	0.035480961	<---- how far we are from the original model (regularizer)



 10%|█         | 5/50 [03:13<29:11, 38.93s/it]

------------------------------ STEP 4 ------------------------------
rewards/mean:	0.107905865	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.389611840	<---- model-estimated average discounted reward
objective/kl:	0.060832642	<---- how far we are from the original model (regularizer)



 12%|█▏        | 6/50 [03:52<28:35, 38.99s/it]

------------------------------ STEP 5 ------------------------------
rewards/mean:	0.769151688	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.316379488	<---- model-estimated average discounted reward
objective/kl:	0.036593869	<---- how far we are from the original model (regularizer)



 14%|█▍        | 7/50 [04:31<27:58, 39.03s/it]

------------------------------ STEP 6 ------------------------------
rewards/mean:	0.481787205	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.251095355	<---- model-estimated average discounted reward
objective/kl:	0.060364127	<---- how far we are from the original model (regularizer)



 16%|█▌        | 8/50 [05:10<27:21, 39.08s/it]

------------------------------ STEP 7 ------------------------------
rewards/mean:	0.318519115	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.347147733	<---- model-estimated average discounted reward
objective/kl:	0.073101357	<---- how far we are from the original model (regularizer)



 18%|█▊        | 9/50 [05:49<26:41, 39.07s/it]

------------------------------ STEP 8 ------------------------------
rewards/mean:	-0.130833626	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.404385746	<---- model-estimated average discounted reward
objective/kl:	0.091342673	<---- how far we are from the original model (regularizer)



 20%|██        | 10/50 [06:28<26:05, 39.13s/it]

------------------------------ STEP 9 ------------------------------
rewards/mean:	0.342811108	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.393909395	<---- model-estimated average discounted reward
objective/kl:	0.119021691	<---- how far we are from the original model (regularizer)



 22%|██▏       | 11/50 [07:08<25:25, 39.12s/it]

------------------------------ STEP 10 ------------------------------
rewards/mean:	-0.182218552	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.435639620	<---- model-estimated average discounted reward
objective/kl:	0.173643634	<---- how far we are from the original model (regularizer)



 24%|██▍       | 12/50 [07:46<24:43, 39.04s/it]

------------------------------ STEP 11 ------------------------------
rewards/mean:	0.823665559	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.275062740	<---- model-estimated average discounted reward
objective/kl:	0.200725347	<---- how far we are from the original model (regularizer)



 26%|██▌       | 13/50 [08:26<24:08, 39.15s/it]

------------------------------ STEP 12 ------------------------------
rewards/mean:	0.571063995	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.296156228	<---- model-estimated average discounted reward
objective/kl:	0.170160398	<---- how far we are from the original model (regularizer)



 28%|██▊       | 14/50 [09:05<23:28, 39.11s/it]

------------------------------ STEP 13 ------------------------------
rewards/mean:	0.530268192	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.258001447	<---- model-estimated average discounted reward
objective/kl:	0.213201255	<---- how far we are from the original model (regularizer)



 30%|███       | 15/50 [09:44<22:47, 39.07s/it]

------------------------------ STEP 14 ------------------------------
rewards/mean:	0.875080109	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.226568520	<---- model-estimated average discounted reward
objective/kl:	0.190126836	<---- how far we are from the original model (regularizer)



 32%|███▏      | 16/50 [10:23<22:07, 39.05s/it]

------------------------------ STEP 15 ------------------------------
rewards/mean:	0.334225655	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.276781023	<---- model-estimated average discounted reward
objective/kl:	0.246694401	<---- how far we are from the original model (regularizer)



 34%|███▍      | 17/50 [11:02<21:27, 39.02s/it]

------------------------------ STEP 16 ------------------------------
rewards/mean:	0.770143509	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.176135004	<---- model-estimated average discounted reward
objective/kl:	0.255189538	<---- how far we are from the original model (regularizer)



 36%|███▌      | 18/50 [11:40<20:39, 38.72s/it]

------------------------------ STEP 17 ------------------------------
rewards/mean:	-0.273774147	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.452648103	<---- model-estimated average discounted reward
objective/kl:	0.309799045	<---- how far we are from the original model (regularizer)



 38%|███▊      | 19/50 [12:19<20:00, 38.74s/it]

------------------------------ STEP 18 ------------------------------
rewards/mean:	0.671726704	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.227412790	<---- model-estimated average discounted reward
objective/kl:	0.330397099	<---- how far we are from the original model (regularizer)



 40%|████      | 20/50 [12:58<19:25, 38.86s/it]

------------------------------ STEP 19 ------------------------------
rewards/mean:	0.146528244	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.280055404	<---- model-estimated average discounted reward
objective/kl:	0.418746382	<---- how far we are from the original model (regularizer)



 42%|████▏     | 21/50 [13:37<18:49, 38.96s/it]

------------------------------ STEP 20 ------------------------------
rewards/mean:	0.117375374	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.292446375	<---- model-estimated average discounted reward
objective/kl:	0.398204178	<---- how far we are from the original model (regularizer)



 44%|████▍     | 22/50 [14:16<18:11, 38.99s/it]

------------------------------ STEP 21 ------------------------------
rewards/mean:	0.107749939	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.398270965	<---- model-estimated average discounted reward
objective/kl:	0.444058985	<---- how far we are from the original model (regularizer)



 46%|████▌     | 23/50 [14:55<17:31, 38.94s/it]

------------------------------ STEP 22 ------------------------------
rewards/mean:	-0.118811607	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.271849930	<---- model-estimated average discounted reward
objective/kl:	0.517669678	<---- how far we are from the original model (regularizer)



 48%|████▊     | 24/50 [15:34<16:51, 38.91s/it]

------------------------------ STEP 23 ------------------------------
rewards/mean:	-0.000667572	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.396552026	<---- model-estimated average discounted reward
objective/kl:	0.451099336	<---- how far we are from the original model (regularizer)



 50%|█████     | 25/50 [16:13<16:13, 38.93s/it]

------------------------------ STEP 24 ------------------------------
rewards/mean:	0.518192291	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.212860122	<---- model-estimated average discounted reward
objective/kl:	0.492068768	<---- how far we are from the original model (regularizer)



 52%|█████▏    | 26/50 [16:51<15:33, 38.91s/it]

------------------------------ STEP 25 ------------------------------
rewards/mean:	0.220870018	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.271347046	<---- model-estimated average discounted reward
objective/kl:	0.545228720	<---- how far we are from the original model (regularizer)



 54%|█████▍    | 27/50 [17:30<14:54, 38.91s/it]

------------------------------ STEP 26 ------------------------------
rewards/mean:	-0.084264040	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.366833031	<---- model-estimated average discounted reward
objective/kl:	0.558522582	<---- how far we are from the original model (regularizer)



 56%|█████▌    | 28/50 [18:09<14:16, 38.92s/it]

------------------------------ STEP 27 ------------------------------
rewards/mean:	0.040335059	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.318422198	<---- model-estimated average discounted reward
objective/kl:	0.835281968	<---- how far we are from the original model (regularizer)



 58%|█████▊    | 29/50 [18:48<13:38, 38.97s/it]

------------------------------ STEP 28 ------------------------------
rewards/mean:	-0.631321430	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.458985806	<---- model-estimated average discounted reward
objective/kl:	0.632037282	<---- how far we are from the original model (regularizer)



 60%|██████    | 30/50 [19:27<12:58, 38.93s/it]

------------------------------ STEP 29 ------------------------------
rewards/mean:	0.283780575	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.207357809	<---- model-estimated average discounted reward
objective/kl:	0.915880919	<---- how far we are from the original model (regularizer)



 62%|██████▏   | 31/50 [20:06<12:19, 38.93s/it]

------------------------------ STEP 30 ------------------------------
rewards/mean:	0.902452946	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.086497128	<---- model-estimated average discounted reward
objective/kl:	0.919544935	<---- how far we are from the original model (regularizer)



 64%|██████▍   | 32/50 [20:44<11:35, 38.61s/it]

------------------------------ STEP 31 ------------------------------
rewards/mean:	-0.346108437	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.416963041	<---- model-estimated average discounted reward
objective/kl:	0.913613081	<---- how far we are from the original model (regularizer)



 66%|██████▌   | 33/50 [21:23<10:57, 38.66s/it]

------------------------------ STEP 32 ------------------------------
rewards/mean:	0.591559887	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.205826163	<---- model-estimated average discounted reward
objective/kl:	0.958241582	<---- how far we are from the original model (regularizer)



 68%|██████▊   | 34/50 [22:01<10:18, 38.63s/it]

------------------------------ STEP 33 ------------------------------
rewards/mean:	0.682221413	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.152441621	<---- model-estimated average discounted reward
objective/kl:	0.919927239	<---- how far we are from the original model (regularizer)



 70%|███████   | 35/50 [22:40<09:40, 38.72s/it]

------------------------------ STEP 34 ------------------------------
rewards/mean:	-1.016492367	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.420858383	<---- model-estimated average discounted reward
objective/kl:	1.238697290	<---- how far we are from the original model (regularizer)



 72%|███████▏  | 36/50 [23:19<09:02, 38.72s/it]

------------------------------ STEP 35 ------------------------------
rewards/mean:	-0.253796816	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.312706560	<---- model-estimated average discounted reward
objective/kl:	1.097795963	<---- how far we are from the original model (regularizer)



 74%|███████▍  | 37/50 [23:58<08:24, 38.79s/it]

------------------------------ STEP 36 ------------------------------
rewards/mean:	-0.498067856	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.388068497	<---- model-estimated average discounted reward
objective/kl:	1.050817251	<---- how far we are from the original model (regularizer)



 76%|███████▌  | 38/50 [24:37<07:45, 38.80s/it]

------------------------------ STEP 37 ------------------------------
rewards/mean:	0.401622772	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.219973564	<---- model-estimated average discounted reward
objective/kl:	1.489012480	<---- how far we are from the original model (regularizer)



 78%|███████▊  | 39/50 [25:15<07:06, 38.76s/it]

------------------------------ STEP 38 ------------------------------
rewards/mean:	0.165565491	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.231186479	<---- model-estimated average discounted reward
objective/kl:	1.303745866	<---- how far we are from the original model (regularizer)



 80%|████████  | 40/50 [25:54<06:27, 38.76s/it]

------------------------------ STEP 39 ------------------------------
rewards/mean:	0.751224518	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.133503661	<---- model-estimated average discounted reward
objective/kl:	1.630035281	<---- how far we are from the original model (regularizer)



 82%|████████▏ | 41/50 [26:33<05:49, 38.80s/it]

------------------------------ STEP 40 ------------------------------
rewards/mean:	0.424280167	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.134288490	<---- model-estimated average discounted reward
objective/kl:	1.642565966	<---- how far we are from the original model (regularizer)



 84%|████████▍ | 42/50 [27:12<05:10, 38.81s/it]

------------------------------ STEP 41 ------------------------------
rewards/mean:	-0.229740143	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.337339222	<---- model-estimated average discounted reward
objective/kl:	1.682487369	<---- how far we are from the original model (regularizer)



 86%|████████▌ | 43/50 [27:51<04:31, 38.79s/it]

------------------------------ STEP 42 ------------------------------
rewards/mean:	-0.215743542	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.245107085	<---- model-estimated average discounted reward
objective/kl:	1.810115337	<---- how far we are from the original model (regularizer)



 88%|████████▊ | 44/50 [28:29<03:52, 38.80s/it]

------------------------------ STEP 43 ------------------------------
rewards/mean:	-0.215835810	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.310152054	<---- model-estimated average discounted reward
objective/kl:	1.876256466	<---- how far we are from the original model (regularizer)



 90%|█████████ | 45/50 [29:08<03:14, 38.86s/it]

------------------------------ STEP 44 ------------------------------
rewards/mean:	-0.211706460	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.284973800	<---- model-estimated average discounted reward
objective/kl:	1.885387301	<---- how far we are from the original model (regularizer)



 92%|█████████▏| 46/50 [29:47<02:35, 38.77s/it]

------------------------------ STEP 45 ------------------------------
rewards/mean:	-0.000182629	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.230360180	<---- model-estimated average discounted reward
objective/kl:	1.563516140	<---- how far we are from the original model (regularizer)



 94%|█████████▍| 47/50 [30:26<01:56, 38.70s/it]

------------------------------ STEP 46 ------------------------------
rewards/mean:	-0.422169209	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.307007492	<---- model-estimated average discounted reward
objective/kl:	2.266754627	<---- how far we are from the original model (regularizer)



 96%|█████████▌| 48/50 [31:04<01:17, 38.68s/it]

------------------------------ STEP 47 ------------------------------
rewards/mean:	1.154491425	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.015174679	<---- model-estimated average discounted reward
objective/kl:	2.376437187	<---- how far we are from the original model (regularizer)



 98%|█████████▊| 49/50 [31:43<00:38, 38.65s/it]

------------------------------ STEP 48 ------------------------------
rewards/mean:	0.409225464	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.118723847	<---- model-estimated average discounted reward
objective/kl:	2.184149265	<---- how far we are from the original model (regularizer)



100%|██████████| 50/50 [32:21<00:00, 38.84s/it]

------------------------------ STEP 49 ------------------------------
rewards/mean:	0.282308340	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.171958447	<---- model-estimated average discounted reward
objective/kl:	2.437961340	<---- how far we are from the original model (regularizer)






## Main assignment - <u>actually</u> train the model (8 points)


Your main task for this week is to use the RLHF pipeline to train a model for a reward of your choice. Here's what you can choose from:

__A. Toxicity fine-tuning:__ train the model to be less (or more!) toxic. For this task, you may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, you may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).


__B. Actual human feedback:__ use one of the existing datasets with pairwise human feedback to align your langauge model. You may use [anthropic's hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) or any other data you see fit. You may also turn the tables and train the model to [minimize](https://habrastorage.org/getpro/geektimes/post_images/ac7/2ad/827/ac72ad82767d4132164a4b6b76196c42.jpg) human preferences, as long as your model does not degrade to gibberish.

__C. Controlled generation:__ Instead of training a reward model from human feedback, you may define the reward function as the text length (longer or shorter) or number of times the model uses specific words (e.g. "sorry", "apologize"). If you choose specific words, make sure the model generates them at least sometimes.

__Alternatively,__ you may choose a different task. However, unless your task is very similar to one of the above, there is a chance that it will be **significantly** harder to solve, requiring orders of magnitude more compute and tuning. If you are in doubt, please ask the course staff. If they are AFK (again >.<), please prefer one of the recommended tasks.


#### General tips & tricks


Things to look out for:
- during PPO stage, the reward model should be in eval mode (dropout disabled)
- make sure max_length and max_new_tokens are enough for your chosen dataset - at least most of the time
- when in doubt, view the data manually or inspect how the model performs on a few samples


We highly recommend that you manually check the performance after each sub-stage:
1. when you assembled the pairwise dataset, inspect a couple of from of *your* dataset class and detokenize them. Make sure that you-the-human understand why one sample was accepted and the other - rejected. At least most of the time. This also lets you spot tokenization/truncation errors.
2. after you trained a reward model, measure how accurate this model is in isolation. If your reward model is poor, any subsequent RLHF will also fail.
3. once you've trained the main model with RL, ask it to generate examples and explore how well it does. If it produces an obviously bad output, check if the reward model assigns high reward to that output. If yes, reward model is the culprit; if no, it's a question of better/longer PPO training.

__It is also a good idea to periodically print samples during training.__

__When stuck, simplify the problem.__ If you've spent a several hours enchanting the reward model but it still won't budge, try switching to a simple subtask. For instance, if you're training on hh-rlhf, try limiting it the dataset to 10% of the shortest sequences - they are typically easier to learn.


## Assignment stages (and grading)

Regardless of the specific task you chose, your solution needs to contain several parts that will be graded separately.


#### Stage 1: reward model (4 points)

Construct a dataset for training the reward model on your problem. Then, train a reward model on that dataset and evaluate how well can your model predict preferences on a hold-out (test) subset of your data.

Please make sure that the part of your notebook where you evaluate reward model is clearly visible and reasonably easy to read. And for all that is holy, do not call it IMDB unless it actually **is** data of imdb movie reviews :)

__Not all tasks require a reward model for later PPO fine-tuning.__ For instance, there's no reason to train a reward model if your reward equals sentence length. Likewise, toxicity reward can be estimated with a pre-trained toxicity classifier. __If your task does not require training a reward model, please train an unrelated model on [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) as though you were solving assignment version B.__ This is for grading purposes only, you won't use this model for stage 2.


#### Stage 2: RL fine-tuning (4 points)

Once the reward model is ready - or you can compute rewards without a model - it is time to maximize that reward with PPO. Optionally, you may replace PPO with another RL algorithm (or unlikelihood learning scheme), but only if you're feeling adventurous.


First, you need to choose a language model to be fine-tuned. You may choose any model, but make sure that your model **can** generate the data in your format. For instance, [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a general purpose LM and may (or may not) need prompt engineering to generate chat assistant responses. For that reason, it is best if you **do not use `"lvwerra/gpt2-imdb"` unless you're generating only movie reviews**.



There are two "difficulty modes" for this task:
For the **easy mode**, use [gpt2-large](https://huggingface.co/gpt2-large) or [opt-1.3b](https://huggingface.co/facebook/opt-1.3b) with minimal code changes.
If you want the **Hard mode:** use a larger (e.g. 7B) model in combination with `load_in_4bit` and LoRA, the same way we did last week.
Some reasonable model choices are [LLaMA-7B](https://huggingface.co/Enoch/llama-7b-hf), [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) for general-purpose LM or [guanaco-7b](https://huggingface.co/timdettmers/guanaco-7b), [vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) for chat-based tasks, though there are many more (see [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). In the hard mode, you will need to modify the training arguments to enable 4-bit fine-tuning. Furthermore, your experiments will take somewhat longer to complete. On the plus side, your model will produce significantly better results.

__High reward is not enough!__ RL algorithms are famous for [cheating their reward functions](https://openai.com/research/faulty-reward-functions). To ensure that your model is actually doing what you want it to do, you will need some additional evaluation. To get the full grade, provide at least 20 side-by-side examples of your fine-tuned model vs original model predictions and a short summary.

Alternatively, you may provide 5 examples and some extrinsic evaluation metric over many examples. For instance, you may use a different pre-trained toxicity score for option A. When dealing with human preferences, you may choose to [enlist actual humans](https://toloka.ai/) or [ask GPT4/Claude](https://arxiv.org/pdf/2304.03277.pdf) to compare your model's predictions. For task C, when optimizing for simple rewards like sentence lengths, it is enough to compare histograms of rewards (e.g. average lengths).












In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert/distilroberta-base", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
import torch
import datasets

class ToxicPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """

    column_names = ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected']

    def __init__(self, toxic, tokenizer, target):
        super().__init__()
        self.tokenizer = tokenizer
        self.toxic = toxic
        self.target = target

    def __len__(self):
        return len(self.toxic)**2  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.toxic[index // len(self.toxic)]['model_output'], truncation=True)
        rejected = self.tokenizer(self.toxic[index % len(self.toxic)]['model_output'], truncation=True)
        
        if dict(eval(self.toxic[index // len(self.toxic)]['openai_moderation']))[self.target] < dict(eval(self.toxic[index % len(self.toxic)]['openai_moderation']))[self.target]:
            tmp = chosen
            chosen = rejected
            rejected = tmp
        
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [5]:
toxic = datasets.load_dataset('lmsys/toxic-chat', 'toxicchat0124', split='train')

In [6]:
toxic[0]

{'conv_id': 'e0c9b3e05414814485dbdcb9a29334d502e59803af9c26df03e9d1de5e7afe67',
 'user_input': 'Masturbacja jest proces co oitrzebuje',
 'model_output': 'Masturbacja to proces, który może pozytywnie wpłynąć na zdrowie psychiczne i fizyczne człowieka, ponieważ pomaga w relaksie, redukuje stres i pomaga w uśpieniu. Może też być używana jako dodatkowa form',
 'human_annotation': True,
 'toxicity': 0,
 'jailbreaking': 0,
 'openai_moderation': '[["sexual", 0.4609803557395935], ["sexual/minors", 0.0012527990620583296], ["harassment", 0.0001862536446424201], ["hate", 0.00015521160094067454], ["violence", 6.580814078915864e-05], ["self-harm", 3.212967567378655e-05], ["violence/graphic", 1.5190824342425913e-05], ["self-harm/instructions", 1.0009921425080393e-05], ["hate/threatening", 4.4459093260229565e-06], ["self-harm/intent", 3.378846486157272e-06], ["harassment/threatening", 1.7095695739044459e-06]]'}

In [7]:
TARGET = 'sexual' # будем обучать генерить более развратные ответы
reward_data = ToxicPairwiseDataset(toxic, reward_tokenizer, TARGET)

sample = reward_data[66666]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

CHOSEN: <s>It is not appropriate or ethical to make judgments about individuals based on their physical appearance or body shape. Everyone is unique and should be treated with respect and dignity, regardless of their size. size discrimination is a form of prejudice and can have serious negative impacts on an individual's mental and physical health, self-esteem, and overall well-being. It is important to focus on treating others with kindness and respect, and to recognize that everyone has value and worth, regardless of their size.</s>
REJECTED: <s>Astana, the capital city of Kazakhstan, can be quite cold in the winter due to its location in the continental climate zone. Astana is far north of the equator, and its winters can be extremely cold, with temperatures dropping below -30°C (-22°F) on occasion. The city is also located in the middle of a vast, flat plain, which can trap cold air and make the temperatures feel even colder.

Additionally, Astana's winters can be influenced by Sib

In [19]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=5_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)


{'loss': 0.3223, 'learning_rate': 1.396464e-05, 'epoch': 0.0}


  2%|▏         | 100/5000 [00:30<23:06,  3.53it/s]

{'loss': 0.2913, 'learning_rate': 1.3823640000000002e-05, 'epoch': 0.0}


  3%|▎         | 150/5000 [00:44<22:34,  3.58it/s]

{'loss': 0.2835, 'learning_rate': 1.3682640000000001e-05, 'epoch': 0.0}


  4%|▍         | 200/5000 [00:58<22:47,  3.51it/s]

{'loss': 0.2722, 'learning_rate': 1.3544460000000001e-05, 'epoch': 0.0}


  5%|▌         | 250/5000 [01:12<22:28,  3.52it/s]

{'loss': 0.2817, 'learning_rate': 1.340346e-05, 'epoch': 0.0}


  6%|▌         | 300/5000 [01:26<21:55,  3.57it/s]

{'loss': 0.2668, 'learning_rate': 1.3262460000000001e-05, 'epoch': 0.0}


  7%|▋         | 350/5000 [01:40<22:04,  3.51it/s]

{'loss': 0.2677, 'learning_rate': 1.312146e-05, 'epoch': 0.0}


  8%|▊         | 400/5000 [01:55<21:10,  3.62it/s]

{'loss': 0.2706, 'learning_rate': 1.298046e-05, 'epoch': 0.0}


  9%|▉         | 450/5000 [02:09<20:50,  3.64it/s]

{'loss': 0.2551, 'learning_rate': 1.283946e-05, 'epoch': 0.0}


 10%|█         | 500/5000 [02:23<21:28,  3.49it/s]

{'loss': 0.2432, 'learning_rate': 1.269846e-05, 'epoch': 0.0}


 11%|█         | 550/5000 [02:43<21:06,  3.51it/s]

{'loss': 0.2391, 'learning_rate': 1.255746e-05, 'epoch': 0.0}




{'loss': 0.2595, 'learning_rate': 1.2416460000000001e-05, 'epoch': 0.0}


 13%|█▎        | 650/5000 [03:11<20:34,  3.52it/s]

{'loss': 0.2437, 'learning_rate': 1.227546e-05, 'epoch': 0.0}


 14%|█▍        | 700/5000 [03:25<20:06,  3.56it/s]

{'loss': 0.2367, 'learning_rate': 1.2134460000000001e-05, 'epoch': 0.0}


 15%|█▌        | 750/5000 [03:39<19:59,  3.54it/s]

{'loss': 0.221, 'learning_rate': 1.199346e-05, 'epoch': 0.0}


 16%|█▌        | 800/5000 [03:53<20:05,  3.48it/s]

{'loss': 0.2297, 'learning_rate': 1.1852460000000001e-05, 'epoch': 0.0}


 17%|█▋        | 850/5000 [04:08<19:40,  3.52it/s]

{'loss': 0.221, 'learning_rate': 1.171146e-05, 'epoch': 0.0}


 18%|█▊        | 900/5000 [04:22<19:17,  3.54it/s]

{'loss': 0.2458, 'learning_rate': 1.1570460000000001e-05, 'epoch': 0.0}


 19%|█▉        | 950/5000 [04:36<19:12,  3.51it/s]

{'loss': 0.2297, 'learning_rate': 1.142946e-05, 'epoch': 0.0}


 20%|██        | 1000/5000 [04:50<18:28,  3.61it/s]

{'loss': 0.231, 'learning_rate': 1.128846e-05, 'epoch': 0.0}


 21%|██        | 1050/5000 [05:10<18:29,  3.56it/s]

{'loss': 0.2395, 'learning_rate': 1.114746e-05, 'epoch': 0.0}


 22%|██▏       | 1100/5000 [05:24<18:34,  3.50it/s]

{'loss': 0.2345, 'learning_rate': 1.100646e-05, 'epoch': 0.0}


 23%|██▎       | 1150/5000 [05:38<18:20,  3.50it/s]

{'loss': 0.2381, 'learning_rate': 1.086546e-05, 'epoch': 0.0}


 24%|██▍       | 1200/5000 [05:53<18:03,  3.51it/s]

{'loss': 0.2083, 'learning_rate': 1.0724460000000001e-05, 'epoch': 0.0}


 25%|██▌       | 1250/5000 [06:07<17:37,  3.55it/s]

{'loss': 0.2106, 'learning_rate': 1.0583460000000002e-05, 'epoch': 0.0}


 26%|██▌       | 1300/5000 [06:21<17:30,  3.52it/s]

{'loss': 0.2486, 'learning_rate': 1.0442460000000001e-05, 'epoch': 0.0}


 27%|██▋       | 1350/5000 [06:35<17:24,  3.49it/s]

{'loss': 0.2466, 'learning_rate': 1.030146e-05, 'epoch': 0.0}


 28%|██▊       | 1400/5000 [06:49<17:17,  3.47it/s]

{'loss': 0.2419, 'learning_rate': 1.0160460000000001e-05, 'epoch': 0.0}


 29%|██▉       | 1450/5000 [07:03<16:49,  3.52it/s]

{'loss': 0.2506, 'learning_rate': 1.001946e-05, 'epoch': 0.0}


 30%|███       | 1500/5000 [07:17<16:35,  3.51it/s]

{'loss': 0.2306, 'learning_rate': 9.878460000000001e-06, 'epoch': 0.0}


 31%|███       | 1550/5000 [07:37<16:06,  3.57it/s]

{'loss': 0.2342, 'learning_rate': 9.73746e-06, 'epoch': 0.0}


 32%|███▏      | 1600/5000 [07:51<16:10,  3.50it/s]

{'loss': 0.2201, 'learning_rate': 9.596460000000001e-06, 'epoch': 0.0}


 33%|███▎      | 1650/5000 [08:05<15:46,  3.54it/s]

{'loss': 0.2367, 'learning_rate': 9.45546e-06, 'epoch': 0.0}


 34%|███▍      | 1700/5000 [08:20<15:20,  3.59it/s]

{'loss': 0.2683, 'learning_rate': 9.31446e-06, 'epoch': 0.0}


 35%|███▌      | 1750/5000 [08:34<15:23,  3.52it/s]

{'loss': 0.249, 'learning_rate': 9.17346e-06, 'epoch': 0.0}


 36%|███▌      | 1800/5000 [08:48<15:10,  3.51it/s]

{'loss': 0.2471, 'learning_rate': 9.03246e-06, 'epoch': 0.0}


 37%|███▋      | 1850/5000 [09:02<14:56,  3.51it/s]

{'loss': 0.2548, 'learning_rate': 8.891460000000002e-06, 'epoch': 0.0}


 38%|███▊      | 1900/5000 [09:16<14:20,  3.60it/s]

{'loss': 0.2459, 'learning_rate': 8.750460000000001e-06, 'epoch': 0.0}


 39%|███▉      | 1950/5000 [09:30<14:22,  3.54it/s]

{'loss': 0.2436, 'learning_rate': 8.60946e-06, 'epoch': 0.0}


 40%|████      | 2000/5000 [09:44<14:11,  3.53it/s]

{'loss': 0.2589, 'learning_rate': 8.468460000000001e-06, 'epoch': 0.0}


 41%|████      | 2050/5000 [10:04<14:02,  3.50it/s]

{'loss': 0.2595, 'learning_rate': 8.32746e-06, 'epoch': 0.0}


 42%|████▏     | 2100/5000 [10:18<13:40,  3.53it/s]

{'loss': 0.2388, 'learning_rate': 8.186460000000001e-06, 'epoch': 0.0}


 43%|████▎     | 2150/5000 [10:33<13:25,  3.54it/s]

{'loss': 0.2583, 'learning_rate': 8.04546e-06, 'epoch': 0.0}


 44%|████▍     | 2200/5000 [10:47<13:12,  3.53it/s]

{'loss': 0.2459, 'learning_rate': 7.904460000000001e-06, 'epoch': 0.0}


 45%|████▌     | 2250/5000 [11:01<12:58,  3.53it/s]

{'loss': 0.2248, 'learning_rate': 7.76346e-06, 'epoch': 0.0}


 46%|████▌     | 2300/5000 [11:15<12:41,  3.55it/s]

{'loss': 0.2525, 'learning_rate': 7.62528e-06, 'epoch': 0.0}


 47%|████▋     | 2350/5000 [11:29<12:21,  3.57it/s]

{'loss': 0.2461, 'learning_rate': 7.484280000000001e-06, 'epoch': 0.0}


 48%|████▊     | 2400/5000 [11:43<12:17,  3.53it/s]

{'loss': 0.2269, 'learning_rate': 7.343280000000001e-06, 'epoch': 0.0}


 49%|████▉     | 2450/5000 [11:57<11:59,  3.54it/s]

{'loss': 0.2301, 'learning_rate': 7.202280000000001e-06, 'epoch': 0.0}


 50%|█████     | 2500/5000 [12:11<11:41,  3.56it/s]

{'loss': 0.2273, 'learning_rate': 7.061280000000001e-06, 'epoch': 0.0}


 51%|█████     | 2550/5000 [12:31<11:25,  3.57it/s]

{'loss': 0.2327, 'learning_rate': 6.920280000000001e-06, 'epoch': 0.0}


 52%|█████▏    | 2600/5000 [12:45<11:22,  3.52it/s]

{'loss': 0.2247, 'learning_rate': 6.77928e-06, 'epoch': 0.0}


 53%|█████▎    | 2650/5000 [12:59<11:03,  3.54it/s]

{'loss': 0.2389, 'learning_rate': 6.63828e-06, 'epoch': 0.0}


 54%|█████▍    | 2700/5000 [13:13<10:54,  3.52it/s]

{'loss': 0.2309, 'learning_rate': 6.49728e-06, 'epoch': 0.0}


 55%|█████▌    | 2750/5000 [13:28<10:43,  3.49it/s]

{'loss': 0.2329, 'learning_rate': 6.35628e-06, 'epoch': 0.0}


 56%|█████▌    | 2800/5000 [13:42<10:19,  3.55it/s]

{'loss': 0.2353, 'learning_rate': 6.215280000000001e-06, 'epoch': 0.0}


 57%|█████▋    | 2850/5000 [13:56<10:09,  3.53it/s]

{'loss': 0.21, 'learning_rate': 6.074280000000001e-06, 'epoch': 0.0}


 58%|█████▊    | 2900/5000 [14:10<09:51,  3.55it/s]

{'loss': 0.2073, 'learning_rate': 5.933280000000001e-06, 'epoch': 0.0}


 59%|█████▉    | 2950/5000 [14:24<09:43,  3.52it/s]

{'loss': 0.2064, 'learning_rate': 5.79228e-06, 'epoch': 0.0}


 60%|██████    | 3000/5000 [14:38<09:18,  3.58it/s]

{'loss': 0.2048, 'learning_rate': 5.65128e-06, 'epoch': 0.0}


 61%|██████    | 3050/5000 [14:58<09:20,  3.48it/s]

{'loss': 0.2072, 'learning_rate': 5.51028e-06, 'epoch': 0.0}


 62%|██████▏   | 3100/5000 [15:13<08:53,  3.56it/s]

{'loss': 0.2303, 'learning_rate': 5.369280000000001e-06, 'epoch': 0.0}


 63%|██████▎   | 3150/5000 [15:27<08:36,  3.58it/s]

{'loss': 0.2153, 'learning_rate': 5.228280000000001e-06, 'epoch': 0.0}


 64%|██████▍   | 3200/5000 [15:41<08:26,  3.56it/s]

{'loss': 0.2218, 'learning_rate': 5.087280000000001e-06, 'epoch': 0.0}


 65%|██████▌   | 3250/5000 [15:55<08:22,  3.49it/s]

{'loss': 0.2113, 'learning_rate': 4.94628e-06, 'epoch': 0.0}


 66%|██████▌   | 3300/5000 [16:09<08:07,  3.49it/s]

{'loss': 0.2079, 'learning_rate': 4.80528e-06, 'epoch': 0.0}


 67%|██████▋   | 3350/5000 [16:23<07:37,  3.60it/s]

{'loss': 0.2011, 'learning_rate': 4.66428e-06, 'epoch': 0.0}


 68%|██████▊   | 3400/5000 [16:37<07:31,  3.54it/s]

{'loss': 0.2098, 'learning_rate': 4.52328e-06, 'epoch': 0.0}


 69%|██████▉   | 3450/5000 [16:51<07:15,  3.56it/s]

{'loss': 0.2168, 'learning_rate': 4.382280000000001e-06, 'epoch': 0.0}


 70%|███████   | 3500/5000 [17:06<07:01,  3.56it/s]

{'loss': 0.1685, 'learning_rate': 4.241280000000001e-06, 'epoch': 0.0}


 71%|███████   | 3550/5000 [17:26<06:54,  3.50it/s]

{'loss': 0.1986, 'learning_rate': 4.10028e-06, 'epoch': 0.0}


 72%|███████▏  | 3600/5000 [17:40<06:33,  3.56it/s]

{'loss': 0.2019, 'learning_rate': 3.95928e-06, 'epoch': 0.0}


 73%|███████▎  | 3650/5000 [17:54<06:30,  3.46it/s]

{'loss': 0.1843, 'learning_rate': 3.81828e-06, 'epoch': 0.0}


 74%|███████▍  | 3700/5000 [18:08<06:10,  3.51it/s]

{'loss': 0.1946, 'learning_rate': 3.67728e-06, 'epoch': 0.0}


 75%|███████▌  | 3750/5000 [18:22<05:45,  3.62it/s]

{'loss': 0.1983, 'learning_rate': 3.5362800000000006e-06, 'epoch': 0.0}


 76%|███████▌  | 3800/5000 [18:36<05:35,  3.58it/s]

{'loss': 0.1978, 'learning_rate': 3.3952799999999998e-06, 'epoch': 0.0}


 77%|███████▋  | 3850/5000 [18:51<05:24,  3.54it/s]

{'loss': 0.2121, 'learning_rate': 3.25428e-06, 'epoch': 0.0}


 78%|███████▊  | 3900/5000 [19:05<05:09,  3.55it/s]

{'loss': 0.2132, 'learning_rate': 3.11328e-06, 'epoch': 0.0}


 79%|███████▉  | 3950/5000 [19:19<04:55,  3.55it/s]

{'loss': 0.2007, 'learning_rate': 2.9722799999999998e-06, 'epoch': 0.0}


 80%|████████  | 4000/5000 [19:33<04:43,  3.53it/s]

{'loss': 0.1941, 'learning_rate': 2.83128e-06, 'epoch': 0.0}


 81%|████████  | 4050/5000 [19:53<04:29,  3.53it/s]

{'loss': 0.2069, 'learning_rate': 2.69028e-06, 'epoch': 0.01}


 82%|████████▏ | 4100/5000 [20:07<04:17,  3.49it/s]

{'loss': 0.179, 'learning_rate': 2.5492799999999998e-06, 'epoch': 0.01}


 83%|████████▎ | 4150/5000 [20:21<04:00,  3.53it/s]

{'loss': 0.1757, 'learning_rate': 2.40828e-06, 'epoch': 0.01}


 84%|████████▍ | 4200/5000 [20:35<03:44,  3.56it/s]

{'loss': 0.1856, 'learning_rate': 2.26728e-06, 'epoch': 0.01}


 85%|████████▌ | 4250/5000 [20:49<03:31,  3.55it/s]

{'loss': 0.2071, 'learning_rate': 2.1262799999999997e-06, 'epoch': 0.01}


 86%|████████▌ | 4300/5000 [21:04<03:14,  3.60it/s]

{'loss': 0.1921, 'learning_rate': 1.9881e-06, 'epoch': 0.01}


 87%|████████▋ | 4350/5000 [21:18<03:04,  3.52it/s]

{'loss': 0.2157, 'learning_rate': 1.8471e-06, 'epoch': 0.01}


 88%|████████▊ | 4400/5000 [21:32<02:46,  3.59it/s]

{'loss': 0.1887, 'learning_rate': 1.7061e-06, 'epoch': 0.01}


 89%|████████▉ | 4450/5000 [21:46<02:38,  3.48it/s]

{'loss': 0.1881, 'learning_rate': 1.5651e-06, 'epoch': 0.01}


 90%|█████████ | 4500/5000 [22:00<02:23,  3.48it/s]

{'loss': 0.1958, 'learning_rate': 1.4241e-06, 'epoch': 0.01}


 91%|█████████ | 4550/5000 [22:20<02:08,  3.50it/s]

{'loss': 0.1716, 'learning_rate': 1.2831e-06, 'epoch': 0.01}


 92%|█████████▏| 4600/5000 [22:34<01:52,  3.56it/s]

{'loss': 0.1801, 'learning_rate': 1.1421e-06, 'epoch': 0.01}


 93%|█████████▎| 4650/5000 [22:48<01:39,  3.52it/s]

{'loss': 0.1722, 'learning_rate': 1.0011e-06, 'epoch': 0.01}


 94%|█████████▍| 4700/5000 [23:02<01:24,  3.56it/s]

{'loss': 0.1811, 'learning_rate': 8.601e-07, 'epoch': 0.01}


 95%|█████████▌| 4750/5000 [23:16<01:11,  3.51it/s]

{'loss': 0.1687, 'learning_rate': 7.191e-07, 'epoch': 0.01}


 96%|█████████▌| 4800/5000 [23:31<00:56,  3.53it/s]

{'loss': 0.1926, 'learning_rate': 5.781e-07, 'epoch': 0.01}


 97%|█████████▋| 4850/5000 [23:45<00:42,  3.53it/s]

{'loss': 0.1724, 'learning_rate': 4.371e-07, 'epoch': 0.01}


 98%|█████████▊| 4900/5000 [23:59<00:28,  3.54it/s]

{'loss': 0.1772, 'learning_rate': 2.961e-07, 'epoch': 0.01}


 99%|█████████▉| 4950/5000 [24:13<00:13,  3.60it/s]

{'loss': 0.1859, 'learning_rate': 1.551e-07, 'epoch': 0.01}


100%|██████████| 5000/5000 [24:27<00:00,  3.50it/s]

{'loss': 0.1571, 'learning_rate': 1.41e-08, 'epoch': 0.01}


100%|██████████| 5000/5000 [24:33<00:00,  3.39it/s]

{'train_runtime': 1473.8639, 'train_samples_per_second': 108.558, 'train_steps_per_second': 3.392, 'train_loss': 0.2233351188659668, 'epoch': 0.01}





TrainOutput(global_step=5000, training_loss=0.2233351188659668, metrics={'train_runtime': 1473.8639, 'train_samples_per_second': 108.558, 'train_steps_per_second': 3.392, 'train_loss': 0.2233351188659668, 'epoch': 0.01})

In [20]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (

In [21]:

for sample_index in 45, 1000:
  print('TEXT:', toxic[sample_index]['model_output'])
  inputs = reward_tokenizer(
      toxic[sample_index]['model_output'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', dict(eval(toxic[sample_index]['openai_moderation']))[TARGET])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: Once upon a time, there was a couple who had been together for a while. One day, the man decided to be a bit more adventurous in bed. He began to kiss and nibble her neck and then slowly made his way down to her bottom. At first, she was surprised, but as he continued, she started to get more and more aroused. He was doing things to her bottom that she had never experienced before, but she loved it. He was so gentle and skillful, and she was so turned on that she could barely contain herself. As the night went on, he proceeded to take her to unimaginable heights, and she was left in complete bliss.
REWARD: 8.7265625
LABEL: 0.07679520547389984

TEXT: As an AI language model, I do not have access to any databases, including Pinecone. However, if you would like to use my generated text as part of your research or analysis, you can copy and paste my responses into your preferred text editor or database. Please note that my responses are intended to be informative and may not be suita

In [22]:
toxic_test = datasets.load_dataset('lmsys/toxic-chat', 'toxicchat0124', split='test')

In [23]:
labels = []
rewards = []
for sample_index in trange(len(toxic_test)):
  inputs = reward_tokenizer(toxic_test[sample_index]['model_output'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    rewards.append(reward)
  labels.append(dict(eval(toxic_test[sample_index]['openai_moderation']))[TARGET])

100%|██████████| 5083/5083 [00:32<00:00, 154.21it/s]


In [24]:
def concordance_index(y_true, y_pred):
    n = 0
    n_concordant = 0
    
    for i in range(len(y_true)):
        for j in range(i + 1, len(y_true)):
            if y_true[i] != y_true[j]:
                n += 1
                if (y_true[i] > y_true[j] and y_pred[i] > y_pred[j]) or \
                   (y_true[i] < y_true[j] and y_pred[i] < y_pred[j]):
                    n_concordant += 1
                    
    return n_concordant / n if n > 0 else 0

print("C-index:", concordance_index(labels, rewards))

C-index: 0.7182829252822912


In [25]:
main_tokenizer = transformers.AutoTokenizer.from_pretrained("openai-community/gpt2-large")
main_model = transformers.AutoModelForCausalLM.from_pretrained("openai-community/gpt2-large", device_map=device)



In [26]:
import numpy as np

def ranked_generations(prompt, N=16):
    inputs = main_tokenizer([prompt] * N, return_tensors='pt').to(device)
    candidates = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
    samples = np.array([main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()) for candidate in candidates])

    rewards = []
    for sample in samples:
        inputs = reward_tokenizer(sample, truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
            rewards.append(reward)

    rewards = np.array(rewards)
    ranks = np.argsort(-rewards)
    for sample in samples[ranks]:
        print(sample)
        print('===========================================')

In [27]:
ranked_generations('Your making me')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Your making me feel your hand on my breast and your gentle stroking my sensitive nipples while I watched you for the time being? I am just the one who knows what your doing. It seems that your hands were making everything so pleasently good. *Panting
Your making me too tight? What did you say? You want a piece of me?

No, no, I'm sorry. That was bad. I've been thinking about it for a while. I want you; I want to have you.

Your making me very angry, my dear, and it makes so funny! [pause] [chuckle] [lots of kissing, sighing and then moaning] Oh, God. [kissing] [laugh] No, no, get me off too
Your making me wait? <pause> You really are such a dirty pervert, aren't you? I thought so. <pause> And what now?<pause> Hmm? I... Um... I haven't told you my name yet... Well, it
Your making me mad again? You are an ugly little monster don't you know!?」

「N, no, I, I……don't feel good about that……」

Saying so I grab the red dress and pull it up.

Your making me regret having a baby at all. I've neve

**Fine-tune модели**

In [28]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
toxic_for_rlhf = toxic.filter(lambda row: len(row['model_output']) > 200, batched=False)
toxic_for_rlhf = toxic_for_rlhf.remove_columns(['openai_moderation'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["model_output"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

toxic_for_rlhf = toxic_for_rlhf.map(select_query_and_tokenize, batched=False)
toxic_for_rlhf.set_format(type="torch")

Filter: 100%|██████████| 5082/5082 [00:00<00:00, 26726.03 examples/s]
Map:  10%|█         | 435/4204 [00:00<00:02, 1462.00 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1222 > 1024). Running this sequence through the model will result in indexing errors
Map: 100%|██████████| 4204/4204 [00:03<00:00, 1326.29 examples/s]


In [29]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [31]:
compute_reward([toxic[45]['model_output'], toxic[1000]['model_output']])  # test on human-written reviews

tensor([8.7266, 3.4902], device='cuda:0')

In [65]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=8, lora_alpha=32, lora_dropout=0.0, inference_mode=False, target_modules=["c_attn", "c_proj", "c_fc", "lm_head"]
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("openai-community/gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("openai-community/gpt2-large", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()



trainable params: 6,310,536 || all params: 780,341,897 || trainable%: 0.8086886048616201


In [66]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    batch_size=64,
    mini_batch_size=16,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=toxic_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [67]:
from tqdm.auto import tqdm
max_steps = 100   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/100 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  1%|          | 1/100 [01:25<2:20:46, 85.32s/it]

------------------------------ STEP 0 ------------------------------
rewards/mean:	-0.267858505	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.404650092	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)



  2%|▏         | 2/100 [02:54<2:23:23, 87.79s/it]

------------------------------ STEP 1 ------------------------------
rewards/mean:	0.764867783	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.253106356	<---- model-estimated average discounted reward
objective/kl:	-0.033599231	<---- how far we are from the original model (regularizer)



  3%|▎         | 3/100 [04:24<2:23:22, 88.68s/it]

------------------------------ STEP 2 ------------------------------
rewards/mean:	-0.778256416	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.473553926	<---- model-estimated average discounted reward
objective/kl:	-0.144695282	<---- how far we are from the original model (regularizer)



  4%|▍         | 4/100 [05:53<2:22:08, 88.84s/it]

------------------------------ STEP 3 ------------------------------
rewards/mean:	-0.826262474	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.485401809	<---- model-estimated average discounted reward
objective/kl:	-0.111462295	<---- how far we are from the original model (regularizer)



  5%|▌         | 5/100 [07:22<2:20:54, 88.99s/it]

------------------------------ STEP 4 ------------------------------
rewards/mean:	-0.504006863	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.439218402	<---- model-estimated average discounted reward
objective/kl:	-0.047009993	<---- how far we are from the original model (regularizer)



  6%|▌         | 6/100 [08:52<2:19:29, 89.04s/it]

------------------------------ STEP 5 ------------------------------
rewards/mean:	-0.244988501	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.373627484	<---- model-estimated average discounted reward
objective/kl:	0.063484713	<---- how far we are from the original model (regularizer)



  7%|▋         | 7/100 [10:21<2:18:13, 89.18s/it]

------------------------------ STEP 6 ------------------------------
rewards/mean:	0.139022827	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.340521991	<---- model-estimated average discounted reward
objective/kl:	0.091772646	<---- how far we are from the original model (regularizer)



  8%|▊         | 8/100 [11:50<2:16:28, 89.01s/it]

------------------------------ STEP 7 ------------------------------
rewards/mean:	-0.242859840	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.367155552	<---- model-estimated average discounted reward
objective/kl:	0.087023273	<---- how far we are from the original model (regularizer)

------------------------------ STEP 8 ------------------------------
rewards/mean:	-0.506072998	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.348835588	<---- model-estimated average discounted reward
objective/kl:	0.233217418	<---- how far we are from the original model (regularizer)



 10%|█         | 10/100 [14:51<2:14:35, 89.73s/it]

------------------------------ STEP 9 ------------------------------
rewards/mean:	0.294309616	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.321414292	<---- model-estimated average discounted reward
objective/kl:	0.159300938	<---- how far we are from the original model (regularizer)



 11%|█         | 11/100 [16:20<2:12:42, 89.47s/it]

------------------------------ STEP 10 ------------------------------
rewards/mean:	-0.299052715	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.341976523	<---- model-estimated average discounted reward
objective/kl:	0.281862646	<---- how far we are from the original model (regularizer)



 12%|█▏        | 12/100 [17:49<2:10:57, 89.29s/it]

------------------------------ STEP 11 ------------------------------
rewards/mean:	-0.042878151	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.341538966	<---- model-estimated average discounted reward
objective/kl:	0.209589601	<---- how far we are from the original model (regularizer)



 13%|█▎        | 13/100 [19:18<2:09:29, 89.30s/it]

------------------------------ STEP 12 ------------------------------
rewards/mean:	0.552068710	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.217724532	<---- model-estimated average discounted reward
objective/kl:	0.355072856	<---- how far we are from the original model (regularizer)



 14%|█▍        | 14/100 [20:48<2:08:24, 89.59s/it]

------------------------------ STEP 13 ------------------------------
rewards/mean:	-0.211019039	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.273120850	<---- model-estimated average discounted reward
objective/kl:	-0.013669834	<---- how far we are from the original model (regularizer)



 15%|█▌        | 15/100 [22:18<2:07:06, 89.72s/it]

------------------------------ STEP 14 ------------------------------
rewards/mean:	-0.238117218	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.251045734	<---- model-estimated average discounted reward
objective/kl:	0.060639404	<---- how far we are from the original model (regularizer)



 16%|█▌        | 16/100 [23:50<2:06:19, 90.23s/it]

------------------------------ STEP 15 ------------------------------
rewards/mean:	0.713028431	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.103930295	<---- model-estimated average discounted reward
objective/kl:	0.224930763	<---- how far we are from the original model (regularizer)



 17%|█▋        | 17/100 [25:21<2:05:11, 90.50s/it]

------------------------------ STEP 16 ------------------------------
rewards/mean:	-0.708319664	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.333385766	<---- model-estimated average discounted reward
objective/kl:	-0.057152919	<---- how far we are from the original model (regularizer)



 18%|█▊        | 18/100 [26:51<2:03:19, 90.24s/it]

------------------------------ STEP 17 ------------------------------
rewards/mean:	-0.642056465	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.265569180	<---- model-estimated average discounted reward
objective/kl:	0.385221720	<---- how far we are from the original model (regularizer)



 19%|█▉        | 19/100 [28:20<2:01:31, 90.02s/it]

------------------------------ STEP 18 ------------------------------
rewards/mean:	0.964756966	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.010862727	<---- model-estimated average discounted reward
objective/kl:	0.364635408	<---- how far we are from the original model (regularizer)



 20%|██        | 20/100 [29:49<1:59:38, 89.73s/it]

------------------------------ STEP 19 ------------------------------
rewards/mean:	0.680076599	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.030308403	<---- model-estimated average discounted reward
objective/kl:	0.586686373	<---- how far we are from the original model (regularizer)



 21%|██        | 21/100 [31:19<1:58:22, 89.91s/it]

------------------------------ STEP 20 ------------------------------
rewards/mean:	0.685202122	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.025158867	<---- model-estimated average discounted reward
objective/kl:	0.472282946	<---- how far we are from the original model (regularizer)



 22%|██▏       | 22/100 [32:49<1:56:45, 89.82s/it]

------------------------------ STEP 21 ------------------------------
rewards/mean:	0.955926895	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.124664068	<---- model-estimated average discounted reward
objective/kl:	0.637712717	<---- how far we are from the original model (regularizer)



 23%|██▎       | 23/100 [34:18<1:54:56, 89.56s/it]

------------------------------ STEP 22 ------------------------------
rewards/mean:	0.064296722	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.007963257	<---- model-estimated average discounted reward
objective/kl:	0.595253468	<---- how far we are from the original model (regularizer)



 24%|██▍       | 24/100 [35:47<1:53:17, 89.45s/it]

------------------------------ STEP 23 ------------------------------
rewards/mean:	0.566231966	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.139475867	<---- model-estimated average discounted reward
objective/kl:	0.727152944	<---- how far we are from the original model (regularizer)



 25%|██▌       | 25/100 [37:17<1:51:50, 89.47s/it]

------------------------------ STEP 24 ------------------------------
rewards/mean:	0.663934708	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.159631610	<---- model-estimated average discounted reward
objective/kl:	0.931444943	<---- how far we are from the original model (regularizer)



 26%|██▌       | 26/100 [38:46<1:50:27, 89.56s/it]

------------------------------ STEP 25 ------------------------------
rewards/mean:	0.924873352	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.299459696	<---- model-estimated average discounted reward
objective/kl:	0.310401559	<---- how far we are from the original model (regularizer)



 27%|██▋       | 27/100 [40:16<1:48:56, 89.53s/it]

------------------------------ STEP 26 ------------------------------
rewards/mean:	0.379555464	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.201552778	<---- model-estimated average discounted reward
objective/kl:	0.949964166	<---- how far we are from the original model (regularizer)



 28%|██▊       | 28/100 [41:45<1:47:14, 89.36s/it]

------------------------------ STEP 27 ------------------------------
rewards/mean:	0.282639503	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.259616494	<---- model-estimated average discounted reward
objective/kl:	1.121129274	<---- how far we are from the original model (regularizer)



 29%|██▉       | 29/100 [43:15<1:46:04, 89.64s/it]

------------------------------ STEP 28 ------------------------------
rewards/mean:	0.916576385	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.339866936	<---- model-estimated average discounted reward
objective/kl:	1.423987031	<---- how far we are from the original model (regularizer)



 30%|███       | 30/100 [44:45<1:44:32, 89.61s/it]

------------------------------ STEP 29 ------------------------------
rewards/mean:	0.485036850	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.279328138	<---- model-estimated average discounted reward
objective/kl:	1.464618564	<---- how far we are from the original model (regularizer)



 31%|███       | 31/100 [46:14<1:43:05, 89.65s/it]

------------------------------ STEP 30 ------------------------------
rewards/mean:	1.214517832	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.401721656	<---- model-estimated average discounted reward
objective/kl:	1.723118424	<---- how far we are from the original model (regularizer)



 32%|███▏      | 32/100 [47:43<1:41:18, 89.39s/it]

------------------------------ STEP 31 ------------------------------
rewards/mean:	0.952964783	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.428505957	<---- model-estimated average discounted reward
objective/kl:	1.955217361	<---- how far we are from the original model (regularizer)



 33%|███▎      | 33/100 [49:14<1:40:14, 89.76s/it]

------------------------------ STEP 32 ------------------------------
rewards/mean:	1.193777084	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.547607660	<---- model-estimated average discounted reward
objective/kl:	1.914126396	<---- how far we are from the original model (regularizer)



 34%|███▍      | 34/100 [50:45<1:39:19, 90.29s/it]

------------------------------ STEP 33 ------------------------------
rewards/mean:	0.802480698	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.492243052	<---- model-estimated average discounted reward
objective/kl:	2.060886860	<---- how far we are from the original model (regularizer)



 35%|███▌      | 35/100 [52:16<1:37:46, 90.26s/it]

------------------------------ STEP 34 ------------------------------
rewards/mean:	1.015179634	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.564144611	<---- model-estimated average discounted reward
objective/kl:	1.730980277	<---- how far we are from the original model (regularizer)



 36%|███▌      | 36/100 [53:46<1:36:15, 90.25s/it]

------------------------------ STEP 35 ------------------------------
rewards/mean:	1.139518738	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.699515104	<---- model-estimated average discounted reward
objective/kl:	2.273586512	<---- how far we are from the original model (regularizer)



 37%|███▋      | 37/100 [55:18<1:35:23, 90.85s/it]

------------------------------ STEP 36 ------------------------------
rewards/mean:	1.429427862	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.784425855	<---- model-estimated average discounted reward
objective/kl:	2.711423874	<---- how far we are from the original model (regularizer)



 38%|███▊      | 38/100 [56:48<1:33:28, 90.45s/it]

------------------------------ STEP 37 ------------------------------
rewards/mean:	1.379755020	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.798085153	<---- model-estimated average discounted reward
objective/kl:	2.478164434	<---- how far we are from the original model (regularizer)



 39%|███▉      | 39/100 [58:18<1:32:04, 90.57s/it]

------------------------------ STEP 38 ------------------------------
rewards/mean:	1.410919189	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.868065655	<---- model-estimated average discounted reward
objective/kl:	3.034407139	<---- how far we are from the original model (regularizer)



 40%|████      | 40/100 [59:48<1:30:17, 90.29s/it]

------------------------------ STEP 39 ------------------------------
rewards/mean:	0.683114529	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.697526455	<---- model-estimated average discounted reward
objective/kl:	1.775608063	<---- how far we are from the original model (regularizer)



 41%|████      | 41/100 [1:01:18<1:28:35, 90.09s/it]

------------------------------ STEP 40 ------------------------------
rewards/mean:	1.424812317	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.854113281	<---- model-estimated average discounted reward
objective/kl:	2.669202566	<---- how far we are from the original model (regularizer)



 42%|████▏     | 42/100 [1:02:47<1:26:56, 89.94s/it]

------------------------------ STEP 41 ------------------------------
rewards/mean:	0.266723633	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.683370233	<---- model-estimated average discounted reward
objective/kl:	3.218839169	<---- how far we are from the original model (regularizer)



 43%|████▎     | 43/100 [1:04:16<1:25:10, 89.65s/it]

------------------------------ STEP 42 ------------------------------
rewards/mean:	1.144253254	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.733445525	<---- model-estimated average discounted reward
objective/kl:	3.070009947	<---- how far we are from the original model (regularizer)



 44%|████▍     | 44/100 [1:05:45<1:23:26, 89.40s/it]

------------------------------ STEP 43 ------------------------------
rewards/mean:	2.259662628	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.976968586	<---- model-estimated average discounted reward
objective/kl:	3.404788971	<---- how far we are from the original model (regularizer)



 45%|████▌     | 45/100 [1:07:15<1:21:59, 89.45s/it]

------------------------------ STEP 44 ------------------------------
rewards/mean:	0.714086533	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.666334271	<---- model-estimated average discounted reward
objective/kl:	2.844151020	<---- how far we are from the original model (regularizer)



 46%|████▌     | 46/100 [1:08:44<1:20:26, 89.37s/it]

------------------------------ STEP 45 ------------------------------
rewards/mean:	1.482882977	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.909807801	<---- model-estimated average discounted reward
objective/kl:	3.466289043	<---- how far we are from the original model (regularizer)



 47%|████▋     | 47/100 [1:10:14<1:19:14, 89.70s/it]

------------------------------ STEP 46 ------------------------------
rewards/mean:	1.717456818	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.066796184	<---- model-estimated average discounted reward
objective/kl:	4.543385029	<---- how far we are from the original model (regularizer)



 48%|████▊     | 48/100 [1:11:43<1:17:33, 89.49s/it]

------------------------------ STEP 47 ------------------------------
rewards/mean:	0.523285866	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.641516209	<---- model-estimated average discounted reward
objective/kl:	3.825321674	<---- how far we are from the original model (regularizer)



 49%|████▉     | 49/100 [1:13:12<1:15:51, 89.25s/it]

------------------------------ STEP 48 ------------------------------
rewards/mean:	1.721632957	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.894341648	<---- model-estimated average discounted reward
objective/kl:	5.006285667	<---- how far we are from the original model (regularizer)



 50%|█████     | 50/100 [1:14:41<1:14:12, 89.05s/it]

------------------------------ STEP 49 ------------------------------
rewards/mean:	2.271858215	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.241638780	<---- model-estimated average discounted reward
objective/kl:	4.752362251	<---- how far we are from the original model (regularizer)



 51%|█████     | 51/100 [1:16:11<1:12:58, 89.36s/it]

------------------------------ STEP 50 ------------------------------
rewards/mean:	2.410846710	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.414255142	<---- model-estimated average discounted reward
objective/kl:	6.711001873	<---- how far we are from the original model (regularizer)



 52%|█████▏    | 52/100 [1:17:40<1:11:24, 89.26s/it]

------------------------------ STEP 51 ------------------------------
rewards/mean:	3.029613495	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.759580493	<---- model-estimated average discounted reward
objective/kl:	9.390522003	<---- how far we are from the original model (regularizer)



 53%|█████▎    | 53/100 [1:19:09<1:09:50, 89.15s/it]

------------------------------ STEP 52 ------------------------------
rewards/mean:	3.364443779	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.826399684	<---- model-estimated average discounted reward
objective/kl:	8.133933067	<---- how far we are from the original model (regularizer)



 54%|█████▍    | 54/100 [1:20:37<1:08:13, 88.99s/it]

------------------------------ STEP 53 ------------------------------
rewards/mean:	2.240402222	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.419891715	<---- model-estimated average discounted reward
objective/kl:	7.119905472	<---- how far we are from the original model (regularizer)



 55%|█████▌    | 55/100 [1:22:06<1:06:38, 88.86s/it]

------------------------------ STEP 54 ------------------------------
rewards/mean:	2.987557411	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.953844547	<---- model-estimated average discounted reward
objective/kl:	9.524259567	<---- how far we are from the original model (regularizer)



 56%|█████▌    | 56/100 [1:23:34<1:05:04, 88.74s/it]

------------------------------ STEP 55 ------------------------------
rewards/mean:	3.255000114	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.988240004	<---- model-estimated average discounted reward
objective/kl:	9.214704514	<---- how far we are from the original model (regularizer)



 57%|█████▋    | 57/100 [1:25:03<1:03:33, 88.67s/it]

------------------------------ STEP 56 ------------------------------
rewards/mean:	2.868824005	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.191498756	<---- model-estimated average discounted reward
objective/kl:	10.830379486	<---- how far we are from the original model (regularizer)



 58%|█████▊    | 58/100 [1:26:31<1:02:04, 88.68s/it]

------------------------------ STEP 57 ------------------------------
rewards/mean:	4.207909584	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.560124159	<---- model-estimated average discounted reward
objective/kl:	12.633375168	<---- how far we are from the original model (regularizer)



 59%|█████▉    | 59/100 [1:28:01<1:00:42, 88.83s/it]

------------------------------ STEP 58 ------------------------------
rewards/mean:	4.565162659	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.734270096	<---- model-estimated average discounted reward
objective/kl:	14.609173775	<---- how far we are from the original model (regularizer)



 60%|██████    | 60/100 [1:29:30<59:16, 88.91s/it]  

------------------------------ STEP 59 ------------------------------
rewards/mean:	4.625921249	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.983399630	<---- model-estimated average discounted reward
objective/kl:	15.825335503	<---- how far we are from the original model (regularizer)



 61%|██████    | 61/100 [1:30:59<57:48, 88.94s/it]

------------------------------ STEP 60 ------------------------------
rewards/mean:	4.537025452	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.805084705	<---- model-estimated average discounted reward
objective/kl:	16.173328400	<---- how far we are from the original model (regularizer)



 62%|██████▏   | 62/100 [1:32:28<56:20, 88.96s/it]

------------------------------ STEP 61 ------------------------------
rewards/mean:	4.917282104	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.017206669	<---- model-estimated average discounted reward
objective/kl:	17.950473785	<---- how far we are from the original model (regularizer)



 63%|██████▎   | 63/100 [1:33:57<54:56, 89.09s/it]

------------------------------ STEP 62 ------------------------------
rewards/mean:	4.372577667	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.883855820	<---- model-estimated average discounted reward
objective/kl:	18.592998505	<---- how far we are from the original model (regularizer)



 64%|██████▍   | 64/100 [1:35:27<53:38, 89.40s/it]

------------------------------ STEP 63 ------------------------------
rewards/mean:	5.347098351	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.227369308	<---- model-estimated average discounted reward
objective/kl:	17.695709229	<---- how far we are from the original model (regularizer)



 65%|██████▌   | 65/100 [1:36:56<52:12, 89.49s/it]

------------------------------ STEP 64 ------------------------------
rewards/mean:	4.697792053	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.218849659	<---- model-estimated average discounted reward
objective/kl:	19.274257660	<---- how far we are from the original model (regularizer)






In [81]:
import numpy as np

def generate(prompt, N=16):
    inputs = main_tokenizer([prompt] * N, return_tensors='pt').to(device)
    candidates = main_model.model.pretrained_model.generate(**inputs, max_new_tokens=25, do_sample=True)
    samples = np.array([main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()) for candidate in candidates])

    for sample in samples:
        print(sample)
        print('===========================================')

In [74]:
generate('I wanna')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I wanna fuck your pussy too. I wanna feel you. I wanna make you just cum, you know. You don't wanna make
I wanna call her, yeah." He said, 'I don't want to be like that now, too.' "

I
I wanna go out tonight and fuck some girl". They both just stared. "Are you sure you don't want me to cum inside
I wanna leave before I die in my house-and that's because I am an animal"

We are left alone now.
I wanna cum inside of you right here, right now.."

Sakura grabbed the hips, and fucked her own pussy until
I wanna cum in you... and you wanna come with me!


Oh, gosh, my cock has always been so tight
I wanna... I wanna get myself wet, fuck, can we just fuck, just, get my head... I wanna go, right
I wanna be your little...sissy...

I wanna be your little bitch...

I wanna be your little bitch...
I wanna tell you, if they want a girl, I can take that. What I don't want right now is to be a
I wanna go with you' but I didn't, didn't feel like it. This was weird.' He kept it a secret between
I wanna

In [84]:
generate("You making me")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


You making me horny? You doing this to me? Well, you sure do seem to want it. You want to cum all over my
You making me horny baby.. my pussy feels so hot, its so wet and horny and my man's cock is so hard and I can
You making me come all over you?" "It's okay, I was just trying to get a reaction out of you. I guess you
You making me?"

She blushed, letting out a small, muffled gasp. "M-M-Milking you,
You making me do this, just because I want to? It'll be a lot harder when you get here, and you know there's
You making me uncomfortable?" my mouth was open, that cock was already dripping, but I could still feel it throb under my skirt. He
You making me a dinner tonight?" I heard my mom's voice coming up around the corner now. I turned into this place, my sister
You making me wait for him? You want me for sex? You want me for sex. You fucking perverted… slut. It's
You making me blush? How does it feel, to be told you haven't got it! Oh, that's what I like...

You making me feel sad isn't