<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).

_based on the [original notebook](https://github.com/antndlcrx/oxford-llms-workshop/blob/main/materials/seminars/day_3/8_LLMs%20alignment%20with%20RLHF.ipynb) by Ilya Boytsov for the Oxford LLMs workshop_



In this session, you're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [None]:
#!g1.1
%pip install -q trl==0.7.4 transformers==4.33.1 datasets==2.14.4 peft==0.5.0

### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [887]:
#!g1.1
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

Downloading tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [888]:
#!g1.1
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[2023-12-03 03:56:44,384] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)





Generated text: The movie's premise as explained above is that a couple of "spices" -- "inhabitants" or "reformed men" -- mix with women, who they want to marry, and that this mix results in them "living in symbiosis


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.


__If you actually want to maximize sentiment (or other "label") instead of human preferences, train reward model as a classifier! (see week5)__


In [889]:
#!g1.1
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

Downloading config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [890]:
#!g1.1
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class IMDBPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, imdb, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.rejected_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [891]:
#!g1.1
# TARGET_LABEL = <YOUR CHOICE - either 0 or 1>   # and make sure it works by reviewing the sample printed below
TARGET_LABEL = 1
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Found 12500 chosen and 12500 rejected texts, 156250000 pairs
CHOSEN: [CLS] Lars Von Trier is never backward in trying out new techniques. Some of them are very original while others are best forgotten. < br / > < br / > He depicts postwar Germany as a nightmarish train journey. With so many cities lying in ruins, Leo Kessler a young American of German descent feels obliged to help in their restoration. It is not a simple task as he quickly finds out. < br / > < br / > His uncle finds him a job as a night conductor on the Zentropa Railway Line. His job is to attend to the needs of the passengers. When the shoes are polished a chalk mark is made on the soles. A terrible argument ensues when a passenger's shoes are not chalked despite the fact they have been polished. There are many allusions to the German fanaticism of adherence to such stupid details. < br / > < br / > The railway journey is like an allegory representing man's procession through life with all its trials and tribulations

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [892]:
#!g1.1
import warnings
warnings.filterwarnings("ignore")


import trl


training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.5071
100,0.1867
150,0.1285
200,0.1259
250,0.1207
300,0.1
350,0.1046
400,0.0828
450,0.0964
500,0.0881


TrainOutput(global_step=1000, training_loss=0.10759799814224243, metrics={'train_runtime': 578.6162, 'train_samples_per_second': 55.304, 'train_steps_per_second': 1.728, 'total_flos': 0.0, 'train_loss': 0.10759799814224243, 'epoch': 0.0})

In [893]:
#!g1.1
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model (1 point)

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [895]:
#!g1.1
for sample_index in 45, 16000:
    print('TEXT:', imdb[sample_index]['text'])
    inputs = reward_tokenizer(
        imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        reward = reward_model(**inputs).logits[0, 0].item()
        print("REWARD:", reward)
    print('LABEL:', imdb[sample_index]['label'])
    print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: -5.1328125
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /

In [896]:
#!g1.1
import numpy as np
from tqdm.auto import tqdm

negative_consequences = []
positive_consequences = []

reward_model.eval()
with torch.no_grad():
    frequency_train = []
    for el in tqdm(imdb):
        tokens = reward_tokenizer(el['text'], truncation=True, return_tensors='pt').to(device)
        reward = reward_model(**tokens).logits[0, 0]

        if el['label'] == 0:
            negative_consequences.append(reward.item())
        else:
            positive_consequences.append(reward.item())

  0%|          | 0/25000 [00:00<?, ?it/s]

In [897]:
#!g1.1
frequency = np.array([np.array(positive_consequences) > i for i in negative_consequences])
frequency.mean()

0.9827414592

In [898]:
#!g1.1
imdb_test = datasets.load_dataset("imdb", split='test')

negative_consequences_test = []
positive_consequences_test = []

reward_model.eval()
with torch.no_grad():
    frequency_train = []
    for el in tqdm(imdb_test):
        tokens = reward_tokenizer(el['text'], truncation=True, return_tensors='pt').to(device)
        reward = reward_model(**tokens).logits[0, 0]

        if el['label'] == 0:
            negative_consequences_test.append(reward.item())
        else:
            positive_consequences_test.append(reward.item())

Using the latest cached version of the module from /tmp/xdg_cache/huggingface/modules/datasets_modules/datasets/imdb/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0 (last modified on Sun Dec  3 03:57:11 2023) since it couldn't be found locally at imdb., or remotely on the Hugging Face Hub.


  0%|          | 0/25000 [00:00<?, ?it/s]

In [899]:
#!g1.1
frequency_test = np.array([np.array(positive_consequences_test) > i for i in negative_consequences_test])
frequency_test.mean()

0.9729508672

### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [900]:
#!g1.1
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
    print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was supposed to be, I believe, a movie. And it never was.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
Sample: It was one of these low-rent, low quality dramas which we're so used to, only this time it was a great one.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
Sample: It was very cool to see them in person! 

In [901]:
#!g1.1
# <YOUR CODE HERE> - feel free to organize it as you see fit


promts = ["It was", "This movie is", "He forced me", "While watching I felt", 'I had thoughts while watching']

result = []
with torch.no_grad():
    for promt in promts:
        print(f'promt: {promt}')
        inputs = main_tokenizer([promt] * 5, return_tensors='pt').to(device)
        for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
            generate_text = main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist())
            inputs = reward_tokenizer(generate_text, truncation=True, padding=True, return_tensors='pt').to(device)
            reward = reward_model(**inputs).logits[:, 0]

            print("Sample:", generate_text)
            print("reward:", reward.item())

            result.append([generate_text, reward.item()])
        print('----------------------------------------')

result = sorted(result, key=lambda x: x[1])

promt: It was

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.





Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was an incredibly fun movie!<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 2.1171875
Sample: It was a pretty good showing of the "cute girl" genre. "Dawn of the Dead" was great. It was a really well cast. We all did understand how great the "cute girl" genre is, but there was also a
reward: 4.65625
Sample: It was a disaster, but it's not like they didn't want them on it, or they would have liked to make something in the future with an art

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: This movie is about one woman's fascination with her mother. She decides to leave the state and go to school for a year before returning home. She goes through a lot, having been raised in the state but having left school the year before the new year started.
reward: 2.201171875
Sample: This movie is a joke. It starts out as an action movie with lots of action to it. Then, it's not. When you get to the end they're all very fast, but not fast as they should be at this level. But then in the
reward: -1.6337890625
Sample: This movie is filled with a few surprises, but the one thing that is the most disturbing in all of this, is that the movie is based entirely on the work of the same man who wrote the screenplay for this film. The title itself is more akin to that
reward: 1.2783203125
Sample: This movie is really poorly done, and the actors were good as always. It's just not very interesting, and the acting, just OK as well. But, if you like good humor, then this is it, and it's g

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: He forced me to watch this film. As you know I love horror films, all of them have great actors, plot, performances, a great visual novel, etc. As a young adult this film was my best. I was just thinking about being a teenager and
reward: 4.37109375
Sample: He forced me on the edge of my seat, thinking the script was stupid, but I never knew what to make of the whole thing. The acting was great, so much so that I can imagine where Bill Murray would be today. I'm so sorry to see
reward: 0.56201171875
Sample: He forced me to admit that I knew much more than what they had done before. I'm amazed I had not seen this movie before. He's a real genius.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 1.9521484375
Sample: He forced me to make these jokes and to make some pretty bad j

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 While watching I felt a bit unsure why the movie got made. However, once I was back on stage, this is what I kept thinking.<br /><br />That's what makes these movies work. Just a few scenes are enough to make you feel like your
reward: 2.103515625
----------------------------------------
promt: I had thoughts while watching
Sample: I had thoughts while watching this movie which were all about the fact that there are other people who are making the same mistake here and I also think it is more interesting to see them like my father and how they keep talking to each other and what they do to achieve the result
reward: 3.35546875
Sample: I had thoughts while watching the movie. Maybe I should warn you to avoid this movie. If not you, skip this bad review and stay away. Don't bother. Don't even watch this bad review.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
rewar

Сгенерированный текст с позитивной наградой генерирует позитивные отзывы. Вот пара примеров

In [902]:
#!g1.1
print('Sample:', result[-1][0])
print('reward:', result[-1][1])


print('Sample:', result[-2][0])
print('reward:', result[-2][1])


print('Sample:', result[-3][0])
print('reward:', result[-3][1])

Sample: It was a pretty good showing of the "cute girl" genre. "Dawn of the Dead" was great. It was a really well cast. We all did understand how great the "cute girl" genre is, but there was also a
reward: 4.65625
Sample: He forced me to watch this film. As you know I love horror films, all of them have great actors, plot, performances, a great visual novel, etc. As a young adult this film was my best. I was just thinking about being a teenager and
reward: 4.37109375
Sample: While watching I felt like I could watch it again and again. They even have this cool looking and well made set of props and music which makes it a must for all sci-fi fans. You may not even know how it's done. You'll just get the
reward: 4.14453125


Отрицательная награда означает отрицательный отзыв, ниже несколько примеров

In [903]:
#!g1.1
print('Sample:', result[0][0])
print('reward:', result[0][1])


print('Sample:', result[1][0])
print('reward:', result[1][1])


print('Sample:', result[2][0])
print('reward:', result[2][1])

Sample: It was so bad it made me wonder if the film was even made and then why had it been given to the production company and not released? For that matter I don't think it might have made even the slightest bit of a dent in any of the hype
reward: -4.41796875
Sample: While watching I felt I had come to an end, though I was expecting a somewhat less dramatic end to this than other reviews have mentioned. Well, this one was very predictable and very predictable.<br /><br />There were still some really good moments in the film
reward: -4.29296875
Sample: He forced me to make these jokes and to make some pretty bad jokes for nothing so that I could be a good comedian.<br /><br />It was always kind of like when someone wants to insult your intelligence by making something stupid like the comment "The only
reward: -3.25390625


# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [904]:
#!g1.1
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")



Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [905]:
#!g1.1
from typing import List


def compute_reward(texts: List[str]) -> torch.Tensor:
    inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
    with torch.no_grad():
        return reward_model(**inputs).logits[:, 0]

In [906]:
#!g1.1
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([-5.1328,  4.7695], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [907]:
#!g1.1
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9390589771670923


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [908]:
#!g1.1
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [909]:
#!g1.1
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


------------------------------ STEP 0 ------------------------------
rewards/mean:	-0.399527550	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.481661946	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-0.013451576	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.358317673	<---- model-estimated average discounted reward
objective/kl:	0.425588489	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	0.456541300	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.217689544	<---- model-estimated average discounted reward
objective/kl:	1.109936953	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3 

In [910]:
#!g1.1
promts = ["It was", "This movie is", "He forced me", "While watching I felt", 'I had thoughts while watching']

result = []
with torch.no_grad():
    for promt in promts:
        print(f'promt: {promt}')
        inputs = main_tokenizer([promt] * 5, return_tensors='pt').to(device)
        for candidate in main_model.model.generate(**inputs, max_new_tokens=50, do_sample=True):
            generate_text = main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist())
            inputs = reward_tokenizer(generate_text, truncation=True, padding=True, return_tensors='pt').to(device)
            reward = reward_model(**inputs).logits[:, 0]

            print("Sample:", generate_text)
            print("reward:", reward.item())

            result.append([generate_text, reward.item()])
        print('----------------------------------------')

result = sorted(result, key=lambda x: x[1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


promt: It was
Sample: It was the first film she had ever made, and it was amazing. I enjoyed it for years and the movie is amazing. I love it!<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 4.53125


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was a masterpiece. "The Shining" is one of the best cinema films of the 20th century.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 3.236328125
Sample: It was my first film in England, I was proud for you.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 2.306640625
Sample: It was a "I thought this was a great film, its a great movie and a great movie, but I've on

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: This movie is very realistic and not really that good. This is a real heartwarming story. The movie is a real heartwarming movie.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 3.3203125
Sample: This movie is a great show, and the show has had so much success. Thank you all. It really was amazing.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 4.6484375
Sample: This movie is very entertaining and hilarious with its unique style and great musical score. It's classic, you can enjoy it.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 4.5546875
Sample: This movie is well-to-watch. I highly recommend it.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftex

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: He forced me to see other stories of men like this one. I love the story of the famous Hollywood actor. This is a perfect film.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 4.296875
Sample: He forced me to see this. You don't want to look at the beautiful scenery when you are in there, you don't think of the great acting that played in the movie. I enjoyed this movie.<|endoftext|><|endoftext|>
reward: 4.56640625
Sample: He forced me into it with the first film. "Welcome to my place!" He showed an amazing smile, it made our heart begin to grow. The film has an incredible story. <br /><br /><|endoftext|>
reward: 4.703125
Sample: He forced me to believe that what happened at the trial is one of the most shocking things I have seen. As a film I find this film incredibly beautiful.<|endoftext|><|endoftext|><|endoftext|><|endoft

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: While watching I felt a little more free than I expected. The setting was perfect and the photography was excellent. I highly recommend this film.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward: 3.97265625
Sample: While watching I felt like something great, almost more importantly it was. There was a lot of fun, not only and enjoyable, but also a new appreciation for American movies. I know there is a lot of movies for the adults and especially a whole lot of fun.
reward: 4.4765625
Sample: While watching I felt that for a first time, this movie was terrific. I watched it more than once. I appreciate the whole scene and a nice note of humor in it. What a joy for a movie.<|endoftext|><|e

In [911]:
#!g1.1
print("Sample: ", result[-1][0])
print("reward: ", result[-1][1])

Sample:  I had thoughts while watching. The ending makes the movie special when I think of it. The book is great and the movie and movie scenes are just incredible. And the show is like a great family!<|endoftext|>
reward:  4.828125


In [912]:
#!g1.1
print("Sample: ", result[0][0])
print("reward: ", result[0][1])

Sample:  It was my first film in England, I was proud for you.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
reward:  2.306640625


Все что сгенерировала модель теперь это позитивные рецензии

## Main assignment - <u>actually</u> train the model (8 points)


Your main task for this week is to use the RLHF pipeline to train a model for a reward of your choice. Here's what you can choose from:

__A. Toxicity fine-tuning:__ train the model to be less (or more!) toxic. For this task, you may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, you may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).


__B. Actual human feedback:__ use one of the existing datasets with pairwise human feedback to align your langauge model. You may use [anthropic's hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) or any other data you see fit. You may also turn the tables and train the model to [minimize](https://habrastorage.org/getpro/geektimes/post_images/ac7/2ad/827/ac72ad82767d4132164a4b6b76196c42.jpg) human preferences, as long as your model does not degrade to gibberish.

__C. Controlled generation:__ Instead of training a reward model from human feedback, you may define the reward function as the text length (longer or shorter) or number of times the model uses specific words (e.g. "sorry", "apologize"). If you choose specific words, make sure the model generates them at least sometimes.

__Alternatively,__ you may choose a different task. However, unless your task is very similar to one of the above, there is a chance that it will be **significantly** harder to solve, requiring orders of magnitude more compute and tuning. If you are in doubt, please ask the course staff. If they are AFK (again >.<), please prefer one of the recommended tasks.


#### General tips & tricks


Things to look out for:
- during PPO stage, the reward model should be in eval mode (dropout disabled)
- make sure max_length and max_new_tokens are enough for your chosen dataset - at least most of the time
- when in doubt, view the data manually or inspect how the model performs on a few samples


We highly recommend that you manually check the performance after each sub-stage:
1. when you assembled the pairwise dataset, inspect a couple of from of *your* dataset class and detokenize them. Make sure that you-the-human understand why one sample was accepted and the other - rejected. At least most of the time. This also lets you spot tokenization/truncation errors.
2. after you trained a reward model, measure how accurate this model is in isolation. If your reward model is poor, any subsequent RLHF will also fail.
3. once you've trained the main model with RL, ask it to generate examples and explore how well it does. If it produces an obviously bad output, check if the reward model assigns high reward to that output. If yes, reward model is the culprit; if no, it's a question of better/longer PPO training.

__It is also a good idea to periodically print samples during training.__

__When stuck, simplify the problem.__ If you've spent a several hours enchanting the reward model but it still won't budge, try switching to a simple subtask. For instance, if you're training on hh-rlhf, try limiting it the dataset to 10% of the shortest sequences - they are typically easier to learn.


## Assignment stages (and grading)

Regardless of the specific task you chose, your solution needs to contain several parts that will be graded separately.


#### Stage 1: reward model (4 points)

Construct a dataset for training the reward model on your problem. Then, train a reward model on that dataset and evaluate how well can your model predict preferences on a hold-out (test) subset of your data.

Please make sure that the part of your notebook where you evaluate reward model is clearly visible and reasonably easy to read. And for all that is holy, do not call it IMDB unless it actually **is** data of imdb movie reviews :)

__Not all tasks require a reward model for later PPO fine-tuning.__ For instance, there's no reason to train a reward model if your reward equals sentence length. Likewise, toxicity reward can be estimated with a pre-trained toxicity classifier. __If your task does not require training a reward model, please train an unrelated model on [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) as though you were solving assignment version B.__ This is for grading purposes only, you won't use this model for stage 2.


#### Stage 2: RL fine-tuning (4 points)

Once the reward model is ready - or you can compute rewards without a model - it is time to maximize that reward with PPO. Optionally, you may replace PPO with another RL algorithm (or unlikelihood learning scheme), but only if you're feeling adventurous.


First, you need to choose a language model to be fine-tuned. You may choose any model, but make sure that your model **can** generate the data in your format. For instance, [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a general purpose LM and may (or may not) need prompt engineering to generate chat assistant responses. For that reason, it is best if you **do not use `"lvwerra/gpt2-imdb"` unless you're generating only movie reviews**.



There are two "difficulty modes" for this task:
For the **easy mode**, use [gpt2-large](https://huggingface.co/gpt2-large) or [opt-1.3b](https://huggingface.co/facebook/opt-1.3b) with minimal code changes.
If you want the **Hard mode:** use a larger (e.g. 7B) model in combination with `load_in_4bit` and LoRA, the same way we did last week.
Some reasonable model choices are [LLaMA-7B](https://huggingface.co/Enoch/llama-7b-hf), [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) for general-purpose LM or [guanaco-7b](https://huggingface.co/timdettmers/guanaco-7b), [vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) for chat-based tasks, though there are many more (see [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). In the hard mode, you will need to modify the training arguments to enable 4-bit fine-tuning. Furthermore, your experiments will take somewhat longer to complete. On the plus side, your model will produce significantly better results.

__High reward is not enough!__ RL algorithms are famous for [cheating their reward functions](https://openai.com/research/faulty-reward-functions). To ensure that your model is actually doing what you want it to do, you will need some additional evaluation. To get the full grade, provide at least 20 side-by-side examples of your fine-tuned model vs original model predictions and a short summary.

Alternatively, you may provide 5 examples and some extrinsic evaluation metric over many examples. For instance, you may use a different pre-trained toxicity score for option A. When dealing with human preferences, you may choose to [enlist actual humans](https://toloka.ai/) or [ask GPT4/Claude](https://arxiv.org/pdf/2304.03277.pdf) to compare your model's predictions. For task C, when optimizing for simple rewards like sentence lengths, it is enough to compare histograms of rewards (e.g. average lengths).












In [916]:
#!g1.1
import torch
import transformers

import numpy as np
from tqdm.auto import tqdm

import trl
from trl import RewardTrainer, RewardConfig

import datasets


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')



[2023-12-03 05:12:06,737] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)




In [917]:
#!g1.1
dataset_hh_rlhf = datasets.load_dataset("Anthropic/hh-rlhf")

Downloading readme:   0%|          | 0.00/5.77k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/743k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/875k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [918]:
#!g1.1
class hh_rlhf_PairwiseDataset(torch.utils.data.Dataset):
    def __init__(self, hh_rlhf, tokenizer):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = hh_rlhf['chosen']
        self.rejected_texts = hh_rlhf['rejected']
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts)

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index], truncation=True)

        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [920]:
#!g1.1
import peft
from peft import LoraConfig, TaskType
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=32, 
    lora_dropout=0.0, 
    inference_mode=False,
    target_modules = ['q_lin', 'k_lin']
)


name_model = 'distilbert-base-cased'
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained(name_model, device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained(name_model)

reward_model_with_lora = peft.get_peft_model(reward_model, peft_config, adapter_name='default')

reward_model_with_lora.print_trainable_parameters()

hh_rlhf_train_for_reward_model = hh_rlhf_PairwiseDataset(dataset_hh_rlhf['train'], reward_tokenizer)
hh_rlhf_test_for_reward_model = hh_rlhf_PairwiseDataset(dataset_hh_rlhf['test'], reward_tokenizer)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,331,716 || all params: 66,522,628 || trainable%: 2.001899263510756
Found 160800 chosen and 160800 rejected texts, 160800 pairs
Found 8552 chosen and 8552 rejected texts, 8552 pairs


In [872]:
#!g1.1
# name_model = 'distilbert-base-cased'
# reward_model = transformers.AutoModelForSequenceClassification.from_pretrained(name_model, device_map=device)

# for name, param in reward_model.named_parameters():
#     if param.requires_grad:
#         print(name, param.shape)

In [None]:
#!g1.1
# a = {"input_ids": hh_rlhf_train_for_reward_model[0]['input_ids_chosen'].to(device), "attention_mask": hh_rlhf_train_for_reward_model[0]['attention_mask_chosen'].to(device)}
# reward_model_with_lora(**a)

In [921]:
#!g1.1
len_train_chosen = []
len_train_rejected = []
for i in tqdm(hh_rlhf_train_for_reward_model):
    len_train_chosen.append(len(i['input_ids_chosen']))
    len_train_rejected.append(len(i['input_ids_rejected']))

  0%|          | 0/160800 [00:00<?, ?it/s]

In [923]:
#!g1.1
mask_train_chosen = (np.array(len_train_chosen) < 70)
mask_train_rejected = (np.array(len_train_rejected) < 70)
mask_train_chosen.mean(), mask_train_rejected.mean()

(0.1651679104477612, 0.18822139303482588)

In [924]:
#!g1.1
new_hh_rlhf_train = []
for i in tqdm(range(len(hh_rlhf_train_for_reward_model))):
    if mask_train_chosen[i] and mask_train_rejected[i]:
        new_hh_rlhf_train.append(hh_rlhf_train_for_reward_model[i])

  0%|          | 0/160800 [00:00<?, ?it/s]

In [925]:
#!g1.1
len_test_chosen = []
len_test_rejected = []
for i in tqdm(hh_rlhf_test_for_reward_model):
    len_test_chosen.append(len(i['input_ids_chosen']))
    len_test_rejected.append(len(i['input_ids_rejected']))

  0%|          | 0/8552 [00:00<?, ?it/s]

In [926]:
#!g1.1
mask_test_chosen = (np.array(len_test_chosen) < 70)
mask_test_rejected = (np.array(len_test_rejected) < 70)
mask_test_chosen.mean(), mask_test_rejected.mean()

(0.16627689429373246, 0.186155285313377)

In [927]:
#!g1.1
new_hh_rlhf_test = []
for i in tqdm(range(len(hh_rlhf_test_for_reward_model))):
    if mask_test_chosen[i] and mask_test_rejected[i]:
        new_hh_rlhf_test.append(hh_rlhf_test_for_reward_model[i])

  0%|          | 0/8552 [00:00<?, ?it/s]

In [928]:
#!g1.1
import warnings
warnings.filterwarnings("ignore")


training_args = trl.RewardConfig(
    output_dir="reward_model",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    learning_rate=1.41e-4,
    max_steps=2_500,
    logging_steps=100,
    max_length=256,
    warmup_steps=200,
    gradient_checkpointing=False,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True,
)

trainer = trl.RewardTrainer(
    model=reward_model_with_lora,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=new_hh_rlhf_train,
    eval_dataset=new_hh_rlhf_test,
    peft_config=None,
)

In [929]:
#!g1.1
def evall(reward_model):
    reward_model.eval()
    hh_rlhf_frequency_test = []
    with torch.no_grad():
        for batch in tqdm(trainer.get_eval_dataloader()):
            chosen = {"input_ids": batch['input_ids_chosen'], "attention_mask": batch['attention_mask_chosen']}
            rejected = {"input_ids": batch['input_ids_rejected'], "attention_mask": batch['attention_mask_rejected']}

            reward_chosen = reward_model(**chosen).logits
            reward_rejected = reward_model(**rejected).logits

            hh_rlhf_frequency_test.extend((reward_chosen[:,0] > reward_rejected[:,0]).int().cpu().numpy())

    return np.mean(hh_rlhf_frequency_test)

In [930]:
#!g1.1
for i in tqdm(range(5)):
    reward_model_with_lora.train()
    trainer.train()
    print(evall(reward_model_with_lora))

  0%|          | 0/5 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
100,0.6942
200,0.6912
300,0.681
400,0.6677
500,0.6617
600,0.6565
700,0.6443
800,0.6412
900,0.639
1000,0.6311


  0%|          | 0/32 [00:00<?, ?it/s]

0.6017699115044248


Step,Training Loss
100,0.5882
200,0.5899
300,0.5905
400,0.5859
500,0.5852
600,0.5891
700,0.5736
800,0.5703
900,0.574
1000,0.561


  0%|          | 0/32 [00:00<?, ?it/s]

0.6017699115044248


Step,Training Loss
100,0.5252
200,0.5284
300,0.5404
400,0.53
500,0.5395
600,0.531
700,0.5224
800,0.5144
900,0.5069
1000,0.5069


  0%|          | 0/32 [00:00<?, ?it/s]

0.6096361848574238


Step,Training Loss
100,0.474
200,0.4753
300,0.4895
400,0.4677
500,0.4844
600,0.4874
700,0.45
800,0.4606
900,0.4695
1000,0.4482


  0%|          | 0/32 [00:00<?, ?it/s]

0.599803343166175


Step,Training Loss
100,0.4151
200,0.4221
300,0.4335
400,0.4179
500,0.4272
600,0.426
700,0.4124
800,0.4076
900,0.4107
1000,0.3949


  0%|          | 0/32 [00:00<?, ?it/s]

0.6076696165191741


In [931]:
#!g1.1
main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")

Downloading config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [966]:
#!g1.1
with torch.no_grad():
    tokens_chosen = {'input_ids': torch.tensor([hh_rlhf_test_for_reward_model[0]['input_ids_chosen']]).to(device), 
                     'attention_mask': torch.tensor([hh_rlhf_test_for_reward_model[0]['attention_mask_chosen']]).to(device)}
   
    tokens_rejected = {'input_ids': torch.tensor([hh_rlhf_test_for_reward_model[0]['input_ids_rejected']]).to(device),
                       'attention_mask': torch.tensor([hh_rlhf_test_for_reward_model[0]['attention_mask_rejected']]).to(device)}

    reward_chosen = reward_model_with_lora(**tokens_chosen).logits[0, 0]
    reward_rejected = reward_model_with_lora(**tokens_rejected).logits[0, 0]

In [969]:
#!g1.1
reward_chosen, reward_rejected

(tensor(-1.0020, device='cuda:0'), tensor(-4.8281, device='cuda:0'))

In [970]:
#!g1.1
# def preprocessing(row):
#     return len(main_tokenizer.encode(row['chosen'])) < 70

In [971]:
#!g1.1
# hh_rlhf_train = hh_rlhf['train'].filter(lambda row: preprocessing(row), batched=False)

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode('\n'.join(sample["chosen"].split('\n')[:-1]))
    sample["query"] = main_tokenizer.decode(query_ids)
    sample["input_ids"] = query_ids
    return sample

train = dataset_hh_rlhf['train'].map(select_query_and_tokenize, batched=False)
train.set_format(type="torch")

Failed to deserialize variable 'hh_rlhf_train'. Run the following code to delete it:
  del_datasphere_variables('hh_rlhf', 'hh_rlhf_train')
Traceback (most recent call last):
  File "/kernel/lib/python3.10/site-packages/ml_kernel/state/state_protocol.py", line 283, in _load_component
    value = unpickler.load()
  File "/home/jupyter/.local/lib/python3.10/site-packages/datasets/table.py", line 1069, in __setstate__
    table = _memory_mapped_arrow_table_from_file(path)
  File "/home/jupyter/.local/lib/python3.10/site-packages/datasets/table.py", line 65, in _memory_mapped_arrow_table_from_file
    opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
  File "/home/jupyter/.local/lib/python3.10/site-packages/datasets/table.py", line 50, in _memory_mapped_record_batch_reader_from_file
    memory_mapped_stream = pa.memory_map(filename)
  File "pyarrow/io.pxi", line 898, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 859, in pyarrow.lib.MemoryMappedFile._open
  Fi

Map:   0%|          | 0/160800 [00:00<?, ? examples/s]

In [None]:
#!g1.1
# number_of_tokens_per_response = []
# for i in range(len(hh_rlhf_train)):
#     number_of_tokens_per_response.append(len(main_tokenizer.encode(hh_rlhf_train[i]['chosen'].split('\n')[-1])))

# np.array(number_of_tokens_per_response).mean(), min(number_of_tokens_per_response), max(number_of_tokens_per_response)

In [None]:
#!g1.1
# hh_rlhf_test = hh_rlhf['test'].filter(lambda row: preprocessing(row), batched=False)


# def select_query_and_tokenize(sample):
#     query_ids = main_tokenizer.encode('\n'.join(sample["chosen"].split('\n')[:-1]))
#     sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
#     sample["input_ids"] = query_ids  # to avoid re-tokenizing later
#     return sample  # we do not need the rest - it will be generated by the model


# # hh_rlhf_test = hh_rlhf_test.remove_columns(['label'])

# hh_rlhf_test = hh_rlhf_test.map(select_query_and_tokenize, batched=False)
# hh_rlhf_test.set_format(type="torch")

In [972]:
#!g1.1
def compute_reward(texts: list[str]) -> torch.Tensor:
    inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
    with torch.no_grad():
        return reward_model_with_lora(**inputs).logits[:, 0]

In [978]:
#!g1.1
compute_reward([train[10]['chosen'], train[10]['rejected']])  # test on human-written reviews

tensor([-1.8711, -2.8164], device='cuda:0')

In [981]:
#!g1.1
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("gpt2-large", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

trainable params: 5,898,240 || all params: 779,929,601 || trainable%: 0.7562528710844506


In [982]:
#!g1.1
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,
)

ppo_trainer = trl.PPOTrainer(
    training_args,
    model=main_model.model,
    tokenizer=main_tokenizer,
    dataset=train,
    data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)

In [984]:
#!g1.1
max_steps = 10
generation_kwargs = dict(
    min_length=-1, 
    max_new_tokens=50, 
    do_sample=True, 
    top_k=0, 
    top_p=1.0, 
    pad_token_id=main_tokenizer.eos_token_id
)

with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
    reward_model_with_lora.eval()
    for epoch, batch in progressbar:
        if epoch >= max_steps:
            break

        response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
        batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
        rewards = compute_reward(batch['response'])

        stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
        stats['rewards/mean'] = rewards.mean().item()

        print("-" * 30, 'STEP', epoch, '-' * 30)
        print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
        print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
        print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
        print()

        ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/10 [00:00<?, ?it/s]

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
#!g1.1


In [None]:
#!g1.1


In [None]:
#!g1.1
