<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).

_based on the [original notebook](https://github.com/antndlcrx/oxford-llms-workshop/blob/main/materials/seminars/day_3/8_LLMs%20alignment%20with%20RLHF.ipynb) by Ilya Boytsov for the Oxford LLMs workshop_



In this session, you're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [None]:
%pip install -q trl==0.7.4 transformers==4.33.1 datasets==2.14.4 peft==0.5.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m91.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [None]:
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device:', device)
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

device: cuda


In [None]:
inputs = main_tokenizer("The class was", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The class was so bad that it really looked like a bad movie, like it's supposed to have been made a couple of years out in the South of England, in a movie I was really into. I've still got the cast I'd be at the top


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.


__If you actually want to maximize sentiment (or other "label") instead of human preferences, train reward model as a classifier! (see week5)__


In [None]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [None]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets
import numpy as np

# class IMDBPairwiseDataset(torch.utils.data.Dataset):
#     """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
#     def __init__(self, imdb, tokenizer, accepted_label: int):
#         super().__init__()
#         self.tokenizer = tokenizer
#         self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
#         self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
#         assert self.chosen_texts, f"no texts with label {accepted_label}"
#         print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

#     def __len__(self):
#         return len(self.chosen_texts) #* len(self.rejected_texts)  # all pairs

#     def __getitem__(self, index: int):
#         chosen = self.tokenizer(self.chosen_texts[index], truncation=True)
#         negative_ix = np.random.randint(0, len(self.chosen_texts))
#         rejected = self.tokenizer(self.rejected_texts[negative_ix], truncation=True)
#         return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
#                     input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

class IMDBPairwiseDatasetNoTokenizer(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, imdb, accepted_label: int):
        super().__init__()
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts")

    def __len__(self):
        return len(self.chosen_texts) # number of texts, 12500

    def __getitem__(self, index: tuple[int, int]):
        pos_ix, neg_ix = index
        ch = self.chosen_texts[pos_ix]
        rej = self.rejected_texts[neg_ix]
        return {'chosen':ch, 'rejected':rej}

In [None]:
TARGET_LABEL = 1   # and make sure it works by reviewing the sample printed below
imdb = datasets.load_dataset("imdb", split='train')
# reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)
reward_data_no_token = IMDBPairwiseDatasetNoTokenizer(imdb, accepted_label=TARGET_LABEL)

# sample = reward_data[100]
sample_no = reward_data_no_token[100, 100]
# print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
# print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Found 12500 chosen and 12500 rejected texts


In [None]:
sample_no

{'chosen': 'Emilio Miraglio\'s "The Red Queen Kills Seven Times" (1972) is just about the most perfect example of a giallo that I have ever seen, mixing all the requisite elements into one sinister stew indeed. First of all, and of paramount importance for me, it has a complex, twisty plot that ultimately makes perfect sense, and the killer here does not come completely out of left field at the end. The story, concerning a series of gruesome murders (you already know how many from the film\'s title, right?) that takes place in seeming fulfillment of an ancient prophecy concerning two sisters, is an involving one, and the murderer, a red-cloaked figure with the insane laugh of a madwoman, is both frightening and memorable. Every great giallo requires some lovely lead actresses, and here we have quite an assortment, headed by the ridiculously beautiful Barbara Bouchet as one of the two sisters and, in one of her earlier roles, Sybil Danning, as a lustful tramp at Barbara\'s fashion house

In [None]:
imdb[10].keys()

dict_keys(['text', 'label'])

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [None]:
import shutil as sh

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# import trl

# training_args = trl.RewardConfig(  # like transformers.TrainingArguments
#     output_dir="reward_model",
#     per_device_train_batch_size=32,
#     gradient_accumulation_steps=1,
#     learning_rate=1.41e-5,
#     max_steps=2_000,              # note: training may need more than 1k steps
#     logging_steps=50,
#     gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
#     gradient_checkpointing_kwargs={"use_reentrant": False},
#     fp16=True                     # disable this on CPU or on very old GPUs
#     # you may add any other hyperparameters that you found useful in weeks 5-7
# )

# trainer = trl.RewardTrainer(
#     model=reward_model,
#     args=training_args,
#     tokenizer=reward_tokenizer,
#     train_dataset=reward_data,
#     peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
# )

# trainer.train()

In [None]:
# torch.save(reward_model, 'reward_model.pt')

In [None]:
# sh.copy("reward_model.pt", "/content/drive/MyDrive/reward_model_new.pt")

'/content/drive/MyDrive/reward_model_new.pt'

In [None]:
# new
sh.copy("/content/drive/MyDrive/reward_model_new.pt", "reward_model.pt")

'reward_model.pt'

In [None]:
reward_model = torch.load('reward_model.pt')

In [None]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model (1 point)

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [None]:
for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(
      imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: -6.93669319152832
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes o

In [13]:
def pad_reviews(batch, rew):
    chosen = [x['chosen'] for x in batch]
    rejected = [x['rejected'] for x in batch]
    chosen = rew(chosen, return_tensors='pt', padding=True, truncation=True)
    rejected = rew(rejected, return_tensors='pt', padding=True, truncation=True)
    return chosen, rejected

In [14]:
from torch.utils.data import DataLoader
from functools import partial
import multiprocessing as mp

In [18]:
def to_device(dictionary):
    return {k:v.to(device) for k, v in dictionary.items()}

In [19]:
import tqdm
def evaluate_model(model, tokenizer, dataset, batch_size, iters=1):
    steps = len(dataset) // 2 // batch_size

    sampler = torch.utils.data.sampler.BatchSampler(torch.utils.data.sampler.RandomSampler(dataset), batch_size=2, drop_last=False)
    loader = DataLoader(
        dataset,
        batch_size=batch_size,
        collate_fn=partial(pad_reviews, rew=tokenizer),
        sampler=sampler,
        num_workers=mp.cpu_count()//2,
        pin_memory=True,
        persistent_workers=True,
    )

    correct = 0
    total = 0
    model.eval()

    for i, batch in tqdm.tqdm(enumerate(loader), total=steps):
        chosen, rejected = batch
        chosen, rejected = to_device(chosen), to_device(rejected)

        with torch.no_grad():
            chosen_rewards = model(**chosen).logits[:, 1]
            rejected_rewards = model(**rejected).logits[:, 1]
            diff = chosen_rewards > rejected_rewards
            correct += torch.sum(diff).item()
            total += len(diff)

    return correct / total

In [96]:
batch_size = 256

In [None]:
train_reward_accuracy = evaluate_model(reward_model, reward_tokenizer, reward_data_no_token, batch_size)
print('Train reward accuracy:', train_reward_accuracy)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


25it [03:32,  8.49s/it]                        

Train reward accuracy: 0.98912





In [None]:
imdb_test = datasets.load_dataset("imdb", split='test')

In [None]:
reward_test_data_no_token = IMDBPairwiseDatasetNoTokenizer(imdb_test, accepted_label=TARGET_LABEL)

Found 12500 chosen and 12500 rejected texts


In [None]:
test_reward_accuracy = evaluate_model(reward_model, reward_tokenizer, reward_test_data_no_token, batch_size)
print('Test reward accuracy:', test_reward_accuracy)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


25it [03:33,  8.55s/it]                        

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Test reward accuracy: 0.97504





### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [None]:
def reward_generation(prefixes, batch_size):
  highest_score_texts, lowest_score_texts = [], []
  for prefix in prefixes:
    inputs = main_tokenizer([prefix] * batch_size, return_tensors='pt').to(device)
    candidates = main_model.generate(**inputs, max_new_tokens=400, do_sample=True)
    candidates_texts = []
    for candidate in candidates:
      candidate_text = main_tokenizer.decode(candidate.flatten())
      candidates_texts.append(candidate_text)
    tok_inp = reward_tokenizer(candidates_texts,truncation=True, padding=True,return_tensors='pt')
    tok_inp = to_device(tok_inp)
    with torch.no_grad():
      scores = reward_model(**tok_inp).logits[:, 1].detach().cpu().numpy().tolist()

    max_score = np.argmax(scores)
    min_score = np.argmin(scores)
    highest_score_texts.append((candidates_texts[max_score], max(scores)))
    lowest_score_texts.append((candidates_texts[min_score], min(scores)))

  return highest_score_texts, lowest_score_texts

In [None]:
prefixes = ["It was", "I saw a movie about", "Main character in the movie is", "I watched a movie"]
highest_score_texts, lowest_score_texts = reward_generation(prefixes, 16)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
for i, (high, low) in enumerate(zip(highest_score_texts, lowest_score_texts)):
    print(f'Pair: {i}:')
    print(f'Score: {high[1]}, text: {high[0]}')
    print(f'Score: {low[1]}, text: {low[0]}')
    print('-------------------------')

Pair: 0:
Score: 5.49664831161499, text: It was a wonderful movie for all ages! I recommend this movie as a great movie to learn something. One more thing...It really goes above and beyond being an action movie..it actually gives very different insight on life for the victims!! As this movie became, I've not been able to enjoy the movie since. I guess this is because this movie is the movie of the century. I thought I had been lost on the world by the end and that I had really missed it! If I had to choose between this movie and MySpace movies, I would choose MySpace! If I had to choose between MySpace and MySpace and The World is your oyster (you guessed it right)! I'd choose MySpace!!!<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endo

### Analysis
I generated 16 samples for each of the 4 prefixes, and selected the text with max and min reward. As we see that text with max reward are very positive and the texts with min reward are very negative. So the model seems to work (both generative model to generate texts and discriminative model to get the reward)

# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [None]:
import trl

In [None]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [None]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [None]:
imdb[16000]['text']

'Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /><br />Bad: Main missions are a bit short.<br /><br />This game defines what a "good" third person shooter(not necessarily a spy-game) is. Great firefights carry on the story and make you want to complete EVERY single mission through, and unlock all the genuine bonuses the game has to offer. The hype this game had, was lived up to, and I personally think you should buy it, and hook up with a couple of friends and play this one. Loads of fun. <br /><br />The sound in this game, is a rip-roaring achievement from a few previous bond games, and firing a weapon, really feels like you\'re firing a weapon. It ties in with the aspect that you are a deadly and ruthless spy.<br /><br />All in all, this game makes you excited and satisfied after you make it through, and some multiplayer that can compete with the standards of t

In [None]:
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([-6.9367,  6.2693], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [None]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()



trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9390589771670923


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [None]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [None]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

------------------------------ STEP 0 ------------------------------
rewards/mean:	-0.258332342	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.001502186	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	1.002418995	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.305601835	<---- model-estimated average discounted reward
objective/kl:	0.119602509	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	-0.003728613	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.070193335	<---- model-estimated average discounted reward
objective/kl:	0.087802038	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3 --

In [None]:
stats

{'objective/kl': 16.23154640197754,
 'objective/kl_dist': array([10.07762  , 14.099819 , 18.582432 , 17.648338 , 14.626041 ,
        29.34806  , 30.24659  , 13.031353 , 12.48026  , 20.929663 ,
        21.19273  , 21.922321 , 14.915989 , 19.299576 ,  5.0947537,
        18.318642 , 13.459557 , 24.857752 ,  8.909032 , 29.22661  ,
        19.53822  , 14.703208 , 22.088036 , 23.727571 , 17.329473 ,
        23.164112 , 11.439144 , 10.814253 , 20.888172 , 23.808994 ,
        11.344417 , 22.790745 , 15.773094 , 24.66652  , 19.548517 ,
        12.326679 ,  8.785221 , 15.149178 , 15.046537 , 16.561705 ,
        12.623873 , 18.611702 , 14.299225 , 16.15994  ,  6.58377  ,
         9.530743 , 10.165553 , 14.527981 , 14.081645 ,  9.717058 ,
        19.567291 , 12.623527 , 12.116524 , 15.647436 , 18.683199 ,
        22.751425 , 18.417181 ,  6.07038  , 21.452393 ,  8.003216 ,
        12.945242 , 13.156256 , 22.068367 ,  1.2541091], dtype=float32),
 'objective/logprobs': array([[-3.45544791e+00, -1.266

## Main assignment - <u>actually</u> train the model (8 points)


Your main task for this week is to use the RLHF pipeline to train a model for a reward of your choice. Here's what you can choose from:

__A. Toxicity fine-tuning:__ train the model to be less (or more!) toxic. For this task, you may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, you may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).


__B. Actual human feedback:__ use one of the existing datasets with pairwise human feedback to align your langauge model. You may use [anthropic's hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) or any other data you see fit. You may also turn the tables and train the model to [minimize](https://habrastorage.org/getpro/geektimes/post_images/ac7/2ad/827/ac72ad82767d4132164a4b6b76196c42.jpg) human preferences, as long as your model does not degrade to gibberish.

__C. Controlled generation:__ Instead of training a reward model from human feedback, you may define the reward function as the text length (longer or shorter) or number of times the model uses specific words (e.g. "sorry", "apologize"). If you choose specific words, make sure the model generates them at least sometimes.

__Alternatively,__ you may choose a different task. However, unless your task is very similar to one of the above, there is a chance that it will be **significantly** harder to solve, requiring orders of magnitude more compute and tuning. If you are in doubt, please ask the course staff. If they are AFK (again >.<), please prefer one of the recommended tasks.


#### General tips & tricks


Things to look out for:
- during PPO stage, the reward model should be in eval mode (dropout disabled)
- make sure max_length and max_new_tokens are enough for your chosen dataset - at least most of the time
- when in doubt, view the data manually or inspect how the model performs on a few samples


We highly recommend that you manually check the performance after each sub-stage:
1. when you assembled the pairwise dataset, inspect a couple of from of *your* dataset class and detokenize them. Make sure that you-the-human understand why one sample was accepted and the other - rejected. At least most of the time. This also lets you spot tokenization/truncation errors.
2. after you trained a reward model, measure how accurate this model is in isolation. If your reward model is poor, any subsequent RLHF will also fail.
3. once you've trained the main model with RL, ask it to generate examples and explore how well it does. If it produces an obviously bad output, check if the reward model assigns high reward to that output. If yes, reward model is the culprit; if no, it's a question of better/longer PPO training.

__It is also a good idea to periodically print samples during training.__

__When stuck, simplify the problem.__ If you've spent a several hours enchanting the reward model but it still won't budge, try switching to a simple subtask. For instance, if you're training on hh-rlhf, try limiting it the dataset to 10% of the shortest sequences - they are typically easier to learn.


## Assignment stages (and grading)

Regardless of the specific task you chose, your solution needs to contain several parts that will be graded separately.


#### Stage 1: reward model (4 points)

Construct a dataset for training the reward model on your problem. Then, train a reward model on that dataset and evaluate how well can your model predict preferences on a hold-out (test) subset of your data.

Please make sure that the part of your notebook where you evaluate reward model is clearly visible and reasonably easy to read. And for all that is holy, do not call it IMDB unless it actually **is** data of imdb movie reviews :)

__Not all tasks require a reward model for later PPO fine-tuning.__ For instance, there's no reason to train a reward model if your reward equals sentence length. Likewise, toxicity reward can be estimated with a pre-trained toxicity classifier. __If your task does not require training a reward model, please train an unrelated model on [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) as though you were solving assignment version B.__ This is for grading purposes only, you won't use this model for stage 2.


#### Stage 2: RL fine-tuning (4 points)

Once the reward model is ready - or you can compute rewards without a model - it is time to maximize that reward with PPO. Optionally, you may replace PPO with another RL algorithm (or unlikelihood learning scheme), but only if you're feeling adventurous.


First, you need to choose a language model to be fine-tuned. You may choose any model, but make sure that your model **can** generate the data in your format. For instance, [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a general purpose LM and may (or may not) need prompt engineering to generate chat assistant responses. For that reason, it is best if you **do not use `"lvwerra/gpt2-imdb"` unless you're generating only movie reviews**.



There are two "difficulty modes" for this task:
For the **easy mode**, use [gpt2-large](https://huggingface.co/gpt2-large) or [opt-1.3b](https://huggingface.co/facebook/opt-1.3b) with minimal code changes.
If you want the **Hard mode:** use a larger (e.g. 7B) model in combination with `load_in_4bit` and LoRA, the same way we did last week.
Some reasonable model choices are [LLaMA-7B](https://huggingface.co/Enoch/llama-7b-hf), [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) for general-purpose LM or [guanaco-7b](https://huggingface.co/timdettmers/guanaco-7b), [vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) for chat-based tasks, though there are many more (see [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). In the hard mode, you will need to modify the training arguments to enable 4-bit fine-tuning. Furthermore, your experiments will take somewhat longer to complete. On the plus side, your model will produce significantly better results.

__High reward is not enough!__ RL algorithms are famous for [cheating their reward functions](https://openai.com/research/faulty-reward-functions). To ensure that your model is actually doing what you want it to do, you will need some additional evaluation. To get the full grade, provide at least 20 side-by-side examples of your fine-tuned model vs original model predictions and a short summary.

Alternatively, you may provide 5 examples and some extrinsic evaluation metric over many examples. For instance, you may use a different pre-trained toxicity score for option A. When dealing with human preferences, you may choose to [enlist actual humans](https://toloka.ai/) or [ask GPT4/Claude](https://arxiv.org/pdf/2304.03277.pdf) to compare your model's predictions. For task C, when optimizing for simple rewards like sentence lengths, it is enough to compare histograms of rewards (e.g. average lengths).












## Stage 1

In [1]:
%pip install -q trl==0.7.4 transformers==4.33.1 datasets==2.14.4 peft==0.5.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m114.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

In [1]:
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device:', device)

device: cuda


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('30k_comments.csv') # dataframe created from Kaggle competition, has 15k toxic and 15k non-toxic texts
df.head()

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic
0,139145,e8b043b74dc9c4a7,"""\n\n == deflem said """"fuck you Wikipedia"""" ==...",1
1,21000,37707ececa862361,DONT BREAK WP:3RR BIATCH,0
2,151347,7ba7c886dceabe09,foot fetishes are awesome fuck you 68.228.72.192,1
3,95600,ffae7f0306ace986,These guys like Pascerboy and Sturmvogel are a...,1
4,91119,f3bbde465794ef50,"""\n\nAm I supposed to be scared? It's not like...",1


In [3]:
df['toxic'].value_counts()

1    15294
0    15294
Name: toxic, dtype: int64

In [4]:
train_df, test_df = train_test_split(df, test_size=0.2)
train_df.shape, test_df.shape

((24470, 4), (6118, 4))

In [5]:
class ToxicDatasetPairs(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, df, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.toxic_texts = [row['comment_text'] for i, row in df.iterrows() if row['toxic'] == 1]
        self.non_toxic_texts = [row['comment_text'] for i, row in df.iterrows() if row['toxic'] == 0]

        print(f"Found {len(self.toxic_texts)} toxic and {len(self.non_toxic_texts)} non toxic texts")

    def __len__(self):
        return len(self.toxic_texts)# * len(self.non_toxic_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.toxic_texts[index], truncation=True)
        rejected = self.tokenizer(self.non_toxic_texts[index], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [6]:
reward_model_name = 'microsoft/deberta-base'
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained(reward_model_name, device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained(reward_model_name)

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['pooler.dense.bias', 'classifier.weight', 'classifier.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
train_dataset = ToxicDatasetPairs(train_df, reward_tokenizer, accepted_label=1)
test_dataset = ToxicDatasetPairs(test_df, reward_tokenizer, accepted_label=1)

Found 12212 toxic and 12258 non toxic texts
Found 3082 toxic and 3036 non toxic texts


In [84]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=2_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=train_dataset,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

You're using a DebertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.3553
100,0.0826
150,0.0559
200,0.043
250,0.0529
300,0.0381
350,0.0374
400,0.0351
450,0.0211
500,0.0123




KeyboardInterrupt: ignored

In [85]:
# torch.save(reward_model, 'reward_model_toxic.pt')

In [8]:
reward_model = torch.load('reward_model_toxic.pt')

In [86]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [87]:
import shutil as sh

In [88]:
# sh.copy("reward_model_toxic.pt", "/content/drive/MyDrive/reward_model_new_toxic.pt")

'/content/drive/MyDrive/reward_model_new_toxic.pt'

In [10]:
class ToxicDatasetPairsNoTokenizer(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, df):
        super().__init__()
        self.toxic_texts = [row['comment_text'] for i, row in df.iterrows() if row['toxic'] == 1]
        self.non_toxic_texts = [row['comment_text'] for i, row in df.iterrows() if row['toxic'] == 0]
        min_len = min(len(self.toxic_texts), len(self.non_toxic_texts))
        min_len = min_len if min_len % 2 == 0 else min_len - 1
        self.toxic_texts = self.toxic_texts[:min_len]
        self.non_toxic_texts = self.non_toxic_texts[:min_len]

        print(f"Found {len(self.toxic_texts)} toxic and {len(self.non_toxic_texts)} non toxic texts")

    def __len__(self):
        return len(self.toxic_texts) # number of texts

    def __getitem__(self, index: tuple[int, int]):
        pos_ix, neg_ix = index
        ch = self.toxic_texts[pos_ix]
        rej = self.non_toxic_texts[neg_ix]
        return {'chosen':ch, 'rejected':rej}

In [20]:
train_dataset_no_token = ToxicDatasetPairsNoTokenizer(train_df)
test_dataset_no_token = ToxicDatasetPairsNoTokenizer(test_df)

Found 12212 toxic and 12212 non toxic texts
Found 3036 toxic and 3036 non toxic texts


In [21]:
train_reward_accuracy = evaluate_model(reward_model, reward_tokenizer, train_dataset_no_token, 4)
print('Train reward accuracy:', train_reward_accuracy)

1527it [04:07,  6.17it/s]


Train reward accuracy: 0.9918113331149688


In [22]:
test_reward_accuracy = evaluate_model(reward_model, reward_tokenizer, test_dataset_no_token, 4)
print('Test reward accuracy:', test_reward_accuracy)

380it [01:09,  5.48it/s]


Test reward accuracy: 0.994729907773386


## Results, Stage 1. Reward model seems to be rather good
Although I did not even train in on full 10k toxic/non-toxic texts

## Stage 2

In [62]:
import peft
import trl
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

model_name = 'gpt2-large'
# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO

main_tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained(model_name, device_map=device)


In [63]:
prefixes = [
    "What",
    "Who are you",
    "Give it to",
    "Hello, how is",
    "Do you",
    "The weather is really",
    "This person is",
    "The world is",
    "I decided to",
    "Is it funny",
    "Love does",
    "Friends are",
    "The earth is",
    "Red color means",
    "Waves move wind",
    "Bear lives in",
    "There is no",
    "There are many",
    "Armin is exceptional",
    "All I need for Christmas",
    "Whenever, wherever"
    ]

In [64]:
def prefix_generation(prefixes, model):
    texts, scores = [], []
    for prefix in prefixes:
        inputs = main_tokenizer(prefix, return_tensors='pt').to(device)
        candidate = model.generate(**inputs, max_new_tokens=100, do_sample=True)
        candidate_text = main_tokenizer.decode(candidate.flatten())
        # print(candidate_text)
        tok_inp = reward_tokenizer(candidate_text,truncation=True, padding=True,return_tensors='pt')
        tok_inp = to_device(tok_inp)
        with torch.no_grad():
            score = reward_model(**tok_inp).logits[:, 1].detach().cpu().numpy()
        texts.append(candidate_text)
        scores.append(score)

    return texts, scores

In [66]:
before_texts, before_scores = prefix_generation(prefixes, main_model)
before_texts, before_scores

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

(['What I saw that day was a small, white plastic box that looked to be some sort of container from a medical device company. I was told that after the patient left, they would take a sample so he could have a complete DNA profile to compare to patients who did not undergo the procedure. I asked to see the sample. The patient was furious. He began saying that the sample was a huge insult and that he would do anything to stop it. The medical director, Dr. A., stood on',
  "Who are you talking to?\n\nIt could be anyone...\n\n(They come through the doors, stopping in the hall door where Harry and Luna are standing)\n\nHARRY: Hello?\n\nLEIRINA: Why are you here?\n\nHARRY: Well, what have you made of me this entire time?\n\nLEIRINA: So, I am... (nods) You've been telling me what to think and do, Harry, and all I can",
  'Give it to me now in order to be safe, so I can go back home." So I said, "Do you know the words of wisdom?" "Yes," he replied, "and the word of wisdom is: Let you do and u

In [41]:
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()



trainable params: 5,898,240 || all params: 779,929,601 || trainable%: 0.7562528710844506


In [44]:
from datasets import Dataset
toxic_true = train_df[train_df['toxic']==1]
toxic_for_rlhf = Dataset.from_pandas(train_df)

In [45]:
toxic_for_rlhf = toxic_for_rlhf.remove_columns(['toxic', 'Unnamed: 0', 'id'])

In [46]:
toxic_for_rlhf

Dataset({
    features: ['comment_text', '__index_level_0__'],
    num_rows: 24470
})

In [47]:
sample_length = trl.core.LengthSampler(2, 8)

In [48]:
def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["comment_text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)
    sample["input_ids"] = query_ids
    return sample

toxic_for_rlhf = toxic_for_rlhf.map(select_query_and_tokenize, batched=False)
toxic_for_rlhf.set_format(type="torch")

Map:   0%|          | 0/24470 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1408 > 1024). Running this sequence through the model will result in indexing errors


In [49]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=16,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=toxic_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [50]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [54]:
from tqdm.auto import tqdm
max_steps = 200   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/200 [00:00<?, ?it/s]

------------------------------ STEP 0 ------------------------------
rewards/mean:	-2.686006069	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-1.939171076	<---- model-estimated average discounted reward
objective/kl:	3.607631207	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-1.892179132	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.008448124	<---- model-estimated average discounted reward
objective/kl:	7.372690201	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	-1.466933012	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-2.034774065	<---- model-estimated average discounted reward
objective/kl:	8.793331146	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3

KeyboardInterrupt: ignored

In [61]:
# torch.save(main_model.model.state_dict(), 'main_model_toxic.bin')

In [56]:
after_texts, after_scores = prefix_generation(prefixes, main_model.model)
after_texts, after_scores

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

(['What?" Ruby asked, surprised. "Wha?" Yang shrugged. "Wha? Fuck?" Blake was so out of it. "I told you to fucking go your fucking nuts out and you\'re in real trouble!" Yang said, pushing the camera out of her hands while pushing Blake to the floor. "I\'ve gotten you three times this year as a result of you fucking your brains out and you can\'t have me?<|endoftext|>',
  'Who are you?" Ye Qian suddenly exclaimed.\n\n"Hey! Are you really going to chase me and get angry? What are you hiding from? You can\'t keep your eyes off me, and you\'re going to start killing me. Are you really going to beat me up?\n\n"Do you have a headache?" Ye Qian was not scared at all. Actually, this is what he wanted; he was very angry and couldn\'t stand this situation. He\'s still not scared at all. He',
  'Give it to him, a good boy. [Cum for me]<|endoftext|>',
  "Hello, how is it going? - Oh, it's going great, thank you. Um huh, nice to have our mama back, huh? - Thank you. Hey, baby? Hey, baby, can you h

In [67]:
# This template helps to compare generated code samples in pretty table form
# feel free to present your work in other forms

from IPython.display import HTML, display
table_template = """<table style="border:1px solid black" >
  <tr>
    <th style="text-align: center; border:1px solid black">PREFIX</th>
    <th style="text-align: center; border:1px solid black">BEFORE</th>
    <th style="text-align: center; border:1px solid black">AFTER</th>
  </tr>
{}
</table>"""

row_template = '''  <tr>
    <td style="width:20%; border:1px solid black"><pre align="left">`{}`</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
  </tr>'''

rows = []

for i, prefix in enumerate(prefixes):
    # replace placeholders in the format() arguments
    rows.append(row_template.format(prefix, before_texts[i], after_texts[i]))

display(HTML(table_template.format('\n'.join(rows))))

PREFIX,BEFORE,AFTER
`What`,"What I saw that day was a small, white plastic box that looked to be some sort of container from a medical device company. I was told that after the patient left, they would take a sample so he could have a complete DNA profile to compare to patients who did not undergo the procedure. I asked to see the sample. The patient was furious. He began saying that the sample was a huge insult and that he would do anything to stop it. The medical director, Dr. A., stood on","What?"" Ruby asked, surprised. ""Wha?"" Yang shrugged. ""Wha? Fuck?"" Blake was so out of it. ""I told you to fucking go your fucking nuts out and you're in real trouble!"" Yang said, pushing the camera out of her hands while pushing Blake to the floor. ""I've gotten you three times this year as a result of you fucking your brains out and you can't have me?<|endoftext|>"
`Who are you`,"Who are you talking to? It could be anyone... (They come through the doors, stopping in the hall door where Harry and Luna are standing) HARRY: Hello? LEIRINA: Why are you here? HARRY: Well, what have you made of me this entire time? LEIRINA: So, I am... (nods) You've been telling me what to think and do, Harry, and all I can","Who are you?"" Ye Qian suddenly exclaimed. ""Hey! Are you really going to chase me and get angry? What are you hiding from? You can't keep your eyes off me, and you're going to start killing me. Are you really going to beat me up? ""Do you have a headache?"" Ye Qian was not scared at all. Actually, this is what he wanted; he was very angry and couldn't stand this situation. He's still not scared at all. He"
`Give it to`,"Give it to me now in order to be safe, so I can go back home."" So I said, ""Do you know the words of wisdom?"" ""Yes,"" he replied, ""and the word of wisdom is: Let you do and understand whatever you want. Do not think what your father wishes of the way of doing things. The one who does things according to his intention also receives what he wishes. And because you want to be rich, even though he wishes to give your wealth to anyone, he","Give it to him, a good boy. [Cum for me]<|endoftext|>"
"`Hello, how is`","Hello, how is my friend?"" ""Very well then. I am the one who ordered dinner last night. How did you like it?"" ""I could do without it. I think I am going to put on my jacket and start for home."" ""You are going? Oh, that may not be a very good idea."" ""It may be. I am going to go around the street, look at houses, and report back as to whether they are inhabited.""","Hello, how is it going? - Oh, it's going great, thank you. Um huh, nice to have our mama back, huh? - Thank you. Hey, baby? Hey, baby, can you hear me? I'm making a mess, alright? Go ahead and try it though. - Is that nice? - Yeah, you can lick it. Yeah, lick it. It tastes so good. - Come, come on, lick it. - You gotta lick me. Gonna be"
`Do you`,"Do you have a friend with your username and password?"" ""Sure do!"" I replied. ""Do you have a friend with your username and password? ""Sure do!"" I said. ""Where's your friend?"" ""Uh... in a different room,"" he said. I looked back at the computer, which looked slightly different without the light on. It was still the one that I'd brought along, so I'd forgotten to change any of the settings. I couldn",Do you want your little bitch to fuck my fucking hole for you?<|endoftext|>
`The weather is really`,"The weather is really hard today and I'm hoping for rain. I'm not sure if I'll catch it or not. So, I'll give it a shot."" As she leaves the venue, Toh said she knows that weather is what she will focus on as she runs around the course. ""It's going to be hard. I always think the better it is, the more it can influence me. So, if in the end, it rains, I'll have to try harder,""","The weather is really brutal today... So if the weather doesn't get any better we will be playing our first game in a long time... I really hope we can play it out and keep it out of my head... So I'm looking forward to it as much as you are! If you don't want chill, fuck you. Love you too.<|endoftext|>"
`This person is`,"This person is the only one who can prevent this, the only one who can change my past, the only one who I can ask."" At the same time, he could feel his hands shaking, his hands clenching, waiting for a chance. With the wind at his back, he took his time and began the climb down from the mountain. If there were others in the area, he would have immediately rushed to the top as soon as he saw this person. For once, the young","This person is definitely not an official member of the council,"" she told the officer. [The person] did not look well. She also seemed not sure of herself. The person didn't seem to think clearly. The first thing I did was to say I had a right to be here, and she said ""OK"" and she began speaking to me. Then she started saying what she wanted. I wanted to say ""I have a right to be here now"", and she says ""I need to show you"
`The world is`,"The world is a different place now,"" he said. ""We may see a decrease."" The main factors in rising greenhouse-gas emissions have been the rapid increase of use of cars and aircraft, and the rise in the value of renewables, he added. The IPCC's forecast of global warming of 1.5 degrees Celsius compared with previous predictions of 1.2 degrees, came as the U.N. climate panel made a report last month on ways to fight global warming. The report","The world is ending. The first thing to do is get out of here. We don't deserve what's coming to us. I got tired of all these fuckin' fuckin' fuckin' kids from school fighting. They don't have nothing to do. They look at all this fuckin' fuckin' fuckin' and they look like fuck. They think they want to be fuckin' fuckin' cool. Fuck, fuckin' highas don't have any style, fuck' highas... they look at all this fuckin"
`I decided to`,"I decided to go for the first one and it doesn't even have a name so you don't have to use it's name. It just comes in it's own little bit of mystery box, you know. ""Then the second one, the second one, it's just a box, you don't know, it's really just a little bit of little details and it seems to work."" Mr Higgs added that the process started with a simple survey of his office by one of his","I decided to play and see what happens. You know, 'Man, you gotta do something for the fans. Do you understand me? I understand what you guys go through, boy.' ""The first play I made, I saw a little boy jump in that crowd, and he was laughing and laughing and being silly. Everybody was laughing and watching and laughing. I was the joke, the joke, the joke. I laughed at him, him, it was just you know, what do they think"
`Is it funny`,"Is it funny that Trump should be on an international speaking tour because he speaks one second of Spanish and then another second of Italian or that he didn't understand the Italian word for 'nigger' and he had to say 'bitch'?"" ""It's one of this kind of ridiculous things he does,"" he added. After the controversy, a spokesman for Trump's immigration team told The Huffington Post, ""In fact, the Trump Organization recently hired two people who are bilingual and in the country","Is it funny? Oh man… You want to fuck a dude, right?<|endoftext|>"


## Stage 2, Results.
Looks like gpt-2 after RLHF became very toxic. This is really great homework, thank you very much for designing it.