<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).



In this homework, you're gonna fine-tune a language model with reinforcement learning to make it generat bad (or good) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

## Stage 0: load model

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate negative movie reviews. In fact, __it's your choice whether you want positive or negative reviews__, however I recommend you to focus on negative ones, in order to see greater effect after RLHF

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [1]:
!pip install datasets




In [2]:
!pip install trl==0.11.3



In [68]:
import torch
import transformers
import datasets
import trl

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

In [5]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



Generated text: The movie's subject matter is often misunderstood by the public and by people who disagree with them. But with such a film, we are reminded of how common common this kind of criticism is, so that it can be used to educate others about this type of crime


In [6]:
!unzip /content/file.zip

Archive:  /content/file.zip
replace content/train_results/tokenizer.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



Archive:  file.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of file.zip or
        file.zip.zip, and cannot find file.zip.ZIP, period.


## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this homework will teach you how to do RLHF for any kind objective.



In [18]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("/content/content/train_results", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

In [8]:
torch.cuda.empty_cache()
print(torch.cuda.memory_summary(device=None, abbreviated=False))

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      | 508385 KiB |   1110 MiB |   3910 MiB |   3414 MiB |
|       from large pool | 495616 KiB |   1098 MiB |   1255 MiB |    771 MiB |
|       from small pool |  12769 KiB |     24 MiB |   2655 MiB |   2643 MiB |
|---------------------------------------------------------------------------|
| Active memory         | 508385 KiB |   1110 MiB |   3910 MiB |   3414 MiB |
|       from large pool | 495616 KiB |   1098 MiB |   1255 MiB |    771 MiB |
|       from small pool |  12769 KiB |     24 MiB |   2655 MiB |   2643 MiB |
|---------------------------------------------------------------

__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [9]:
from torch.utils.data import Dataset

class IMDBPairwiseDataset(Dataset):
    """
    A dataset of all possible pairs of chosen and rejected texts for TRL reward training format.

    This dataset is designed to facilitate the training of a reward model by providing pairs of
    texts where one is preferred (chosen) and the other is not (rejected). Each sample in the dataset
    is a dictionary containing tokenized input IDs and attention masks for both the chosen and rejected
    texts.

    Parameters:
    imdb: dataset to pairwise
    tokenizer: The tokenizer used to preprocess the texts
    accepted_label (int): The label that indicates a chosen text. Texts with this label are considered
                          preferred, while others are considered rejected.

    Methods:
    __len__(): Returns the total number of possible pairs of chosen and rejected texts.
    __getitem__(index): Returns a dictionary containing tokenized inputs for a specific pair of chosen
                        and rejected texts.
    """

    def __init__(self, imdb, tokenizer, accepted_label):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [x['text'] for x in imdb if x['label'] == accepted_label]
        self.rejected_texts = [x['text'] for x in imdb if x['label'] != accepted_label]

        assert self.chosen_texts, f"no texts with label {accepted_label}"
        # print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

        self.column_names = [
            'input_ids_chosen', 'attention_mask_chosen',
            'input_ids_rejected', 'attention_mask_rejected'
        ]

    def __len__(self):
        return len(self.chosen_texts)*len(self.rejected_texts)

    def __getitem__(self, index: int):
        batch_chosen= self.tokenizer(self.chosen_texts[index//len(self.rejected_texts)], return_attention_mask=True, **{'max_length': 512, 'truncation': True, 'padding': 'max_length'})
        chosen, chosen_attention = batch_chosen['input_ids'], batch_chosen['attention_mask']
        batch_rejected = self.tokenizer(self.rejected_texts[index%len(self.rejected_texts)], return_attention_mask=True,**{'max_length': 512, 'truncation': True, 'padding': 'max_length'})
        rejected, rejected_attention = batch_rejected['input_ids'], batch_rejected['attention_mask']
        return dict(
            input_ids_chosen=chosen,
            attention_mask_chosen=chosen_attention,
            input_ids_rejected=rejected,
            attention_mask_rejected=rejected_attention
        )

In [55]:
TARGET_LABEL = 0 # negative reviews
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

CHOSEN: [CLS] If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story. < br / > < br / > One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives ( unless one comes up with one while one ' s mind wanders, as it will invariably do during this pointless film ). < br / > < br / > One might better spend one ' s time staring out a window at a tree growing. < br / > < br / > [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer`.

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [None]:
training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=2_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True,                    # disable this on CPU or on very old GPUs
    report_to='none'
    # you may add any other hyperparameters that you found useful
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.538
100,0.2174
150,0.1462
200,0.1251
250,0.0999
300,0.1079
350,0.0929
400,0.0919
450,0.0696
500,0.0822


TrainOutput(global_step=2000, training_loss=0.0747733507156372, metrics={'train_runtime': 2181.6457, 'train_samples_per_second': 29.336, 'train_steps_per_second': 0.917, 'total_flos': 0.0, 'train_loss': 0.0747733507156372, 'epoch': 0.0004095999580569643})

In [None]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


### Sanity-check the reward model

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [56]:

for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: 7.037755966186523
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes o

First of all, let's implement `compute_reward` function. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [53]:
from torch import Tensor, no_grad

def compute_reward(reward_model, reward_tokenizer, texts: list[str], device='cpu') -> Tensor:
    """
    Compute the reward scores for a list of texts using a specified reward model and tokenizer.

    Parameters:
    reward_model: The model used to compute the reward scores
    reward_tokenizer: The tokenizer for reward_model
    texts (list[str]): A list of text strings for which the reward scores are to be computed.
    device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

    Returns:
    torch.Tensor: A tensor containing the reward scores for each input text. The scores are extracted
                  from the logits of the reward model.

    Example:
    >>> compute_reward(my_reward_model, my_reward_tokenizer, ["text1", "text2"])
    tensor([ 5.1836, -4.8438], device='cpu')
    """
    inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
    with no_grad():
      rewards = reward_model(**inputs).logits[:,0]
    return rewards



In [54]:
rewards = compute_reward(reward_model, reward_tokenizer, [imdb[45]['text'], imdb[16000]['text']], device=device)
print(rewards)
assert rewards[0] > rewards[1]
assert rewards[0] > 0
assert rewards[1] < 0

tensor([ 7.0378, -5.9343], device='cuda:0')


In [65]:
from tqdm.auto import tqdm

def eval_reward_model(reward_model, reward_tokenizer, test_dataset, target_label, device='cpu'):
    """
    Evaluate the performance of a reward model by comparing reward scores for chosen and rejected reviews.

    This function selects reviews from a test dataset based on a target label and evaluates the reward model's
    ability to assign higher scores to chosen reviews compared to rejected ones. The evaluation is performed
    in batches for efficiency.
    Note that reward scores are compared on corresponding chosen and rejected reviews:
        chosen_reviews[0] vs rejected_reviews[0],
        chosen_reviews[1] vs rejected_reviews[1],
        etc.

    Parameters:
    reward_model: The model used to compute the reward scores
    reward_tokenizer: The tokenizer for reward_model
    tes_dataset: test Dataset
    target_label (0 or 1): The label used to select chosen reviews. Reviews with this label are considered chosen,
                  while others are considered rejected.
    device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

    Returns:
    float: The accuracy of the reward model, calculated as the proportion of times the model assigns a higher
           reward score to the chosen review compared to the rejected review.

    Example:
    >>> accuracy = eval_reward_model(my_reward_model, my_reward_tokenizer, test_data, target_label=1)
    >>> print(f"Model accuracy: {accuracy:.2%}")
    """

    chosen_reviews = [data['text'] for data in test_dataset if data['label'] == target_label][:10] #full dataset requires too much time and memory in colab
    rejected_reviews = [data['text'] for data in test_dataset if data['label'] != target_label][:10]
    assert len(chosen_reviews) == len(rejected_reviews)
    chosen_rewards = compute_reward(reward_model, reward_tokenizer, chosen_reviews, device=device)
    rejected_rewards = compute_reward(reward_model, reward_tokenizer, rejected_reviews, device=device)
    acc = (chosen_rewards > rejected_rewards).to(float).mean()
    return acc


In [61]:
torch.cuda.empty_cache()


In [64]:
imdb_test = datasets.load_dataset("imdb", split='test')

test_accuracy = eval_reward_model(
    reward_model,
    reward_tokenizer,
    imdb_test,
    target_label=TARGET_LABEL,
    device=device,
)

print('test accuracy: {}'.format(test_accuracy))
assert test_accuracy > 0.94

test accuracy: 1.0


### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [None]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
print(inputs['attention_mask'])
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


tensor([[1, 1],
        [1, 1],
        [1, 1],
        [1, 1],
        [1, 1]])
Sample: It was the kind of movie that could really be funny, just like "Flesh for the Mouth" (1934). When a poor girl gets beaten up, they decide to go get drugs and the police ask them to put "drugs" on their
Sample: It was the first time that I've really enjoyed some of the other films in the series. It is also a great introduction to the film genre. It is really exciting to see other great films that were made in the same period of time or on different budgets
Sample: It was never intended as a political statement at all. It's interesting the story of what happened when a war was being fought as a political statement, and what can be said of the final events of WWII.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
Sample: It was one of the first films based on the true story of David Cronenberg. I was shocked to find out t

In [70]:
def generate_with_reward_guidance(
        main_model, main_tokenizer,
        reward_model, reward_tokenizer,
        N=16,
        device='cpu',
    ):
    """
    Generate text samples using a main model and select the best sample based on a reward model's guidance.

    This function generates multiple text samples from a main model, evaluates each sample using a reward model,
    and returns the sample with the highest reward score. The process is guided by the reward model to select
    the most desirable output.

    Parameters:
    main_model: The language model used to generate text samples.
    main_tokenizer: The tokenizer for main_model
    reward_model: The model used to compute reward scores for the generated samples.
    reward_tokenizer: The tokenizer for reward_model
    N (int, optional): The number of text samples to generate. Default is 16.
    device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

    Returns:
    str: The generated text sample with the highest reward score.
    """
    # bos = main_tokenizer.bos_token_id

    # inputs = torch.full((N,), fill_value=bos).to(device).unsqueeze(1)
    inputs = main_tokenizer(["It was"] * N, return_tensors='pt').to(device)
    candidates = main_model.generate(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], max_new_tokens=50, do_sample=True)
    samples = []
    for candidate in candidates:
        samples.append(main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))
    rewards = compute_reward(reward_model, reward_tokenizer, samples,device=device)
    best = rewards.argmax()
    return samples[best]





In [71]:
generate_with_reward_guidance(
    main_model, main_tokenizer,
    reward_model, reward_tokenizer,
    device=device,
)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


"It was supposed to be an action film with all sets, characters, acting, story line and location packed together, but it's not as well done as you would think. The acting is poor, the script is horrible, the scenery is terrible, the plot"

# Stage 2: fine-tune the main model with RL


Now, we will optimize GPT2 to produce negative IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [14]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [23]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9391




In [31]:
torch.cuda.empty_cache()

Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [30]:
training_args = trl.PPOConfig(
    mini_batch_size=32,
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [32]:
torch.cuda.empty_cache()

In [31]:
import gc
gc.collect()

4975

In [34]:
from tqdm.auto import tqdm

max_steps = 10   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!

average_reward = 1.835465431 # pretrained 50 steps, wasn't enought so i trained for another 10 steps
gamma = 0.7

with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage - rewards for batch['response']
    rewards =  compute_reward(reward_model, reward_tokenizer, batch['response'], device=device)

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean()# <YOUR CODE HERE> - compute mean rewards for batch
    average_reward = gamma * average_reward + (1 - gamma) * stats['rewards/mean']

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'rewards/moving_avg:\t{average_reward:.9f}\t<---- moving average reward (higher=better, less noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/10 [00:00<?, ?it/s]

------------------------------ STEP 0 ------------------------------
rewards/mean:	1.736105323	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.805657387	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.890823722	<---- model-estimated average discounted reward
objective/kl:	2.630071402	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	2.190168142	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.921010613	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.997027040	<---- model-estimated average discounted reward
objective/kl:	2.895033598	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	2.427908421	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.073080063

KeyboardInterrupt: 

In [35]:
assert average_reward > 2

And now test your PPO model:

In [37]:
inputs = [main_tokenizer.encode("The movie was", return_tensors='pt').to(device)[0] for i in range(5)]

response_tensors = ppo_trainer.generate(inputs, **generation_kwargs)
batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
for sample in batch["response"]:
    print('Sample: {}'.format(sample))

Sample: The movie was entertaining not just because of the plot; but because Sarekey was extremely eager and not only trying to take some prisoners. During my ninety minutes of viewing motion sickness, I almost wanted to gasp at Sarekey being tempted to sell him to a leading role. <br /><br />Kreuse's misfortune may have forced him to a satisfactory climax, but he's got a few more twists and turns to take.<br /><br />SHOT AT HELL CRIMES FOOT OF ASSASSIN STERNALS<|endoftext|>
Sample: The movie was a disaster. The script might have been a little too explosive in many scenes and a couple more clumsy - and oddly we're ice breakers! It wasn't believable at all. Perhaps we wasn't expecting too many wacky moments of violent violence and explicit violence, just casual violence set up by a couple ofasses. It was bad? OK uh. Who cares? It was a metaphor for all the people we would be chafed at if this film wasn't so awful. Would you care less? Two people, one couple, with a snow machine that was