<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).



In this homework, you're gonna fine-tune a language model with reinforcement learning to make it generate bad (or good) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

## Stage 0: load model

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate negative movie reviews. In fact, __it's your choice whether you want positive or negative reviews__, however I recommend you to focus on negative ones, in order to see greater effect after RLHF

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [1]:
!pip install trl peft

Collecting trl
  Downloading trl-0.12.2-py3-none-any.whl.metadata (11 kB)
Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Downloading trl-0.12.2-py3-none-any.whl (365 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading peft-0.14.0-py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.8/374.8 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trl, peft
Successfully installed peft-0.14.0 trl-0.12.2


In [2]:
import torch
import transformers
import datasets
import trl

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [3]:
!export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'

  pid, fd = os.forkpty()


In [4]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



Generated text: The movie, in my opinion, had no story, no plot, no characters, even no plot. It's absolutely pathetic, the acting was very amateurish. I mean how can a movie like this ever be a success? Even if you enjoyed your time


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this homework will teach you how to do RLHF for any kind objective.



In [5]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("allenai/longformer-base-4096", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("allenai/longformer-base-4096")

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/597M [00:00<?, ?B/s]

Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [6]:
from torch.utils.data import Dataset

class IMDBPairwiseDataset(Dataset):
    """ 
    A dataset of all possible pairs of chosen and rejected texts for TRL reward training format.

    This dataset is designed to facilitate the training of a reward model by providing pairs of
    texts where one is preferred (chosen) and the other is not (rejected). Each sample in the dataset
    is a dictionary containing tokenized input IDs and attention masks for both the chosen and rejected
    texts.

    Parameters:
    imdb: dataset to pairwise
    tokenizer: The tokenizer used to preprocess the texts
    accepted_label (int): The label that indicates a chosen text. Texts with this label are considered
                          preferred, while others are considered rejected.

    Methods:
    __len__(): Returns the total number of possible pairs of chosen and rejected texts.
    __getitem__(index): Returns a dictionary containing tokenized inputs for a specific pair of chosen
                        and rejected texts.
    """
    
    def __init__(self, imdb, tokenizer, accepted_label):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [sample['text'] for sample in imdb if sample['label'] == accepted_label]
        self.rejected_texts = [sample['text'] for sample in imdb if sample['label'] != accepted_label]

        assert self.chosen_texts, f"no texts with label {accepted_label}"
        # print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

        self.column_names = [
            'input_ids_chosen', 'attention_mask_chosen',
            'input_ids_rejected', 'attention_mask_rejected'
        ]

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)

    def __getitem__(self, index: int):
        i = index // len(self.rejected_texts)
        j = index % len(self.rejected_texts)
        chosen_text, rejected_text = self.chosen_texts[i], self.rejected_texts[j]
        
        chosen_encodings = self.tokenizer(chosen_text, padding=False, return_tensors="pt")
        rejected_encodings = self.tokenizer(rejected_text, padding=False, return_tensors="pt")
        return dict(
            input_ids_chosen=chosen_encodings['input_ids'].squeeze(0),
            attention_mask_chosen=chosen_encodings['attention_mask'].squeeze(0),
            input_ids_rejected=rejected_encodings['input_ids'].squeeze(0),
            attention_mask_rejected=rejected_encodings['attention_mask'].squeeze(0),
        )

In [7]:
TARGET_LABEL = 0 # negative reviews
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/597M [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

CHOSEN: <s>If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br /></s>
REJECTED: <s>This movie has some things that are pretty amazing. First, it is supposed to be based on a true story. That, in itself, is amazing that multiple tornadoes would hit the same town at night in the fall-in Nebraska. I wonder if the real town's name was close to "Blainsworth" (which is the town's name in the movie). There is an Ainsworth, Nebraska, but there is also a town that starts with Blains-something.<br /><br />It does show the slowest 

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer`.

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [8]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["query", "key", "value", "query_global", "key_global", "value_global", "dense"],
    task_type="SEQ_CLS"  # Sequence Classification
)


new_max_position_embeddings = 890

reward_model.config.max_position_embeddings = new_max_position_embeddings

In [9]:
reward_tokenizer.model_max_length = 890

In [10]:
training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True,                    # disable this on CPU or on very old GPUs
    report_to='none',
    # you may add any other hyperparameters that you found useful
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=peft_config,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs
Token indices sequence length is longer than the specified maximum sequence length for this model (1403 > 890). Running this sequence through the model will result in indexing errors
You're using a LongformerTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Initializing global attention on CLS token...
Input ids are automatically padded to be a multiple of `config.attention_window`: 512
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.6937
100,0.6879
150,0.5988
200,0.177
250,0.1004
300,0.0859
350,0.0781
400,0.0823
450,0.0591
500,0.0857


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]


TrainOutput(global_step=1000, training_loss=0.16167684149742126, metrics={'train_runtime': 11922.4015, 'train_samples_per_second': 1.342, 'train_steps_per_second': 0.084, 'total_flos': 0.0, 'train_loss': 0.16167684149742126, 'epoch': 0.0001024})

In [11]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

LongformerForSequenceClassification(
  (longformer): LongformerModel(
    (embeddings): LongformerEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (position_embeddings): Embedding(4098, 768, padding_idx=1)
    )
    (encoder): LongformerEncoder(
      (layer): ModuleList(
        (0-11): 12 x LongformerLayer(
          (attention): LongformerAttention(
            (self): LongformerSelfAttention(
              (query): lora.Linear(
                (base_layer): Linear(in_features=768, out_features=768, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=16, bias=False)
                )
                (lora_

### Sanity-check the reward model

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [12]:

for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: 5.080700397491455
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes o

First of all, let's implement `compute_reward` function. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [23]:
from torch import Tensor, no_grad

def compute_reward(reward_model, reward_tokenizer, texts: list[str], device='cpu') -> Tensor:
    """
    Compute the reward scores for a list of texts using a specified reward model and tokenizer.

    Parameters:
    reward_model: The model used to compute the reward scores
    reward_tokenizer: The tokenizer for reward_model
    texts (list[str]): A list of text strings for which the reward scores are to be computed.
    device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

    Returns:
    torch.Tensor: A tensor containing the reward scores for each input text. The scores are extracted
                  from the logits of the reward model.

    Example:
    >>> compute_reward(my_reward_model, my_reward_tokenizer, ["text1", "text2"])
    tensor([ 5.1836, -4.8438], device='cpu')
    """

    # <YOUR CODE HERE>
    encodings = reward_tokenizer(texts, return_tensors='pt', padding=True).to(device)

    with no_grad():
        # <YOUR CODE HERE>
        reward_scores = reward_model(**encodings).logits[:, 0]
    
    return reward_scores

In [24]:
rewards = compute_reward(reward_model, reward_tokenizer, [imdb[45]['text'], imdb[16000]['text']], device=device)
print(rewards)
assert rewards[0] > rewards[1]
assert rewards[0] > 0
assert rewards[1] < 0

tensor([ 5.0807, -4.0519], device='cuda:0')


In [25]:
from tqdm.auto import tqdm

def eval_reward_model(reward_model, reward_tokenizer, test_dataset, target_label, device='cpu'):
    """
    Evaluate the performance of a reward model by comparing reward scores for chosen and rejected reviews. 

    This function selects reviews from a test dataset based on a target label and evaluates the reward model's
    ability to assign higher scores to chosen reviews compared to rejected ones. The evaluation is performed
    in batches for efficiency.
    Note that reward scores are compared on corresponding chosen and rejected reviews: 
        chosen_reviews[0] vs rejected_reviews[0], 
        chosen_reviews[1] vs rejected_reviews[1],
        etc.

    Parameters:
    reward_model: The model used to compute the reward scores
    reward_tokenizer: The tokenizer for reward_model
    tes_dataset: test Dataset
    target_label (0 or 1): The label used to select chosen reviews. Reviews with this label are considered chosen,
                  while others are considered rejected.
    device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

    Returns:
    float: The accuracy of the reward model, calculated as the proportion of times the model assigns a higher
           reward score to the chosen review compared to the rejected review.

    Example:
    >>> accuracy = eval_reward_model(my_reward_model, my_reward_tokenizer, test_data, target_label=1)
    >>> print(f"Model accuracy: {accuracy:.2%}")
    """
    # <YOUR CODE HERE>

    chosen_reviews = [sample['text'] for sample in test_dataset if sample['label'] == target_label]
    rejected_reviews = [sample['text'] for sample in test_dataset if sample['label'] != target_label]

    assert len(chosen_reviews) == len(rejected_reviews)

    # <YOUR CODE HERE>
    correct_count = 0 
    total_count = len(chosen_reviews)

    batch_size = 8
    for i in tqdm(range(0, total_count, batch_size), desc="Evaluating"):
        batch_chosen = chosen_reviews[i: i + batch_size]
        batch_rejected = rejected_reviews[i: i + batch_size]

        reward_chosen = compute_reward(reward_model, reward_tokenizer, batch_chosen, device=device)
        reward_rejected = compute_reward(reward_model, reward_tokenizer, batch_rejected, device=device)
        
        correct_count += (reward_chosen > reward_rejected).sum().item()
    
    accuracy = correct_count / total_count
    
    return accuracy

In [26]:
imdb_test = datasets.load_dataset("imdb", split='test')

test_accuracy = eval_reward_model(
    reward_model,
    reward_tokenizer,
    imdb_test,
    target_label=TARGET_LABEL,
    device=device,
)

print('test accuracy: {}'.format(test_accuracy))
assert test_accuracy > 0.94

Evaluating:   0%|          | 0/1563 [00:00<?, ?it/s]

test accuracy: 0.98432


### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [27]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Sample: It was a great movie to watch. Some of the dialogue I read there (including the "what if, you came on this island and killed so bad" part of the movie) is very familiar I guess to any real islander. Plus, what I
Sample: It was really a waste of time to try this film. I really wouldn't recommend this film to anyone except friends. Anyone who's at the theatre at school, or the theatre and wants to see this film then this is the movie for you.<br /
Sample: It was really funny. What a great film. I was laughing through it. I know I was watching the ending and all of the tension in this movie was lost in all of what went on in that theater. In the end it turned out that if you
Sample: It was actually very disappointing to see so many of his performances on TV. There were many, many errors in the production design and editing.<br /><br />To summarize, I felt that the show's plot was a very poor attempt at establishing a premise
Sample: It was very difficult to play without becoming a

In [28]:
def generate_with_reward_guidance(
        main_model, main_tokenizer,
        reward_model, reward_tokenizer,
        N=16,
        device='cpu',
    ):
    """
    Generate text samples using a main model and select the best sample based on a reward model's guidance.

    This function generates multiple text samples from a main model, evaluates each sample using a reward model,
    and returns the sample with the highest reward score. The process is guided by the reward model to select
    the most desirable output.

    Parameters:
    main_model: The language model used to generate text samples.
    main_tokenizer: The tokenizer for main_model
    reward_model: The model used to compute reward scores for the generated samples.
    reward_tokenizer: The tokenizer for reward_model
    N (int, optional): The number of text samples to generate. Default is 16.
    device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

    Returns:
    str: The generated text sample with the highest reward score.
    """

    # <YOUR CODE HERE>
    if main_tokenizer.pad_token is None:
        main_tokenizer.pad_token = main_tokenizer.eos_token

    input_text = "The movie was"
    inputs = main_tokenizer(
        input_text, 
        return_tensors='pt', 
        padding=True, 
        truncation=True, 
        max_length=512
    ).to(device)

    generated_texts = []
    for _ in range(N):
        with no_grad():
            output = main_model.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=50,
                num_return_sequences=1,
                do_sample=True,
                top_k=50,
                pad_token_id=main_tokenizer.pad_token_id
            )
        decoded_text = main_tokenizer.decode(output[0], skip_special_tokens=True)
        generated_texts.append(decoded_text)

    reward_scores = compute_reward(reward_model, reward_tokenizer, generated_texts, device=device)

    best_index = reward_scores.argmax().item()
    best_text = generated_texts[best_index]
    
    return best_text

In [29]:
generate_with_reward_guidance(
    main_model, main_tokenizer,
    reward_model, reward_tokenizer,
    device=device,
)

"The movie was poorly acted to boot, and didn't keep pace with the rest of the cast. The pacing was almost comical, but had nothing more to do than make the actors laugh at themselves--and that is fine if people in big film"

# Stage 2: fine-tune the main model with RL


Now, we will optimize GPT2 to produce negative IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [30]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [31]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9391




Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [55]:
reward_model.save_pretrained('./my_model')
reward_tokenizer.save_pretrained('./my_model')

('./my_model/tokenizer_config.json',
 './my_model/special_tokens_map.json',
 './my_model/vocab.json',
 './my_model/merges.txt',
 './my_model/added_tokens.json',
 './my_model/tokenizer.json')

In [54]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

TypeError: PPOConfig.__init__() got an unexpected keyword argument 'model_name'

In [None]:
from tqdm.auto import tqdm

max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!

average_reward = 0
gamma = 0.7

with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage - rewards for batch['response']
    rewards = compute_reward(reward_model, reward_tokenizer, batch["response"], device=device) # <YOUR CODE HERE>

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item() # <YOUR CODE HERE> - compute mean rewards for batch
    average_reward = gamma * average_reward + (1 - gamma) * stats['rewards/mean']

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'rewards/moving_avg:\t{average_reward:.9f}\t<---- moving average reward (higher=better, less noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]


AssertionError: Torch not compiled with CUDA enabled

In [None]:
assert average_reward > 2

And now test your PPO model:

In [None]:
inputs = [main_tokenizer.encode("The movie was", return_tensors='pt').to(device)[0] for i in range(5)]

response_tensors = ppo_trainer.generate(inputs, **generation_kwargs)
batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
for sample in batch["response"]:
    print('Sample: {}'.format(sample))

Sample: The movie was written by Omar Bakari of Ir Daily Show in Internet novels works, najd-as theor death, I didn't know when he made this thing, but it wasn't bad. def shines for Great F****** Good for He subordinates.<br /><br />This IS BEST AROUND! I was not expecting it to be a science movie it an A. failed to have a schtick to make it academically, it sucked and failed to do the admirable job with he job.<|endoftext|>
Sample: The movie was really awful. I aren't referring to the tiradist antics with what adjects. The whole plot summary bombing and launching cost vulgomer that is Bochowski Centerll was completely boring, complete and completely pointless. The greatest waste of David Cameron movie theater, and the man with the most unstable personality I have ever seen in a movie.<|endoftext|>
Sample: The movie was kind of boring...Some drawings looks like graphic shopgirls and they take some ends wrong in their new ways; in the beginning they do the original drawings, because the