<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).

_based on the [original notebook](https://github.com/antndlcrx/oxford-llms-workshop/blob/main/materials/seminars/day_3/8_LLMs%20alignment%20with%20RLHF.ipynb) by Ilya Boytsov for the Oxford LLMs workshop_



In this session, you're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [33]:
# !python -m pip install --upgrade --ignore-installed setuptools accelerate trl==0.7.4 transformers==4.33.1 datasets==2.14.4 peft==0.5.0 tokenizers

### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [34]:
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

In [41]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The movie was really bad. I found the acting to be average, the storyline to be lame, and the plot to be uninteresting (uncompelling). The acting wasn't great and the film was about as good as Hell. Also this was the worst


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.


__If you actually want to maximize sentiment (or other "label") instead of human preferences, train reward model as a classifier! (see week5)__


In [46]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [48]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class IMDBPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, imdb, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.chosen_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [49]:
TARGET_LABEL = 0   # and make sure it works by reviewing the sample printed below
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Found 12500 chosen and 12500 rejected texts, 156250000 pairs
CHOSEN: [CLS] If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story. < br / > < br / > One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives ( unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film ). < br / > < br / > One might better spend one's time staring out a window at a tree growing. < br / > < br / > [SEP]
REJECTED: [CLS] This movie has some things that are pretty amazing. First, it is supposed to be based on a true story. That, in itself, is amazing that multiple tornadoes would hit the same town at night in the fall - in Nebraska. I wonder if the real town's name was close to " Blainsworth " ( which is the town's name in the movie ). There is an Ainsworth, Nebraska,

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [50]:
import trl

training_args = trl.RewardConfig( # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  0%|          | 0/1000 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed
  5%|▌         | 50/1000 [00:45<11:25,  1.39it/s] 

{'loss': 0.5091, 'grad_norm': 3.0127673149108887, 'learning_rate': 1.34091e-05, 'epoch': 0.0}


 10%|█         | 100/1000 [01:21<10:47,  1.39it/s]

{'loss': 0.1947, 'grad_norm': 2.138087749481201, 'learning_rate': 1.2718200000000001e-05, 'epoch': 0.0}


 15%|█▌        | 150/1000 [01:57<10:10,  1.39it/s]

{'loss': 0.1377, 'grad_norm': 2.3117825984954834, 'learning_rate': 1.20132e-05, 'epoch': 0.0}


 20%|██        | 200/1000 [02:33<09:31,  1.40it/s]

{'loss': 0.1258, 'grad_norm': 3.006094217300415, 'learning_rate': 1.1308200000000001e-05, 'epoch': 0.0}


 25%|██▌       | 250/1000 [03:09<08:58,  1.39it/s]

{'loss': 0.1, 'grad_norm': 7.099982261657715, 'learning_rate': 1.06032e-05, 'epoch': 0.0}


 30%|███       | 300/1000 [03:45<08:18,  1.40it/s]

{'loss': 0.1119, 'grad_norm': 10.28709888458252, 'learning_rate': 9.8982e-06, 'epoch': 0.0}


 35%|███▌      | 350/1000 [04:20<07:44,  1.40it/s]

{'loss': 0.1043, 'grad_norm': 2.178290605545044, 'learning_rate': 9.1932e-06, 'epoch': 0.0}


 40%|████      | 400/1000 [04:56<07:06,  1.41it/s]

{'loss': 0.09, 'grad_norm': 5.269755840301514, 'learning_rate': 8.4882e-06, 'epoch': 0.0}


 45%|████▌     | 450/1000 [05:32<06:32,  1.40it/s]

{'loss': 0.0819, 'grad_norm': 4.6626458168029785, 'learning_rate': 7.7832e-06, 'epoch': 0.0}


 50%|█████     | 500/1000 [06:08<05:54,  1.41it/s]

{'loss': 0.0823, 'grad_norm': 3.085207223892212, 'learning_rate': 7.0782e-06, 'epoch': 0.0}


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
 55%|█████▌    | 550/1000 [06:44<05:22,  1.39it/s]

{'loss': 0.0874, 'grad_norm': 2.4558117389678955, 'learning_rate': 6.3732e-06, 'epoch': 0.0}


 60%|██████    | 600/1000 [07:21<04:42,  1.41it/s]

{'loss': 0.0737, 'grad_norm': 0.978861391544342, 'learning_rate': 5.6682e-06, 'epoch': 0.0}


 65%|██████▌   | 650/1000 [07:57<04:09,  1.40it/s]

{'loss': 0.0685, 'grad_norm': 5.672646999359131, 'learning_rate': 4.9632e-06, 'epoch': 0.0}


 70%|███████   | 700/1000 [08:32<03:33,  1.41it/s]

{'loss': 0.0751, 'grad_norm': 4.7364044189453125, 'learning_rate': 4.2582e-06, 'epoch': 0.0}


 75%|███████▌  | 750/1000 [09:08<03:03,  1.37it/s]

{'loss': 0.07, 'grad_norm': 9.540241241455078, 'learning_rate': 3.5532e-06, 'epoch': 0.0}


 80%|████████  | 800/1000 [09:45<02:21,  1.41it/s]

{'loss': 0.0454, 'grad_norm': 0.6404498219490051, 'learning_rate': 2.8482e-06, 'epoch': 0.0}


 85%|████████▌ | 850/1000 [10:20<01:46,  1.41it/s]

{'loss': 0.0593, 'grad_norm': 1.604547142982483, 'learning_rate': 2.1432e-06, 'epoch': 0.0}


 90%|█████████ | 900/1000 [10:56<01:10,  1.41it/s]

{'loss': 0.067, 'grad_norm': 6.464433670043945, 'learning_rate': 1.4382e-06, 'epoch': 0.0}


 95%|█████████▌| 950/1000 [11:31<00:35,  1.41it/s]

{'loss': 0.0626, 'grad_norm': 0.36007577180862427, 'learning_rate': 7.332e-07, 'epoch': 0.0}


100%|██████████| 1000/1000 [12:07<00:00,  1.40it/s]

{'loss': 0.044, 'grad_norm': 1.3303836584091187, 'learning_rate': 2.82e-08, 'epoch': 0.0}


100%|██████████| 1000/1000 [12:09<00:00,  1.37it/s]

{'train_runtime': 729.3085, 'train_samples_per_second': 43.877, 'train_steps_per_second': 1.371, 'train_loss': 0.10954286599159241, 'epoch': 0.0}





TrainOutput(global_step=1000, training_loss=0.10954286599159241, metrics={'train_runtime': 729.3085, 'train_samples_per_second': 43.877, 'train_steps_per_second': 1.371, 'total_flos': 0.0, 'train_loss': 0.10954286599159241, 'epoch': 0.00020479997902848215})

In [7]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model (1 point)

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [8]:

for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(
      imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: 5.3046875
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br />

In [9]:
from sklearn.metrics import roc_auc_score

In [11]:
from tqdm import tqdm
imdb_test = datasets.load_dataset("imdb", split='test')

rewards = []
labels = []

for idx in tqdm(range(len(imdb_test))):
    inputs = reward_tokenizer(imdb_test[idx]['text'], truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        reward = reward_model(**inputs).logits[0, 0].item()
        rewards.append(reward)
        labels.append(imdb_test[idx]['label'])

score = roc_auc_score(labels,rewards)
roc_auc_score(labels,rewards)


# <a whole lot of your code here, feel free to spit it as you see fit>


100%|██████████| 25000/25000 [02:10<00:00, 191.11it/s]


0.0277544896

In [20]:
1-score

0.9722455104

### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [19]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was always my dream to play poker, I always thought I could do it, so I got involved. I think it was interesting to see how popular poker was before getting into poker. Many poker players have been talking about the fact that their poker careers are
Sample: It was funny how quickly he was getting old, so his father gave him out from the hospital and he was okay. The doctor thought he died and he went to the same hospital a couple days later that same day. My daughter did not know anything about this
Sample: It was interesting how she managed to remain credible throughout; her most memorable scenes were the flashbacks where she visits school and falls for an older older child. We have the young girl, who tries to find a place for her in life but her parents turn her into
Sample: It was the first episode where the actors decided to throw in a couple of cheap shots. There is no way there would be another show. There would be worse.<|endoftext|><|endoftext|><|endoftext|><|endoft

In [11]:
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("reward_model/checkpoint-1000", device_map=device)

In [53]:
# <YOUR CODE HERE> - feel free to organize it as you see fit
outputs = []

inputs = main_tokenizer(["This movie is"] * 16, return_tensors='pt').to(device)

for gen in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
    outputs.append(main_tokenizer.decode(gen.flatten().cpu().numpy().tolist()))

    

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [54]:
rewards = []
for i in outputs:
    inputs = reward_tokenizer(i, truncation=True, return_tensors='pt').to(device)
    rewards.append(reward_model(**inputs).logits[0,0].detach().cpu())

In [61]:
import numpy as np
np.array(outputs)[np.argsort(rewards)]

array(["This movie is about a girl who loses her virginity accidentally and her sister gets caught up in it. A story I haven't had to think of any more!! A nice twist to this story!! Also, a really nice action/suspense in the last hour with",
       "This movie is definitely worth watching. I'd give it a 9, as you get a nice feel for the characters even if you're not the type to read all about them.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>",
       'This movie is about two teenagers trying to solve one of the best murder mysteries ever set in Chicago. The leads play "ghosts" from one mystery to the next and the mystery is far removed from what they had done in the first film. A mystery for all',
       'This movie is like a horror movie, as a bad thing, but you do get a really good one that is not even bad, b

# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [6]:
import trl
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [7]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [12]:
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([ 5.3037, -4.7717], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [13]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9391




Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [14]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=4,
    learning_rate=1.41e-5,
    batch_size=64,
    mini_batch_size=16,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



In [15]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  2%|▏         | 1/50 [00:22<18:12, 22.30s/it]

------------------------------ STEP 0 ------------------------------
rewards/mean:	0.537303925	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.027607590	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)



  4%|▍         | 2/50 [00:43<17:33, 21.95s/it]

------------------------------ STEP 1 ------------------------------
rewards/mean:	-0.114208698	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.073779672	<---- model-estimated average discounted reward
objective/kl:	0.000067968	<---- how far we are from the original model (regularizer)



  6%|▌         | 3/50 [01:05<16:57, 21.65s/it]

------------------------------ STEP 2 ------------------------------
rewards/mean:	0.485072792	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.053837083	<---- model-estimated average discounted reward
objective/kl:	0.008635592	<---- how far we are from the original model (regularizer)



  8%|▊         | 4/50 [01:28<17:05, 22.30s/it]

------------------------------ STEP 3 ------------------------------
rewards/mean:	0.098857373	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.034162991	<---- model-estimated average discounted reward
objective/kl:	0.008405432	<---- how far we are from the original model (regularizer)



 10%|█         | 5/50 [01:48<16:03, 21.42s/it]

------------------------------ STEP 4 ------------------------------
rewards/mean:	0.669786990	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.031738099	<---- model-estimated average discounted reward
objective/kl:	0.013997542	<---- how far we are from the original model (regularizer)



 12%|█▏        | 6/50 [02:12<16:21, 22.30s/it]

------------------------------ STEP 5 ------------------------------
rewards/mean:	0.363579839	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.007205085	<---- model-estimated average discounted reward
objective/kl:	0.027143134	<---- how far we are from the original model (regularizer)



 14%|█▍        | 7/50 [02:34<15:59, 22.32s/it]

------------------------------ STEP 6 ------------------------------
rewards/mean:	0.579912066	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.013163842	<---- model-estimated average discounted reward
objective/kl:	0.017383823	<---- how far we are from the original model (regularizer)



 16%|█▌        | 8/50 [02:55<15:19, 21.90s/it]

------------------------------ STEP 7 ------------------------------
rewards/mean:	0.674233437	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.067161046	<---- model-estimated average discounted reward
objective/kl:	0.041535616	<---- how far we are from the original model (regularizer)



 18%|█▊        | 9/50 [03:19<15:19, 22.44s/it]

------------------------------ STEP 8 ------------------------------
rewards/mean:	-0.135698825	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.104912646	<---- model-estimated average discounted reward
objective/kl:	0.035695463	<---- how far we are from the original model (regularizer)



 20%|██        | 10/50 [03:40<14:40, 22.00s/it]

------------------------------ STEP 9 ------------------------------
rewards/mean:	0.322107166	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.008312296	<---- model-estimated average discounted reward
objective/kl:	0.025565404	<---- how far we are from the original model (regularizer)



 22%|██▏       | 11/50 [04:02<14:17, 22.00s/it]

------------------------------ STEP 10 ------------------------------
rewards/mean:	0.320143789	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.028853713	<---- model-estimated average discounted reward
objective/kl:	0.073386572	<---- how far we are from the original model (regularizer)



 24%|██▍       | 12/50 [04:25<14:02, 22.17s/it]

------------------------------ STEP 11 ------------------------------
rewards/mean:	0.422156900	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.053277723	<---- model-estimated average discounted reward
objective/kl:	0.045664683	<---- how far we are from the original model (regularizer)



 26%|██▌       | 13/50 [04:46<13:30, 21.91s/it]

------------------------------ STEP 12 ------------------------------
rewards/mean:	0.416449904	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.020355772	<---- model-estimated average discounted reward
objective/kl:	0.127303272	<---- how far we are from the original model (regularizer)



 28%|██▊       | 14/50 [05:08<13:07, 21.87s/it]

------------------------------ STEP 13 ------------------------------
rewards/mean:	0.380871922	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.036533356	<---- model-estimated average discounted reward
objective/kl:	0.134921893	<---- how far we are from the original model (regularizer)



 30%|███       | 15/50 [05:30<12:46, 21.91s/it]

------------------------------ STEP 14 ------------------------------
rewards/mean:	0.888844609	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.072935693	<---- model-estimated average discounted reward
objective/kl:	0.122885332	<---- how far we are from the original model (regularizer)



 32%|███▏      | 16/50 [05:50<12:13, 21.56s/it]

------------------------------ STEP 15 ------------------------------
rewards/mean:	-0.019307733	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.083748966	<---- model-estimated average discounted reward
objective/kl:	0.149929971	<---- how far we are from the original model (regularizer)



 34%|███▍      | 17/50 [06:14<12:09, 22.11s/it]

------------------------------ STEP 16 ------------------------------
rewards/mean:	0.821264982	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.129028842	<---- model-estimated average discounted reward
objective/kl:	0.088150516	<---- how far we are from the original model (regularizer)



 36%|███▌      | 18/50 [06:35<11:36, 21.76s/it]

------------------------------ STEP 17 ------------------------------
rewards/mean:	1.070811272	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.116673350	<---- model-estimated average discounted reward
objective/kl:	0.105175868	<---- how far we are from the original model (regularizer)



 38%|███▊      | 19/50 [06:57<11:15, 21.80s/it]

------------------------------ STEP 18 ------------------------------
rewards/mean:	0.234880865	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.015775513	<---- model-estimated average discounted reward
objective/kl:	0.102224238	<---- how far we are from the original model (regularizer)



 40%|████      | 20/50 [07:20<11:07, 22.24s/it]

------------------------------ STEP 19 ------------------------------
rewards/mean:	0.551761866	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.046502981	<---- model-estimated average discounted reward
objective/kl:	0.202081412	<---- how far we are from the original model (regularizer)



 42%|████▏     | 21/50 [07:41<10:31, 21.78s/it]

------------------------------ STEP 20 ------------------------------
rewards/mean:	0.866803169	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.120542631	<---- model-estimated average discounted reward
objective/kl:	0.207578003	<---- how far we are from the original model (regularizer)



 44%|████▍     | 22/50 [08:03<10:15, 21.99s/it]

------------------------------ STEP 21 ------------------------------
rewards/mean:	0.540445507	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.069162190	<---- model-estimated average discounted reward
objective/kl:	0.124114834	<---- how far we are from the original model (regularizer)



 46%|████▌     | 23/50 [08:25<09:56, 22.11s/it]

------------------------------ STEP 22 ------------------------------
rewards/mean:	0.262418449	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.001951750	<---- model-estimated average discounted reward
objective/kl:	0.188866973	<---- how far we are from the original model (regularizer)



 48%|████▊     | 24/50 [08:46<09:23, 21.69s/it]

------------------------------ STEP 23 ------------------------------
rewards/mean:	-0.241178378	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.072451100	<---- model-estimated average discounted reward
objective/kl:	0.227816850	<---- how far we are from the original model (regularizer)



 50%|█████     | 25/50 [09:08<09:03, 21.76s/it]

------------------------------ STEP 24 ------------------------------
rewards/mean:	0.091523319	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.043856263	<---- model-estimated average discounted reward
objective/kl:	0.067279354	<---- how far we are from the original model (regularizer)



 52%|█████▏    | 26/50 [09:30<08:46, 21.92s/it]

------------------------------ STEP 25 ------------------------------
rewards/mean:	-0.265032053	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.104704857	<---- model-estimated average discounted reward
objective/kl:	0.313671231	<---- how far we are from the original model (regularizer)



 54%|█████▍    | 27/50 [09:52<08:23, 21.88s/it]

------------------------------ STEP 26 ------------------------------
rewards/mean:	0.601075888	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.055311732	<---- model-estimated average discounted reward
objective/kl:	0.252364814	<---- how far we are from the original model (regularizer)



 56%|█████▌    | 28/50 [10:14<08:03, 21.99s/it]

------------------------------ STEP 27 ------------------------------
rewards/mean:	0.441175282	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.045060780	<---- model-estimated average discounted reward
objective/kl:	0.223596737	<---- how far we are from the original model (regularizer)



 58%|█████▊    | 29/50 [10:35<07:34, 21.64s/it]

------------------------------ STEP 28 ------------------------------
rewards/mean:	0.336502433	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.004063313	<---- model-estimated average discounted reward
objective/kl:	0.274217635	<---- how far we are from the original model (regularizer)



 60%|██████    | 30/50 [10:57<07:12, 21.62s/it]

------------------------------ STEP 29 ------------------------------
rewards/mean:	0.585320950	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.034546711	<---- model-estimated average discounted reward
objective/kl:	0.356056690	<---- how far we are from the original model (regularizer)



 62%|██████▏   | 31/50 [11:20<07:00, 22.12s/it]

------------------------------ STEP 30 ------------------------------
rewards/mean:	0.866914034	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.087734923	<---- model-estimated average discounted reward
objective/kl:	0.352633238	<---- how far we are from the original model (regularizer)



 64%|██████▍   | 32/50 [11:40<06:27, 21.53s/it]

------------------------------ STEP 31 ------------------------------
rewards/mean:	-0.121526994	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.036884651	<---- model-estimated average discounted reward
objective/kl:	0.340081692	<---- how far we are from the original model (regularizer)



 66%|██████▌   | 33/50 [12:02<06:09, 21.73s/it]

------------------------------ STEP 32 ------------------------------
rewards/mean:	0.681454182	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.098433793	<---- model-estimated average discounted reward
objective/kl:	0.466585100	<---- how far we are from the original model (regularizer)



 68%|██████▊   | 34/50 [12:25<05:53, 22.11s/it]

------------------------------ STEP 33 ------------------------------
rewards/mean:	0.118767433	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.013490248	<---- model-estimated average discounted reward
objective/kl:	0.312696993	<---- how far we are from the original model (regularizer)



 70%|███████   | 35/50 [12:47<05:27, 21.84s/it]

------------------------------ STEP 34 ------------------------------
rewards/mean:	0.918653429	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.150502622	<---- model-estimated average discounted reward
objective/kl:	0.534421563	<---- how far we are from the original model (regularizer)



 72%|███████▏  | 36/50 [13:09<05:07, 21.94s/it]

------------------------------ STEP 35 ------------------------------
rewards/mean:	0.457341045	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.042851694	<---- model-estimated average discounted reward
objective/kl:	0.288924873	<---- how far we are from the original model (regularizer)



 74%|███████▍  | 37/50 [13:31<04:46, 22.06s/it]

------------------------------ STEP 36 ------------------------------
rewards/mean:	0.555555880	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.097869538	<---- model-estimated average discounted reward
objective/kl:	0.370789826	<---- how far we are from the original model (regularizer)



 76%|███████▌  | 38/50 [13:52<04:21, 21.80s/it]

------------------------------ STEP 37 ------------------------------
rewards/mean:	0.347857088	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.055577446	<---- model-estimated average discounted reward
objective/kl:	0.571614385	<---- how far we are from the original model (regularizer)



 78%|███████▊  | 39/50 [14:15<04:03, 22.10s/it]

------------------------------ STEP 38 ------------------------------
rewards/mean:	0.139375091	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.005866603	<---- model-estimated average discounted reward
objective/kl:	0.519901156	<---- how far we are from the original model (regularizer)



 80%|████████  | 40/50 [14:36<03:36, 21.68s/it]

------------------------------ STEP 39 ------------------------------
rewards/mean:	0.194767654	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.024233123	<---- model-estimated average discounted reward
objective/kl:	0.464358151	<---- how far we are from the original model (regularizer)



 82%|████████▏ | 41/50 [14:57<03:12, 21.41s/it]

------------------------------ STEP 40 ------------------------------
rewards/mean:	0.252348512	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.031494543	<---- model-estimated average discounted reward
objective/kl:	0.585824132	<---- how far we are from the original model (regularizer)



 84%|████████▍ | 42/50 [15:19<02:52, 21.58s/it]

------------------------------ STEP 41 ------------------------------
rewards/mean:	0.737719119	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.095232829	<---- model-estimated average discounted reward
objective/kl:	0.602477491	<---- how far we are from the original model (regularizer)



 86%|████████▌ | 43/50 [15:39<02:27, 21.13s/it]

------------------------------ STEP 42 ------------------------------
rewards/mean:	0.348758221	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.042668916	<---- model-estimated average discounted reward
objective/kl:	0.473462075	<---- how far we are from the original model (regularizer)



 88%|████████▊ | 44/50 [16:00<02:07, 21.19s/it]

------------------------------ STEP 43 ------------------------------
rewards/mean:	-0.270424843	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.084525675	<---- model-estimated average discounted reward
objective/kl:	0.590520024	<---- how far we are from the original model (regularizer)



 90%|█████████ | 45/50 [16:22<01:47, 21.50s/it]

------------------------------ STEP 44 ------------------------------
rewards/mean:	0.355695844	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.022175774	<---- model-estimated average discounted reward
objective/kl:	0.589467466	<---- how far we are from the original model (regularizer)



 92%|█████████▏| 46/50 [16:43<01:25, 21.36s/it]

------------------------------ STEP 45 ------------------------------
rewards/mean:	0.687814593	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.144094676	<---- model-estimated average discounted reward
objective/kl:	0.588263392	<---- how far we are from the original model (regularizer)



 94%|█████████▍| 47/50 [17:05<01:04, 21.40s/it]

------------------------------ STEP 46 ------------------------------
rewards/mean:	0.874625981	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.111767992	<---- model-estimated average discounted reward
objective/kl:	0.645615757	<---- how far we are from the original model (regularizer)



 96%|█████████▌| 48/50 [17:26<00:42, 21.43s/it]

------------------------------ STEP 47 ------------------------------
rewards/mean:	0.925737619	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.166211829	<---- model-estimated average discounted reward
objective/kl:	0.396295190	<---- how far we are from the original model (regularizer)



 98%|█████████▊| 49/50 [17:48<00:21, 21.54s/it]

------------------------------ STEP 48 ------------------------------
rewards/mean:	1.073199272	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.179797187	<---- model-estimated average discounted reward
objective/kl:	0.731850266	<---- how far we are from the original model (regularizer)



100%|██████████| 50/50 [18:12<00:00, 21.85s/it]

------------------------------ STEP 49 ------------------------------
rewards/mean:	0.493699908	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.101782501	<---- model-estimated average discounted reward
objective/kl:	0.935232162	<---- how far we are from the original model (regularizer)






## Main assignment - <u>actually</u> train the model (8 points)


Your main task for this week is to use the RLHF pipeline to train a model for a reward of your choice. Here's what you can choose from:

__A. Toxicity fine-tuning:__ train the model to be less (or more!) toxic. For this task, you may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, you may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).


__B. Actual human feedback:__ use one of the existing datasets with pairwise human feedback to align your langauge model. You may use [anthropic's hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) or any other data you see fit. You may also turn the tables and train the model to [minimize](https://habrastorage.org/getpro/geektimes/post_images/ac7/2ad/827/ac72ad82767d4132164a4b6b76196c42.jpg) human preferences, as long as your model does not degrade to gibberish.

__C. Controlled generation:__ Instead of training a reward model from human feedback, you may define the reward function as the text length (longer or shorter) or number of times the model uses specific words (e.g. "sorry", "apologize"). If you choose specific words, make sure the model generates them at least sometimes.

__Alternatively,__ you may choose a different task. However, unless your task is very similar to one of the above, there is a chance that it will be **significantly** harder to solve, requiring orders of magnitude more compute and tuning. If you are in doubt, please ask the course staff. If they are AFK (again >.<), please prefer one of the recommended tasks.


#### General tips & tricks


Things to look out for:
- during PPO stage, the reward model should be in eval mode (dropout disabled)
- make sure max_length and max_new_tokens are enough for your chosen dataset - at least most of the time
- when in doubt, view the data manually or inspect how the model performs on a few samples


We highly recommend that you manually check the performance after each sub-stage:
1. when you assembled the pairwise dataset, inspect a couple of from of *your* dataset class and detokenize them. Make sure that you-the-human understand why one sample was accepted and the other - rejected. At least most of the time. This also lets you spot tokenization/truncation errors.
2. after you trained a reward model, measure how accurate this model is in isolation. If your reward model is poor, any subsequent RLHF will also fail.
3. once you've trained the main model with RL, ask it to generate examples and explore how well it does. If it produces an obviously bad output, check if the reward model assigns high reward to that output. If yes, reward model is the culprit; if no, it's a question of better/longer PPO training.

__It is also a good idea to periodically print samples during training.__

__When stuck, simplify the problem.__ If you've spent a several hours enchanting the reward model but it still won't budge, try switching to a simple subtask. For instance, if you're training on hh-rlhf, try limiting it the dataset to 10% of the shortest sequences - they are typically easier to learn.


## Assignment stages (and grading)

Regardless of the specific task you chose, your solution needs to contain several parts that will be graded separately.


#### Stage 1: reward model (4 points)

Construct a dataset for training the reward model on your problem. Then, train a reward model on that dataset and evaluate how well can your model predict preferences on a hold-out (test) subset of your data.

Please make sure that the part of your notebook where you evaluate reward model is clearly visible and reasonably easy to read. And for all that is holy, do not call it IMDB unless it actually **is** data of imdb movie reviews :)

__Not all tasks require a reward model for later PPO fine-tuning.__ For instance, there's no reason to train a reward model if your reward equals sentence length. Likewise, toxicity reward can be estimated with a pre-trained toxicity classifier. __If your task does not require training a reward model, please train an unrelated model on [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) as though you were solving assignment version B.__ This is for grading purposes only, you won't use this model for stage 2.


#### Stage 2: RL fine-tuning (4 points)

Once the reward model is ready - or you can compute rewards without a model - it is time to maximize that reward with PPO. Optionally, you may replace PPO with another RL algorithm (or unlikelihood learning scheme), but only if you're feeling adventurous.


First, you need to choose a language model to be fine-tuned. You may choose any model, but make sure that your model **can** generate the data in your format. For instance, [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a general purpose LM and may (or may not) need prompt engineering to generate chat assistant responses. For that reason, it is best if you **do not use `"lvwerra/gpt2-imdb"` unless you're generating only movie reviews**.



There are two "difficulty modes" for this task:
For the **easy mode**, use [gpt2-large](https://huggingface.co/gpt2-large) or [opt-1.3b](https://huggingface.co/facebook/opt-1.3b) with minimal code changes.
If you want the **Hard mode:** use a larger (e.g. 7B) model in combination with `load_in_4bit` and LoRA, the same way we did last week.
Some reasonable model choices are [LLaMA-7B](https://huggingface.co/Enoch/llama-7b-hf), [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) for general-purpose LM or [guanaco-7b](https://huggingface.co/timdettmers/guanaco-7b), [vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) for chat-based tasks, though there are many more (see [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). In the hard mode, you will need to modify the training arguments to enable 4-bit fine-tuning. Furthermore, your experiments will take somewhat longer to complete. On the plus side, your model will produce significantly better results.

__High reward is not enough!__ RL algorithms are famous for [cheating their reward functions](https://openai.com/research/faulty-reward-functions). To ensure that your model is actually doing what you want it to do, you will need some additional evaluation. To get the full grade, provide at least 20 side-by-side examples of your fine-tuned model vs original model predictions and a short summary.

Alternatively, you may provide 5 examples and some extrinsic evaluation metric over many examples. For instance, you may use a different pre-trained toxicity score for option A. When dealing with human preferences, you may choose to [enlist actual humans](https://toloka.ai/) or [ask GPT4/Claude](https://arxiv.org/pdf/2304.03277.pdf) to compare your model's predictions. For task C, when optimizing for simple rewards like sentence lengths, it is enough to compare histograms of rewards (e.g. average lengths).












In [43]:
from datasets import load_dataset

ds = load_dataset("lmsys/toxic-chat", "toxicchat0124")

In [44]:
import torch
import numpy as np
import datasets

class FeaturedDataset(torch.utils.data.Dataset):
    """A dataset with pairwise ranked numerical features"""
    def __init__(self, ds, tokenizer, target_feature: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.ds = ds
        self.target_feature = target_feature

    def __len__(self):
        return len(self.ds)**2  # all pairs

    def __getitem__(self, index: int):
        sample = self.ds[index//len(self.ds)]
        another = self.ds[index%len(self.ds)]
        if dict(eval(sample['openai_moderation']))[self.target_feature] > dict(eval(another['openai_moderation']))[self.target_feature]:
            chosen = self.tokenizer(sample['model_output'], truncation=True)
            rejected = self.tokenizer(another['model_output'], truncation=True)
        else:
            chosen = self.tokenizer(another['model_output'], truncation=True)
            rejected = self.tokenizer(sample['model_output'], truncation=True)
            
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [45]:

import transformers
device = 'cuda'
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")
reward_data = FeaturedDataset(ds['train'],reward_tokenizer,'sexual')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
import trl

training_args = trl.RewardConfig( # like transformers.TrainingArguments
    output_dir="r_model",
    per_device_train_batch_size=100,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    max_steps=2000,            # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True,
    dataset_num_proc=11,
    
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  0%|          | 0/2000 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point o

{'loss': 0.6349, 'grad_norm': 1.9936622381210327, 'learning_rate': 1.95e-05, 'epoch': 0.0}


  5%|▌         | 100/2000 [07:57<2:28:29,  4.69s/it]

{'loss': 0.553, 'grad_norm': 2.1092488765716553, 'learning_rate': 1.9010000000000003e-05, 'epoch': 0.0}


  8%|▊         | 150/2000 [11:50<2:17:55,  4.47s/it]

{'loss': 0.4982, 'grad_norm': 1.9142942428588867, 'learning_rate': 1.851e-05, 'epoch': 0.0}


 10%|█         | 200/2000 [15:32<2:10:41,  4.36s/it]

{'loss': 0.4434, 'grad_norm': 2.496483087539673, 'learning_rate': 1.8010000000000002e-05, 'epoch': 0.0}


 12%|█▎        | 250/2000 [19:14<2:12:15,  4.53s/it]

{'loss': 0.3843, 'grad_norm': 2.6583919525146484, 'learning_rate': 1.751e-05, 'epoch': 0.0}


 15%|█▌        | 300/2000 [23:11<2:28:11,  5.23s/it]

{'loss': 0.347, 'grad_norm': 2.926581621170044, 'learning_rate': 1.701e-05, 'epoch': 0.0}


 18%|█▊        | 350/2000 [27:05<2:06:17,  4.59s/it]

{'loss': 0.321, 'grad_norm': 3.3935976028442383, 'learning_rate': 1.6510000000000003e-05, 'epoch': 0.0}


 20%|██        | 400/2000 [31:12<2:00:32,  4.52s/it]

{'loss': 0.2893, 'grad_norm': 2.8093655109405518, 'learning_rate': 1.601e-05, 'epoch': 0.0}


 22%|██▎       | 450/2000 [34:53<1:52:07,  4.34s/it]

{'loss': 0.2828, 'grad_norm': 3.060565948486328, 'learning_rate': 1.5510000000000002e-05, 'epoch': 0.0}


 25%|██▌       | 500/2000 [38:51<1:53:17,  4.53s/it]

{'loss': 0.2587, 'grad_norm': 2.9939539432525635, 'learning_rate': 1.501e-05, 'epoch': 0.0}


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
 28%|██▊       | 550/2000 [42:38<1:45:31,  4.37s/it]

{'loss': 0.2447, 'grad_norm': 2.8916547298431396, 'learning_rate': 1.4510000000000002e-05, 'epoch': 0.0}


 30%|███       | 600/2000 [46:17<1:41:01,  4.33s/it]

{'loss': 0.233, 'grad_norm': 2.8077616691589355, 'learning_rate': 1.4010000000000001e-05, 'epoch': 0.0}


 32%|███▎      | 650/2000 [50:05<2:05:20,  5.57s/it]

{'loss': 0.2274, 'grad_norm': 2.9370219707489014, 'learning_rate': 1.3510000000000001e-05, 'epoch': 0.01}


 35%|███▌      | 700/2000 [53:52<1:34:52,  4.38s/it]

{'loss': 0.2238, 'grad_norm': 2.917951822280884, 'learning_rate': 1.301e-05, 'epoch': 0.01}


 38%|███▊      | 750/2000 [57:46<1:50:56,  5.32s/it]

{'loss': 0.206, 'grad_norm': 3.9415810108184814, 'learning_rate': 1.251e-05, 'epoch': 0.01}


 40%|████      | 800/2000 [1:01:24<1:26:50,  4.34s/it]

{'loss': 0.2012, 'grad_norm': 3.655064582824707, 'learning_rate': 1.2010000000000002e-05, 'epoch': 0.01}


 42%|████▎     | 850/2000 [1:05:08<1:25:14,  4.45s/it]

{'loss': 0.1942, 'grad_norm': 3.060561418533325, 'learning_rate': 1.1510000000000002e-05, 'epoch': 0.01}


 45%|████▌     | 900/2000 [1:08:57<1:20:02,  4.37s/it]

{'loss': 0.1955, 'grad_norm': 2.7942662239074707, 'learning_rate': 1.1010000000000001e-05, 'epoch': 0.01}


 48%|████▊     | 950/2000 [1:12:43<1:23:26,  4.77s/it]

{'loss': 0.1868, 'grad_norm': 2.633648633956909, 'learning_rate': 1.0510000000000001e-05, 'epoch': 0.01}


 50%|█████     | 1000/2000 [1:16:54<1:17:03,  4.62s/it]

{'loss': 0.1875, 'grad_norm': 3.1920478343963623, 'learning_rate': 1.0009999999999999e-05, 'epoch': 0.01}


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
 52%|█████▎    | 1050/2000 [1:20:50<1:13:08,  4.62s/it]

{'loss': 0.1806, 'grad_norm': 3.432727098464966, 'learning_rate': 9.51e-06, 'epoch': 0.01}


 55%|█████▌    | 1100/2000 [1:24:49<1:18:38,  5.24s/it]

{'loss': 0.1758, 'grad_norm': 3.018944263458252, 'learning_rate': 9.01e-06, 'epoch': 0.01}


 57%|█████▊    | 1150/2000 [1:28:46<1:04:17,  4.54s/it]

{'loss': 0.1757, 'grad_norm': 3.684968948364258, 'learning_rate': 8.51e-06, 'epoch': 0.01}


 60%|██████    | 1200/2000 [1:32:44<1:01:04,  4.58s/it]

{'loss': 0.1665, 'grad_norm': 2.986323356628418, 'learning_rate': 8.010000000000001e-06, 'epoch': 0.01}


 62%|██████▎   | 1250/2000 [1:36:36<1:03:14,  5.06s/it]

{'loss': 0.1629, 'grad_norm': 3.572213888168335, 'learning_rate': 7.510000000000001e-06, 'epoch': 0.01}


 65%|██████▌   | 1300/2000 [1:40:38<54:46,  4.70s/it]  

{'loss': 0.1637, 'grad_norm': 3.561124086380005, 'learning_rate': 7.01e-06, 'epoch': 0.01}


 68%|██████▊   | 1350/2000 [1:44:26<49:15,  4.55s/it]

{'loss': 0.1609, 'grad_norm': 2.7606916427612305, 'learning_rate': 6.51e-06, 'epoch': 0.01}


 70%|███████   | 1400/2000 [1:48:35<48:57,  4.90s/it]  

{'loss': 0.1542, 'grad_norm': 3.6478464603424072, 'learning_rate': 6.01e-06, 'epoch': 0.01}


 72%|███████▎  | 1450/2000 [1:52:53<53:55,  5.88s/it]  

{'loss': 0.1585, 'grad_norm': 2.8350319862365723, 'learning_rate': 5.510000000000001e-06, 'epoch': 0.01}


 75%|███████▌  | 1500/2000 [1:56:43<37:45,  4.53s/it]

{'loss': 0.1484, 'grad_norm': 3.2492713928222656, 'learning_rate': 5.01e-06, 'epoch': 0.01}


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
 78%|███████▊  | 1550/2000 [2:00:59<34:19,  4.58s/it]  

{'loss': 0.155, 'grad_norm': 3.4183096885681152, 'learning_rate': 4.510000000000001e-06, 'epoch': 0.01}


 80%|████████  | 1600/2000 [2:05:00<32:37,  4.89s/it]

{'loss': 0.1461, 'grad_norm': 3.3123769760131836, 'learning_rate': 4.0100000000000006e-06, 'epoch': 0.01}


 82%|████████▎ | 1650/2000 [2:09:02<37:30,  6.43s/it]

{'loss': 0.148, 'grad_norm': 3.708418369293213, 'learning_rate': 3.5100000000000003e-06, 'epoch': 0.01}


 85%|████████▌ | 1700/2000 [2:13:04<33:27,  6.69s/it]

{'loss': 0.1424, 'grad_norm': 2.6902918815612793, 'learning_rate': 3.01e-06, 'epoch': 0.01}


 88%|████████▊ | 1750/2000 [2:16:58<19:45,  4.74s/it]

{'loss': 0.1435, 'grad_norm': 3.617710590362549, 'learning_rate': 2.51e-06, 'epoch': 0.01}


 90%|█████████ | 1800/2000 [2:21:00<15:30,  4.65s/it]

{'loss': 0.1386, 'grad_norm': 3.330132246017456, 'learning_rate': 2.0100000000000002e-06, 'epoch': 0.01}


 92%|█████████▎| 1850/2000 [2:25:26<11:24,  4.57s/it]

{'loss': 0.142, 'grad_norm': 3.7916035652160645, 'learning_rate': 1.5100000000000002e-06, 'epoch': 0.01}


 95%|█████████▌| 1900/2000 [2:29:13<07:31,  4.52s/it]

{'loss': 0.1311, 'grad_norm': 3.591104507446289, 'learning_rate': 1.01e-06, 'epoch': 0.01}


 98%|█████████▊| 1950/2000 [2:33:42<03:47,  4.55s/it]

{'loss': 0.1388, 'grad_norm': 2.7564380168914795, 'learning_rate': 5.1e-07, 'epoch': 0.02}


100%|██████████| 2000/2000 [2:37:41<00:00,  4.35s/it]

{'loss': 0.1377, 'grad_norm': 3.4875423908233643, 'learning_rate': 1e-08, 'epoch': 0.02}


100%|██████████| 2000/2000 [2:37:42<00:00,  4.73s/it]

{'train_runtime': 9462.5938, 'train_samples_per_second': 42.272, 'train_steps_per_second': 0.211, 'train_loss': 0.229576851606369, 'epoch': 0.02}





TrainOutput(global_step=2000, training_loss=0.229576851606369, metrics={'train_runtime': 9462.5938, 'train_samples_per_second': 42.272, 'train_steps_per_second': 0.211, 'total_flos': 0.0, 'train_loss': 0.229576851606369, 'epoch': 0.015487787879257206})

In [1]:
from datasets import load_dataset

ds = load_dataset("lmsys/toxic-chat", "toxicchat0124")

  from .autonotebook import tqdm as notebook_tqdm


In [46]:
import transformers
device = 'cuda'
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("r_model/checkpoint-2000", device_map=device)

In [3]:
import trl
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
qconfig = transformers.BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_use_double_quant=True)
main_tokenizer = transformers.AutoTokenizer.from_pretrained("openai-community/gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token
main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("openai-community/gpt2-large", device_map=device,quantization_config = qconfig,torch_dtype = torch.float32)


In [4]:
from peft import get_peft_model,LoraConfig,TaskType
lora_config = LoraConfig(
                target_modules=["c_attn",'c_proj','c_fc','lm_head'],
                lora_alpha=32,
                inference_mode=False,
                use_rslora = True,
                bias = 'all',
                lora_dropout=0.,
                r=4,
                task_type=TaskType.CAUSAL_LM,
                init_lora_weights='gaussian'
            )
main_model = get_peft_model(main_model,lora_config)

In [5]:
main_model.print_trainable_parameters()

trainable params: 3,663,429 || all params: 777,186,629 || trainable%: 0.4714


In [6]:
import trl
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["user_input"])[:sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

ds_rlhf = ds['test'].map(select_query_and_tokenize, batched=False)
ds_rlhf.set_format(type="torch")

In [7]:
import trl
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    batch_size=32,
    mini_batch_size=8,
    ppo_epochs=4,
    dataset_num_proc=8,
              # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=ds_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0]),
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



In [8]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [9]:
from tqdm.auto import tqdm
max_steps = 150   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break
    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer
    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]
    # Evaluation stage
    rewards = compute_reward(batch['response'])
    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()
    if epoch % 10 == 0:
      for name, param in main_model.named_parameters():
        if param.requires_grad:
            torch.save(param,f"ft/{name}")
    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))
  for name, param in main_model.named_parameters():
      if param.requires_grad:
          torch.save(param,f"ft/{name}")

    

  0%|          | 0/150 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  1%|          | 1/150 [01:03<2:37:43, 63.51s/it]

------------------------------ STEP 0 ------------------------------
rewards/mean:	1.341312528	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.226613402	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)



  1%|▏         | 2/150 [02:15<2:49:08, 68.57s/it]

------------------------------ STEP 1 ------------------------------
rewards/mean:	1.314770699	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.319299579	<---- model-estimated average discounted reward
objective/kl:	-0.634562492	<---- how far we are from the original model (regularizer)



  2%|▏         | 3/150 [03:24<2:47:50, 68.51s/it]

------------------------------ STEP 2 ------------------------------
rewards/mean:	1.233374119	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.466379970	<---- model-estimated average discounted reward
objective/kl:	-1.265505314	<---- how far we are from the original model (regularizer)



  3%|▎         | 4/150 [04:29<2:43:40, 67.26s/it]

------------------------------ STEP 3 ------------------------------
rewards/mean:	1.979054451	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.677819371	<---- model-estimated average discounted reward
objective/kl:	0.299598038	<---- how far we are from the original model (regularizer)



  3%|▎         | 5/150 [05:38<2:44:12, 67.95s/it]

------------------------------ STEP 4 ------------------------------
rewards/mean:	1.392532110	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.762321830	<---- model-estimated average discounted reward
objective/kl:	0.991976619	<---- how far we are from the original model (regularizer)



  4%|▍         | 6/150 [06:47<2:43:55, 68.30s/it]

------------------------------ STEP 5 ------------------------------
rewards/mean:	2.719542503	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.039131522	<---- model-estimated average discounted reward
objective/kl:	0.826579690	<---- how far we are from the original model (regularizer)



  5%|▍         | 7/150 [07:55<2:42:24, 68.14s/it]

------------------------------ STEP 6 ------------------------------
rewards/mean:	1.877263546	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.020757914	<---- model-estimated average discounted reward
objective/kl:	3.831849575	<---- how far we are from the original model (regularizer)



  5%|▌         | 8/150 [08:58<2:37:37, 66.60s/it]

------------------------------ STEP 7 ------------------------------
rewards/mean:	1.731281519	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.048143148	<---- model-estimated average discounted reward
objective/kl:	6.026952744	<---- how far we are from the original model (regularizer)



  6%|▌         | 9/150 [10:06<2:37:34, 67.05s/it]

------------------------------ STEP 8 ------------------------------
rewards/mean:	2.312290907	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.136253119	<---- model-estimated average discounted reward
objective/kl:	5.247801304	<---- how far we are from the original model (regularizer)



  7%|▋         | 10/150 [11:10<2:34:23, 66.17s/it]

------------------------------ STEP 9 ------------------------------
rewards/mean:	1.951788783	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.090348482	<---- model-estimated average discounted reward
objective/kl:	5.103988171	<---- how far we are from the original model (regularizer)



  7%|▋         | 11/150 [12:18<2:34:34, 66.72s/it]

------------------------------ STEP 10 ------------------------------
rewards/mean:	0.812562227	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.898574233	<---- model-estimated average discounted reward
objective/kl:	4.089690685	<---- how far we are from the original model (regularizer)



  8%|▊         | 12/150 [13:24<2:32:27, 66.29s/it]

------------------------------ STEP 11 ------------------------------
rewards/mean:	2.612108707	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.079362035	<---- model-estimated average discounted reward
objective/kl:	7.723804474	<---- how far we are from the original model (regularizer)



  9%|▊         | 13/150 [14:28<2:29:53, 65.65s/it]

------------------------------ STEP 12 ------------------------------
rewards/mean:	1.088492632	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.764500320	<---- model-estimated average discounted reward
objective/kl:	7.949388981	<---- how far we are from the original model (regularizer)



  9%|▉         | 14/150 [15:37<2:31:22, 66.78s/it]

------------------------------ STEP 13 ------------------------------
rewards/mean:	3.501738548	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.361379981	<---- model-estimated average discounted reward
objective/kl:	9.644536972	<---- how far we are from the original model (regularizer)



 10%|█         | 15/150 [16:43<2:29:45, 66.56s/it]

------------------------------ STEP 14 ------------------------------
rewards/mean:	3.466894388	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.260950446	<---- model-estimated average discounted reward
objective/kl:	9.719232559	<---- how far we are from the original model (regularizer)



 11%|█         | 16/150 [17:52<2:30:21, 67.33s/it]

------------------------------ STEP 15 ------------------------------
rewards/mean:	2.914625168	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.144826531	<---- model-estimated average discounted reward
objective/kl:	11.818075180	<---- how far we are from the original model (regularizer)



 11%|█▏        | 17/150 [19:04<2:31:49, 68.49s/it]

------------------------------ STEP 16 ------------------------------
rewards/mean:	3.355872154	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.135199308	<---- model-estimated average discounted reward
objective/kl:	9.554873466	<---- how far we are from the original model (regularizer)



 12%|█▏        | 18/150 [20:09<2:28:23, 67.45s/it]

------------------------------ STEP 17 ------------------------------
rewards/mean:	3.872513056	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.383689523	<---- model-estimated average discounted reward
objective/kl:	10.846442223	<---- how far we are from the original model (regularizer)



 13%|█▎        | 19/150 [21:19<2:29:19, 68.39s/it]

------------------------------ STEP 18 ------------------------------
rewards/mean:	3.645012140	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.338815928	<---- model-estimated average discounted reward
objective/kl:	12.541727066	<---- how far we are from the original model (regularizer)



 13%|█▎        | 20/150 [22:33<2:31:29, 69.92s/it]

------------------------------ STEP 19 ------------------------------
rewards/mean:	2.355683327	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.241760850	<---- model-estimated average discounted reward
objective/kl:	13.797780037	<---- how far we are from the original model (regularizer)



 14%|█▍        | 21/150 [23:44<2:31:22, 70.40s/it]

------------------------------ STEP 20 ------------------------------
rewards/mean:	3.898432016	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.661746025	<---- model-estimated average discounted reward
objective/kl:	17.300355911	<---- how far we are from the original model (regularizer)



 15%|█▍        | 22/150 [24:53<2:28:50, 69.77s/it]

------------------------------ STEP 21 ------------------------------
rewards/mean:	5.079894066	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.822383523	<---- model-estimated average discounted reward
objective/kl:	19.491298676	<---- how far we are from the original model (regularizer)



 15%|█▌        | 23/150 [26:05<2:29:40, 70.71s/it]

------------------------------ STEP 22 ------------------------------
rewards/mean:	3.485247612	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.542023897	<---- model-estimated average discounted reward
objective/kl:	19.192256927	<---- how far we are from the original model (regularizer)



 16%|█▌        | 24/150 [27:16<2:28:16, 70.61s/it]

------------------------------ STEP 23 ------------------------------
rewards/mean:	4.860739708	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.600691676	<---- model-estimated average discounted reward
objective/kl:	17.574884415	<---- how far we are from the original model (regularizer)



 17%|█▋        | 25/150 [28:30<2:29:14, 71.63s/it]

------------------------------ STEP 24 ------------------------------
rewards/mean:	5.452943802	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.332481384	<---- model-estimated average discounted reward
objective/kl:	23.163593292	<---- how far we are from the original model (regularizer)



 17%|█▋        | 26/150 [29:40<2:27:21, 71.30s/it]

------------------------------ STEP 25 ------------------------------
rewards/mean:	3.536050558	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.345178843	<---- model-estimated average discounted reward
objective/kl:	18.190950394	<---- how far we are from the original model (regularizer)



 18%|█▊        | 27/150 [30:50<2:25:08, 70.80s/it]

------------------------------ STEP 26 ------------------------------
rewards/mean:	5.425052643	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.897816420	<---- model-estimated average discounted reward
objective/kl:	19.492385864	<---- how far we are from the original model (regularizer)



 19%|█▊        | 28/150 [32:03<2:25:00, 71.32s/it]

------------------------------ STEP 27 ------------------------------
rewards/mean:	4.559685707	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.911891699	<---- model-estimated average discounted reward
objective/kl:	18.495073318	<---- how far we are from the original model (regularizer)



 19%|█▉        | 29/150 [33:14<2:23:38, 71.23s/it]

------------------------------ STEP 28 ------------------------------
rewards/mean:	5.250615120	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.917570710	<---- model-estimated average discounted reward
objective/kl:	19.804821014	<---- how far we are from the original model (regularizer)



 20%|██        | 30/150 [34:23<2:21:13, 70.61s/it]

------------------------------ STEP 29 ------------------------------
rewards/mean:	5.997655869	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.242083073	<---- model-estimated average discounted reward
objective/kl:	22.009107590	<---- how far we are from the original model (regularizer)



 21%|██        | 31/150 [35:28<2:16:49, 68.99s/it]

------------------------------ STEP 30 ------------------------------
rewards/mean:	6.340649605	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.051354885	<---- model-estimated average discounted reward
objective/kl:	19.054113388	<---- how far we are from the original model (regularizer)



 21%|██▏       | 32/150 [36:35<2:14:23, 68.34s/it]

------------------------------ STEP 31 ------------------------------
rewards/mean:	5.862373352	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.834059715	<---- model-estimated average discounted reward
objective/kl:	19.961746216	<---- how far we are from the original model (regularizer)



 22%|██▏       | 33/150 [37:41<2:11:57, 67.67s/it]

------------------------------ STEP 32 ------------------------------
rewards/mean:	7.908440590	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.315105677	<---- model-estimated average discounted reward
objective/kl:	19.547580719	<---- how far we are from the original model (regularizer)



 23%|██▎       | 34/150 [38:54<2:14:04, 69.35s/it]

------------------------------ STEP 33 ------------------------------
rewards/mean:	6.929636478	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.274160624	<---- model-estimated average discounted reward
objective/kl:	19.270620346	<---- how far we are from the original model (regularizer)



 23%|██▎       | 35/150 [40:03<2:12:29, 69.13s/it]

------------------------------ STEP 34 ------------------------------
rewards/mean:	7.966818810	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.354738712	<---- model-estimated average discounted reward
objective/kl:	20.693359375	<---- how far we are from the original model (regularizer)



 24%|██▍       | 36/150 [41:08<2:08:53, 67.84s/it]

------------------------------ STEP 35 ------------------------------
rewards/mean:	5.848605156	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.427324772	<---- model-estimated average discounted reward
objective/kl:	19.464336395	<---- how far we are from the original model (regularizer)



 25%|██▍       | 37/150 [42:17<2:08:39, 68.31s/it]

------------------------------ STEP 36 ------------------------------
rewards/mean:	7.071865559	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.490811825	<---- model-estimated average discounted reward
objective/kl:	19.304241180	<---- how far we are from the original model (regularizer)



 25%|██▌       | 38/150 [43:23<2:05:59, 67.50s/it]

------------------------------ STEP 37 ------------------------------
rewards/mean:	6.002763748	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.609704256	<---- model-estimated average discounted reward
objective/kl:	17.964504242	<---- how far we are from the original model (regularizer)



 26%|██▌       | 39/150 [44:16<1:57:15, 63.38s/it]

------------------------------ STEP 38 ------------------------------
rewards/mean:	4.815419197	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.645116091	<---- model-estimated average discounted reward
objective/kl:	12.653374672	<---- how far we are from the original model (regularizer)



 27%|██▋       | 40/150 [45:04<1:47:35, 58.68s/it]

------------------------------ STEP 39 ------------------------------
rewards/mean:	4.575706482	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.280853510	<---- model-estimated average discounted reward
objective/kl:	10.536380768	<---- how far we are from the original model (regularizer)



 27%|██▋       | 41/150 [45:59<1:44:44, 57.66s/it]

------------------------------ STEP 40 ------------------------------
rewards/mean:	5.476653099	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.204080343	<---- model-estimated average discounted reward
objective/kl:	15.210311890	<---- how far we are from the original model (regularizer)



 28%|██▊       | 42/150 [46:54<1:42:26, 56.91s/it]

------------------------------ STEP 41 ------------------------------
rewards/mean:	2.541163445	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.659190595	<---- model-estimated average discounted reward
objective/kl:	11.902305603	<---- how far we are from the original model (regularizer)



 29%|██▊       | 43/150 [48:00<1:46:06, 59.50s/it]

------------------------------ STEP 42 ------------------------------
rewards/mean:	5.004567623	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.921865940	<---- model-estimated average discounted reward
objective/kl:	18.280464172	<---- how far we are from the original model (regularizer)



 29%|██▉       | 44/150 [48:59<1:44:38, 59.23s/it]

------------------------------ STEP 43 ------------------------------
rewards/mean:	5.070037365	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.800494909	<---- model-estimated average discounted reward
objective/kl:	13.173219681	<---- how far we are from the original model (regularizer)



 30%|███       | 45/150 [49:50<1:39:33, 56.89s/it]

------------------------------ STEP 44 ------------------------------
rewards/mean:	6.089390755	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.056863308	<---- model-estimated average discounted reward
objective/kl:	13.403395653	<---- how far we are from the original model (regularizer)



 31%|███       | 46/150 [50:48<1:39:05, 57.17s/it]

------------------------------ STEP 45 ------------------------------
rewards/mean:	5.657330513	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.290950060	<---- model-estimated average discounted reward
objective/kl:	17.486110687	<---- how far we are from the original model (regularizer)



 31%|███▏      | 47/150 [51:46<1:38:23, 57.32s/it]

------------------------------ STEP 46 ------------------------------
rewards/mean:	6.434219360	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.200557947	<---- model-estimated average discounted reward
objective/kl:	15.126501083	<---- how far we are from the original model (regularizer)



 32%|███▏      | 48/150 [52:23<1:27:33, 51.51s/it]

------------------------------ STEP 47 ------------------------------
rewards/mean:	6.499149323	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.669670343	<---- model-estimated average discounted reward
objective/kl:	14.407856941	<---- how far we are from the original model (regularizer)



 33%|███▎      | 49/150 [53:11<1:24:48, 50.39s/it]

------------------------------ STEP 48 ------------------------------
rewards/mean:	5.367022991	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.727546692	<---- model-estimated average discounted reward
objective/kl:	15.238353729	<---- how far we are from the original model (regularizer)



 33%|███▎      | 50/150 [53:58<1:22:07, 49.28s/it]

------------------------------ STEP 49 ------------------------------
rewards/mean:	6.055223465	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.767988205	<---- model-estimated average discounted reward
objective/kl:	17.585762024	<---- how far we are from the original model (regularizer)



 34%|███▍      | 51/150 [54:51<1:23:03, 50.34s/it]

------------------------------ STEP 50 ------------------------------
rewards/mean:	7.319556236	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.970761776	<---- model-estimated average discounted reward
objective/kl:	18.437261581	<---- how far we are from the original model (regularizer)



 35%|███▍      | 52/150 [55:28<1:15:47, 46.40s/it]

------------------------------ STEP 51 ------------------------------
rewards/mean:	6.875696182	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.381352425	<---- model-estimated average discounted reward
objective/kl:	18.067337036	<---- how far we are from the original model (regularizer)



 35%|███▌      | 53/150 [56:13<1:14:14, 45.92s/it]

------------------------------ STEP 52 ------------------------------
rewards/mean:	6.359442711	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.316426992	<---- model-estimated average discounted reward
objective/kl:	17.919857025	<---- how far we are from the original model (regularizer)



 36%|███▌      | 54/150 [57:09<1:18:23, 49.00s/it]

------------------------------ STEP 53 ------------------------------
rewards/mean:	5.822316647	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.006809711	<---- model-estimated average discounted reward
objective/kl:	20.558523178	<---- how far we are from the original model (regularizer)



 37%|███▋      | 55/150 [57:52<1:14:58, 47.35s/it]

------------------------------ STEP 54 ------------------------------
rewards/mean:	7.191326618	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.268063068	<---- model-estimated average discounted reward
objective/kl:	19.028964996	<---- how far we are from the original model (regularizer)



 37%|███▋      | 56/150 [58:33<1:11:10, 45.43s/it]

------------------------------ STEP 55 ------------------------------
rewards/mean:	6.783601761	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	1.986341000	<---- model-estimated average discounted reward
objective/kl:	18.058065414	<---- how far we are from the original model (regularizer)



 38%|███▊      | 57/150 [59:14<1:08:23, 44.13s/it]

------------------------------ STEP 56 ------------------------------
rewards/mean:	7.796194077	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	2.780905724	<---- model-estimated average discounted reward
objective/kl:	21.820026398	<---- how far we are from the original model (regularizer)



 39%|███▊      | 58/150 [59:49<1:03:02, 41.11s/it]

------------------------------ STEP 57 ------------------------------
rewards/mean:	7.973384380	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.570426226	<---- model-estimated average discounted reward
objective/kl:	21.094982147	<---- how far we are from the original model (regularizer)



 39%|███▉      | 59/150 [1:00:19<57:28, 37.89s/it]

------------------------------ STEP 58 ------------------------------
rewards/mean:	7.144409180	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.212305069	<---- model-estimated average discounted reward
objective/kl:	19.118520737	<---- how far we are from the original model (regularizer)



 40%|████      | 60/150 [1:00:44<51:17, 34.19s/it]

------------------------------ STEP 59 ------------------------------
rewards/mean:	8.634210587	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.239290714	<---- model-estimated average discounted reward
objective/kl:	21.633682251	<---- how far we are from the original model (regularizer)



 41%|████      | 61/150 [1:01:09<46:18, 31.22s/it]

------------------------------ STEP 60 ------------------------------
rewards/mean:	8.232854843	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.682798386	<---- model-estimated average discounted reward
objective/kl:	19.605241776	<---- how far we are from the original model (regularizer)



 41%|████▏     | 62/150 [1:01:39<45:22, 30.93s/it]

------------------------------ STEP 61 ------------------------------
rewards/mean:	7.992423058	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.087344170	<---- model-estimated average discounted reward
objective/kl:	22.157684326	<---- how far we are from the original model (regularizer)



 42%|████▏     | 63/150 [1:02:09<44:18, 30.56s/it]

------------------------------ STEP 62 ------------------------------
rewards/mean:	7.739045143	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.350378036	<---- model-estimated average discounted reward
objective/kl:	20.495399475	<---- how far we are from the original model (regularizer)



 43%|████▎     | 64/150 [1:02:38<43:22, 30.26s/it]

------------------------------ STEP 63 ------------------------------
rewards/mean:	7.686486244	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	3.883570194	<---- model-estimated average discounted reward
objective/kl:	20.496810913	<---- how far we are from the original model (regularizer)



 43%|████▎     | 65/150 [1:03:03<40:41, 28.72s/it]

------------------------------ STEP 64 ------------------------------
rewards/mean:	8.509353638	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.820219040	<---- model-estimated average discounted reward
objective/kl:	20.676715851	<---- how far we are from the original model (regularizer)



 44%|████▍     | 66/150 [1:03:25<37:19, 26.66s/it]

------------------------------ STEP 65 ------------------------------
rewards/mean:	8.144248009	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.500892162	<---- model-estimated average discounted reward
objective/kl:	20.936874390	<---- how far we are from the original model (regularizer)



 45%|████▍     | 67/150 [1:03:43<33:05, 23.92s/it]

------------------------------ STEP 66 ------------------------------
rewards/mean:	8.824471474	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.517116547	<---- model-estimated average discounted reward
objective/kl:	21.117475510	<---- how far we are from the original model (regularizer)



 45%|████▌     | 68/150 [1:04:03<31:01, 22.70s/it]

------------------------------ STEP 67 ------------------------------
rewards/mean:	7.739512920	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.957062244	<---- model-estimated average discounted reward
objective/kl:	20.338497162	<---- how far we are from the original model (regularizer)



 46%|████▌     | 69/150 [1:04:22<29:07, 21.57s/it]

------------------------------ STEP 68 ------------------------------
rewards/mean:	8.438900948	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.258892536	<---- model-estimated average discounted reward
objective/kl:	18.890731812	<---- how far we are from the original model (regularizer)



 47%|████▋     | 70/150 [1:04:38<26:48, 20.11s/it]

------------------------------ STEP 69 ------------------------------
rewards/mean:	8.438972473	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.372510910	<---- model-estimated average discounted reward
objective/kl:	20.483896255	<---- how far we are from the original model (regularizer)



 47%|████▋     | 71/150 [1:04:59<26:49, 20.38s/it]

------------------------------ STEP 70 ------------------------------
rewards/mean:	8.646898270	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.685758591	<---- model-estimated average discounted reward
objective/kl:	21.410249710	<---- how far we are from the original model (regularizer)



 48%|████▊     | 72/150 [1:05:17<25:22, 19.52s/it]

------------------------------ STEP 71 ------------------------------
rewards/mean:	8.387558937	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.631323338	<---- model-estimated average discounted reward
objective/kl:	21.656536102	<---- how far we are from the original model (regularizer)



 49%|████▊     | 73/150 [1:05:31<23:08, 18.03s/it]

------------------------------ STEP 72 ------------------------------
rewards/mean:	7.881222725	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.434672832	<---- model-estimated average discounted reward
objective/kl:	18.791301727	<---- how far we are from the original model (regularizer)



 49%|████▉     | 74/150 [1:05:48<22:08, 17.48s/it]

------------------------------ STEP 73 ------------------------------
rewards/mean:	7.798803329	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.134981155	<---- model-estimated average discounted reward
objective/kl:	21.409873962	<---- how far we are from the original model (regularizer)



 50%|█████     | 75/150 [1:06:04<21:24, 17.12s/it]

------------------------------ STEP 74 ------------------------------
rewards/mean:	9.091558456	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.987581730	<---- model-estimated average discounted reward
objective/kl:	23.649581909	<---- how far we are from the original model (regularizer)



 51%|█████     | 76/150 [1:06:23<21:53, 17.75s/it]

------------------------------ STEP 75 ------------------------------
rewards/mean:	9.141706467	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.024496078	<---- model-estimated average discounted reward
objective/kl:	22.714954376	<---- how far we are from the original model (regularizer)



 51%|█████▏    | 77/150 [1:06:39<20:50, 17.13s/it]

------------------------------ STEP 76 ------------------------------
rewards/mean:	8.814638138	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.862103462	<---- model-estimated average discounted reward
objective/kl:	22.515727997	<---- how far we are from the original model (regularizer)



 52%|█████▏    | 78/150 [1:06:53<19:27, 16.22s/it]

------------------------------ STEP 77 ------------------------------
rewards/mean:	8.790559769	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.840145588	<---- model-estimated average discounted reward
objective/kl:	21.581649780	<---- how far we are from the original model (regularizer)



 53%|█████▎    | 79/150 [1:07:06<18:16, 15.44s/it]

------------------------------ STEP 78 ------------------------------
rewards/mean:	8.399183273	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.547458649	<---- model-estimated average discounted reward
objective/kl:	21.029064178	<---- how far we are from the original model (regularizer)



 53%|█████▎    | 80/150 [1:07:21<17:37, 15.11s/it]

------------------------------ STEP 79 ------------------------------
rewards/mean:	8.705688477	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.696536064	<---- model-estimated average discounted reward
objective/kl:	22.096145630	<---- how far we are from the original model (regularizer)



 54%|█████▍    | 81/150 [1:07:36<17:27, 15.18s/it]

------------------------------ STEP 80 ------------------------------
rewards/mean:	8.633142471	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.941682816	<---- model-estimated average discounted reward
objective/kl:	20.737798691	<---- how far we are from the original model (regularizer)



 55%|█████▍    | 82/150 [1:07:48<16:05, 14.19s/it]

------------------------------ STEP 81 ------------------------------
rewards/mean:	8.386959076	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.881262779	<---- model-estimated average discounted reward
objective/kl:	18.330665588	<---- how far we are from the original model (regularizer)



 55%|█████▌    | 83/150 [1:07:59<14:50, 13.29s/it]

------------------------------ STEP 82 ------------------------------
rewards/mean:	7.877856255	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.272432804	<---- model-estimated average discounted reward
objective/kl:	19.980041504	<---- how far we are from the original model (regularizer)



 56%|█████▌    | 84/150 [1:08:12<14:29, 13.17s/it]

------------------------------ STEP 83 ------------------------------
rewards/mean:	8.918150902	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.850790501	<---- model-estimated average discounted reward
objective/kl:	19.955596924	<---- how far we are from the original model (regularizer)



 57%|█████▋    | 85/150 [1:08:23<13:34, 12.53s/it]

------------------------------ STEP 84 ------------------------------
rewards/mean:	8.463045120	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.736501694	<---- model-estimated average discounted reward
objective/kl:	19.504901886	<---- how far we are from the original model (regularizer)



 57%|█████▋    | 86/150 [1:08:33<12:36, 11.82s/it]

------------------------------ STEP 85 ------------------------------
rewards/mean:	8.791949272	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.927683353	<---- model-estimated average discounted reward
objective/kl:	19.306018829	<---- how far we are from the original model (regularizer)



 58%|█████▊    | 87/150 [1:08:46<12:40, 12.07s/it]

------------------------------ STEP 86 ------------------------------
rewards/mean:	9.019742012	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.006408691	<---- model-estimated average discounted reward
objective/kl:	22.459373474	<---- how far we are from the original model (regularizer)



 59%|█████▊    | 88/150 [1:09:00<12:59, 12.58s/it]

------------------------------ STEP 87 ------------------------------
rewards/mean:	8.310916901	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.439692974	<---- model-estimated average discounted reward
objective/kl:	20.441547394	<---- how far we are from the original model (regularizer)



 59%|█████▉    | 89/150 [1:09:13<13:05, 12.87s/it]

------------------------------ STEP 88 ------------------------------
rewards/mean:	8.414443970	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.451375484	<---- model-estimated average discounted reward
objective/kl:	20.553842545	<---- how far we are from the original model (regularizer)



 60%|██████    | 90/150 [1:09:26<12:58, 12.97s/it]

------------------------------ STEP 89 ------------------------------
rewards/mean:	8.657321930	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.997351646	<---- model-estimated average discounted reward
objective/kl:	17.276069641	<---- how far we are from the original model (regularizer)



 61%|██████    | 91/150 [1:09:38<12:22, 12.58s/it]

------------------------------ STEP 90 ------------------------------
rewards/mean:	7.805204391	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.325268745	<---- model-estimated average discounted reward
objective/kl:	18.523683548	<---- how far we are from the original model (regularizer)



 61%|██████▏   | 92/150 [1:09:49<11:43, 12.14s/it]

------------------------------ STEP 91 ------------------------------
rewards/mean:	9.023410797	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.133665085	<---- model-estimated average discounted reward
objective/kl:	18.928413391	<---- how far we are from the original model (regularizer)



 62%|██████▏   | 93/150 [1:10:01<11:28, 12.08s/it]

------------------------------ STEP 92 ------------------------------
rewards/mean:	8.460382462	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.660878181	<---- model-estimated average discounted reward
objective/kl:	17.796623230	<---- how far we are from the original model (regularizer)



 63%|██████▎   | 94/150 [1:10:15<11:44, 12.58s/it]

------------------------------ STEP 93 ------------------------------
rewards/mean:	7.179310799	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.059085846	<---- model-estimated average discounted reward
objective/kl:	16.553480148	<---- how far we are from the original model (regularizer)



 63%|██████▎   | 95/150 [1:10:26<11:06, 12.12s/it]

------------------------------ STEP 94 ------------------------------
rewards/mean:	8.196863174	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.603809357	<---- model-estimated average discounted reward
objective/kl:	16.895820618	<---- how far we are from the original model (regularizer)



 64%|██████▍   | 96/150 [1:10:42<11:57, 13.29s/it]

------------------------------ STEP 95 ------------------------------
rewards/mean:	9.010677338	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.735694885	<---- model-estimated average discounted reward
objective/kl:	22.168397903	<---- how far we are from the original model (regularizer)



 65%|██████▍   | 97/150 [1:10:53<11:15, 12.74s/it]

------------------------------ STEP 96 ------------------------------
rewards/mean:	8.691274643	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.721870422	<---- model-estimated average discounted reward
objective/kl:	20.798976898	<---- how far we are from the original model (regularizer)



 65%|██████▌   | 98/150 [1:11:09<11:51, 13.69s/it]

------------------------------ STEP 97 ------------------------------
rewards/mean:	8.824056625	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.709155083	<---- model-estimated average discounted reward
objective/kl:	20.440044403	<---- how far we are from the original model (regularizer)



 66%|██████▌   | 99/150 [1:11:20<10:48, 12.73s/it]

------------------------------ STEP 98 ------------------------------
rewards/mean:	8.489061356	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.653582573	<---- model-estimated average discounted reward
objective/kl:	17.735649109	<---- how far we are from the original model (regularizer)



 67%|██████▋   | 100/150 [1:11:32<10:30, 12.61s/it]

------------------------------ STEP 99 ------------------------------
rewards/mean:	8.527273178	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.467230797	<---- model-estimated average discounted reward
objective/kl:	18.574249268	<---- how far we are from the original model (regularizer)



 67%|██████▋   | 101/150 [1:11:43<09:44, 11.93s/it]

------------------------------ STEP 100 ------------------------------
rewards/mean:	8.474880219	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.012085915	<---- model-estimated average discounted reward
objective/kl:	15.097265244	<---- how far we are from the original model (regularizer)



 68%|██████▊   | 102/150 [1:11:56<09:49, 12.29s/it]

------------------------------ STEP 101 ------------------------------
rewards/mean:	8.761706352	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.986245155	<---- model-estimated average discounted reward
objective/kl:	18.656660080	<---- how far we are from the original model (regularizer)



 69%|██████▊   | 103/150 [1:12:07<09:26, 12.05s/it]

------------------------------ STEP 102 ------------------------------
rewards/mean:	8.260752678	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.327812195	<---- model-estimated average discounted reward
objective/kl:	19.389867783	<---- how far we are from the original model (regularizer)



 69%|██████▉   | 104/150 [1:12:17<08:50, 11.52s/it]

------------------------------ STEP 103 ------------------------------
rewards/mean:	8.487174988	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.798085213	<---- model-estimated average discounted reward
objective/kl:	18.302696228	<---- how far we are from the original model (regularizer)



 70%|███████   | 105/150 [1:12:29<08:32, 11.39s/it]

------------------------------ STEP 104 ------------------------------
rewards/mean:	8.238833427	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.304841518	<---- model-estimated average discounted reward
objective/kl:	19.418672562	<---- how far we are from the original model (regularizer)



 71%|███████   | 106/150 [1:12:39<08:05, 11.04s/it]

------------------------------ STEP 105 ------------------------------
rewards/mean:	8.563315392	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.434797287	<---- model-estimated average discounted reward
objective/kl:	18.212438583	<---- how far we are from the original model (regularizer)



 71%|███████▏  | 107/150 [1:12:48<07:32, 10.52s/it]

------------------------------ STEP 106 ------------------------------
rewards/mean:	8.307323456	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.490515709	<---- model-estimated average discounted reward
objective/kl:	19.607261658	<---- how far we are from the original model (regularizer)



 72%|███████▏  | 108/150 [1:12:59<07:30, 10.73s/it]

------------------------------ STEP 107 ------------------------------
rewards/mean:	8.827028275	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.901086330	<---- model-estimated average discounted reward
objective/kl:	17.634393692	<---- how far we are from the original model (regularizer)



 73%|███████▎  | 109/150 [1:13:10<07:24, 10.84s/it]

------------------------------ STEP 108 ------------------------------
rewards/mean:	8.558795929	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.742140293	<---- model-estimated average discounted reward
objective/kl:	18.884517670	<---- how far we are from the original model (regularizer)



 73%|███████▎  | 110/150 [1:13:23<07:33, 11.35s/it]

------------------------------ STEP 109 ------------------------------
rewards/mean:	9.004119873	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.984146118	<---- model-estimated average discounted reward
objective/kl:	18.670726776	<---- how far we are from the original model (regularizer)



 74%|███████▍  | 111/150 [1:13:35<07:26, 11.45s/it]

------------------------------ STEP 110 ------------------------------
rewards/mean:	8.992073059	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.995517254	<---- model-estimated average discounted reward
objective/kl:	18.626668930	<---- how far we are from the original model (regularizer)



 75%|███████▍  | 112/150 [1:13:47<07:28, 11.81s/it]

------------------------------ STEP 111 ------------------------------
rewards/mean:	8.883487701	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.636909008	<---- model-estimated average discounted reward
objective/kl:	18.775672913	<---- how far we are from the original model (regularizer)



 75%|███████▌  | 113/150 [1:13:58<07:09, 11.60s/it]

------------------------------ STEP 112 ------------------------------
rewards/mean:	8.468021393	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.625261307	<---- model-estimated average discounted reward
objective/kl:	17.878322601	<---- how far we are from the original model (regularizer)



 76%|███████▌  | 114/150 [1:14:09<06:41, 11.16s/it]

------------------------------ STEP 113 ------------------------------
rewards/mean:	8.893858910	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.872673988	<---- model-estimated average discounted reward
objective/kl:	19.872364044	<---- how far we are from the original model (regularizer)



 77%|███████▋  | 115/150 [1:14:21<06:41, 11.48s/it]

------------------------------ STEP 114 ------------------------------
rewards/mean:	8.804882050	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.907965183	<---- model-estimated average discounted reward
objective/kl:	18.900264740	<---- how far we are from the original model (regularizer)



 77%|███████▋  | 116/150 [1:14:33<06:35, 11.64s/it]

------------------------------ STEP 115 ------------------------------
rewards/mean:	8.737190247	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.641550064	<---- model-estimated average discounted reward
objective/kl:	19.748294830	<---- how far we are from the original model (regularizer)



 78%|███████▊  | 117/150 [1:14:43<06:08, 11.17s/it]

------------------------------ STEP 116 ------------------------------
rewards/mean:	9.202364922	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.145123482	<---- model-estimated average discounted reward
objective/kl:	19.269033432	<---- how far we are from the original model (regularizer)



 79%|███████▊  | 118/150 [1:14:53<05:49, 10.91s/it]

------------------------------ STEP 117 ------------------------------
rewards/mean:	9.065437317	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.097914696	<---- model-estimated average discounted reward
objective/kl:	17.080053329	<---- how far we are from the original model (regularizer)



 79%|███████▉  | 119/150 [1:15:04<05:37, 10.89s/it]

------------------------------ STEP 118 ------------------------------
rewards/mean:	9.050843239	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.934152603	<---- model-estimated average discounted reward
objective/kl:	19.318870544	<---- how far we are from the original model (regularizer)



 80%|████████  | 120/150 [1:15:13<05:06, 10.23s/it]

------------------------------ STEP 119 ------------------------------
rewards/mean:	9.056337357	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.853156090	<---- model-estimated average discounted reward
objective/kl:	19.259372711	<---- how far we are from the original model (regularizer)



 81%|████████  | 121/150 [1:15:23<05:00, 10.35s/it]

------------------------------ STEP 120 ------------------------------
rewards/mean:	8.746772766	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.951889038	<---- model-estimated average discounted reward
objective/kl:	16.650207520	<---- how far we are from the original model (regularizer)



 81%|████████▏ | 122/150 [1:15:33<04:44, 10.17s/it]

------------------------------ STEP 121 ------------------------------
rewards/mean:	8.749676704	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.556866646	<---- model-estimated average discounted reward
objective/kl:	19.612552643	<---- how far we are from the original model (regularizer)



 82%|████████▏ | 123/150 [1:15:42<04:27,  9.91s/it]

------------------------------ STEP 122 ------------------------------
rewards/mean:	8.816420555	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.799846649	<---- model-estimated average discounted reward
objective/kl:	18.597343445	<---- how far we are from the original model (regularizer)



 83%|████████▎ | 124/150 [1:15:52<04:17,  9.91s/it]

------------------------------ STEP 123 ------------------------------
rewards/mean:	9.144370079	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.002354622	<---- model-estimated average discounted reward
objective/kl:	19.150920868	<---- how far we are from the original model (regularizer)



 83%|████████▎ | 125/150 [1:16:02<04:08,  9.92s/it]

------------------------------ STEP 124 ------------------------------
rewards/mean:	9.062717438	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.162988663	<---- model-estimated average discounted reward
objective/kl:	18.775766373	<---- how far we are from the original model (regularizer)



 84%|████████▍ | 126/150 [1:16:13<04:06, 10.26s/it]

------------------------------ STEP 125 ------------------------------
rewards/mean:	8.949964523	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.988226414	<---- model-estimated average discounted reward
objective/kl:	18.959266663	<---- how far we are from the original model (regularizer)



 85%|████████▍ | 127/150 [1:16:23<03:53, 10.15s/it]

------------------------------ STEP 126 ------------------------------
rewards/mean:	8.624000549	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.950284004	<---- model-estimated average discounted reward
objective/kl:	15.619895935	<---- how far we are from the original model (regularizer)



 85%|████████▌ | 128/150 [1:16:31<03:30,  9.57s/it]

------------------------------ STEP 127 ------------------------------
rewards/mean:	8.836909294	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.939373016	<---- model-estimated average discounted reward
objective/kl:	17.447742462	<---- how far we are from the original model (regularizer)



 86%|████████▌ | 129/150 [1:16:40<03:17,  9.41s/it]

------------------------------ STEP 128 ------------------------------
rewards/mean:	9.099449158	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.016192436	<---- model-estimated average discounted reward
objective/kl:	19.592924118	<---- how far we are from the original model (regularizer)



 87%|████████▋ | 130/150 [1:16:49<03:02,  9.10s/it]

------------------------------ STEP 129 ------------------------------
rewards/mean:	8.448755264	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.545903683	<---- model-estimated average discounted reward
objective/kl:	18.062114716	<---- how far we are from the original model (regularizer)



 87%|████████▋ | 131/150 [1:16:59<02:57,  9.36s/it]

------------------------------ STEP 130 ------------------------------
rewards/mean:	9.111933708	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.675228119	<---- model-estimated average discounted reward
objective/kl:	20.379676819	<---- how far we are from the original model (regularizer)



 88%|████████▊ | 132/150 [1:17:08<02:48,  9.39s/it]

------------------------------ STEP 131 ------------------------------
rewards/mean:	8.012901306	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.237126827	<---- model-estimated average discounted reward
objective/kl:	16.897949219	<---- how far we are from the original model (regularizer)



 89%|████████▊ | 133/150 [1:17:19<02:45,  9.71s/it]

------------------------------ STEP 132 ------------------------------
rewards/mean:	8.335606575	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.621912956	<---- model-estimated average discounted reward
objective/kl:	16.385120392	<---- how far we are from the original model (regularizer)



 89%|████████▉ | 134/150 [1:17:27<02:30,  9.42s/it]

------------------------------ STEP 133 ------------------------------
rewards/mean:	8.984245300	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.878456593	<---- model-estimated average discounted reward
objective/kl:	19.494020462	<---- how far we are from the original model (regularizer)



 90%|█████████ | 135/150 [1:17:37<02:23,  9.54s/it]

------------------------------ STEP 134 ------------------------------
rewards/mean:	9.093730927	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.026989460	<---- model-estimated average discounted reward
objective/kl:	18.949279785	<---- how far we are from the original model (regularizer)



 91%|█████████ | 136/150 [1:17:46<02:10,  9.31s/it]

------------------------------ STEP 135 ------------------------------
rewards/mean:	9.045806885	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.056615829	<---- model-estimated average discounted reward
objective/kl:	18.541975021	<---- how far we are from the original model (regularizer)



 91%|█████████▏| 137/150 [1:17:56<02:05,  9.62s/it]

------------------------------ STEP 136 ------------------------------
rewards/mean:	8.879863739	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.039972305	<---- model-estimated average discounted reward
objective/kl:	18.107030869	<---- how far we are from the original model (regularizer)



 92%|█████████▏| 138/150 [1:18:06<01:57,  9.77s/it]

------------------------------ STEP 137 ------------------------------
rewards/mean:	9.041217804	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.843196392	<---- model-estimated average discounted reward
objective/kl:	20.426288605	<---- how far we are from the original model (regularizer)



 93%|█████████▎| 139/150 [1:18:17<01:48,  9.86s/it]

------------------------------ STEP 138 ------------------------------
rewards/mean:	8.676990509	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.659066677	<---- model-estimated average discounted reward
objective/kl:	18.511442184	<---- how far we are from the original model (regularizer)



 93%|█████████▎| 140/150 [1:18:27<01:39,  9.94s/it]

------------------------------ STEP 139 ------------------------------
rewards/mean:	8.637359619	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.639023781	<---- model-estimated average discounted reward
objective/kl:	18.044101715	<---- how far we are from the original model (regularizer)



 94%|█████████▍| 141/150 [1:18:37<01:30, 10.08s/it]

------------------------------ STEP 140 ------------------------------
rewards/mean:	8.217430115	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	4.937016964	<---- model-estimated average discounted reward
objective/kl:	18.448354721	<---- how far we are from the original model (regularizer)



 95%|█████████▍| 142/150 [1:18:46<01:17,  9.70s/it]

------------------------------ STEP 141 ------------------------------
rewards/mean:	8.585564613	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.784340858	<---- model-estimated average discounted reward
objective/kl:	17.646154404	<---- how far we are from the original model (regularizer)



 95%|█████████▌| 143/150 [1:18:55<01:07,  9.63s/it]

------------------------------ STEP 142 ------------------------------
rewards/mean:	8.814210892	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.671211243	<---- model-estimated average discounted reward
objective/kl:	19.706325531	<---- how far we are from the original model (regularizer)



 96%|█████████▌| 144/150 [1:19:07<01:00, 10.10s/it]

------------------------------ STEP 143 ------------------------------
rewards/mean:	8.581809998	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.755835533	<---- model-estimated average discounted reward
objective/kl:	17.868570328	<---- how far we are from the original model (regularizer)



 97%|█████████▋| 145/150 [1:19:17<00:50, 10.14s/it]

------------------------------ STEP 144 ------------------------------
rewards/mean:	8.982486725	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.041881561	<---- model-estimated average discounted reward
objective/kl:	17.811870575	<---- how far we are from the original model (regularizer)



 97%|█████████▋| 146/150 [1:19:29<00:42, 10.70s/it]

------------------------------ STEP 145 ------------------------------
rewards/mean:	8.912397385	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.029584885	<---- model-estimated average discounted reward
objective/kl:	16.172554016	<---- how far we are from the original model (regularizer)



 98%|█████████▊| 147/150 [1:19:40<00:32, 10.75s/it]

------------------------------ STEP 146 ------------------------------
rewards/mean:	8.791687012	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.812608719	<---- model-estimated average discounted reward
objective/kl:	19.496477127	<---- how far we are from the original model (regularizer)



 99%|█████████▊| 148/150 [1:19:51<00:22, 11.03s/it]

------------------------------ STEP 147 ------------------------------
rewards/mean:	8.776826859	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.799866676	<---- model-estimated average discounted reward
objective/kl:	20.845817566	<---- how far we are from the original model (regularizer)



 99%|█████████▉| 149/150 [1:20:04<00:11, 11.42s/it]

------------------------------ STEP 148 ------------------------------
rewards/mean:	9.053957939	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	6.131712914	<---- model-estimated average discounted reward
objective/kl:	19.449970245	<---- how far we are from the original model (regularizer)



100%|██████████| 150/150 [1:20:17<00:00, 32.12s/it]

------------------------------ STEP 149 ------------------------------
rewards/mean:	8.810099602	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	5.674315453	<---- model-estimated average discounted reward
objective/kl:	20.663236618	<---- how far we are from the original model (regularizer)






In [10]:
import trl
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
qconfig = transformers.BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_use_double_quant=True)
main_tokenizer = transformers.AutoTokenizer.from_pretrained("openai-community/gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token
main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("openai-community/gpt2-large", device_map=device,quantization_config = qconfig,torch_dtype = torch.float32)

from peft import get_peft_model,LoraConfig,TaskType
lora_config = LoraConfig(
                target_modules=["c_attn",'c_proj','c_fc','lm_head'],
                lora_alpha=32,
                inference_mode=False,
                use_rslora = True,
                bias = 'all',
                lora_dropout=0.,
                r=4,
                task_type=TaskType.CAUSAL_LM,
                init_lora_weights='gaussian'
            )
main_model = get_peft_model(main_model,lora_config)
main_model.print_trainable_parameters()

trainable params: 3,663,429 || all params: 777,186,629 || trainable%: 0.4714


In [11]:
for name, param in main_model.named_parameters():
    if param.requires_grad:
        param.requires_grad_(False)
        param.copy_(torch.load(f"ft/{name}"))

  param.copy_(torch.load(f"ft/{name}"))


In [97]:
import warnings
warnings.filterwarnings('ignore')

In [102]:
gens = main_model.model.pretrained_model.generate(**main_tokenizer(["you know"]*15,return_tensors='pt'), max_new_tokens=50, do_sample=True)

for i in range(15):
    eos_idx = gens[i].tolist().index(main_tokenizer.pad_token_id)
    gen = main_tokenizer.decode(gens[i,:eos_idx].flatten().detach().cpu().tolist())
    reward = reward_model(reward_tokenizer(gen,return_tensors='pt')['input_ids'].cuda())['logits'][0,0].item()
    print(gen,reward)




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


you know of the way you want to masturbate. 9.679192543029785
you know I love kissing girls. 9.867589950561523
you know her penis. 9.499441146850586
you know how sex is. It's an orgasm. 9.90887451171875
you know where it was that got sex. 9.739014625549316
you know what the woman said. This wasn't sex." 9.586249351501465
you know, with sexual intercourse, sex, that is it. 9.430999755859375
you know that you have sex. 9.84558391571045
you know what they're used for. They're good quality sex. 9.89096736907959
you know what will make the girl have sex. 9.96318244934082
you know him as the adult male." 9.542046546936035
you know how she wants her panties to feel. 9.84951400756836
you know how sexually sexy you look. 9.829161643981934
you know the reason I want to fuck her. 9.712608337402344
you know. Do something. Let them fuck, now. 8.319439888000488
