# myInstructGPT: Reinforcement Learning from Human Feedback

In this tutorial, our goal is to use RLHF in practice to build a model that can summarize  Reddit posts. To this end we will use a model initialized from the pretrained 124M parameter GPT-2 model and fine tune it step by step following the Intrsuct-GPT learning process. As shown in class, this process includes the following steps:

1. Supervised Fine Tuning (SFT)

2. Learning a Reward Model

3. Policy Optimization with PPO


Let us first install and import the libraries we are going to need.

In [None]:
%pip install torch datasets regex transformers matplotlib triton

In [None]:
import os
import copy
import time
from tqdm import trange

import torch
from torch.utils.data.dataloader import DataLoader

from mingpt.model import GPT
from mingpt.trainer import Trainer, CN
from mingpt.utils import set_seed, lr_schedule, masked_mean, try_auto_cast
from mingpt.logger import Logger
from mingpt.rewards import RewardModel, ValueModel, calculate_advantage_and_returns

from summarize_rlhf.summarize_gpt import SummarizePrompt, policy_loss, value_loss
from summarize_rlhf.summarize_sft import SFTSummarize
from summarize_rlhf.summarize_reward_model import RewardModelSummarize

## Step 1: Supervised Fine Tuning (SFT)

<center width="100%"><img src="./images/SFT_step1.png"></center>
<center width="100%"><small>Step 1: Fine-tune the model with supervised learning</small><br><br></center>

**What model and data we will use for SFT?** 

To fine tune our model using supervised learning, we wll first need training data specific to our application (domain specific). For our summarization task, this means that we need a dataset with input prompts that include the post we want to summarize, to which we can append the characters 'TL;DR:', thus asking essentially our model to generate a summary of the post. In addition, for our dataset, we will need true `labels` for our prompts, that can be example summaries written by human experts. For our task we will use the [CarperAI/openai_summarize_tldr](https://huggingface.co/datasets/CarperAI/openai_summarize_tldr?row=0) dataset, that is publicly available in huggingface. As a pre-trained model, we will use the pre-trained GPT-2 model with ~124M parameters, the weights of which we will also download from hugginface.

**Why do we need SFT?**

With SFT we leverage the general capabilities of our base model, that may have gone through extensive training, but on the same time try to stir them and specialize them for our specific application. In particular, for our task, we take advantage of the GPT-2 model trained to produce coherent and human understandable text, and further train it to specialize it for post summaries. However, in our case, the good news is that we do need to train the model with millions of data, and we can achieve relatively good performance with much fewer data, saving a lot of time and resources. Another advantage is that we do not necessarily need to retrain the entire model (the last layers may be enough).  

**What are the challenges with SFT?**

Creating a labeled dataset can be quite tedious and costly, especially because it requires human effort to collect the data and produce the labels for the data. In addition, it may be even impossible to collect such a dataset depending on the application. 


Let's try it out! We will first define the below callback.

In [4]:
def batch_end_callback(trainer):
    model = trainer.model
    model.eval()

    trainer.logger.log("Train", trainer.iter_num, trainer.loss.item())

    if trainer.iter_num % trainer.config.log_every == 0:
        # evaluate both the train and test score
        with torch.no_grad():
            total_loss = 0
            for i, batch in enumerate(valid_loader):
                batch = [x.to(device) for x in batch]
                logits, loss = model(*batch)
                total_loss += loss.item()

        val_loss = total_loss / (i+1)
        trainer.logger.log("Valid", trainer.iter_num, val_loss)
        print(f"E: {trainer.epoch}, iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}, val loss: {val_loss:.5f}")

    if trainer.iter_num % trainer.config.generate_every == 0:
        with torch.no_grad():
            sample_prompt = prompt_ds[17].to(device)
            idx = model.generate(sample_prompt, max_new_tokens=128, do_sample=True, top_k=30, stop_at=train_ds.tokenizer.eot_token).cpu()
            for j,generation in enumerate(idx):
                print(f"Generation {j}:", train_ds.tokenizer.decode(generation))

    # save the latest model
    if trainer.config.save_every and trainer.iter_num % trainer.config.save_every == 0:
        print("saving model")
        ckpt_path = os.path.join(os.path.curdir, "model_sft.pt")
        torch.save(model.state_dict(), ckpt_path)

    # revert model to training mode
    model.train()



Let's set up our training data and parameters

In [22]:
set_seed(424242)
torch.set_float32_matmul_precision('high')

print("===== STARTING SUPERVISED FINE-TUNING =====")

# For Logging
train_idx = []
train_losses = []
val_idx = []
val_losses = []

print("Loading Data...")
valid_iters = 32
block_size = 256
train_ds = SFTSummarize(block_size=block_size, split='train')
valid_ds = SFTSummarize(block_size=block_size, split='valid')
prompt_ds = SummarizePrompt(block_size=block_size, split='valid')
print("Data Loaded")

print("Creating Model...")
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2'
model_config.vocab_size = train_ds.get_vocab_size()
model_config.block_size = block_size
model = GPT.from_pretrained("gpt2")
print("Model Created")

print("Configuring Trainer...")
train_config = Trainer.get_default_config()
train_config.learning_rate = 5e-6
train_config.num_workers = 2
train_config.log_every = 50
train_config.generate_every = 200
train_config.save_every = None
train_config.epochs = 1
train_config.batch_size = 4
train_config.compile = True
trainer = Trainer(train_config, model, train_ds)
print("Trainer Configured")


===== STARTING SUPERVISED FINE-TUNING =====
Loading Data...
Data Loaded
Creating Model...
number of parameters: 124.44M
Model Created
Configuring Trainer...
running on device cpu
Trainer Configured


Let us know train our model 

In [6]:
print("Starting Training...")
device = trainer.device
valid_loader = DataLoader(
    valid_ds,
    shuffle=False,
    num_workers=2,
    batch_size=trainer.config.batch_size * 2,
)
print('Validation loader created')
print('Setting up Callbacks')
trainer.set_callback('on_batch_end', batch_end_callback)
print('Callbacks set')
print("Training...")
trainer.run()

print("===== DONE SUPERVISED FINE-TUNING =====")

Starting Training...
Validation loader created
Setting up Callbacks
Callbacks set
Training...
E: 0, iter_dt 0.00ms; iter 0: train loss 3.65614, val loss: 3.45290
Generation 0: SUBREDDIT: r/relationships
TITLE: When should I [M24] offer to start paying for things at my girlfriends [F26] place? Or at all?
POST: We've been together officially for a little over a month now, but have been dating for closer to four months. I've known her almost three years now.

Since things became official I've been spending nearly all my time at her place. She gave me a key and has said that it's half my home too. So my dog and I are there now all the time. I still have my own apartment (six months left on the lease). We've talked some about me moving in, which will happen officially once my lease is up. But if I'm spending all my time at her place, using heat, water, electricity, etc... Shouldn't I help pay for something? Or is it too soon to talk about that kind of thing?

Her internet is very slow DSL a

In [8]:
# Save the model
# torch.save(model.state_dict(), "summarize_sft.pt")

# Load the model trained for a whole epoch
model.load_state_dict(torch.load("summarize_sft.pt", map_location=torch.device('cpu')))

  model.load_state_dict(torch.load("summarize_sft.pt", map_location=torch.device('cpu')))


<All keys matched successfully>

Let's generate a couple of summaries

In [None]:
for i in [17, 42, 75]:
    # Get a validation prompt to test
    sample_prompt = prompt_ds[i].to(device)
    idx = model.generate(sample_prompt, max_new_tokens=128, do_sample=True, top_k=30, stop_at=train_ds.tokenizer.eot_token).cpu()
    for j,generation in enumerate(idx):
        print(f"Generation {j}:", train_ds.tokenizer.decode(generation))

# # Plot the losses
# trainer.logger.plot({"Loss": ["Train", "Valid"]}, filename="summarize_sft.png")

Generation 0: SUBREDDIT: r/relationships
TITLE: When should I [M24] offer to start paying for things at my girlfriends [F26] place? Or at all?
POST: We've been together officially for a little over a month now, but have been dating for closer to four months. I've known her almost three years now.

Since things became official I've been spending nearly all my time at her place. She gave me a key and has said that it's half my home too. So my dog and I are there now all the time. I still have my own apartment (six months left on the lease). We've talked some about me moving in, which will happen officially once my lease is up. But if I'm spending all my time at her place, using heat, water, electricity, etc... Shouldn't I help pay for something? Or is it too soon to talk about that kind of thing?

Her internet is very slow DSL and she's off contract.. I've thought about offering to have my much faster cable internet moved to her place and just keep paying it myself.. Thoughts?
TL;DR: Bor

<center width="100%"><img src="./images/summarize_sft_1ep_256.png"></center>

## Step 2: Reward Model

<center width="100%"><img src="./images/RM_step2.png" width=300px></center>
<center width="100%"><small>Step 2: Train a reward model based on human preferences</small><br><br></center>

**What do we mean by reward?** 

In the context of RL, the reward is the immediate feedback received from the environment after an action is taken.  It indicates the immediate benefit or cost associated with that action, helping the agent to learn which actions lead to favorable outcomes. For our application, the reward should reflect the quality of our summary. For example, a short summary correctly capturing the meaning of the original post should get a much higher reward from a summary that seems irrelevant to the original text.   

**What data do need to train the RM?**

The training dataset for our RM model should include prompt-generation pairs (i.e., post-summary) as well as capture somehow the preference of human users on the generation (summary). We can achieve this through the following process. Each prompt is passed through the initial language model to generate new text (summaries) multiple times. Then human annotators rank the generated text, for example for two generations, the annotator may indicate which one they prefer over the other. Finally, this preference may be added as a 'label' in the summary and be included in the training dataset. For our task, we will use the dataset [CarperAI/openai_summarize_comparisons](https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons) available in hugginface. This dataset consists of Reddit posts, and for each post, includes two summaries, where one (positive) was preferred of the other (negative) by a human annotator. Our goal is to train our RM model to 'prefer' the summaries, that our human annotators would also prefer.   

Let's build and train our reward model! We start by defining some helper functions for training.

In [3]:
def evaluate(model, config, ds, iters=32):
    train_loader = DataLoader(
        ds,
        shuffle=False,
        batch_size=config.batch_size * 2,
        drop_last=True
    )
    total_loss = 0
    total_acc = 0
    i = 0
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        loss, acc = model(
            batch["neg_toks"],
            attn_mask=batch["neg_mask"],
            positive_tokens=batch["pos_toks"],
            positive_mask=batch["pos_mask"]
        )
        total_loss += loss.item()
        total_acc += acc.item()
        i += 1
        if i == iters:
            break
    return total_loss / i, total_acc / i



In [4]:
@torch.no_grad()
def set_reward_bias(model, config, ds, iters=128, device='cpu'):
    train_loader = DataLoader(
        ds,
        shuffle=False,
        batch_size=config.batch_size * 2,
        drop_last=True
    )
    all_rewards = []
    i = 0
    for batch in train_loader:
        x = torch.cat((batch["pos_toks"], batch["neg_toks"]))
        mask = torch.cat((batch["pos_mask"], batch["neg_mask"]))
        x, mask = [v.to(device) for v in (x, mask)]
        model.to(device)
        rewards = model(x, attn_mask=mask)
        all_rewards.append(rewards)
        i += 1
        if i == iters:
            break

    reward_bias = torch.mean(torch.cat(all_rewards))
    model.prediction_head.bias.sub_(reward_bias)
    print("Set reward bias to", model.prediction_head.bias.item())

Let's now configure the training data and parameters.

In [9]:
set_seed(424242)
torch.set_float32_matmul_precision('high')

# simple supervised training loop for the reward model!

# ------
# Config
# ------
config = CN()

config.num_workers = 2
config.batch_size = 16
config.learning_rate = 3e-6
config.betas = (0.9, 0.95)
config.weight_decay = 0.1
config.grad_norm_clip = 1.0

epochs = 3
data_block_size = 256
train_dataset = RewardModelSummarize(block_size=data_block_size, split='train')
valid_dataset = RewardModelSummarize(block_size=data_block_size, split='valid1')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device", device)

print("train dataset:", len(train_dataset), "val dataset:", len(valid_dataset))

model_block_size = 1024
config.model = GPT.get_default_config()
config.model.model_type = "gpt2"
config.model.n_layer = 12
config.model.n_head = 12
config.model.n_embd = 768
config.model.resid_pdrop = 0
config.model.attn_pdrop = 0
config.model.vocab_size = train_dataset.get_vocab_size()
config.model.block_size = model_block_size

# setup for logging
logger = Logger()

Using device cpu
train dataset: 13856 val dataset: 4373


Next we define our model.

In [10]:
model = RewardModel(config.model)

# Load model from finetuned model
sd = torch.load("trained_models/summarize_sft.pt", map_location=torch.device('cpu'))
missing = model.load_state_dict(sd, strict=False)
# We expect the lm head to be replaced by the scalar reward prediction head
print("Missing keys:", missing)

model.to(device)
model.train()
uncompiled_model = model
model = torch.compile(model)

# setup the optimizer
optimizer = GPT.configure_optimizers(model, config)

# -----
# Train
# -----
# to speed up training, we can start by only training the reward head of the model
# model.transformer.requires_grad_(False)
# run(model, config, logger)

config.batch_size = 8
model.requires_grad_(True)

  sd = torch.load("trained_models/summarize_sft.pt", map_location=torch.device('cpu'))


Missing keys: _IncompatibleKeys(missing_keys=['prediction_head.weight', 'prediction_head.bias'], unexpected_keys=['lm_head.weight'])


OptimizedModule(
  (_orig_mod): RewardModel(
    (transformer): Transformer(
      (wte): Embedding(50257, 768)
      (wpe): Embedding(1024, 768)
      (embd_drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0-11): 12 x Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): CausalSelfAttention(
            (c_attn): Linear(in_features=768, out_features=2304, bias=True)
            (c_proj): Linear(in_features=768, out_features=768, bias=True)
            (attn_dropout): Dropout(p=0, inplace=False)
            (resid_dropout): Dropout(p=0, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): ModuleDict(
            (c_fc): Linear(in_features=768, out_features=3072, bias=True)
            (c_proj): Linear(in_features=3072, out_features=768, bias=True)
            (act): NewGELU()
            (dropout): Dropout(p=0, inplace=False)
          )
        )
      )
      

Ready for training!

In [11]:
# setup the dataloader
train_loader = DataLoader(
    train_dataset,
    shuffle=True,
    pin_memory=True,
    drop_last=True,
    batch_size=config.batch_size,
    num_workers=config.num_workers,
)

model.train()
iter_num = 0
iter_time = time.time()
epochs = 1

for epoch in range(epochs):
    for i, batch in enumerate(train_loader):
        batch = {k: v.to(device) for k, v in batch.items()}

        # forward the model
        with try_auto_cast(device):
            loss, acc = model(
                batch["neg_toks"],
                attn_mask=batch["neg_mask"],
                positive_tokens=batch["pos_toks"],
                positive_mask=batch["pos_mask"]
            )

        # backprop and update the parameters
        model.zero_grad(set_to_none=True)
        loss.backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
        optimizer.step()

        # Evaluation
        if i % 100 == 0:
            model.eval()
            tnow = time.time()
            iter_dt = tnow - iter_time
            val_loss, val_acc = evaluate(model, config, valid_dataset)
            print(f"E: {epoch}, Iter: {iter_num}, Train loss: {loss.item():.4f}, Train Acc: {acc.item():.2f}, Grad norm: {grad_norm:.4f}, Val loss: {val_loss:.4f}, Val Acc: {val_acc:.2f} Took: {iter_dt:.0f}s")
            iter_time = tnow
            model.train()

            # Collect data for plotting
            logger.log("Val Loss", iter_num, val_loss)
            logger.log("Val Acc", iter_num, val_acc)

        logger.log("Train Loss", iter_num, loss.item())
        logger.log("Train Acc", iter_num, acc.item())

        iter_num += 1
        if iter_num == 1:
            break
# subtract the mean reward from the reward head to make it unbiased
# set_reward_bias(model, config, train_dataset, device=device)


E: 0, Iter: 0, Train loss: 0.6301, Train Acc: 0.50, Grad norm: 1.4482, Val loss: 0.6929, Val Acc: 0.60 Took: 12s


In [None]:
# setup the dataloader
train_loader = DataLoader(
    train_dataset,
    shuffle=True,
    pin_memory=True,
    drop_last=True,
    batch_size=config.batch_size,
    num_workers=config.num_workers,
)

model.train()
iter_num = 0
iter_time = time.time()
epochs = 1

for epoch in range(epochs):
    for i, batch in enumerate(train_loader):
        batch = {k: v.to(device) for k, v in batch.items()}

        # forward the model
        with try_auto_cast(device):
            loss, acc = model(
                batch["neg_toks"],
                attn_mask=batch["neg_mask"],
                positive_tokens=batch["pos_toks"],
                positive_mask=batch["pos_mask"]
            )

        # backprop and update the parameters
        model.zero_grad(set_to_none=True)
        loss.backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
        optimizer.step()

        # Evaluation
        if i % 100 == 0:
            model.eval()
            tnow = time.time()
            iter_dt = tnow - iter_time
            val_loss, val_acc = evaluate(model, config, valid_dataset)
            print(f"E: {epoch}, Iter: {iter_num}, Train loss: {loss.item():.4f}, Train Acc: {acc.item():.2f}, Grad norm: {grad_norm:.4f}, Val loss: {val_loss:.4f}, Val Acc: {val_acc:.2f} Took: {iter_dt:.0f}s")
            iter_time = tnow
            model.train()

            # Collect data for plotting
            logger.log("Val Loss", iter_num, val_loss)
            logger.log("Val Acc", iter_num, val_acc)

        logger.log("Train Loss", iter_num, loss.item())
        logger.log("Train Acc", iter_num, acc.item())

        iter_num += 1
# subtract the mean reward from the reward head to make it unbiased
set_reward_bias(model, config, train_dataset, device=device)


E: 0, Iter: 600, Train loss: 0.4050, Train Acc: 0.88, Grad norm: 10.8846, Val loss: 0.6921, Val Acc: 0.58 Took: 742s
E: 0, Iter: 700, Train loss: 0.8046, Train Acc: 0.50, Grad norm: 10.0794, Val loss: 0.7137, Val Acc: 0.58 Took: 1001s
E: 0, Iter: 800, Train loss: 0.5157, Train Acc: 0.88, Grad norm: 5.5163, Val loss: 0.6936, Val Acc: 0.60 Took: 983s
E: 0, Iter: 900, Train loss: 0.4445, Train Acc: 0.88, Grad norm: 8.0555, Val loss: 0.7058, Val Acc: 0.59 Took: 999s
E: 0, Iter: 1000, Train loss: 0.9042, Train Acc: 0.50, Grad norm: 21.1563, Val loss: 0.6852, Val Acc: 0.59 Took: 1000s
E: 0, Iter: 1100, Train loss: 0.4658, Train Acc: 0.88, Grad norm: 7.8752, Val loss: 0.6933, Val Acc: 0.59 Took: 952s
E: 0, Iter: 1200, Train loss: 0.7349, Train Acc: 0.50, Grad norm: 8.2510, Val loss: 0.7201, Val Acc: 0.60 Took: 993s
E: 0, Iter: 1300, Train loss: 0.6682, Train Acc: 0.75, Grad norm: 6.9098, Val loss: 0.7029, Val Acc: 0.59 Took: 976s
E: 0, Iter: 1400, Train loss: 0.6103, Train Acc: 0.75, Grad nor

W0112 10:00:17.750000 33714 torch/_dynamo/convert_frame.py:844] [0/20] torch._dynamo hit config.cache_size_limit (8)
W0112 10:00:17.750000 33714 torch/_dynamo/convert_frame.py:844] [0/20]    function: 'forward' (/Users/mbk-21-0452/Documents/TUK/NLP_TA/notebooks/tutorial_RLHF/mingpt/rewards.py:112)
W0112 10:00:17.750000 33714 torch/_dynamo/convert_frame.py:844] [0/20]    last reason: 0/10: GLOBAL_STATE changed: grad_mode 
W0112 10:00:17.750000 33714 torch/_dynamo/convert_frame.py:844] [0/20] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0112 10:00:17.750000 33714 torch/_dynamo/convert_frame.py:844] [0/20] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.


Set reward bias to -0.03496289998292923


We now save our model and draw some plots.

In [None]:
torch.save(uncompiled_model.state_dict(), "reward_model.pt")
logger.plot({"Loss": ["Train Loss", "Val Loss"], "Accuracy": ["Train Acc", "Val Acc"]}, filename="reward_model.png")


<center width="100%"><img src="reward_model.png"></center>

## Step 3: Policy Optimization with PPO

<center width="100%"><img src="./images/RL_step3.png" width=300px></center>
<center width="100%"><small>Step 3: Train the model with PPO</small><br><br></center>

**How RL is relevant to our fine-tuning task?** 

We can consider our fine-tuning task as a RL problem, where: 

* the token sequence generated so far is the state
* the next token to be sampled is the action 
* the next token distribution, predicted by our model, is the action policy
* the reward predicted by our reward model is the environment reward

*Note*: The reward may optionally be the predicted reward of the RM model minus a penalty term depending on the  Kullback–Leibler divergence (KL), shown as $D_{KL}$ below,  between the policy of the base model and the policy of the model during training. The KL divergence term penalizes the RL policy from moving substantially away from the initial model at each training step, which can be useful to make sure that the generated text is reasonable and coherent. Otherwise the model may start generating text, that may achieve high reward, but it is actually gibberish.

We summarize below the RL training pipeline

<center width="100%"><img src="./images/RLHF_full.png"></center>
<center width="100%"><small>The entire RL training process</small><br><br></center>


We start with a helper function (runs on the validation set).

In [25]:
def validate(valid_ds, model, reward_model, device, max_iters=64):
    reward_model.eval()
    model.eval()

    total_rewards = 0
    count = 0
    total = min(max_iters, len(valid_ds))
    valid_progress_bar = trange(total, desc="Validating", leave=False)
    with torch.no_grad():
        for i in valid_progress_bar:
            prompt = valid_ds[i].to(device)

            completion = model.generate(prompt, max_new_tokens=completion_len, do_sample=True, top_k=30, stop_at=end_of_text)

            reward = reward_model(completion).item()
            total_rewards += reward
            count += 1
            valid_progress_bar.set_postfix(avg_reward=f"{total_rewards/count:.4f}")

            if i < 3:
                print(train_ds.tokenizer.decode(completion[0]), f"\nReward: {reward}\n========\n")

    average_reward = total_rewards / total
    model.train()
    return average_reward


We now define all the models we are going to use. 

In [26]:
set_seed(424242)
torch.set_float32_matmul_precision('high')

# Transformer context length
block_size = 1024

# completion + prompt <= block_size
completion_len = 80
max_prompt_len = 256

model = GPT.get_default_config()
model.model_type = "gpt2"
model.n_layer = 12
model.n_head = 12
model.n_embd = 768
model.vocab_size = 50257
model.model_type = None
model.block_size = block_size

reward_model = RewardModel(model)
value_model = ValueModel(model)
model = GPT(model)

# Load reward, value, and model from weights!
model.load_state_dict(torch.load("trained_models/summarize_sft.pt", map_location='cpu'))
reward_model.load_state_dict(torch.load("trained_models/reward_model.pt", map_location='cpu'))
value_model.load_state_dict(torch.load("trained_models/reward_model.pt", map_location='cpu'))

# reference model is the finetuned SFT model
ref_model = copy.deepcopy(model)
ref_model.requires_grad_(False)
ref_model.eval()

reward_model.requires_grad_(False)
reward_model.eval()



number of parameters: 124.44M


  model.load_state_dict(torch.load("trained_models/summarize_sft.pt", map_location='cpu'))
  reward_model.load_state_dict(torch.load("trained_models/reward_model.pt", map_location='cpu'))
  value_model.load_state_dict(torch.load("trained_models/reward_model.pt", map_location='cpu'))


RewardModel(
  (transformer): Transformer(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (embd_drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (pre

Let's move the models to GPU and compile them

In [14]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
value_model.to(device)
reward_model.to(device)
ref_model.to(device)
print("Running on device", device)

uncompiled_model = model
uncompiled_value_model = value_model

# compile the model
model = torch.compile(model)
reward_model = torch.compile(reward_model)
value_model = torch.compile(value_model)
ref_model = torch.compile(ref_model)

Running on device cpu


Next, we set up the training data and parameters.

In [29]:
# PPO hyperparams
sample_batch_size = 4 # Number of completions to sample
train_batch_size = 2 # OAI uses equal training and sampling batches of 64 (we'll use whatever fits on the GPU!)
grad_accum_steps = sample_batch_size // train_batch_size
max_learning_rate = 3e-6
grad_norm_clip = 1.0
kl_beta = 0.02
n_updates = 2
gamma = 1
lambd = 0.95
n_epochs = 1

# Logging
logger = Logger()

# Prompt datasets
train_ds = SummarizePrompt('train', block_size=max_prompt_len)
valid_ds = SummarizePrompt('valid', block_size=max_prompt_len)
end_of_text = train_ds.tokenizer.eot_token
print("train ds:", len(train_ds), "val ds:", len(valid_ds))

# Can set separate lrs for policy and value fn
total_iters = len(train_ds) // sample_batch_size * n_epochs
get_lr = lr_schedule(max_learning_rate, max_iters=total_iters)
optim_groups = [{'params': model.parameters()}, {'params': value_model.parameters()}]
optimizer = torch.optim.AdamW(optim_groups, lr=get_lr(0), betas=(0.9, 0.95), fused=torch.cuda.is_available(), weight_decay=0.0)

train ds: 25130 val ds: 1409


Let the training begin!

In [None]:
# PPO training loop:
# 1. generate a set of completions, given some prompts for the task
# 2. calculate the rewards, values and advantages of the completions
# 3. optimize the models completions based on the rewards using ppo objective

# initial reward on the validation set
val_reward = validate(valid_ds, model, reward_model, device)
print("Initial (SFT) Val reward:", val_reward)

i = 0
for epoch in range(n_epochs):
    batch_idxs = torch.randperm(len(train_ds))
    for idx in trange(len(train_ds) // sample_batch_size, desc="iter"):

        # Learning rate schedule
        curr_lr = get_lr(i)
        for pg in optimizer.param_groups:
            pg['lr'] = curr_lr

        # Sample completions given some prompts from the dataset
        # these are the `actions` in the RL sense that the model takes
        with torch.no_grad():
            model.eval()
            value_model.eval()

            original_log_probs = []
            completions = []
            advantages = []
            returns = []
            action_mask = []
            targets = []
            total_reward = 0
            start_idx = idx * sample_batch_size
            for prompt_idx in batch_idxs[start_idx : start_idx + sample_batch_size]:
                prompt = train_ds[prompt_idx.item()].to(device)

                # Sample the completions
                completion = model.generate(prompt, max_new_tokens=completion_len, do_sample=True, top_k=30, stop_at=end_of_text)

                if completion[0, -1] == end_of_text:
                    # Evaluate and store the rewards for the last token
                    reward = reward_model(completion).unsqueeze(-1)
                else:
                    # If there is no eot token, hardcode a negative reward
                    reward = torch.tensor([[-1.0]], device=device)

                total_reward += reward.item()
                completion_minus_1, target = completion[:, :-1], completion[:, 1:]

                # Store the model's original log prob (could be merged into the generate fn)
                original_log_prob = model.log_probs(completion_minus_1, target)

                # Reference logprobs
                ref_log_prob = ref_model.log_probs(completion_minus_1, target)

                # Calculate values, returns and advantages
                values = value_model(completion)

                # Calculate the advantage for our policy gradient
                # Include the kl score to reduce overfitting
                # the kl reward here could be kept up to date with the policy network
                # inside the ppo updates below for a better regularization effect
                kl = original_log_prob - ref_log_prob
                score = torch.cat((- kl_beta * kl, reward), dim=1)
                advantage, single_return = calculate_advantage_and_returns(score, values, gamma=gamma, lambd=lambd)

                # Pad the values up to block_size with zeros
                pad = torch.zeros(1, block_size - advantage.size(1), device=advantage.device, dtype=advantage.dtype)
                advantages.append(torch.cat((advantage, pad), dim=1))
                returns.append(torch.cat((single_return, pad), dim=1))

                # pad the log probs with 1 extra 0
                pad_plus_1 = torch.zeros(1, block_size - original_log_prob.size(1), device=advantage.device, dtype=advantage.dtype)
                original_log_probs.append(torch.cat((original_log_prob, pad_plus_1), dim=1))

                # Pad the tokens with longs
                pad = torch.zeros(1, block_size - completion.size(1), device=completion.device, dtype=completion.dtype)
                completions.append(torch.cat((completion, pad), dim=1))
                pad = torch.zeros(1, block_size - target.size(1), device=target.device, dtype=target.dtype)
                targets.append(torch.cat((target, pad), dim=1))

                # The action mask is only the generated part of the completion
                mask = torch.zeros(1, block_size, device=advantage.device, dtype=advantage.dtype)
                mask[:, prompt.size(1):completion.size(1)] = 1
                action_mask.append(mask)

        # Stack the values into a batch
        advantages = torch.cat(advantages)
        returns = torch.cat(returns)
        completions = torch.cat(completions)
        original_log_probs = torch.cat(original_log_probs)
        action_mask = torch.cat(action_mask)
        targets = torch.cat(targets)


        # Do the PPO update on the batch of data several times
        model.train()
        value_model.train()
        for _ in range(n_updates):
            b_inds = torch.randperm(sample_batch_size)
            for start in range(0, sample_batch_size, train_batch_size):
                end = start + train_batch_size

                # Grab the mini-batches
                mb_inds = b_inds[start:end]
                mb_completion = completions[mb_inds]
                mb_target = targets[mb_inds]
                mb_original_logps = original_log_probs[mb_inds]
                mb_advantages = advantages[mb_inds]
                mb_returns = returns[mb_inds]
                mb_action_mask = action_mask[mb_inds]

                
                with try_auto_cast(device):
                    # Forward pass through the latest model
                    log_probs = model.log_probs(mb_completion, mb_target)

                    # Policy loss
                    pg_loss = policy_loss(log_probs, mb_original_logps, mb_advantages, mb_action_mask)

                    # Value loss
                    new_value = value_model(mb_completion)
                    v_loss = value_loss(new_value, mb_returns, mb_action_mask)
                    
                    loss = pg_loss + 0.1 * v_loss
                    loss = loss / grad_accum_steps

                loss.backward()

            policy_grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), grad_norm_clip)
            value_grad_norm = torch.nn.utils.clip_grad_norm_(value_model.parameters(), grad_norm_clip)
            optimizer.step()
            model.zero_grad()
            value_model.zero_grad()

        avg_reward = total_reward / sample_batch_size
        
        logger.log("Reward", i, avg_reward)
        logger.log("Value Loss", i, v_loss.item())
        logger.log("KL", i, kl.mean().item())
        logger.log("Policy Grad Norm", i, policy_grad_norm.item())
        logger.log("Value Grad Norm", i, value_grad_norm.item())

        if i % 10 == 0:
            val_reward = validate(valid_ds, model, reward_model, device)
            logger.log("Val Reward", i, val_reward)
            print(f"Iter: {i}, Avg reward: {avg_reward:.3f}, KL: {kl.mean().item():.3f}, Value Loss: {v_loss.item():.4f}, Grad Norm: {policy_grad_norm:.2f}, Vf grad norm: {value_grad_norm:.2f}, Val reward: {val_reward:.3f}")

            torch.save(uncompiled_model.state_dict(), f"trained_models/tmp_summarize_rl_{i}.pt")
            break

        i += 1

Validating:   2%|▏         | 1/64 [00:23<25:08, 23.94s/it, avg_reward=-0.7153]

SUBREDDIT: r/AskReddit
TITLE: How do you get someone out of your head?
POST: Hi,
I'm 22, and I have been with my girlfriend for 5 years now. We recently moved together. We've always loved each other intensely.

Problem, I recently started to have feelings for an other person (a friend). This person has had a boyfriend for now 3 years, and has absolutely no ideas. Those feelings were so strong, it was hard to hide them. After 2 months of me being distant and really sad, my girlfriend forced me to say what was bothering me. I'm not a good liar, and now she knows.

We decided to give us a week alone, I went to my parents. 

Now, I'm completely lost. I keep on thinking about this person, and I hate that. I would like for those feelings to go away, to leave me alone. But I can't.  

What do I do? It's been 3 months now, and I'm just desperate.
TL;DR: I've come back from my depression, I have feelings for another person, and still need to convince someone.<|endoftext|> 
Reward: -0.7153141498

Validating:   3%|▎         | 2/64 [00:32<15:35, 15.09s/it, avg_reward=0.0386] 

SUBREDDIT: r/running
TITLE: Knee pain due to poor balance
POST: I've had difficulty with distance running due to strong knee pain. My endurance is great, I can cycle for very long distances, but I can't run because my knees give out around 8 to 10 mies.
I went to the Orthopedist who did a full series of x-rays and pronounced my knees in excellent condition. Then he had me do a bunch of balance exercises and told me that balance and "hip stability" was my issue. He prescribed PT, but my insurance is kinda crappy and 3x's/week PT will run me around $300/month. That's a bit steep.
So, has anyone else had knee issues due to balance and hip stability? What did you do? Are there balancing exercises I can do at home and not spend a ton of money on PT?
TL;DR: A full series of x-rays in my knees, I'm not sure what I'm doing wrong. Should I go for balance or should I just go with a treadmill at home instead? Please help!<|endoftext|> 
Reward: 0.7925448417663574



Validating:   5%|▍         | 3/64 [00:39<11:37, 11.44s/it, avg_reward=0.2231]

SUBREDDIT: r/Pets
TITLE: Pet lovers, how do you keep your home clean?
POST: Everyone has their favorite tricks/tips to keeping a clean house, so I'm curious...and in the market for a new vacuum and/or steam mop. 

We have three adult cats and one Italian Greyhound puppy and live in a mostly hard-wood apartment [two carpeted rooms and two large area rugs]. The cats are short hair but shed like crazy [black, white and grey!] and IGs don't really shed at all, but track in a decent amount of dirt from the yard. Getting sick of sweeping, swiffering and then pushing around dirt with a mop. It'd be nice to have a vacuum that picks up dirt and hair effectively on hardwood and carpet and I'm strongly considering investing in a steam mop.

So what do you do? What do you recommend?
TL;DR: Got a cheap and effective vacuum and a steam muffin that cleans and mops everything in our living room, would be nice for an older cat or two.<|endoftext|> 
Reward: 0.5920672416687012



                                                                               

Initial (SFT) Val reward: -0.4824687163345516


iter:   0%|          | 0/6282 [00:00<?, ?it/s]

SUBREDDIT: r/AskReddit
TITLE: How do you get someone out of your head?
POST: Hi,
I'm 22, and I have been with my girlfriend for 5 years now. We recently moved together. We've always loved each other intensely.

Problem, I recently started to have feelings for an other person (a friend). This person has had a boyfriend for now 3 years, and has absolutely no ideas. Those feelings were so strong, it was hard to hide them. After 2 months of me being distant and really sad, my girlfriend forced me to say what was bothering me. I'm not a good liar, and now she knows.

We decided to give us a week alone, I went to my parents. 

Now, I'm completely lost. I keep on thinking about this person, and I hate that. I would like for those feelings to go away, to leave me alone. But I can't.  

What do I do? It's been 3 months now, and I'm just desperate.
TL;DR: Girlfriend wants me to get out of my head, to leave her alone. I don't want to be in this situation. How do I get away?<|endoftext|> 
Reward: 



SUBREDDIT: r/running
TITLE: Knee pain due to poor balance
POST: I've had difficulty with distance running due to strong knee pain. My endurance is great, I can cycle for very long distances, but I can't run because my knees give out around 8 to 10 mies.
I went to the Orthopedist who did a full series of x-rays and pronounced my knees in excellent condition. Then he had me do a bunch of balance exercises and told me that balance and "hip stability" was my issue. He prescribed PT, but my insurance is kinda crappy and 3x's/week PT will run me around $300/month. That's a bit steep.
So, has anyone else had knee issues due to balance and hip stability? What did you do? Are there balancing exercises I can do at home and not spend a ton of money on PT?
TL;DR: Examined to see if it would be worth the money to spend on a back lift, ended up having knee issues. Went to Orthopedist, and he gave me a bunch of stress tests to get comfortable.<|endoftext|> 
Reward: 0.8704727292060852





SUBREDDIT: r/Pets
TITLE: Pet lovers, how do you keep your home clean?
POST: Everyone has their favorite tricks/tips to keeping a clean house, so I'm curious...and in the market for a new vacuum and/or steam mop. 

We have three adult cats and one Italian Greyhound puppy and live in a mostly hard-wood apartment [two carpeted rooms and two large area rugs]. The cats are short hair but shed like crazy [black, white and grey!] and IGs don't really shed at all, but track in a decent amount of dirt from the yard. Getting sick of sweeping, swiffering and then pushing around dirt with a mop. It'd be nice to have a vacuum that picks up dirt and hair effectively on hardwood and carpet and I'm strongly considering investing in a steam mop.

So what do you do? What do you recommend?
TL;DR: Have a cat litter, and want to know what a good way to keep a nice new puppy clean. Advice posted here and here<|endoftext|> 
Reward: 0.40485984086990356





Iter: 0, Avg reward: -0.417, KL: 0.000, Value Loss: 1.9033, Grad Norm: 11.54, Vf grad norm: 18.26, Val reward: -0.444


iter:   0%|          | 0/6282 [11:27<?, ?it/s]


Saving our data and plots

In [None]:
torch.save(uncompiled_model.state_dict(), "summarize_rl.pt")
torch.save(uncompiled_value_model.state_dict(), "value_model.pt")

# Plot the results
logger.plot({"Reward": ["Reward", "Val Reward"], "Value Loss": ["Value Loss"]}, filename="summarize_rl_rewards.png")
logger.plot({"Gradient Norm": ["Policy Grad Norm", "Value Grad Norm"], "KL Div": ["KL"], "Clip Fraction": ["Clip Frac"]}, filename="summarize_rl_metrics.png")

<center width="100%"><img src="./images/summarize_rl_rewards.png"></center>
<center width="100%"><img src="./images/summarize_rl_metrics.png"></center>

Let's generate some summaries!

In [23]:
# Load the model trained for a whole epoch
model.load_state_dict(torch.load("trained_models/summarize_rl.pt", map_location=torch.device('cpu')))

for i in [17, 42, 75]:
    # Get a validation prompt to test
    sample_prompt = prompt_ds[i].to(device)
    idx = model.generate(sample_prompt, max_new_tokens=128, do_sample=True, top_k=30, stop_at=train_ds.tokenizer.eot_token).cpu()
    for j,generation in enumerate(idx):
        print(f"Generation {j}:", train_ds.tokenizer.decode(generation))


  model.load_state_dict(torch.load("trained_models/summarize_rl.pt", map_location=torch.device('cpu')))


Generation 0: SUBREDDIT: r/relationships
TITLE: When should I [M24] offer to start paying for things at my girlfriends [F26] place? Or at all?
POST: We've been together officially for a little over a month now, but have been dating for closer to four months. I've known her almost three years now.

Since things became official I've been spending nearly all my time at her place. She gave me a key and has said that it's half my home too. So my dog and I are there now all the time. I still have my own apartment (six months left on the lease). We've talked some about me moving in, which will happen officially once my lease is up. But if I'm spending all my time at her place, using heat, water, electricity, etc... Shouldn't I help pay for something? Or is it too soon to talk about that kind of thing?

Her internet is very slow DSL and she's off contract.. I've thought about offering to have my much faster cable internet moved to her place and just keep paying it myself.. Thoughts?
TL;DR: Whe

In [None]:
val_reward = validate(valid_ds, model, reward_model, device)

Validating:   2%|▏         | 1/64 [00:09<09:59,  9.51s/it, avg_reward=1.0719]

SUBREDDIT: r/AskReddit
TITLE: How do you get someone out of your head?
POST: Hi,
I'm 22, and I have been with my girlfriend for 5 years now. We recently moved together. We've always loved each other intensely.

Problem, I recently started to have feelings for an other person (a friend). This person has had a boyfriend for now 3 years, and has absolutely no ideas. Those feelings were so strong, it was hard to hide them. After 2 months of me being distant and really sad, my girlfriend forced me to say what was bothering me. I'm not a good liar, and now she knows.

We decided to give us a week alone, I went to my parents. 

Now, I'm completely lost. I keep on thinking about this person, and I hate that. I would like for those feelings to go away, to leave me alone. But I can't.  

What do I do? It's been 3 months now, and I'm just desperate.
TL;DR: How do you get someone out of your head? I'm 22, and I have been with my girlfriend for 5 years now. We recently moved together. We've always 

Validating:   3%|▎         | 2/64 [00:20<10:27, 10.12s/it, avg_reward=1.7202]

SUBREDDIT: r/running
TITLE: Knee pain due to poor balance
POST: I've had difficulty with distance running due to strong knee pain. My endurance is great, I can cycle for very long distances, but I can't run because my knees give out around 8 to 10 mies.
I went to the Orthopedist who did a full series of x-rays and pronounced my knees in excellent condition. Then he had me do a bunch of balance exercises and told me that balance and "hip stability" was my issue. He prescribed PT, but my insurance is kinda crappy and 3x's/week PT will run me around $300/month. That's a bit steep.
So, has anyone else had knee issues due to balance and hip stability? What did you do? Are there balancing exercises I can do at home and not spend a ton of money on PT?
TL;DR: Knee pain due to poor balance I've had difficulty with distance running due to strong knee pain. My endurance is great, I can cycle for very long distances, but I can't run because my knees give out around 8 to 10 mies.<|endoftext|> 
Rewa

Validating:   5%|▍         | 3/64 [00:27<08:54,  8.77s/it, avg_reward=1.9577]

SUBREDDIT: r/Pets
TITLE: Pet lovers, how do you keep your home clean?
POST: Everyone has their favorite tricks/tips to keeping a clean house, so I'm curious...and in the market for a new vacuum and/or steam mop. 

We have three adult cats and one Italian Greyhound puppy and live in a mostly hard-wood apartment [two carpeted rooms and two large area rugs]. The cats are short hair but shed like crazy [black, white and grey!] and IGs don't really shed at all, but track in a decent amount of dirt from the yard. Getting sick of sweeping, swiffering and then pushing around dirt with a mop. It'd be nice to have a vacuum that picks up dirt and hair effectively on hardwood and carpet and I'm strongly considering investing in a steam mop.

So what do you do? What do you recommend?
TL;DR: Everyone has their favorite tricks/tips to keeping a clean house, so I'm curious...and in the market for a new vacuum and/or steam mop.<|endoftext|> 
Reward: 2.432692289352417



                                                                              

In [33]:
print("Val Reward after RLHF:", val_reward)

Val Reward after RLHF: 1.4084996758028865


## References

*  https://www.restack.io/p/reinforcement-learning-answer-reward-model-purpose-cat-ai
*  https://nebius.com/blog/posts/fine-tuning/supervised-fine-tuning
*  https://huggingface.co/blog/rlhf

