# Chapter 7 Variations of Policy Gradient Algorithms

This notebook was created by Kellen Kanarios (kellenkk@umich.edu).

## Example: Reinforcement Learning with Human Feedback (RLHF) for Positive Movie Reviews

In this notebook, we will use proximal policy optimization (PPO) to finetune a large language model (LLM) for more positive movie reviews.

### Problem Description

The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment. There are some language models that have been pretrained on this dataset to generate movie reviews. We will consider two models, "edbeeching/gpt2-medium-imdb" and "lvwerra/gpt2-imdb" that are available in the Huggingface model zoo [1]. The generated movie reviews could be either positive or negative. Our goal is to finetune these language models to generate more positive movie reviews.

In an LLM, each output is generated by appending one token at a time to the input sequence. A token is essentially a word or a sub-word. We are then given whether the total response is acceptable. This sounds like an RL problem!

### Formulation

To formulate language generation as an RL problem, we must first construct an MDP.

- *State* $s$:
    
    The state $s$ is the current sequence of words, which is also a sequence of tokens.

- *Action $u$*:

    The action $u$ is the token that will be appended to the end of the sequence. Taking an action is just appending the token to the end of the state (current sequence). $\pi(u | s)$ is the probability of saying token $u$ given sentence $s$.

- *Transition*:

    Suppose we take action $u_t$ in state $s_t$. Then the next state $s_{t+1}=s_t + u_t$, which means appending $u_t$ to the end of $s_t$.

- *Reward*:
    
    At the end of the trajectory, a reward will be given for the total trajectory. We will use a classifier to analyze the sentiment of the produced sentences and use the classifier's outputs as rewards signals. We will use the classifier "lvwerra/distilbert-imdb" in [2].

- *Objective*: Maximize the expected reward.


We can train (finetune) the language model using clipped PPO (along with a few more KL tricks). Namely, we want

\begin{align*}
    \max_{\tilde{w}} \mathbb{E}_{x\sim \rho_w, u \sim \pi_w}\left[\min\left\{\frac{\pi_{\tilde{w}}(u | x)}{\pi_{w}(u | x)}A_{w}(x, u), \left(\frac{\pi_{\tilde{w}}(u | x)}{\pi_{w}(u | x)}\right)^{1+\epsilon}_{1-\epsilon} A_{w}(x, u) \right\}\right],
\end{align*}
where $(x)_a^b$ is the clipping function such that $(x)_a^b=x$ if $x\in[a,b]$, $a$ if $x<a$, and $b$ if $x>b$.


### References

[1] https://huggingface.co/models

[2] https://huggingface.co/lvwerra/distilbert-imdb

[3] The following was used as starter code https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb


# Install and Import Packages

In [None]:
!pip install  --upgrade \
  "transformers" \
  "datasets" \
  "trl" \
  "peft" \
  "wandb" \
  "bitsandbytes" \
  "accelerate" \
  "optimum" \
  "pandas"

import torch
import random
import numpy

import wandb

from transformers import AutoTokenizer

from trl.core import LengthSampler
from datasets import load_dataset

from trl import PPOConfig
from peft import LoraConfig
from transformers import BitsAndBytesConfig
import torch

from trl import AutoModelForCausalLMWithValueHead, PPOTrainer
import bitsandbytes as bnb
from transformers import pipeline, AutoModelForSequenceClassification, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

from tqdm import tqdm
import pandas as pd

#Helper Functions

In [None]:
def set_seed(seed=0):
    # set seed for all possible avenues of stochasticity
    numpy.random.seed(seed=seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_seed()

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

#Setup Wandb
To track training results, we will use wandb.
**Please visit the site and make an account.**

In [None]:
wandb.login()

# Choose a Model

We will choose one of the following two language models with different Low-Rank Adaptation (LoRA) configurations. LoRA is a Parameter-Efficient Fine-Tuning (PEFT) method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned. See https://huggingface.co/docs/peft/en/package_reference/lora for details of LoRA.

Model (a):
*   Model name: "edbeeching/gpt2-medium-imdb" in https://huggingface.co/edbeeching/gpt2-medium-imdb
*   LoRA condiguration: r=128, lora_alpha=256, lora_dropout=0.05.

Model (b):
*   Model name: "lvwerra/gpt2-imdb" in https://huggingface.co/lvwerra/gpt2-imdb
*   LoRA condiguration: r=512, lora_alpha=1024, lora_dropout=0.05.

Model (b) is smaller than Model (a).

**Please set `model_chosen` to `"a"` or `"b"` in the following code block.**

In [None]:
# choose model "a" or "b"
model_chosen = "a"  # or "b"

if model_chosen == "a":
    part_a = True
    model_id = "edbeeching/gpt2-medium-imdb"
elif model_chosen == "b":
    part_a = False
    model_id = "lvwerra/gpt2-imdb"
else:
    print("Wrong model input!")
    exit(1)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Preprocess and Pretokenize Data

In [None]:
def build_dataset(
    tokenizer,
    dataset_name="imdb",
    input_min_text_length=8,
    input_max_text_length=24,
):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    train_dataset = load_dataset(dataset_name, split="train")
    original_columns = train_dataset.column_names

    train_dataset = train_dataset.filter(lambda x: len(x["text"]) > 200, batched=False)

    def preprocess_function(samples):
        new_samples = {
            "query": [],
            "input_ids": [],
        }
        for review in samples["text"]:
            input_size = LengthSampler(input_min_text_length, input_max_text_length)
            query = review[: input_size()]
            new_samples["query"].append(query)
            tokenized_query = tokenizer.encode(
                query,
            )
            new_samples["input_ids"].append(tokenized_query)

        return new_samples

    ds = train_dataset.map(
        preprocess_function,
        batched=True,
        remove_columns=original_columns,
    )

    ds.set_format(type="torch")
    return ds

In [None]:
dataset = build_dataset(tokenizer)

# Setup the Model

*   You can try different ``cliprange`` = 0.1, 0.2, 0.3 in the ``PPOConfig`` object.


In [None]:
# Experiment with different cliprange

config = PPOConfig(
  model_name=model_id,
  learning_rate=1.41e-5,
  cliprange=0.2,
  use_score_scaling=True,
  use_score_norm=True,
  score_clip=0.5,
  log_with="wandb"
)

peft_config_a = LoraConfig(
    r=128,
    lora_alpha=256,
    lora_dropout=0.05,
    bias="none",
    fan_in_fan_out=True,
    task_type="CAUSAL_LM",
)

peft_config_b = LoraConfig(
    r=512,
    lora_alpha=1024,
    lora_dropout=0.05,
    bias="none",
    fan_in_fan_out=True,
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [None]:
peft_config = peft_config_a if part_a else peft_config_b

model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    peft_config=peft_config,
)
model = BetterTransformer.transform(model, keep_original_model=False)

ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
).eval()

optimizer = bnb.optim.Adam8bit(model.parameters(), lr=config.learning_rate)

ppo_trainer = PPOTrainer(
    config=config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    dataset=dataset,
    data_collator=collator,
    optimizer=optimizer,
)

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug

reward_model_id = "lvwerra/distilbert-imdb"
reward_tokenizer = AutoTokenizer.from_pretrained(
    reward_model_id,
    max_length=512,
    truncation=True,
)
reward_model = pipeline(
    "sentiment-analysis",
    model=reward_model_id,
    tokenizer=reward_tokenizer,
    device=device
)

In [None]:
print_trainable_parameters(model)

In [None]:
generation_kwargs = {
    "min_length": 4,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "eos_token_id": tokenizer.eos_token_id,
    "max_new_tokens": 16,
    "remove_invalid_values": True,
    "return_prompt": False
}

sent_kwargs = {"top_k": None, "function_to_apply": "none", "padding": True, "batch_size": 16}

# Test the Reward Model and Define the Reward Function

In [None]:
good_text = "this movie was really good!!"
print(reward_model(good_text, **sent_kwargs))

bad_text = "this movie was really bad!!"
print(reward_model(bad_text, **sent_kwargs))

def reward(output):
    return output[1]["score"] if output[1]["label"] == "POSITIVE" else output[0]["score"]

print(reward(reward_model(good_text, **sent_kwargs)))
print(reward(reward_model(bad_text, **sent_kwargs)))

# Start Training

In [None]:
%%wandb

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Generate response
    response_tensors = ppo_trainer.generate(
        query_tensors,
        **generation_kwargs
    )
    batch["response"] = tokenizer.batch_decode(response_tensors)

    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    reward_outputs = reward_model(texts, **sent_kwargs)
    reward_model.call_count = 0
    rewards = [torch.tensor(reward(output)) for output in reward_outputs]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)


# Results and Comparison

In [None]:
output_min_length = 16
output_max_length = 32
output_length_sampler = LengthSampler(output_min_length, output_max_length)
#### get a batch from the dataset
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id, "remove_invalid_values": True}

bs = 16
game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

for i in range(bs):
    gen_len = output_length_sampler()
    output = ref_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors.append(output)

#### decode responses
game_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
game_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
game_data["rewards (before)"] = [reward(output) for output in reward_model(texts, **sent_kwargs)]

texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
game_data["rewards (after)"] = [reward(output) for output in reward_model(texts, **sent_kwargs)]

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

In [None]:
print("mean:")
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print("median:")
display(df_results[["rewards (before)", "rewards (after)"]].median())