# Reinforcement Learning from Human Feedback (RLHF)

## Enhancing T5-Base Summarization with Proximal Policy Optimization (PPO) and PEFT Fine-Tuning


Reinforcement Learning from Human Feedback, commonly known as **RLHF**, is a specialized machine learning approach that amalgamates traditional reinforcement learning techniques and human expertise. This union offers a unique pathway to training artificial intelligence agents.

---

### Key Insights:

1. **Nature of RLAIF**: RLAIF can be understood as an iterative procedure. The system undergoes continuous improvement, adapting its learning function based on newly acquired human feedback.
  
2. **Safety and Trust**: Incorporating human feedback ensures the system not only comprehends the tasks it should execute but also recognizes actions it should avoid. This dual capability fosters safer and more trustworthy systems.
  
3. **Performance Enhancements**: A study in 2022 evidenced that RLHF outperforms conventional supervised learning (SL). This superiority can be attributed to RLHF's ability to assess cumulative rewards for coherent conversations, a nuanced understanding that SL misses.

---

RLHF has proven instrumental in guiding language models, molding them to align better with intricate human values. As we venture into this notebook, we'll deep-dive into the methodologies and applications of RLHF.



Useful references:

https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py

https://www.kaggle.com/code/paultimothymooney/fine-tune-flan-t5-with-ppo-deeplearning-ai

https://github.com/huggingface/trl/blob/main/tests/test_ppo_trainer.py


Reward model: ideally a SequenceClassification type of model: We will use Bert

Policy model: ideally a Seq2SeqLM: We will use T5

![Alt text](https://github.com/PanoEvJ/summarization_RLHF/blob/main/image-1.png?raw=1)

Image source: https://huggingface.co/docs/trl/index

## Process Overview

In this notebook, we embark on the journey of aligning a model using Reinforcement Learning from Human Feedback (RLHF). We'll employ various specialized models and leverage a structured training loop for this purpose.

---

### Models Utilized:

1. **Rewards Model**:
   - A finely-tuned model designated for dispensing rewards based on the actions of the policy model.

2. **Base Model (Policy Model)**:
   - The core model we aim to align using RLHF.
   - During the RL process, this model becomes the "policy model", driving decisions and actions.

3. **Reference Model**:
   - A frozen replica of the base model.
   - Its primary role is to act as a benchmark, monitoring the evolution of the policy model throughout the RL process.

---

### Training loop Overview:

We begin by initializing the Proximal Policy Optimization (PPO) training class. The training process encompasses the following steps:

- **Generation of Summaries**:
  - Derived from the policy model.
  
- **Reward Assignment**:
  - The generated summaries are channeled through the rewards model.
  - Based on these summaries, rewards are determined, reflecting the alignment of the policy model with human preferences.
  
- **Model Adjustment via PPO**:
  - Utilizing the acquired rewards, PPO refines the weights of the policy model, nudging it closer to human preferences.
  
This iterative training loop continues for a predefined number of steps.

---

## Evaluation:

Post-training, we evaluate the efficacy and alignment of the policy model post-RL to determine its proficiency in mirroring human preferences.

---



### Install dependencies

In [None]:
!pip install -q torch
!pip install -q transformers
!pip install -q datasets
!pip install -q trl
!pip install -q peft
!pip install -q numpy
!pip install -q pandas
!pip install -q tqdm
!pip install -q openai
!pip install -q wandb
!pip install -U -q sentencepiece

In [None]:
import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer, T5Tokenizer, T5ForConditionalGeneration

from torch.utils.data import DataLoader, Dataset as TorchDataset
from torch.optim import AdamW

from datasets import load_dataset, Dataset as HFDataset

from peft import PeftModel, PeftConfig,  TaskType

from peft import (
    get_peft_config,
    get_peft_model,
    get_peft_model_state_dict,
    set_peft_model_state_dict,
    PeftType,
    LoraConfig,
)

# AutoModelForCausalLMWithValueHead & AutoModelForSeq2SeqLMWithValueHead: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
# https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead

# trl: Transformer Reinforcement Learning library
import trl
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead # https://huggingface.co/docs/trl/quickstart
from trl import create_reference_model
from trl.core import LengthSampler

# import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()

## GPT-3.5-turbo as the Rewards Model

![Alt text](https://github.com/PanoEvJ/summarization_RLHF/blob/main/image-2.png?raw=1)

Image source: https://huggingface.co/blog/rlhf

## Reward Model in Reinforcement Learning (RL)

In RL, a **reward model** is a mechanism providing feedback to the agent about its performance in its environment. Instead of predefined reward functions, reward models infer the reward signal from human feedback, especially useful in complex scenarios where crafting a reward function is challenging.

### Why is it Important?

- **Feedback Mechanism**: It's how agents determine if actions are beneficial or detrimental.
- **Facilitates Learning**: Agents use these signals to update their policies to maximize rewards.
- **Handles Complexity**: For real-world problems where explicit reward functions are difficult, a learned reward model is valuable.
- **Safety and Alignment**: They ensure RL agents' objectives align with human intentions, reducing potential harmful behaviors.

In our code, we're initializing a reward model (based on a transformer like BERT) for RL with Human Feedback (RLHF). This model generates reward signals from the agent's interactions, steering its learning process.


In [None]:
import os
import getpass

openai_api_key = getpass.getpass("Enter your OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
# 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

Load the summarization dataset.

In [None]:
orig_dataset = load_dataset('CarperAI/openai_summarize_comparisons', split='test')

Print one row of the dataset. There are 3 columns: prompt, chosen, rejected

In [None]:
orig_dataset[10000]

#### Test-run: score a random data point (prompt/summary) using ChatGPT.

In [None]:
full_text       = orig_dataset[10000]['prompt']
summarized_text = orig_dataset[10000]['chosen']

In [None]:
prompt = f"""### FULL TEXT:\n {full_text} \n
### SUMMARIZED TEXT: \n {summarized_text}"""

In [None]:
print(prompt)

In [None]:
import openai

response = openai.ChatCompletion.create(
    temperature = 0.9,
    model="gpt-3.5-turbo",
    messages=[{"role": "system", "content": """You are an expert in text summarization. Below, you are given the full text and its summarization.
Your role is to rate the provided summarization with scores ranging from 0 to 1, where: 0 is the lowest score, 1 is the highest score.
Your response should only contain the rating score."""},
    {"role": "user", "content": prompt}]
)

In [None]:
response['choices'][0]['message']['content']

# RLHF Fine-Tuning

## Loading the T5 Model for RLHF Fine-Tuning

### Overview:

T5, short for "Text-to-Text Transfer Transformer", is a state-of-the-art model designed to handle various text-to-text tasks. In this section, we'll be loading a T5 model that is intended to be fine-tuned using the Reinforcement Learning with Human Feedback (RLHF) approach.

### Steps:

1. **Model Selection**:
    - We've selected the T5 model for our fine-tuning process. Specifically, we'll be working with the "t5-base" variant which offers a balance between computational efficiency and performance.

2. **Loading Model and Tokenizer**:
    - `policy_model_path`: Specifies the directory path where our pre-trained (or fine-tuned) T5 model is saved.
    - `policy_model_name`: Indicates the model name, which in this case is "t5-base".
    - Using the `T5ForConditionalGeneration.from_pretrained` method, we load the model weights from our specified path.
    - Similarly, the corresponding tokenizer, which is essential for converting text into a format that the T5 model can understand, is loaded using the `T5Tokenizer.from_pretrained` method.

3. **Device Allocation**:
    - The model is assigned to a computation device (either CPU or GPU) using the `.to(device)` method. This ensures efficient computation, especially when working with large datasets.

### Test the Model:

After loading, it's a good practice to perform some inference tests to ensure that the model is loaded correctly and is functioning as expected.



In [None]:
policy_model_path = "JuanKO/rlhf_base_model"
policy_model_name = "t5-base"

policy_model = T5ForConditionalGeneration.from_pretrained(policy_model_path)
policy_model.to(device)
policy_tokenizer = T5Tokenizer.from_pretrained(policy_model_path)

### Testing the T5 Model for Summarization


After loading our T5 model, we'll test its summarization capabilities on a sample text from the r/relationships subreddit. This test will help us understand the model's performance and its readiness for RLHF fine-tuning.

### Steps:

1. **Setting the Task Prefix**:
    - We use the prefix "summarize: " to indicate to the T5 model the type of task we want it to perform.

2. **Sample Text**:
    - We have selected a post from the r/relationships subreddit to be summarized. This text provides context about a user's relationship concerns related to her bisexuality.

3. **Generating the Summary**:
    - We feed the concatenated task prefix and text into our T5 model.
    - The model then processes this input and returns a concise summary. The `generate` function is used to obtain this output, and we've set a max length of 100 tokens for our summary.

4. **Decoding the Summary**:
    - The output from the T5 model is in the form of token IDs. Using the T5 tokenizer's `decode` method, we convert these tokens back into human-readable text.

5. **Scoring the Summary using the Reward Model**:
    - With the generated summary in hand, we then use our previously defined `score_summaries` function to evaluate the quality of the summary.
    - This function returns a score and logit value for both the chosen summary and a rejected (blank) summary. Higher scores and logits suggest better alignment with what the reward model considers a good summary.

### Results:

By examining the printed scores and logits, we can gauge the perceived quality of the generated summary according to our reward model.


In [None]:
# task_prefix = "summarize: "

# text = "SUBREDDIT: r/relationships TITLE: How do I/do I at all [20 F] tell my boyfriend [23 M] that I'm bisexual? POST: I've had two serious relationships prior to this one, both with women. They had no problem with me being bisexual and it was something known before the relationship -- my first girlfriend was also bisexual. I am now in a relationship with a guy. We've been exclusive for about a month. Having never faced this issue, I come to you, Reddit. Is this something that he needs to know? Is it really relevant to a hetero relationship, regardless of if one of the participants in the relationship is bisexual? If you guys think it is necessary, when do you think is the right time? I think my biggest fear is losing him because of it. I know that I should be with someone who is fine with who I am, but I really like the guy and I'd hate for my sexual orientation to be the thing that kills this."
# #text = "SUBREDDIT: r/legaladvice TITLE: What can I do legally to restore water to my condominium!? POST: Hi, I live in SE Michigan in a condominium complex. Our water was shut off due to non-payment. (we recieved no notice) and we had to pay all that was due ($1500) We payed this yesterday at 2, they said the water would be turned on immediately. It wasn't. It's now the next day. The lady in our assosciation keeps insisting that the water meter is in another condo. Which we can't access because the person living there is never there (it's being rented) Now we're stuck with no water, no shower, no teeth brushing, no toilets, and no food for certain meals.... Please help us... What can we do? We called the police and they say that we can file a civil report for the lady not doing her job..."
# prompt = f"{task_prefix}{text}"
# input_ids = policy_tokenizer(prompt, return_tensors="pt").input_ids.to(device)
# outputs = policy_model.generate(input_ids, max_length=100).to(device)

# strOutput = policy_tokenizer.decode(outputs[0], skip_special_tokens=True)
# print(strOutput)

# chosen_score, rejected_score, chosen_logit, rejected_logit = score_summaries(rm_model, rm_tokenizer, strOutput, "")

# print(f"Chosen Score: {chosen_score:.4f}")
# print(f"Rejected Score: {rejected_score:.4f}")

# print(f"Chosen Logit: {chosen_logit:.4f}")
# print(f"Rejected Logit: {rejected_logit:.4f}")


## Preparing the T5 Model for Peft + LoRA

### Overview:

Peft and LoRA (Low-Rank Adaptation) are techniques that enable efficient fine-tuning of pre-trained models by introducing low-rank structures into the models. Here, we'll configure the T5 model for this process.

### Steps:

1. **Setting up the LoRA Configuration**:
    - `LoraConfig` provides the configuration settings for Low-Rank Adaptation.
        - `r`: Rank of the low-rank structure. In this instance, it's set to 8.
        - `lora_alpha`: Scaling factor for the newly introduced low-rank parameters.
        - `target_modules`: Specifies which parts of the model to apply LoRA. Here, we're targeting the "q" (query) and "v" (value) modules.
        - `lora_dropout`: Dropout rate for the low-rank parameters. Set to 0.10, or 10%.
        - `bias`: Specifies the type of bias for the low-rank projection. We've chosen "none" in this case.
        - `task_type`: Indicates the type of task. As we're using T5, the task type is set to `SEQ_2_SEQ_LM`.

2. **Applying LoRA Configuration to T5**:
    - Using the `get_peft_model` function, we apply the LoRA configuration to our pre-loaded T5 model.
    - The returned model (`policy_peft_model`) is equipped with the Peft + LoRA modifications and is ready for fine-tuning.

### Summary of this section:

Our T5 model is now prepared with Peft + LoRA adjustments. This configuration optimizes the model for more efficient fine-tuning on specific tasks while leveraging the powerful pre-trained knowledge.


In [None]:
lora_config = LoraConfig(
    r=8, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.10,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # T5
)

policy_peft_model = get_peft_model(policy_model, lora_config)
policy_peft_model.to(device)

### Analyzing Trainable Parameters in the Peft + LoRA Configured T5 Model

After applying the Peft + LoRA configuration to our T5 model, it's essential to inspect the model's parameters to understand its structure better.

### Key Insights:

1. **Trainable Parameters**:
    - This refers to the parameters that will be updated during the training process.
    - In our configured model, there are **884,736** trainable parameters.

2. **Total Parameters**:
    - This indicates the complete count of parameters present in the model, including those that are non-trainable.
    - The model consists of **223,788,288** total parameters.

3. **Percentage of Trainable Parameters**:
    - It's useful to know the fraction of the model's parameters that are trainable, as this can influence training time and model flexibility.
    - Only about **0.3953%** (or roughly 0.4%) of the entire model's parameters are trainable.

### Summary of this section:

The Peft + LoRA configuration results in a model where only a small fraction of parameters are trainable. This approach offers a balance, as it allows for specific fine-tuning while leveraging a vast pre-trained structure. The advantage is that it can lead to faster training times and might prevent overfitting, especially when training data is limited.


In [None]:
policy_peft_model.print_trainable_parameters()

![Alt text](https://github.com/PanoEvJ/summarization_RLHF/blob/main/image-3.png?raw=1)

Image source: https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives

## Instantiating the PPO Model with Value Head

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm. In this step, we set up the model for PPO training using our earlier `policy_peft_model`.

### Key Components:

1. **AutoModelForSeq2SeqLMWithValueHead**:
    - An extension of the transformers model that includes a scalar output for each token, aiding in reinforcement learning.
    - This model can capture the value function, an estimate of future rewards.

2. **Inputs**:
    - We pass in our `policy_peft_model`, which has been configured with Peft + LoRA, as the foundation for our PPO model.
    - We set `torch_dtype` to `torch.bfloat16` for numerical precision and memory efficiency.
    - The `is_trainable` flag is set to `True`, allowing us to further fine-tune the model using our RL loop.

3. **Device Assignment**:
    - We transfer our instantiated model to the appropriate device (`device`) for computation, ensuring efficient training.

### Summary of this section:

With our PPO model instantiated, we're poised to fine-tune our summarization model using reinforcement learning with human feedback. This approach is aimed at improving the model's performance in generating summaries based on human preferences and judgments.

[More on PPO and TRL](https://huggingface.co/docs/trl/quickstart)


In [None]:
# https://huggingface.co/docs/trl/quickstart
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(policy_peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

ppo_model.to(device)

### Defining the Reference Model

In reinforcement learning, especially when fine-tuning models using methods like Proximal Policy Optimization (PPO), it's helpful to have a reference model. This model represents the initial state or behavior of the learner model (in this case, the Language Model) before any alignment or optimization. It aids in calculating the importance sampling ratio, a critical component for stable and effective updates in PPO.

### Key Components:

1. **create_reference_model**:
    - A function provided by Huggingface's TRL (Transformer Reinforcement Learning) library.
    - Creates a duplicate of the passed model which acts as a reference during the RL fine-tuning process.

2. **Inputs**:
    - The `policy_model` we previously defined serves as the input. This model acts as the basis for our reference model.

3. **Device Assignment**:
    - Once instantiated, we move our reference model to the specified device (`device`) for computations.

### Summary of this section:

By defining a reference model, we set a stable baseline against which we can measure and guide the progress and changes of our main model during the reinforcement learning process.

[More on TRL and Reference Models](https://huggingface.co/docs/trl/models#trl.create_reference_model)


In [None]:
ref_model = create_reference_model(policy_model)
ref_model.to(device)

### Preparing the Dataset for Reinforcement Learning

Reinforcement learning (RL) requires a dataset to simulate experiences and provide feedback. In our RL setup for fine-tuning a language model, we utilize a comparison dataset.

### Steps:

1. **Load Dataset**:
    - Using Huggingface's `datasets` library, we fetch the 'CarperAI/openai_summarize_comparisons' dataset's test split.

2. **Filtering**:
    - We want to ensure the prompt lengths are manageable.
    - Filtering by word count: We retain samples where the prompt has ≤ 450 words.
    - (Alternative Filtering by character count is commented out for reference.)

3. **Shuffling and Sampling**:
    - To ensure a diverse set of samples, we shuffle the dataset.
    - We then select a subset (2,000 samples in this instance) for the RL process.

4. **Feature Extraction**:
    - From our shuffled dataset, we focus on the `prompt` and `chosen` fields.
    - Rename the 'chosen' field to 'response' to align with the PPO library's requirements.

5. **Dataset Conversion**:
    - Convert the dictionary containing our features into a Huggingface Dataset format.

6. **Train-Eval Split**:
    - Split the dataset into training and evaluation subsets.
    - Here, 80% of samples are designated for training, and the remaining 20% are for evaluation.

### Outcome:

By the end of this process, we will have a training dataset and an evaluation dataset ready for the RL process. These datasets will be essential in guiding the model's fine-tuning and assessing its performance during the RL loop.


In [None]:
# Load the dataset
orig_dataset = load_dataset('CarperAI/openai_summarize_comparisons', split='test')

# Filter samples where the prompt length is less than or equal to 750
filtered_dataset = orig_dataset.filter(lambda example: len(example['prompt'].split()) <= 450) # By word
#filtered_dataset = orig_dataset.filter(lambda example: len(example['prompt']) <= 1250) # By character

# Shuffle and select the first 10K samples
#shuffled_dataset = orig_dataset.shuffle(seed=42).select(range(1000))
shuffled_dataset = filtered_dataset.shuffle(seed=42).select(range(2000))


# Extract the desired features.  Renaming chose to response to follow the ppo library requirements.
new_dataset_dict = {
    "prompt": shuffled_dataset["prompt"],
    "response": shuffled_dataset["chosen"]
}

# Convert the dictionary to a new Dataset
dataset = HFDataset.from_dict(new_dataset_dict)

# Split the new_dataset into train_dataset and eval_dataset
split_ratio = 0.8  # 80% for training, 20% for evaluation
num_train_samples = int(split_ratio * len(dataset))
train_dataset = dataset.select(range(num_train_samples))
eval_dataset = dataset.select(range(num_train_samples, len(dataset)))

In [None]:
print(train_dataset[0].keys())
print(eval_dataset[0].keys())

### Tokenization of Datasets

For reinforcement learning, it is crucial that the data is in a format understood by the model. This requires tokenizing our textual data into numerical tokens. Here, we'll use the tokenizer associated with our model (T5 in this case) to process our datasets.

### Steps:

1. **Tokenizer Initialization**:
    - Instantiate the tokenizer corresponding to our model (T5). If you use a different model, ensure you fetch the right tokenizer.

2. **Tokenization Function**:
    - Define a function (`tokenize_function`) that:
        - Processes the 'prompt' in each example of the dataset.
        - Truncates or pads the tokenized prompt to a maximum length of 512 tokens.
        - Returns the tokenized 'input_ids' for each 'prompt' and retains the associated 'response'.

3. **Apply Tokenization**:
    - Apply the `tokenize_function` to both the training and evaluation datasets using the `map` function.

### Outcome:

The datasets (`train_dataset` and `eval_dataset`) are now tokenized and in a suitable format for model ingestion during the reinforcement learning loop.


In [None]:
from transformers import T5Tokenizer

# Instantiate your tokenizer (replace T5Tokenizer with your model's tokenizer if different)
tokenizer = T5Tokenizer.from_pretrained("t5-small") # or whatever model you're using

def tokenize_function(example):
    # Tokenize the prompt and store it as input_ids. Also return the response.
    return {
        "input_ids": tokenizer(example["prompt"], return_tensors="pt", truncation=True, max_length=1024)["input_ids"].squeeze(),
        "response": example["response"],
    }

# Tokenize the training and evaluation datasets
train_dataset = train_dataset.map(tokenize_function, batched=False)
eval_dataset = eval_dataset.map(tokenize_function, batched=False)


In [None]:
train_dataset

In [None]:
# Lets check one sample of the train_dataset
print(train_dataset[0])  # print the first example from the training dataset

### Hyperparameter Initialization

Before training the model using reinforcement learning, we need to define several hyperparameters that will guide and constrain the training process.

### Data Collation:

- **`collator` Function**:
    - A helper function that takes a list of data samples and merges them into a single batch, making it suitable for processing by the model.
    - For instance, given an input of individual key-value data samples, the function groups the values by their keys.

    Example:
    ```python
    test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}, {"key1": "value4", "key2": "value5", "key3": "value6"}]
    collated_data = collator(test_data)
    ```

- **Sample Data**:
    - To visually validate the output of the `collator`, a sample is taken from the training dataset and processed.

### Key Hyperparameters:

- **`learning_rate`**:
    - Controls the step size at each iteration while moving towards a minimum in the loss function. Set to `1.41e-5`.

- **`max_ppo_epochs`**:
    - Specifies the maximum number of epochs for the Proximal Policy Optimization (PPO) training. Set to `3`.

- **`mini_batch_size`** & **`batch_size`**:
    - Determines the number of samples in each mini-batch (`4`) and the overall batch size (`16`).

- **`DEFAULT_REJECTED_SUMMARY_TEXT`**:
    - A placeholder text for a bad summary. This could potentially act as a regularizer during training, though its effect needs to be verified.

- **Generation Constraints** (`generation_kwargs`):
    - `temperature`: Controls the randomness of predictions by scaling the logits before applying softmax. Set to `1.0`.
    - `min_length`: Minimum length of the generated text. Set to `5`.
    - `top_k` & `top_p`: Parameters controlling the nucleus sampling method. Here, `top_k` is set to `0.0` and `top_p` to `1.0`, indicating no truncation based on these parameters.
    - `do_sample`: Boolean value determining whether to sample the outputs. Set to `True`.

- **Output Length Sampling**:
    - `output_min_length` & `output_max_length`: Define the minimum (`100`) and maximum (`400`) lengths of generated outputs.
    - `output_length_sampler`: Samples an output length between the specified min and max values.

- **`max_ppo_steps`**:
    - Determines the total number of PPO steps during training. Set to `100`.


In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}, {"key1": "value4", "key2": "value5", "key3": "value6"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

# Lets sample what the collator generates:
sample_data = [train_dataset[i] for i in range(3)]  # take first three examples
collated_data = collator(sample_data)
print(collated_data.keys())

In [None]:
learning_rate=1e-4
max_ppo_epochs=5
mini_batch_size=2
batch_size=8

### Configuration for PPO Training

We leverage the `PPOConfig` from the Hugging Face `trl` library to set up the configuration required for the Proximal Policy Optimization (PPO) training.

The `PPOConfig` requires and/or allows for a number of arguments that define the behavior of the PPO training loop:

- **`model_name`**:
    - Name of the model. Here, it is set as `policy_model_name`.

- **`learning_rate`**:
    - The rate at which the model adjusts based on the error during training. We've set it to the previously initialized value of `learning_rate`.

- **`ppo_epochs`**:
    - Specifies the number of epochs for PPO training. Set to the previously defined `max_ppo_epochs`.

- **`mini_batch_size`**:
    - The size of the smaller batches that the main batch is divided into, during training. Set to the previously initialized value of `mini_batch_size`.

- **`batch_size`**:
    - The number of data samples processed during each training step. We've set it to the previously initialized value of `batch_size`.

For a more detailed understanding and potential additional configurations, one can refer to the [Hugging Face documentation on `trl.trainer`](https://huggingface.co/docs/trl/trainer).


In [None]:
# Check out https://huggingface.co/docs/trl/trainer

config = PPOConfig(
    model_name=policy_model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

### Setting Up the PPO Trainer

To fine-tune the model using Proximal Policy Optimization (PPO), we use the `PPOTrainer` class from Hugging Face's `trl` library.

The `PPOTrainer` class is initialized with several key arguments:

- **`config`**:
    - The configuration object created using `PPOConfig`. This contains the hyperparameters required for PPO training.

- **`model`**:
    - The model that will be fine-tuned. In this case, it is the `ppo_model` which was previously instantiated.

- **`ref_model`**:
    - The reference model, representing the model before alignment. We use `ref_model` for this purpose.

- **`tokenizer`**:
    - The tokenizer responsible for converting text into tokens suitable for the model's input. Here, it's the `policy_tokenizer` we set up before.

- **`dataset`**:
    - The training dataset. We use the tokenized `train_dataset`.

- **`data_collator`**:
    - A function to transform a list of samples to a batch. We use the `collator` function we defined earlier.

This trainer will be used to conduct the PPO training loop, enabling us to fine-tune the model using reinforcement learning.

For a deeper dive into the functionalities provided by the `PPOTrainer` class, one can refer to the [Hugging Face documentation on `trl.trainer`](https://huggingface.co/docs/trl/trainer).



In [None]:
# Check out https://huggingface.co/docs/trl/trainer

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=policy_tokenizer,
                         dataset=train_dataset,
                         data_collator=collator)

In [None]:
# Some initial values
output_min_length = 128
output_max_length = 2048
output_length_sampler = LengthSampler(output_min_length, output_max_length)

# These hyperparams guide the generation of the completion in the policy model. We could add other params like temperature.
generation_kwargs = {
    "temperature": 0.5,
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

max_ppo_steps = 256

## Fine-Tuning with Reinforcement Learning

Reinforcement learning offers a unique approach to fine-tuning models. The underlying principle is to allow the model to learn by receiving feedback (rewards) on its actions. In this context, an action would be generating a summary for a given text prompt.

### Training Loop Overview

The training loop we've crafted here follows this sequence of steps:

1. **Model Prediction**: Using the policy language model (`ppo_trainer` in this case), we generate predicted summaries.
2. **Score Generation**: We then pass these summaries to a reward model to assign a score (reward) based on the quality of the generated summary.
3. **Model Update**: With the generated summaries and their respective scores, we use Proximal Policy Optimization (PPO) to update our policy language model.

### Detailed Breakdown

#### **1. Model Prediction**:

- We iterate through our training data in batches (`prompt_tensors`).
- For each prompt, we predict a summary (`summary_tensors`). This prediction is based on the generation hyperparameters we've specified (`generation_kwargs`), which guide the sampling strategy.

#### **2. Score Generation**:

- For each summary, we calculate a score by comparing it with a default rejected summary.
- This step uses a separate reward model (`rm_model`), which assesses the quality of summaries.

#### **3. Model Update**:

- Using PPO, we update our policy model based on:
  - The initial input (`prompt_tensors`).
  - The generated summary (`summary_tensors`).
  - The assigned reward (`reward_tensors`).
  
### Key Metrics:

- `objective/kl`: Measures how different the policy's action distribution after the update is from the action distribution before the update. PPO tries to make these changes very small to avoid drastic changes.
  
- `ppo/returns/mean`: This is the average return achieved by the agent. Higher is better.

- `ppo/policy/advantages_mean`: Measures how much better an action is than the average action at a given state. An advantage of zero means the action is just average, a positive advantage means it's better than average, and a negative one means it's worse than average.

### Important Notes:

- **HACK** Alert: The code contains certain hacks (like for handling variable sequence lengths) which were used to overcome specific issues during development.

- **Reward Model**: The quality of the model training largely depends on the feedback it provides.

### References:

- [PPOTrainer in Hugging Face's TRL library](https://huggingface.co/docs/trl/trainer#trl.PPOTrainer)
- [Using Transformer Reinforcement Learning to detoxify generative language models](https://medium.com/@ben.burtenshaw/using-transformer-reinforcement-learning-to-detoxify-generative-language-models-5198446d6786)
- HuggingFace's example scripts in their GitHub repository.

The success of reinforcement learning is deeply intertwined with the feedback mechanism and the quality of the reward signal.


In [None]:
import openai
import re

def score_summaries(full_text, summarized_text):

  prompt = f"""### FULL TEXT:\n {full_text} \n
  ### SUMMARIZED TEXT: \n {summarized_text}"""

  response = openai.ChatCompletion.create(
      temperature = 0.,
      model="gpt-3.5-turbo",
      messages=[{"role": "system", "content": f"""You are an expert in text summarization. Below, you are given the full text and its summarization.
  Your role is to rate the provided summarization with scores ranging from 0 to 1, where: 0 is the lowest score, 1 is the highest score.
  Your response should only be a double precision number that represents the scoring rate.
  """},
      {"role": "user", "content": prompt}],
      request_timeout=60000
  )

  response = response['choices'][0]['message']['content']
  score    = float(re.findall(r"[-+]?(?:\d*\.*\d+)", response)[0])
  return score

In [None]:
import tqdm

In [None]:
orig_dataset[10000]['prompt']

In [None]:
# import re

# test_string = "Rating: 0.2"
# res = re.findall(r"[-+]?(?:\d*\.*\d+)", test_string)[0]
# print(f"The numbers list is : {res}")

In [None]:
objective_kl    = []
returns_mean    = []
advantages_mean = []

import time

start = time.time()

for step, batch in enumerate(ppo_trainer.dataloader):

    if step >= max_ppo_steps: # Break when we reach max_steps.
        break


    prompts = [policy_tokenizer.decode(tok) for tok in batch['input_ids']][0]
    prompt_tensors = batch["input_ids"]
    # print(batch['response'])
    # if step==0: break

    if isinstance(prompt_tensors, list) and all(isinstance(item, list) for item in prompt_tensors): # HACK!!! Check if original_prompt_tensors is a list of lists
        lengths = [len(seq) for seq in prompt_tensors] # Verify if sequences have fixed or variable length
        unique_lengths = set(lengths)

        if len(unique_lengths) > 1: # If sequences have variable lengths, pad them
            max_length = max(unique_lengths)
            original_prompt_tensors = [seq + [0] * (max_length - len(seq)) for seq in prompt_tensors]  # padding with zeros

        prompt_tensors = [torch.tensor(seq).to(device) for seq in prompt_tensors] # Convert original_prompt_tensors to individual tensors

    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        prompt_tensor = torch.tensor(prompt_tensor).to(device)
        max_new_tokens = output_length_sampler()
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    batch["response"] = [policy_tokenizer.decode(r.squeeze()) for r in summary_tensors]

    response = batch["response"]

    reward_tensors = []

    for prompt, summary in zip(prompts, response):
        score = score_summaries(prompt, response)
        # score = float(score)
        reward_tensors.append(torch.tensor(score))

    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}') # Measures how different the policy's action distribution after the update is from the action distribution before the update. PPO tries to make these changes very small to avoid sudden changes.
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}') # This is the average return achieved by the agent. Higher is better.
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}') # Measures how much better an action is than the average action at a given state.
    print(f'STEP: {step}')

    objective_kl.append(stats["objective/kl"])
    returns_mean.append(stats["ppo/returns/mean"])
    advantages_mean.append(stats["ppo/policy/advantages_mean"])

    print('-'.join('' for x in range(100)))

end = time.time()
print(f'TIME: {end - start}')

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.array(returns_mean)
s = range(len(returns_mean))

fig, ax = plt.subplots()
ax.plot(s, t)

ax.set(xlabel='episodes', ylabel='mean return',
       title='Policy optimization')
# ax.grid()

fig.savefig("test.png")
plt.show()

## Saving the Model and Tokenizer

After the fine-tuning process, it's crucial to save the model's weights and the tokenizer's configuration for future use, whether it's for inference, further training, or sharing with the community.

### 1. Saving the Model

To preserve the state of your model post-training, use the `save_pretrained` method:


In [None]:
# import huggingface_hub

# hf_token = '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'

# hf_api = huggingface_hub.HfApi(hf_token)

In [None]:
ppo_trainer.model.push_to_hub('PanoEvJ/T5_summarization_RLAIF', token='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')
policy_tokenizer.push_to_hub('PanoEvJ/T5_summarization_RLAIF', token='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')

In [None]:
objective_kl

In [None]:
returns_mean

In [None]:
advantages_mean

## Inference using the Fine-tuned Model

After saving the fine-tuned model, the next step is to utilize it for generating summaries. The model will produce outputs based on the knowledge it acquired during the RL fine-tuning process.

### Loading the Model

To load the model, we will use the `AutoModelForSeq2SeqLMWithValueHead` class from the `trl` library. This class is tailored for sequence-to-sequence tasks and also has the value head which was required for the Proximal Policy Optimization (PPO) algorithm:


In [None]:
ppo_saved_model_path = "PanoEvJ/T5_summarization_RLAIF"

from trl import AutoModelForSeq2SeqLMWithValueHead # https://huggingface.co/docs/trl/quickstart
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(ppo_saved_model_path)

from transformers import AutoTokenizer
policy_tokenizer = AutoTokenizer.from_pretrained(ppo_saved_model_path)

### Function for Generating Summaries

In order to simplify the inference process and generate summaries for new prompts, a dedicated function `generate_summary` has been defined. This function uses the trained model, its tokenizer, and other parameters to produce concise and relevant summaries for input text.


In [None]:
def generate_summary(prompt: str, model, tokenizer, generation_kwargs, output_length_sampler) -> str:
    """
    Generate a summary for a given prompt using a trained policy model.

    Args:
    - prompt (str): The input text for which a summary needs to be generated.
    - model: The trained policy model.
    - tokenizer: The tokenizer used for the policy model.
    - generation_kwargs (dict): Arguments used for response generation.
    - output_length_sampler (func): Function to sample the length of the output.

    Returns:
    - str: Generated summary.
    """

    # Tokenize the prompt
    prompt_tensor = tokenizer.encode(prompt, return_tensors='pt').to(device)

    # Ensure it's only one tensor and check its shape
    assert prompt_tensor.dim() == 2, f"Unexpected tensor shape: {prompt_tensor.shape}"

    # Set the generation arguments
    max_new_tokens = output_length_sampler()
    generation_kwargs["max_new_tokens"] = max_new_tokens

    # Generate a summary
    summary_tensor = model.generate(input_ids=prompt_tensor, **generation_kwargs)

    # Decode and return the summary
    summary = tokenizer.decode(summary_tensor[0], skip_special_tokens=True)
    return summary


In [None]:
# text = "SUBREDDIT: r/relationships TITLE: How do I/do I at all [20 F] tell my boyfriend [23 M] that I'm bisexual? POST: I've had two serious relationships prior to this one, both with women. They had no problem with me being bisexual and it was something known before the relationship -- my first girlfriend was also bisexual. I am now in a relationship with a guy. We've been exclusive for about a month. Having never faced this issue, I come to you, Reddit. Is this something that he needs to know? Is it really relevant to a hetero relationship, regardless of if one of the participants in the relationship is bisexual? If you guys think it is necessary, when do you think is the right time? I think my biggest fear is losing him because of it. I know that I should be with someone who is fine with who I am, but I really like the guy and I'd hate for my sexual orientation to be the thing that kills this."
# text = "SUBREDDIT: r/legaladvice TITLE: What can I do legally to restore water to my condominium!? POST: Hi, I live in SE Michigan in a condominium complex. Our water was shut off due to non-payment. (we recieved no notice) and we had to pay all that was due ($1500) We payed this yesterday at 2, they said the water would be turned on immediately. It wasn't. It's now the next day. The lady in our assosciation keeps insisting that the water meter is in another condo. Which we can't access because the person living there is never there (it's being rented) Now we're stuck with no water, no shower, no teeth brushing, no toilets, and no food for certain meals.... Please help us... What can we do? We called the police and they say that we can file a civil report for the lady not doing her job..."
# text = "SUBREDDIT: r/relationships TITLE: To go or not to go? Old friend (f, 23) getting married, I (f 23) don't want to because I have to go from here in the Netherlands to USA. POST: So, I have had this friend for a long time and we have always been there for each other. But about 6 months ago I moved here to the Netherlands to be with my partner (m23). This is our first place together here and we had to buy our own furniture. Needless to say we don't really have any money for trips. My friend is getting married in March in the USA and I feel really guilty out of obligation but I really don't want to go. I don't have the money for it and I don't want to leave here and miss my partner. Reasons for not wanting to go: 1. Money 2. Missing my partner. 3. Being incredibly bored once I'm there! I won't have a car or a way to get around, so I'll just be sitting in my parents house all day. I know it's bad that I don't want to go, but I am just really dreading it. Reddit, what do I do?"
# text = "SUBREDDIT: r/Advice TITLE: Bike tour around the world? POST: Hi there redditors! First of all I'd like to apologize for my English, but as you will see (I hope not), I'm not a native speaker. I'm 23-year-old who recently graduated from university and just stared my first job. Now, you see, my job is interesting and all, but it's an office job and I feel I'm not suited for this. I'm the adventures type, I want something happening around me and going to work from 9 to 6 is just killing me. The one thing that I thought of is a bike trip mostly in Europe, Asia and North Africa. The problem is that I'm from a country with an average salary around 350 euros or 450 USD. My salary is a bit higher - around 450 euros, but still not enough according to what I read is needed for such a trip, witch is about 30000 USD. My question is if somebody has done something like this without any money and if they have some tips for me. I'm thinking about sleeping outdoors or helping some locals for food and a place to crash. Is this something that could work out? I'm planning to go with my girlfriend and I think not too many people would take us in. Any help would be greatly appreciated!"
text = "SUBREDDIT: r/Parenting TITLE: Question about saying 'no' to 18 month old POST: When I tell my son 'no' to something that is either dangerous (like sitting on the arm of the couch or trying to climb onto the television) or something that is an unwanted behavior (biting, hitting etc.) he looks at me and giggles before continuing to do whatever the hell he wants to do. When my husband tells him 'no' he stops what he's doing and sometimes gets upset to the point of crying (I think because his feelings are hurt). I guess the question is, how do I get him to listen to me and not just to his father? I have tried to make my voice sound louder and more masculine, but that just makes him laugh even harder."


In [None]:
prompt = f"{task_prefix}{text}"
generated_summary = generate_summary(prompt, ppo_model, policy_tokenizer, generation_kwargs, output_length_sampler)
print(generated_summary)

## Conclusion and Recap

In this notebook, we embarked on the ambitious journey of Reinforcement Learning from Human Feedback (RLHF) with the aim to enhance text summarization. The major components of this approach are the policy model (in this case, a T5 model) and a reward model (based on BERT). Let's recap the steps we've taken and the knowledge we've gained:

1. **Loading the Policy Model (T5)**:
   - We began by initializing the T5 model which would act as our policy model for generating text summaries.
  
2. **Loading the Reward Model (BERT)**:
   - To evaluate the quality of the summaries generated by the T5 model and to give feedback, we employed a BERT-based model which was trained on a mixture of model-written summaries and human feedback.

3. **Training Loop with Proximal Policy Optimization (PPO)**:
   - For the fine-tuning of our T5 policy model, we utilized the PPO algorithm, a state-of-the-art deep reinforcement learning method.
   - We established a loop wherein the T5 model proposed text summaries which were then evaluated by the BERT-based reward model. Using these rewards, the T5 model was fine-tuned to better align with human preferences.
   - Throughout this loop, we monitored various metrics such as the KL divergence, mean returns, and advantages to ensure that the training was progressing desirably.

4. **Inference**:
   - After the RLHF process, we put our enhanced T5 model to the test! By employing a dedicated function, we generated summaries for new input text, reaping the rewards of our fine-tuning efforts.

By leveraging the strengths of both T5 and BERT, and by harnessing the power of reinforcement learning through PPO, we aimed to create a model that produces summaries of superior quality that are more in line with human preferences.

Future efforts can focus on refining the training process, experimenting with different RL algorithms, or scaling up the training data to further improve the performance.

Thank you for joining on this journey, and happy summarizing!


Notebook developed by [Juan Olano](https://www.linkedin.com/in/juan-olano-b9a330112/) and [Pano Evangeliou](https://www.linkedin.com/in/p-evangeliou/) - Sept.2023