# Fine-Tune with Reinforcement Learning (PPO) to Generate Semantically queries

# Table of Contents

- [ 1 - Set up Kernel and Required Dependencies](#1)
- [ 2 - Load Model, Prepare Reward Model](#2)
  - [ 2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Instruction](#2.1)
  - [ 2.2 - Prepare Reward Model](#2.2)
  - [ 2.3 - Evaluate Toxicity](#2.3)
- [ 3 - Perform Fine-Tuning](#3)
  - [ 3.1 - Initialize `PPOTrainer`](#3.1)
  - [ 3.2 - Fine-Tune the Model](#3.2)
  - [ 3.3 - Evaluate the Model Quantitatively](#3.3)
  - [ 3.4 - Evaluate the Model Qualitatively](#3.4)

<a name='1'></a>
## 1 - Set up Kernel and Required Dependencies

In [24]:
%pip install --upgrade pip
# Installing the Reinforcement Learning library directly from github.
%pip install trl==0.11.4 sentence-transformers einops

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# trl: Transformer Reinforcement Learning library
from trl import AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

<a name='2'></a>
## 2 - Load T5 Model, Prepare Reward Model and Semantic Evaluator

<a name='2.1'></a>
### 2.1 - Load Data and Fine tuning Model Fine-Tuned with Instruction

In [None]:
from typing import List
import torch
import datasets
from datasets import Dataset
import numpy as np

instruction = "rewrite: "
text_column = "query"
summary_column = "alternative"
seed = 12231


def formatting_prompts_func(examples, tokenizer, max_length: int = 32):
    inputs, targets = [], []
    for i in range(len(examples[text_column])):
        if examples[text_column][i] and examples[summary_column][i]:
            inputs.append(examples[text_column][i])
            targets.append(examples[summary_column][i])

    inputs = [instruction + inp for inp in inputs]
    model_inputs = tokenizer(
        inputs, max_length=max_length, padding="max_length", truncation=True
    )
    labels = tokenizer(
        targets, max_length=max_length, padding="max_length", truncation=True
    )
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # # padding in the loss.
    # if padding == "max_length" and data_args.ignore_pad_token_for_loss:
    labels["input_ids"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label]
        for label in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    model_inputs["query"] = inputs
    return model_inputs


def create_ds_from_parquet(data_path, tokenizer, max_seq_length):
    dataset = (
        Dataset.from_parquet(data_path)
        .map(
            lambda d: formatting_prompts_func(
                examples=d, tokenizer=tokenizer, max_length=max_seq_length
            ),
            batched=True,
        )
        .shuffle(seed=seed)
    )
    return dataset


def clip_rewards(
    similarity_scores: np.ndarray,
    optimal_min: float = 0.7,
    optimal_max: float = 0.95,
) -> np.ndarray:
    """Clip rewards based on cosine similarity scores."""
    """
        Args:
            - similarity_scores: np.ndarray, cosine similarity scores
            - optimal_min: float, minimum similarity score
            - optimal_max: float, maximum similarity score
        Returns:
            - np.ndarray, clipped rewards
    """
    # Convert input to numpy array if not already
    scores = np.asarray(similarity_scores, dtype=np.float32)
    return np.where((optimal_min <= scores) & (scores <= optimal_max), scores, 0)


def calculate_rewards(
    generated_texts: List[str],
    reference_texts: List[str],
    sentence_model,
    optimal_min: float = 0.7,
    optimal_max: float = 0.95,
    max_reward: float = 1.0,
    penalty_factor: float = 0.5,
    min_threshold: float = 0.3,
    high_similarity_penalty: float = 2.0,
) -> torch.Tensor:
    """
    Calculate semantic similarity rewards using SentenceTransformer embeddings
    Args:
    - generated_texts: List[str], list of generated texts
    - reference_texts: List[str], list of reference texts
    - sentence_model: SentenceTransformer model
    - optimal_min: float, minimum similarity score
    - optimal_max: float, maximum similarity score
    - max_reward: float, maximum reward
    - penalty_factor: float, penalty factor
    - min_threshold: float, minimum threshold
    - high_similarity_penalty: float, high similarity penalty
    Returns:
    - torch.Tensor, rewards

    """
    # Get embeddings
    ref_embeddings = sentence_model.encode(reference_texts, convert_to_tensor=True)
    gen_embeddings = sentence_model.encode(generated_texts, convert_to_tensor=True)

    # Calculate cosine similarity
    similarity_scores = (
        torch.nn.functional.cosine_similarity(ref_embeddings, gen_embeddings)
        .cpu()
        .numpy()
    )

    # Calculate rewards using numpy function
    rewards = clip_rewards(
        similarity_scores,
        optimal_min=optimal_min,
        optimal_max=optimal_max,
        max_reward=max_reward,
        penalty_factor=penalty_factor,
        min_threshold=min_threshold,
        high_similarity_penalty=high_similarity_penalty,
    )

    return torch.tensor(rewards, dtype=torch.float32)

Prepare a function to pull out the number of model parameters (it is the same as in the previous lab):

In [27]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

In [None]:
model_name = "SteveTran/T5-small-query-expansion"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = model.to("cuda")

In [29]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(model,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 60507137
all model parameters: 60507137
percentage of trainable model parameters: 100.00%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=512, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


In [30]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 60507137
percentage of trainable model parameters: 0.00%



Everything is set. It is time to prepare the reward model!

<a name='2.2'></a>
### 2.2 - Prepare Reward Model

In [31]:
from sentence_transformers import SentenceTransformer
model_id = "nomic-ai/nomic-embed-text-v1.5"
sbert = SentenceTransformer(model_id, device="cuda", trust_remote_code=True)

<All keys matched successfully>


In [32]:
calculate_rewards(
    generated_texts=["Machine Learning definition","Who is ML", "What is Machine Learning?"],
    reference_texts=["What is Machine Learning?", "What is Machine Learning?", "What is Machine Learning?"],
sentence_model=sbert)

tensor([0.9242, 0.0000, 0.0000])

<a name='2.3'></a>
### 2.3 - Evaluate Semantic value

In [None]:
def evaluate_semantic(
    seq_model: AutoModelForSeq2SeqLM,
    sentence_transformer_model: SentenceTransformer,
    tokenizer: AutoTokenizer,
    dataset: datasets.Dataset,
    generation_config: dict = dict(
        max_new_tokens=16,
        top_k=0,
        top_p=1.0,
        do_sample=True,
        use_cache=True,
    ),
):
    """ """

    ref_texts = []
    gen_texts = []
    for i, sample in tqdm(enumerate(dataset), total=len(dataset)):
        input_text = sample["query"]
        input_ids = tokenizer([input_text], return_tensors="pt", padding=True).to(
            "cuda"
        )
        response_token_ids = seq_model.generate(**input_ids, **generation_config)
        for j in range(len(response_token_ids)):
            generated_text = tokenizer.decode(
                response_token_ids[j], skip_special_tokens=True
            )
            gen_texts.append(generated_text)
            ref_texts.append(input_text)

    semantic_values = (
        calculate_rewards(
            reference_texts=ref_texts,
            generated_texts=generated_text,
            sentence_model=sentence_transformer_model,
        )
        .detach()
        .numpy()
    )

    # Compute mean & std using np.
    mean = np.mean(semantic_values)
    std = np.std(semantic_values)

    return mean, std

And now perform the calculation of the model toxicity before fine-tuning:

In [34]:
val_ds = datasets.load_dataset("SteveTran/query-expansion-v1", split="test")

In [35]:
generation_config: dict = dict(
    max_new_tokens=20,
    top_k=0,
    top_p=1.0,
    do_sample=True,
    use_cache=True,
)

In [55]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_, std_before_ = evaluate_semantic(
    seq_model=ref_model,
    sentence_transformer_model=sbert,
    tokenizer=tokenizer,
    dataset=val_ds,
    generation_config=generation_config
)

print(f'semantic [mean, std] before detox: [{mean_before_}, {std_before_}]')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11991/11991 [15:34<00:00, 12.83it/s]


semantic [mean, std] before detox: [6.149153341539204e-05, 0.006733252666890621]


<a name='3'></a>
## 3 - Perform Fine-Tuning
Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).

<a name='3.1'></a>
### 3.1 - Initialize `PPOTrainer`
 
For the `PPOTrainer` initialization, you will need a collator. Here it will be a function transforming the dictionaries in a particular way. You can define and test it:

In [1]:
from trl import PPOConfig, PPOTrainer

In [40]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

In [41]:
learning_rate=1e-5
mini_batch_size=128
batch_size=16
num_ppo_epochs=3
total_episodes = 100_000
max_seq_length = 20

last_name = model_name.split("/")[-1]
model_dir=f"models/{last_name}-ppo"
model_checkpoint_dir=f"models/{last_name}-ppo-checkpoint"
output_data_dir=f"models/{last_name}-ppo-data"
config = PPOConfig(
    learning_rate=learning_rate,
    mini_batch_size=mini_batch_size,
)


train_ds = create_ds_from_parquet("test_sample_query_rewrite_nodup_full_sample_clean.parquet", tokenizer=tokenizer, max_seq_length=max_seq_length)
optimizer = torch.optim.AdamW(lr=learning_rate, params=ppo_model.parameters())
ppo_trainer = PPOTrainer(
    config, 
    ppo_model, 
    ref_model, 
    tokenizer, 
    dataset=train_ds.with_format("torch"),
    data_collator=collator,
    lr_scheduler=torch.optim.lr_scheduler.LinearLR(optimizer=optimizer),
    optimizer=optimizer
)
# ppo_trainer.train()



<a name='3.2'></a>
### 3.2 - Fine-Tune the Model

The fine-tuning loop consists of the following main steps:
1. Get the query responses from the policy LLM
2. Get semantically values from Sentence Transformer model
3. Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if you see the following metrics appearing:
* `objective/kl`: minimize kl divergence,
* `ppo/returns/mean`: maximize mean returns,
* `ppo/policy/advantages_mean`: maximize advantages.

In [None]:
num_epochs = 3
max_new_tokens = 20

for epoch in range(num_epochs):
    for step, batch in tqdm(enumerate(ppo_trainer.dataloader),total=len(ppo_trainer.dataloader), disable=True):
        # Break when you reach max_steps.
        prompt_tensors = batch["input_ids"]
    
        # Get response from LLM
        summary_tensors = []
    
        for prompt_tensor in prompt_tensors:
            # max_new_tokens = output_length_sampler()
            # generation_kwargs["max_new_tokens"] = max_new_tokens
            summary = ppo_trainer.generate(prompt_tensor, **generation_config)
            
            summary_tensors.append(summary.squeeze()[-max_new_tokens:])
            
        # This needs to be called "response".
        batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]
    
        # Compute reward outputs  
        reward_tensors = [score for score in calculate_rewards(generated_texts=batch["response"], reference_texts=batch["query"], sentence_model=sbert)]
        # Run PPO step.
        stats = ppo_trainer.step(queries=prompt_tensors, responses=summary_tensors, scores=reward_tensors)
        ppo_trainer.log_stats(stats, batch, reward_tensors)
    
        desc = f"""Epoch: {epoch+1}, Step: {step}, mean scores: {stats["ppo/mean_scores"]}, objective/kl: {stats["objective/kl"]}, loss: {stats["ppo/loss/value"]}"""
        print(desc)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch: 1, Step: 0, mean scores: 0.6439306735992432, objective/kl: 0.0, loss: 0.37198907136917114
Epoch: 1, Step: 1, mean scores: 0.5170105695724487, objective/kl: -0.007749204523861408, loss: 0.2968685030937195
Epoch: 1, Step: 2, mean scores: 0.5125870704650879, objective/kl: -0.02115878090262413, loss: 0.2863601744174957
Epoch: 1, Step: 3, mean scores: 0.4957434833049774, objective/kl: -0.05524995177984238, loss: 0.2854583263397217
Epoch: 1, Step: 4, mean scores: 0.531112551689148, objective/kl: -0.06087280437350273, loss: 0.29867905378341675
Epoch: 1, Step: 5, mean scores: 0.5481873750686646, objective/kl: -0.09510336816310883, loss: 0.32273808121681213
Epoch: 1, Step: 6, mean scores: 0.5520648956298828, objective/kl: -0.09363441169261932, loss: 0.3287044167518616
Epoch: 1, Step: 7, mean scores: 0.5696418285369873, objective/kl: -0.11545153707265854, loss: 0.34329307079315186
Epoch: 1, Step: 8, mean scores: 0.5634353160858154, objective/kl: -0.15675708651542664, loss: 0.3215775787830

<a name='3.3'></a>
### 3.3 - Evaluate the Model Quantitatively

In [52]:
mean_after_, std_after_ = evaluate_semantic(
    seq_model=model,
    sentence_transformer_model=sbert,
    tokenizer=tokenizer,
    dataset=val_ds
)

print(f'semantic [mean, std] after detox: [{mean_after_}, {std_after_}]')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11991/11991 [14:05<00:00, 14.19it/s]


semantic [mean, std] after detox: [0.00014098300016485155, 0.010951225645840168]


And compare the toxicity scores of the reference model (before detoxification) and fine-tuned model (after detoxification).

In [57]:
mean_improvement = (mean_before_ - mean_after_) / mean_before_
std_improvement = (std_before_ - std_after_) / std_before_

print(f'Percentage improvement of sematically score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of sematically score after detoxification:
mean: -129.27%
std: -62.64%


In [58]:
path = "models/LaMini-T5-61M-ppo"
ppo_model.save_pretrained(path)
tokenizer.save_pretrained(path)

('models/LaMini-T5-61M-ppo/tokenizer_config.json',
 'models/LaMini-T5-61M-ppo/special_tokens_map.json',
 'models/LaMini-T5-61M-ppo/tokenizer.json')

In [None]:
# prompt = "What is Machine Learning?"
prompt = "Who lived longer, Nikola Tesla or Milutin Milankovic?"
prompt = "Create a table for top noise cancelling headphones that are not expensive"
# prompt = "Author David Chanoff has collaborated with a U.S. Navy admiral who served as the ambassador to the United Kingdom under which President?"
inputs = tokenizer(
    ["{}: {}".format(instruction, prompt)],
    padding=False,
    return_tensors="pt",
).to("cuda")

outputs = ppo_model.generate(**inputs, **generation_config)
print("Answer: ", tokenizer.batch_decode(outputs, skip_special_tokens=True))

CPU times: user 95.5 ms, sys: 0 ns, total: 95.5 ms
Wall time: 95.3 ms
Answer:  ['cost of top noise cancelling headphones']


<a name='3.4'></a>
### 3.4 - Evaluate the Model Qualitatively

Let's inspect some examples from the test dataset. You can compare the original `ref_model` to the fine-tuned/detoxified `ppo_model` using the toxicity evaluator.

In [81]:
batch_size = 20
compare_results = {}

df_batch = val_ds.shuffle().select(range(0, batch_size)).to_pandas()
compare_results["query"] = df_batch["query"]
prompt_tensors = tokenizer(df_batch["query"].tolist(), return_tensors="pt", padding="longest", truncation=True).to("cuda")["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to('cuda'), 
        **generation_config
    ).squeeze()
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to('cuda'), 
        **generation_config
    ).squeeze()
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i], skip_special_tokens=True) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i], skip_special_tokens=True) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
rewards_before = calculate_rewards(
    reference_texts=compare_results["query"],
    generated_texts=compare_results["response_before"],
    sentence_model=sbert
)

rewards_after = calculate_rewards(
    reference_texts=compare_results["query"],
    generated_texts=compare_results["response_after"],
    sentence_model=sbert
)
compare_results["reward_before"] = rewards_before
compare_results["reward_after"] = rewards_after

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.00it/s]




Store and review the results in a DataFrame

In [82]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,how does va rate shoulder range of motion,VHA level germination frequency,VA shoulder range of motion calculated,0.0,0.93034,0.93034
1,is a rhombus a polygon,difference between hombus of rhombus,rhombus polygon,0.0,0.921353,0.921353
2,cpa starting salary california,CPA starting salary California,Chartered Accountant starting salary California,0.0,0.914483,0.914483
3,how does the london pass work,England pass time,London pass track method,0.0,0.742174,0.742174
4,what is epigenetic modification,epigenetic therapy for reduced strains,epigenetic modifications,0.769173,0.931372,0.162199
5,is kodak still making cameras,Kodak video processing equipment,Kodak cameramaking capabilities,0.774541,0.847937,0.073395
6,how do transcriptionist charge,transcriptionist charge technique and pulsation,transcriptionist charge mechanism,0.800912,0.868187,0.067275
7,how to grow pumpkins in small space,growing pumpkins in small areas for males,Pumpkin growers in small spaces,0.833836,0.875493,0.041657
8,why is glycogen used in exercise,uses of glycogen in endurance,glycogen work in aerobic activities,0.84451,0.858682,0.014172
9,what is the italian peninsula rome,Rome italian peninsula,Italian peninsula Rome,0.934214,0.947376,0.013162
