### NLP (Natural Language Processing) with PEFT (Parameter Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) for Less-Toxic Summarization

**Project Workflow:**
* **Setup:** Import necessary libraries and define project parameters.

* **Dataset Exploration:** Discover the DialogSum dataset.

* **Test Model Zero-Shot Inference:** Initially, test the FLAN-T5 model for zero-shot inference on dialogue summarization tasks to establish baseline performance.

* **Preprocess Dialogue and Summary from the Dataset:** Preprocess the dialogue and its corresponding summary from the dataset to prepare it for training.

* **Perform Parameter-Efficient Fine-Tuning (PEFT):** Implement Parameter-Efficient Fine-Tuning (PEFT), a more efficient fine-tuning approach that can significantly reduce training time while maintaining performance.

* **Evaluation:**

* Perform a human evaluation to measure the model's output in terms of readability and coherence. This may involve having annotators rate the generated summaries for quality.

* Use ROUGE metrics to evaluate the quality of the generated summaries. ROUGE measures the overlap between the generated summaries and human-written references.

The argument is that a dataset similar to those already in Production is being used, seeking a similar public dataset to exemplify the scenario.

**Data**

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues (plus 100 hold-out data for topic generation) with corresponding manually labeled summaries and topics.

[Dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum?row=0)

## <b>1 <span style='color:#78D118'>|</span> Introduction</b>

This project explores the capabilities of large language models (LLMs), with a particular emphasis on using parameter-efficient fine-tuning (PEFT) to create dialogue summaries with reduced toxicity. We will fine-tune a FLAN-T5 model to generate less toxic content using Meta AI's Hate Speech Reward Model. This reward model is a binary classifier that predicts whether a given text is "non-hateful" or "hateful." We will use Proximal Policy Optimization (PPO) to fine-tune the model and reduce its toxicity.

Our primary goal is to improve the quality of dialogue summaries while minimizing toxicity. To achieve this, we apply Proximal Policy Optimization (PPO) for fine-tuning, aiming to mitigate the toxic output of the model. Additionally, we will demonstrate the advantages of parameter-efficient fine-tuning (PEFT), showing that its benefits outweigh any potential minor performance trade-offs.

**NOTE**: This is an example, and we did not use all of the data used.

![texto del vínculo](https://drive.google.com/uc?id=1HPqVdpmizy-5UzSnQoSqOVMO9TPVDLaE)
![texto del vínculo](https://drive.google.com/uc?id=1_aV-TgO-wEtQPb9nTxnBKfCbUAFsVV-l)
![texto del vínculo](https://drive.google.com/uc?id=1rkSoiFs8hnFXHNKcnxuqOdKjZqsbwHtd)
![texto del vínculo](https://drive.google.com/uc?id=1rzUv-MyBtLq4lH2fNlGIgLlSLVqvJwC8)

In [None]:
%pip install --upgrade pip
%pip install torch
%pip install torchdata

%pip install transformers
%pip install evaluate
%pip install rouge_score
%pip install peft


Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
Successfully installed evaluate-0.4.6
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (pyproj

In [None]:
%pip install datasets



In [None]:
%pip install trl==0.11.3

Collecting trl==0.11.3
  Downloading trl-0.11.3-py3-none-any.whl.metadata (12 kB)
Collecting tyro>=0.5.11 (from trl==0.11.3)
  Downloading tyro-0.9.35-py3-none-any.whl.metadata (12 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl==0.11.3)
  Downloading shtab-1.7.2-py3-none-any.whl.metadata (7.4 kB)
Downloading trl-0.11.3-py3-none-any.whl (316 kB)
Downloading tyro-0.9.35-py3-none-any.whl (132 kB)
Downloading shtab-1.7.2-py3-none-any.whl (14 kB)
Installing collected packages: shtab, tyro, trl
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [trl]
[1A[2KSuccessfully installed shtab-1.7.2 trl-0.11.3 tyro-0.9.35


In [None]:
#load the libs
#,GenerationConfig va en transformer

from datasets import  load_dataset, Dataset
from transformers import pipeline, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, GenerationConfig,Trainer
#trl: Transformer Reinforcement Learning Library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import time
import evaluate
import pandas as pd
import numpy as np


#tqdm library makes the loops show a smart progress meter
from tqdm import tqdm
tqdm.pandas()

## <b>2 <span style='color:#78D118'>|</span> Data download</b>

Here, we'll use the T5 model as a pre-trained base and the corresponding tokenizer. You can use a different pre-trained model (and corresponding tokenizer) by renaming the model below to a different model in Hugging Face Hub, or you can use a custom model/train a tokenizer from scratch on your own dataset. Keep in mind that you'll need significantly more data and computation to train a good model from scratch.

T5 is available in multiple sizes, including: T5 Small, T5 Base, T5 Large, T5 3B, and T5 11B.

In [None]:
model_name = "google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

# Load the dataset
dataset_original = load_dataset(huggingface_dataset_name)

# Check the dataset
print(dataset_original)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})


3 | Methods

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

# <b>4<span style='color:#78D118'>|</span> Tokenizing Information</b>

The next step involves preprocessing the dataset. We will select a subset of the data, filter the dialogs to a specific length to ensure readability while maintaining meaningful content, and then integrate each dialog with a statement before tokenizing the prompts. The resulting token IDs will be stored in the `input_ids` field, while the decoded prompts will be saved in the `query` field.

To streamline this process, it is recommended to create a function called `build_dataset`. This function can be defined as follows:

In [None]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """

    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")

    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name) #, device_map="auto"

    def tokenize(sample):

        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/10022 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


# <b>5 <span style='color:#78D118'>|</span> Optimized FLAN-T5 Model with Summary Instructions</b>

## <b>5.1 <span style='color:#78D118'>|</span> Enhancement of the Optimized FLAN-T5 Model with a Summary Adapter</b>

We are enhancing the original FLAN-T5 model by adding a summary adapter. This adapter is designed to improve the model's performance in summary tasks.

We begin by configuring the adapter using the following parameters:
- `r`: Rank, which is set to 32.
- `lora_alpha`: LORA alpha value, set to 32.
- `target_modules`: We specify the target modules as ["q", "v"].

- `lora_dropout`: Dropout rate for LORA, set to 0.05.

- `bias`: We use "none" as the bias configuration.

- `task_type`: The task type is set to SEQ_2_SEQ_LM, which is suitable for FLAN-T5.

Next, we load the previously trained FLAN-T5 model and create an instance of AutoModelForSeq2SeqLM with the specified model name and data type (torch_dtype).

We also create a PeftModel incorporating the previously loaded model.
Additionally, we provide the LoRa configuration, the torch data type, the device mapping and specify that the model is trainable.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType
from peft import PeftModel, PeftConfig

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model,
                                       'z7ye/peft-dialogue-summary-checkpoint',
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=True) #device_map="auto",

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



## <b>5.2 <span style='color:#78D118'>|</span> Improving the LLM Summary Response with Reinforcement Learning using OOP</b>

We are now preparing to fine-tune the language model (LLM) using reinforcement learning (RL). While a more detailed explanation of RL is provided, our current focus is on setting up the proximal policy optimization (PPO) model.

This PPO model will receive the instruction-tuned PEFT model as input and will be used to optimize the RL policy according to the reward model.

In [None]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


In [None]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



# <b>6<span style='color:#78D118'>|</span> Building a Reward Model for Reinforcement Learning</b>

Reinforcement learning (RL) is a fundamental branch of machine learning in which agents make decisions within an environment to maximize their accumulated rewards. The behavior of these agents is governed by a decision-making policy, and the fundamental goal of RL is for the agent to acquire an optimal or near-optimal policy that maximizes the reward function.

Previously, the original policy was based on the PEFT instruction model—essentially, the language model (LLM) before it underwent detoxification. While one approach involved asking human labelers to provide feedback on the toxicity of the model's outputs, this process can become prohibitively expensive when applied throughout the fine-tuning phase. A pragmatic solution to avoid this expense is to implement a reward model that incentivizes the agent to produce detoxified dialogue summaries.


A sensible approach in this case is to perform sentiment analysis on the model's outputs, classifying them into two categories: "nothate" and "hate." Higher rewards are assigned when the probability of classifying an output as "nothate" is higher.

In this context, we will use Meta AI's RoBERTa-based hate speech model as our reward model. This model generates logits and then predicts probabilities for two classes: "nothate" and "hate." Positive rewards are derived from the logits associated with the "nothate" class. The model will undergo further tuning through proximal policy optimization (PPO) with these reward values.

## <b>6.1<span style='color:#78D118'>|</span> Load the Meta AI-based RoBERTa hate speech model</b>

In [None]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name) # device_map="auto"
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name) #, device_map="auto"
print(toxicity_model.config.id2label)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

{0: 'nothate', 1: 'hate'}


Take a non-toxic text, convert it to a token and pass it to the model. Print the output logits, probabilities, and corresponding reward, which will be used for fine-tuning.

In [None]:
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

logits [not hate, hate]: [3.114102363586426, -2.489619016647339]
probabilities [not hate, hate]: [0.9963293671607971, 0.0036706042010337114]
reward (high): [3.114102363586426]


We're going to show a toxic comment. This one will have a low reward because it's more toxic.

In [None]:
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')

logits [not hate, hate]: [-0.6921197175979614, 0.37227365374565125]
probabilities [not hate, hate]: [0.2564707398414612, 0.743529200553894]
reward (low): [-0.6921197175979614]


## <b>6.2<span style='color:#78D118'>|</span> Configure the Pipeline Toxicity Reward Model</b>

Configure the Hugging Face inference pipeline to simplify the code for the toxicity reward model:

In [None]:
sentiment_pipe = pipeline("sentiment-analysis",
                          model=toxicity_model_name,
                          framework='pt'
                          ) #device=device
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Device set to use cpu


Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114102363586426}, {'label': 'hate', 'score': -2.489619016647339}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706042010337114}]
For toxic text
[{'label': 'hate', 'score': 0.37227365374565125}, {'label': 'nothate', 'score': -0.6921197175979614}]
[{'label': 'hate', 'score': 0.743529200553894}, {'label': 'nothate', 'score': 0.2564707398414612}]


The results are the logits of the `nothate` (positive) and `hate` (negative) classes. But PPO will only use the logits of the `nothate` class as a positive reward signal used to help detoxify LLM results.

In [None]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 3.114102363586426}, {'label': 'hate', 'score': -2.489619016647339}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706042010337114}]


In [None]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 0.37227365374565125}, {'label': 'nothate', 'score': -0.6921197175979614}]
[{'label': 'hate', 'score': 0.743529200553894}, {'label': 'nothate', 'score': 0.2564707398414612}]


## <b>6.3<span style='color:#78D118'>|</span> Evaluating Toxicity</b>

To evaluate the model's performance both before and after the adjustment and detoxification processes, it is essential to establish a toxicity assessment metric. The toxicity score is represented as a decimal value ranging from 0 to 1, where 1 signifies the highest degree of toxicity.

In [None]:
toxicity_evaluator = evaluate.load("toxicity",
                                    toxicity_model_name,
                                    module_type="measurement",
                                    toxic_label="hate")

Downloading builder script: 0.00B [00:00, ?B/s]

Device set to use cpu


Try calculating the toxicity for the same sentences. Not surprisingly, the toxicity scores are the probabilities of the "hate" class returned directly from the reward model.

In [None]:
toxicity_score = toxicity_evaluator.compute(predictions=[
    non_toxic_text
])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[
    toxic_text
])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.0036706042010337114]

Toxicity score for toxic text:
[0.743529200553894]


This evaluator can be used effectively to calculate the toxicity levels of dialogues.

To do this, you will need to provide several essential components, including the test dataset (`dataset["test"]`), the tokenizer used in the previous section, the previously frozen PEFT model, and the toxicity evaluator itself. For a simplified and organized approach, it is recommended to encapsulate these necessary procedures within a dedicated function called `evaluate_toxicity`.

In [None]:
def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.

    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids

        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             top_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)

        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)

    return mean, std

## <b> <span style='color:#78D118'></span>Baseline</b>

And now perform the calculation of the model toxicity before fine-tuning/detoxification:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name) #, device_map="auto"

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model,
                                                                          toxicity_evaluator=toxicity_evaluator,
                                                                          tokenizer=tokenizer,
                                                                          dataset=dataset["test"],
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [02:09, 11.78s/it]

toxicity [mean, std] before detox: [0.04223491532982073, 0.056939489389915776]





## <b>7 <span style='color:#78D118'>|</span>Fine-tune to detoxify summaries</b>

Optimize an RL policy in relation to the reward model using proximal policy optimization (PPO).

## <b>7.1 <span style='color:#78D118'>|</span> Initializing `PPOTrainer`</b>

To initialize `PPOTrainer`, you will need a collator. In this case, it will be a function that transforms dictionaries in a particular way. You can define and test it:

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


Configure the essential parameters. Load the `ppo_model` and its corresponding tokenizer.

Additionally, load a static version of the model, called `ref_model`.

The purpose of having two models is twofold: the first model, `ppo_model`, is optimized, while the second model, `ref_model`, serves as a reference point for calculating the KL divergence from the initial state.

This acts as an additional reward signal in the PPO training process, ensuring that the optimized model does not deviate too much from the original language model (LLM).

In [None]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)



## <b>7.2 <span style='color:#78D118'>|</span> Fine-Tune for the Model</b>

The fine-tuning cycle comprises the following key steps:

1. Retrieve query responses from the policy language model (PEFT model).

2. Determine the sentiment associated with the queries and responses using the RoBERTa hate speech model.

3. Optimize the policy using proximal policy optimization (PPO) with the input triplet, which includes the query, the response, and the associated reward.

You can confirm that the operation is running correctly by monitoring the following metrics:

- `objective/kl`: Minimizing Kullback-Leibler (KL) divergence.

- `ppo/returns/mean`: Maximizing average returns.

- `ppo/policy/advantages_mean`: Maximizing average advantages.

These metrics serve as indicators of the progress of the training process and the achievement of specific objectives within the fine-tuning cycle.

In [28]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

1it [43:42, 2622.58s/it]

objective/kl: 29.579225540161133
ppo/returns/mean: -0.6670620441436768
ppo/policy/advantages_mean: 0.007738230749964714
---------------------------------------------------------------------------------------------------


2it [1:21:09, 2401.45s/it]

objective/kl: 31.56088638305664
ppo/returns/mean: -0.7442222237586975
ppo/policy/advantages_mean: -0.007027161307632923
---------------------------------------------------------------------------------------------------


3it [1:54:15, 2212.03s/it]

objective/kl: 22.469125747680664
ppo/returns/mean: -0.4711116552352905
ppo/policy/advantages_mean: 0.021737314760684967
---------------------------------------------------------------------------------------------------


4it [2:26:44, 2107.91s/it]

objective/kl: 22.0545654296875
ppo/returns/mean: -0.24437589943408966
ppo/policy/advantages_mean: 0.03475768491625786
---------------------------------------------------------------------------------------------------


5it [2:57:50, 2020.90s/it]

objective/kl: 20.909496307373047
ppo/returns/mean: -0.181763157248497
ppo/policy/advantages_mean: 0.0032940134406089783
---------------------------------------------------------------------------------------------------


6it [3:35:41, 2105.99s/it]

objective/kl: 27.601778030395508
ppo/returns/mean: -0.43456143140792847
ppo/policy/advantages_mean: 0.01504969596862793
---------------------------------------------------------------------------------------------------


7it [4:08:41, 2064.63s/it]

objective/kl: 26.997100830078125
ppo/returns/mean: -0.4891512989997864
ppo/policy/advantages_mean: -0.004291746765375137
---------------------------------------------------------------------------------------------------


8it [4:43:55, 2080.44s/it]

objective/kl: 24.381603240966797
ppo/returns/mean: -0.3845676779747009
ppo/policy/advantages_mean: -0.0035097701475024223
---------------------------------------------------------------------------------------------------


9it [5:18:39, 2081.47s/it]

objective/kl: 21.788793563842773
ppo/returns/mean: -0.32264211773872375
ppo/policy/advantages_mean: 0.046168241649866104
---------------------------------------------------------------------------------------------------


10it [5:52:04, 2112.42s/it]

objective/kl: 19.28367805480957
ppo/returns/mean: -0.009228572249412537
ppo/policy/advantages_mean: 0.04478596895933151
---------------------------------------------------------------------------------------------------





## <b>7.3 <span style='color:#78D118'>|</span>Evaluate the model quantitatively</b>

Retrieve the PPO/PEFT model from the saved disk checkpoint and use the test dataset split to evaluate the toxicity score of the RL-adjusted model.

In [29]:
device = 'cpu'
ppo_model = ppo_model.to(device)
ref_model = ref_model.to(device)
#toxicity_evaluator = toxicity_evaluator.to(device)

In [30]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model,
                                                                        toxicity_evaluator=toxicity_evaluator,
                                                                        tokenizer=tokenizer,
                                                                        dataset=dataset["test"],
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [01:42,  9.33s/it]

toxicity [mean, std] after detox: [0.03179597118022767, 0.037803028742650374]





Compare the toxicity scores of the reference model (before detoxification) and the adjusted model (after detoxification).

In [31]:
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of toxicity score after detoxification:
mean: 24.72%
std: 33.61%


## <b>7.4 <span style='color:#78D118'>|</span>Evaluate the model qualitatively</b>

Explore sample examples from the test dataset, allowing a comparison between the initial `ref_model` and the refined/detoxified `ppo_model` using the toxicity evaluator.

In [32]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

100%|██████████| 20/20 [05:45<00:00, 17.27s/it]


Results in DataFrame

In [33]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following conversation. #Person1#: Hello? #Person2#: Hello? #Person1#: Can I speak to Li Hong, please? #Person2#: Speaking. #Person1#: Hi, Li Hong. This is Alice. #Person2#: Hi, Alice. How are you? #Person1#: Not bad. Li Hong, I am sorry that I can't go to see Mrs. Brown with you tomorrow morning. My mother is ill. I must take care of her. #Person2#: I'm sorry to hear that. You'd better stay at home. After all, we can visit Mrs. Brown later #Person1#: OK. Bye - bye. #Person2#: ...",<pad> Alice asks Li Hong to seat her at home and means an invitation for tomorrow. Li Hong doesn't allow her. Olivia will go home later.</s>,<pad> Alice couldn't see Mrs. Brown during programs. Li would rather stay at home or visit Mrs. Brown. #Person2# offers to visit Regina later.</s>,1.208793,2.156693,0.9479
1,"Summarize the following conversation. #Person1#: It smells like an ashtray in here! #Person2#: Hi honey! What's wrong? Why do you have that look on your face? #Person1#: What's wrong? I thought we agreed that you were gonna quit smoking. #Person2#: No! I said I was going to cut down which is very different. You can't just expect me to go cold turkey overnight! #Person1#: Look, there are other ways to quit. You can try the nicotine patch, or nicotine chewing gum. We spend a fortune on cigaret...",<pad> #Person1# is looking for honey so they talk about the smoking situation. She tells #Person1# there are many ways to quit cigarettes but she doesn't have the willpower to quit. She wants to divorce #Person1#.</s>,<pad> #Person1# and #Person2# are fighting over smoking something after discussing why #Person2# keeps smoking.</s>,1.006258,1.395997,0.389739
2,"Summarize the following conversation. #Person1#: What can I do for you, madam? #Person2#: I'd like to buy a toy car for my son. #Person1#: How about this one? #Person2#: It looks nice. How much is it? #Person1#: They're three hundred dollars. #Person2#: Oh, I'm afraid it's too expensive. Can you show me something cheaper? #Person1#: OK, This one is one hundred and twenty. It's the cheapest here. #Person2#: OK, I'll take it. Here's the money. #Person1#: Thank you very much. Summary: </s>",<pad> #Person2# wanted to buy a car for her son. #Person1# offers a one-a-two hundred and twenty. #Person2# wins.</s>,"<pad> #Person2# wants one customized toy car, but #Person1# suggests one hundred and twenty. With thanks, #Person2# decides to buy one.</s>",1.183649,1.444947,0.261298
3,"Summarize the following conversation. #Person1#: How much are you asking for this? #Person2#: I'm offering them to you at 150 yuan a piece. Is that all right? #Person1#: Is tax already included in their price? #Person2#: Yes. Our price can't be matched. #Person1#: Would you consider a volume discount? #Person2#: If you buy 1, 000 or more, you'll get a 10 % discount. #Person1#: I'll accept your offer. Summary: </s>",<pad> #Person2# offers 150 yuan cigars to #Person1# at 150 yuan a piece. #Person2# also offers $1000 for cheaper shipping if #Person1# buys more.</s>,<pad> #Person1# asks for a pricing match on the two cards with charges and a bulk discount to sell the card 150 yuan. Multiple discounts will be honored.</s>,2.455658,2.6299,0.174242
4,"Summarize the following conversation. #Person1#: Amanda, how do you like this peaked cap? #Person2#: Didn't you say you want to buy a top hat? #Person1#: But I think this one fits me Well. Why don't you try on the sombrero in black? #Person2#: I don't like caps at all. Summary: </s>",<pad> Amanda is buying a peaked cap and she prefers the wide hat. #Person2# doesn't like caps at all.</s>,<pad> Amanda wants to look for a top hat but she hates caps. She decides to buy an ombrero because she dislikes caps.</s>,1.325797,1.331855,0.006057
5,"Summarize the following conversation. #Person1#: I would like to order some internet today. #Person2#: What kind would you like? #Person1#: What kind of internet is there? #Person2#: You can get DEL or dial-up. #Person1#: Which of those two is best? #Person2#: I would recommend DEL. #Person1#: So that one better? #Person2#: It's better because it doesn't tie up the phone. #Person1#: What do you mean by that? #Person2#: DEL isn't connected through your phone line, but dial-up is. #Person1#: S...",<pad> #Person1# wants to buy some cable Internet from #Person2#. #Person2# recommends Dial-Up because Dial-up is not connected to the phone line.</s>,<pad> #Person1# wants to order internet to use the wen it and others discount DEL because it is direct. #Person2# recommends DEL to #Person1# and it is better because it doesn't connect the phone.</s>,2.398747,2.334501,-0.064246
6,"Summarize the following conversation. #Person1#: So how did you like the restaurant? #Person2#: Actually, it could have been better. #Person1#: What didn't you like about it? #Person2#: It is a new restaurant. I don't think they have their act together yet. #Person1#: What did you think about the food? #Person2#: I felt that the food was pretty mediocre. #Person1#: The service wasn't that great, either. #Person2#: I agree. The service was not good. #Person1#: Do you think that you want to tr...",<pad> #Person2# can't say Tasanik is a new restaurant because it's a new one. #Person1# believes #Person2# will not recommend it. #Person2# says they haven't had enough of Tasanik.</s>,"<pad> #Person2# hopes the restaurant can finally be improved, but it's not as good as #2# looks. Not much will be later. #2# thinks the restaurant is a waste.</s>",2.266187,2.165323,-0.100863
7,"Summarize the following conversation. #Person1#: Let's take a coffee break, shall we? #Person2#: I wish I could, but I can't. #Person1#: What keeps you so busy? You've been sitting there for hours. You've got to walk around. You just can't stay on the computer forever. #Person2#: Well, I am up to my neck in work. I've got to finish this report. Sarah needs it by noon. I don't want to be scolded if I can't finish my work by the deadline. #Person1#: I understand that, but you'd feel better if ...",<pad> #Person2# cannot take a coffee break because it takes hours to finish the report that #Person2#'s doing and #Person2# can't work.</s>,<pad> #Person2# can't take a coffee break for @Person1#'s busy schedule. #Person1# thinks they hate eachother. #Person1# doesn't like current things. We join up together.</s>,1.718067,1.597449,-0.120618
8,"Summarize the following conversation. #Person1#: Could you help me figure out how to look for a job? #Person2#: We have lots of options, what type of job do you need? #Person1#: I want to work in an office. #Person2#: Do you want to work part-time or full-time? #Person1#: I want to work full-time. #Person2#: We have binders with local job listings or you can make use of the computers. OK? #Person1#: I am confused a bit but I am sure that I can figure it out. #Person2#: If you make an appoint...",<pad> #Person2# helps #Person1# to find a full-time full-time job using the job center. #Person1# stresses that #Person1# needs an appointment with a job counselor because #Person1# doesn't want to do that.</s>,<pad> #Person1# wants to work in an office and it is more friendly to go to the work center on so #Person2# will help #Person1#.</s>,2.16611,2.036709,-0.129401
9,"Summarize the following conversation. #Person1#: Judy, what is everybody talking about? #Person2#: Haven't you heard? Richard was fired by our manager. #Person1#: You're kidding. It can't be true. #Person2#: Believe it or not. Everybody is talking about it in the company. #Person1#: Really? I'm surprised. #Person2#: Me too. Summary: </s>",<pad> Judy and #Person1# share how Richard was fired and they want to cure it. Judy and #Person1# are surprised by the fact that everyone is talking about things in the company.</s>,"<pad> Judy gives particular attention to looking after Richard has been fired and explains that last guy was Walis was fired by his supervisor. If you are fortunate and realize that someone is fired by their manager, there is a lot of scandal and we are surprised and surprised.</s>",1.747789,1.603635,-0.144154


## <b>8.0 <span style='color:#78D118'>|</span>References</b>

* https://huggingface.co/datasets/knkarthick/dialogsum/viewer?row=0

* https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target

* https://huggingface.co/docs/transformers/model_doc/roberta

* https://aws.amazon.com/es/blogs/machine-learning/fine-tune-large-language-models-with-reinforcement-learning-from-human-or-ai-feedback/