# The Reward Model in Reinforcement Learning from Human Feedback (RLHF)

The reward model plays a pivotal role in RLHF processes, facilitating a mechanism to gauge and guide a model's real-world behavior.

---

## Definition:

A **reward model** is a specialized component of a machine learning system dedicated to appraising the behavior of a model based on real-world interactions.

---

## Workflow:

1. **Input and Output**:
   - The reward model processes a *prompt and response pair*.
   - It subsequently outputs a corresponding reward or score.
   
2. **Training Methodology**:
   - Human evaluators provide the model with feedback. They review multiple outputs generated by the model and rate them based on quality.
   - This feedback equips the reward model with insights, enabling it to accurately assess the primary model's performance.
   
3. **Integration with the Main Model**:
   - The primary (or main) model assimilates this feedback, refining its approach and enhancing performance for forthcoming tasks.

---

## Characteristics:

- **Architecture**: The reward model can manifest as an end-to-end language model or even a modular system.
  
- **Functionality**: Its essential function is to convert input text sequences into a scalar value. This scalar, known as the reward, serves as a crucial bridge for integrating existing reinforcement learning algorithms, ensuring a smooth RLHF workflow.

---



![Alt text](image-2.png)

Reward Model based on BERT-BASE-UNCASED

Dataset: CarperAI/openai_summarize_comparisons 

#### Install the dependencies

In [None]:
!pip install torch
!pip install numpy
!pip install transformers
!pip install trl
!pip install peft
!pip install datasets

### Importing the required libs

In [1]:
import torch 
import random 
import numpy as np 

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from trl import RewardTrainer #, RewardConfig
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

from datasets import load_dataset


  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()

### Why Choose BERT as a Rewards Model?

BERT, or Bidirectional Encoder Representations from Transformers, stands out as a remarkably efficient model for a plethora of NLP tasks. But what makes it especially suited for a Rewards model? Here are the key reasons:


#### 1. **Encoder-Only Architecture**:

BERT is fundamentally an encoder-only transformer. Unlike models with decoder components that are tailored for generation tasks, BERT's encoder-only design is optimized for understanding and representing input text. This makes it apt for absorbing the nuances of a text and subsequently producing a scalar reward.


#### 2. **Lightweight yet Comprehensive**:

Though BERT comes in various sizes, even its base version offers a balance between complexity and efficiency. It's designed to be lightweight enough for practical applications but retains the depth required to understand intricate linguistic constructs.


#### 3. **Deep Contextual Understanding**:

One of BERT's standout features is its bidirectional context absorption. Instead of reading text in a single direction, BERT processes it both ways, ensuring a holistic understanding. This depth of comprehension is paramount for a Rewards model, enabling it to detect subtle details and intricacies in the input text.


#### 4. **Output Suitability**:

Given its encoder architecture, BERT's outputs are high-dimensional representations of the input text. These embeddings are readily adaptable for a range of tasks, including the generation of scalar rewards. By applying a simple linear layer on top of these embeddings, we can produce meaningful reward values that reflect the quality or desirability of the input.


### Fine-tuning BERT-BASE-UNCASED as a Rewards Model

In this section, we'll walk through the process of initializing and fine-tuning the `BERT-BASE-UNCASED` model for its use as a rewards model in RLHF.

---

#### Loading the Model:

In [3]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased') 

# Load a tokenizer (change the model name as per your requirements)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Processing the Dataset for Model Training

- **Loading the Dataset**: The dataset "openai_summarize_comparisons" is sourced from the CarperAI repository.

- **Tokenization and Length Calculation**: The 'chosen' column of the dataset undergoes tokenization, with the tokenized input encompassing appended special tokens suitable for BERT-like architectures. For every entry in the dataset, we capture the total number of tokens and store this count in the 'lengths' column.

- **Determining Maximum Token Length**: This step identifies the entry with the most tokens in our tokenized dataset, which can aid in setting sequence lengths during training or ensuring consistent padding.

- **Shuffling and Random Sampling**: All entries in the 'valid2' subset of our dataset are shuffled randomly. We then derive a subset of this shuffled dataset, based on a predetermined sample size of `n_samples`.

- **Size Verification of Processed Dataset**: A final verification is done to confirm the number of entries in our dataset post all the preprocessing.


In [48]:
# Load the dataset
dataset = load_dataset("CarperAI/openai_summarize_comparisons")



# Tokenize the dataset and get the lengths
tokenized_lengths = dataset["train"].map(lambda examples: {'lengths': len(tokenizer(examples['chosen'], add_special_tokens=True)["input_ids"])}, remove_columns=dataset["train"].column_names)
# Fetch max length
max_length = max(tokenized_lengths["lengths"])
print("Max token count:", max_length)



# Shuffle the indices
total_samples = len(dataset["valid2"])
all_indices = list(range(total_samples))
random.shuffle(all_indices)


# Select 'n'' random indices
n_samples = 40000 # With 12500 samples, train at 80% will be 10k
selected_indices = all_indices[:n_samples]

# Get the 'n'' random samples
dataset = dataset["valid2"].select(selected_indices)

len(dataset)


Found cached dataset parquet (C:/Users/juan_/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_comparisons-79d2c222a15dc8fb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/92534 [00:00<?, ? examples/s]

Max token count: 172


40000

#### Splitting the Dataset into Train, Validation, and Test Sets

- **Setting Proportions**: We've designated `80%` of the dataset for training (`train_percent`), `10%` for validation (`val_percent`), and by default, the remaining `10%` will be reserved for testing. It's worth noting that these percentages are predicated on the entirety of the dataset.

- **Computing Dataset Sizes**: We compute the absolute number of samples for the training and validation subsets based on their respective percentages (`train_percent` and `val_percent`) and the total sample count `n_samples`.

- **Splitting the Dataset**:
  - The **Training Set** consists of the initial `train_size` samples.
  - The **Validation Set** starts immediately after the training set and encompasses `val_size` samples.
  - The **Test Set** comprises the samples that follow the validation dataset, making up the remainder of `n_samples`.


In [49]:
# Split the dataset into train, val, and test
train_percent = 0.8
val_percent = 0.1
# test_percent is implicitly 0.1 since train + val + test = 1.0

train_size = int(train_percent * n_samples)
val_size = int(val_percent * n_samples)
# Remaining samples are for testing

train_dataset = dataset.select(list(range(train_size)))
val_dataset = dataset.select(list(range(train_size, train_size + val_size)))
test_dataset = dataset.select(list(range(train_size + val_size, n_samples)))

The HF RewardTraining util expects a very specific dataset format with 2 features: chosen and rejected. The dataset we are using includes 'prompt' features. Lets drop it:

In [50]:
# Remove 'prompt' column from each dataset
train_dataset = train_dataset.remove_columns(['prompt'])
val_dataset = val_dataset.remove_columns(['prompt'])

### Tokenization and Dataset Cleaning

- **Function Definition (`process_features`)**:
  - **Tokenization of 'chosen' Feature**: The 'chosen' field of the dataset is tokenized to a maximum length of `512` tokens. It employs padding to achieve a consistent length across all samples. The tokenized `input_ids` and `attention_mask` are added to the dataset under the keys `input_ids_chosen` and `attention_mask_chosen` respectively.
  
  - **Tokenization of 'rejected' Feature**: Similarly, the 'rejected' field undergoes tokenization, and the results are stored under `input_ids_rejected` and `attention_mask_rejected`.

- **Application of Tokenization Function**:
  - The `process_features` function is batch-applied to both the `train_dataset` and `val_dataset` using the `map` method. This enriches the datasets with tokenized versions of the 'chosen' and 'rejected' fields.

- **Dataset Cleaning**:
  - Post tokenization, the original text columns (`chosen` and `rejected`) are superfluous and are thus removed from both `train_dataset` and `val_dataset` to economize on memory and clarity.

Upon completion, both datasets are streamlined with tokenized inputs ready for modeling tasks.


Each final dataset object should contain two 4 entries:

* input_ids_chosen
* attention_mask_chosen
* input_ids_rejected
* attention_mask_rejected

In [51]:
def process_features(batch):
    # Tokenize 'chosen' feature
    chosen_tokens = tokenizer(batch['chosen'], padding='max_length', truncation=True, max_length=512, return_tensors='np')
    batch['input_ids_chosen'] = chosen_tokens['input_ids']
    batch['attention_mask_chosen'] = chosen_tokens['attention_mask']
    
    # Tokenize 'rejected' feature
    rejected_tokens = tokenizer(batch['rejected'], padding='max_length', truncation=True, max_length=512, return_tensors='np')
    batch['input_ids_rejected'] = rejected_tokens['input_ids']
    batch['attention_mask_rejected'] = rejected_tokens['attention_mask']
    
    return batch

# Apply the function to your datasets
train_dataset = train_dataset.map(process_features, batched=True)
val_dataset = val_dataset.map(process_features, batched=True)

# Remove original 'chosen' and 'rejected' columns
columns_to_remove = ['chosen', 'rejected']
train_dataset = train_dataset.remove_columns(columns_to_remove)
val_dataset = val_dataset.remove_columns(columns_to_remove)


Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

### Prepare the training objects

#### Documentation: Metric Loading and Calculation

- **Loading the Accuracy Metric**: The `datasets` library provides a suite of metrics for evaluation. Here, we specifically load the "accuracy" metric to assess model performance.

- **Function Definition (`compute_metrics`)**:
  - **Inputs**: The function receives `eval_pred`, a tuple consisting of model predictions (`logits`) and ground truth labels (`labels`).
  - **Predictions Extraction**: Given the multi-dimensional `logits` array, we employ `np.argmax` on the last axis to retrieve the index (or class) with the highest predicted score for each sample.
  - **Accuracy Computation**: Leveraging the loaded `metric`, we compute the accuracy by comparing the extracted predictions against the true labels (`references`).


In [52]:
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### Documentation: Configuration for LoRA (Low-Rank Adaptation)

- **Task Type (`task_type`)**: Specifies the nature of the task. Here, it's set to `SEQ_CLS`, indicating a sequence classification task, which is typical for problems where an entire sequence of tokens needs to be classified into a category.

- **Inference Mode (`inference_mode`)**: Determines if the configuration is set for inference. It's currently set to `False`, implying that the model is in training or evaluation mode.

- **Reduction Rank (`r`)**: Defines the rank for low-rank adaptation. A value of `8` indicates a relatively low rank and can capture essential patterns while ensuring reduced complexity.

- **LoRA Alpha (`lora_alpha`)**: An amplification factor. Here, it's set to `32`, which is a hyperparameter that can affect the strength of the low-rank adaptation.

- **LoRA Dropout (`lora_dropout`)**: Specifies the dropout rate for LoRA layers. A value of `0.2` suggests that 20% of the inputs will be set to zero, aiding in regularization and potentially preventing overfitting.


In [55]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.2,
)

#### Documentation: Training Arguments Configuration

- **Output Directory (`output_dir`)**: Specifies the path where model checkpoints and other training outputs will be saved. In this instance, they will be stored in a directory named `model_bert_hf_experiment2`.

- **Training Indication (`do_train`)**: A flag indicating whether to train the model. When set to `True`, the model will undergo training using the provided data.

- **Evaluation Indication (`do_eval`)**: A flag that signals if the model should be evaluated. With `True`, the model will be evaluated on the evaluation dataset after training.

- **Evaluation Strategy (`evaluation_strategy`)**: Determines the frequency of evaluation during training. The value `"epoch"` means that the model will be evaluated at the end of each epoch.

- **Saving Strategy (`save_strategy`)**: Dictates when the model checkpoints are saved. Similarly to the evaluation strategy, setting it to `"epoch"` ensures model checkpoints are saved after every epoch.

- **Number of Training Epochs (`num_train_epochs`)**: Specifies the total number of times the training dataset will be passed through. In this configuration, the training data will be used to train the model for a total of `8` epochs.



In [56]:
output_dir = './rlhf_reward_model' 

training_args = TrainingArguments(
    output_dir=output_dir,
    do_train=True,
    do_eval=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=8,
)

#### Documentation: RewardTrainer Instantiation

**RewardTrainer**: An extended trainer class tailored for reinforcement learning with human feedback. 

- **Model (`model`)**: Specifies the underlying model to be trained. Typically, this will be a pre-trained model like BERT which will be fine-tuned with the given datasets.

- **Training Arguments (`args`)**: Contains essential parameters for the training process. This includes the output directory, training/evaluation flags, saving strategy, number of epochs, etc.

- **Tokenizer (`tokenizer`)**: Specifies the tokenizer to be used for encoding the input text. This tokenizer will convert text into format suitable for model input.

- **Training Dataset (`train_dataset`)**: The primary dataset used to train the model. 

- **Evaluation Dataset (`eval_dataset`)**: The dataset on which the model will be evaluated to check its performance after training.

- **LoRA Configuration (`peft_config`)**: The configuration for the LoRA (Layer-wise Relevance Analysis) technique, which is used for interpreting model decisions and understanding their reasoning.

- **Metrics Function (`compute_metrics`)**: Function that will compute metrics like accuracy, based on the model's outputs and the true labels. This is used to gauge the performance of the model.

- **Maximum Sequence Length (`max_length`)**: Determines the maximum number of tokens in the input sequence. In this instance, sequences will be truncated or padded to a length of `256` tokens.


In [57]:
trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    compute_metrics=compute_metrics,
    max_length=256,
)



### Start the training loop

In [58]:
trainer.train()



  0%|          | 0/32000 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


{'loss': 0.6882, 'learning_rate': 4.921875e-05, 'epoch': 0.12}
{'loss': 0.6562, 'learning_rate': 4.8437500000000005e-05, 'epoch': 0.25}
{'loss': 0.6447, 'learning_rate': 4.765625e-05, 'epoch': 0.38}
{'loss': 0.6406, 'learning_rate': 4.6875e-05, 'epoch': 0.5}
{'loss': 0.6336, 'learning_rate': 4.609375e-05, 'epoch': 0.62}
{'loss': 0.6388, 'learning_rate': 4.5312500000000004e-05, 'epoch': 0.75}
{'loss': 0.6351, 'learning_rate': 4.453125e-05, 'epoch': 0.88}
{'loss': 0.6185, 'learning_rate': 4.375e-05, 'epoch': 1.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6250253915786743, 'eval_accuracy': 0.64075, 'eval_runtime': 143.3438, 'eval_samples_per_second': 27.905, 'eval_steps_per_second': 3.488, 'epoch': 1.0}




{'loss': 0.6172, 'learning_rate': 4.2968750000000004e-05, 'epoch': 1.12}
{'loss': 0.6278, 'learning_rate': 4.21875e-05, 'epoch': 1.25}
{'loss': 0.6267, 'learning_rate': 4.140625e-05, 'epoch': 1.38}
{'loss': 0.6274, 'learning_rate': 4.0625000000000005e-05, 'epoch': 1.5}
{'loss': 0.6248, 'learning_rate': 3.984375e-05, 'epoch': 1.62}
{'loss': 0.6078, 'learning_rate': 3.90625e-05, 'epoch': 1.75}
{'loss': 0.6249, 'learning_rate': 3.828125e-05, 'epoch': 1.88}
{'loss': 0.6157, 'learning_rate': 3.7500000000000003e-05, 'epoch': 2.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6143164038658142, 'eval_accuracy': 0.6605, 'eval_runtime': 143.6905, 'eval_samples_per_second': 27.838, 'eval_steps_per_second': 3.48, 'epoch': 2.0}




{'loss': 0.6155, 'learning_rate': 3.671875e-05, 'epoch': 2.12}
{'loss': 0.619, 'learning_rate': 3.59375e-05, 'epoch': 2.25}
{'loss': 0.6107, 'learning_rate': 3.5156250000000004e-05, 'epoch': 2.38}
{'loss': 0.6087, 'learning_rate': 3.4375e-05, 'epoch': 2.5}
{'loss': 0.6066, 'learning_rate': 3.359375e-05, 'epoch': 2.62}
{'loss': 0.6075, 'learning_rate': 3.2812500000000005e-05, 'epoch': 2.75}
{'loss': 0.5972, 'learning_rate': 3.203125e-05, 'epoch': 2.88}
{'loss': 0.6139, 'learning_rate': 3.125e-05, 'epoch': 3.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6110466122627258, 'eval_accuracy': 0.66475, 'eval_runtime': 142.2232, 'eval_samples_per_second': 28.125, 'eval_steps_per_second': 3.516, 'epoch': 3.0}




{'loss': 0.6055, 'learning_rate': 3.0468750000000002e-05, 'epoch': 3.12}
{'loss': 0.6186, 'learning_rate': 2.96875e-05, 'epoch': 3.25}
{'loss': 0.6023, 'learning_rate': 2.890625e-05, 'epoch': 3.38}
{'loss': 0.5963, 'learning_rate': 2.8125000000000003e-05, 'epoch': 3.5}
{'loss': 0.6015, 'learning_rate': 2.734375e-05, 'epoch': 3.62}
{'loss': 0.5982, 'learning_rate': 2.6562500000000002e-05, 'epoch': 3.75}
{'loss': 0.5977, 'learning_rate': 2.578125e-05, 'epoch': 3.88}
{'loss': 0.594, 'learning_rate': 2.5e-05, 'epoch': 4.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.612758994102478, 'eval_accuracy': 0.6675, 'eval_runtime': 139.6517, 'eval_samples_per_second': 28.643, 'eval_steps_per_second': 3.58, 'epoch': 4.0}




{'loss': 0.5977, 'learning_rate': 2.4218750000000003e-05, 'epoch': 4.12}
{'loss': 0.597, 'learning_rate': 2.34375e-05, 'epoch': 4.25}
{'loss': 0.5918, 'learning_rate': 2.2656250000000002e-05, 'epoch': 4.38}
{'loss': 0.5948, 'learning_rate': 2.1875e-05, 'epoch': 4.5}
{'loss': 0.5903, 'learning_rate': 2.109375e-05, 'epoch': 4.62}
{'loss': 0.5936, 'learning_rate': 2.0312500000000002e-05, 'epoch': 4.75}
{'loss': 0.5831, 'learning_rate': 1.953125e-05, 'epoch': 4.88}
{'loss': 0.5947, 'learning_rate': 1.8750000000000002e-05, 'epoch': 5.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6098501086235046, 'eval_accuracy': 0.66925, 'eval_runtime': 139.5495, 'eval_samples_per_second': 28.664, 'eval_steps_per_second': 3.583, 'epoch': 5.0}




{'loss': 0.5819, 'learning_rate': 1.796875e-05, 'epoch': 5.12}
{'loss': 0.5912, 'learning_rate': 1.71875e-05, 'epoch': 5.25}
{'loss': 0.582, 'learning_rate': 1.6406250000000002e-05, 'epoch': 5.38}
{'loss': 0.576, 'learning_rate': 1.5625e-05, 'epoch': 5.5}
{'loss': 0.5764, 'learning_rate': 1.484375e-05, 'epoch': 5.62}
{'loss': 0.5853, 'learning_rate': 1.4062500000000001e-05, 'epoch': 5.75}
{'loss': 0.5843, 'learning_rate': 1.3281250000000001e-05, 'epoch': 5.88}
{'loss': 0.5921, 'learning_rate': 1.25e-05, 'epoch': 6.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6071081757545471, 'eval_accuracy': 0.67425, 'eval_runtime': 140.8946, 'eval_samples_per_second': 28.39, 'eval_steps_per_second': 3.549, 'epoch': 6.0}




{'loss': 0.5799, 'learning_rate': 1.171875e-05, 'epoch': 6.12}
{'loss': 0.5794, 'learning_rate': 1.09375e-05, 'epoch': 6.25}
{'loss': 0.5804, 'learning_rate': 1.0156250000000001e-05, 'epoch': 6.38}
{'loss': 0.583, 'learning_rate': 9.375000000000001e-06, 'epoch': 6.5}
{'loss': 0.5792, 'learning_rate': 8.59375e-06, 'epoch': 6.62}
{'loss': 0.5706, 'learning_rate': 7.8125e-06, 'epoch': 6.75}
{'loss': 0.5749, 'learning_rate': 7.031250000000001e-06, 'epoch': 6.88}
{'loss': 0.585, 'learning_rate': 6.25e-06, 'epoch': 7.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6101868748664856, 'eval_accuracy': 0.67275, 'eval_runtime': 139.4081, 'eval_samples_per_second': 28.693, 'eval_steps_per_second': 3.587, 'epoch': 7.0}




{'loss': 0.577, 'learning_rate': 5.46875e-06, 'epoch': 7.12}
{'loss': 0.5773, 'learning_rate': 4.6875000000000004e-06, 'epoch': 7.25}
{'loss': 0.5711, 'learning_rate': 3.90625e-06, 'epoch': 7.38}
{'loss': 0.5765, 'learning_rate': 3.125e-06, 'epoch': 7.5}
{'loss': 0.5833, 'learning_rate': 2.3437500000000002e-06, 'epoch': 7.62}
{'loss': 0.5594, 'learning_rate': 1.5625e-06, 'epoch': 7.75}
{'loss': 0.5822, 'learning_rate': 7.8125e-07, 'epoch': 7.88}
{'loss': 0.5703, 'learning_rate': 0.0, 'epoch': 8.0}


  0%|          | 0/500 [00:00<?, ?it/s]

{'eval_loss': 0.6113418936729431, 'eval_accuracy': 0.67275, 'eval_runtime': 140.1212, 'eval_samples_per_second': 28.547, 'eval_steps_per_second': 3.568, 'epoch': 8.0}
{'train_runtime': 20834.5183, 'train_samples_per_second': 12.287, 'train_steps_per_second': 1.536, 'train_loss': 0.600979121208191, 'epoch': 8.0}


TrainOutput(global_step=32000, training_loss=0.600979121208191, metrics={'train_runtime': 20834.5183, 'train_samples_per_second': 12.287, 'train_steps_per_second': 1.536, 'train_loss': 0.600979121208191, 'epoch': 8.0})

## Saving the Model and Tokenizer

After the fine-tuning process, it's crucial to save the model's weights and the tokenizer's configuration for future use, whether it's for inference, further training, or sharing with the community.

### 1. Saving the Model

To preserve the state of your model post-training, use the `save_pretrained` method:

In [59]:
# Save the model and tokenizer
model.save_pretrained("./YOUR/PATH/HERE")
tokenizer.save_pretrained("./YOUR/PATH/HERE")

('./model_bert_hf_experiment2/tokenizer_config.json',
 './model_bert_hf_experiment2/special_tokens_map.json',
 './model_bert_hf_experiment2/vocab.txt',
 './model_bert_hf_experiment2/added_tokens.json',
 './model_bert_hf_experiment2/tokenizer.json')

## Inference using the Fine-tuned Model

After saving the fine-tuned model, the next step is to utilize it for generating rewards on sample summaries. The model will produce outputs based on the knowledge it acquired during the fine-tuning process.

### Loading the Model

To load the model, we will use the `AutoModelForSequenceClassification` class from the `Huggingface` library. This class is tailored for sequence-classification tasks:


In [6]:
model = AutoModelForSequenceClassification.from_pretrained("JuanKO/rlhf_reward_model")

Downloading (…)lve/main/config.json:   0%|          | 0.00/678 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at JuanKO/rlhf_reward_model were not used when initializing BertForSequenceClassification: ['bert.encoder.layer.5.attention.self.value.lora_B.default.weight', 'bert.encoder.layer.7.attention.self.value.lora_B.default.weight', 'bert.encoder.layer.6.attention.self.value.lora_B.default.weight', 'bert.encoder.layer.5.attention.self.query.lora_A.default.weight', 'bert.encoder.layer.1.attention.self.query.lora_B.default.weight', 'bert.encoder.layer.11.attention.self.query.lora_B.default.weight', 'bert.encoder.layer.8.attention.self.query.lora_A.default.weight', 'classifier.modules_to_save.default.bias', 'bert.encoder.layer.1.attention.self.value.lora_A.default.weight', 'bert.encoder.layer.9.attention.self.query.lora_A.default.weight', 'classifier.modules_to_save.default.weight', 'bert.encoder.layer.5.attention.self.value.lora_A.default.weight', 'bert.encoder.layer.2.attention.self.value.lora_A.default.weight', 'bert.encoder.layer.9.attention.self.value.lo

In [7]:

tokenizer = AutoTokenizer.from_pretrained("JuanKO/rlhf_reward_model")
model.to(device)


Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

NameError: name 'device' is not defined

### Evaluate

In [None]:
# Evaluate the model
results = trainer.evaluate()

# Print metrics
print(results)

In [61]:
# Get predictions
predictions, label_ids, _ = trainer.predict(val_dataset)



  0%|          | 0/500 [00:00<?, ?it/s]

In [62]:
# Convert logits to labels
predicted_labels = np.argmax(predictions, axis=1)

# Compute accuracy or any other metric
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(label_ids, predicted_labels)
print("Accuracy:", accuracy)

Accuracy: 0.67275


### Inference

## Function: `score_summaries`

### Description:
The `score_summaries` function is designed to score two summaries, `chosen_summary` and `rejected_summary`, within the context of a Reinforcement Learning with Human Feedback (RLHF) loop. It tokenizes the inputs, obtains the logits from a given model, computes the softmax probabilities, and finally extracts the scores (probabilities) and logits associated with each summary.

### Parameters:

- **model** (`torch.nn.Module`): 
    - The PyTorch model that produces logits given an input.
  
- **tokenizer** (`transformers.PreTrainedTokenizer`): 
    - A tokenizer object used to tokenize input summaries.
  
- **chosen_summary** (`str`): 
    - The chosen summary string that needs to be scored.
  
- **rejected_summary** (`str`): 
    - The rejected summary string that needs to be scored.

### Returns:

- **chosen_score** (`float`): 
    - The probability score associated with the `chosen_summary` being positive or "good".

- **rejected_score** (`float`): 
    - The probability score associated with the `rejected_summary` being positive or "good".

- **chosen_logit** (`float`): 
    - The logit value associated with the `chosen_summary`.

- **rejected_logit** (`float`): 
    - The logit value associated with the `rejected_summary`.

### Function Flow:

1. **Tokenization**: 
    - The input summaries, `chosen_summary` and `rejected_summary`, are tokenized using the provided tokenizer. These tokenized inputs are padded or truncated to a maximum length of 512 tokens.

2. **Move to Device**: 
    - The tokenized tensors are transferred to the device (likely a GPU or CPU) where the model resides.

3. **Obtain Logits**: 
    - The tokenized tensors are passed through the model to obtain logits. This is done in a no-gradient context to ensure computational efficiency and prevent any updates to the model.

4. **Compute Probabilities**: 
    - The obtained logits are passed through a softmax function to get the associated probabilities. This helps in understanding how likely each summary is deemed "good" by the model.

5. **Extract Scores and Logits**: 
    - The function then extracts the probability and logit associated with the positive class (assumed to be the second class in the logits) for both summaries.

### Notes:
- The function assumes that the positive class (indicating the summary is "good") is the second class in the logits.
- The softmax function ensures that the logits are converted into probabilities that sum up to 1.

In [10]:
import torch.nn.functional as F


def score_summaries(model, tokenizer, chosen_summary, rejected_summary):
    # Tokenize the inputs
    chosen_tokens = tokenizer(chosen_summary, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
    rejected_tokens = tokenizer(rejected_summary, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
    
    chosen_tokens.to(device)
    rejected_tokens.to(device)
    
    # Get logits from the model
    with torch.no_grad():
        chosen_logits = model(**chosen_tokens).logits
        rejected_logits = model(**rejected_tokens).logits
    
    # Apply softmax to get probabilities
    chosen_probs = F.softmax(chosen_logits, dim=-1)
    rejected_probs = F.softmax(rejected_logits, dim=-1)

    # Assuming the positive class (indicating 'chosen' is good) is the second one
    chosen_score = chosen_probs[0][1].item()
    rejected_score = rejected_probs[0][1].item()
    
    # Extract logits for each summary
    chosen_logit = chosen_logits[0][1].item()
    rejected_logit = rejected_logits[0][1].item()

    return chosen_score, rejected_score, chosen_logit, rejected_logit
    

### Run sum samples:

In this test, we evaluate the `score_summaries` function using two sample summaries: one labeled as `chosen_summary` and the other as `rejected_summary`. These summaries are tokenized, scored, and the associated logits are obtained using our reward model (`rm_model`) and its tokenizer (`rm_tokenizer`).

### Sample Summaries:

- **Chosen Summary**: 
    - "Water meter in another condo is not in our condo. What can we do legally to restore water to my condo complex?"
    
- **Rejected Summary**: 
    - "Go fix the problem."

### Test Execution:

The `score_summaries` function is called with the provided model, tokenizer, and the sample summaries. The returned scores and logits for each summary are then printed.

### Expected Output:

- **Chosen Score**: 
    - This gives the probability score of the `chosen_summary` being perceived as "good" or positive by the model.
  
- **Rejected Score**: 
    - This gives the probability score of the `rejected_summary` being perceived as "good" or positive by the model.
  
- **Chosen Logit**:
    - This returns the raw logit value associated with the `chosen_summary`.
  
- **Rejected Logit**:
    - This returns the raw logit value associated with the `rejected_summary`.

### Notes:
- Higher scores indicate a higher probability of the summary being perceived as positive or "good".
- The logit values provide insight into the raw outputs of the model before being passed through the softmax function.

In [21]:
# Example usage
chosen_summary = "TL;DR: Water meter in another condo is not in our condo. What can we do legally to restore water to my condo complex?"
rejected_summary = "TL;DR: I don't know"

In [22]:
chosen_score, rejected_score, chosen_logit, rejected_logit = score_summaries(model, tokenizer, chosen_summary, rejected_summary)

print(f"Chosen Score: {chosen_score:.4f}")
print(f"Rejected Score: {rejected_score:.4f}")

print(f"Chosen Logit: {chosen_logit:.4f}")
print(f"Rejected Logit: {rejected_logit:.4f}")


Chosen Score: 0.5634
Rejected Score: 0.5950
Chosen Logit: 0.1582
Rejected Logit: 0.2139


In [12]:

def evaluate_on_test_samples(model, tokenizer, test_data, n):
    results = []
    for i in range(n):
        chosen_summary = test_data['chosen'][i]
        rejected_summary = test_data['rejected'][i]
        
        chosen_score, rejected_score, chosen_logit, rejected_logit = score_summaries(model, tokenizer, chosen_summary, rejected_summary)
        results.append({
            'chosen_summary': chosen_summary,
            'rejected_summary': rejected_summary,
            'chosen_score': chosen_score,
            'rejected_score': rejected_score,
            'chosen_logit': chosen_logit,
            'rejected_logit': rejected_logit
        })
    return results

In [None]:

# Run the evaluation on top 'n' samples
n = 20  # or any other number up to 2500
results = evaluate_on_test_samples(model, tokenizer, test_dataset, n)

# Print results
for i, result in enumerate(results, 1):
    print(f"Sample {i} - Chosen Logit: {result['chosen_logit']:.4f} | Rejected Logit: {result['rejected_logit']:.4f}")
    #print(f"Sample {i} - Chosen Score: {result['chosen_score']:.4f} | Chosen Logit: {result['chosen_logit']:.4f} - Rejected Score: {result['rejected_score']:.4f} | Rejected Logit: {result['rejected_logit']:.4f}")
    #print(f"Sample {i} - Chosen Summary: {result['chosen_summary']} - Score: {result['chosen_score']:.4f} | Logit: {result['chosen_logit']:.4f}")
    #print(f"Chosen Summary: {result['chosen_summary']}")
    #print(f"Chosen Score: {result['chosen_score']:.4f} | Logit: {result['chosen_logit']:.4f}")
    #print(f"Rejected Summary: {result['rejected_summary']} - Rejected Score: {result['rejected_score']:.4f} | Logit: {result['rejected_logit']:.4f}")
    #print(f"Rejected Score: {result['rejected_score']:.4f} | Logit: {result['rejected_logit']:.4f}")
    #print("-" * 50)

### Upload to HuggingFace

In [72]:
from huggingface_hub import notebook_login

In [74]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [77]:

trainer.model.push_to_hub('JuanKO/rlhf')

adapter_model.bin:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/JuanKO/rlhf/commit/993219569a62ffa6bf581996516c05ff21d2b6fd', commit_message='Upload model', commit_description='', oid='993219569a62ffa6bf581996516c05ff21d2b6fd', pr_url=None, pr_revision=None, pr_num=None)

## Conclusion and Recap

### **1. Overview**

Throughout this notebook, we've delved into the sophisticated yet powerful domain of Reinforcement Learning from Human Feedback (RLHF) with the overarching goal of enhancing the text summarization abilities of a T5 model. This process, fundamentally rooted in reinforcement learning, utilizes human feedback as the primary mechanism to provide the model with insights into its performance, enabling iterative refinement.

### **2. The RLHF Process**

The RLHF framework operates in stages:

- **Initial Training**: The model (in this case, T5) is trained with standard supervised methods to obtain an initial version.
- **Comparison Data Collection**: The model's outputs (summaries) are compared against alternatives to gather feedback on which outputs are better or worse.
- **Rewards Model Development**: This is where the BERT model comes into play. Using the comparison data, we train BERT to predict rewards for different model outputs, essentially quantifying the quality of the summaries.
- **Fine-Tuning with Proximal Policy Optimization**: Using the rewards provided by the BERT-based rewards model, the main model (T5) is fine-tuned using reinforcement learning techniques. This step is iterative, with the model continuously refining its abilities based on the feedback from the rewards model.

### **3. Role of the Rewards Model**

BERT, serving as the rewards model, is a pivotal component in this setup. Its primary function is to assess the quality of the summaries generated by the main model. By consuming human comparison data, BERT understands and quantifies the subtle nuances that make one summary better than another. This scalar reward then guides the reinforcement learning process, ensuring that the T5 model's improvements are aligned with human preferences.

### **4. Final Thoughts**

With the convergence of transformer models, human feedback, and reinforcement learning, the RLHF framework presents a promising avenue for model improvement. As we've seen, not only does it allow for more nuanced training but also ensures that the model's development is continuously tethered to human judgments, leading to more reliable and trustworthy AI systems.

By leveraging BERT as the rewards model, we ensure a robust mechanism to gauge and guide the T5 model's performance. The journey we've undertaken in this notebook is a testament to the immense potential and adaptability of modern AI methodologies.


Notebook developed by [Juan Olano](https://www.linkedin.com/in/juan-olano-b9a330112/) and [Pano Evangeliou](https://www.linkedin.com/in/p-evangeliou/) - Sept.2023