# Introduction

In this project, we aim to train a language model capable of generating engaging and relevant movie descriptions by leveraging a combination of supervised learning and reinforcement learning. For this purpose, we rely on reviews from the IMDb database, a vast collection of movie critiques written by users and experts.

# Data Preparation and Filtering

In [27]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import Dataset, load_dataset
from transformers import TrainingArguments
from trl import RewardTrainer
from transformers import GPT2Tokenizer
from trl.trainer.reward_trainer import RewardConfig

In [40]:
# https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb
data = load_dataset("stanfordnlp/imdb", split="train[:5%]") # Load the IMDb dataset, selecting 5% of the training split
data = data.rename_columns({"text": "review"})
data = data.filter(lambda x: len(x["review"]) > 200, batched=False) # Filter out reviews that are too short 

Filter:   0%|          | 0/1250 [00:00<?, ? examples/s]

# Reward Model Training

In [41]:
model_name = "gpt2"
input_min_text_length = 2
input_max_text_length = 8

In [42]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Initialize the reward model for sequence classification
reward_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
reward_model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
# https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb
def sample_length():
    return torch.randint(input_min_text_length, input_max_text_length + 1, (1,)).item()

# Function to tokenize the review data
def tokenize(sample):
    max_length = sample_length()
    sample["input_ids"] = tokenizer.encode(sample["review"])[:max_length]
    sample["query"] = tokenizer.decode(sample["input_ids"])
    return sample

data = data.map(tokenize, batched=False)

for example in data.select(range(5)):
    print("Query:", example["query"])
    print("Input IDs:", example["input_ids"])
    print()


Map:   0%|          | 0/1241 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Query: I rented I AM CURIOUS
Input IDs: [40, 26399, 314, 3001, 327, 47269, 20958]

Query: "I Am Curious: Yellow" is
Input IDs: [1, 40, 1703, 44269, 25, 12550, 1, 318]

Query: If only to
Input IDs: [1532, 691, 284]

Query: This film
Input IDs: [1212, 2646]

Query: Oh, brother
Input IDs: [5812, 11, 3956]



In [44]:
# Split the reviews into chosen and rejected pairs
reviews = data["query"]
chosen = reviews[::2] 
rejected = reviews[1::2] 

# Ensure both lists are of the same length
min_length = min(len(chosen), len(rejected))
chosen = chosen[:min_length]
rejected = rejected[:min_length]

from datasets import Dataset
reward_data = {"chosen": chosen, "rejected": rejected}
reward_dataset = Dataset.from_dict(reward_data)

print(reward_dataset)

Dataset({
    features: ['chosen', 'rejected'],
    num_rows: 620
})


In [45]:
# Define a custom data collator for padding (requested help from ChatGPT due to an error)
class CustomRewardDataCollator(DataCollatorWithPadding):
    def __call__(self, features):
        return super().__call__(features)

In [46]:
# Define custom training arguments (requested help from ChatGPT due to an error)
class CustomTrainingArguments(TrainingArguments):
    def __init__(self, *args, max_length=512, dataset_num_proc=1, center_rewards_coefficient=1.0, **kwargs):
        super().__init__(*args, **kwargs)
        self.max_length = max_length
        self.dataset_num_proc = dataset_num_proc
        self.center_rewards_coefficient = center_rewards_coefficient

# Configure the training arguments
training_args = CustomTrainingArguments(
    output_dir="./reward_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=2,  
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    max_length=512,
    dataset_num_proc=1,
    center_rewards_coefficient=1.0,
)

# Initialize the RewardTrainer
reward_trainer = RewardTrainer(
    model=reward_model,
    train_dataset=reward_dataset,
    eval_dataset=reward_dataset,
    processing_class=tokenizer,
    args=training_args,
    max_length=None,  
)


# Train the reward model
reward_trainer.train()

reward_trainer.save_model("./reward_model")
tokenizer.save_pretrained("./reward_model")

Map:   0%|          | 0/620 [00:00<?, ? examples/s]

Map:   0%|          | 0/620 [00:00<?, ? examples/s]

Filter:   0%|          | 0/620 [00:00<?, ? examples/s]

Map:   0%|          | 0/620 [00:00<?, ? examples/s]

Map:   0%|          | 0/620 [00:00<?, ? examples/s]

Filter:   0%|          | 0/620 [00:00<?, ? examples/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1123,0.949783,0.556452
2,0.9286,0.741319,0.609677
3,0.8161,0.722785,0.625806


('./reward_model\\tokenizer_config.json',
 './reward_model\\special_tokens_map.json',
 './reward_model\\vocab.json',
 './reward_model\\merges.txt',
 './reward_model\\added_tokens.json',
 './reward_model\\tokenizer.json')

The results indicate a steady improvement in model performance across epochs, with the training loss decreasing from 1.1123 to 0.8161 and validation loss reducing from 0.9498 to 0.7228. This demonstrates the model’s ability to generalize better with each epoch. Additionally, the accuracy improved from 55.64% in the first epoch to 62.58% in the final epoch, showcasing the reward model’s increasing effectiveness in evaluating text quality. These findings underscore the potential of combining reward models with PPO for refining language models to generate more coherent and human-aligned outputs.

# Optimization with Proximal Policy Optimization (PPO)

In [125]:
from transformers import AutoModelForCausalLM
from trl import PPOTrainer, PPOConfig

ppo_model = AutoModelForCausalLM.from_pretrained(model_name)
ppo_tokenizer = AutoTokenizer.from_pretrained(model_name)
ppo_tokenizer.pad_token = ppo_tokenizer.eos_token

# Initialize Reference and Value Models
ref_model = AutoModelForCausalLM.from_pretrained(model_name)
value_model = AutoModelForCausalLM.from_pretrained(model_name)

ppo_config = PPOConfig(
    output_dir="./ppo_model",
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
)

prompts = [
    "What are the best movies of the year?",
    "Describe a critically acclaimed thriller movie.",
    "What makes a comedy movie entertaining?",
]

def tokenize_prompts(prompts):
    tokenized = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt")
    return {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"],
    }

tokenized_data = tokenize_prompts(prompts)

# Define Custom Dataset (requested help from ChatGPT)
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data, device):
        self.data = data
        self.device = device

    def __len__(self):
        return len(self.data["input_ids"])

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.data.items()}
    

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_dataset = CustomDataset(tokenized_data, device)

In [126]:
ppo_model.config.output_hidden_states = True
reward_model.config.output_hidden_states = True

In [140]:
# Debug PPO Trainer (requested help from ChatGPT)
class DebugPPOTrainer(PPOTrainer):
    def train(self):
        for batch in self.dataloader:
            print("Training batch:")
            print(batch)
            input_ids = batch["input_ids"].to(self.args.device)
            print("Input IDs shape:", input_ids.shape)
            break  

# Initialize PPO Trainer
ppo_trainer = DebugPPOTrainer(
    config=ppo_config,
    model=ppo_model,
    ref_model=ref_model,
    value_model=value_model,
    tokenizer=ppo_tokenizer,
    train_dataset=train_dataset,
    reward_model=reward_model,
)

from torch.utils.data import DataLoader

# Prepare DataLoader (ChatGTP)
train_dataloader = DataLoader(
    train_dataset,
    batch_size=ppo_config.batch_size,
    collate_fn=lambda x: {
        key: torch.cat([item[key] for item in x], dim=0)
        for key in x[0]
    },
    shuffle=True
)

ppo_trainer.dataloader = train_dataloader


  ppo_trainer = DebugPPOTrainer(


In [141]:
# Run PPO Training
ppo_trainer.train()

Training batch:
{'input_ids': tensor([24564,  4892,   257, 19475, 27023, 32251,  3807,    13, 50256,  2061,
         1838,   257, 10997,  3807, 17774,    30, 50256, 50256,  2061,   389,
          262,  1266,  6918,   286,   262,   614,    30]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 1])}
Input IDs shape: torch.Size([27])


# Text Generation with PPO-Optimized Model

In [144]:
from transformers import pipeline

generation_pipeline = pipeline("text-generation", model=ppo_model, tokenizer=ppo_tokenizer)
result = generation_pipeline("Describe a heartwarming drama movie.", max_length=100)
print("Generated Text:", result[0]["generated_text"])

Device set to use cpu


Generated Text: Describe a heartwarming drama movie.

Daniels played a friend of an old friend of his who is now dead.

Ride the Carousel

When an ex-con returns to her fishing village, she blasts off on a date with a big brat.

Sunderland Tilapia

After studyingancy in Rome for two years, a single U.S. tourist tries heartbreak by staying up all night watching the latest movies.




The text generation process successfully demonstrates that the PPO-optimized model can respond to prompts and generate content. However, the quality of the output lacks coherence and does not fully align with the context of the prompt. This may be attributed to the absence of a properly fine-tuned Hugging Face reward model and incomplete optimization steps. Further refinement, including better data preprocessing and reinforcement training, is required to enhance the clarity and relevance of the generated text.

# References and Resources

The following resources were used to guide and structure this project. They provided valuable insights into reward modeling, PPO optimization, and the implementation of advanced reinforcement learning techniques for language models:

- [GPT-2 Sentiment Analysis Notebook](https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb)
- [PPO Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py)
- [PPO TLDR Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py)
- [Reward Modeling Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py)

- [CleanRL GitHub Repository](https://github.com/vwxyzjn/cleanrl/tree/master)

- [Introduction to PPO and Reinforcement Learning for NLP](https://www.youtube.com/watch?v=hlv79rcHws0&ab_channel=MachineLearningwithPhil)

- [Reward Model Training Guide](https://medium.com/towards-generative-ai/reward-model-training-2209d1befb5f)
