# Introduction

In this project, we aim to train a language model capable of generating engaging and relevant movie descriptions by leveraging a combination of supervised learning and reinforcement learning. For this purpose, we rely on reviews from the IMDb database, a vast collection of movie critiques written by users and experts.

# Data Preparation and Filtering

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import Dataset, load_dataset
from transformers import TrainingArguments
from trl import RewardTrainer
from transformers import GPT2Tokenizer
from trl.trainer.reward_trainer import RewardConfig

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb
data = load_dataset("stanfordnlp/imdb", split="train")
data = data.rename_columns({"text": "review"})
print(f"Total avant filtrage: {len(data)}")
print(f"Positives avant: {sum(1 for d in data if d['label'] == 1)}")
print(f"Négatives avant: {sum(1 for d in data if d['label'] == 0)}")

Total avant filtrage: 25000
Positives avant: 12500
Négatives avant: 12500


In [3]:
data = data.filter(lambda x: len(x["review"]) > 200, batched=False)
print(f"\nTotal après filtrage: {len(data)}")
print(f"Positives après: {sum(1 for d in data if d['label'] == 1)}")
print(f"Négatives après: {sum(1 for d in data if d['label'] == 0)}")


Total après filtrage: 24895
Positives après: 12439
Négatives après: 12456


# Reward Model Training

In [4]:
model_name = "gpt2"
input_min_text_length = 2
input_max_text_length = 8

In [5]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Initialize the reward model for sequence classification
reward_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
reward_model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
positive_data = data.filter(lambda x: x["label"] == 1)
negative_data = data.filter(lambda x: x["label"] == 0)

In [7]:
print(f"Reviews positives: {len(positive_data)}")
print(f"Reviews négatives: {len(negative_data)}")

Reviews positives: 12439
Reviews négatives: 12456


In [8]:
# https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb
def sample_length():
    return torch.randint(input_min_text_length, input_max_text_length + 1, (1,)).item()

# Function to tokenize the review data
def tokenize(sample):
    max_length = sample_length()
    sample["input_ids"] = tokenizer.encode(sample["review"])[:max_length]
    sample["query"] = tokenizer.decode(sample["input_ids"])
    return sample

positive_data = positive_data.map(tokenize, batched=False)
negative_data = negative_data.map(tokenize, batched=False)

In [9]:
positive_reviews = positive_data["query"]
negative_reviews = negative_data["query"]

In [10]:
min_length = min(len(positive_reviews), len(negative_reviews))
chosen = positive_reviews[:min_length]
rejected = negative_reviews[:min_length]

In [11]:
min_length = min(len(positive_reviews), len(negative_reviews))
chosen = positive_reviews[:min_length]
rejected = negative_reviews[:min_length]

reward_data = {"chosen": chosen, "rejected": rejected}
reward_dataset = Dataset.from_dict(reward_data)

print(f"Reward dataset créé avec {len(reward_dataset)} paires")

Reward dataset créé avec 12439 paires


In [15]:
training_args = RewardConfig(
    output_dir="./reward_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=16,  # Augmenté pour GPU
    per_device_eval_batch_size=16,   # Augmenté pour GPU
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    remove_unused_columns=False,
    fp16=True,  # Mixed precision pour GPU
    dataloader_num_workers=4,  # Parallélisation
    max_length=512,  # Longueur max pour le reward model
)

reward_trainer = RewardTrainer(
    model=reward_model,
    train_dataset=reward_dataset,
    eval_dataset=reward_dataset,
    processing_class=tokenizer,
    args=training_args,
)

Adding EOS to train dataset: 100%|██████████| 12439/12439 [00:00<00:00, 28539.83 examples/s]
Tokenizing train dataset: 100%|██████████| 12439/12439 [00:02<00:00, 6213.59 examples/s]
Filtering train >512 tokens: 100%|██████████| 12439/12439 [00:00<00:00, 83617.46 examples/s]
Adding EOS to eval dataset: 100%|██████████| 12439/12439 [00:00<00:00, 33011.37 examples/s]
Tokenizing eval dataset: 100%|██████████| 12439/12439 [00:02<00:00, 5674.12 examples/s]
Filtering eval >512 tokens: 100%|██████████| 12439/12439 [00:00<00:00, 84534.55 examples/s]


In [16]:
reward_trainer.train()
reward_trainer.save_model("./reward_model")
tokenizer.save_pretrained("./reward_model")

Epoch,Training Loss,Validation Loss,Num Tokens,Min Reward,Mean Reward,Max Reward,Accuracy,Margin
1,0.5865,0.537916,149626.0,0.237725,3.052803,5.285942,0.723341,0.68068
2,0.499,0.449922,299252.0,-0.423991,3.542835,6.939659,0.789662,1.193776
3,0.3881,0.395735,448878.0,-1.310874,3.751971,7.934824,0.821107,1.666769


('./reward_model\\tokenizer_config.json',
 './reward_model\\special_tokens_map.json',
 './reward_model\\vocab.json',
 './reward_model\\merges.txt',
 './reward_model\\added_tokens.json',
 './reward_model\\tokenizer.json')

# Optimization with Proximal Policy Optimization (PPO)

In [22]:
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
)
from trl import PPOTrainer, PPOConfig
from torch.utils.data import DataLoader

# Base LM to fine-tune with PPO
policy_model = AutoModelForCausalLM.from_pretrained(model_name)

# Reference model for KL (frozen copy of the SFT model)
ref_model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Reward model: fine-tuned sentiment classifier saved at ./reward_model
# (num_labels=1 so it outputs a scalar reward per sequence)
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "./reward_model",
    num_labels=1,
)

# Value model: critic. Same architecture type (scalar regression head).
# You can use the same base as the reward model or another checkpoint.
value_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=1,
)

ppo_config = PPOConfig(
    exp_name="imdb-sentiment-rlhf",

    # Optim / schedule
    learning_rate=1e-5,
    per_device_train_batch_size=2,      
    gradient_accumulation_steps=4,
    num_train_epochs=1,                  
    num_ppo_epochs=4,
    num_mini_batches=4,                  

    # RL-specific
    gamma=1.0,
    lam=0.95,
    cliprange=0.2,
    cliprange_value=0.2,
    vf_coef=0.1,
    kl_coef=0.05,
    whiten_rewards=False,

    # Generation / stopping
    response_length=32,                  
    stop_token=None,                     
    stop_token_id=None,

    # for logging / reproducibility
    seed=42,
    output_dir="./ppo_imdb",
    num_sample_generations=0,

    no_cuda=False
)

# ppo_config.total_episodes = 1000  

def tokenize_for_ppo(sample):
    max_length = sample_length()
    input_ids = tokenizer.encode(
        sample["review"],
        truncation=True,
        max_length=max_length,
    )
    return {"input_ids": input_ids}

# Add input_ids column
data_with_ids = data.map(tokenize_for_ppo, batched=False)

train_input_ids = data_with_ids["input_ids"][:1000]

# PPOTrainer expects each element to have at least "input_ids"
ppo_dataset = [{"input_ids": ids} for ids in train_input_ids]

# Let PPOTrainer create the default DataCollatorWithPadding using the tokenizer.
ppo_trainer = PPOTrainer(
    args=ppo_config,
    processing_class=tokenizer,
    model=policy_model,        # policy
    ref_model=ref_model,       # frozen ref policy
    reward_model=reward_model, # reward model
    value_model=value_model,   # critic
    train_dataset=ppo_dataset,
    # eval_dataset=...          # optional
    # data_collator=...         # leave None to use the default
)

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 23b67c47-5736-4ef9-b2a9-108abf42a329)')' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/custom_generate/generate.py
Retrying in 1s [Retry 1/5].
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

True
1
NVIDIA GeForce RTX 4050 Laptop GPU


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy_model.to(device)
ref_model.to(device)
reward_model.to(device)
value_model.to(device)

print("Starting PPO training...")


ppo_trainer.train()

# Sauvegarder le modèle PPO
policy_model.save_pretrained("./ppo_model_final")
tokenize.save_pretrained("./ppo_model_final")
print("PPO training completed and model saved.")

Starting PPO training...
===training policy===


Step,Training Loss


AttributeError: 'function' object has no attribute 'save_pretrained'

In [25]:
# Sauvegarder le modèle PPO
policy_model.save_pretrained("./ppo_model_final")
tokenizer.save_pretrained("./ppo_model_final")
print("PPO training completed and model saved.")

PPO training completed and model saved.


# Text Generation with PPO-Optimized Model

In [27]:
from transformers import pipeline

# Charger le modèle optimisé
generation_pipeline = pipeline(
    "text-generation",
    model="./ppo_model_final",
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Générer du texte
prompts = [
    "This movie is",
    "The acting was",
    "I really enjoyed"
]

for prompt in prompts:
    result = generation_pipeline(prompt, max_length=50, num_return_sequences=1)
    print(f"\nPrompt: {prompt}")
    print(f"Generated: {result[0]['generated_text']}")

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Prompt: This movie is
Generated: This movie is about a man and his dog who have been separated for more than a year. This is a touching and touching story about a man who believes that all love is real and that every moment of love is precious.

A man and his dog are separated as they attempt to reconnect.

"The American Dream" was written and directed by Eric Kripke, who is a Canadian-American father of two. He is best known for his role as Steve Rogers in the Star Wars: Episode IV A New Hope trilogy and the movie "The Last Jedi". He also directed the film "Blue Sky" (2011) and "The Force Awakens" (2015).

For more information on the film, visit www.thestarwars.com

Produced by: The Force Awakens

Directed by: Eric Kripke

Cast: John Boyega (Steve Rogers), Harrison Ford (Captain America), Ben Whishaw (Joker), Robert Downey Jr. (Bella), Daisy Ridley (Harrison Ford), Lupita Nyong'o (Bella), Lupita Nyong'o (Bella), Andressa (Bella), Zoe Saldana (Bella), and Andy Serkis (Bella).


Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Prompt: The acting was
Generated: The acting was shot with a.357 Magnum.

The shooting occurred after a woman reported being in fear for her life at a local park.

The woman who reported the incident was working at the park when she became physically or mentally ill.

The female was taken to a local hospital for treatment. The woman was pronounced dead at the scene.

Police were able to identify the suspect as 43-year-old Michael J. Johnson.

According to police, Johnson was arrested for felony murder, armed illegal handgun possession and robbery.

Johnson was arrested at an apartment building and charged with felony murder.

The victim was able to escape with minor wounds to her body and was taken to the hospital for treatment.

Police are asking anyone with information to call Crime Stoppers at (901) 592-TIPS.

Source: http://www.azcentral.com/news/local/african-americans/michael-johnson-pursues-alleged-suicide-after-jail-says-himself-a-stalker-is-a-witness

© 2018 Cox Media Group.W

# References and Resources

The following resources were used to guide and structure this project. They provided valuable insights into reward modeling, PPO optimization, and the implementation of advanced reinforcement learning techniques for language models:

- [GPT-2 Sentiment Analysis Notebook](https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb)
- [PPO Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py)
- [PPO TLDR Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py)
- [Reward Modeling Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py)

- [CleanRL GitHub Repository](https://github.com/vwxyzjn/cleanrl/tree/master)

- [Introduction to PPO and Reinforcement Learning for NLP](https://www.youtube.com/watch?v=hlv79rcHws0&ab_channel=MachineLearningwithPhil)

- [Reward Model Training Guide](https://medium.com/towards-generative-ai/reward-model-training-2209d1befb5f)
