<a href="https://colab.research.google.com/github/Papa-Panda/Paper_reading/blob/main/ppo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 【零基础学习强化学习算法：ppo】 https://www.bilibili.com/video/BV1iz421h7gb/?share_source=copy_web&vd_source=985107e9bc8449878c67f709b64e7ad2

In [None]:
# no GAE, about advantage function
# is this actor-critic? (no training on critic)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Set model to train on CPU
device = torch.device("cpu")
model.to(device)

# PPO Parameters
ppo_epochs = 4
clip_epsilon = 0.2
lr = 5e-5
gamma = 0.99

# Dataset for prompts and responses
class PromptDataset(Dataset):
    def __init__(self, prompts):
        self.prompts = prompts

    def __len__(self):
        return len(self.prompts)

    def __getitem__(self, idx):
        return self.prompts[idx]

# Example prompts
prompts = [
    "What is the capital of France?",
    "Explain the theory of relativity.",
    "Why is the sky blue?",
    "Tell me a joke about computers."
]
dataset = PromptDataset(prompts)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Reward simulation
def get_reward(response):
    """Simulated reward function."""
    if "Paris" in response:
        return 1.0  # Example: reward for correct answer
    elif "joke" in response:
        return 0.5  # Reward for mentioning a joke
    else:
        return 0.1  # Default reward

# PPO Training Loop
optimizer = optim.Adam(model.parameters(), lr=lr)

for epoch in range(3):  # Outer training loop
    for batch in dataloader:
        # Generate responses
        batch = [prompt for prompt in batch]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(device)
        with torch.no_grad():
            outputs = model.generate(input_ids=inputs["input_ids"], max_length=50, num_return_sequences=1)

        responses = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

        # Calculate rewards
        rewards = torch.tensor([get_reward(response) for response in responses], dtype=torch.float32).to(device)

        # Prepare for PPO
        optimizer.zero_grad()

        # Train with PPO
        for _ in range(ppo_epochs):
            # Recompute old logits and old log probs

            # Compute old logits and old log probs (fixed policy before PPO updates)
            with torch.no_grad():
                old_logits = model(inputs["input_ids"], attention_mask=inputs["attention_mask"]).logits
                old_log_probs = torch.log_softmax(old_logits, dim=-1)
                old_log_probs = old_log_probs.gather(-1, inputs["input_ids"].unsqueeze(-1)).squeeze(-1)

            # Train with PPO
            for _ in range(ppo_epochs):
                # Compute new logits and new log probs (updated policy)
                new_logits = model(inputs["input_ids"], attention_mask=inputs["attention_mask"]).logits
                new_log_probs = torch.log_softmax(new_logits, dim=-1)
                new_log_probs = new_log_probs.gather(-1, inputs["input_ids"].unsqueeze(-1)).squeeze(-1)

                # Compute ratios
                ratios = torch.exp(new_log_probs - old_log_probs)

                # Expand advantages to match token-level shape
                advantages = rewards - rewards.mean()  # Shape: [batch_size]
                advantages = advantages.unsqueeze(-1)  # Shape: [batch_size, 1]

                # PPO loss
                surrogate1 = ratios * advantages
                surrogate2 = torch.clamp(ratios, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
                loss = -torch.min(surrogate1, surrogate2).mean()

                # Backward and optimize
                optimizer.zero_grad()
                loss.backward()
                for name, param in model.named_parameters():
                    if param.requires_grad:  # Only track trainable parameters
                        print(f"Epoch {epoch}, Before Update, {name}: {param.view(-1)[0].item()}")  # Flatten and get first value
                        break

                optimizer.step()

                for name, param in model.named_parameters():
                    if param.requires_grad:  # Only track trainable parameters
                        print(f"Epoch {epoch}, After Update, {name}: {param.view(-1)[0].item()}")  # Flatten and get first value
                        break
        print(f"Epoch {epoch}, Loss: {loss.item()}")

print("Training complete!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 0, Before Update, transformer.wte.weight: -0.11010301113128662
Epoch 0, After Update, transformer.wte.weight: -0.11005306243896484
Epoch 0, Before Update, transformer.wte.weight: -0.11005306243896484
Epoch 0, After Update, transformer.wte.weight: -0.11001960933208466
Epoch 0, Before Update, transformer.wte.weight: -0.11001960933208466
Epoch 0, After Update, transformer.wte.weight: -0.10999375581741333
Epoch 0, Before Update, transformer.wte.weight: -0.10999375581741333
Epoch 0, After Update, transformer.wte.weight: -0.10997258126735687
Epoch 0, Before Update, transformer.wte.weight: -0.10997258126735687
Epoch 0, After Update, transformer.wte.weight: -0.10994064062833786
Epoch 0, Before Update, transformer.wte.weight: -0.10994064062833786
Epoch 0, After Update, transformer.wte.weight: -0.10991303622722626
Epoch 0, Before Update, transformer.wte.weight: -0.10991303622722626
Epoch 0, After Update, transformer.wte.weight: -0.10988893359899521
Epoch 0, Before Update, transformer.wte.w