In [None]:
hf_tokens = "TOKEN"

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Adapter (LORA implementation) - https://towardsdatascience.com/implementing-lora-from-scratch-20f838b046f1




# Finetune with PPO to output only positive movie review

hf link - https://huggingface.co/docs/trl/en/quickstart

https://github.com/hkproj/rlhf-ppo/blob/main/gpt_sentiment.py


https://www.philschmid.de/fsdp-qlora-llama3





### PPO logging information
---

## **Key Metrics to Monitor**

### **1. Reward Metrics**
- **env/reward_mean**: Average reward obtained from the environment. Alias `ppo/mean_scores`, used to monitor the reward model.
- **env/reward_std**: Standard deviation of the reward. Alias `ppo/std_scores`, used to monitor reward variability.
- **env/reward_dist**: Histogram distribution of rewards.

### **2. KL Divergence Metrics**
- **objective/kl**: Mean Kullback-Leibler (KL) divergence between old and new policies. Measures deviation of new policy from old policy.
- **objective/kl_dist**: Histogram distribution of KL divergence.
- **objective/kl_coef**: Coefficient for KL divergence in the objective function.
- **ppo/mean_non_score_reward**: KL penalty calculated as `objective/kl * objective/kl_coef`. Prevents new policy from deviating too far from old policy.

### **3. Entropy Metrics**
- **objective/entropy**: Entropy of the model’s policy (calculated as -logprobs.sum(-1).mean()). High entropy indicates more randomness in actions, beneficial for exploration.
- **ppo/policy/entropy**: Entropy of the policy, calculated by `torch.nn.functional.softmax(logits, dim=-1)` and related operations.

### **4. Policy Metrics**
- **ppo/policy/clipfrac**: Fraction of probability ratios falling outside the clipping range in the PPO objective.
- **ppo/policy/approxkl**: Approximate KL divergence between old and new policies (k2 estimator).
- **ppo/policy/policykl**: KL divergence measured by masked mean (k1 estimator).
- **ppo/policy/ratio**: Histogram distribution of the ratio between new and old policies.
- **ppo/policy/advantages_mean**: Average of the Generalized Advantage Estimation (GAE) advantage estimates.
- **ppo/policy/advantages**: Histogram distribution of GAE advantages.

### **5. Value Function Metrics**
- **ppo/returns/mean**: Mean of TD(λ) returns, calculated as `advantage + values`.
- **ppo/returns/var**: Variance of TD(λ) returns.
- **ppo/val/mean**: Mean of the value function.
- **ppo/val/var**: Variance of the value function.
- **ppo/val/var_explained**: Explained variance for the value function.
- **ppo/val/clipfrac**: Fraction of value function’s predicted values that are clipped.
- **ppo/val/vpred**: Predicted values from the value function.
- **ppo/val/error**: Mean squared error between predicted values and returns.

### **6. Loss Metrics**
- **ppo/loss/policy**: Policy loss for PPO.
- **ppo/loss/value**: Loss for the value function.
- **ppo/loss/total**: Total loss, which is the sum of policy and value function losses.

### **7. Token Metrics**
- **tokens/queries_len_mean**: Average length of query tokens.
- **tokens/queries_len_std**: Standard deviation of query token lengths.
- **tokens/queries_dist**: Histogram distribution of query token lengths.
- **tokens/responses_len_mean**: Average length of response tokens.
- **tokens/responses_len_std**: Standard deviation of response token lengths.
- **tokens/responses_dist**: Histogram distribution of response token lengths (note: `tokens/responses_len_dist` for consistency).

### **8. Log Probability Metrics**
- **objective/logprobs**: Histogram distribution of log probabilities of actions taken by the model.
- **objective/ref_logprobs**: Histogram distribution of log probabilities of actions taken by the reference model.

---

## **Crucial Values**

### **For Reward and Policy Monitoring:**
- **env/reward_mean, env/reward_std, env/reward_dist**: Monitor the reward distribution from the reward model.
- **ppo/mean_non_score_reward**: Mean negated KL penalty during training; indicates the deviation between reference and new policy.

### **For Stability:**
- **ppo/loss/value**: Spikes or NaNs may indicate issues with value function estimation.
- **ppo/policy/ratio**: Ratio values should be around 1. High values (e.g., > 200) suggest overoptimization and potential instability.
- **ppo/policy/clipfrac, ppo/policy/approxkl**: High values suggest that the policy is deviating significantly from the old policy, which can lead to instability.
- **objective/kl**: Should remain positive; a low or negative value suggests the policy is drifting too far from the reference.
- **objective/kl_coef**: Often increases before numerical instabilities arise.

---



In [None]:
!pip install -q datasets trl

In [None]:
import torch
from tqdm import tqdm
import wandb

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead,create_reference_model
from trl.core import LengthSampler

def build_dataset(config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8):
    # Build a dataset to be used for the training.
    # It is a series of prompts (each with different length chosen randomly)
    # We will use it to generate the responses and compute the rewards.
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load the IMDB dataset
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    # Only choose reviews with more than 200 tokens
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

#     def tokenize(sample):
#         # From each review just keep the first `input_size` tokens, this represents the prompt used to generate the response
#         sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
#         sample["query"] = tokenizer.decode(sample["input_ids"])
#         return sample

    def tokenize(sample):
        # Encode the review and generate the attention mask
        input_ids = tokenizer.encode(sample["review"])[:input_size()]
        attention_mask = [1] * len(input_ids)  # Attention mask is 1 for all valid tokens

        # Handle padding if needed (optional, but will be needed if sequence length varies)
        if len(input_ids) < input_size():
            padding_length = input_size() - len(input_ids)
            input_ids += [tokenizer.pad_token_id] * padding_length
            attention_mask += [0] * padding_length
        # From each review just keep the first `input_size` tokens, this represents the prompt used to generate the response
        sample["input_ids"] = input_ids
        sample["attention_mask"] = attention_mask
        sample["query"] = tokenizer.decode(input_ids)
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds


def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])


config = PPOConfig(
    model_name="lvwerra/gpt2-imdb",
    learning_rate=1.41e-5,
    num_shared_layers=6,
    log_with="wandb",
)

wandb.init(
        project="rlhf-load-dataset",
        config={
            "model_name": config.model_name,
            "learning_rate": config.learning_rate,
            "output_min_length": 4,
            "output_max_length": 16,
            "batch_size": 16,  # Example batch size
            "epochs": 10,  # Example number of epochs
        }
    )

dataset = build_dataset(config)
dataset

2024-08-14 02:57:35.076107: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-14 02:57:35.076210: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-14 02:57:35.204791: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['review', 'label', 'input_ids', 'attention_mask', 'query'],
    num_rows: 24895
})

In [None]:
# This is the model we are going to fine-tune with PPO
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
# This is the reference model (frozen) for the KL divergence - Here for memory efficiency we're sharing 6 layers
ref_model = create_reference_model(config.model_name,config.num_shared_layers)

tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug

# This is the reward model: a "positive" (e.g. a positive review) response will be given a high reward, a "negative" response will be given a low reward
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)


# Print some examples of sentiments generated by the reward model
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}
text = "this movie was really bad!!"
print(sentiment_pipe(text, **sent_kwargs))

text = "this movie was really good!!"
print(sentiment_pipe(text, **sent_kwargs)) # [{'label': 'NEGATIVE', 'score': -2.335047960281372}, {'label': 'POSITIVE', 'score': 2.557039737701416}]


pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


VBox(children=(Label(value='0.018 MB of 0.018 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



[[{'label': 'NEGATIVE', 'score': 2.3350484371185303}, {'label': 'POSITIVE', 'score': -2.726576328277588}]]
[[{'label': 'NEGATIVE', 'score': -2.294790267944336}, {'label': 'POSITIVE', 'score': 2.557040214538574}]]


In [None]:

output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

# The configuration to generate responses (trajectories)
response_generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}


def batch_process_pipeline(pipeline, texts, batch_size=16, **kwargs):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        results.extend(pipeline(batch, **kwargs))
    return results


for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Phase 1: Get trajectories from the offline policy
    # In this case we are only generating the responses, but not computing the log probabilities, which will be computed internally by the PPOTrainer.
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        response_generation_kwargs["max_new_tokens"] = gen_len # Number of tokens to generate (chosen randomly)
        response = ppo_trainer.generate(query, **response_generation_kwargs) # It returns the (query + response) tokens
        response_tensors.append(response.squeeze()[-gen_len:]) # Only take the tokens corresponding to the generated response (remove the prompt/query from the beginning)
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Phase 1: Compute rewards
    # Join the query (prompt) + response (generated tokens)
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    # Compute the reward for each of the texts (query + response)
    # shape: A list of dictionaries with two keys: POSITIVE and NEGATIVE. We are interested in the POSITIVE score. This will be our reward.
    pipe_outputs = batch_process_pipeline(sentiment_pipe, texts, **sent_kwargs)
    # The reward for each text is the score (logit) corresponding to the POSITIVE class.
    # shape: A list of scalars, one for each generated response.
    # It means we assign the reward to the whole response (not to each token).
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    #### Phase 1 + Phase 2: calculate the logprobs and then run the PPO update
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

    # Log metrics to wandb
    wandb.log({
        "epoch": epoch,
        "average_reward": torch.mean(torch.tensor(rewards)).item(),  # Example: log the average reward
        "logprobs": stats.get("logprobs"),  # Adjust according to your stats
        # Add more stats or metrics you want to log
    })

    ppo_trainer.log_stats(stats, batch, rewards)

# Finish wandb run
wandb.finish()

0it [00:00, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
1it [00:15, 15.88s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
194it [54:52, 16.97s/it]


VBox(children=(Label(value='4.918 MB of 4.918 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
average_reward,▁▂▂▄▅▆▆▆▆▇▇▆▇▇▆▇▇▆▇▇▇▇▇▇▇███▇▇█████▇███▇
env/reward_mean,▁▂▂▄▅▆▆▆▆▇▇▆▇▇▆▇▇▆▇▇▇▇▇▇▇███▇▇█████▇███▇
env/reward_std,█▇▆▇▇▆▆▆▅▄▅▅▃▃▅▄▅▇▃▄▆▃▄▄▅▃▃▃▃▄▁▂▃▃▃▄▃▃▃▃
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
objective/entropy,▆██▆▅▆▄▄▄▄▃▃▃▂▃▄▃▃▄▃▆▃▃▄▄▃▂▃▂▃▃▃▂▃▃▂▁▁▃▂
objective/kl,▁▁▂▃▄▅▅▅▅▅▆▅▆▆▅▅▆▆▆▆▇▆▇▇▆▇▇▇▆▇▇▆▆█▇▆▇█▇▇
objective/kl_coef,███▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁
ppo/learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
ppo/loss/policy,▁▄▄▆▆▆▆▅▆▅▆▇▆▆▆▆▅▄▇▆▅▇▇▅▆▇█▆▇▇█▇▆▇▇▆▆▇▇▅
ppo/loss/total,█▆▄▆▄▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▂▁▁▁▁▂▂▂▁▂▁

0,1
average_reward,2.18062
env/reward_mean,2.18062
env/reward_std,1.045
epoch,193.0
objective/entropy,29.58419
objective/kl,5.24842
objective/kl_coef,0.12533
ppo/learning_rate,1e-05
ppo/loss/policy,-0.02309
ppo/loss/total,-0.00575


In [None]:
model.save_pretrained("gpt2-imdb-pos-v2", push_to_hub=True)
tokenizer.save_pretrained("gpt2-imdb-pos-v2", push_to_hub=True)

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

('gpt2-imdb-pos-v2/tokenizer_config.json',
 'gpt2-imdb-pos-v2/special_tokens_map.json',
 'gpt2-imdb-pos-v2/vocab.json',
 'gpt2-imdb-pos-v2/merges.txt',
 'gpt2-imdb-pos-v2/added_tokens.json',
 'gpt2-imdb-pos-v2/tokenizer.json')

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name_or_path = "pritam3355/gpt2-imdb-pos-v2"
device = 0 if torch.cuda.is_available() else "cpu" # or "cuda" if you have a GPU

model = AutoModelForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

inputs = tokenizer.encode("I'm not sure of the movie but I think it's", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
model

Some weights of the model checkpoint at pritam3355/gpt2-imdb-pos-v2 were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I'm not sure of the movie but I think it's great. It's a great movie,


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
#