<a href="https://colab.research.google.com/github/Shubhamd13/NLP/blob/main/5_2_RLHF_Student_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task

$\text{In this quiz, we will implement a simple REINFORCE algorithm to fine-tune a GPT-2 model variant.}$

$\text{We will cosider the entire text generation as a single action and optimize it to produce responses more similar to preferred answers.}$

$\text{We assume that we have a prompt with corresponding good and bad response.}$

$\text{We will train the model using REINFORCE algorithm so that it learns to generate preferred response for the prompt.}$

## Step 1. Utility (Do not change)

In [1]:
import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.optim import AdamW
import torch.nn.functional as F
import random
import numpy as np

seed = 42
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
torch.use_deterministic_algorithms(True)

from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

## Step 2. Load Model and Tokenizer (Q3)

We take instruction tuned variant of gpt-2 model and the corresponding tokenizer. The output of the cell provides the model architecture.

In [3]:
model_name = "vicgalle/gpt2-open-instruct-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/510M [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50260, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50260, bias=False)
)

## Step 3. Human Rating

For a single prompt, we first generate two different responses by the gpt-2 model. Then we ask human to provide feedback score.

In [4]:
model.eval()

generated_responses = []

for t in range(2):

  question = "Why does the Earth experience different seasons throughout the year?"

  prompt_template = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.\n
  ### Instruction:\n
  {question}
  \n### Response:\n"""


  input_ids = tokenizer(prompt_template, return_tensors="pt").input_ids
  input_ids = input_ids.to(device)
  output = model.generate(input_ids, max_new_tokens=50, do_sample=True)

  generated_answer = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
  generated_responses.append(generated_answer)

  print(f"Response {t+1}:", generated_answer)

Response 1: The Earth experience different seasons throughout the year because it is the only place on Earth where the seasons are determined by the Earth’s atmosphere. Additionally, there is not a single consistent pattern in the Earth's climate that has caused different seasons to be
Response 2: The Earth experiences different seasons throughout the year because of different factors. For example, spring and summer will often be at different times of year, in many parts of the year. In some parts of the year, spring may actually be in the middle of


$\text{Score (1-100) both the responses. The better response should have higher score.}$

In [5]:
# Human Rating (You may change)
response_1_rating = 20
response_2_rating = 2

# Keep track of the better response index for later use (do not change)
if response_1_rating > response_2_rating:
  preferred_idx = 0
else:
  preferred_idx = 1

## Step 4. Reward Function (Q4)


Let us assume that we have a reward function which compare (generated response, good response) and (generated response, bad response) using cosine similarity.

If the cosine similarity of (generated response, good response) is higher, the reward function gives a +1 reward, otherwise -1 reward.

In [6]:
from sentence_transformers import SentenceTransformer, util
r_model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute cosine similarity
def cosine_sim(generated_vec, preferred_vec):
    a = np.array(generated_vec.cpu())
    b = np.array(preferred_vec.cpu())

    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)

    ###<--- Write code here
    sim = np.dot(a,b)/(norm_a*norm_b)                # A.B/|A|*|B|
    ###

    return sim

def reward_function(generated, preferred, rejected):
    embedding_gen = r_model.encode(generated, convert_to_tensor=True)
    embedding_pref = r_model.encode(preferred, convert_to_tensor=True)
    embedding_rej = r_model.encode(rejected, convert_to_tensor=True)

    score_gen_pref = cosine_sim(embedding_gen, embedding_pref)
    score_gen_rej = cosine_sim(embedding_gen, embedding_rej)

    print(score_gen_pref, score_gen_rej)

    if score_gen_pref > score_gen_rej:
        return 1
    else:
        return -1

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
gen = "In the year of February, the Earth experiences only two different seasons in a month - December (January to October), and March (February to April). Each of these seasons influences the climate of the planet"
good = "The earth's spin axis is tilted with respect to its orbital plane. This is what causes the seasons."
bad = "The Earth has seasons because it moves around."


reward_function(gen, good, bad)

0.6314019 0.63855225


-1

## Step 5. REINFORCE learning (Q6, Q7, Q8)

The training objective is  -

$\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \left[ R(y, x) \nabla_\theta \log \pi_\theta(y \mid x) \right]$

This is the policy gradient for the REINFORCE algorithm, where:

$x$ is the prompt/input

$y$ is the generated sequence

$R(y,x)$ is the reward function

$\pi_\theta(y \mid x)$ is the policy (our language model)

$\nabla_\theta \log \pi_\theta(y \mid x)$ is the gradient of the log probability of the entire sequence with respect to model parameters

We want to maximize the expected reward, but optimization algorithms typically minimize. So, we minimize the negative of the objective (equivalent to maximizing the objective).



$$\text{We follow the below steps for REINFORCE training:}$$

We train the model few times:

1. For the prompt (x), we take the preferred answer and rejected answer.
2. Format the prompt using an instruction template structure
3. Tokenize the formatted prompt template, generate a response (y) using the model without sampling
4. Compute a reward score R(y,x) comparing the generated text to the preferred and rejected answer

5. Combine the prompt template with the generated text and pass the full sequence through the model to get logits

6. Identify where the generated response starts. Extract logits for only the generated part. Extract target tokens for only the generated part.

7. Compute log probabilities of each token in the generated response

8. Sum the log probabilities to get the entire sequence log probability

9. Calculate the REINFORCE loss: -reward * log_prob_sum.
This follows the policy gradient formula: -R(y,x) * log π(y|x).
The negative sign is because we want to minimize loss.

10. Perform back propagation

In [8]:
# Initialize optimizer
model.train()
optimizer = AdamW(model.parameters(), lr=1e-5)
total_loss = 0.0
epochs = 10

# Repeat training
for t in range(epochs):
    print("Epoch: ", t+1)
    optimizer.zero_grad()  # Reset gradients

    # For the prompt (x), we take the preferred answer and rejected answer.
    prompt = "Why does the Earth experience different seasons throughout the year?"
    preferred = generated_responses[preferred_idx]
    rejected = generated_responses[1-preferred_idx]

    # Format the prompt using an instruction template structure
    prompt_template = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.\n
    ### Instruction:\n
    {prompt}
    \n### Response:\n"""

    # Generate response
    prompt_input_ids = tokenizer(prompt_template, return_tensors="pt").input_ids   # Tokenize the formatted prompt template
    prompt_input_ids = prompt_input_ids.to(device)
    output_ids = model.generate(prompt_input_ids, max_new_tokens=50, do_sample=False) # Generate a response with maximum 10 tokens using the model with NO sampling
    generated = tokenizer.decode(output_ids[0][prompt_input_ids.shape[1]:], skip_special_tokens=True)  # Extract the generated tokens, starting after the prompt

    # Compute reward

    ###<--- Write code here
    reward =   reward_function(generated,preferred,rejected)            # call reward function to calculate reward score for generated, preferred and rejected responses
    ###


    # Combine the prompt template with the generated text and pass the full sequence through the model to get logits
    full_input = tokenizer(prompt_template + " " + generated, return_tensors="pt")
    full_input_ids = full_input["input_ids"].to(device)
    outputs = model(full_input_ids)
    logits = outputs.logits

    prompt_length = prompt_input_ids.shape[1]            # Identify where the generated response starts
    shifted_logits = logits[:, prompt_length-1:-1, :]    # Extract logits for only the generated part
    shifted_targets = full_input_ids[:, prompt_length:]  # Extract target tokens for only the generated part

    # Compute log probabilities of each token in the generated response

    ###<--- Write code here
    log_probs =   F.log_softmax(shifted_logits, dim=-1)             # Compute log probabilities for the generated tokens' logits
    ###

    selected_log_probs = log_probs.gather(2, shifted_targets.unsqueeze(-1)).squeeze(-1)

    # Sum the log probabilities to get the sequence log probability
    log_prob_sum = selected_log_probs.sum()

    # REINFORCE loss: -reward * log_prob


    ###<--- Write code here
    loss =  -reward*log_prob_sum                #  Since optimizers minimize, we use -R(y,x) * log π_θ(y|x) to get the gradient -R(y,x) * ∇_θ log π_θ(y|x)
    ###


    # Backpropagation
    loss.backward()
    optimizer.step()

    print(f"Prompt: {prompt}")
    print(f"Generated: {generated.strip()}")
    print(f"Reward: {reward:.2f}, LOG: {log_prob_sum}, Loss: {loss.item():.2f}")
    print("="*50)
    total_loss += loss.item()

print(f"Average loss: {total_loss/epochs:.2f}")

Epoch:  1
0.9255813 0.8382578
Prompt: Why does the Earth experience different seasons throughout the year?
Generated: The Earth experiences different seasons throughout the year because of the Earth's rotation. The Earth rotates every year, and the seasons change over time. This means that the seasons are constantly changing, and the seasons are constantly changing. This means that the seasons
Reward: 1.00, LOG: -54.17591094970703, Loss: 54.18
Epoch:  2
0.9077575 0.83762085
Prompt: Why does the Earth experience different seasons throughout the year?
Generated: The Earth experiences different seasons throughout the year because of the Earth's rotation. This rotation is caused by the Earth's rotation, which is the rate of change in the Earth's orbit around the Sun. The Earth's rotation is also influenced by the Earth
Reward: 1.00, LOG: -46.52693176269531, Loss: 46.53
Epoch:  3
0.88447267 0.8362254
Prompt: Why does the Earth experience different seasons throughout the year?
Generated: The

## Step 6. Test the learned model

In [None]:
model.eval()

question = "Why does the Earth experience different seasons throughout the year?"

prompt_template = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.\n
### Instruction:\n
{question}
\n### Response:\n"""

print(prompt_template)

input_ids = tokenizer(prompt_template, return_tensors="pt").input_ids
input_ids = input_ids.to(device)
output = model.generate(input_ids, max_new_tokens=50, do_sample=False)
generated_answer = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print("Generated:", generated_answer)

In [None]:
reward_function(generated_answer, generated_responses[preferred_idx], generated_responses[1-preferred_idx])