### RLHF Demo
Run post training on a pretrained GPT-2 model to understand RLHF. Steps will be SFT -> train reward model -> run grpo on pretrained llm on reward model. Rather than using TRL, I will be implementing grpo myself. Implementation will start with single gpu and then scaled to distributed system.

In [46]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer, DataCollatorForLanguageModeling

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

model.eval()

prompt = "The usual weather in California is"
inputs = tokenizer.encode(prompt, return_tensors='pt')

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_length=1000,
        num_return_sequences=1,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The usual weather in California is a bit of a mess.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still thick.

The sun is shining, but the clouds are still t

### Supervised fine tuning

What I need to do:
1. Preprocess data into chat template with EOS token. Ensure data is padded and make sure batches are truncated to fit context length.
2. Iterate through every batch and for each one calculate the loss (ONLY on the last assistant completion so the model learns prompt prediction). We use cross entropy btw.
3. Run a number of epochs on it.
4. Keep single threaded till we implement grpo as well.

In [157]:
from datasets import load_dataset, load_dataset_builder, get_dataset_split_names
from torch.utils.data import DataLoader
from pprint import pprint

# ---------------
# hyperparameters
num_epochs = 5
batch_size = 2
# ---------------

# create dataset train/val/test splits
train_sft_dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft').select(range(1000))
train_split_size = int(0.9 * len(train_sft_dataset))
train_split = train_sft_dataset.select(range(train_split_size))
val_split = train_sft_dataset.select(range(train_split_size, len(train_sft_dataset)))

# create chat template for tokenizer to use, gpt2 uses eos token so we need to add that as well
tokenizer.chat_template = """
{%- for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- else %}
    {{- eos_token }}
{%- endif %}
"""

# preprocess data and create dataloader
ending_msg_token_len = len(tokenizer.encode('<|im_end|>\n'))
def add_chat_tem(example):
    # convert to chat template and keep track of # of tokens in last generation
    enc_chat_tem_ex = tokenizer.apply_chat_template(example['messages'], tokenize=True, add_special_tokens=False)
    example['input_ids'] = enc_chat_tem_ex
    end_size = (len(tokenizer.encode(example['messages'][-1]['content'], add_special_tokens=False)) + ending_msg_token_len)
    last_gen_start_ind = len(enc_chat_tem_ex) - end_size
    example['last_gen_start_ind'] = last_gen_start_ind
    return example


train_split = train_split.map(add_chat_tem)
val_split = val_split.map(add_chat_tem)

# create custom collator for sft
class DataCollatorForSFT(DataCollatorForLanguageModeling):
    def __call__(self, features, return_tensors=None):
        last_gen_start_inds = [example['last_gen_start_ind'] for example in features]
        features = [{'input_ids': example['input_ids']} for example in features]
        batch = super().__call__(features, return_tensors=return_tensors)
        # scrappy but just assume we're calling with return_tensors='pt'
        batch['last_gen_start_inds'] = torch.tensor(last_gen_start_inds)

        return batch


data_collator = DataCollatorForSFT(
    tokenizer=tokenizer,
    mlm=False,
    return_tensors='pt'
)

train_dataloader = DataLoader(
    train_split,
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator
)

In [158]:
'''
BxT -> BxTxP
A couple of issues I need to work out:
1. Padding inputs_ids in each prompt in the match correctly and ensuring attention mask is right
2. After we feed the prompt, we need to remove everything but the corresponding amount of tokens from the end of the prompt corresponding to the last assistant generation. We need to calculate labels that way simimlarly.
3. Calculate the loss for the sliced part (but each batch inner matrix is different size so need to think about that).
4. Backprop on that to train the model.
'''

# training run
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        input_ids = batch['input_ids']
        last_gen_start_inds = batch['last_gen_start_inds']

        print("shape:", input_ids.shape)

        print("\n\n\n\n")

        first_prompt_input_ids = batch['input_ids'][0]
        print(len(first_prompt_input_ids))
        first_prompt_last_gen_token_len = batch['last_gen_start_inds'][0].item()
        print(first_prompt_last_gen_token_len)

        print("\n\n\n\n")
        print(tokenizer.decode(first_prompt_input_ids[:first_prompt_last_gen_token_len]))


        # mask labels that aren't included in last gen
        mask = torch.arange(input_ids.shape[1]) < last_gen_start_inds[:, None]
        print("\n\n\n\n")
        print(mask)
        input_ids[mask] = -100


        break
    break

FEATURES LENGTH: 2
DIFFERENT LENGTHS: [1106, 1581]
BATCH SIZE BEFORE: 2
BATCH SIZE IN COLLATOR:  2
shape: torch.Size([2, 1581])





1581
820





<|im_start|>user
If you are not comfortable using spreadsheets and running scripts to create shared folders with your students, there is another simpler way. It helps if you do this at the beginning of a class or at least all together as a class so you can help people who get stuck. Once they share a folder with you, anything they put in the folder will be shared with you so that you can edit it.
Have everyone go to their Google Drive page. Select “Create” and then click “Folder”.
The students should name their folder a specific way so that you can easily sort them later. I would suggest “Period-LastName-FirstName-Semester”.
They should then select the “Share” button.
They will have to start typing in your name but it should automatically bring up your account to share with. REMIND THEM TO UNCLICK THE “NOTIFY PEOPLE VIA EMAIL” box otherwise 

### Create the reward model

In [10]:
# The reward model generates a single logit representing the probability of that response (we use Bradley-Terry model of preferences)
class RewardModel(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, input_size * 4)
        self.fc2 = nn.Linear(input_size * 4, 1)

'''
Preference training loop steps:
1. Iterate through synthetic data of preferences.
2. Use the final hidden state last embedding as input to reward network.
3. Do that for both (otherwise you can use synthetic data and feed into the network the prompt + the response)
4. Use that to get the embeddings for both (just do one pass)
5. Calculate the loss based on Bradley-Terry and do backprop on the reward model network (you can mean the sums so you do it batchwise)

The idea we'll use is take the prompt reward string to get rewrd but only train on prompt to avoid mismatch problem.
'''


"\nPreference training loop steps:\n1. Iterate through synthetic data of preferences.\n2. Use the final hidden state last embedding as input to reward network.\n3. Do that for both (otherwise you can use synthetic data and feed into the network the prompt + the response)\n4. Use that to get the embeddings for both (just do one pass)\n5. Calculate the loss based on Bradley-Terry and do backprop on the reward model network (you can mean the sums so you do it batchwise)\n\nThe idea we'll use is take the prompt reward string to get rewrd but only train on prompt to avoid mismatch problem.\n"

In [16]:
from datasets import load_dataset

train_data = load_dataset('HuggingFaceH4/no_robots')['train']
train_data[0]

{'prompt': 'Please summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert’s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood more with rising sea levels. “We can’t ask restoration ecologists to plant nonnative species or to 