<a href="https://colab.research.google.com/github/QinyanGong/2024RL/blob/main/Huawei_Research_London_Coding_Interview_LLMFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Custom Trainer Based on Huggingface

## Main Task
In this second part, you need to implement a trainer based on Huggingface Trainer class.
You need to extend the existing Trainer class of huggingface to use a different loss function.
Below is the link to the documentation of the Trainer class: https://huggingface.co/docs/transformers/main/en/trainer
Note that you need 4-8GB of CPU RAM.
We do not expect to run the full training, but only to implement the necessary components for training

## LLM
For this excersice you need to use the Qwen/Qwen1.5-0.5B-Chat model.
Note: THIS IS A CHAT MODEL. Please read carefully how to use this model in: https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat

## Dataset

You are given the training dataset where each example in the dataset is a dictionary that contains two keys:
* The first key is 'text', which contains a multi-turn conversation in natural language that will be used as input to the llm. The text is in the form [list[dict]], which is a list of dictionaries. Each dictionary has two keys; the first is the role, which can be 'system', 'user', or 'assistant', the second is 'content', which is the content of the message. 'system' corresponds to the system prompt of the llm, 'user' corresponds to the text that is inputted to the llm, and 'assistant' corresponds to the response of the llm.
```
chat = [
    {"role": "system", "content": "You a helpful assisant"},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "Fine. How can I help you today?"},
    {"role": "user", "content": "What is the circumference of earth?"},
    {"role": "assistant", "content": "It is 40,075 kms."},]
```

* The second key is 'reward', which is a scalar number between -1 and 1, which indicates whether the specific example is good or not

## Optimization Objective

You need to implement the undiscounted REINFORCE algorithm. It is an extension of SFT that takes into account the reward that is assigned to the trajectory.
The loss function is
$$ L = - \frac{1}{B} \sum_{b \in B}\sum_{t \in seq[b]} (reward[b] * \log p(x[b][t] | x[b][:t])) \textrm{ if $x[b][t]$ belongs is one of the assistant's tokens}$$

The loss is 0 otherwise.
Practically that means that give the aforementioned chat example, the loss is not 0 only for the following pieces of text
```
{"role": "assistant", "content": "Fine. How can I help you today?"},
and
{"role": "assistant", "content": "It is 40,075 kms."}
```



In [None]:
from huggingface_hub import notebook_login
# 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
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "Qwen/Qwen1.5-0.5B-Chat"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
)


You are give the following data, which has to be loaded in order to be used by the huggingface's transformers Trainer. We recommend using the datasets library
https://huggingface.co/docs/datasets/en/index

In [None]:
data =[{'text':
    [{"role": "system", "content": "You a helpful assisant"},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "Fine. How can I help you today?"},
    {"role": "user", "content": "What is the circumference of earth?"},
    {"role": "assistant", "content": "It is 40,075 kms."}],
         'reward': 1},
        {'text':
    [{"role": "system", "content": "You a helpful assisant"},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "Fine. How can I help you today?"},
    {"role": "user", "content": "What is the shape of earth?"},
    {"role": "assistant", "content": "Earth is a square"}],
         'reward': -1}]
print(data)

[{'text': [{'role': 'system', 'content': 'You a helpful assisant'}, {'role': 'user', 'content': 'Hello, how are you?'}, {'role': 'assistant', 'content': 'Fine. How can I help you today?'}, {'role': 'user', 'content': 'What is the circumference of earth?'}, {'role': 'assistant', 'content': 'It is 40,075 kms.'}], 'reward': 1}, {'text': [{'role': 'system', 'content': 'You a helpful assisant'}, {'role': 'user', 'content': 'Hello, how are you?'}, {'role': 'assistant', 'content': 'Fine. How can I help you today?'}, {'role': 'user', 'content': 'What is the shape of earth?'}, {'role': 'assistant', 'content': 'Earth is a square'}], 'reward': -1}]


Please note that we do not expect to run the full training. We will just check whether the code is generally correct. You are free to use any library such as transformers, trl, etc for your implementation.

In [None]:
from torch import nn
from transformers import Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss for 3 labels with different weights
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

In [None]:
import torch.nn.functional as F
def pad_tensor(tensor, max_length):
    """
    Pads or truncates a tensor to the specified max_length.
    If the tensor is shorter than max_length, it pads with 0s.
    If the tensor is longer than max_length, it truncates.
    """
    if tensor.size(0) < max_length:
        # Pad tensor if it's shorter than max_length
        return F.pad(tensor, (0, max_length - tensor.size(0)), value=0)
    else:
        # Truncate tensor if it's longer than max_length
        return tensor[:max_length]

In [None]:
def process_data(data):
    input_ids_list = []
    labels_list = []
    role_ids_list = []
    rewards_list = []
    #  data is a list
    for conversation in data:
        reward = conversation['reward']
        full_text = ""
        role_ids = []

        # Construct the sequence for tokenization
        for turn in conversation['text']:
            content = turn['content']
            full_text += content + " "  # Concatenate the conversation

            # Role IDs: 1 for assistant tokens, 0 for others (system and user)
            # Use role ids as masks
            role_id = 1 if turn['role'] == 'assistant' else 0
            role_ids.extend([role_id] * len(tokenizer.tokenize(content)))


        # Tokenize the entire conversation
        inputs = tokenizer(full_text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")

        # Append the input_ids and attention masks to the corresponding lists
        input_ids_list.append(inputs['input_ids'].squeeze(0).to('cuda'))  # Remove batch dimension
        role_ids = pad_tensor(torch.tensor(role_ids), inputs['input_ids'].size(-1)).to('cuda')
        role_ids_list.append(torch.tensor(role_ids[:inputs['input_ids'].size(-1)]))  # Match length to input_ids

        # Assign the same reward for all tokens (for the assistant's tokens)
        rewards = torch.tensor([reward] * len(role_ids)).to('cuda')
        rewards_list.append(rewards[:inputs['input_ids'].size(-1)])  # Match length to input_ids
        # Create labels list where non-assistant tokens are replaced with pad_token_id
        labels = inputs['input_ids'].clone()
        # print(labels)
        # print(role_ids)
        labels[torch.tensor(role_ids).unsqueeze(0) == 0] = tokenizer.pad_token_id
        labels_list.append(labels.to('cuda'))

    return input_ids_list, role_ids_list, rewards_list, labels_list

In [None]:
# Process the data
input_ids, role_ids, rewards, labels = process_data(data)
inputs = {
    'input_ids': input_ids,   # Tokenized sequences
    'role_ids': role_ids,     # 1 for assistant tokens, 0 for user/system
    'labels': labels,      # For language models, input and labels are usually the same
    'rewards': rewards        # Reward associated with each sequence
}
inputs

  role_ids_list.append(torch.tensor(role_ids[:inputs['input_ids'].size(-1)]))  # Match length to input_ids
  labels[torch.tensor(role_ids).unsqueeze(0) == 0] = tokenizer.pad_token_id


{'input_ids': [tensor([  2610,    264,  10950,   1071,    285,    517,  21927,     11,   1246,
             525,    498,     30,  30153,     13,   2585,    646,    358,   1492,
             498,   3351,     30,   3555,    374,    279,  74926,    315,   9393,
              30,   1084,    374,    220,     19,     15,     11,     15,     22,
              20,  96677,     13,    220, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
   

In [None]:
a=[1,2,3,4,5,6,7,8,9]
b=[1,0,0,0,0,0,0,0,0]
a=torch.tensor(a)
b=torch.tensor(b)
a[b==0]=100
a

tensor([  1, 100, 100, 100, 100, 100, 100, 100, 100])

In [None]:
class REINFORCETrainer(Trainer):
  def compute_loss(self, model, inputs, return_outputs=False):

    # forward pass
    print(inputs)
    outputs = model(**inputs)
    labels = inputs.pop("labels")
    logits = outputs.get("logits")
    rewards = inputs.get("rewards")
    # Compute the log-probabilities of the predicted tokens
    log_probs = nn.functional.log_softmax(logits, dim=-1)
    print(log_probs.size()) # [batch, words, idx]
    # Initialize the loss as zero
    loss = 0
    role_ids = inputs.get("role_ids")
    print(role_ids)
    # Loop through the batch
    for batch_idx in range(log_probs.size(0)):
      seq_log_probs = log_probs[batch_idx]     # Log probs for the current sequence
      seq_labels = labels[batch_idx]           # True labels for the current sequence
      seq_role_ids = role_ids[batch_idx]       # Role identifiers for the current sequence (assistant/user)
      reward = rewards[batch_idx]              # Reward for the current sequence

      # Loop through the sequence
      for t in range(seq_log_probs.size(0)):
        if seq_role_ids[t] == 1:  # Only compute loss for assistant-generated tokens
          token_log_prob = seq_log_probs[t, seq_labels[t]]  # Log prob of the correct token
          loss -= reward * token_log_prob  # REINFORCE loss for the assistant's token

      # Normalize loss by batch size
      loss = loss / log_probs.size(0)

      return (loss, outputs) if return_outputs else loss

In [None]:
def compute_loss(model, inputs, return_outputs=False):

    # forward pass
    # print(inputs)
    model_inputs = {
        'input_ids': inputs['input_ids'],
        'labels': inputs['labels'],
        'attention_mask': torch.ones_like(inputs['input_ids'],device='cuda')
    }
    outputs = model(**model_inputs)
    labels = inputs["labels"]
    logits = outputs.get("logits")
    rewards = inputs["rewards"]
    # Compute the log-probabilities of the predicted tokens
    log_probs = nn.functional.log_softmax(logits, dim=-1)
    # print(log_probs.size()) # [batch, words, idx]
    # Initialize the loss as zero
    loss = 0
    role_ids = inputs["role_ids"]
    # Loop through the batch
    for batch_idx in range(log_probs.size(0)):
      seq_log_probs = log_probs[batch_idx]     # Log probs for the current sequence
      seq_labels = labels[batch_idx]           # True labels for the current sequence
      seq_role_ids = role_ids[batch_idx]       # Role identifiers for the current sequence (assistant/user)
      reward = rewards[batch_idx]              # Reward for the current sequence
      # print(f'seq_log_probs: {seq_log_probs.shape}')
      # Loop through the sequence
      for t in range(seq_log_probs.size(0)):
        if seq_role_ids[t] == 1:  # Only compute loss for assistant-generated tokens
          token_log_prob = seq_log_probs[t,seq_labels.squeeze(0)[t]]  # Log prob of the correct token
          loss -= reward[t] * token_log_prob  # REINFORCE loss for the assistant's token
      # print(f'log prob: {token_log_prob}')
      # Normalize loss by batch size
      loss = loss / log_probs.size(0)
      # print(f'loss:{loss}')

      return loss

In [None]:
from transformers import TrainerCallback

class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, num_steps=10):
        self.num_steps = num_steps

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step >= self.num_steps:
            return {"should_training_stop": True}
        else:
            return {}

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="new-chat",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

In [None]:
from torch.utils.data import Dataset
class ChatDataset(Dataset):
    def __init__(self, input_ids_list, role_ids_list, rewards_list, labels):
        self.input_ids_list = input_ids_list
        self.role_ids_list = role_ids_list
        self.rewards_list = rewards_list
        self.labels = labels

    def __len__(self):
        return len(self.input_ids_list)

    def __getitem__(self, idx):
        item = {
            'input_ids': self.input_ids_list[idx],
            'role_ids': self.role_ids_list[idx],
            'labels': self.labels[idx],
            'rewards': self.rewards_list[idx]
        }
        return item

# Create an instance of the dataset
chat_dataset = ChatDataset(input_ids, role_ids, rewards,labels)
train_loader = torch.utils.data.DataLoader(chat_dataset, batch_size=2, shuffle=True)

In [None]:
chat_dataset[0]
next(iter(train_loader))

{'input_ids': tensor([[  2610,    264,  10950,   1071,    285,    517,  21927,     11,   1246,
             525,    498,     30,  30153,     13,   2585,    646,    358,   1492,
             498,   3351,     30,   3555,    374,    279,  74926,    315,   9393,
              30,   1084,    374,    220,     19,     15,     11,     15,     22,
              20,  96677,     13,    220, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
          151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
   

In [None]:
from torch.optim import AdamW
# Optimizer
optimizer = AdamW(model.parameters(), lr=1e-4)

for epoch in range(10):
    model.train()

    for batch in train_loader:
        # do the forward path
        generated_ids = model.generate(batch['input_ids'],max_new_tokens=512)
        # generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(batch.input_ids, generated_ids)]
        # response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        # calculate the loss
        loss = compute_loss(model, batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch}, Loss: {loss.item()}")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 0, Loss: 55.70682144165039


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 1, Loss: 21.333810806274414


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 2, Loss: 16.612850189208984


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 3, Loss: 1.5350146293640137


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 4, Loss: -5.284585952758789


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 5, Loss: -28.238998413085938


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 6, Loss: -75.381591796875


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 7, Loss: -209.03271484375


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch 8, Loss: -273.06427001953125
Epoch 9, Loss: -315.2392578125


In [None]:
# Trainer instance
trainer = REINFORCETrainer(
    model=model,
    args=training_args,
    train_dataset=chat_dataset,
    tokenizer=tokenizer
)

# Start training
trainer.train()

# Optionally, evaluate the model
trainer.evaluate()

{'input_ids': tensor([[  2610,    264,  10950,   1071,    285,    517,  21927,     11,   1246,
            525,    498,     30,  30153,     13,   2585,    646,    358,   1492,
            498,   3351,     30,   3555,    374,    279,  74926,    315,   9393,
             30,   1084,    374,    220,     19,     15,     11,     15,     22,
             20,  96677,     13,    220, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         15164

TypeError: 'NoneType' object is not subscriptable