🚀 **The all IMPORTANT compute_metrics() method implemented in all LLM (Large Language Model) FineTuning Project**🚀🔥


---------------------

👉 **How does the `tokenizer.batch_decode` really work**

👉 The `tokenizer.batch_decode` method is a part of the tokenizer object provided by Hugging Face's transformers library. It's used to convert tokenized sequences back into human-readable text, performing the inverse operation of the tokenization process.

👉 When you tokenize a piece of text, the tokenizer converts each word or subword into a numerical representation, known as a token. For example, the sentence "I love AI" might be tokenized into something like `[72, 1801, 1789]`. The tokenizer maintains a mapping from these numerical tokens back to the original words or subwords.

👉 The `tokenizer.batch_decode` function takes as input a batch of these token sequences (each sequence is a list of integers) and decodes them into their original textual form. If the `skip_special_tokens` flag is set to `True`, it will ignore special tokens like padding, start of sentence, end of sentence, etc., when converting back to text.

Here is a basic example:

```python
from transformers import AutoTokenizer

# Initialize a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Suppose we have the following tokenized input
input_ids = tokenizer.encode("Hello, I'm a data scientist.", return_tensors='pt')

print(f"Tokenized Input: {input_ids}")
# Output: tensor([[  101,  7592,  1010,  1045,  1005,  1049,  1037,  2951,  7155,  1012,  102]])

# Now, we decode it back to the original text
decoded_text = tokenizer.batch_decode(input_ids, skip_special_tokens=True)

print(f"Decoded Text: {decoded_text}")
# Output: ["hello, i'm a data scientist."]
```

In this example, the tokenizer first encoded the input sentence into a sequence of tokens. Then, using the `batch_decode` method, the token sequence was decoded back to the original text.

----------------

 **Why does the code replaces any instance of `-100` with `50256`**

 👉 In the given code snippet, `-100` is a special token ID that is often used to represent tokens that should be ignored when computing the loss during model training. These might be extra padding tokens, or tokens that are not part of the target sequence in a sequence-to-sequence model.

👉 During the process of training a language model, not all tokens in the output sequence are always relevant. For instance, we may have padded sequences to a particular length with irrelevant tokens or we may want to mask certain tokens. In these scenarios, we assign a token ID of `-100` to those tokens we wish the model to ignore when it computes the loss. 

👉 However, when it's time to evaluate the model, the `-100` tokens cannot be decoded back into text, since it's not a valid token ID. So, before decoding the sequence, these tokens must be replaced with a valid token ID.

👉 The ID `50256` is chosen here because it corresponds to the padding token in the tokenizer being used. When `tokenizer.batch_decode` is called with the argument `skip_special_tokens=True`, it will automatically ignore these padding tokens, so they don't show up in the final decoded text. This is why any `-100` tokens are replaced with `50256` before decoding the predicted sequences.

Here is a small pseudo-example to illustrate this process:

```python
preds = [[1, 2, -100, 3], [2, -100, 1, 4]]
# These are the predicted sequences. -100 represents tokens to be ignored

for idx in range(len(preds)):
    for idx2 in range(len(preds[idx])):
        if preds[idx][idx2]==-100:
            preds[idx][idx2] = 50256

print(preds)
# Output: [[1, 2, 50256, 3], [2, 50256, 1, 4]]
# Now all -100 tokens are replaced with 50256 (padding token) and can be properly ignored when decoding
```


--------------

**What is the significance of the below line here**

`labels = np.where(labels != pad_tok, labels, tokenizer.pad_token_id)`

👉 The `np.where` function works like this: `np.where(condition, x, y)`. It checks the `condition` for each element in the array. If the condition is `True`, it selects the corresponding element from `x`; if the condition is `False`, it selects the corresponding element from `y`.

The condition is `labels != pad_tok`. It checks each element in the `labels` array and sees if it's not equal to `pad_tok`. `pad_tok` is the padding token, which is typically used to fill in sequences to a consistent length for batch processing. If the condition is `True` (the label is not a padding token), it keeps the original label. If the condition is `False` (the label is a padding token), it replaces that label with `tokenizer.pad_token_id`.

👉 This operation ensures that all instances of the padding token in the labels are represented with the padding token ID specific to the tokenizer being used. This is important for the subsequent decoding of the labels into text, as the padding tokens will be correctly ignored during the decoding process.

In [None]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    for idx in range(len(preds)):
        for idx2 in range(len(preds[idx])):
            if preds[idx][idx2]==-100:
                preds[idx][idx2] = 50256
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != pad_tok, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    
    # The results of the Rouge metric are then multiplied by 100 and rounded to four decimal places.
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

==============

FULL SOURCE CODE

==============

In [None]:
# Source - https://github.com/artidoro/qlora/issues/157
import pandas as pd
import os
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType, prepare_model_for_kbit_training
from transformers import DataCollatorForSeq2Seq
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import Dataset, DatasetDict
import argparse
import pickle
import json

parser = argparse.ArgumentParser(description='Options')
parser.add_argument('--dataset_dir', default='data', type=str, help="folder in which the dataset is stored")
parser.add_argument('--output_dir', default="lora-instructcodet5p", type=str, help="output directory for the model")
parser.add_argument('--results_dir', default="results", type=str, help="where the results should be stored")
args = parser.parse_args()

nltk.download("punkt")
tokenized_dataset = DatasetDict.load_from_disk(args.dataset_dir)
# Metric
metric = evaluate.load("rouge")
pad_tok = 50256
token_id="Salesforce/instructcodet5p-16b"

tokenizer = AutoTokenizer.from_pretrained(token_id)
# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    for idx in range(len(preds)):
        for idx2 in range(len(preds[idx])):
            if preds[idx][idx2]==-100:
                preds[idx][idx2] = 50256
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != pad_tok, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

def get_dict(predicts):
    d = {}
    for num in range(len(tokenized_dataset['test'])):
        pred = tokenizer.decode([n for n in predicts[0][num] if n!=50256 and n!=-100])[1:]
        d[num+1] = {'Question':tokenizer.decode([n for n in tokenized_dataset['test'][num]['input_ids'] if n!=50256]),
                    'Ground truth solution':tokenizer.decode([n for n in tokenized_dataset['test'][num]['labels'] if n!=50256]),
                   'Prediction': pred if pred else None}
    return d

def find_all_linear_names(model):
    cls = torch.nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])


    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')
    return list(lora_module_names)


def main():
    device = 'cuda'

    # huggingface hub model id
    model_id="instructcodet5p-16b"
    if not os.path.exists(model_id):
        model_id=token_id
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    # load model from the hub
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id,
                                              # torch_dtype=torch.bfloat16,
                                              low_cpu_mem_usage=True,
                                              trust_remote_code=True, decoder_start_token_id=1, pad_token_id=pad_tok, device_map="auto", quantization_config=bnb_config)
    modules = find_all_linear_names(model)
    # Define LoRA Config
    lora_config = LoraConfig(
     r=16,
     lora_alpha=32,
     target_modules=modules,
     lora_dropout=0.05,
     bias="none",
     task_type=TaskType.SEQ_2_SEQ_LM
    )
    model = prepare_model_for_kbit_training(model, False)

    # add LoRA adaptor
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # we want to ignore tokenizer pad token in the loss
    label_pad_token_id = pad_tok
    # Data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        model=model,
        label_pad_token_id=label_pad_token_id,
        pad_to_multiple_of=8
    )
    output_dir=args.output_dir

    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=1,
        # per_device_eval_batch_size=1,
        predict_with_generate=True,
        weight_decay=0.05,
        # warmup_steps=200,
        fp16=False, # Overflows with fp16
        learning_rate=1e-4,
        num_train_epochs=5,
        logging_dir=f"{output_dir}/logs",
        logging_strategy="epoch",
        report_to="tensorboard",
        push_to_hub=False,
        # generation_max_length=200,
        optim="paged_adamw_8bit",
        lr_scheduler_type = 'constant'
    )

    # Create Trainer instance
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=tokenized_dataset["train"],
        # eval_dataset=tokenized_dataset["validation"],
        # compute_metrics=compute_metrics,
    )

    # train model
    train_result = trainer.train()

if __name__ == '__main__':
    main()