<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024S2/blob/main/session-7/finetune_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune a Causal Language Model for Dialogue Summarization

In this exercise, you will fine-tune Meta's Llama 3.2 LLM for enhanced dialogue summarization. We will explore how to use the Huggingface TRL (Transformer Reinforcement Learning) library to help us to perform Supervised Finetuning (SFT).  We will explore the use of Parameter Efficient Fine-Tuning (PEFT) for efficient and fast finetuning, and evaluate the resulting model using ROUGE metrics.

In [1]:
%%capture
!pip install -q accelerate peft transformers trl
!pip install -U bitsandbytes

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset
import torch

## Templating Instruction Data

To have the LLM follow instructions, we will need to prepare instruction data that follows a chat template.

<img src="https://github.com/nyp-sit/iti107-2024S2/blob/main/assets/chat_template.png?raw=true" />

This chat template differentiates between what the LLM has generated and what the user has generated. May LLM chat models that are available on HuggingFace comes with built-in chat template that you can use.

In [3]:
# This is the chat model of TinyLlama. We only load it because we want to use it's chat template to format our data
chat_model="meta-llama/Llama-3.2-1B-Instruct"

template_tokenizer = AutoTokenizer.from_pretrained(chat_model)

In [4]:
template_tokenizer.get_chat_template()

'{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now("%d %b %Y") %}\n    {%- else %}\n        {%- set date_string = "26 Jul 2024" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0][\'role\'] == \'system\' %}\n    {%- set system_message = messages[0][\'content\']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message #}\n{{- "<|start_header_id|>system<|end_header_id|>\\n\\n" }}\n{%- if tools is not none %}\n    {{- "Environment: ipython\\n" }}\n{%- endif %}\n{{- "Cutting

You can see that the template expects the prompt to include fields like role, content, and with content demarcated by `|user|`, `|assistant|` and `|system|`.

### Format the data according to chat template

Let's download our data and format them according to the template given. We select a subset of 6000 samples to reduce training time.


In [5]:
dataset_name = "knkarthick/dialogsum"
dataset_train = load_dataset(dataset_name, split='train[:6000]')
dataset_val = load_dataset(dataset_name, split='validation[:100]')
dataset_test = load_dataset(dataset_name, split='test')

In [6]:
print(dataset_val, dataset_test)

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 100
}) Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 1500
})


Note that the completed prompt is put under 'text' field of the json. This is the default field that model will look for the text data.

In [7]:
def format_chat_template(row):
    user_prompt = (
        f"Summarize this dialog:\n{{dialog}}\n---\nSummary:\n"
    )
    user_prompt = user_prompt.format(dialog = row["dialogue"])
    row_json = [ {"role": "system", "content": "You are a helpful assistant" },
                {"role": "user", "content": user_prompt},
               {"role": "assistant", "content": row["summary"]}]

    prompt = template_tokenizer.apply_chat_template(row_json, tokenize=False)
    # print(prompt)
    return {"text": prompt}

In [8]:
dataset_train = dataset_train.map(format_chat_template, remove_columns=list(dataset_train.features))
dataset_val = dataset_val.map(format_chat_template, remove_columns=list(dataset_val.features))
dataset_test = dataset_test.map(format_chat_template, remove_columns=list(dataset_test.features))

Using the "text" column, we can explore these formatted prompts:

In [9]:
dataset_train[0]['text']

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 27 Oct 2024\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSummarize this dialog:\n#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can'

### Model Quantization

Now that we have our data, we can start loading in our model. This is where we apply the Q in QLoRA, namely quantization. We use the
bitsandbytes package to compress the pretrained model to a 4-bit representation.

In BitsAndBytesConfig, you can define the quantization scheme. We follow the steps used in the original QLoRA paper and load the model in 4-bit (load_in_4bit) with a normalized float representation (bnb_4bit_quant_type) and double quantization (bnb_4bit_use_double_quant):

In [10]:
model_name = "meta-llama/Llama-3.2-1B"

# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4", # Quantization type
    bnb_4bit_compute_dtype="float16", # Compute dtype
    bnb_4bit_use_double_quant=True, # Apply nested quantization
)

# Load the model to train on the GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda:0",
    # Leave this out for regular SFT
    quantization_config=bnb_config,
)



### Test the Model with Zero Shot Inferencing

Let's test the base model (non-instruction tuned model) with zero shot inferencing (i.e. ask it to summarize without giving any example. You can see that the model struggles to summarize the dialogue compared to the baseline summary, and it is just repeating the conversation.

In [11]:
eval_prompt = """
Summarize this dialog:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
---
Summary:
"""

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda:0")

model.eval()
with torch.no_grad():   # no gradient update
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
2024-10-28 19:35:45.233792: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1730115345.244746  870088 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1730115345.248118  870088 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-28 19:35:45.260073: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2


Summarize this dialog:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
---
Summary:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going t

### LoRA Configuration

We will be using LoRA to train our model. LoRA is supported in Hugging Face's PEFT library.
Here are some explanation about the parameters used in the LoRA:
- `r` - This is the rank of the compressed matrices. Increasing this value will also increase the sizes of compressed matrices leading to less compression and thereby improved representative power. Values typically range between 4 and 64.
- `lora_alpha` - Controls the amount of change that is added to the original weights. In essence, it balances the knowledge of the original model with that of the new task. A rule of thumb is to choose a value twice the size of r.
- `target_modules` - Controls which layers to target. The LoRA procedure can choose to ignore specific layers, like specific projection layers. This can speed up training but reduce performance and vice versa.

In [12]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

Let's compare the number of trainable parameters of the PEFT model vs the base model.

In [13]:
model.print_trainable_parameters()

trainable params: 45,088,768 || all params: 1,280,903,168 || trainable%: 3.5201


### Training Configuration

Next we need to set our training configuration. Since we are going to use SFTTrainer, we can specify the training arguments in SFTConfig.

Note that we set `fp16` to True for mixed-precision training. If you are using Ampere and newer GPU architecture, you can set bf16 to better accuracy and faster training.

Modern LLM has quite a large context window, typically more than a 100K. Many of the text sample we encountered are very much shorter than that. For more efficient use of the context window, Instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.

<img src="https://github.com/nyp-sit/iti107-2024S2/blob/main/assets/packing.png?raw=1" width="700"/>

TRL allows us to do this packing very easily, by just specifying `packing=True`.  Internally, a [`ConstantLengthDataset`](https://huggingface.co/docs/trl/en/sft_trainer#trl.trainer.ConstantLengthDataset) is being created so we can iterate over the dataset on fixed-length sequences.

In [14]:
import os

os.environ["WANDB_PROJECT"] = "llama3.2-summarize"
# os.environ["WANDB_API_KEY"] = "Your secret wandb key"

## convenience method to generate unique run name for WanDB 
def get_run_id():
    import time
    run_id = time.strftime("run_%Y%m%d_%H%M%S")
    return run_id

In [15]:
from trl import SFTConfig

model.config.use_cache = False
model.config.pretraining_tp = 1

# Configure the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# where to write the checkpoint to
output_dir = "./results"

sft_config = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    # num_train_epochs=1,
    logging_steps=5,
    max_steps=200,
    bf16=True,
    # fp16=True
    gradient_checkpointing=True,
    resume_from_checkpoint=True,
    packing=True,
    eval_packing=True,
    dataset_text_field="text",
    max_seq_length=1024,
    save_strategy = "steps",
    save_steps=5,
    eval_strategy='steps',
    eval_steps=5,
    run_name=get_run_id()
)

In [None]:
from trl import SFTTrainer

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    # dataset_text_field="text",
    tokenizer=tokenizer,
    # Leave this out for regular SFT
    peft_config=peft_config,
    args=sft_config
 )

# Train model
trainer.train()



In [17]:
# Save QLoRA weights
trainer.model.save_pretrained("Llama-3.2-1B-Summarizer-QLoRA")

In [18]:
# from huggingface_hub import login
# login()
# ## push the model to hub
# model.push_to_hub("khengkok/Llama-3.2-1B-Summarizer-QLoRA")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Merge Weights

After we have trained our QLoRA weights, we still need to combine them with the original weights to use them. We reload the model in 16 bits, instead of the quantized 4 bits, to merge the weights.

In [23]:
from peft import AutoPeftModelForCausalLM


model = AutoPeftModelForCausalLM.from_pretrained(
    "Llama-3.2-1B-Summarizer-QLoRA",
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Uncomment the following to load the pretrained model if you did not manage to train your own
model = AutoPeftModelForCausalLM.from_pretrained(
    "khengkok/Llama-3.2-1B-Summarizer-QLoRA",
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Merge LoRA and base model
merged_model = model.merge_and_unload()

After merging the adapter with the base model, we can use it with the prompt template that we defined earlier:

In [25]:
eval_prompt = """<|user|>
Summarize this dialog:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
---
Summary:</s>
<|assistant|>
"""

from transformers import TextStreamer
from transformers import pipeline

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# Run our instruction-tuned model
# pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer, max_new_tokens=200)
# print(pipe(eval_prompt)[0]["generated_text"])


# #Streaming support
streamer = TextStreamer(tokenizer)
merged_model.eval()
with torch.no_grad():
    merged_model.generate(**model_input, streamer=streamer, max_length=512)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|><|user|>
Summarize this dialog:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
---
Summary:</s>
<|assistant|>
#Person1# phones #Person2# to complain about the cable and #Person2# promises to give #Person1# a credit. #Person1# is grateful for it. #Person2# apologizes for the inconvenience. #Person1# thinks #Person2# should apologize more. #Person2# agrees. #Person1# is satisfied. #Person2# thinks #Person1# will be satisfied

Good reference:

https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-tune-an-LLM-Part-3-The-HuggingFace-Trainer--Vmlldzo1OTEyNjMy
