# Fine-tuning an LLM for Command Generation in CALM

This is a worked example of how to efficiently fine-tune a base language model from [Hugging Face Hub](https://huggingface.co/models) using the [TRL](https://huggingface.co/docs/trl/en/index) libraries for the task of command generation within [CALM](https://rasa.com/docs/rasa-pro/calm).

To run fine-tuning, you must have first [generated the dataset](https://rasa.com/rasa-pro/docs/operating/fine-tuning-recipe) files `train.jsonl` and `val.jsonl`, which must be in the [TRL instruction format](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support).

## 1. Configure fine-tuning environment

In order to run this notebook, you will need to first install the necessary libraries onto a machine with the following minimum hardware requirements:
- Single NVIDIA A100 GPU with 40GB VRAM
- 12 core CPU with 85GB RAM
- 250GB disk

Here is an example of how to set up the environment:

First, we provisioned a Linux instance with the appropriate hardware and the following software installed:
- Python 3.10
- CUDA Toolkit 12.1
- PyTorch 2.2.

Next, we installed the necessary packages as follows:

In [None]:
%%sh
pip install "torch==2.6.*" "accelerate==1.5.*" "peft==0.14.*" "bitsandbytes==0.45.*" "transformers==4.49.*" "trl==0.15.*" "vllm==0.8.2" "flash-attn==2.7.4.post1"

## 2. Choose base model

You can download the model you want to fine-tune from Hugging Face Hub using the [official CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) with an [API access token](https://huggingface.co/docs/transformers.js/en/guides/private#step-1-generating-a-user-access-token) as per the code below. Make sure you first update the `HUGGINGFACE_TOKEN` and `BASE_MODEL` environment variables with your own values.

When testing this notebook, the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model was used. Note that `meta-llama/Meta-Llama-3.1-8B-Instruct` is a [gated model](https://huggingface.co/docs/hub/en/models-gated) that you must first request access to before using.

You can use any other PyTorch model available on [Hugging Face Hub](https://huggingface.co/models). It is recommended that you use a model that has been pre-trained on instructional tasks, such as the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model.

Pre-trained models with more parameters will generally perform better at tasks than models with fewer parameters. However, the size of model you can use is limited by how much memory your GPU has.

Alternatively, if you already have a PyTorch model directory to hand, you can upload it to your notebook environment manually.

In [None]:
# TODO: update with your values
%env BASE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
%env HUGGINGFACE_TOKEN=CHANGEME
%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# download model
!huggingface-cli download "$BASE_MODEL" \
    --token "$HUGGINGFACE_TOKEN" \
    --local-dir "./base_model"

## 3. Load and quantize base model

The [quantization of model parameters](https://huggingface.co/docs/optimum/en/concept_guides/quantization) can significantly reduce the GPU memory required to run model fine-tuning and inference, at the cost of model accuracy.

Here, the base model is loaded from disk and can be quantized into an 4-bit representation on the fly using the [BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes) library.

In [None]:
import torch
from peft import prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


do_quantization = False # adjust this setting based on whether you want to use quantization or not
BASE_MODEL_PATH = "./base_model"

def get_model_and_tokenizer(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = "<|finetune_right_pad_id|>"

    bnb_config = None

    if do_quantization:
        # 4-bit quantization configuration
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16
        )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        quantization_config=bnb_config
    )

    if do_quantization:
        model = prepare_model_for_kbit_training(
            model,
            use_gradient_checkpointing=True,
        )
    model.config.use_cache = False
    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

    return model, tokenizer


model, tokenizer = get_model_and_tokenizer(BASE_MODEL_PATH)

## 4. Configure base model for PEFT

[Parameter Efficient Fine-Tuning](https://huggingface.co/blog/peft) (PEFT) is a technique for adapting LLMs for specific tasks by freezing all of the base model parameters and only training a relatively small number of additional parameters. Compared to fine-tuning all parameters, PEFT can significantly reduce the amount of GPU memory required at the cost of the fine-tuned model accuracy.

In the code below, the base model is configured for PEFT using the [Low-Rank Adaptation](https://arxiv.org/pdf/2106.09685) (LoRA) method. It is recommended that you read the [official documentation](https://github.com/huggingface/peft) and experiment with the parameters of the `LoraConfig` class. For example, you may get better model performance with different values for `r` and `lora_alpha`.

In [None]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=16, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
    target_modules=
    [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)
model = get_peft_model(model, peft_config)

## 5. Load training and validation datasets

The following code loads the training and validation datasets from the `train.jsonl` and `val.jsonl` files, respectively

As the files use the TRL instruction format, the TRL trainer used later will be able to [automatically parse](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support) the datasets and [generate the prompts from a template](https://huggingface.co/docs/transformers/en/chat_templating) configured in the tokenizer.

Prompt templates vary between models and TRL will infer the correct template from your base model. If this is not available in your base model or if you wish to change it, you can set your own [template string](https://huggingface.co/docs/transformers/en/chat_templating#advanced-adding-and-editing-chat-templates) manually.

In [None]:
import datasets

# Load the training and evaluation datasets from JSONL files on disk
train_dataset = datasets.load_dataset(
    "json", data_files={"train": "train.jsonl"}, split="train"
)
eval_dataset = datasets.load_dataset(
    "json", data_files={"eval": "val.jsonl"}, split="eval"
)


# Uncomment the following line if you want to test prompt formatting on a single example from the eval dataset
# print(get_formatting_func_from_dataset(train_dataset, tokenizer)(eval_dataset[0]))

# Define a function to format prompts for each example in the dataset
def formatting_prompts_func(examples):
    # Extract conversation messages from each example
    convos = examples["messages"]

    # Apply the chat template to each conversation without tokenizing or adding generation prompts
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]

    # Return the formatted texts in a new dictionary key
    return {"text": texts}


# Apply the formatting function to both the training and evaluation datasets in batches
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)

## 6. Configure trainer

Below, the arguments for the supervised fine-tuning (SFT) trainer are configured. Their values were chosen somewhat arbitrarily and resulted in satisfactory results during testing.

It is recommended that you read the official documentation and experiment with the arguments passed to `SFTConfig` (see [here](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer)) and `SFTTrainer` (see [here](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer)).

For example:
- If you get an OOM error when running fine-tuning, you can reduce `per_device_train_batch_size` in order to reduce the memory footprint. However, if your GPU has sufficient memory, you can try increasing it in order to reduce the total number of training steps.
- Consider setting `max_steps`, as you may not need to perform all epochs in order to achieve optimal model accuracy. Conversely, you may see better model accuracy by increasing `num_train_epochs`.
- If fine-tuning is taking too long, you can increase `eval_steps` in order to reduce how often validation is performed. 

Response template used in the `DataCollatorForCompletionOnlyLM` currently can't be loaded automatically for each model, so this string needs to be changed based on the model. The example defined here is used in LLaMA-3 models

In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig

max_seq_length = 4096

# configure training args
args = SFTConfig(
    ###### training
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=20,
    # max_steps = 1,
    num_train_epochs=3,
    learning_rate=1e-4,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    weight_decay=0.01,
    ###### datatypes
    fp16=False,
    bf16=True,
    ###### evaluation
    eval_strategy="steps",
    eval_steps=200,
    per_device_eval_batch_size=2,
    ###### outputs
    logging_steps=20,
    output_dir="outputs",
    max_seq_length=max_seq_length,
    packing=False
)

response_template = "assistant<|end_header_id|>"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer, return_tensors="pt")

# setup trainer
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=args,
    data_collator=collator,
    peft_config=peft_config
)

## 7. Perform supervised fine-tuning

In the code below, fine-tuning is performed using the previously congfigured trainer.

When testing this step on an NVIDIA A100 using the configuration defined above, it took around 1 hour to perform fine-tuning with a training dataset containing around 1600 examples.

In [None]:
# run fine-tuning
trainer.train()

## 8. Persisting the trained model

After fine-tuning, you can save the LoRA adapter weights, allowing you to later load them on top of the base model for inference.

In [None]:
import pathlib

FINETUNED_MODEL_PATH = pathlib.Path("./finetuned_model")

FINETUNED_MODEL_PATH.mkdir(exist_ok=True, parents=True)

# `model` is your PEFT-wrapped model, `tokenizer` the tokenizer you trained with
model.save_pretrained(FINETUNED_MODEL_PATH)
tokenizer.save_pretrained(FINETUNED_MODEL_PATH)

print("✓ LoRA adapter written to", FINETUNED_MODEL_PATH.resolve())

## 9. Visualize fine-tuning metrics

Some of the metrics collected during fine-tuning are visualised below in order for you to diagnose any potential issues with the model.

Specifically, the training and validation losses are plotted against the training step number. Please check the plot for the following:
- Ideally, as the fine-tuning steps increase, the training and validation losses should decrease and converge. 
- If both loss curves do not converge, it may be worth performing more fine-tuning steps or epochs. This is known as [underfitting](https://www.ibm.com/topics/underfitting).
- If the validation loss suddenly starts to increase while the training loss continues to decrease or converge, you should decrease your total number of steps or epochs. This is known as [overfitting](https://www.ibm.com/topics/overfitting).

In [None]:
!pip install pandas matplotlib

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# plot step against train and val losses
fig, ax = plt.subplots()
log_history = pd.DataFrame(trainer.state.log_history)
eval_loss = log_history[["step", "eval_loss"]].dropna().plot(x="step", ax=ax)
train_loss = log_history[["step", "train_loss"]].dropna().plot(x="step", ax=ax)
fig.show()

## 10. Run ad hoc inference

You can load your fine-tuned model from disk using HuggingFace transformers and use it to run optimized inference on individual inputs of your choosing using the code below.

Note that the inputs passed to model are in the [TRL convertsational format](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support) as the Hugging Face [chat template requires them to be](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates). During training TRL will [automatically convert the instruction format to the conversational format](https://github.com/huggingface/trl/blob/main/trl/extras/dataset_formatting.py). However, you have to do this yourself when applying chat templates manually for inference.

In [None]:
import torch, pathlib, os
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TextStreamer,
)
from peft import PeftConfig, PeftModel

ADAPTER_DIR   = pathlib.Path("./finetuned_model")
dtype         = torch.bfloat16

bnb_cfg = None
if do_quantization:
    bnb_cfg = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

peft_cfg   = PeftConfig.from_pretrained(ADAPTER_DIR)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    device_map="auto",
    torch_dtype=dtype,
    attn_implementation="flash_attention_2",
    quantization_config=bnb_cfg,
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model, ADAPTER_DIR)

# (optional) merge for slightly faster inference
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH, trust_remote_code=True)
tokenizer.pad_token = "<|finetune_right_pad_id|>"

model.eval()
model.config.use_cache = True

content = eval_dataset[0]["messages"]

input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": content}],  # in the TRL conversational format
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": content}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

streamer = TextStreamer(tokenizer)
with torch.inference_mode():
    _ = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=False,
        streamer=streamer,
    )

## 11. Serve with vLLM

You can also deploy the model via vLLM library using the command below. If you are not doing quantization, remove `--quantization bitsandbytes` parameter.
For any further adjustments and parameterization, check out the [official vLLM documentation](https://docs.vllm.ai/en/latest/)

In [None]:
%env HF_TOKEN=CHANGEME  # Same as the `HUGGINGFACE_TOKEN`, but needs to be set like this for vLLM to work

%%sh
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization bitsandbytes \
--dtype bfloat16 \
--enable-lora \
--lora-modules custom_lora=finetuned_model \
--swap-space 16 \
--max-model-len 4096

## 12. Export fine-tuned model

Lastly, export your fine-tuned model directory to an appropriate storage location that can be easily accessed later for [deployment](https://rasa.com/rasa-pro/docs/building-assistants/self-hosted-llm).

It is recommended that you use a cloud object store, such as [Amazon S3](https://aws.amazon.com/s3/) or [Google Cloud Storage](https://cloud.google.com/storage).

Uncomment and run the corresponding commands below for your cloud provider, making sure to first update the environment variables with your own values. It is assumed that:
- your bucket already exists
- you have already installed the CLI tool for your cloud provider
- you have already authenticated with your cloud provider and have sufficient permissions to write to your bucket

In [None]:
%%sh
export LOCAL_MODEL_PATH="./finetuned_model"

# if using amazon
# export S3_MODEL_URI="s3://CHANGEME" # update with your value
# aws s3 cp "${LOCAL_MODEL_PATH}" "${S3_MODEL_URI}" --recursive

# if using google
# export GCS_MODEL_URI="gs://CHANGEME" # update with your value
# gsutil cp -r "${LOCAL_MODEL_PATH}" "${GCS_MODEL_URI}"