# [C-1]. Fine-tuning LLM with Intel® Gaudi®

# What you will do

In this tutorial, you will fine-tune the Qwen3-0.6B model for Causal Language Modeling using Optimum Habana. This guide builds upon examples from the [huggingface/optimum-habana](https://github.com/huggingface/optimum-habana) repository.

We will focus on adapting the model to a specific "Joseon Dynasty" persona using the following techniques:
- DeepSpeed: optimizing memory usage for distributed training on Gaudi.
- LoRA (Low-Rank Adaptation): Efficient fine-tuning by freezing pre-trained weights.
- GraLoRA (Granular LoRA): An advanced PEFT method for better expressiveness.

**Note on Scalability**: While this workshop utilizes a single Gaudi device, the fine-tuning code provided is fully implemented and optimized for multi-Gaudi distributed training. The workflows you learn here are designed to scale seamlessly.

## What is Causal Language Modeling?

Causal language modeling is the task of predicting the token following a sequence of tokens. In this scenario, the model **only attends to the left context** (tokens previously generated or provided). This training objective is essential for generation tasks like chatbots and story completion.

# 1. Install Dependencies

First, we install the necessary libraries. This includes **Optimum Habana** for Gaudi support, **PEFT** for parameter-efficient fine-tuning, and **DeepSpeed** for distributed training optimization.

In [None]:
# Install Optimum Habana from main
!pip install git+https://github.com/huggingface/optimum-habana@main 

# Install standard data processing and visualization libraries
!pip install datasets sentencepiece protobuf scikit-learn pandas matplotlib

# Install the latest PEFT library
!pip install git+https://github.com/huggingface/peft@main 

# Install Habana-optimized DeepSpeed
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.22.0

# 2. Dataset Prepartion

## Dataset: Joseon Persona

We are using a custom Korean fine-tuning dataset designed to train an LLM to respond to modern daily inquiries with the persona and tone of a loyal subject from the Joseon Dynasty.

Examples:

| Input | Output|
| --- | --- |
| 돈 아껴 쓰라고 해줘. | 티끌 모아 태산이라 했습니다. 그 푼돈을 아껴야 나중에 기와집이라도 한 채 사시지 않겠사옵니까. |
| 겨울 간식 추천해줘. | 뜨끈한 어묵 국물과 함께 먹는 붕어빵, 그리고 귤이 겨울의 맛이옵니다. |
| 운동하라고 닥달해줘. | 뱃살이 인덕이라 우길 단계는 지났사옵니다. 당장 일어나 뛰시옵소서! |

You can view the full dataset [here](./joseon_persona_dataset.csv).
Here, we use a specific alpaca-chat format that matches the Joseon Persona dataset structure. 

# 3. Training Configuration

### Fine-tuning Qwen/Qwen3-0.6B on Intel Gaudi

We will configure our training in three parts:
1. PEFT Configuration: Selecting the method and its hyperparameters
2. Training Arguments: Standart HuggingFace training arguments (learning rate, epochs, etc.).
3. Gaudi Arguments: Gaudi-specific configurations

## 3.1. PEFT Arguments

We start by defining the LoRA parameters. LoRA reduces the number of trainable parameters by injecting rank-decomposition matrices into each layer of the Transformer architecture.

In [None]:
# 1. Define PEFT (LoRA) Parameters
input_kwargs = {
    "peft_type": "lora",
    "lora_rank": 64,
    "lora_alpha": 128,
    "lora_dropout": 0.05,
}

## 3.2. Training Arguments

We then define the standart training arguments.

In [None]:
# 2. Define Standard Transformer Training Arguments
transformer_kwargs = {
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "num_train_epochs": 20,
    "seed": 42,
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.1,
    "learning_rate": 1e-3,
    "max_grad_norm": 1.0,
    "per_device_train_batch_size": 256,
    "gradient_checkpointing": True,
    "logging_steps": 1,
}

# Merge arguments
input_kwargs.update(transformer_kwargs)

# Generate a unique output directory name based on hyperparameters
name_args = ["peft_type", "lora_rank", "lora_alpha", "num_train_epochs", "learning_rate"]
lora_output_dir = f"./finetuned_models/joseon_persona_qwen3_0.6b"
for arg in name_args:
    lora_output_dir += f"_{arg.replace('lora_', '')}-{input_kwargs[arg]}"
    
print(f"Model will be saved in: {lora_output_dir}")
input_kwargs["output_dir"] = lora_output_dir

## 3.3. Gaudi Specific Argtuments

To enable training on Intel Gaudi, we must replace the standard HuggingFace Transformer classes with their Optimum-Habana counterparts. 
Specifically, `Trainer` is replaced by `GaudiTrainer`, and `TrainingArguments` is replaced by `GaudiTrainingArguments`.

### Why `GaudiTrainer`
The `GaudiTrainer` is a wrapper built around the standard Transformer `Trainer`. Its primary role is to integrate the `gaudi_config` argument, which orchestrates essential hardware-specific behaviors:
- Mixed Precision: Utilizing BF16/FP32 autocasting.
- Fused Operations: Leveraging Habana's custom AdamW or Clip Norm implementations.

Additional information and exact implementation of `GaudiTrainer` can be found [here](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/trainer.py#L214)

### Why `GaudiTrainingArguments`
The `GaudiTrainingArguments` class extends standard Transformer `TrainingArguments` to include parameters critical for managing Gaudi execution flow: 
- Training Mode (Lazy vs. Eager): Determines the execution backend. It is typically used to enable Lazy Mode, which accumulates operations to build and compile efficient computation graphs rather than executing them eagerly operation-by-operation.
- Compilation & Cache Limits: Controls the Gaudi graph compiler's behavior, allowing you to set limits on cache size or the maximum number of compiled graphs to prevent out-of-memory (OOM) errors during dynamic shape variations.
- Profiling & Throughput: Provides specific arguments to fine-tune performance measurement, such as defining warm-up steps and identifying the exact number of steps to capture for accurate throughput calculation.

Additional information and exact implementation of `GaudiTrainerArguments` can be found [here](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/training_args.py#L86)

### Define Gaudi Arguments

In [None]:
# 3. Define Gaudi Specific Arguments
gaudi_kwargs = {
    "use_habana": True,  # Whether to use Gaudi or not
    "use_lazy_mode": True,  # Whether to use lazy or eager mode
    "gaudi_config_name": "Habana/qwen",  # Gaudi configuration to use
}
input_kwargs.update(gaudi_kwargs)

# 4. Distributed Training with DeepSpeed

To efficiently manage memory and train larger models, we leverage **DeepSpeed**. DeepSpeed allows us to partition optimizer states and gradients across processes.

We will use the `DistributedRunner` class to launch the training script. This utility handles the complexity of spawning multiple processes on the Gaudi.

Steps:
1. Define the DeepSpeed configuration.
2. Instantiate DistributedRunner with the command to run.

### Launch Training
- The training log will be saved in the `./logs` directory.
- The loss curve will be saved in the `./train_loss` directory.

In [None]:
import os
import socket
from optimum.habana.distributed import DistributedRunner

# Add DeepSpeed config to arguments
input_kwargs["deepspeed"] = "configs/deepspeed_zero_1.json" 

# Construct the command line string from our arguments
training_args_command_line = " ".join(f"--{key} {value}" for key, value in input_kwargs.items())

# We execute the external script `run_lora_fine_tuning.py`
if not os.path.exists("./logs"):
    os.makedirs("./logs")
train_log_file = f"./logs/train_log_{input_kwargs['output_dir'].split('/')[-1]}.txt"
command = f"./run_lora_fine_tuning.py {training_args_command_line} > {train_log_file}"

print(f"Training log will be saved in {train_log_file}")


def is_port_open(host, port, timeout=1):
    # Create a new socket using the with statement to ensure it's closed automatically
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.settimeout(timeout)
        result = sock.connect_ex((host, int(port)))
        if result == 0:
            sock.shutdown(socket.SHUT_RDWR)
            return True
        else:
            return False

MASTER_PORT = 29500
while is_port_open("localhost", MASTER_PORT):
    MASTER_PORT += 1

distributed_runner = DistributedRunner(
    command_list=[command], 
    world_size=1,        # Set to >1 for multi-card training
    use_deepspeed=True,  # Enable DeepSpeed
    master_port=MASTER_PORT
)

# Launch
print("="*100)
print(f"Starting training with command: {command}")
print("="*100)
print(f"Training log will be saved in {train_log_file}")
print("="*100)
ret_code = distributed_runner.run()

# 5. Evaluation
Now that the model is trained, we verify its performance. We will use a helper function `evaluate` to generate text using both the base model and the fine-tuned adpater.

### Evaluation Function

In [None]:
import torch
from optimum.habana.utils import set_seed

set_seed(42)

@torch.inference_mode()
def evaluate(prompt_list, tokenizer, base_model, trained_model, trained_model2=None):
    # Define the chat template used for fine-tuning
    chat_template = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{input}\n\n### Response:\n"
    )
    if isinstance(prompt_list, str):
        prompt_list = [prompt_list]
    for prompt in prompt_list:
        # Generate the prompt for the base model using the original chat template
        base_prompt = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True, enable_thinking=False
        )
        base_prompt = tokenizer.encode(base_prompt, add_special_tokens=False, return_tensors="pt")
        base_prompt = base_prompt.to("hpu")

        # Generate the prompt for the fine-tuned model using the new chat template
        trained_prompt = chat_template.format(input=prompt)
        trained_prompt = tokenizer.encode(trained_prompt, add_special_tokens=False, return_tensors="pt")
        trained_prompt = trained_prompt.to("hpu")

        # Generate the output for the base model
        base_output = base_model.generate(
            input_ids=base_prompt,
            max_new_tokens=50,
            do_sample=False,
            temperature=None,
            top_p=None,
            top_k=None,
        )
        base_result = tokenizer.decode(base_output[0][len(base_prompt[0]) :], skip_special_tokens=True)

        # Generate the output for the fine-tuned model
        trained_output = trained_model.generate(
            input_ids=trained_prompt,
            max_new_tokens=50,
            do_sample=False,
            temperature=None,
            top_p=None,
            top_k=None,
        )
        trained_result = tokenizer.decode(trained_output[0][len(trained_prompt[0]) :], skip_special_tokens=True)

        # Print the results
        print("=" * 26 + " Input " + "=" * 26 + "\n")
        print(prompt + "\n")
        print("=" * 21 + " Original Output " + "=" * 21 + "\n")
        print(base_result + "\n")
        print("=" * 20 + " Fine-tuned Output " + "=" * 20 + "\n")
        print(trained_result + "\n")

        if trained_model2 is not None:
            trained_output2 = trained_model2.generate(
                input_ids=trained_prompt,
                max_new_tokens=50,
                do_sample=False,
                temperature=None,
                top_p=None,
                top_k=None,
            )
            trained_result2 = tokenizer.decode(trained_output2[0][len(trained_prompt[0]) :], skip_special_tokens=True)
            print("=" * 20 + " Fine-tuned Output2 " + "=" * 19 + "\n")
            print(trained_result2 + "\n")

        print("=" * 27 + " End " + "=" * 27)

### Test and Enjoy! 
We load the tokenizer, the base model, and the fine-tuned LoRA adapters to prepare for evaluation. Then, compile each model using `torch.compile` for faster inference. 

In [None]:
import copy
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained(input_kwargs["model_name_or_path"])

base_model = AutoModelForCausalLM.from_pretrained(input_kwargs["model_name_or_path"])
base_model = base_model.to("hpu")
base_model = torch.compile(base_model, backend="hpu_backend")

trained_model = AutoModelForCausalLM.from_pretrained(input_kwargs["output_dir"])
trained_model = PeftModel.from_pretrained(trained_model, lora_output_dir)
trained_model.merge_and_unload()
trained_model = trained_model.to("hpu")
trained_model = torch.compile(trained_model, backend="hpu_backend")

Feel free to modify the input prompts below to experiment with various queries and observe how the fine-tuned model's responses differ from the original.

In [None]:
# Test with a example prompts. You can change the input prompt to see different outputs.
input_prompt_list = [
    "오늘 저녁 메뉴 추천해줘",
    "오늘 행사 재미있었어",
    "주말에 뭐하면 좋을까",
]
evaluate(input_prompt_list, tokenizer, base_model, trained_model)

## Next Steps: Exploring Recent PEFT Methods
We are now ready to move on to the next session to experiment with the latest PEFT techniques. Please restart the kernel to release resources and open the next [notebook](./2_GraLoRA_finetuning.ipynb).