# [C-2]. Fine-tuning LLM with Intel® Gaudi®: GraLoRA

# 1. Introducing GraLoRA (Granular Low-Rank Adaptation)

<img src="./figures/gralora_overview.png" width="75%" height="75%"/>

In this session, we will explore [GraLoRA](https://huggingface.co/docs/peft/package_reference/gralora), a new PEFT (Parameter-Efficient FIne-tuning) method developed by our team.

**[Important Note]**: Before you start, make sure to *restart kernel* in the previous notebook to release allocated resources.

### What is GraLoRA?

GraLoRA (Granular Low-Rank Adaptation) enhances the standard LoRA approach by using multiple small adapters to approximate the gradient of the full model. This method aims to be more robust and expressive than standard LoRA without additional costs.

Since GraLoRA is fully integrated into the Hugging Face PEFT library, adapting our previous LoRA workflow is simple: we only need to change the `peft_type` to `gralora` and update our output directory.

# 2. Configuration

We will define the training arguments. The setup remains largely similar to the standard LoRA configuration, with the key difference being the `peft_type`.

Key Settings:
- peft_type: Set to `gralora`.
- gaudi_config_name: Uses `Habana/qwen` for Gaudi  optimization.

### 1. Determine the PEFT Arguments

In [None]:
# 1. Define PEFT (LoRA) Parameters
input_kwargs = {
    "peft_type": "gralora",
    "lora_rank": 64,
    "lora_alpha": 128,
    "lora_dropout": 0.05,
}

### 2. Determine Training and Gaudi Arguments

In [None]:
# 1. Define PEFT (LoRA) Parameters
input_kwargs = {
    "peft_type": "gralora",
    "lora_rank": 64,
    "lora_alpha": 128,
    "lora_dropout": 0.05,
}

# 2. Define Standard Transformer Training Arguments
transformer_kwargs = {
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "num_train_epochs": 20,
    "seed": 42,
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.1,
    "learning_rate": 1e-3,
    "max_grad_norm": 1.0,
    "per_device_train_batch_size": 256,
    "gradient_checkpointing": True,
    "logging_steps": 1,
}

# Merge arguments
input_kwargs.update(transformer_kwargs)

# Generate a unique output directory name based on hyperparameters
name_args = ["peft_type", "lora_rank", "lora_alpha", "num_train_epochs", "learning_rate"]
gralora_output_dir = f"./finetuned_models/joseon_persona_qwen3_0.6b"
for arg in name_args:
    gralora_output_dir += f"_{arg.replace('lora_', '')}-{input_kwargs[arg]}"
    
print(f"Model will be saved in: {gralora_output_dir}")
input_kwargs["output_dir"] = gralora_output_dir

# 3. Define Gaudi Specific Arguments
gaudi_kwargs = {
    "use_habana": True,  # Whether to use Gaudi s or not
    "use_lazy_mode": True,  # Whether to use lazy or eager mode
    "gaudi_config_name": "Habana/qwen",  # Gaudi configuration to use
}
input_kwargs.update(gaudi_kwargs)

# 3. Training Execution
We use the DistributedRunner to execute the training script (`run_lora_fine_tuning.py`) on the Gaudi. We also enable DeepSpeed to optimize memory usage during training.

- The training log will be saved in the `./logs` directory.
- The loss curve will be saved in the `./train_loss` directory.

In [None]:
import os
import socket
from optimum.habana.distributed import DistributedRunner

# Add DeepSpeed config to arguments
input_kwargs["deepspeed"] = "configs/deepspeed_zero_1.json" 

# Construct the command line string from our arguments
training_args_command_line = " ".join(f"--{key} {value}" for key, value in input_kwargs.items())

# We execute the external script `run_lora_fine_tuning.py`
if not os.path.exists("./logs"):
    os.makedirs("./logs")
train_log_file = f"./logs/train_log_{input_kwargs['output_dir'].split('/')[-1]}.txt"
command = f"./run_lora_fine_tuning.py {training_args_command_line} > {train_log_file}"

print(f"Training log will be saved in {train_log_file}")

def is_port_open(host, port, timeout=1):
    # Create a new socket using the with statement to ensure it's closed automatically
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.settimeout(timeout)
        result = sock.connect_ex((host, int(port)))
        if result == 0:
            sock.shutdown(socket.SHUT_RDWR)
            return True
        else:
            return False

MASTER_PORT = 29500
while is_port_open("localhost", MASTER_PORT):
    MASTER_PORT += 1

# Instantiate the DistributedRunner
distributed_runner = DistributedRunner(
    command_list=[command], 
    world_size=1,        # Set to >1 for multi-card training
    use_deepspeed=True,  # Enable DeepSpeed
    master_port=MASTER_PORT
)

# Launch
print(f"Starting training with command: {command}")
print(f"Training log will be saved in {train_log_file}")
ret_code = distributed_runner.run()

# 4. Inference Setup

Now that the model is trained, we need to evaluate it. We will define a Python function evaluate that generates responses from both the Base Model and the Fine-Tuned Model side-by-side.

**Note on Templates**
- Base Model: Uses the standard chat template via tokenizer.apply_chat_template.
- Fine-Tuned Model: Uses a specific "Instruction/Response" format that matches the Joseon Persona dataset structure.

In [None]:
import torch
from optimum.habana.utils import set_seed

set_seed(42)

@torch.inference_mode()
def evaluate(prompt_list, tokenizer, base_model, trained_model, trained_model2=None):
    # Define the chat template used for fine-tuning
    chat_template = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{input}\n\n### Response:\n"
    )
    if isinstance(prompt_list, str):
        prompt_list = [prompt_list]
    for prompt in prompt_list:
        # Generate the prompt for the base model using the original chat template
        base_prompt = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True, enable_thinking=False
        )
        base_prompt = tokenizer.encode(base_prompt, add_special_tokens=False, return_tensors="pt")
        base_prompt = base_prompt.to("hpu")

        # Generate the prompt for the fine-tuned model using the new chat template
        trained_prompt = chat_template.format(input=prompt)
        trained_prompt = tokenizer.encode(trained_prompt, add_special_tokens=False, return_tensors="pt")
        trained_prompt = trained_prompt.to("hpu")

        # Generate the output for the base model
        base_output = base_model.generate(
            input_ids=base_prompt,
            max_new_tokens=50,
            do_sample=False,
            temperature=None,
            top_p=None,
            top_k=None,
        )
        base_result = tokenizer.decode(base_output[0][len(base_prompt[0]) :], skip_special_tokens=True)

        # Generate the output for the fine-tuned model
        trained_output = trained_model.generate(
            input_ids=trained_prompt,
            max_new_tokens=50,
            do_sample=False,
            temperature=None,
            top_p=None,
            top_k=None,
        )
        trained_result = tokenizer.decode(trained_output[0][len(trained_prompt[0]) :], skip_special_tokens=True)

        # Print the results
        print("=" * 26 + " Input " + "=" * 26 + "\n")
        print(prompt + "\n")
        print("=" * 21 + " Original Output " + "=" * 21 + "\n")
        print(base_result + "\n")
        print("=" * 20 + " Fine-tuned Output " + "=" * 20 + "\n")
        print(trained_result + "\n")

        if trained_model2 is not None:
            trained_output2 = trained_model2.generate(
                input_ids=trained_prompt,
                max_new_tokens=50,
                do_sample=False,
                temperature=None,
                top_p=None,
                top_k=None,
            )
            trained_result2 = tokenizer.decode(trained_output2[0][len(trained_prompt[0]) :], skip_special_tokens=True)
            print("=" * 20 + " Fine-tuned Output2 " + "=" * 19 + "\n")
            print(trained_result2 + "\n")

        print("=" * 27 + " End " + "=" * 27)

# 5. Load and Compile Models
We load the tokenizer and the base model onto the Gaudi. Then, we load the GraLoRA adapters from our output directory, fuse them into the base model for efficiency, and compile the model using `torch.compile` for faster inference. 
Due to the compilation, first few warmup steps might be slower.

In [None]:
import copy
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained(input_kwargs["model_name_or_path"])

base_model = AutoModelForCausalLM.from_pretrained(input_kwargs["model_name_or_path"])
base_model = base_model.to("hpu")
base_model = torch.compile(base_model, backend="hpu_backend")

trained_model = AutoModelForCausalLM.from_pretrained(input_kwargs["output_dir"])
trained_model = PeftModel.from_pretrained(trained_model, gralora_output_dir)
trained_model.merge_and_unload()
trained_model = trained_model.to("hpu")
trained_model = torch.compile(trained_model, backend="hpu_backend")

# 6. Evaluation
Finally, we test the model with specific prompts to verify if it has successfully learned the "Joseon Dynasty" persona.

In [None]:
# Test with a example prompts.
input_prompt_list = [
    "오늘 저녁 메뉴 추천해줘",
    "오늘 행사 재미있었어",
    "주말에 뭐하면 좋을까",
]
evaluate(input_prompt_list, tokenizer, base_model, trained_model)

You can further optimize the model's performance by modifying the training arguments in **Step 2. Configuration**. 

Recommended hyperparameters to tune:
- `num_train_epochs`: Increase for better learning (watch for overfitting).
- `lora_rank`: Adjust the rank of the adapters.
- `learning_rate`: Fine-tune the step size.
- `peft_type`: Switch back to "lora" to compare results directly against GraLoRA.