# Fine-tuning PLLuM 8B for Function Calling

This notebook implements the fine-tuning of the Polish language model [CYFRAGOVPL/Llama-PLLuM-8B-instruct](https://huggingface.co/CYFRAGOVPL/Llama-PLLuM-8B-instruct) for function calling tasks using a dataset of examples in both Polish and English.

We use the following techniques:
- **QLoRA** (Quantized Low-Rank Adaptation) with 4-bit quantization for memory efficiency
- **Unsloth** framework for optimized training speed
- Mixed dataset with both Polish and English examples

The fine-tuning adapts the model to understand the specific format of function calling requests and to generate proper JSON responses.

## Setup and Imports

In [None]:
# Install/update dependencies if needed
# Note: You should install PyTorch with CUDA support first using:
# pip install torch --index-url https://download.pytorch.org/whl/cu118
!pip install -q -U unsloth bitsandbytes sentencepiece nvidia-ml-py

In [None]:
import os
import json
import torch
import random
import numpy as np
from pathlib import Path
from dotenv import load_dotenv
from datetime import datetime

# Import our fine-tuning utilities
from src.fine_tuning import (
    PLLuMFineTuningConfig,
    setup_model_and_tokenizer,
    prepare_dataset,
    train_model,
    format_function_calling_prompt,
    generate_function_call,
    check_cuda_compatibility,
)
from src.auth import login  # For Hugging Face authentication
from src.dataset import parse_json_entry

# Set a random seed for reproducibility
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

# Load environment variables
load_dotenv()

# Authenticate with Hugging Face
login()

## Check GPU and CUDA Compatibility

Before proceeding with fine-tuning, let's verify that we have a compatible GPU with CUDA properly configured. This notebook is designed for use with an NVIDIA RTX 4060 or similar GPU.

In [None]:
# Run CUDA compatibility check
cuda_info = check_cuda_compatibility()

if not cuda_info['cuda_available']:
    print("WARNING: CUDA is not available! Fine-tuning will be extremely slow on CPU.")
    print("Please make sure you have an NVIDIA GPU and have installed PyTorch with CUDA support.")
    print("You can install PyTorch with CUDA using: pip install torch --index-url https://download.pytorch.org/whl/cu118")
    # Optionally stop execution
    # raise RuntimeError("CUDA is required for fine-tuning")
else:
    print(f"\n✅ CUDA is available with version {cuda_info['cuda_version']}")
    for i, device in enumerate(cuda_info['devices']):
        print(f"\nGPU {i}: {device['name']}")
        print(f"  Memory: {device['total_memory_gb']:.2f} GB total")
        
        if 'memory_free_gb' in device:
            print(f"  Free memory: {device['memory_free_gb']:.2f} GB")
            print(f"  Used memory: {device['memory_used_gb']:.2f} GB")
            
            # Check if there's enough free memory (at least 6GB recommended for 8B model with QLoRA)
            if device['memory_free_gb'] < 6.0:
                print(f"⚠️ Warning: Only {device['memory_free_gb']:.2f} GB free memory detected.")
                print("   You may encounter out-of-memory errors during fine-tuning.")
                print("   Consider reducing batch size, sequence length, or closing other applications.")
            else:
                print(f"✅ Sufficient free memory detected ({device['memory_free_gb']:.2f} GB)")

In [None]:
# Verify PyTorch was installed with CUDA support
if torch.cuda.is_available():
    # Run a simple test to verify CUDA is working
    try:
        print("Running CUDA test...")
        x = torch.rand(10, 10).cuda()
        y = torch.rand(10, 10).cuda()
        z = x @ y  # Matrix multiplication
        print(f"CUDA test result shape: {z.shape}")
        print("✅ CUDA test passed!")
        
        # Test triton if installed (optional acceleration library)
        try:
            import triton
            print("✅ Triton is installed for additional CUDA optimizations")
        except ImportError:
            print("ℹ️ Triton is not installed. For additional optimizations, install with: pip install triton")
            
        # Test flash-attention if installed
        try:
            import flash_attn
            print("✅ Flash Attention is installed for faster training")
        except ImportError:
            print("ℹ️ Flash Attention is not installed. For faster training, install with: pip install flash-attn")
    except Exception as e:
        print(f"❌ CUDA test failed: {str(e)}")
        print("This may indicate a problem with your CUDA installation.")

## Load and Examine the Dataset

We'll load the translated dataset and examine it to understand its structure.

**Note:** In this dataset, the `tools` and `answers` fields are stored as JSON strings that need to be parsed with `json.loads()`. This is the expected format according to the dataset documentation.

In [None]:
# Path to the translated dataset
DATASET_PATH = "../data/translated_dataset.json"

# Check if the dataset exists
if not os.path.exists(DATASET_PATH):
    raise FileNotFoundError(f"Dataset not found at {DATASET_PATH}. Please run create_translated_dataset.ipynb first.")

# Load the dataset
with open(DATASET_PATH, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

print(f"Dataset loaded with {len(dataset)} examples.")

In [None]:
# Examine dataset structure
print(f"Dataset type: {type(dataset)}")
if len(dataset) > 0:
    print(f"First item type: {type(dataset[0])}")
    if isinstance(dataset[0], dict):
        print(f"First item keys: {list(dataset[0].keys())}")

In [None]:
# Examine a few examples with understanding that tools and answers are JSON strings
def print_example(example, idx=0):
    print(f"Example {idx}:")
    print(f"Query: {example['query']}")
    
    # Handle tools as a JSON string (intended format)
    if 'tools' in example:
        tools_type = type(example['tools'])
        if isinstance(example['tools'], str):
            # Parse the JSON string to show tool count
            try:
                parsed_tools = json.loads(example['tools'])
                print(f"Tools: JSON string containing {len(parsed_tools)} tool(s)")
                # Print first tool if available
                if len(parsed_tools) > 0:
                    print(f"First tool: {parsed_tools[0]['name']}")
                    print(f"Description: {parsed_tools[0]['description'][:50]}...")
            except json.JSONDecodeError:
                print(f"Tools: Invalid JSON string")
        else:
            print(f"Tools has unexpected type: {tools_type}")
    
    # Handle answers as a JSON string (intended format)
    if 'answers' in example:
        answers_type = type(example['answers'])
        if isinstance(example['answers'], str):
            # Parse the JSON string to show answer count
            try:
                parsed_answers = json.loads(example['answers'])
                print(f"Answers: JSON string containing {len(parsed_answers)} answer(s)")
                # Print first answer if available
                if len(parsed_answers) > 0:
                    print(f"First answer uses tool: {parsed_answers[0]['name']}")
                    print(f"With arguments: {parsed_answers[0]['arguments']}")
            except json.JSONDecodeError:
                print(f"Answers: Invalid JSON string")
        else:
            print(f"Answers has unexpected type: {answers_type}")
    
    print("\n")

# Print examples
for i in range(min(3, len(dataset))):
    print_example(dataset[i], i)

In [None]:
# Test the format_function_calling_prompt function with our dataset
# This will parse the JSON strings automatically
if len(dataset) > 0:
    example_prompt = format_function_calling_prompt(dataset[0])
    print("Example formatted prompt:")
    print(example_prompt)

## Configure Fine-tuning Parameters

Set up the fine-tuning configuration with parameters optimized for an RTX 4060 GPU.

In [None]:
# Create a timestamped output directory
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
MODEL_OUTPUT_DIR = f"../models/pllum-function-calling-{timestamp}"

# Create the output directory
os.makedirs(MODEL_OUTPUT_DIR, exist_ok=True)

# Configure fine-tuning parameters
config = PLLuMFineTuningConfig(
    model_name_or_path="CYFRAGOVPL/Llama-PLLuM-8B-instruct",
    output_dir=MODEL_OUTPUT_DIR,
    
    # QLoRA settings
    lora_r=16,  # LoRA rank
    lora_alpha=32,  # LoRA alpha
    lora_dropout=0.05,
    use_4bit=True,  # Use 4-bit quantization for memory efficiency
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",  # Normal Float 4-bit quantization
    use_nested_quant=False,
    
    # Training parameters
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Adjust based on GPU memory
    gradient_accumulation_steps=2,  # Increase effective batch size
    learning_rate=2e-4,
    weight_decay=0.01,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    
    # Logging & Saving
    logging_steps=10,
    save_steps=200,
    save_total_limit=3,
    
    # Dataset parameters
    max_seq_length=1024,  # Maximum sequence length
    dataset_path=DATASET_PATH,
    
    # CUDA settings
    use_cuda=torch.cuda.is_available(),
    device_map="auto",
)

# Adjust batch size if we detect limited GPU memory
if torch.cuda.is_available() and 'devices' in cuda_info and len(cuda_info['devices']) > 0:
    if 'memory_free_gb' in cuda_info['devices'][0] and cuda_info['devices'][0]['memory_free_gb'] < 6.0:
        original_batch_size = config.per_device_train_batch_size
        config.per_device_train_batch_size = 2  # Reduce batch size for low memory GPUs
        config.gradient_accumulation_steps = 4  # Increase gradient accumulation
        print(f"⚠️ Limited GPU memory detected. Reducing batch size from {original_batch_size} to {config.per_device_train_batch_size}")
        print(f"   and increasing gradient accumulation steps to {config.gradient_accumulation_steps}")

# Save the configuration to the model directory for future reference
config_dict = {k: str(v) if isinstance(v, Path) else v for k, v in vars(config).items()}
with open(os.path.join(MODEL_OUTPUT_DIR, "config.json"), 'w', encoding='utf-8') as f:
    json.dump(config_dict, f, indent=2)

## Load Model and Tokenizer

Setup the model with QLoRA and Unsloth optimizations.

In [None]:
# Load model and tokenizer
print("Loading model and tokenizer...")
model, tokenizer = setup_model_and_tokenizer(config)
print("Model and tokenizer loaded successfully.")

## Prepare the Dataset for Training

In [None]:
# Prepare the dataset
print("Preparing dataset...")
train_dataset = prepare_dataset(
    dataset_path=config.dataset_path,
    tokenizer=tokenizer,
    max_length=config.max_seq_length
)
print(f"Dataset prepared with {len(train_dataset['input_ids'])} examples.")

## Fine-tune the Model

This is the main training process. It will take several hours depending on your hardware.

In [None]:
# Run the training
print(f"Starting fine-tuning process. Model will be saved to {config.output_dir}")
print("This may take several hours depending on your hardware.")

# Start training
trained_model = train_model(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    config=config
)

print("Fine-tuning completed successfully!")

## Test the Fine-tuned Model

Let's test our model with a few examples from the dataset.

In [None]:
# Test the model with examples from the dataset
def test_model_with_example(example_idx=0):
    example = dataset[example_idx]
    
    query = example['query']
    
    # Parse tools and answers from JSON strings
    tools = json.loads(example['tools'])
    expected_answers = json.loads(example['answers'])
    
    print(f"Query: {query}")
    print("\nAvailable tools:")
    for i, tool in enumerate(tools):
        print(f"{i+1}. {tool['name']}: {tool['description']}")
    
    print("\nExpected answer:")
    print(json.dumps(expected_answers, indent=2, ensure_ascii=False))
    
    print("\nGenerating function call...")
    generated = generate_function_call(
        model=trained_model,
        tokenizer=tokenizer,
        query=query,
        tools=tools,
        temperature=0.1
    )
    
    print("\nGenerated answer:")
    print(json.dumps(generated, indent=2, ensure_ascii=False))
    
    return generated

# Test with a few examples
for i in range(min(3, len(dataset))):
    print(f"\n--- Example {i} ---")
    generated = test_model_with_example(i)
    print("\n" + "-"*50)

## Monitor GPU Usage During Inference

Let's check the GPU usage during inference to help optimize your settings for future runs.

In [None]:
if torch.cuda.is_available():
    try:
        import pynvml
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        
        # Get memory info before inference
        mem_info_before = pynvml.nvmlDeviceGetMemoryInfo(handle)
        print(f"GPU memory before inference: {mem_info_before.used / 1e9:.2f} GB used")
        
        # Run a complex inference with sample from dataset
        if len(dataset) > 0:
            example = dataset[0]
            query = example['query']
            tools = json.loads(example['tools'])  # Parse JSON string
            
            generate_function_call(
                model=trained_model,
                tokenizer=tokenizer,
                query=query,
                tools=tools,
                temperature=0.1,
                max_new_tokens=1024  # Longer generation to stress test
            )
        
        # Get memory info after inference
        mem_info_after = pynvml.nvmlDeviceGetMemoryInfo(handle)
        print(f"GPU memory after inference: {mem_info_after.used / 1e9:.2f} GB used")
        print(f"Memory used by inference: {(mem_info_after.used - mem_info_before.used) / 1e9:.2f} GB")
        
        # Get GPU utilization
        util_rates = pynvml.nvmlDeviceGetUtilizationRates(handle)
        print(f"GPU utilization: {util_rates.gpu}%")
        print(f"Memory utilization: {util_rates.memory}%")
        
        pynvml.nvmlShutdown()
    except Exception as e:
        print(f"Could not monitor GPU: {e}")

## Save a Final Model Summary

Let's create a summary file with information about the fine-tuning process.

In [None]:
# Create a summary file
summary = {
    "model_name": config.model_name_or_path,
    "fine_tuned_model_path": config.output_dir,
    "training_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    "dataset": {
        "path": config.dataset_path,
        "num_examples": len(train_dataset['input_ids']),
    },
    "training_parameters": {
        "epochs": config.num_train_epochs,
        "batch_size": config.per_device_train_batch_size,
        "gradient_accumulation_steps": config.gradient_accumulation_steps,
        "effective_batch_size": config.per_device_train_batch_size * config.gradient_accumulation_steps,
        "learning_rate": config.learning_rate,
        "lora_r": config.lora_r,
        "lora_alpha": config.lora_alpha,
        "max_seq_length": config.max_seq_length,
        "use_4bit": config.use_4bit,
    },
    "hardware": {
        "cuda_available": torch.cuda.is_available(),
        "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None",
        "cuda_version": torch.version.cuda if torch.cuda.is_available() else "None",
    },
}

# Save the summary
with open(os.path.join(config.output_dir, "training_summary.json"), 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2, ensure_ascii=False)

print(f"Training summary saved to {os.path.join(config.output_dir, 'training_summary.json')}")

## Conclusion

The PLLuM 8B model has been successfully fine-tuned for function calling using QLoRA techniques and the Unsloth framework for optimization. The model can now be used to parse queries and generate appropriate function calls in both Polish and English languages.

To use the fine-tuned model in your applications, check the `test_model.ipynb` notebook for examples of how to load and integrate the model into your pipeline.