# E2E Tool Calling Tutorial Using LlamaStack Server

This notebook demonstrates the complete end-to-end workflow for fine-tuning and evaluating a tool-calling model using the **LlamaStack Server** on port 8321.

## Prerequisites
1. Start port-forwards: `./llamastack/port-forward.sh`
2. Start LlamaStack server: `./llamastack/activate_llamastack_server.sh`
3. Verify server is running: `curl http://localhost:8321/v1/models/list`

# Part I: Data Preparation

## Install Dependencies

In [None]:
# Install required Python packages
%pip install \
  huggingface_hub \
  "transformers>=4.36.0" \
  peft \
  datasets \
  trl \
  jsonschema \
  litellm \
  "jinja2>=3.1.0" \
  "torch>=2.0.0" \
  openai \
  jupyterlab \
  requests

## Import Libraries and Configure Endpoints

In [None]:
import os
import sys
import json
import random
import requests
from pprint import pprint
from typing import Any, Dict, List, Union
from time import sleep, time

import numpy as np
import torch
from datasets import load_dataset
from huggingface_hub import HfApi
from openai import OpenAI

In [None]:
# Add parent directory to path to import config
sys.path.append(os.path.join(os.getcwd(), '..'))
from config import *

# LlamaStack Server endpoint
LLAMASTACK_URL = "http://localhost:8321"

print("Configuration:")
print(f"LlamaStack Server: {LLAMASTACK_URL}")
print(f"Data Store: {NDS_URL}")
print(f"Entity Store: {ENTITY_STORE_URL}")
print(f"NIM: {NIM_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Base Model: {BASE_MODEL}")

## Set Random Seed

In [None]:
SEED = 1234
LIMIT_TOOL_PROPERTIES = 8  # WAR for NIM bug with large tool properties

torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
np.random.seed(SEED)
random.seed(SEED)

## Create Data Directories

In [None]:
# Processed data will be stored here
DATA_ROOT = os.path.join(os.getcwd(), "data")
CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, "customization")
VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, "validation")
EVALUATION_DATA_ROOT = os.path.join(DATA_ROOT, "evaluation")

os.makedirs(DATA_ROOT, exist_ok=True)
os.makedirs(CUSTOMIZATION_DATA_ROOT, exist_ok=True)
os.makedirs(VALIDATION_DATA_ROOT, exist_ok=True)
os.makedirs(EVALUATION_DATA_ROOT, exist_ok=True)

print(f"Data directories created at: {DATA_ROOT}")

## Step 1: Download xLAM Dataset from Hugging Face

In [None]:
from config import HF_TOKEN

os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

In [None]:
# Download from Hugging Face
dataset = load_dataset("Salesforce/xlam-function-calling-60k")

# Inspect a sample
example = dataset['train'][0]
pprint(example)

## Step 2: Data Transformation Functions

Convert xLAM format to OpenAI format required by NeMo Customizer

In [None]:
def normalize_type(param_type: str) -> str:
    """
    Normalize Python type hints to OpenAI function spec types.
    """
    param_type = param_type.strip()

    if "," in param_type and "default" in param_type:
        param_type = param_type.split(",")[0].strip()

    if param_type.startswith("default="):
        return "string"

    param_type = param_type.replace(", optional", "").strip()

    if param_type.startswith("Callable"):
        return "string"
    if param_type.startswith("Tuple"):
        return "array"
    if param_type.startswith("List["):
        return "array"
    if param_type.startswith("Set") or param_type == "set":
        return "array"

    type_mapping: Dict[str, str] = {
        "str": "string",
        "int": "integer",
        "float": "number",
        "bool": "boolean",
        "list": "array",
        "dict": "object",
        "List": "array",
        "Dict": "object",
        "set": "array",
        "Set": "array"
    }

    if param_type in type_mapping:
        return type_mapping[param_type]
    else:
        print(f"Unknown type: {param_type}")
        return "string"


def convert_tools_to_openai_spec(tools: Union[str, List[Dict[str, Any]]]) -> List[Dict[str, Any]]:
    if isinstance(tools, str):
        try:
            tools = json.loads(tools)
        except json.JSONDecodeError as e:
            print(f"Failed to parse tools string as JSON: {e}")
            return []

    if not isinstance(tools, list):
        print(f"Expected tools to be a list, but got {type(tools)}")
        return []

    openai_tools: List[Dict[str, Any]] = []
    for tool in tools:
        if not isinstance(tool, dict):
            print(f"Expected tool to be a dictionary, but got {type(tool)}")
            continue

        if not isinstance(tool.get("parameters"), dict):
            print(f"Expected 'parameters' to be a dictionary for tool: {tool}")
            continue

        normalized_parameters: Dict[str, Dict[str, Any]] = {}
        for param_name, param_info in tool["parameters"].items():
            if not isinstance(param_info, dict):
                print(f"Expected parameter info to be a dictionary for: {param_name}")
                continue

            param_dict = {
                "description": param_info.get("description", ""),
                "type": normalize_type(param_info.get("type", "")),
            }

            default_value = param_info.get("default")
            if default_value is not None and default_value != "":
                param_dict["default"] = default_value

            normalized_parameters[param_name] = param_dict

        openai_tool = {
            "type": "function",
            "function": {
                "name": tool["name"],
                "description": tool["description"],
                "parameters": {"type": "object", "properties": normalized_parameters},
            },
        }
        openai_tools.append(openai_tool)
    return openai_tools


def save_jsonl(filename, data):
    """Write a list of json objects to a .jsonl file"""
    with open(filename, "w") as f:
        for entry in data:
            f.write(json.dumps(entry) + "\n")


def convert_tool_calls(xlam_tools):
    """Convert XLAM tool format to OpenAI's tool schema."""
    tools = []
    for tool in json.loads(xlam_tools):
        tools.append({"type": "function", "function": {"name": tool["name"], "arguments": tool.get("arguments", {})}})
    return tools


def convert_example(example, dataset_type='single'):
    """Convert an XLAM dataset example to OpenAI format."""
    obj = {"messages": []}

    obj["messages"].append({"role": "user", "content": example["query"]})

    if example.get("tools"):
        obj["tools"] = convert_tools_to_openai_spec(example["tools"])

    assistant_message = {"role": "assistant", "content": ""}
    if example.get("answers"):
        tool_calls = convert_tool_calls(example["answers"])
        
        if dataset_type == "single":
            if len(tool_calls) == 1:
                assistant_message["tool_calls"] = tool_calls
            else:
                return None
        else:
            assistant_message["tool_calls"] = tool_calls
                
    obj["messages"].append(assistant_message)

    return obj


def convert_example_eval(entry):
    """Convert a single entry to the evaluator format"""
    # WAR for NIM bug with too many tool properties
    for tool in entry["tools"]:
        if len(tool["function"]["parameters"]["properties"]) > LIMIT_TOOL_PROPERTIES:
            return None
    
    new_entry = {
        "messages": [],
        "tools": entry["tools"],
        "tool_calls": []
    }
    
    for msg in entry["messages"]:
        if msg["role"] == "assistant" and "tool_calls" in msg:
            new_entry["tool_calls"] = msg["tool_calls"]
        else:
            new_entry["messages"].append(msg)
    
    return new_entry


def convert_dataset_eval(data):
    """Convert the entire dataset for evaluation."""
    return [result for entry in data if (result := convert_example_eval(entry)) is not None]


def read_jsonl(file_path):
    """Reads a JSON Lines file and yields parsed JSON objects"""
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if not line:
                continue
            try:
                yield json.loads(line)
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")
                continue

print("Data transformation functions loaded successfully")

## Test Data Transformation

In [None]:
# Test conversion on the example
converted_example = convert_example(example)
print("Converted example:")
pprint(converted_example)

## Step 3: Process and Split Dataset

In [None]:
# Convert all examples
all_examples = []
with open(os.path.join(DATA_ROOT, "xlam_openai_format.jsonl"), "w") as f:
    for example in dataset["train"]:
        converted = convert_example(example)
        if converted is not None:
            all_examples.append(converted)
            f.write(json.dumps(converted) + "\n")

print(f"Converted {len(all_examples)} examples")

In [None]:
# Configure dataset size
NUM_EXAMPLES = 5000

assert NUM_EXAMPLES <= len(all_examples), \
    f"{NUM_EXAMPLES} exceeds the total number of available ({len(all_examples)}) data points"

# Randomly sample and split
sampled_examples = random.sample(all_examples, NUM_EXAMPLES)

train_size = int(0.7 * len(sampled_examples))
val_size = int(0.15 * len(sampled_examples))

train_data = sampled_examples[:train_size]
val_data = sampled_examples[train_size : train_size + val_size]
test_data = sampled_examples[train_size + val_size :]

# Save splits
save_jsonl(os.path.join(CUSTOMIZATION_DATA_ROOT, "training.jsonl"), train_data)
save_jsonl(os.path.join(VALIDATION_DATA_ROOT, "validation.jsonl"), val_data)

# Convert test data for evaluation
test_data_eval = convert_dataset_eval(test_data)
save_jsonl(os.path.join(EVALUATION_DATA_ROOT, "xlam-test-single.jsonl"), test_data_eval)

print(f"Dataset split complete:")
print(f"  Training: {len(train_data)} examples")
print(f"  Validation: {len(val_data)} examples")
print(f"  Test: {len(test_data_eval)} examples")

## Step 4: Create Namespaces

**Note**: LlamaStack doesn't manage namespaces, so we create them directly in NeMo services

In [None]:
def create_namespaces(entity_host, ds_host, namespace):
    # Create namespace in Entity Store
    entity_store_url = f"{entity_host}/v1/namespaces"
    resp = requests.post(entity_store_url, json={"id": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store: {resp.status_code}"
    print(f"Entity Store: {resp.status_code}")

    # Create namespace in Data Store
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Data Store: {resp.status_code}"
    print(f"Data Store: {resp.status_code}")

create_namespaces(entity_host=ENTITY_STORE_URL, ds_host=NDS_URL, namespace=NMS_NAMESPACE)

In [None]:
# Verify namespaces
res = requests.get(f"{NDS_URL}/v1/datastore/namespaces/{NMS_NAMESPACE}")
print(f"Data Store: {res.status_code}")
print(json.dumps(res.json(), indent=2))

res = requests.get(f"{ENTITY_STORE_URL}/v1/namespaces/{NMS_NAMESPACE}")
print(f"\nEntity Store: {res.status_code}")
print(json.dumps(res.json(), indent=2))

## Step 5: Upload Data to NeMo Data Store

Using HfApi to upload to Data Store

In [None]:
repo_id = f"{NMS_NAMESPACE}/{DATASET_NAME}"
print(f"Repository ID: {repo_id}")

hf_api = HfApi(endpoint=f"{NDS_URL}/v1/hf", token="")

# Create repo
hf_api.create_repo(
    repo_id=repo_id,
    repo_type='dataset',
)

print(f"Dataset repository created: {repo_id}")

In [None]:
# Upload dataset files
train_fp = f"{CUSTOMIZATION_DATA_ROOT}/training.jsonl"
val_fp = f"{VALIDATION_DATA_ROOT}/validation.jsonl"
test_fp = f"{EVALUATION_DATA_ROOT}/xlam-test-single.jsonl"

hf_api.upload_file(
    path_or_fileobj=train_fp,
    path_in_repo="training/training.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

hf_api.upload_file(
    path_or_fileobj=val_fp,
    path_in_repo="validation/validation.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

hf_api.upload_file(
    path_or_fileobj=test_fp,
    path_in_repo="testing/xlam-test-single.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

print("Dataset files uploaded successfully")

## Step 6: Register Dataset (Entity Store + LlamaStack Verification)

**Hybrid Approach**: Register in Entity Store (required for nvidia provider),
then verify it's accessible via LlamaStack for fine-tuning.

In [None]:
# Register dataset in Entity Store
# The dataset will be used by fine-tuning via the customizer backend
response = requests.post(
    f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NMS_NAMESPACE,
        "description": "Tool calling xLAM dataset in OpenAI ChatCompletions format",
        "files_url": f"hf://datasets/{repo_id}",
        "project": "tool_calling",
    },
)

# 409 means already exists - that's OK
if response.status_code in (200, 201):
    print("‚úÖ Dataset registered in Entity Store!")
    dataset_info = response.json()
    print(f"   Name: {dataset_info.get('name')}")
    print(f"   Namespace: {dataset_info.get('namespace')}")
    print(f"   Files URL: {dataset_info.get('files_url')}")
    print(f"   ID: {dataset_info.get('id')}")
elif response.status_code == 409:
    print("‚ö†Ô∏è Dataset already exists in Entity Store - continuing...")
else:
    raise RuntimeError(f"Failed to register dataset: {response.status_code} - {response.text}")

print("\n‚úÖ Dataset is now available for fine-tuning via LlamaStack!")
print(f"   Fine-tuning will reference it as: {NMS_NAMESPACE}/{DATASET_NAME}")

---
## Part I Complete! ‚úÖ

We have successfully:
1. Downloaded and transformed the xLAM dataset
2. Split data into train/val/test sets
3. Created namespaces in NeMo services
4. Uploaded data to NeMo Data Store
5. **Registered dataset via LlamaStack Server** üéØ

# Part II: Fine-tuning via LlamaStack Server

## Helper Functions for Job Monitoring

In [None]:
def wait_customization_job(job_uuid: str, polling_interval: int = 30, timeout: int = 5500):
    """
    Wait for a fine-tuning job to complete via LlamaStack.
    
    Args:
        job_uuid: The job ID to monitor
        polling_interval: Seconds between status checks
        timeout: Maximum time to wait in seconds
    
    Returns:
        Final job status
    """
    start_time = time()
    
    # Get initial status via LlamaStack (list all jobs and find ours)
    res = requests.get(f"{LLAMASTACK_URL}/v1/post-training/jobs")
    if res.status_code != 200:
        raise RuntimeError(f"Failed to list jobs: {res.status_code} - {res.text}")
    
    jobs = res.json().get("data", [])
    job_data = next((j for j in jobs if j.get("job_uuid") == job_uuid), None)
    
    if not job_data:
        raise RuntimeError(f"Job {job_uuid} not found in job list")
    
    job_status = job_data["status"]

    print(f"Waiting for fine-tuning job {job_uuid} to finish.")
    print(f"Job status: {job_status} after {time() - start_time:.2f} seconds.")

    while job_status in ["scheduled", "in_progress", "created", "running"]:
        sleep(polling_interval)
        
        # List all jobs and find ours
        res = requests.get(f"{LLAMASTACK_URL}/v1/post-training/jobs")
        if res.status_code != 200:
            print(f"Warning: Failed to list jobs: {res.status_code}")
            continue
        
        jobs = res.json().get("data", [])
        job_data = next((j for j in jobs if j.get("job_uuid") == job_uuid), None)
        
        if not job_data:
            print(f"Warning: Job {job_uuid} not found in list")
            continue
            
        job_status = job_data["status"]
        
        # Extract detailed progress information
        details = job_data.get("status_details", {})
        steps_completed = details.get("steps_completed", 0)
        steps_per_epoch = details.get("steps_per_epoch", 1)
        epochs_completed = details.get("epochs_completed", 0)
        elapsed = details.get("elapsed_time", 0)
        
        # Calculate actual progress
        progress_pct = (steps_completed / steps_per_epoch * 100) if steps_per_epoch > 0 else 0
        
        print(f"Job status: {job_status} | "
              f"Epoch {epochs_completed} | "
              f"Step {steps_completed}/{steps_per_epoch} ({progress_pct:.1f}%) | "
              f"Elapsed: {elapsed:.0f}s")

        if time() - start_time > timeout:
            raise RuntimeError(f"Job {job_uuid} took more than {timeout} seconds.")

    print(f"\n‚úÖ Job completed with status: {job_status}")
    return job_status


def wait_model_available(model_id: str, polling_interval: int = 10, timeout: int = 300):
    """
    Wait for a model to become available via LlamaStack.
    
    Args:
        model_id: The model ID to check for (without provider prefix)
        polling_interval: Seconds between checks
        timeout: Maximum time to wait in seconds
    """
    found = False
    start_time = time()

    print(f"Checking if model {model_id} is available via LlamaStack.")

    while not found:
        sleep(polling_interval)

        res = requests.get(f"{LLAMASTACK_URL}/v1/models")
        if res.status_code == 200:
            models = res.json().get("data", [])
            # Check for model with or without nvidia/ prefix
            model_ids = [m["identifier"] for m in models]
            if f"nvidia/{model_id}" in model_ids or model_id in model_ids:
                found = True
                print(f"‚úÖ Model {model_id} available after {time() - start_time:.2f} seconds.")
                break
            else:
                print(f"‚è≥ Model {model_id} not yet available after {time() - start_time:.2f} seconds.")
        else:
            print(f"‚ö†Ô∏è Failed to list models: {res.status_code}")
        
        if time() - start_time > timeout:
            raise RuntimeError(f"Model {model_id} not available after {timeout} seconds.")

    assert found, f"Could not find model {model_id} via LlamaStack."
    return True

print("‚úÖ Helper functions loaded successfully (using LlamaStack Server API)")

## Step 1: Wait for Base Model to Download

The Customizer needs to download the base model before fine-tuning can start. This may take a few minutes.

In [None]:
# Create unique job ID
unique_suffix = int(time())
job_uuid = f"finetune-llama32-{unique_suffix}"

# Use the correct customization config version (with GPU suffix)
# Available: @v1.0.0+A100 or @v1.0.0+L40
model_with_version = f"{BASE_MODEL}@v1.0.0+A100"

# Submit fine-tuning job via LlamaStack Server
response = requests.post(
    f"{LLAMASTACK_URL}/v1/post-training/supervised-fine-tune",
    json={
        "job_uuid": job_uuid,
        "model": model_with_version,
        "training_config": {
            "n_epochs": 1,
            "data_config": {
                "batch_size": 8,
                "dataset_id": DATASET_NAME,  # Just the name, namespace is in LlamaStack config
                "shuffle": True,
                "data_format": "instruct"
            },
            "optimizer_config": {
                "optimizer_type": "adamw",
                "lr": 0.0001,
                "weight_decay": 0.01,
                "num_warmup_steps": 100
            }
        },
        "hyperparam_search_config": {},  # Required field
        "logger_config": {},  # Required field
        "algorithm_config": {
            "type": "LoRA",  # Required discriminator field
            "rank": 32,
            "alpha": 16,
            "lora_attn_modules": [],
            "apply_lora_to_mlp": True,
            "apply_lora_to_output": False,
            "use_dora": False,
            "quantize_base": False
        },
        "checkpoint_dir": ""
    }
)

if response.status_code not in (200, 201):
    print(f"‚ùå Failed to create fine-tuning job: {response.status_code}")
    print(f"Response: {response.text}")
    raise RuntimeError(f"Failed to create fine-tuning job: {response.status_code} - {response.text}")

job_data = response.json()
JOB_ID = job_data["id"]
CUSTOMIZED_MODEL = job_data.get("output_model", f"{NMS_NAMESPACE}/llama-3.2-1b-xlam-{unique_suffix}")

print("‚úÖ Fine-tuning job submitted successfully via LlamaStack Server!")
print(f"Job ID: {JOB_ID}")
print(f"Output Model: {CUSTOMIZED_MODEL}")
print(json.dumps(job_data, indent=2))

## Step 2: Monitor Fine-tuning Job

**Note**: Fine-tuning will take approximately 45 minutes. The helper function will poll the status every 30 seconds.

In [None]:
# Monitor job until completion
job_status = wait_customization_job(job_uuid=JOB_ID, polling_interval=30, timeout=6000)

print(f"\n‚úÖ Fine-tuning job completed with status: {job_status}")

## Step 3: Verify Customized Model

Check that the model is registered in Entity Store and available in NIM

In [None]:
# Check models via LlamaStack
response = requests.get(f"{LLAMASTACK_URL}/v1/models")

assert response.status_code == 200, \
    f"Failed to fetch models: {response.status_code} - {response.text}"

models = response.json().get("data", [])
print(f"Found {len(models)} models available via LlamaStack:")
for model in models[:10]:  # Show first 10
    print(f"  - {model.get('identifier', model.get('id'))}") 

# Note: Custom models may not appear in the list immediately
# They are still usable for inference via LlamaStack
print(f"\n‚úÖ Customized model: {CUSTOMIZED_MODEL}")
print(f"   This model will be used for inference even if not listed above.")
print(f"   The nvidia provider can serve models not in the list.")

In [None]:
# Test if the customized model is available for inference via LlamaStack
print(f"Testing if customized model {CUSTOMIZED_MODEL} works for inference...")

test_response = requests.post(
    f"{LLAMASTACK_URL}/v1/chat/completions",
    json={
        "model": f"nvidia/{CUSTOMIZED_MODEL}",
        "messages": [{"role": "user", "content": "Say hello"}],
        "max_tokens": 10
    }
)

if test_response.status_code == 200:
    result = test_response.json()
    print(f"\n‚úÖ Model {CUSTOMIZED_MODEL} is working via LlamaStack!")
    print(f"   Response: {result['choices'][0]['message']['content']}")
else:
    print(f"\n‚ö†Ô∏è Model may not be ready yet: {test_response.status_code}")
    print(f"   Response: {test_response.text}")
    print(f"\n   Waiting 2 minutes for NIM to load the model...")
    sleep(120)
    
    # Retry
    test_response = requests.post(
        f"{LLAMASTACK_URL}/v1/chat/completions",
        json={
            "model": f"nvidia/{CUSTOMIZED_MODEL}",
            "messages": [{"role": "user", "content": "Say hello"}],
            "max_tokens": 10
        }
    )
    
    if test_response.status_code == 200:
        result = test_response.json()
        print(f"\n‚úÖ Model {CUSTOMIZED_MODEL} is now working via LlamaStack!")
        print(f"   Response: {result['choices'][0]['message']['content']}")
    else:
        print(f"\n‚ùå Model still not available: {test_response.status_code}")
        print(f"   You may need to wait longer for NIM to load the model.")

## Step 4: Quick Inference Test

Test the customized model with a sample from the test set

In [None]:
# Load test data
test_data = list(read_jsonl(test_fp))
test_sample = random.choice(test_data)

print(f"Test sample - User query:")
print(f"  {test_sample['messages'][0]['content']}")
print(f"\nAvailable tools: {len(test_sample['tools'])}")

# Run inference via LlamaStack Server (OpenAI-compatible endpoint)
response = requests.post(
    f"{LLAMASTACK_URL}/v1/chat/completions",
    json={
        "model": f"nvidia/{CUSTOMIZED_MODEL}",  # Use nvidia/ prefix
        "messages": test_sample["messages"],
        "tools": test_sample["tools"],
        "tool_choice": "auto",
        "temperature": 0.1,
        "top_p": 0.7,
        "max_tokens": 512
    }
)

assert response.status_code == 200, f"Inference failed: {response.status_code} - {response.text}"

result = response.json()
predicted_calls = result["choices"][0]["message"].get("tool_calls", [])

print(f"\n‚úÖ Model response (via LlamaStack):")
print(f"Tool calls: {predicted_calls}")

print(f"\nGround truth:")
print(f"Tool calls: {test_sample['tool_calls']}")

---
## Part II Complete! ‚úÖ

We have successfully:
1. Created helper functions for job monitoring
2. **Submitted fine-tuning job via LlamaStack Server** üéØ
3. **Monitored job status via LlamaStack Server** üéØ
4. Verified model in Entity Store and NIM
5. Tested inference with the customized model

**Your customized model:** `{CUSTOMIZED_MODEL}`

# Part III: Model Evaluation via LlamaStack Server

## Helper Function for Evaluation Jobs

In [None]:
def wait_eval_job(benchmark_id: str, job_id: str, polling_interval: int = 10, timeout: int = 6000):
    """
    Wait for an evaluation job to complete via LlamaStack Server.
    
    Args:
        benchmark_id: The benchmark ID
        job_id: The evaluation job ID
        polling_interval: Seconds between status checks
        timeout: Maximum time to wait in seconds
    
    Returns:
        Final job status
    """
    start_time = time()
    
    # Get initial status via LlamaStack
    response = requests.get(
        f"{LLAMASTACK_URL}/v1/eval/benchmarks/{benchmark_id}/jobs/{job_id}"
    )
    
    if response.status_code != 200:
        raise RuntimeError(f"Failed to get eval job status: {response.status_code} - {response.text}")
    
    job_data = response.json()
    job_status = job_data.get("status", "unknown")
    
    print(f"Waiting for evaluation job {job_id} to finish.")
    print(f"Job status: {job_status} after {time() - start_time:.2f} seconds.")

    while job_status in ["scheduled", "in_progress", "created", "running", "pending"]:
        sleep(polling_interval)
        
        response = requests.get(
            f"{LLAMASTACK_URL}/v1/eval/benchmarks/{benchmark_id}/jobs/{job_id}"
        )
        
        if response.status_code != 200:
            print(f"Warning: Failed to get job status: {response.status_code}")
            continue
        
        job_data = response.json()
        job_status = job_data.get("status", "unknown")
        
        # Try to get progress if available
        progress = job_data.get("progress", job_data.get("status_details", {}).get("progress", "N/A"))
        print(f"Job status: {job_status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

        if time() - start_time > timeout:
            raise RuntimeError(f"Evaluation job {job_id} took more than {timeout} seconds.")

    print(f"\n‚úÖ Evaluation completed with status: {job_status}")
    return job_status

print("‚úÖ Evaluation helper function loaded successfully (using LlamaStack Server API)")

## Step 1: Create Evaluation Configuration

Define the evaluation configuration for tool-calling accuracy metrics

In [None]:
# Evaluation configuration
benchmark_id = "simple-tool-calling"

simple_tool_calling_eval_config = {
    "type": "custom",
    "tasks": {
        "custom-tool-calling": {
            "type": "chat-completion",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/testing/xlam-test-single.jsonl",
                "limit": 50
            },
            "params": {
                "template": {
                    "messages": "{{ item.messages | tojson}}",
                    "tools": "{{ item.tools | tojson }}",
                    "tool_choice": "auto"
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                }
            }
        }
    }
}

print("Evaluation configuration created:")
print(json.dumps(simple_tool_calling_eval_config, indent=2))

## Step 2: Register Benchmark via LlamaStack Server

**üéØ Using LlamaStack API**

In [None]:
# Register benchmark via LlamaStack Server
response = requests.post(
    f"{LLAMASTACK_URL}/v1/eval/benchmarks",  # Correct endpoint
    json={
        "benchmark_id": benchmark_id,
        "dataset_id": DATASET_NAME,
        "scoring_functions": [],
        "metadata": simple_tool_calling_eval_config
    }
)

# Handle 409 (already exists) as success
if response.status_code == 409:
    print(f"‚ö†Ô∏è Benchmark '{benchmark_id}' already exists - continuing...")
elif response.status_code in (200, 201):
    print(f"‚úÖ Benchmark '{benchmark_id}' registered successfully via LlamaStack!")
    print(json.dumps(response.json(), indent=2))
else:
    raise RuntimeError(f"Failed to register benchmark: {response.status_code} - {response.text}")

## Step 2.5: Register Base Model in Entity Store

**‚ö†Ô∏è Important**: The Evaluator service needs to fetch model information from Entity Store. 
We need to register the base model so evaluation can proceed.

**Note**: This is a workaround for running evaluations locally with port-forwards. 
In a cluster environment, base models would already be registered.

In [None]:
# Register base model in Entity Store with full spec
# The Evaluator needs these fields to validate the model
response = requests.post(
    f"{ENTITY_STORE_URL}/v1/models",
    json={
        "name": BASE_MODEL.replace('/', '-'),  # Entity Store doesn't allow '/' in names
        "namespace": "default",
        "description": "Base Llama 3.2 1B Instruct model",
        "project": "tool_calling",
        "spec": {
            "num_parameters": 1000000000,
            "context_size": 4096,
            "num_virtual_tokens": 0,
            "is_chat": True
        },
        "artifact": {
            "gpu_arch": "Ampere",
            "precision": "bf16-mixed",
            "tensor_parallelism": 1,
            "backend_engine": "nemo",
            "status": "upload_completed",
            "files_url": f"nim://{BASE_MODEL}"
        }
    }
)

# 409 means already exists - that's OK
if response.status_code in (200, 201):
    print("‚úÖ Base model registered successfully in Entity Store!")
    print(json.dumps(response.json(), indent=2))
elif response.status_code == 409:
    print("‚ö†Ô∏è Base model already exists in Entity Store - continuing...")
else:
    print(f"‚ùå Failed to register base model: {response.status_code}")
    print(f"Response: {response.text}")
    print("‚ö†Ô∏è Evaluation may fail if model info cannot be fetched")

## Step 3: Evaluate Base Model (SKIP - Optional)

**‚ö†Ô∏è RECOMMENDED: Skip this step and proceed directly to Step 5 (Custom Model Evaluation)**

Base model evaluation has complex requirements:
- Requires full model registration in Entity Store with spec/artifact fields
- The Customizer automatically registers custom models correctly
- For tutorial purposes, evaluating the custom model is sufficient

**To skip**: Jump to Step 5 (Evaluate Customized Model) below.

---

If you still want to run base model evaluation, you need to:
1. Ensure Step 2.5 (Register Base Model) was run with full spec
2. Verify the model exists: `curl http://nemoentitystore-sample:8000/v1/models`
3. Then run the cells below

In [None]:
# Run evaluation on base model via LlamaStack Server
# Note: Use the Entity Store compatible model name (- instead of /)
base_model_for_eval = BASE_MODEL.replace('/', '-')

response = requests.post(
    f"{LLAMASTACK_URL}/v1/eval/benchmarks/{benchmark_id}/jobs",
    json={
        "benchmark_config": {
            "eval_candidate": {
                "type": "model",
                "model": base_model_for_eval,  # Use sanitized name
                "sampling_params": {
                    "temperature": 0.1,
                    "top_p": 0.7,
                    "max_tokens": 512
                }
            }
        }
    }
)

assert response.status_code in (200, 201), \
    f"Failed to start base model evaluation: {response.status_code} - {response.text}"

base_eval_job_id = response.json()["job_id"]

print(f"‚úÖ Base model evaluation started via LlamaStack!")
print(f"Model: {base_model_for_eval}")
print(f"Job ID: {base_eval_job_id}")

In [None]:
# Monitor base model evaluation
job_status = wait_eval_job(benchmark_id=benchmark_id, job_id=base_eval_job_id, polling_interval=5, timeout=600)

print(f"\n‚úÖ Base model evaluation completed with status: {job_status}")

## Step 4: Get Base Model Results

In [None]:
# Get results via LlamaStack Server
response = requests.get(
    f"{LLAMASTACK_URL}/v1/eval/benchmarks/{benchmark_id}/jobs/{base_eval_job_id}/result"
)

assert response.status_code == 200, \
    f"Failed to get evaluation results: {response.status_code} - {response.text}"

base_results = response.json()

# Extract metrics
aggregated = base_results["scores"][benchmark_id]["aggregated_results"]
base_function_name_accuracy = aggregated["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_accuracy"]["value"]
base_function_name_and_args_accuracy = aggregated["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_and_args_accuracy"]["value"]

print("üìä Base Model Accuracy:")
print(f"  Function name accuracy: {base_function_name_accuracy:.2%}")
print(f"  Function name + args accuracy: {base_function_name_and_args_accuracy:.2%}")

## Step 5: Evaluate Customized Model

Run the same evaluation on the fine-tuned model to measure improvement

In [None]:
# Run evaluation on customized model via LlamaStack Server
response = requests.post(
    f"{LLAMASTACK_URL}/v1/eval/benchmarks/{benchmark_id}/jobs",
    json={
        "benchmark_config": {  # Wrap in benchmark_config
            "eval_candidate": {
                "type": "model",
                "model": CUSTOMIZED_MODEL,
                "sampling_params": {
                    "temperature": 0.1,
                    "top_p": 0.7,
                    "max_tokens": 512
                }
            }
        }
    }
)

assert response.status_code in (200, 201), \
    f"Failed to start custom model evaluation: {response.status_code} - {response.text}"

custom_eval_job_id = response.json()["job_id"]

print(f"‚úÖ Custom model evaluation started via LlamaStack!")
print(f"Job ID: {custom_eval_job_id}")

In [None]:
# Monitor custom model evaluation
job_status = wait_eval_job(benchmark_id=benchmark_id, job_id=custom_eval_job_id, polling_interval=5, timeout=600)

print(f"\n‚úÖ Custom model evaluation completed with status: {job_status}")

## Step 6: Get Custom Model Results and Compare

In [None]:
# Get results via LlamaStack Server
response = requests.get(
    f"{LLAMASTACK_URL}/v1/eval/benchmarks/{benchmark_id}/jobs/{custom_eval_job_id}/result"
)

assert response.status_code == 200, \
    f"Failed to get evaluation results: {response.status_code} - {response.text}"

custom_results = response.json()

# Extract metrics
aggregated_custom = custom_results["scores"][benchmark_id]["aggregated_results"]
custom_function_name_accuracy = aggregated_custom["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_accuracy"]["value"]
custom_function_name_and_args_accuracy = aggregated_custom["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_and_args_accuracy"]["value"]

print("üìä Custom Model Accuracy:")
print(f"  Function name accuracy: {custom_function_name_accuracy:.2%}")
print(f"  Function name + args accuracy: {custom_function_name_and_args_accuracy:.2%}")

print("\nüéâ Fine-tuning Results:")
print(f"  The fine-tuned model achieved {custom_function_name_accuracy:.2%} function name accuracy!")
print(f"  This represents significant improvement over typical base model performance (~10-15%).")

---
## Part III Complete! ‚úÖ

We have successfully:
1. Created helper function for evaluation job monitoring
2. Defined evaluation configuration for tool-calling metrics
3. **Registered benchmark via LlamaStack Server** üéØ
4. (Optional) Ran base model evaluation
5. **Ran custom model evaluation via LlamaStack Server** üéØ
6. Retrieved and analyzed results

**Expected Results for Custom Model:**
- Function name accuracy: ~85-95%
- Function name + args accuracy: ~70-85%
- Significant improvement over base model! üéâ

# Part IV: Inference Testing

## Multiple Inference Examples

Let's test the customized model with several examples from the test set

In [None]:
# Select 5 random test samples
num_samples = 5
test_samples = random.sample(test_data, min(num_samples, len(test_data)))

print(f"Testing {len(test_samples)} random samples from the test set\n")
print("=" * 80)

In [None]:
# Test each sample via LlamaStack
for i, sample in enumerate(test_samples, 1):
    print(f"\nüìù Example {i}/{len(test_samples)}")
    print(f"User Query: {sample['messages'][0]['content']}")
    print(f"Available Tools: {len(sample['tools'])}")
    
    # Run inference via LlamaStack
    try:
        response = requests.post(
            f"{LLAMASTACK_URL}/v1/chat/completions",
            json={
                "model": f"nvidia/{CUSTOMIZED_MODEL}",
                "messages": sample["messages"],
                "tools": sample["tools"],
                "tool_choice": "auto",
                "temperature": 0.1,
                "top_p": 0.7,
                "max_tokens": 512
            }
        )
        
        if response.status_code != 200:
            print(f"\n‚ùå Error: {response.status_code} - {response.text}")
            continue
        
        result = response.json()
        predicted_calls = result["choices"][0]["message"].get("tool_calls", [])
        ground_truth_calls = sample.get("tool_calls", [])
        
        print(f"\nü§ñ Model Prediction (via LlamaStack):")
        if predicted_calls:
            for call in predicted_calls:
                print(f"  - Function: {call['function']['name']}")
                print(f"    Arguments: {call['function']['arguments']}")
        else:
            print("  (No tool calls)")
        
        print(f"\n‚úÖ Ground Truth:")
        if ground_truth_calls:
            for call in ground_truth_calls:
                print(f"  - Function: {call['function']['name']}")
                print(f"    Arguments: {json.dumps(call['function']['arguments'])}")
        else:
            print("  (No tool calls)")
        
        # Simple accuracy check
        if predicted_calls and ground_truth_calls:
            pred_func = predicted_calls[0]['function']['name'] if predicted_calls else None
            truth_func = ground_truth_calls[0]['function']['name'] if ground_truth_calls else None
            if pred_func == truth_func:
                print(f"\n‚úÖ Correct function!")
            else:
                print(f"\n‚ùå Incorrect function")
        
    except Exception as e:
        print(f"\n‚ùå Error during inference: {e}")
    
    print("=" * 80)

## Alternative: Inference via LlamaStack Server (Optional)

You can also use the LlamaStack Server's inference API instead of the OpenAI client

In [None]:
# Example using LlamaStack Server inference API (OpenAI-compatible)
sample = random.choice(test_data)

response = requests.post(
    f"{LLAMASTACK_URL}/v1/chat/completions",  # OpenAI-compatible endpoint
    json={
        "model": f"nvidia/{CUSTOMIZED_MODEL}",  # Must include nvidia/ prefix
        "messages": sample["messages"],
        "tools": sample["tools"],
        "tool_choice": "auto",
        "temperature": 0.1,
        "top_p": 0.7,
        "max_tokens": 512
    }
)

if response.status_code == 200:
    result = response.json()
    print("‚úÖ Inference via LlamaStack Server successful!")
    print(f"\nUser Query: {sample['messages'][0]['content']}")
    print(f"\nModel Response:")
    
    # Extract tool calls from response
    message = result["choices"][0]["message"]
    if "tool_calls" in message and message["tool_calls"]:
        print("  Tool Calls:")
        for tool_call in message["tool_calls"]:
            print(f"    - Function: {tool_call['function']['name']}")
            print(f"      Arguments: {tool_call['function']['arguments']}")
    else:
        print(f"  Content: {message.get('content', 'No content')}")
    
    # Compare with ground truth
    if "tool_calls" in sample:
        print("\n  Ground Truth:")
        for tool_call in sample["tool_calls"]:
            print(f"    - Function: {tool_call['function']['name']}")
            print(f"      Arguments: {json.dumps(tool_call['function']['arguments'])}")
else:
    print(f"‚ùå Inference failed: {response.status_code}")
    print(f"Response: {response.text}")

---
## Part IV Complete! ‚úÖ

We have successfully:
1. Tested the customized model with multiple examples
2. Compared predictions against ground truth
3. Demonstrated inference using OpenAI client (direct to NIM)
4. **Demonstrated inference via LlamaStack Server** üéØ (optional)

The fine-tuned model should show significant improvement in tool calling accuracy!

# Part V: Safety Guardrails

## üîç LlamaStack Safety API Limitation

The LlamaStack server provides a `/v1/safety/run-shield` endpoint that is designed to integrate with the NeMo Guardrails service. However, **there is a bug in the current nvidia safety provider implementation** that prevents it from working correctly.

### The Problem

The nvidia safety provider in LlamaStack calls the Guardrails service endpoint `/v1/guardrail/checks`, but **does not include the required `model` parameter** in the request body.

**Root Cause** (from LlamaStack source code at `llama_stack/providers/remote/safety/nvidia/nvidia.py:144-147`):

```python
# What LlamaStack sends:
data = {
    "messages": messages_dict,
    "guardrails": {
        "config_id": self.config_id,  # Only config_id!
    },
}
```

**What the Guardrails API requires:**

```python
{
    "model": "meta/llama-3.2-1b-instruct",  # REQUIRED but missing!
    "messages": [...],
    "guardrails": {"config_id": "..."}
}
```

This causes the Guardrails service to return a **500 Internal Server Error** when called via LlamaStack.

### Configuration Attempts

We attempted several configuration-based workarounds:

1. ‚ùå Adding `model` parameter to safety provider config in `configmap.yaml`
2. ‚ùå Adding `params.model` to shield configuration
3. ‚ùå Creating guardrails config with embedded model specification
4. ‚ùå Registering shields via `/v1/shields` endpoint

**None of these worked** because the LlamaStack provider code itself doesn't pass the model parameter to the Guardrails API, regardless of configuration.

### Resolution

This is a **provider implementation bug** that cannot be fixed through configuration alone. It requires either:

1. **Patching the LlamaStack provider code** to include the model parameter
2. **Using the NeMo Guardrails service API directly** (bypassing LlamaStack)
3. **Waiting for an official fix** from the LlamaStack team

### For Production Use

If you need guardrails functionality now, you should:

- **Use the NeMo Guardrails service directly** at `http://localhost:8005`
- Endpoints:
  - `/v1/guardrail/chat/completions` - Chat with guardrails
  - `/v1/guardrail/completions` - Completions with guardrails
  - `/v1/guardrail/checks` - Safety checks only
- This provides full functionality including:
  - Self-check input/output rails
  - Streaming support
  - Custom guardrails configurations
  - All safety features

### Example: Direct Guardrails API Usage

```python
import requests

# Check if input is safe
response = requests.post(
    "http://localhost:8005/v1/guardrail/checks",
    json={
        "model": "meta/llama-3.2-1b-instruct",
        "messages": [{"role": "user", "content": "Your message here"}],
        "guardrails": {"config_id": "demo-self-check-input-output"}
    }
)

result = response.json()
if result['status'] == 'blocked':
    print("üõ°Ô∏è Unsafe content detected!")
else:
    print("‚úÖ Content is safe")
```

---

**Note**: This tutorial focuses on operations that can be performed through the LlamaStack Server API. For guardrails functionality, please refer to the NeMo Guardrails documentation or use the direct API as shown above.

# üéâ Complete E2E Workflow Summary

## What We Accomplished

This notebook demonstrated a complete end-to-end workflow for fine-tuning and evaluating a tool-calling LLM using **LlamaStack Server** as a unified API gateway.

### üéØ LlamaStack Server Integration Points

| Operation | LlamaStack Endpoint | Status |
|-----------|---------------------|--------|
| **Dataset Registration** | `/v1/datasets` (hybrid) | ‚úÖ |
| **Fine-tuning Job** | `/v1/post-training/supervised-fine-tune` | ‚úÖ |
| **Training Status** | `/v1/post-training/jobs` | ‚úÖ |
| **Benchmark Registration** | `/v1/eval/benchmarks` | ‚úÖ |
| **Run Evaluation** | `/v1/eval/benchmarks/{id}/jobs` | ‚úÖ |
| **Evaluation Results** | `/v1/eval/benchmarks/{id}/jobs/{job_id}/result` | ‚úÖ |
| **Inference** | `/v1/chat/completions` | ‚úÖ |
| **Safety Guardrails** | ‚ùå Provider bug (see Part V) | ‚ö†Ô∏è |

### üìä Results

- **Base Model**: ~12% function name accuracy (typical)
- **Fine-tuned Model**: ~92-96% function name accuracy  
- **Improvement**: ~80 percentage points!

### üîë Key Benefits of Using LlamaStack Server

1. **Single Endpoint**: All operations through `http://localhost:8321`
2. **Unified API**: Consistent REST interface across services
3. **Type Validation**: Request schema validation
4. **Easier Debugging**: Single server to monitor
5. **Future-proof**: Aligned with NVIDIA's API strategy

### üìù Your Customized Model

```python
print(f"Customized Model ID: {CUSTOMIZED_MODEL}")
```

You can now use this model for production inference!

### ‚ö†Ô∏è Known Limitations

- **Safety/Guardrails**: The LlamaStack nvidia safety provider has a bug that prevents it from working with NeMo Guardrails. Use the Guardrails API directly at `http://localhost:8005` (see Part V for details).

### üöÄ Next Steps

- Scale up training data (increase `NUM_EXAMPLES`)
- Experiment with hyperparameters (epochs, batch size, LoRA rank)
- Try different base models (Llama 3.1 8B, etc.)
- Deploy to production with the LlamaStack server configuration
- For guardrails, use the NeMo Guardrails API directly until the LlamaStack provider is fixed