 # Part I: Preparing Datasets for Fine-tuning and Evaluation

The following code cell imports necessary libraries.

In [None]:
# Install required Python packages for this notebook environment
%pip install \
  huggingface_hub \
  "transformers>=4.36.0" \
  peft \
  datasets \
  trl \
  jsonschema \
  litellm \
  "jinja2>=3.1.0" \
  "torch>=2.0.0" \
  openai \
  jupyterlab \
  requests \
  python-dotenv

# Note: print_status function is defined in the next cell
print("‚úÖ Package installation completed")


In [None]:
import os
import json
import random
from pprint import pprint
from typing import Any, Dict, List, Union

import numpy as np
import torch
from datasets import load_dataset

def print_status(message):
    """Print a status message with a checkmark emoji."""
    print(f"‚úÖ {message}")

print_status("Imports completed")

The following code cell sets a random seed for reproducibility.

In [None]:
SEED = 1234

# Limits to at most N tool properties
LIMIT_TOOL_PROPERTIES = 8

torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
np.random.seed(SEED)
random.seed(SEED)

print_status("Random seed configuration completed")

The following code cell defines the data root directory and creates necessary directories for storing processed data.

In [None]:
# Processed data will be stored here
DATA_ROOT = os.path.join(os.getcwd(), "data")
CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, "customization")
VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, "validation")
EVALUATION_DATA_ROOT = os.path.join(DATA_ROOT, "evaluation")

os.makedirs(DATA_ROOT, exist_ok=True)
os.makedirs(CUSTOMIZATION_DATA_ROOT, exist_ok=True)
os.makedirs(VALIDATION_DATA_ROOT, exist_ok=True)
os.makedirs(EVALUATION_DATA_ROOT, exist_ok=True)

print_status("Data directories created")

---
<a id="step-1"></a>
## Step 1: Download xLAM Data

This step loads the xLAM dataset from Hugging Face.

Ensure that you have followed the prerequisites mentioned in the associated README, obtained a Hugging Face access token, and configured it in [config.py](./config.py). In addition to getting an access token, you need to apply for access to the xLAM dataset on its [page](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), which will be approved instantly.


In [None]:
from config import HF_TOKEN

# Set environment variables for Hugging Face
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

# Note: We'll pass the token directly to load_dataset() in the next cell
# This is more reliable than using login() which can have issues with environment variables

print_status("Hugging Face configuration completed")

In [None]:
# Download from Hugging Face
# Ensure environment variables are set and pass token explicitly
from config import HF_TOKEN
import os

# Make sure environment variables are set (in case cell 10 wasn't run)
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

# Load dataset with explicit token
dataset = load_dataset("Salesforce/xlam-function-calling-60k", token=HF_TOKEN)

# Inspect a sample
example = dataset['train'][0]
pprint(example)

print_status("xLAM dataset downloaded and inspected")

For more details on the structure of this data, refer to the [data structure of the xLAM dataset](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k#structure) in the Hugging Face documentation.

---
<a id="step-2"></a>
## Step 2: Prepare Data for Customization

For Customization, the NeMo Microservices platform leverages the OpenAI data format, comprised of `messages` and `tools`:

* `messages` include the `user` query, as well as the ground truth `assistant` response to the query. This response contains the function name(s) and associated argument(s) in a "tool_calls" dict.
* `tools` include a list of functions and parameters available to the LLM to choose from, as well as their descriptions.

The following is an example of the data format:
```
{
    "messages": [
        {
            "role": "user",
            "content": "Where can I find live giveaways for beta access and games?"
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_beta",
                    "type": "function",
                    "function": {
                        "name": "live_giveaways_by_type",
                        "arguments": {"type": "beta"}
                    }
                },
                {
                    "id": "call_game",
                    "type": "function",
                    "function": {
                        "name": "live_giveaways_by_type",
                        "arguments": {"type": "game"}
                    }
                }
            ]
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "live_giveaways_by_type",
                "description": "Retrieve live giveaways from the GamerPower API based on the specified type.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "type": {
                            "type": "string",
                            "description": "The type of giveaways to retrieve (e.g., game, loot, beta).",
                            "default": "game"
                        }
                    },
                    "required": []
                }
            }
        }
    ]
}
```

The following helper functions convert a single xLAM JSON data point into OpenAI format.

In [None]:
def normalize_type(param_type: str) -> str:
    """
    Normalize Python type hints and parameter definitions to OpenAI function spec types.

    Args:
        param_type: Type string that could include default values or complex types

    Returns:
        Normalized type string according to OpenAI function spec
    """
    # Remove whitespace
    param_type = param_type.strip()

    # Handle types with default values (e.g. "str, default='London'")
    if "," in param_type and "default" in param_type:
        param_type = param_type.split(",")[0].strip()

    # Handle types with just default values (e.g. "default='London'")
    if param_type.startswith("default="):
        return "string"  # Default to string if only default value is given

    # Remove ", optional" suffix if present
    param_type = param_type.replace(", optional", "").strip()

    # Handle complex types
    if param_type.startswith("Callable"):
        return "string"  # Represent callable as string in JSON schema
    if param_type.startswith("Tuple"):
        return "array"  # Represent tuple as array in JSON schema
    if param_type.startswith("List["):
        return "array"
    if param_type.startswith("Set") or param_type == "set":
        return "array"  # Represent set as array in JSON schema

    # Map common type variations to OpenAI spec types
    type_mapping: Dict[str, str] = {
        "str": "string",
        "int": "integer",
        "float": "number",
        "bool": "boolean",
        "list": "array",
        "dict": "object",
        "List": "array",
        "Dict": "object",
        "set": "array",
        "Set": "array"
    }

    if param_type in type_mapping:
        return type_mapping[param_type]
    else:
        print(f"Unknown type: {param_type}")
        return "string"  # Default to string for unknown types


def convert_tools_to_openai_spec(tools: Union[str, List[Dict[str, Any]]]) -> List[Dict[str, Any]]:
    # If tools is a string, try to parse it as JSON
    if isinstance(tools, str):
        try:
            tools = json.loads(tools)
        except json.JSONDecodeError as e:
            print(f"Failed to parse tools string as JSON: {e}")
            return []

    # Ensure tools is a list
    if not isinstance(tools, list):
        print(f"Expected tools to be a list, but got {type(tools)}")
        return []

    openai_tools: List[Dict[str, Any]] = []
    for tool in tools:
        # Check if tool is a dictionary
        if not isinstance(tool, dict):
            print(f"Expected tool to be a dictionary, but got {type(tool)}")
            continue

        # Check if 'parameters' is a dictionary
        if not isinstance(tool.get("parameters"), dict):
            print(f"Expected 'parameters' to be a dictionary, but got {type(tool.get('parameters'))} for tool: {tool}")
            continue

    

        normalized_parameters: Dict[str, Dict[str, Any]] = {}
        for param_name, param_info in tool["parameters"].items():
            if not isinstance(param_info, dict):
                print(
                    f"Expected parameter info to be a dictionary, but got {type(param_info)} for parameter: {param_name}"
                )
                continue

            # Create parameter info without default first
            param_dict = {
                "description": param_info.get("description", ""),
                "type": normalize_type(param_info.get("type", "")),
            }

            # Only add default if it exists, is not None, and is not an empty string
            default_value = param_info.get("default")
            if default_value is not None and default_value != "":
                param_dict["default"] = default_value

            normalized_parameters[param_name] = param_dict

        openai_tool = {
            "type": "function",
            "function": {
                "name": tool["name"],
                "description": tool["description"],
                "parameters": {"type": "object", "properties": normalized_parameters},
            },
        }
        openai_tools.append(openai_tool)
    return openai_tools


def save_jsonl(filename, data):
    """Write a list of json objects to a .jsonl file"""
    with open(filename, "w") as f:
        for entry in data:
            f.write(json.dumps(entry) + "\n")


def convert_tool_calls(xlam_tools):
    """Convert XLAM tool format to OpenAI's tool schema."""
    tools = []
    for tool in json.loads(xlam_tools):
        tools.append({"type": "function", "function": {"name": tool["name"], "arguments": tool.get("arguments", {})}})
    return tools


def convert_example(example, dataset_type='single'):
    """Convert an XLAM dataset example to OpenAI format."""
    obj = {"messages": []}

    # User message
    obj["messages"].append({"role": "user", "content": example["query"]})

    # Tools
    if example.get("tools"):
        obj["tools"] = convert_tools_to_openai_spec(example["tools"])

    # Assistant message
    assistant_message = {"role": "assistant", "content": ""}
    if example.get("answers"):
        tool_calls = convert_tool_calls(example["answers"])
        
        if dataset_type == "single":
            # Only include examples with a single tool call
            if len(tool_calls) == 1:
                assistant_message["tool_calls"] = tool_calls
            else:
                return None
        else:
            # For other dataset types, include all tool calls
            assistant_message["tool_calls"] = tool_calls
                
    obj["messages"].append(assistant_message)

    return obj

print_status("Data conversion functions defined")

The following code cell converts the example data to the OpenAI format required by NeMo Customizer.


In [None]:
convert_example(example)

print_status("Example conversion completed")

**NOTE**: The `convert_example` function by default only retains data points that have exactly one `tool_call` in the output.
The `llama-3.2-1b-instruct` model does not support parallel tool calls.
For more information, refer to the [supported models](https://docs.nvidia.com/nim/large-language-models/latest/function-calling.html#supported-models) in the NeMo documentation.

### Process Entire Dataset
Convert each example by looping through the dataset.

In [None]:
all_examples = []
with open(os.path.join(DATA_ROOT, "xlam_openai_format.jsonl"), "w") as f:
    for example in dataset["train"]:
        converted = convert_example(example)
        if converted is not None:
            all_examples.append(converted)
            f.write(json.dumps(converted) + "\n")

print_status("Dataset conversion completed")

### Split Dataset
This step splits the dataset into a train, validation, and test set.
For demonstration, we use a smaller subset of all the examples.
You may choose to modify `NUM_EXAMPLES` to leverage a larger subset.

In [None]:
# Configure to change the size of dataset to use
NUM_EXAMPLES = 5000

assert NUM_EXAMPLES <= len(all_examples), f"{NUM_EXAMPLES} exceeds the total number of available ({len(all_examples)}) data points"

print_status("Dataset size configuration validated")

In [None]:
# Randomly choose a subset
sampled_examples = random.sample(all_examples, NUM_EXAMPLES)

# Split into 70% training, 15% validation, 15% testing
train_size = int(0.7 * len(sampled_examples))
val_size = int(0.15 * len(sampled_examples))

train_data = sampled_examples[:train_size]
val_data = sampled_examples[train_size : train_size + val_size]
test_data = sampled_examples[train_size + val_size :]

# Save the training and validation splits. We will use test split in the next section
save_jsonl(os.path.join(CUSTOMIZATION_DATA_ROOT, "training.jsonl"), train_data)
save_jsonl(os.path.join(VALIDATION_DATA_ROOT,"validation.jsonl"), val_data)

print_status("Dataset split and saved")

---
<a id="step-3"></a>
## Step 3: Prepare Data for Evaluation

For evaluation, the NeMo Microservices platform uses a format with a minor modification to the OpenAI format. This requires `tools_calls` to be brought out of `messages` to create a distinct parallel field.

* `messages` includes the `user` query
* `tools` includes a list of functions and parameters available to the LLM to choose from, as well as their descriptions.
* `tool_calls` is the ground truth response to the user query. This response contains the function name(s) and associated argument(s) in a "tool_calls" dict.

Here is an example -

```
{
    "messages": [
        {
            "role": "user",
            "content": "Where can I find live giveaways for beta access?"
        },
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "live_giveaways_by_type",
                "description": "Retrieve live giveaways from the GamerPower API based on the specified type.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "type": {
                            "type": "string",
                            "description": "The type of giveaways to retrieve (e.g., game, loot, beta).",
                            "default": "game"
                        }
                    },
                    "required": []
                }
            }
        }
    ],
    "tool_calls": [
        {
            "id": "call_beta",
            "type": "function",
            "function": {
                "name": "live_giveaways_by_type",
                "arguments": {"type": "beta"}
            }
        }
    ]
}
```

The following steps transform the test dataset into a format compatible with the NeMo Evaluator microservice.
This dataset is for measuring accuracy metrics before and after customization.

In [None]:
def convert_example_eval(entry):
    """Convert a single entry in the dataset to the evaluator format"""

    # Note: This is a WAR for a known bug with tool calling in NIM
    for tool in entry["tools"]:
        if len(tool["function"]["parameters"]["properties"]) > LIMIT_TOOL_PROPERTIES:
            return None
    
    new_entry = {
        "messages": [],
        "tools": entry["tools"],
        "tool_calls": []
    }
    
    for msg in entry["messages"]:
        if msg["role"] == "assistant" and "tool_calls" in msg:
            new_entry["tool_calls"] = msg["tool_calls"]
        else:
            new_entry["messages"].append(msg)
    
    return new_entry

print_status("Evaluation conversion function defined")

def convert_dataset_eval(data):
    """Convert the entire dataset for evaluation by restructuring the data format."""
    return [result for entry in data if (result := convert_example_eval(entry)) is not None]

print_status("Evaluation conversion functions defined")

`NOTE:` We have implemented a workaround for a known bug where tool calls freeze the NIM if a tool description includes a function with a larger number of parameters. As such, we have limited the dataset to use examples with available tools having at most 8 parameters. This will be resolved in the next NIM release.

In [None]:
test_data_eval = convert_dataset_eval(test_data)
save_jsonl(os.path.join(EVALUATION_DATA_ROOT, "xlam-test-single.jsonl"), test_data_eval)

print_status("Evaluation dataset prepared and saved")

# Part II: LoRA Fine-tuning Using NeMo Customizer

In [None]:
import os
import json
import requests
import random
import time
from openai import OpenAI

print_status("Part II imports completed")

### Configure NeMo Microservices Endpoints

This section includes importing required libraries, configuring endpoints, and performing health checks to ensure that the NeMo Data Store, NIM, and other services are running correctly.

In [None]:
from config import *

print(f"Data Store endpoint: {NDS_URL}")
print(f"Entity Store endpoint: {ENTITY_STORE_URL}")
print(f"Customizer endpoint: {CUSTOMIZER_URL}")
print(f"Evaluator endpoint: {EVALUATOR_URL}")
print(f"Guardrails endpoint: {GUARDRAILS_URL}")
print(f"NIM endpoint: {NIM_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Base Model for Customization: {BASE_MODEL}")

print_status("NeMo Microservices endpoints configured")

### Configure Path to Prepared data

The following code sets the paths to the prepared dataset files.

In [None]:
# Path where data preparation notebook saved finetuning and evaluation data
DATA_ROOT = os.path.join(os.getcwd(), "data")
CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, "customization")
VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, "validation")
EVALUATION_DATA_ROOT = os.path.join(DATA_ROOT, "evaluation")

# Sanity checks
train_fp = f"{CUSTOMIZATION_DATA_ROOT}/training.jsonl"
assert os.path.exists(train_fp), f"The training data at '{train_fp}' does not exist. Please ensure that the data was prepared successfully."

val_fp = f"{VALIDATION_DATA_ROOT}/validation.jsonl"
assert os.path.exists(val_fp), f"The validation data at '{val_fp}' does not exist. Please ensure that the data was prepared successfully."

test_fp = f"{EVALUATION_DATA_ROOT}/xlam-test-single.jsonl"
assert os.path.exists(test_fp), f"The test data at '{test_fp}' does not exist. Please ensure that the data was prepared successfully."

print_status("Data paths validated")

---

<a id="step-3"></a>
## Step 3: Sanity Test the Customized Model By Running Sample Inference

Once the model is customized, its adapter is automatically saved in NeMo Entity Store and is ready to be picked up by NVIDIA NIM.
You can test the model by sending a prompt to its NIM endpoint.

First, choose one of the examples from the test set.

### Resource Organization Using Namespace

You can use a [namespace](https://developer.nvidia.com/docs/nemo-microservices/manage-entities/namespaces/index.html) to isolate and organize the artifacts in this tutorial.

#### Create Namespace

Both Data Store and Entity Store use namespaces. The following code creates namespaces for the tutorial.

In [None]:
def create_namespaces(entity_host, ds_host, namespace):
    # Create namespace in Entity Store
    entity_store_url = f"{entity_host}/v1/namespaces"
    resp = requests.post(entity_store_url, json={"id": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store during namespace creation: {resp.status_code}"
    print(resp)

    # Create namespace in Data Store
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Data Store during namespace creation: {resp.status_code}"
    print(resp)

print_status("Namespace creation function defined")

create_namespaces(entity_host=ENTITY_STORE_URL, ds_host=NDS_URL, namespace=NMS_NAMESPACE)

#### Verify Namespaces

The following [Data Store API](https://developer.nvidia.com/docs/nemo-microservices/api/datastore.html) and [Entity Store API](https://developer.nvidia.com/docs/nemo-microservices/api/entity-store.html) list the namespace created in the previous cell.

In [None]:
# Verify Namespace in Data Store
response = requests.get(f"{NDS_URL}/v1/datastore/namespaces/{NMS_NAMESPACE}")
print(f"Status Code: {response.status_code}\nResponse JSON: {response.json()}")

# Verify Namespace in Entity Store
response = requests.get(f"{ENTITY_STORE_URL}/v1/namespaces/{NMS_NAMESPACE}")
print(f"Status Code: {response.status_code}\nResponse JSON: {response.json()}")

print_status("Namespaces verified")

**Tips**:
* You may generally use `{DATASTORE_HOST}/v1/datastore/namespaces/` and `{ENTITYSTORE_HOST}/v1/namespaces/` GET APIs to list **all** available namespaces.
* Send DELETE requests to `{DATASTORE_HOST}/v1/datastore/namespaces/{namespace}` and `{ENTITYSTORE_HOST}/v1/namespaces/{namespace}` APIs to delete a namespace.

---
<a id="step-1"></a>
## Step 1: Upload Data to NeMo Data Store

The NeMo Data Store supports data management using the Hugging Face `HfApi` Client. 

**Note that this step does not interact with Hugging Face at all, it just uses the client library to interact with NeMo Data Store.** This is in comparison to the previous notebook, where we used the `load_dataset` API to download the xLAM dataset from Hugging Face's repository.

More information can be found in [documentation](https://developer.nvidia.com/docs/nemo-microservices/manage-entities/tutorials/manage-dataset-files.html#set-up-hugging-face-client)

### 1.1 Create Repository

In [None]:
repo_id = f"{NMS_NAMESPACE}/{DATASET_NAME}"

print_status("Repository ID configured")

In [None]:
from huggingface_hub import HfApi
from huggingface_hub.utils import HfHubHTTPError

hf_api = HfApi(endpoint=f"{NDS_URL}/v1/hf", token="")

# Create repo (or use existing if it already exists)
try:
    hf_api.create_repo(
        repo_id=repo_id,
        repo_type='dataset',
        exist_ok=True  # Don't raise error if repo already exists
    )
    print(f"‚úÖ Dataset repository '{repo_id}' created or already exists")
except HfHubHTTPError as e:
    # Handle 409 Conflict (repo already exists) as success
    if e.response is not None and e.response.status_code == 409:
        print(f"‚ÑπÔ∏è  Dataset repository '{repo_id}' already exists (this is fine)")
    else:
        # Re-raise other HTTP errors
        print(f"‚ùå Error creating repository: {e}")
        raise
except Exception as e:
    print(f"‚ùå Unexpected error creating repository: {e}")
    raise

print_status("Dataset repository created")

Next, creating a dataset programmatically requires two steps: uploading and registration. More information can be found in [documentation](https://developer.nvidia.com/docs/nemo-microservices/manage-entities/datasets/create-dataset.html#how-to-create-a-dataset).

### 1.2 Upload Dataset Files to NeMo Data Store

In [None]:
hf_api.upload_file(path_or_fileobj=train_fp,
    path_in_repo="training/training.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

hf_api.upload_file(path_or_fileobj=val_fp,
    path_in_repo="validation/validation.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

hf_api.upload_file(path_or_fileobj=test_fp,
    path_in_repo="testing/xlam-test-single.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

print_status("Dataset files uploaded")

Other tips:
* Take a look at the `path_in_repo` argument above. If there are more than one files in the subfolders:
    * All the .jsonl files in `training/` will be merged and used for training by customizer.
    * All the .jsonl files in `validation/` will be merged and used for validation by customizer.
* NeMo Data Store generally supports data management using the [HfApi API](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api). For example, to delete a repo, you may use - 
```python
   hf_api.delete_repo(
     repo_id=repo_id,
     repo_type="dataset"
)
```

### 1.3 Register the Dataset with NeMo Entity Store

To use a dataset for operations such as evaluations and customizations, register a dataset using the `/v1/datasets` endpoint.
Register the dataset to refer to it by its namespace and name afterward.

In [None]:
resp = requests.post(
    url=f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NMS_NAMESPACE,
        "description": "Tool calling xLAM dataset in OpenAI ChatCompletions format",
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
        "project": "tool_calling",
    },
)

# Handle different response statuses
if resp.status_code in (200, 201):
    print(f"‚úÖ Dataset '{NMS_NAMESPACE}/{DATASET_NAME}' created successfully")
    dataset_response = resp.json()
    print("Dataset Response:")
    print(json.dumps(dataset_response, indent=2))
    # Also return it so Jupyter displays it
    dataset_response
elif resp.status_code == 409:
    # Dataset already exists - this is fine, fetch it instead
    print(f"‚ÑπÔ∏è  Dataset '{NMS_NAMESPACE}/{DATASET_NAME}' already exists (this is fine)")
    print("Fetching existing dataset...")
    get_resp = requests.get(url=f"{ENTITY_STORE_URL}/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}")
    if get_resp.status_code == 200:
        dataset_response = get_resp.json()
        print("Existing Dataset:")
        print(json.dumps(dataset_response, indent=2))
        # Also return it so Jupyter displays it
        dataset_response
    else:
        print(f"‚ö†Ô∏è  Warning: Could not fetch existing dataset: {get_resp.status_code}")
        print(f"Response: {get_resp.text}")
else:
    # Other error - raise exception
    print(f"‚ùå Error creating dataset: Status {resp.status_code}")
    print(f"Response: {resp.text}")
    raise Exception(f"Failed to create dataset: Status {resp.status_code}, Response: {resp.text}")

print_status("Dataset registered with Entity Store")

In [None]:
# Sanity check to validate dataset
res = requests.get(url=f"{ENTITY_STORE_URL}/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}")
assert res.status_code in (200, 201), f"Status Code {res.status_code} Failed to fetch dataset {res.text}"
dataset_obj = res.json()

print("Files URL:", dataset_obj["files_url"])
assert dataset_obj["files_url"] == f"hf://datasets/{repo_id}"

print_status("Dataset validation completed")

---
<a id="step-2"></a>
## 2. LoRA Customization with NeMo Customizer

### 2.1 Start the Training Job


Start the training job by sending a POST request to the `/v1/customization/jobs` endpoint.
The following code sets the training parameters and sends the request.

 **The training job will take approximately 45 minutes to complete.**

In [None]:
headers = {"wandb-api-key": WANDB_API_KEY} if WANDB_API_KEY else None

training_params = {
    "name": "llama-3.2-1b-xlam-ft",
    "output_model": f"{NMS_NAMESPACE}/llama-3.2-1b-xlam-run1",
    "config": f"{BASE_MODEL}@{BASE_MODEL_VERSION}",
    "dataset": {"name": DATASET_NAME, "namespace" : NMS_NAMESPACE},
    "hyperparameters": {
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 1,
        "batch_size": 8,
        "learning_rate": 0.0001,
        "lora": {
            "adapter_dim": 32,
            "adapter_dropout": 0.1
        }
    }
}

# Create training job with retry logic
max_retries = 3
retry_delay = 2

for attempt in range(max_retries):
    try:
        print(f"Attempting to create training job (attempt {attempt + 1}/{max_retries})...")
        resp = requests.post(
            f"{CUSTOMIZER_URL}/v1/customization/jobs", 
            json=training_params, 
            headers=headers,
            timeout=30  # 30 second timeout
        )
        
        # Check response status
        if resp.status_code not in (200, 201):
            print(f"‚ùå Error creating training job: Status {resp.status_code}")
            print(f"Response: {resp.text}")
            if attempt < max_retries - 1:
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
                continue
            raise Exception(f"Failed to create training job: {resp.text}")
        
        # Success!
        customization = resp.json()
        
        # Explicitly print the customization response
        print("‚úÖ Training job created successfully!")
        print("\nCustomization Response:")
        print(json.dumps(customization, indent=2))
        print("\n" + "="*50)
        # Also return it so Jupyter displays it
        customization
        
        print_status("Training job created")
        break
        
    except (requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
        print(f"‚ö†Ô∏è Connection error (attempt {attempt + 1}/{max_retries}): {e}")
        if attempt < max_retries - 1:
            print(f"Retrying in {retry_delay} seconds...")
            print("üí° Tip: Make sure port-forwards are running!")
            time.sleep(retry_delay)
        else:
            raise Exception(f"Failed to connect to Customizer after {max_retries} attempts. Check port-forwards!")
    except Exception as e:
        print(f"‚ùå Unexpected error: {e}")
        raise

The following code sets variables for storing the job ID and customized model name.

In [None]:
# To track status
JOB_ID = customization["id"]

# This will be the name of the model that will be used to send inference queries to
CUSTOMIZED_MODEL = customization["output_model"]

print_status("Job ID and customized model name stored")

**Tips**:
* If you configured the NeMo Customizer microservice with your own [Weights & Biases (WandB)](https://wandb.ai/) API key, you can find the training graphs and logs in your WandB account, "nvidia-nemo-customizer" project. Your run ID is similar to your customization `JOB_ID`.
  
* To cancel a job that you scheduled incorrectly, run the following code.
  
  ```python
  requests.post(f"{CUSTOMIZER_URL}/v1/customization/jobs/{JOB_ID}/cancel")
  ```

### 2.2 Get Job Status

Get the job status by sending a GET request to the `/v1/customization/jobs/{JOB_ID}/status` endpoint.
The following code sets the job ID and sends the request.

In [None]:
response = requests.get(f"{CUSTOMIZER_URL}/v1/customization/jobs/{JOB_ID}/status")

assert response.status_code == 200, (
    f"Status Code {response.status_code}: Failed to get job status. Response: {response.text}"
)
print("Response JSON:", json.dumps(response.json(), indent=4))

print_status("Job status retrieved")

**IMPORTANT:** Monitor the job status. Ensure training is completed before proceeding by observing the `percentage_done` key in the response frame.

### 2.3 Validate Availability of Custom Model
The following NeMo Entity Store API should display the model when the training job is complete.
The list below shows all models filtered by your namespace and sorted by the latest first.
For more information about this API, see the [NeMo Entity Store API reference](https://developer.nvidia.com/docs/nemo-microservices/api/entity-store.html).
With the following code, you can find all customized models, including the one trained in the previous cells.
Look for the `name` fields in the output, which should match your `CUSTOMIZED_MODEL`.

In [None]:
response = requests.get(f"{ENTITY_STORE_URL}/v1/models", params={"filter[namespace]": NMS_NAMESPACE, "sort" : "-created_at"})

assert response.status_code == 200, f"Status Code {response.status_code}: Request failed. Response: {response.text}"
print("Response JSON:", json.dumps(response.json(), indent=4))

print_status("Job status retrieved")

**Tips**:

* You can also find the model with its name directly:
  ```python
    # To get specifically the custom model, you may use the following API -
    response = requests.get(f"{ENTITY_STORE_URL}/v1/models/{CUSTOMIZED_MODEL}")
    
    assert response.status_code == 200, f"Status Code {response.status_code}: Request failed. Response: {response.text}"
    print("Response JSON:", json.dumps(response.json(), indent=4))
  ```
  

NVIDIA NIM directly picks up the LoRA adapters from NeMo Entity Store. You can also query the NIM endpoint to look for it, as shown in the following code.

In [None]:
# Check if the custom LoRA model is available in Entity Store
# Note: Custom LoRA models are registered in Entity Store, not directly in NIM's model list
response = requests.get(f"{ENTITY_STORE_URL}/v1/models", params={"filter[namespace]": NMS_NAMESPACE, "sort": "-created_at"})

assert response.status_code == 200, f"Status Code {response.status_code}: Request failed. Response: {response.text}"

models_data = response.json().get("data", [])
# Extract model names (can be in 'name' or 'id' field)
model_names = []
for model in models_data:
    # Try 'name' first, then 'id'
    model_name = model.get("name") or model.get("id", "")
    if model_name:
        model_names.append(model_name)

print(f"Found {len(model_names)} models in namespace '{NMS_NAMESPACE}':")
for name in model_names[:10]:  # Show first 10
    print(f"  - {name}")

# Extract just the model name part (without namespace) for comparison
# CUSTOMIZED_MODEL format: "namespace/model-name" or just "model-name"
customized_model_name = CUSTOMIZED_MODEL.split("/")[-1] if "/" in CUSTOMIZED_MODEL else CUSTOMIZED_MODEL

# Check if our custom model is in the list (compare both with and without namespace)
model_found = False
for model_name in model_names:
    # Compare both full name and just the model part (without namespace)
    model_name_only = model_name.split("/")[-1] if "/" in model_name else model_name
    if CUSTOMIZED_MODEL == model_name or customized_model_name == model_name_only:
        model_found = True
        print(f"\n‚úÖ Custom model found in Entity Store!")
        print(f"   Full name: {CUSTOMIZED_MODEL}")
        print(f"   Entity Store name: {model_name}")
        break

if not model_found:
    print(f"\n‚ö†Ô∏è Custom model '{CUSTOMIZED_MODEL}' not found in the list.")
    print("This is normal if the training job is still running or just completed.")
    print("The model will appear in Entity Store once training completes and the model is uploaded.")
    
    # Try to get the model directly by name (it might exist but not be in the list)
    try:
        direct_response = requests.get(f"{ENTITY_STORE_URL}/v1/models/{CUSTOMIZED_MODEL}")
        if direct_response.status_code == 200:
            print(f"‚úÖ However, the model is accessible directly at: {CUSTOMIZED_MODEL}")
            print("This means training completed and the model is available!")
            model_found = True
        else:
            print(f"‚è≥ Model not yet available. Training may still be in progress.")
            print(f"   Check training job status to see if it's completed.")
    except Exception as e:
        print(f"   Could not check model directly: {e}")

print_status("Custom model availability checked")

---

<a id="step-3"></a>
## Step 3: Sanity Test the Customized Model By Running Sample Inference

Once the model is customized, its adapter is automatically saved in NeMo Entity Store and is ready to be picked up by NVIDIA NIM.
You can test the model by sending a prompt to its NIM endpoint.

First, choose one of the examples from the test set.

### 3.1 Get Test Data Sample

In [None]:
def read_jsonl(file_path):
    """Reads a JSON Lines file and yields parsed JSON objects"""
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()  # Remove leading/trailing whitespace
            if not line:
                continue  # Skip empty lines
            try:
                yield json.loads(line)
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")
                continue


test_data = list(read_jsonl(test_fp))

print(f"There are {len(test_data)} examples in the test set")

print_status("Test data loaded")

In [None]:
# Randomly choose
test_sample = random.choice(test_data)

# Visualize the inputs to the LLM - user query and available tools
test_sample['messages'], test_sample['tools']

print_status("Test sample inspected")

### 3.2 Send an Inference Call to NIM

NIM exposes an OpenAI-compatible completions API endpoint, which you can query using the `OpenAI` client library as shown in the following code.

In [None]:
# First, check if the custom model is available in NIM
print(f"üîç Checking if custom model is available in NIM: {CUSTOMIZED_MODEL}")
nim_models_resp = requests.get(f"{NIM_URL}/v1/models")
if nim_models_resp.status_code == 200:
    nim_models = nim_models_resp.json().get("data", [])
    nim_model_ids = [m.get("id") for m in nim_models]
    print(f"üìã Available models in NIM: {len(nim_model_ids)}")
    for model_id in nim_model_ids[:5]:  # Show first 5
        print(f"   - {model_id}")
    
    if CUSTOMIZED_MODEL not in nim_model_ids:
        print(f"\n‚ö†Ô∏è  Warning: Custom model '{CUSTOMIZED_MODEL}' is not yet loaded in NIM")
        print("Available custom models in NIM:")
        custom_models = [m for m in nim_model_ids if CUSTOMIZED_MODEL.split('/')[-1].split('@')[0] in m]
        if custom_models:
            print(f"   Found similar models: {custom_models}")
            print(f"\nüí° Options:")
            print(f"   1. Use one of the available models: {custom_models[0] if custom_models else 'None'}")
            print(f"   2. Wait for the model to be loaded into NIM (this happens automatically)")
            print(f"   3. Use the base model for now: {BASE_MODEL}")
            
            # Try to use the first available custom model if it exists
            if custom_models:
                print(f"\nüîÑ Using available custom model: {custom_models[0]}")
                model_to_use = custom_models[0]
            else:
                print(f"\nüîÑ Falling back to base model: {BASE_MODEL}")
                model_to_use = BASE_MODEL
        else:
            print(f"\nüîÑ Custom model not loaded yet. Using base model: {BASE_MODEL}")
            model_to_use = BASE_MODEL
    else:
        print(f"‚úÖ Custom model is available in NIM!")
        model_to_use = CUSTOMIZED_MODEL
else:
    print(f"‚ö†Ô∏è  Could not check NIM models: {nim_models_resp.status_code}")
    print(f"   Using requested model: {CUSTOMIZED_MODEL}")
    model_to_use = CUSTOMIZED_MODEL

print(f"\nüöÄ Creating inference client and sending request...")
inference_client = OpenAI(
  base_url = f"{NIM_URL}/v1",
  api_key = "None"
)

try:
    completion = inference_client.chat.completions.create(
      model = model_to_use,
      messages = test_sample["messages"],
      tools = test_sample["tools"],
      tool_choice = 'auto',
      temperature = 0.1,
      top_p = 0.7,
      max_tokens = 512,
      stream = False
    )
    
    print(f"‚úÖ Inference successful using model: {model_to_use}")
    if completion.choices[0].message.tool_calls:
        print(f"üìä Tool calls: {len(completion.choices[0].message.tool_calls)}")
        completion.choices[0].message.tool_calls
    else:
        print("üìä No tool calls in response")
        completion.choices[0].message
    
except Exception as e:
    print(f"‚ùå Error during inference: {e}")
    print(f"\nüí° Troubleshooting:")
    print(f"   1. Check if NIM service is running: curl {NIM_URL}/health")
    print(f"   2. Verify model is loaded: curl {NIM_URL}/v1/models")
    print(f"   3. Check if model exists in Entity Store")
    raise

print_status("Custom model inference completed")

Given that the fine-tuning job was successful, you can get an inference result comparable to the ground truth:

In [None]:
# The ground truth answer
test_sample['tool_calls']

print_status("Ground truth tool calls retrieved")

### 3.3 Take Note of Your Custom Model Name

Take note of your custom model name, as you will use it to run evaluations in the subsequent notebook.

In [None]:
print(f"Name of your custom model is: {CUSTOMIZED_MODEL}")

print_status("Custom model inference completed")

# Part III: Model Evaluation Using NeMo Evaluator

In [None]:
import os
import json
import requests
from time import sleep, time

from openai import OpenAI

print_status("Part III imports completed")

---
<a id="step-1"></a>
## Step 1: Establish Baseline Accuracy Benchmark

First, we‚Äôll assess the accuracy of the 'off-the-shelf' base model‚Äîpristine, untouched, and blissfully unaware of the transformative magic that is fine-tuning. 

### 1.1: Create an Evaluation Config Object
Create an evaluation configuration object for NeMo Evaluator. For more information on various parameters, refer to the [NeMo Evaluator configuration](https://developer.nvidia.com/docs/nemo-microservices/evaluate/evaluation-configs.html) in the NeMo microservices documentation.


* The `tasks.custom-tool-calling.dataset.files_url` is used to indicate which test file to use. Note that it's required to upload this to the NeMo Data Store and register with Entity store before using.
* The `tasks.dataset.limit` argument below specifies how big a subset of test data to run the evaluation on
* The evaluation metric `tasks.metrics.tool-calling-accuracy` reports `function_name_accuracy` and `function_name_and_args_accuracy` numbers, which are as their names imply.

In [None]:
simple_tool_calling_eval_config = {
    "type": "custom",
    "tasks": {
        "custom-tool-calling": {
            "type": "chat-completion",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/testing/xlam-test-single.jsonl",
                "limit": 50
            },
            "params": {
                "template": {
                    "messages": "{{ item.messages | tojson}}",
                    "tools": "{{ item.tools | tojson }}",
                    "tool_choice": "auto"
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                }
            }
        }
    }
}

print_status("Evaluation configuration created")

In [None]:
# Delete evaluation target (if it exists)
res = requests.delete(f"{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/llama-3-1b-instruct")
# Ignore 404 errors (target might not exist)
if res.status_code not in (200, 404):
    print(f"‚ö†Ô∏è Warning: Could not delete existing target: {res.status_code}")

## Create evaluation target
# IMPORTANT: Use cluster-internal URL (NIM_URL_CLUSTER) for evaluation targets
# because evaluation jobs run inside the cluster and can't access localhost port-forwards
# Reload config module to get the latest NIM_URL_CLUSTER value
import importlib
import config
importlib.reload(config)
from config import NIM_URL_CLUSTER

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}
data = {
    "type": "model",
    "name": "llama-3-1b-instruct",
    "namespace": NMS_NAMESPACE,  # Use the correct namespace
    "model": {
        "api_endpoint": {
            "url": f"{NIM_URL_CLUSTER}/v1/chat/completions",  # Use cluster URL, not localhost. Note: /v1/chat/completions for chat models
            "model_id": f"{BASE_MODEL}"
        }
    }
}
print(f"‚ÑπÔ∏è  Creating evaluation target with cluster URL: {NIM_URL_CLUSTER}/v1/chat/completions")
print(f"   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)")
print(f"   (Using /v1/chat/completions endpoint for chat models)")
res = requests.post(f"{EVALUATOR_URL}/v1/evaluation/targets", headers=headers, json=data)
target_response = res.json()
print("Evaluation Target Created:")
print(json.dumps(target_response, indent=2))
# Also return it so Jupyter displays it
target_response

print_status("Evaluation target created")

### 1.2: Launch Evaluation Job 

The following code sends a POST request to the NeMo Evaluator API to launch an evaluation job. It uses the evaluation configuration defined in the previous cell and targets the base model.


In [None]:
res = requests.post(
    f"{EVALUATOR_URL}/v1/evaluation/jobs",
    json={
        "config": simple_tool_calling_eval_config,
        "target": f"{NMS_NAMESPACE}/llama-3-1b-instruct"  # Use the correct namespace
    }
)

if res.status_code not in (200, 201):
    print(f"‚ùå Error creating evaluation job: Status {res.status_code}")
    print(f"Response: {res.text}")
    raise Exception(f"Failed to create evaluation job: {res.text}")

job_response = res.json()
base_eval_job_id = job_response["id"]

print("Evaluation Job Created:")
print(json.dumps(job_response, indent=2))
print(f"\nJob ID: {base_eval_job_id}")

# Also return it so Jupyter displays it
base_eval_job_id

print_status("Base model evaluation job created")

In [None]:
# Get Job status
res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/status")

# Check response status
if res.status_code != 200:
    print(f"‚ùå Error: Status {res.status_code}")
    print(f"Response: {res.text}")
    raise Exception(f"Failed to get evaluation job status: {res.text}")

# Format and print the response
job_status = res.json()
print("Evaluation Job Status:")
print(json.dumps(job_status, indent=2))

# Also return it so Jupyter displays it
job_status

print_status("Evaluation job status retrieved")

The following code defines a helper function to poll on job status until it finishes:

In [None]:
def wait_eval_job(job_url: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job with improved error handling and logging."""
    start_time = time()
    
    # Initial status check
    print(f"üîç Checking evaluation job status at: {job_url}")
    res = requests.get(job_url)
    
    if res.status_code != 200:
        print(f"‚ùå Error: Failed to get job status. HTTP {res.status_code}")
        print(f"Response: {res.text}")
        raise Exception(f"Failed to get evaluation job status: HTTP {res.status_code}, Response: {res.text}")
    
    try:
        job_data = res.json()
        status = job_data.get("status", "unknown")
        print(f"üìä Initial job status: {status}")
        
        # Print full job data for debugging
        print("\nInitial Job Data:")
        print(json.dumps(job_data, indent=2))
    except (KeyError, ValueError) as e:
        print(f"‚ùå Error parsing job response: {e}")
        print(f"Response text: {res.text}")
        raise Exception(f"Failed to parse job status from response: {e}")

    # Check if job is already in a terminal state (failed, completed, etc.)
    if status == "failed":
        print(f"\n‚ùå Job is already in 'failed' state!")
        print("\nFull job data for debugging:")
        print(json.dumps(job_data, indent=2))
        
        # Extract error details if available
        error_details = job_data.get("status_details", {})
        if error_details:
            print("\nError Details:")
            print(json.dumps(error_details, indent=2))
        
        print("\nüí° To investigate the failure, check:")
        print("   1. Evaluator service logs:")
        print("      oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=200")
        print("   2. Evaluation job pods (if any):")
        print("      oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
        print("   3. Evaluator pod status:")
        print("      oc get pods -n arhkp-nemo-helm | grep evaluator")
        print("   4. Check if the evaluation target/model is accessible:")
        print("      oc get nemoevaluator -n arhkp-nemo-helm -o yaml")
        
        raise Exception(f"Evaluation job failed. Status: {status}. Check cluster logs for details.")
    elif status in ["completed", "success", "finished"]:
        print(f"\n‚úÖ Job is already completed with status: {status}")
        print(f"Total time: {time() - start_time:.2f}s")
        return res

    # Track status changes
    last_status = status
    status_changes = []
    
    while (status in ["pending", "created", "running", "unknown"]):
        # Check for timeout
        elapsed = time() - start_time
        if elapsed > timeout:
            print(f"\n‚è±Ô∏è  Timeout: Job took more than {timeout} seconds.")
            print(f"Final status: {status}")
            print(f"Status history: {status_changes}")
            raise RuntimeError(f"Evaluation job timeout after {timeout} seconds. Final status: {status}")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        res = requests.get(job_url)
        
        if res.status_code != 200:
            print(f"‚ö†Ô∏è  Warning: HTTP {res.status_code} when polling job status")
            print(f"Response: {res.text}")
            # Continue polling - might be a temporary issue
            continue
        
        try:
            job_data = res.json()
            status = job_data.get("status", "unknown")
            
            # Track status changes
            if status != last_status:
                status_changes.append((elapsed, last_status, status))
                print(f"\nüîÑ Status changed: {last_status} ‚Üí {status} (after {elapsed:.2f}s)")
                last_status = status
            
            # Progress details
            progress = 0
            if status == "running":
                progress = job_data.get("status_details", {}).get("progress", 0)
                print(f"‚è≥ Job status: {status} | Progress: {progress}% | Elapsed: {elapsed:.2f}s")
            elif status == "completed":
                progress = 100
                print(f"‚úÖ Job status: {status} | Progress: {progress}% | Elapsed: {elapsed:.2f}s")
            elif status in ["pending", "created"]:
                print(f"‚è≥ Job status: {status} | Elapsed: {elapsed:.2f}s")
            elif status == "failed":
                print(f"‚ùå Job status: {status} | Elapsed: {elapsed:.2f}s")
                print("\nFull job data:")
                print(json.dumps(job_data, indent=2))
                raise Exception(f"Evaluation job failed. Check cluster logs for details.")
            else:
                print(f"‚ö†Ô∏è  Job status: {status} (unexpected) | Elapsed: {elapsed:.2f}s")
                
        except (KeyError, ValueError) as e:
            print(f"‚ö†Ô∏è  Warning: Error parsing job response: {e}")
            print(f"Response text: {res.text}")
            # Continue polling - might be a temporary issue
            continue

    print(f"\n‚úÖ Job completed with status: {status} (total time: {time() - start_time:.2f}s)")
    return res

print_status("Evaluation job wait function defined")

Run the helper function:

In [None]:
# Poll for evaluation job completion
print(f"üöÄ Starting to poll evaluation job: {base_eval_job_id}")
print(f"Job URL: {EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}")
print(f"Polling interval: 5 seconds, Timeout: 600 seconds (10 minutes)\n")

try:
    res = wait_eval_job(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}", polling_interval=5, timeout=600)
except Exception as e:
    # If wait_eval_job raised an exception (e.g., job failed), try to get final status for debugging
    print(f"\n‚ö†Ô∏è  Exception during polling: {e}")
    print("\nAttempting to fetch final job status for debugging...")
    try:
        final_res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}")
        if final_res.status_code == 200:
            final_job_data = final_res.json()
            print("\n" + "="*60)
            print("üìã Final Job Status (for debugging):")
            print("="*60)
            print(json.dumps(final_job_data, indent=2))
            print("="*60)
            
            # Extract and display error information
            status_details = final_job_data.get("status_details", {})
            if status_details:
                print("\nüîç Status Details:")
                print(json.dumps(status_details, indent=2))
            
            # Check for error messages
            error_msg = status_details.get("error") or status_details.get("message") or final_job_data.get("error")
            if error_msg:
                print(f"\n‚ùå Error Message: {error_msg}")
        else:
            print(f"Could not fetch final status: HTTP {final_res.status_code}")
    except Exception as fetch_error:
        print(f"Could not fetch final status: {fetch_error}")
    
    # Re-raise the original exception
    raise

# Check response status
if res.status_code != 200:
    print(f"\n‚ùå Error: Evaluation job status check failed with status {res.status_code}")
    print(f"Response: {res.text}")
    print("\nüí° To check cluster logs, run:")
    print(f"   oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=100")
    print(f"   oc get pods -n arhkp-nemo-helm | grep evaluator")
    raise Exception(f"Failed to get evaluation job status: {res.text}")

# Format and print the response
job_status = res.json()
final_status = job_status.get("status", "unknown")

print("\n" + "="*60)
print("üìã Final Evaluation Job Status:")
print("="*60)
print(json.dumps(job_status, indent=2))
print("="*60)

# Extract and display error information if job failed
if final_status == "failed":
    print("\n" + "="*60)
    print("‚ùå JOB FAILED - Error Analysis:")
    print("="*60)
    
    status_details = job_status.get("status_details", {})
    if status_details:
        print("\nStatus Details:")
        print(json.dumps(status_details, indent=2))
        
        # Look for common error fields
        error_fields = ["error", "message", "reason", "failure_reason", "error_message"]
        for field in error_fields:
            if field in status_details:
                print(f"\nüî¥ {field.upper()}: {status_details[field]}")
    
    # Check for error in top level
    if "error" in job_status:
        print(f"\nüî¥ Top-level error: {job_status['error']}")
    
    print("\nüí° Common causes of evaluation job failures:")
    print("   1. Evaluation target (model) is not accessible or not found")
    print("   2. Evaluation dataset is not accessible or invalid")
    print("   3. Insufficient resources (GPU, memory, etc.)")
    print("   4. Network connectivity issues between services")
    print("   5. Configuration errors in evaluation config")
    
    print("\nüí° To investigate:")
    print("   1. Check evaluator service logs:")
    print("      oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=200")
    print("   2. Check evaluation job pods:")
    print("      oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
    print("   3. Verify evaluation target exists:")
    print(f"      requests.get(f'{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/llama-3-1b-instruct').json()")
    print("   4. Check evaluator pod status:")
    print("      oc get pods -n arhkp-nemo-helm | grep evaluator")
    print("   5. Check evaluator custom resource:")
    print("      oc get nemoevaluator -n arhkp-nemo-helm -o yaml")

# Check if job actually completed successfully
if final_status not in ["completed", "success", "finished"]:
    if final_status != "failed":  # Already handled above
        print(f"\n‚ö†Ô∏è  Warning: Job status is '{final_status}', not 'completed'")
        print("This might indicate the job is still running, failed, or in an unexpected state.")
        print("\nüí° To check cluster logs and pod status:")
        print(f"   # Check evaluator pods")
        print(f"   oc get pods -n arhkp-nemo-helm | grep evaluator")
        print(f"   ")
        print(f"   # Check evaluator logs")
        print(f"   oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=100")
        print(f"   ")
        print(f"   # Check for evaluation job pods (if any)")
        print(f"   oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
        print(f"   ")
        print(f"   # Check job status directly")
        print(f"   oc get nemoevaluator -n arhkp-nemo-helm")
else:
    print(f"\n‚úÖ Job completed successfully with status: {final_status}")

# Also return it so Jupyter displays it
job_status

print_status("Base model evaluation job completed")

### 1.3 Review Evaluation Metrics

The following code sends a GET request to retrieve the evaluation results for the base evaluation job. 

In [None]:
# First, check the job status to ensure it's completed
print("Checking evaluation job status...")
status_res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/status")

if status_res.status_code != 200:
    print(f"‚ö†Ô∏è Warning: Could not check job status (Status {status_res.status_code})")
    print(f"Response: {status_res.text}")
    print("\nAttempting to retrieve results anyway (the job might still be accessible)...")
else:
    status_data = status_res.json()
    
    # Print full status for debugging
    print("\nFull status response:")
    print(json.dumps(status_data, indent=2))
    
    # The /status endpoint returns a different structure than the full job endpoint
    # It has "message" and "task_status" instead of a top-level "status" field
    job_status = status_data.get("status")
    
    # If no "status" field, infer from message and task_status
    if job_status is None:
        message = status_data.get("message", "").lower()
        task_status = status_data.get("task_status", {})
        progress = status_data.get("progress", 0)
        
        # Infer status from message and task status
        if "completed successfully" in message or "success" in message:
            # Check if all tasks are completed
            if task_status:
                all_tasks_completed = all(
                    status.lower() in ["completed", "success", "finished"] 
                    for status in task_status.values()
                )
                if all_tasks_completed and progress >= 100:
                    job_status = "completed"
                elif all_tasks_completed:
                    job_status = "completed"  # Progress might not be exactly 100
                else:
                    job_status = "running"  # Some tasks still in progress
            elif progress >= 100:
                job_status = "completed"
            else:
                job_status = "running"
        elif "failed" in message or "error" in message:
            job_status = "failed"
        elif "running" in message or progress > 0:
            job_status = "running"
        else:
            job_status = "unknown"
        
        print(f"\nüìä Inferred job status: {job_status}")
        print(f"   Message: {status_data.get('message', 'N/A')}")
        print(f"   Progress: {progress}%")
        if task_status:
            print(f"   Task status: {task_status}")
    else:
        print(f"Job status: {job_status}")
    
    # Valid completion statuses
    completed_statuses = ["completed", "success", "finished", "done"]
    
    if job_status in completed_statuses:
        print(f"‚úÖ Job is completed (status: {job_status})")
    elif job_status == "unknown":
        print("‚ö†Ô∏è Warning: Job status is 'unknown'")
        print("This could mean:")
        print("  - The job doesn't exist or was deleted")
        print("  - The status endpoint returned an unexpected format")
        print("  - The job is in an intermediate state")
        print("\nAttempting to retrieve results anyway...")
    elif job_status == "failed":
        print(f"‚ùå Job has failed (status: {job_status})")
        print("Check the status details above for error information.")
        print("\nAttempting to retrieve results anyway (may contain error details)...")
    else:
        print(f"‚ö†Ô∏è Warning: Evaluation job is not completed yet. Status: {job_status}")
        print("Valid completion statuses:", completed_statuses)
        print("\nYou can check the status again with:")
        print(f'  requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/status").json()')
        print("\nAttempting to retrieve results anyway (the job might have results even if status is not 'completed')...")

# Now retrieve the results (try even if status check failed or status is unknown)
print("\nRetrieving evaluation results...")
res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/results")

# Check response status
if res.status_code != 200:
    print(f"‚ùå Error retrieving results: Status {res.status_code}")
    print(f"Response: {res.text}")
    raise Exception(f"Failed to retrieve evaluation results: {res.text}")

# Explicitly print the results
results = res.json()
print("\nEvaluation Results:")
print(json.dumps(results, indent=2))

# Check if tasks are empty
if not results.get("tasks") or len(results.get("tasks", {})) == 0:
    print("\n‚ö†Ô∏è WARNING: Evaluation results have no tasks!")
    print("This could mean:")
    print("  1. The evaluation job completed but produced no results")
    print("  2. There was an error during evaluation")
    print("  3. The evaluation configuration was incorrect")
    print("\nPlease check the evaluation job logs or status for more details.")
    print(f"Job ID: {base_eval_job_id}")

# Also return it so Jupyter displays it
results

print_status("Base model evaluation results retrieved")

The following code extracts and prints the accuracy scores for the base model.

In [None]:
# Extract function name accuracy score
# Handle different possible task names and structures
# Note: 'res' should be set from the previous cell (cell 102)
if 'res' not in locals() and 'res' not in globals():
    raise NameError("Variable 'res' not found. Please run the previous cell to retrieve evaluation results first.")

result_data = res.json()
tasks = result_data.get("tasks", {})

# Check if tasks are empty
if not tasks or len(tasks) == 0:
    print("‚ùå Error: Evaluation results have no tasks!")
    print("This means the evaluation job completed but produced no results.")
    print("\nPossible causes:")
    print("  1. The evaluation dataset was empty or invalid")
    print("  2. The evaluation job failed silently")
    print("  3. The evaluation configuration was incorrect")
    print("\nPlease check:")
    print(f"  - Evaluation job status: requests.get(f'{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/status').json()")
    print(f"  - Evaluation job logs in the cluster")
    print(f"  - The evaluation dataset and configuration")
    print("\nFull results structure:")
    print(json.dumps(result_data, indent=2))
    raise ValueError("Evaluation results have no tasks. Please check the evaluation job status and logs.")

# Find the task (could be 'custom-tool-calling' or another name)
task_name = None
if "custom-tool-calling" in tasks:
    task_name = "custom-tool-calling"
elif len(tasks) > 0:
    # Use the first task if 'custom-tool-calling' is not found
    task_name = list(tasks.keys())[0]
    print(f"‚ö†Ô∏è Note: Using task '{task_name}' instead of 'custom-tool-calling'")

if not task_name or task_name not in tasks:
    print("‚ùå Error: Could not find evaluation task in results")
    print(f"Available tasks: {list(tasks.keys())}")
    print(f"\nFull results structure:")
    print(json.dumps(result_data, indent=2))
    raise KeyError(f"Task 'custom-tool-calling' not found. Available tasks: {list(tasks.keys())}")

# Extract metrics
task_data = tasks[task_name]
metrics = task_data.get("metrics", {})
tool_calling_metrics = metrics.get("tool-calling-accuracy", {})
scores = tool_calling_metrics.get("scores", {})

base_function_name_accuracy_score = scores.get("function_name_accuracy", {}).get("value")
base_function_name_and_args_accuracy = scores.get("function_name_and_args_accuracy", {}).get("value")

if base_function_name_accuracy_score is None or base_function_name_and_args_accuracy is None:
    print("‚ö†Ô∏è Warning: Some accuracy scores are missing")
    print(f"Available scores: {list(scores.keys())}")
    print(f"\nFull metrics structure:")
    print(json.dumps(metrics, indent=2))
    print(f"\nFull task data:")
    print(json.dumps(task_data, indent=2))

print(f"Base model: function_name_accuracy: {base_function_name_accuracy_score}")
print(f"Base model: function_name_and_args_accuracy: {base_function_name_and_args_accuracy}")

print_status("Base model accuracy scores extracted")

Without any finetuning, the `meta/llama-3.2-1b-instruct` model should score in the ballpark of about 12% in `function_name_accuracy`, and 8% in `function_name_and_args_accuracy`

### (Optional) 1.4 Download and Inspect Results

To take a deeper look into the model's generated outputs, you can download and review the results.

In [None]:
def download_evaluation_results(eval_url, eval_job_id, output_file):
    """Downloads evaluation results for a given job ID from the NeMo server."""
    
    download_response = requests.get(f"{eval_url}/v1/evaluation/jobs/{eval_job_id}/download-results")
    
    # Check the response status
    if download_response.status_code == 200:
        # Save the results to a file
        with open(output_file, "wb") as file:
            file.write(download_response.content)
        print(f"Evaluation results for job {eval_job_id} downloaded successfully to {output_file}.")
        return True
    else:
        print(f"Failed to download evaluation results. Status code: {download_response.status_code}")
        print('Response:', download_response.text)
        return False

print_status("Evaluation results download function defined")

In [None]:
output_file = f"{base_eval_job_id}.json"

# Assertion fails if download fails
assert download_evaluation_results(eval_url=EVALUATOR_URL, eval_job_id=base_eval_job_id, output_file=output_file) == True

print_status("Base model evaluation results downloaded")

You can inspect the downloaded results file to observe places where the base model errors. Without any fine-tuning, some models not only return inaccurate function names and arguments, but they may not adhere to a consistent structured / predictable output schema. This makes it difficult to automatically parse these outputs, deterring integration with external systems.

---
<a id="step-2"></a>
## Step 2: Evaluate the LoRA Customized Model

### 2.1 Launch Evaluation Job

Run another evaluation job with the same evaluation config but with the customized model.

In [None]:
# IMPORTANT: Wait for the custom model to be available in NIM before creating evaluation target
# NIM synchronizes custom models from Entity Store every 3 minutes, so we need to wait
print(f"üîç Checking if custom model is available in NIM: {CUSTOMIZED_MODEL}")
print("   (NIM synchronizes custom models every 3 minutes, so this may take a few minutes)")

from time import sleep, time
max_wait_time = 600  # 10 minutes max wait
poll_interval = 10  # Check every 10 seconds
start_time = time()
model_available = False
model_to_use = CUSTOMIZED_MODEL

while (time() - start_time) < max_wait_time:
    nim_models_resp = requests.get(f"{NIM_URL}/v1/models")
    if nim_models_resp.status_code == 200:
        nim_models = nim_models_resp.json().get("data", [])
        nim_model_ids = [m.get("id") for m in nim_models]
        
        if CUSTOMIZED_MODEL in nim_model_ids:
            print(f"‚úÖ Custom model '{CUSTOMIZED_MODEL}' is now available in NIM!")
            model_available = True
            model_to_use = CUSTOMIZED_MODEL
            break
        else:
            # Check if a similar model is available (same base name, different version)
            customized_model_name = CUSTOMIZED_MODEL.split('/')[-1].split('@')[0]
            similar_models = [m for m in nim_model_ids if customized_model_name in m]
            if similar_models:
                print(f"‚ö†Ô∏è  Custom model '{CUSTOMIZED_MODEL}' not yet in NIM, but found similar models:")
                for m in similar_models:
                    print(f"   - {m}")
                print(f"   Using the latest similar model: {similar_models[0]}")
                model_to_use = similar_models[0]
                model_available = True
                break
    
    elapsed = time() - start_time
    print(f"‚è≥ Waiting for model to sync... ({elapsed:.0f}s elapsed, checking every {poll_interval}s)")
    sleep(poll_interval)

if not model_available:
    print(f"\n‚ö†Ô∏è  Warning: Custom model '{CUSTOMIZED_MODEL}' is still not available in NIM after {max_wait_time}s")
    print("   This could mean:")
    print("   1. The model hasn't been synchronized yet (NIM syncs every 3 minutes)")
    print("   2. There's an issue with model synchronization")
    print("   3. The model ID is incorrect")
    print("\n   Attempting to use the model anyway (it might work if sync happens during evaluation)...")
    model_to_use = CUSTOMIZED_MODEL

# Delete evaluation target (if it exists)
res = requests.delete(f"{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/llama-3-1b-instruct-customized")
# Ignore 404 errors (target might not exist)
if res.status_code not in (200, 404):
    print(f"‚ö†Ô∏è Warning: Could not delete existing target: {res.status_code}")

## Create evaluation target
# IMPORTANT: Use cluster-internal URL (NIM_URL_CLUSTER) for evaluation targets
# because evaluation jobs run inside the cluster and can't access localhost port-forwards
# Reload config module to get the latest NIM_URL_CLUSTER value
import importlib
import config
importlib.reload(config)
from config import NIM_URL_CLUSTER

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}
data = {
    "type": "model",
    "name": "llama-3-1b-instruct-customized",
    "namespace": NMS_NAMESPACE,  # Use the correct namespace
    "model": {
        "api_endpoint": {
            "url": f"{NIM_URL_CLUSTER}/v1/chat/completions",  # Use cluster URL, not localhost. Note: /v1/chat/completions for chat models
            "model_id": f"{model_to_use}"  # Use the model that's actually available in NIM
        }
    }
}
print(f"\n‚ÑπÔ∏è  Creating evaluation target with cluster URL: {NIM_URL_CLUSTER}/v1/chat/completions")
print(f"   Model ID: {model_to_use}")
print(f"   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)")
print(f"   (Using /v1/chat/completions endpoint for chat models)")
res = requests.post(f"{EVALUATOR_URL}/v1/evaluation/targets", headers=headers, json=data)

if res.status_code not in (200, 201):
    print(f"‚ùå Error creating evaluation target: Status {res.status_code}")
    print(f"Response: {res.text}")
    raise Exception(f"Failed to create evaluation target: Status {res.status_code}, Response: {res.text}")

target_response = res.json()
print("\n‚úÖ Evaluation Target Created:")
print(json.dumps(target_response, indent=2))
# Also return it so Jupyter displays it
target_response

print_status("Custom model evaluation target created")

In [None]:
res = requests.post(
    f"{EVALUATOR_URL}/v1/evaluation/jobs",
    json={
        "config": simple_tool_calling_eval_config,
        "target": f"{NMS_NAMESPACE}/llama-3-1b-instruct-customized"  # Use the correct namespace
    },
)

if res.status_code not in (200, 201):
    print(f"‚ùå Error creating evaluation job: Status {res.status_code}")
    print(f"Response: {res.text}")
    raise Exception(f"Failed to create evaluation job: {res.text}")

job_response = res.json()
ft_eval_job_id = job_response["id"]

print("Evaluation Job Created:")
print(json.dumps(job_response, indent=2))
print(f"\nJob ID: {ft_eval_job_id}")

# Also return it so Jupyter displays it
ft_eval_job_id

print_status("Custom model evaluation job created")

In [None]:
# Poll for evaluation job completion
print(f"üöÄ Starting to poll evaluation job: {ft_eval_job_id}")
print(f"Job URL: {EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}")
print(f"Polling interval: 5 seconds, Timeout: 600 seconds (10 minutes)\n")

try:
    res = wait_eval_job(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}", polling_interval=5, timeout=600)
except Exception as e:
    # If wait_eval_job raised an exception (e.g., job failed), try to get final status for debugging
    print(f"\n‚ö†Ô∏è  Exception during polling: {e}")
    print("\nAttempting to fetch final job status for debugging...")
    try:
        final_res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}")
        if final_res.status_code == 200:
            final_job_data = final_res.json()
            print("\n" + "="*60)
            print("üìã Final Job Status (for debugging):")
            print("="*60)
            print(json.dumps(final_job_data, indent=2))
            print("="*60)
            
            # Extract and display error information
            status_details = final_job_data.get("status_details", {})
            if status_details:
                print("\nüîç Status Details:")
                print(json.dumps(status_details, indent=2))
            
            # Check for error messages
            error_msg = status_details.get("error") or status_details.get("message") or final_job_data.get("error")
            if error_msg:
                print(f"\n‚ùå Error Message: {error_msg}")
        else:
            print(f"Could not fetch final status: HTTP {final_res.status_code}")
    except Exception as fetch_error:
        print(f"Could not fetch final status: {fetch_error}")
    
    # Re-raise the original exception
    raise

# Check response status
if res.status_code != 200:
    print(f"\n‚ùå Error: Evaluation job status check failed with status {res.status_code}")
    print(f"Response: {res.text}")
    print("\nüí° To check cluster logs, run:")
    print(f"   oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=100")
    print(f"   oc get pods -n arhkp-nemo-helm | grep evaluator")
    raise Exception(f"Failed to get evaluation job status: {res.text}")

# Format and print the response
job_status = res.json()
final_status = job_status.get("status", "unknown")

print("\n" + "="*60)
print("üìã Final Evaluation Job Status:")
print("="*60)
print(json.dumps(job_status, indent=2))
print("="*60)

# Extract and display error information if job failed
if final_status == "failed":
    print("\n" + "="*60)
    print("‚ùå JOB FAILED - Error Analysis:")
    print("="*60)
    
    status_details = job_status.get("status_details", {})
    if status_details:
        print("\nStatus Details:")
        print(json.dumps(status_details, indent=2))
        
        # Look for common error fields
        error_fields = ["error", "message", "reason", "failure_reason", "error_message"]
        for field in error_fields:
            if field in status_details:
                print(f"\nüî¥ {field.upper()}: {status_details[field]}")
    
    # Check for error in top level
    if "error" in job_status:
        print(f"\nüî¥ Top-level error: {job_status['error']}")
    
    print("\nüí° Common causes of evaluation job failures:")
    print("   1. Evaluation target (model) is not accessible or not found")
    print("   2. Evaluation dataset is not accessible or invalid")
    print("   3. Insufficient resources (GPU, memory, etc.)")
    print("   4. Network connectivity issues between services")
    print("   5. Configuration errors in evaluation config")
    
    print("\nüí° To investigate:")
    print("   1. Check evaluator service logs:")
    print("      oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=200")
    print("   2. Check evaluation job pods:")
    print("      oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
    print("   3. Verify evaluation target exists:")
    print(f"      requests.get(f'{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/llama-3-1b-instruct-customized').json()")
    print("   4. Check evaluator pod status:")
    print("      oc get pods -n arhkp-nemo-helm | grep evaluator")
    print("   5. Check evaluator custom resource:")
    print("      oc get nemoevaluator -n arhkp-nemo-helm -o yaml")

# Check if job actually completed successfully
if final_status not in ["completed", "success", "finished"]:
    if final_status != "failed":  # Already handled above
        print(f"\n‚ö†Ô∏è  Warning: Job status is '{final_status}', not 'completed'")
        print("This might indicate the job is still running, failed, or in an unexpected state.")
        print("\nüí° To check cluster logs and pod status:")
        print(f"   # Check evaluator pods")
        print(f"   oc get pods -n arhkp-nemo-helm | grep evaluator")
        print(f"   ")
        print(f"   # Check evaluator logs")
        print(f"   oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=100")
        print(f"   ")
        print(f"   # Check for evaluation job pods (if any)")
        print(f"   oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
        print(f"   ")
        print(f"   # Check job status directly")
        print(f"   oc get nemoevaluator -n arhkp-nemo-helm")
else:
    print(f"\n‚úÖ Job completed successfully with status: {final_status}")

# Also return it so Jupyter displays it
job_status

print_status("Custom model evaluation job completed")

In [None]:
# Poll for evaluation job completion
# NOTE: Using localhost (port-forward) is CORRECT for the notebook
# The notebook runs locally and accesses services via port-forward
print(f"üöÄ Starting to poll evaluation job: {ft_eval_job_id}")
print(f"Job URL: {EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}")
print(f"Polling interval: 5 seconds, Timeout: 600 seconds (10 minutes)\n")

try:
    res = wait_eval_job(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}", polling_interval=5, timeout=600)
except Exception as e:
    # If wait_eval_job raised an exception (e.g., job failed), fetch final status for analysis
    # Don't re-raise - instead, fetch the job status and continue with error analysis
    print(f"\n‚ö†Ô∏è  Job failed or exception during polling: {e}")
    print("\nFetching final job status for analysis...")
    try:
        final_res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}")
        if final_res.status_code == 200:
            # Use this response as 'res' so the rest of the cell can process it
            res = final_res
            print("‚úÖ Fetched final job status - continuing with error analysis...\n")
        else:
            print(f"‚ùå Could not fetch final status: HTTP {final_res.status_code}")
            print(f"Response: {final_res.text}")
            raise Exception(f"Failed to fetch job status: HTTP {final_res.status_code}")
    except Exception as fetch_error:
        print(f"‚ùå Could not fetch final status: {fetch_error}")
        raise

# Check response status
if res.status_code != 200:
    print(f"\n‚ùå Error: Evaluation job status check failed with status {res.status_code}")
    print(f"Response: {res.text}")
    print("\nüí° To check cluster logs, run:")
    print(f"   oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=100")
    print(f"   oc get pods -n arhkp-nemo-helm | grep evaluator")
    raise Exception(f"Failed to get evaluation job status: {res.text}")

# Format and print the response
job_status = res.json()
final_status = job_status.get("status", "unknown")

print("\n" + "="*60)
print("üìã Final Evaluation Job Status:")
print("="*60)
print(json.dumps(job_status, indent=2))
print("="*60)

# Extract and display error information if job failed
if final_status == "failed":
    print("\n" + "="*60)
    print("‚ùå JOB FAILED - Error Analysis:")
    print("="*60)
    
    status_details = job_status.get("status_details", {})
    if status_details:
        print("\nStatus Details:")
        print(json.dumps(status_details, indent=2))
        
        # Look for common error fields
        error_fields = ["error", "message", "reason", "failure_reason", "error_message"]
        for field in error_fields:
            if field in status_details:
                print(f"\nüî¥ {field.upper()}: {status_details[field]}")
    
    # Check for error in top level
    if "error" in job_status:
        print(f"\nüî¥ Top-level error: {job_status['error']}")
    
    print("\nüí° Common causes of evaluation job failures:")
    print("   1. Evaluation target (model) is not accessible or not found")
    print("   2. Evaluation dataset is not accessible or invalid")
    print("   3. Insufficient resources (GPU, memory, etc.)")
    print("   4. Network connectivity issues between services")
    print("   5. Configuration errors in evaluation config")
    
    print("\nüí° To investigate:")
    print("   1. Check evaluator service logs:")
    print("      oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=200")
    print("   2. Check evaluation job pods:")
    print("      oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
    print("   3. Verify evaluation target exists:")
    print(f"      requests.get(f'{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/llama-3-1b-instruct-customized').json()")
    print("   4. Check evaluator pod status:")
    print("      oc get pods -n arhkp-nemo-helm | grep evaluator")
    print("   5. Check evaluator custom resource:")
    print("      oc get nemoevaluator -n arhkp-nemo-helm -o yaml")

# Check if job actually completed successfully
if final_status not in ["completed", "success", "finished"]:
    print(f"\n‚ö†Ô∏è  Warning: Job status is '{final_status}', not 'completed'")
    print("This might indicate the job is still running, failed, or in an unexpected state.")
    print("\nüí° To check cluster logs and pod status:")
    print(f"   # Check evaluator pods")
    print(f"   oc get pods -n arhkp-nemo-helm | grep evaluator")
    print(f"   ")
    print(f"   # Check evaluator logs")
    print(f"   oc logs -n arhkp-nemo-helm -l app=nemoevaluator --tail=100")
    print(f"   ")
    print(f"   # Check for evaluation job pods (if any)")
    print(f"   oc get pods -n arhkp-nemo-helm | grep -E 'eval|evaluation'")
    print(f"   ")
    print(f"   # Check job status directly")
    print(f"   oc get nemoevaluator -n arhkp-nemo-helm")
else:
    print(f"\n‚úÖ Job completed successfully with status: {final_status}")

# Also return it so Jupyter displays it
job_status

print_status("Custom model evaluation job completed")

### 2.2 Review Evaluation Metrics
The following code sends a GET request to retrieve the evaluation results for the fine-tuned model evaluation job.

In [None]:
# First, check the job status to ensure it's completed
print("Checking evaluation job status...")
status_res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}/status")

if status_res.status_code != 200:
    print(f"‚ö†Ô∏è Warning: Could not check job status (Status {status_res.status_code})")
    print(f"Response: {status_res.text}")
    print("\nAttempting to retrieve results anyway (the job might still be accessible)...")
else:
    status_data = status_res.json()
    
    # Print full status for debugging
    print("\nFull status response:")
    print(json.dumps(status_data, indent=2))
    
    # The /status endpoint returns a different structure than the full job endpoint
    # It has "message" and "task_status" instead of a top-level "status" field
    job_status = status_data.get("status")
    
    # If no "status" field, infer from message and task_status
    if job_status is None:
        message = status_data.get("message", "").lower()
        task_status = status_data.get("task_status", {})
        progress = status_data.get("progress", 0)
        
        # Infer status from message and task status
        if "completed successfully" in message or "success" in message:
            # Check if all tasks are completed
            if task_status:
                all_tasks_completed = all(
                    status.lower() in ["completed", "success", "finished"] 
                    for status in task_status.values()
                )
                if all_tasks_completed and progress >= 100:
                    job_status = "completed"
                elif all_tasks_completed:
                    job_status = "completed"  # Progress might not be exactly 100
                else:
                    job_status = "running"  # Some tasks still in progress
            elif progress >= 100:
                job_status = "completed"
            else:
                job_status = "running"
        elif "failed" in message or "error" in message:
            job_status = "failed"
        elif "running" in message or progress > 0:
            job_status = "running"
        else:
            job_status = "unknown"
        
        print(f"\nüìä Inferred job status: {job_status}")
        print(f"   Message: {status_data.get('message', 'N/A')}")
        print(f"   Progress: {progress}%")
        if task_status:
            print(f"   Task status: {task_status}")
    else:
        print(f"Job status: {job_status}")
    
    # Valid completion statuses
    completed_statuses = ["completed", "success", "finished", "done"]
    
    if job_status in completed_statuses:
        print(f"‚úÖ Job is completed (status: {job_status})")
    elif job_status == "unknown":
        print("‚ö†Ô∏è Warning: Job status is 'unknown'")
        print("This could mean:")
        print("  - The job doesn't exist or was deleted")
        print("  - The status endpoint returned an unexpected format")
        print("  - The job is in an intermediate state")
        print("\nAttempting to retrieve results anyway...")
    elif job_status == "failed":
        print(f"‚ùå Job has failed (status: {job_status})")
        print("Check the status details above for error information.")
        print("\nAttempting to retrieve results anyway (may contain error details)...")
    else:
        print(f"‚ö†Ô∏è Warning: Evaluation job is not completed yet. Status: {job_status}")
        print("Valid completion statuses:", completed_statuses)
        print("\nYou can check the status again with:")
        print(f'  requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}/status").json()')
        print("\nAttempting to retrieve results anyway (the job might have results even if status is not 'completed')...")

# Now retrieve the results (try even if status check failed or status is unknown)
print("\nRetrieving evaluation results...")
res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{ft_eval_job_id}/results")

# Check response status
if res.status_code != 200:
    print(f"‚ùå Error retrieving results: Status {res.status_code}")
    print(f"Response: {res.text}")
    raise Exception(f"Failed to retrieve evaluation results: {res.text}")

# Explicitly print the results
results = res.json()
print("\nEvaluation Results:")
print(json.dumps(results, indent=2))

# Check if tasks are empty
if not results.get("tasks") or len(results.get("tasks", {})) == 0:
    print("\n‚ö†Ô∏è WARNING: Evaluation results have no tasks!")
    print("This could mean:")
    print("  1. The evaluation job completed but produced no results")
    print("  2. There was an error during evaluation")
    print("  3. The evaluation configuration was incorrect")
    print("\nPlease check the evaluation job logs or status for more details.")
    print(f"Job ID: {ft_eval_job_id}")

# Also return it so Jupyter displays it
results

print_status("Custom model evaluation results retrieved")

In [None]:
# Extract function name accuracy score
# Handle different possible task names and structures (same as base model extraction)
result_data = res.json()
tasks = result_data.get("tasks", {})

# Find the task (could be 'custom-tool-calling' or another name)
task_name = None
if "custom-tool-calling" in tasks:
    task_name = "custom-tool-calling"
elif len(tasks) > 0:
    # Use the first task if 'custom-tool-calling' is not found
    task_name = list(tasks.keys())[0]
    print(f"‚ö†Ô∏è Note: Using task '{task_name}' instead of 'custom-tool-calling'")

if not task_name or task_name not in tasks:
    print("‚ùå Error: Could not find evaluation task in results")
    print(f"Available tasks: {list(tasks.keys())}")
    print(f"\nFull results structure:")
    print(json.dumps(result_data, indent=2))
    raise KeyError(f"Task 'custom-tool-calling' not found. Available tasks: {list(tasks.keys())}")

# Extract metrics
task_data = tasks[task_name]
metrics = task_data.get("metrics", {})
tool_calling_metrics = metrics.get("tool-calling-accuracy", {})
scores = tool_calling_metrics.get("scores", {})

ft_function_name_accuracy_score = scores.get("function_name_accuracy", {}).get("value")
ft_function_name_and_args_accuracy = scores.get("function_name_and_args_accuracy", {}).get("value")

if ft_function_name_accuracy_score is None or ft_function_name_and_args_accuracy is None:
    print("‚ö†Ô∏è Warning: Some accuracy scores are missing")
    print(f"Available scores: {list(scores.keys())}")
    print(f"\nFull metrics structure:")
    print(json.dumps(metrics, indent=2))

print(f"Custom model: function_name_accuracy: {ft_function_name_accuracy_score}")
print(f"Custom model: function_name_and_args_accuracy: {ft_function_name_and_args_accuracy}")

print_status("Custom model accuracy scores extracted")

A successfully fine-tuned `meta/llama-3.2-1b-instruct` results in a significant increase in tool calling accuracy with 

In this case you should observe roughly the following improvements -
* function_name_accuracy: 12% to 92%
* function_name_and_args_accuracy: 8% to 72%

Since this evaluation was on a limited number of samples for demonstration purposes, you may choose to increase `tasks.dataset.limit` in your evaluation config `simple_tool_calling_eval_config`

## (Optional) Next Steps



* You may also run the same evaluation on a base `meta/llama-3.1-70B` model for comparison.
For this, first you will need to deploy the corresponding NIM using instructions [here](https://build.nvidia.com/meta/llama-3_1-70b-instruct/deploy). After your NIM is deployed, set that endpoint as your evaluation target like so -

``` python
# Create an evaluation target
NIM_URL = "http://0.0.0.0:8000"
EVAL_TARGET = {
    "type": "model", 
    "model": {
       "api_endpoint": {
         "url": f"{NIM_URL}/v1/completions",
         "model_id": "meta/llama-3.1-70b-instruct",
        }
    }
}

# Start eval job
res = requests.post(
    f"{EVALUATOR_URL}/v1/evaluation/jobs",
    json={
        "config": simple_tool_calling_eval_config,
        "target": EVAL_TARGET
    }
)
```

Running evaluation using the default config in this notebook, you should observe `meta/llama-3.1-70B` performance similar to -
* function_name_accuracy: 98%
* function_name_and_args_accuracy: 66%

Remarkably, a LoRA-tuned `meta/llama-3.2-1B` achieves accuracy that is close to a model 70 times its size, even outperforming it in the combined `function_name_and_args_accuracy` score.

You can now proceed with the same processes to fine-tune other NIM for LLMs and evaluate the accuracies between the base model and the fine-tuned model. By doing so, you can produce more accurate models for your use case.

# Part IV. Adding Safety Guardrails


In [None]:
import os
import json
import requests
from time import sleep, time
from openai import OpenAI

print_status("Part IV imports completed")

---
<a id="step-1"></a>
## Step 1: Adding a Guardrails Configuration to the Microservice

Start by running the following command which creates a `config.yml` file with the model deployed in the guardrails microservice 

In [None]:
headers = {"Accept": "application/json", "Content-Type": "application/json"}
data = {
    "name": "demo-self-check-input-output",
    "namespace": "default",
    "description": "demo streaming self-check input and output",
    "data": {
        "prompts": [
            {
                "task": "self_check_input",
                "content": "Your task is to check if the user message below contains any explicit content or abusive language"
            },
            {
                "task": "self_check_output",
                "content": "Your task is to check if the bot message below contains any explicit content or abusive language."
            }
        ],
        "instructions": [
            {
                "type": "general",
                "content": "Below is a conversation between a user and a bot called the ABC Bot.\nThe bot is designed to answer employee questions about the ABC Company.\nThe bot is knowledgeable about the employee handbook and company policies.\nIf the bot does not know the answer to a question, it truthfully says it does not know."
            }
        ],
        "sample_conversation": "user \"Hi there. Can you help me with some questions I have about the company?\"\n  express greeting and ask for assistance\nbot express greeting and confirm and offer assistance\n  \"Hi there! I am here to help answer any questions you may have about the ABC Company. What would you like to know?\"\nuser \"What is the company policy on paid time off?\"\n  ask question about benefits\nbot respond to question about benefits\n  \"The ABC Company provides eligible employees with up to two weeks of paid vacation time per year, as well as five paid sick days per year. Please refer to the employee handbook for more information.\"",
        "models": [],
        "rails": {
            "input": {
                "flows": [
                    "self check input"
                ]
            },
            "output": {
                "flows": [
                    "self check output"
                ],
                "streaming": {
                    "enabled": "True",
                    "chunk_size": 200,
                    "context_size": 50,
                    "stream_first": "True"
                }
            },
            "dialog": {
                "single_call": {
                    "enabled": "False"
                }
            }
        }
    }
}
response = requests.post(f"{GUARDRAILS_URL}/v1/guardrail/configs", headers=headers, json=data)
print(json.dumps(response.json(), indent=2))

print_status("Guardrails configurations listed")

The following REST API call lists the available guardrails configurations. You should be able to see the `toolcalling` configuration - 

In [None]:
response = requests.get(f"{GUARDRAILS_URL}/v1/guardrail/configs?page=1&page_size=10&sort=-created_at")
print(json.dumps(response.json(), indent=2))

print_status("Guardrails test with unsafe query completed")

---
<a id="step-2"></a>
## Step 2: Evaluate the Safety guardrails

With the above guardrails configuration in place, we can now send an example query.

Now Let's try with Guardrails ON. NeMo Guardrail should not respond to the unsafe user query.

### 2.2: Unsafe User Query

In [None]:
url = f"{GUARDRAILS_URL}/v1/guardrail/chat/completions"

headers = {"Accept": "application/json", "Content-Type": "application/json"}

data = {
    "model": "meta/llama-3.2-1b-instruct",
    "messages": [
        {"role": "user", "content": "You are stupid"}
    ],
    "guardrails": {
        "config_id": "demo-self-check-input-output",
    },
    "top_p": 1
}

response = requests.post(url, headers=headers, json=data)
print(json.dumps(response.json(), indent=2))

print_status("Guardrails test with unsafe query completed")

Let's try the safe user query. 

### 2.3: Safe User Query

In [None]:
# Usage example
url = f"{GUARDRAILS_URL}/v1/guardrail/completions"

headers = {"Accept": "application/json", "Content-Type": "application/json"}

data = {
    "model": "meta/llama-3.2-1b-instruct",
    "prompt": "Tell me about Cape Hatteras National Seashore in 50 words or less.",
    "guardrails": {
      "config_id": "demo-self-check-input-output"
    },
    "temperature": 1,
    "max_tokens": 100,
    "stream": False
}


response = requests.post(url, headers=headers, json=data)
print(json.dumps(response.json(), indent=2))

print_status("Guardrails test with safe query completed")