# Fine-tuning, Inference, and Evaluation with NVIDIA NeMo Microservices and NIM

### Introduction

This notebook covers the following workflows:
- Creating a dataset and uploading files for customizing and evaluating models
- Running inference on base and customized models
- Customizing and evaluating models, comparing metrics between base models and fine-tuned models
- Running a safety check and evaluating a model using Guardrails


## Prerequisites

### Deploy NeMo Microservices
Ensure the NeMo Microservices platform is up and running, including the model downloading step for `meta/llama-3.1-8b-instruct`. Please refer to the [installation guide](https://aire.gitlab-master-pages.nvidia.com/microservices/documentation/latest/nemo-microservices/latest-internal/set-up/deploy-as-platform/index.html) for instructions.

You can verify the `meta/llama-3.1-8b-instruct` is deployed by querying the NIM endpoint. The response should include a model with an `id` of `meta/llama-3.1-8b-instruct`.

```bash
# URL to NeMo deployment management service
export NEMO_URL="http://nemo.test"

curl -X GET "$NEMO_URL/v1/models" \
  -H "Accept: application/json"
```

### Set up Developer Environment
Set up your development environment on your machine. The project uses `uv` to manage Python dependencies. From the root of the project, install dependencies and create your virtual environment:

```bash
uv sync --extra dev
uv pip install -U llama-stack-client
uv pip install -e .
source .venv/bin/activate
```

### Build Llama Stack Image
Build the Llama Stack image using the virtual environment you just created. For local development, set `LLAMA_STACK_DIR` to ensure your local code is use in the image. To use the production version of `llama-stack`, omit `LLAMA_STACK_DIR`.

```bash
uv run --with llama-stack llama stack list-deps nvidia | xargs -L1 uv pip install
```

## Setup

1. Update the following variables in [config.py](./config.py) with your deployment URLs and API keys. The other variables are optional. You can update these to organize the resources created by this notebook.
```python
# (Required) NeMo Microservices URLs
NDS_URL = "" # NeMo Data Store
NEMO_URL = "" # Other NeMo Microservices (Customizer, Evaluator, Guardrails)
NIM_URL = "" # NIM

# (Required) Hugging Face Token
HF_TOKEN = ""
```

2. Set environment variables used by each service.

In [1]:
import os
from config import *

# Metadata associated with Datasets and Customization Jobs
os.environ["NVIDIA_DATASET_NAMESPACE"] = NAMESPACE
os.environ["NVIDIA_PROJECT_ID"] = PROJECT_ID

# Inference env vars
os.environ["NVIDIA_BASE_URL"] = NIM_URL

# Data Store env vars
os.environ["NVIDIA_DATASETS_URL"] = NEMO_URL

# Customizer env vars
os.environ["NVIDIA_CUSTOMIZER_URL"] = NEMO_URL
os.environ["NVIDIA_OUTPUT_MODEL_DIR"] = CUSTOMIZED_MODEL_DIR

# Evaluator env vars
os.environ["NVIDIA_EVALUATOR_URL"] = NEMO_URL

# Guardrails env vars
os.environ["GUARDRAILS_SERVICE_URL"] = NEMO_URL


3. Initialize the HuggingFace API client. Here, we use NeMo Data Store as the endpoint the client will invoke.

In [2]:
from huggingface_hub import HfApi
import json
import pprint
import requests
from time import sleep, time

os.environ["HF_ENDPOINT"] = f"{NDS_URL}/v1/hf"
os.environ["HF_TOKEN"] = HF_TOKEN

hf_api = HfApi(endpoint=os.environ.get("HF_ENDPOINT"), token=os.environ.get("HF_TOKEN"))

  from .autonotebook import tqdm as notebook_tqdm


4. Initialize the Llama Stack client using the NVIDIA provider.

In [3]:
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=LLAMASTACK_URL)
client._version

'0.4.0-alpha.1'

In [4]:
# Register base model in Entity Store (required for evaluator and customizer)
import requests

response = requests.post(
    f"{ENTITY_STORE_URL}/v1/models",
    json={
        "name": "llama-3.2-1b-instruct",
        "namespace": "meta",
        "description": "Base Llama 3.2 1B Instruct model",
        "project": "tool_calling",
        "spec": {
            "num_parameters": 1000000000,
            "context_size": 4096,
            "num_virtual_tokens": 0,
            "is_chat": True
        },
        "artifact": {
            "gpu_arch": "Ampere",
            "precision": "bf16-mixed",
            "tensor_parallelism": 1,
            "backend_engine": "nemo",
            "status": "upload_completed",
            "files_url": "nim://meta/llama-3.2-1b-instruct"
        }
    }
)

if response.status_code in (200, 201):
    print("✅ Base model registered in Entity Store")
elif response.status_code == 409:
    print("⚠️ Base model already exists in Entity Store")
else:
    print(f"❌ Failed to register: {response.status_code} - {response.text}")


✅ Base model registered in Entity Store


5. Define a few helper functions we'll use later that wait for async jobs to complete.

In [5]:
from llama_stack.apis.common.job_types import JobStatus

def wait_customization_job(job_id: str, polling_interval: int = 30, timeout: int = 3600):
    start_time = time()

    response = client.alpha.post_training.job.status(job_uuid=job_id)
    job_status = response.status

    print(f"Waiting for Customization job {job_id} to finish.")
    print(f"Job status: {job_status} after {time() - start_time} seconds.")

    while job_status in [JobStatus.scheduled.value, JobStatus.in_progress.value]:
        sleep(polling_interval)
        response = client.alpha.post_training.job.status(job_uuid=job_id)
        job_status = response.status

        print(f"Job status: {job_status} after {time() - start_time} seconds.")

        if time() - start_time > timeout:
            raise RuntimeError(f"Customization Job {job_id} took more than {timeout} seconds.")
        
    return job_status

def wait_eval_job(benchmark_id: str, job_id: str, polling_interval: int = 10, timeout: int = 6000):
    start_time = time()
    job_status = client.alpha.eval.jobs.status(benchmark_id=benchmark_id, job_id=job_id)

    print(f"Waiting for Evaluation job {job_id} to finish.")
    print(f"Job status: {job_status} after {time() - start_time} seconds.")

    while job_status.status in [JobStatus.scheduled.value, JobStatus.in_progress.value]:
        sleep(polling_interval)
        job_status = client.alpha.eval.jobs.status(benchmark_id=benchmark_id, job_id=job_id)

        print(f"Job status: {job_status} after {time() - start_time} seconds.")

        if time() - start_time > timeout:
            raise RuntimeError(f"Evaluation Job {job_id} took more than {timeout} seconds.")

    return job_status

# When creating a customized model, NIM asynchronously loads the model in its model registry.
# After this, we can run inference on the new model. This helper function waits for NIM to pick up the new model.
def wait_nim_loads_customized_model(model_id: str, polling_interval: int = 10, timeout: int = 300):
    found = False
    start_time = time()

    print(f"Checking if NIM has loaded customized model {model_id}.")

    while not found:
        sleep(polling_interval)

        response = requests.get(f"{NIM_URL}/v1/models")
        if model_id in [model["id"] for model in response.json()["data"]]:
            found = True
            print(f"Model {model_id} available after {time() - start_time} seconds.")
            break
        else:
            print(f"Model {model_id} not available after {time() - start_time} seconds.")

    if not found:
        raise RuntimeError(f"Model {model_id} not available after {timeout} seconds.")

    assert found, f"Could not find model {model_id} in the list of available models."
            

## Upload Dataset Using the HuggingFace Client

Start by creating a dataset with the `sample_squad_data` files. This data is pulled from the Stanford Question Answering Dataset (SQuAD) reading comprehension dataset, consisting of questions posed on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding passage, or the question is unanswerable.

In [6]:
sample_squad_dataset_name = "sample-squad-test"
repo_id = f"{NAMESPACE}/{sample_squad_dataset_name}"

In [7]:
# Create the repo
response = hf_api.create_repo(repo_id, repo_type="dataset")

In [8]:
# Upload the files from the local folder
hf_api.upload_folder(
    folder_path="./sample_data/sample_squad_data/training",
    path_in_repo="training",
    repo_id=repo_id,
    repo_type="dataset",
)
hf_api.upload_folder(
    folder_path="./sample_data/sample_squad_data/validation",
    path_in_repo="validation",
    repo_id=repo_id,
    repo_type="dataset",
)
hf_api.upload_folder(
    folder_path="./sample_data/sample_squad_data/testing",
    path_in_repo="testing",
    repo_id=repo_id,
    repo_type="dataset",
)

training.jsonl: 100%|██████████| 1.18M/1.18M [00:04<00:00, 260kB/s]
validation.jsonl: 100%|██████████| 171k/171k [00:00<00:00, 223kB/s]
testing.jsonl: 100%|██████████| 345k/345k [00:01<00:00, 250kB/s]


CommitInfo(commit_url='', commit_message='Upload folder using huggingface_hub', commit_description='', oid='2b91468badc4f3708f0173be4a65f1c245656a83', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

In [9]:
# Create the dataset
response = client.beta.datasets.register(
    purpose="post-training/messages",
    dataset_id=sample_squad_dataset_name,
    source={
        "type": "uri",
        "uri": f"hf://datasets/{repo_id}"
    },
    metadata={
        "format": "json",
        "description": "Test sample_squad_data dataset for NVIDIA E2E notebook",
        "provider_id": "nvidia",
    }
)
print(response)

  response = client.beta.datasets.register(
INFO:httpx:HTTP Request: POST http://localhost:8321/v1beta/datasets "HTTP/1.1 200 OK"


DatasetRegisterResponse(identifier='sample-squad-test', provider_id='nvidia', purpose='post-training/messages', source=SourceUriDataSource(uri='hf://datasets/nvidia-e2e-tutorial/sample-squad-test', type='uri'), metadata={'format': 'json', 'description': 'Test sample_squad_data dataset for NVIDIA E2E notebook', 'provider_id': 'nvidia'}, provider_resource_id='sample-squad-test', type='dataset', owner=None)


In [10]:
# Register dataset in Entity Store (required for customizer/evaluator)
import requests
response = requests.post(
    f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": sample_squad_dataset_name,
        "namespace": NAMESPACE,
        "description": "Test sample_squad_data dataset for NVIDIA E2E notebook",
        "files_url": f"hf://datasets/{repo_id}",
        "project": "tool_calling",
        "format": "json",
    },
)

if response.status_code in (200, 201):
    print("✅ Dataset registered in Entity Store")
    dataset_obj = response.json()
    print(f"Files URL: {dataset_obj['files_url']}")
    assert dataset_obj["files_url"] == f"hf://datasets/{repo_id}"
elif response.status_code == 409:
    print("⚠️ Dataset already exists in Entity Store - continuing...")
else:
    print(f"❌ Failed to register: {response.status_code} - {response.text}")


⚠️ Dataset already exists in Entity Store - continuing...


## Inference

We'll use an entry from the `sample_squad_data` test data to verify we can run inference using NVIDIA NIM.

In [11]:
import json
import pprint

with open("./sample_data/sample_squad_data/testing/testing.jsonl", "r") as f:
    examples = [json.loads(line) for line in f]

# Get the user prompt from the last example
sample_prompt = examples[-1]["prompt"]
pprint.pprint(sample_prompt)

('Extract from the following context the minimal span word for word that best '
 'answers the question.\n'
 '- If a question does not make any sense, or is not factually coherent, '
 'explain why instead of answering something not correct.\n'
 "- If you don't know the answer to a question, please don't share false "
 'information.\n'
 '- If the answer is not in the context, the answer should be "?".\n'
 '- Your answer should not include any other text than the answer to the '
 'question. Don\'t include any other text like "Here is the answer to the '
 'question:" or "The minimal span word for word that best answers the question '
 'is:" or anything like that.\n'
 '\n'
 'Context: The league announced on October 16, 2012, that the two finalists '
 "were Sun Life Stadium and Levi's Stadium. The South Florida/Miami area has "
 'previously hosted the event 10 times (tied for most with New Orleans), with '
 'the most recent one being Super Bowl XLIV in 2010. The San Francisco Bay '
 'Area la

In [12]:
# Register the base model with LlamaStack
from llama_stack.apis.models.models import ModelType

try:
    client.models.register(
        model_id=BASE_MODEL,
        model_type=ModelType.llm,
        provider_id="nvidia",
    )
    print(f"✅ Registered model: {BASE_MODEL}")
except Exception as e:
    if "already exists" in str(e).lower():
        print(f"⚠️ Model {BASE_MODEL} already registered")
    else:
        print(f"Error registering model: {e}")


  client.models.register(
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/models "HTTP/1.1 400 Bad Request"


Error registering model: Error code: 400 - {'detail': 'Invalid value: Model meta/llama-3.2-1b-instruct is not available from provider nvidia'}


In [13]:
# Test inference
response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": sample_prompt}
    ],
    model=f"nvidia/{BASE_MODEL}",
    max_tokens=20,
    temperature=0.7,
)
print(f"Inference response: {response.choices[0].message.content}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Inference response: 1985


## Evaluation


To run an Evaluation, we'll first register a benchmark. A benchmark corresponds to an Evaluation Config in NeMo Evaluator, which contains the metadata to use when launching an Evaluation Job. Here, we'll create a benchmark that uses the testing file uploaded in the previous step. 

In [14]:
benchmark_id = "test-eval-config"

In [15]:
simple_eval_config = {
    "benchmark_id": benchmark_id,
    "dataset_id": "",
    "scoring_functions": [],
    "metadata": {
        "type": "custom",
        "params": {"parallelism": 8},
        "tasks": {
            "qa": {
                "type": "completion",
                "params": {
                    "template": {
                        "prompt": "{{prompt}}",
                        "max_tokens": 20,
                        "temperature": 0.7,
                        "top_p": 0.9,
                    },
                },
                "dataset": {"files_url": f"hf://datasets/{repo_id}/testing/testing.jsonl"},
                "metrics": {
                    "bleu": {
                        "type": "bleu",
                        "params": {"references": ["{{ideal_response}}"]},
                    },
                    "string-check": {
                        "type": "string-check",
                        "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]},
                    },
                },
            }
        }
    }
}

In [16]:
# Register a benchmark, which creates an Evaluation Config
response = client.alpha.benchmarks.register(
    benchmark_id=benchmark_id,
    dataset_id=repo_id,
    scoring_functions=simple_eval_config["scoring_functions"],
    metadata=simple_eval_config["metadata"]
)
print(f"Created benchmark {benchmark_id}")

  response = client.alpha.benchmarks.register(
INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks "HTTP/1.1 200 OK"


Created benchmark test-eval-config


In [17]:
# Launch a simple evaluation with the benchmark
response = client.alpha.eval.run_eval(
    benchmark_id=benchmark_id,
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": BASE_MODEL,
            "sampling_params": {}
        }
    }
)
job_id = response.model_dump()["job_id"]
print(f"Created evaluation job {job_id}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs "HTTP/1.1 200 OK"


Created evaluation job eval-Wgr4rnBbS9A8R2ps5E4yCi


In [18]:
# Wait for the job to complete
job = wait_eval_job(benchmark_id=benchmark_id, job_id=job_id, polling_interval=5, timeout=600)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-Wgr4rnBbS9A8R2ps5E4yCi "HTTP/1.1 200 OK"


Waiting for Evaluation job eval-Wgr4rnBbS9A8R2ps5E4yCi to finish.
Job status: Job(job_id='eval-Wgr4rnBbS9A8R2ps5E4yCi', status='in_progress') after 0.2252950668334961 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-Wgr4rnBbS9A8R2ps5E4yCi "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-Wgr4rnBbS9A8R2ps5E4yCi', status='in_progress') after 5.720567226409912 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-Wgr4rnBbS9A8R2ps5E4yCi "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-Wgr4rnBbS9A8R2ps5E4yCi', status='completed') after 11.216270923614502 seconds.


In [19]:
print(f"Job {job_id} status: {job.status}")

Job eval-Wgr4rnBbS9A8R2ps5E4yCi status: completed


In [20]:
job_results = client.alpha.eval.jobs.retrieve(benchmark_id=benchmark_id, job_id=job_id)
print(f"Job results: {json.dumps(job_results.model_dump(), indent=2)}")

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-Wgr4rnBbS9A8R2ps5E4yCi/result "HTTP/1.1 200 OK"


Job results: {
  "generations": [],
  "scores": {
    "test-eval-config": {
      "aggregated_results": {
        "created_at": "2025-11-20T13:35:35.355164",
        "updated_at": "2025-11-20T13:35:35.355165",
        "id": "evaluation_result-37ZwsBfu6sZRRjBfq6Z9PE",
        "job": "eval-Wgr4rnBbS9A8R2ps5E4yCi",
        "tasks": {
          "qa": {
            "metrics": {
              "bleu": {
                "scores": {
                  "sentence": {
                    "value": 9.095718887216758,
                    "stats": {
                      "count": 200,
                      "sum": 1819.1437774433518,
                      "mean": 9.095718887216758
                    }
                  },
                  "corpus": {
                    "value": 4.227454564913527
                  }
                }
              },
              "string-check": {
                "scores": {
                  "string-check": {
                    "value": 0.005,
                    "

In [21]:
# Extract bleu score and assert it's within range
initial_bleu_score = job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["bleu"]["scores"]["corpus"]["value"]
print(f"Initial bleu score: {initial_bleu_score}")

assert initial_bleu_score >= 2

Initial bleu score: 4.227454564913527


In [22]:
# Extract accuracy and assert it's within range
initial_accuracy_score = job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["string-check"]["scores"]["string-check"]["value"]
print(f"Initial accuracy: {initial_accuracy_score}")

assert initial_accuracy_score >= 0

Initial accuracy: 0.005


## Customization

Now that we've established our baseline Evaluation metrics, we'll customize a model using our training data uploaded previously.

In [23]:
# Start the customization job
response = client.alpha.post_training.supervised_fine_tune(
    job_uuid="",
    model=f"{BASE_MODEL}@v1.0.0+A100",
    training_config={
        "n_epochs": 2,
        "data_config": {
            "batch_size": 16,
            "dataset_id": sample_squad_dataset_name,
        },
        "optimizer_config": {
            "lr": 0.0001,
        }
    },
    algorithm_config={
        "type": "LoRA",
        "adapter_dim": 16,
        "adapter_dropout": 0.1,
        "alpha": 16,
        # NOTE: These fields are required, but not directly used by NVIDIA
        "rank": 8,
        "lora_attn_modules": [],
        "apply_lora_to_mlp": True,
        "apply_lora_to_output": False
    },
    hyperparam_search_config={},
    logger_config={},
    checkpoint_dir="",
)

job_id = response.job_uuid
print(f"Created job with ID: {job_id}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/post-training/supervised-fine-tune "HTTP/1.1 200 OK"


Created job with ID: cust-KizmczUEpGiA37zAdAdfrZ


In [24]:
# Wait for the job to complete
job_status = wait_customization_job(job_id=job_id)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Waiting for Customization job cust-KizmczUEpGiA37zAdAdfrZ to finish.
Job status: scheduled after 0.17851614952087402 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 30.721986770629883 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 61.24327778816223 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 91.75424790382385 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 122.30662298202515 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 152.81220698356628 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 183.36947584152222 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 213.87795114517212 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 244.38700914382935 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 274.897922039032 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 305.4090518951416 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: in_progress after 335.9244771003723 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-KizmczUEpGiA37zAdAdfrZ "HTTP/1.1 200 OK"


Job status: completed after 366.4423270225525 seconds.


In [25]:
print(f"Job {job_id} status: {job_status}")

Job cust-KizmczUEpGiA37zAdAdfrZ status: completed


After the fine-tuning job succeeds, we can't immediately run inference on the customized model. In the background, NIM will load newly-created models and make them available for inference. This process typically takes < 5 minutes - here, we wait for our customized model to be picked up before attempting to run inference.

In [26]:
# Check that the customized model has been picked up by NIM;
# We allow up to 5 minutes for the LoRA adapter to be loaded
wait_nim_loads_customized_model(model_id=CUSTOMIZED_MODEL_DIR, timeout=600)

Checking if NIM has loaded customized model nvidia-e2e-tutorial/test-messages-model@v1.
Model nvidia-e2e-tutorial/test-messages-model@v1 not available after 10.493601083755493 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1 not available after 20.984079837799072 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1 not available after 31.47391676902771 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1 not available after 41.96597099304199 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1 not available after 52.45623183250427 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1 not available after 62.94737505912781 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1 available after 73.4358777999878 seconds.


At this point, NIM can run inference on the customized model. However, to use the Llama Stack client to run inference, we need to explicitly register the model first.

In [27]:
# Check that inference with the new customized model works using direct NIM call
# (LlamaStack's nvidia provider doesn't see newly created models immediately)
import requests

response = requests.post(
    f"{NIM_URL}/v1/completions",
    json={
        "model": CUSTOMIZED_MODEL_DIR,
        "prompt": "Complete the sentence using one word: Roses are red, violets are ",
        "temperature": 0.7,
        "top_p": 0.9,
        "max_tokens": 20,
    }
)

if response.status_code == 200:
    print(f"✅ Inference response: {response.json()['choices'][0]['text']}")
else:
    print(f"❌ Error: {response.status_code} - {response.text}")


✅ Inference response:  green, and the sun is shining brightly today.


## Evaluate Customized Model
Now that we've customized the model, let's run another Evaluation to compare its performance with the base model.

In [28]:
# Launch a simple evaluation with the same benchmark with the customized model
response = client.alpha.eval.run_eval(
    benchmark_id=benchmark_id,
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": CUSTOMIZED_MODEL_DIR,
            "sampling_params": {}
        }
    }
)
job_id = response.model_dump()["job_id"]
print(f"Created evaluation job {job_id}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs "HTTP/1.1 200 OK"


Created evaluation job eval-3EfZyAvvjYbeynjwdHC2J6


In [29]:
# Wait for the job to complete
customized_model_job = wait_eval_job(benchmark_id=benchmark_id, job_id=job_id, polling_interval=5, timeout=600)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-3EfZyAvvjYbeynjwdHC2J6 "HTTP/1.1 200 OK"


Waiting for Evaluation job eval-3EfZyAvvjYbeynjwdHC2J6 to finish.
Job status: Job(job_id='eval-3EfZyAvvjYbeynjwdHC2J6', status='in_progress') after 0.17874503135681152 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-3EfZyAvvjYbeynjwdHC2J6 "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-3EfZyAvvjYbeynjwdHC2J6', status='in_progress') after 5.677712678909302 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-3EfZyAvvjYbeynjwdHC2J6 "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-3EfZyAvvjYbeynjwdHC2J6', status='completed') after 11.171893835067749 seconds.


In [30]:
print(f"Job {job_id} status: {customized_model_job.status}")

Job eval-3EfZyAvvjYbeynjwdHC2J6 status: completed


In [31]:
customized_model_job_results = client.alpha.eval.jobs.retrieve(benchmark_id=benchmark_id, job_id=job_id)
print(f"Job results: {json.dumps(customized_model_job_results.model_dump(), indent=2)}")

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config/jobs/eval-3EfZyAvvjYbeynjwdHC2J6/result "HTTP/1.1 200 OK"


Job results: {
  "generations": [],
  "scores": {
    "test-eval-config": {
      "aggregated_results": {
        "created_at": "2025-11-20T13:43:09.574352",
        "updated_at": "2025-11-20T13:43:09.574353",
        "id": "evaluation_result-MBkgexnVNMHy9HBLwPb9oK",
        "job": "eval-3EfZyAvvjYbeynjwdHC2J6",
        "tasks": {
          "qa": {
            "metrics": {
              "bleu": {
                "scores": {
                  "sentence": {
                    "value": 66.00271830401283,
                    "stats": {
                      "count": 200,
                      "sum": 13200.543660802567,
                      "mean": 66.00271830401283
                    }
                  },
                  "corpus": {
                    "value": 48.895642888018884
                  }
                }
              },
              "string-check": {
                "scores": {
                  "string-check": {
                    "value": 0.525,
                    

In [32]:
# Extract bleu score and assert it's within range
customized_bleu_score = customized_model_job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["bleu"]["scores"]["corpus"]["value"]
print(f"Customized bleu score: {customized_bleu_score}")

assert customized_bleu_score >= 35

Customized bleu score: 48.895642888018884


In [33]:
# Extract accuracy and assert it's within range
customized_accuracy_score = customized_model_job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["string-check"]["scores"]["string-check"]["value"]
print(f"Initial accuracy: {customized_accuracy_score}")

assert customized_accuracy_score >= 0.45

Initial accuracy: 0.525


We expect to see an improvement in the bleu score and accuracy in the customized model's evaluation results.

In [34]:
# Ensure the customized model evaluation is better than the original model evaluation
print(f"customized_bleu_score - initial_bleu_score: {customized_bleu_score - initial_bleu_score}")
assert (customized_bleu_score - initial_bleu_score) >= 27

print(f"customized_accuracy_score - initial_accuracy_score: {customized_accuracy_score - initial_accuracy_score}")
assert (customized_accuracy_score - initial_accuracy_score) >= 0.4

customized_bleu_score - initial_bleu_score: 44.668188323105355
customized_accuracy_score - initial_accuracy_score: 0.52


## Upload Chat Dataset Using the HuggingFace Client
Repeat the fine-tuning and evaluation workflow with a chat-style dataset, which has a list of `messages` instead of a `prompt` and `completion`.

In [35]:
sample_squad_messages_dataset_name = "test-squad-messages-dataset"
repo_id = f"{NAMESPACE}/{sample_squad_messages_dataset_name}"

In [36]:
# Create the repo
res = hf_api.create_repo(repo_id, repo_type="dataset")

In [37]:
# Upload the files from the local folder
hf_api.upload_folder(
    folder_path="./sample_data/sample_squad_messages/training",
    path_in_repo="training",
    repo_id=repo_id,
    repo_type="dataset",
)
hf_api.upload_folder(
    folder_path="./sample_data/sample_squad_messages/validation",
    path_in_repo="validation",
    repo_id=repo_id,
    repo_type="dataset",
)
hf_api.upload_folder(
    folder_path="./sample_data/sample_squad_messages/testing",
    path_in_repo="testing",
    repo_id=repo_id,
    repo_type="dataset",
)

training.jsonl: 100%|██████████| 1.28M/1.28M [00:01<00:00, 988kB/s]
validation.jsonl: 100%|██████████| 184k/184k [00:00<00:00, 973kB/s]
testing.jsonl: 100%|██████████| 370k/370k [00:00<00:00, 1.78MB/s]


CommitInfo(commit_url='', commit_message='Upload folder using huggingface_hub', commit_description='', oid='08d111e91ead3502475571005b37096be6c8515e', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

In [38]:
# Create the dataset
response = client.beta.datasets.register(
    purpose="post-training/messages",
    dataset_id=sample_squad_messages_dataset_name,
    source={
        "type": "uri",
        "uri": f"hf://datasets/{repo_id}"
    },
    metadata={
        "format": "json",
        "description": "Test sample_squad_messages dataset for NVIDIA E2E notebook",
        "provider_id": "nvidia",
    }
)
print(response)

  response = client.beta.datasets.register(
INFO:httpx:HTTP Request: POST http://localhost:8321/v1beta/datasets "HTTP/1.1 200 OK"


DatasetRegisterResponse(identifier='test-squad-messages-dataset', provider_id='nvidia', purpose='post-training/messages', source=SourceUriDataSource(uri='hf://datasets/nvidia-e2e-tutorial/test-squad-messages-dataset', type='uri'), metadata={'format': 'json', 'description': 'Test sample_squad_messages dataset for NVIDIA E2E notebook', 'provider_id': 'nvidia'}, provider_resource_id='test-squad-messages-dataset', type='dataset', owner=None)


In [39]:
# Register dataset in Entity Store (required for customizer/evaluator)
import requests
response = requests.post(
    f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": sample_squad_dataset_name,
        "namespace": NAMESPACE,
        "description": "Test sample_squad_data dataset for NVIDIA E2E notebook",
        "files_url": f"hf://datasets/{repo_id}",
        "project": "tool_calling",
        "format": "json",
    },
)

if response.status_code in (200, 201):
    print("✅ Dataset registered in Entity Store")
    dataset_obj = response.json()
    print(f"Files URL: {dataset_obj['files_url']}")
    assert dataset_obj["files_url"] == f"hf://datasets/{repo_id}"
elif response.status_code == 409:
    print("⚠️ Dataset already exists in Entity Store - continuing...")
else:
    print(f"❌ Failed to register: {response.status_code} - {response.text}")


⚠️ Dataset already exists in Entity Store - continuing...


## Inference with chat/completions
We'll use an entry from the `sample_squad_messages` test data to verify we can run inference using NVIDIA NIM.

In [40]:
with open("./sample_data/sample_squad_messages/testing/testing.jsonl", "r") as f:
    examples = [json.loads(line) for line in f]

# get the user and assistant messages from the last example
sample_messages = examples[-1]["messages"][:-1]
pprint.pprint(sample_messages)

[{'content': 'You are a helpful, respectful and honest assistant. Extract from '
             'the following context the minimal span word for word that best '
             'answers the question.\n'
             '- If a question does not make any sense, or is not factually '
             'coherent, explain why instead of answering something not '
             'correct.\n'
             "- If you don't know the answer to a question, please don't share "
             'false information.\n'
             '- If the answer is not in the context, the answer should be '
             '"?".\n'
             '- Your answer should not include any other text than the answer '
             'to the question. Don\'t include any other text like "Here is the '
             'answer to the question:" or "The minimal span word for word that '
             'best answers the question is:" or anything like that.',
  'role': 'system'},
 {'content': 'Context: The league announced on October 16, 2012, that the two

In [41]:
# Test inference
response = client.chat.completions.create(
    messages=sample_messages,
    model="nvidia/meta/llama-3.2-1b-instruct", # BASE_MODEL,
    max_tokens=20,
    temperature=0.7,
)
assert response.choices[0].message.content is not None
print(f"Inference response: {response.choices[0].message.content}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Inference response: 2010


## Evaluate with chat dataset
We'll register a new benchmark that uses the chat-style testing file uploaded previously.

In [42]:
from time import time
# Use a unique benchmark ID to avoid conflicts with existing benchmarks
benchmark_id = f"test-eval-config-chat-{int(time())}"

In [43]:
# Register a benchmark, which creates an Eval Config
simple_eval_config = {
    "benchmark_id": benchmark_id,
    "dataset_id": "",
    "scoring_functions": [],
    "metadata": {
        "type": "custom",
        "params": {"parallelism": 8},
        "tasks": {
            "qa": {
                "type": "completion",
                "params": {
                    "template": {
                        "messages": [
                            {"role": "{{item.messages[0].role}}", "content": "{{item.messages[0].content}}"},
                            {"role": "{{item.messages[1].role}}", "content": "{{item.messages[1].content}}"},
                        ],
                        "max_tokens": 20,
                        "temperature": 0.7,
                        "top_p": 0.9,
                    },
                },
                "dataset": {"files_url": f"hf://datasets/{repo_id}/testing/testing.jsonl"},
                "metrics": {
                    "bleu": {
                        "type": "bleu",
                        "params": {"references": ["{{item.messages[2].content | trim}}"]},
                    },
                    "string-check": {
                        "type": "string-check",
                        "params": {"check": ["{{item.messages[2].content}}", "equals", "{{output_text | trim}}"]},
                    },
                },
            }
        }
    }
}

In [44]:
response = client.alpha.benchmarks.register(
    benchmark_id=benchmark_id,
    dataset_id=repo_id,
    scoring_functions=simple_eval_config["scoring_functions"],
    metadata=simple_eval_config["metadata"]
)
print(f"Created benchmark {benchmark_id}")

  response = client.alpha.benchmarks.register(
INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks "HTTP/1.1 200 OK"


Created benchmark test-eval-config-chat-1763646207


In [45]:
# Launch a simple evaluation with the benchmark
response = client.alpha.eval.run_eval(
    benchmark_id=benchmark_id,
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "meta/llama-3.2-1b-instruct", # BASE_MODEL
            "sampling_params": {}
        }
    }
)
job_id = response.model_dump()["job_id"]
print(f"Created evaluation job {job_id}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646207/jobs "HTTP/1.1 200 OK"


Created evaluation job eval-NGy7hxi3tDaZUbpYAc1Ujs


In [46]:
# Wait for the job to complete
job = wait_eval_job(benchmark_id=benchmark_id, job_id=job_id, polling_interval=5, timeout=600)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646207/jobs/eval-NGy7hxi3tDaZUbpYAc1Ujs "HTTP/1.1 200 OK"


Waiting for Evaluation job eval-NGy7hxi3tDaZUbpYAc1Ujs to finish.
Job status: Job(job_id='eval-NGy7hxi3tDaZUbpYAc1Ujs', status='in_progress') after 0.16646504402160645 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646207/jobs/eval-NGy7hxi3tDaZUbpYAc1Ujs "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-NGy7hxi3tDaZUbpYAc1Ujs', status='in_progress') after 5.693363904953003 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646207/jobs/eval-NGy7hxi3tDaZUbpYAc1Ujs "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-NGy7hxi3tDaZUbpYAc1Ujs', status='completed') after 11.280320882797241 seconds.


In [47]:
print(f"Job {job_id} status: {job.status}")

Job eval-NGy7hxi3tDaZUbpYAc1Ujs status: completed


In [48]:
job_results = client.alpha.eval.jobs.retrieve(benchmark_id=benchmark_id, job_id=job_id)
print(f"Job results: {json.dumps(job_results.model_dump(), indent=2)}")

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646207/jobs/eval-NGy7hxi3tDaZUbpYAc1Ujs/result "HTTP/1.1 200 OK"


Job results: {
  "generations": [],
  "scores": {
    "test-eval-config-chat-1763646207": {
      "aggregated_results": {
        "created_at": "2025-11-20T13:43:27.996624",
        "updated_at": "2025-11-20T13:43:27.996625",
        "id": "evaluation_result-422RmKHVFAirB4BTcm2hhw",
        "job": "eval-NGy7hxi3tDaZUbpYAc1Ujs",
        "tasks": {
          "qa": {
            "metrics": {
              "bleu": {
                "scores": {
                  "sentence": {
                    "value": 32.557034344436985,
                    "stats": {
                      "count": 200,
                      "sum": 6511.406868887398,
                      "mean": 32.557034344436985
                    }
                  },
                  "corpus": {
                    "value": 12.92432521051032
                  }
                }
              },
              "string-check": {
                "scores": {
                  "string-check": {
                    "value": 0.26,
     

In [49]:
# Extract bleu score and assert it's within range
initial_bleu_score = job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["bleu"]["scores"]["corpus"]["value"]
print(f"Initial bleu score: {initial_bleu_score}")

assert initial_bleu_score >= 12

Initial bleu score: 12.92432521051032


In [50]:
# Extract accuracy and assert it's within range
initial_accuracy_score = job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["string-check"]["scores"]["string-check"]["value"]
print(f"Initial accuracy: {initial_accuracy_score}")

assert initial_accuracy_score >= 0.2

Initial accuracy: 0.26


## Customization with chat dataset

Now that we've established our baseline Evaluation metrics for the chat-style dataset, we'll customize a model using our training data uploaded previously.

In [None]:
import subprocess

customized_chat_model_name = "test-messages-model"
# Use a unique version to avoid conflicts with existing models
customized_chat_model_version = f"v{int(time())}"
customized_chat_model_dir = f"{NAMESPACE}/{customized_chat_model_name}@{customized_chat_model_version}"

# NOTE: The output model name is derived from the environment variable in the LlamaStack deployment
# Update the LlamaStack deployment to use the new model version
result = subprocess.run(
    ["oc", "set", "env", "deployment/llamastack", 
     f"NVIDIA_OUTPUT_MODEL_DIR={customized_chat_model_dir}"],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print(f"✅ Updated LlamaStack deployment with model dir: {customized_chat_model_dir}")
    # Wait for the deployment to rollout
    subprocess.run(
        ["oc", "rollout", "status", "deployment/llamastack"],
        timeout=120
    )
    print("✅ LlamaStack deployment updated successfully")
else:
    print(f"❌ Failed to update deployment: {result.stderr}")
    raise Exception(f"Failed to update deployment: {result.stderr}")

# Also set it locally for reference
os.environ["NVIDIA_OUTPUT_MODEL_DIR"] = customized_chat_model_dir

✅ Updated LlamaStack deployment with model dir: nvidia-e2e-tutorial/test-messages-model@v1763646228
Waiting for deployment "llamastack" rollout to finish: 0 of 1 updated replicas are available...
deployment "llamastack" successfully rolled out
✅ LlamaStack deployment updated successfully


In [54]:
customized_chat_model_dir

'nvidia-e2e-tutorial/test-messages-model@v1763646228'

In [56]:
response = requests.post(
    f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": sample_squad_messages_dataset_name,  
        "namespace": NAMESPACE,
        "description": "Test sample_squad_messages dataset for NVIDIA E2E notebook",  # Also update description
        "files_url": f"hf://datasets/{repo_id}",
        "project": "tool_calling",
        "format": "json",
    },
)

In [57]:
response = client.alpha.post_training.supervised_fine_tune(
    job_uuid="",
    model="meta/llama-3.2-1b-instruct@v1.0.0+A100",  
    training_config={
        "n_epochs": 2,
        "data_config": {
            "batch_size": 16,
            "dataset_id": sample_squad_messages_dataset_name,
        },
        "optimizer_config": {
            "lr": 0.0001,
        }
    },
    algorithm_config={
        "type": "LoRA",
        "adapter_dim": 16,
        "adapter_dropout": 0.1,
        "alpha": 16,
        "rank": 8,
        "lora_attn_modules": [],
        "apply_lora_to_mlp": True,
        "apply_lora_to_output": False
    },
    hyperparam_search_config={},
    logger_config={},
    checkpoint_dir="",  
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/post-training/supervised-fine-tune "HTTP/1.1 200 OK"


In [58]:
job_id = response.job_uuid
print(f"Created job with ID: {job_id}")

Created job with ID: cust-DHndZjk1zmqeJDf7FMJuoe


In [59]:
job = wait_customization_job(job_id=job_id, polling_interval=30, timeout=3600)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Waiting for Customization job cust-DHndZjk1zmqeJDf7FMJuoe to finish.
Job status: scheduled after 0.17351388931274414 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 30.762430906295776 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 61.298707008361816 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 91.80515718460083 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 122.314857006073 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 152.81673789024353 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 183.36792588233948 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 213.87998414039612 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 244.39159083366394 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 274.889102935791 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: in_progress after 305.4221601486206 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/post-training/job/status?job_uuid=cust-DHndZjk1zmqeJDf7FMJuoe "HTTP/1.1 200 OK"


Job status: completed after 335.8395550251007 seconds.


In [60]:
print(f"Job {job_id} status: {job_status}")

Job cust-DHndZjk1zmqeJDf7FMJuoe status: completed


In [61]:
# Check that the customized model has been picked up by NIM;
# We allow up to 5 minutes for the LoRA adapter to be loaded
wait_nim_loads_customized_model(model_id=customized_chat_model_dir)

Checking if NIM has loaded customized model nvidia-e2e-tutorial/test-messages-model@v1763646228.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 10.596751928329468 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 21.086602926254272 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 31.566166877746582 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 42.22009587287903 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 52.86898708343506 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 63.35778498649597 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 73.84983396530151 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 not available after 84.33940100669861 seconds.
Model nvidia-e2e-tutorial/test-messages-model@v1763646228 no

In [62]:
import requests

response = requests.post(
    f"{NIM_URL}/v1/completions",
    json={
        "model": customized_chat_model_dir,
        "prompt": "Complete the sentence using one word: Roses are red, violets are ",
        "temperature": 0.7,
        "top_p": 0.9,
        "max_tokens": 20,
    }
)

if response.status_code == 200:
    print(f"✅ Inference response: {response.json()['choices'][0]['text']}")
else:
    print(f"❌ Error: {response.status_code} - {response.text}")


✅ Inference response: -----------
A) blue
B) purple
C) white
D) yellow

The best answer


In [63]:
assert len(response.content) > 1

## Evaluate Customized Model with chat dataset

In [65]:
# Re-create benchmark ID with new timestamp (since LlamaStack pod restarted in cell 84)
benchmark_id = f"test-eval-config-chat-{int(time())}"
print(f"Creating new benchmark ID: {benchmark_id}")

Creating new benchmark ID: test-eval-config-chat-1763646850


In [66]:
# Re-define the eval config with the new benchmark ID
simple_eval_config = {
    "benchmark_id": benchmark_id,
    "dataset_id": "",
    "scoring_functions": [],
    "metadata": {
        "type": "custom",
        "params": {"parallelism": 8},
        "tasks": {
            "qa": {
                "type": "completion",
                "params": {
                    "template": {
                        "messages": [
                            {"role": "{{item.messages[0].role}}", "content": "{{item.messages[0].content}}"},
                            {"role": "{{item.messages[1].role}}", "content": "{{item.messages[1].content}}"},
                        ],
                        "max_tokens": 20,
                        "temperature": 0.7,
                        "top_p": 0.9,
                    },
                },
                "dataset": {"files_url": f"hf://datasets/{repo_id}/testing/testing.jsonl"},
                "metrics": {
                    "bleu": {
                        "type": "bleu",
                        "params": {"references": ["{{item.messages[2].content | trim}}"]},
                    },
                    "string-check": {
                        "type": "string-check",
                        "params": {"check": ["{{item.messages[2].content}}", "equals", "{{output_text | trim}}"]},
                    },
                },
            }
        }
    }
}

In [67]:
# Re-register the benchmark with LlamaStack (after pod restart)
response = client.alpha.benchmarks.register(
    benchmark_id=benchmark_id,
    dataset_id=repo_id,
    scoring_functions=simple_eval_config["scoring_functions"],
    metadata=simple_eval_config["metadata"]
)
print(f"✅ Registered benchmark {benchmark_id}")

  response = client.alpha.benchmarks.register(
INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks "HTTP/1.1 200 OK"


✅ Registered benchmark test-eval-config-chat-1763646850


In [68]:
# Launch evaluation for customized model
response = client.alpha.eval.run_eval(
    benchmark_id=benchmark_id,
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": customized_chat_model_dir,
            "sampling_params": {}
        }
    }
)
job_id = response.model_dump()["job_id"]
print(f"Created evaluation job {job_id}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646850/jobs "HTTP/1.1 200 OK"


Created evaluation job eval-4KiU7JesTLyr1uPWg1WVwM


In [69]:
job = wait_eval_job(benchmark_id=benchmark_id, job_id=job_id, polling_interval=5, timeout=600)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646850/jobs/eval-4KiU7JesTLyr1uPWg1WVwM "HTTP/1.1 200 OK"


Waiting for Evaluation job eval-4KiU7JesTLyr1uPWg1WVwM to finish.
Job status: Job(job_id='eval-4KiU7JesTLyr1uPWg1WVwM', status='in_progress') after 0.16409921646118164 seconds.


INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646850/jobs/eval-4KiU7JesTLyr1uPWg1WVwM "HTTP/1.1 200 OK"


Job status: Job(job_id='eval-4KiU7JesTLyr1uPWg1WVwM', status='completed') after 5.662703275680542 seconds.


In [70]:
job_results = client.alpha.eval.jobs.retrieve(benchmark_id=benchmark_id, job_id=job_id)
print(f"Job results: {json.dumps(job_results.model_dump(), indent=2)}")

INFO:httpx:HTTP Request: GET http://localhost:8321/v1alpha/eval/benchmarks/test-eval-config-chat-1763646850/jobs/eval-4KiU7JesTLyr1uPWg1WVwM/result "HTTP/1.1 200 OK"


Job results: {
  "generations": [],
  "scores": {
    "test-eval-config-chat-1763646850": {
      "aggregated_results": {
        "created_at": "2025-11-20T13:54:13.686229",
        "updated_at": "2025-11-20T13:54:13.686230",
        "id": "evaluation_result-4y532gccosY7k2ddzGmNpe",
        "job": "eval-4KiU7JesTLyr1uPWg1WVwM",
        "tasks": {
          "qa": {
            "metrics": {
              "bleu": {
                "scores": {
                  "sentence": {
                    "value": 68.94887522892891,
                    "stats": {
                      "count": 200,
                      "sum": 13789.775045785784,
                      "mean": 68.94887522892891
                    }
                  },
                  "corpus": {
                    "value": 53.23802090091468
                  }
                }
              },
              "string-check": {
                "scores": {
                  "string-check": {
                    "value": 0.545,
     

In [71]:
# Extract bleu score and assert it's within range
customized_bleu_score = job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["bleu"]["scores"]["corpus"]["value"]
print(f"Customized bleu score: {customized_bleu_score}")

assert customized_bleu_score >= 40

Customized bleu score: 53.23802090091468


In [72]:
# Extract accuracy and assert it's within range
customized_accuracy_score = job_results.scores[benchmark_id].aggregated_results["tasks"]["qa"]["metrics"]["string-check"]["scores"]["string-check"]["value"]
print(f"Customized accuracy: {customized_accuracy_score}")

assert customized_accuracy_score >= 0.47

Customized accuracy: 0.545


In [73]:
# Ensure the customized model evaluation is better than the original model evaluation
print(f"customized_bleu_score - initial_bleu_score: {customized_bleu_score - initial_bleu_score}")
assert (customized_bleu_score - initial_bleu_score) >= 20

print(f"customized_accuracy_score - initial_accuracy_score: {customized_accuracy_score - initial_accuracy_score}")
assert (customized_accuracy_score - initial_accuracy_score) >= 0.2

customized_bleu_score - initial_bleu_score: 40.31369569040436
customized_accuracy_score - initial_accuracy_score: 0.28500000000000003
