## Model Selection and Parameter Optimization

In this notebook, we will demonstrate how the NVIDIA NeMo Agent toolkit (NAT) optimizer can be used to create a robust model evaluation, comparison, and selection pipeline for custom datasets.

**Goal**: exemplify the minimal configuration and demonstrate a practical example using the NAT optimizer module to evaluate and compare the performance of various LLMs. Help NAT users establish a working understanding of this feature so that they can create similar workflows in their own institutions.

## Table of Contents
 
- [0.0) Setup](#setup)
  - [0.1) Prerequisites](#prereqs)
  - [0.2) API Keys](#api-keys)
  - [0.3) Installing NeMo Agent Toolkit](#install-nat)
  - [0.4) Additional dependencies](#deps)
- [1.0) LLM-as-a-judge with NAT](#llm-judge-h1)
  - [1.1) Create a new workflow](#new-workflow)
  - [1.2) Head-to-head comparison of multiple LLMs using eval](#nat-eval)
    - [1.2.1) LLM-as-a-judge workflow config](#config)
    - [1.2.2) Create an eval dataset](#dataset)
    - [1.2.3) Run the optimizer](#optimize-first)
    - [1.2.4) Interpret first optimizer run](#interpret-optimizer-first)
- [2.0) Optimized model selection for tool-calling agents](#optimize-tool-calling-agents)
  - [2.1) Create a tool-calling agent](#create-triage-agent)
  - [2.2) Configure the tool-calling agent](#configure-triage-agent)
  - [2.3) Test the tool-calling agent](#test-triage-agent)
  - [2.4) Evaluate the tool-calling agent](#eval-triage-agent1)
  - [2.5) Optimize the tool-calling agent's LLM](#optimize-triage-agent)
  - [2.6) Re-evaluate the optimized tool-calling agent](#eval-triage-agent2)
- [3.0) Next steps](#next-steps)

<a id="setup"></a>
# 0.0) Setup

<a id="prereqs"></a>
## 0.1) Prerequisites

We strongly recommend that users begin this notebook with a working understanding of NAT workflows. Please refer to earlier iterations of this notebook series prior to beginning this notebook.

- **Platform:** Linux, macOS, or Windows
- **Python:** version 3.11, 3.12, or 3.13
- **Python Packages:** `pip`

<a id="api-keys"></a>
## 0.2) API Keys

For this notebook, you will need the following API keys to run all examples end-to-end:

- **NVIDIA Build:** You can obtain an NVIDIA Build API Key by creating an [NVIDIA Build](https://build.nvidia.com) account and generating a key at https://build.nvidia.com/settings/api-keys

Then you can run the cell below:

In [None]:
import getpass
import os

if "NVIDIA_API_KEY" not in os.environ:
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

<a id="install-nat"></a>
## 0.3) Installing NeMo Agent Toolkit

The recommended way to install NAT is through `pip` or `uv pip`.

First, we will install `uv` which offers parallel downloads and faster dependency resolution.

In [None]:
!pip install uv

NeMo Agent toolkit can be installed through the PyPI `nvidia-nat` package.

There are several optional subpackages available for NAT. For this example, we will rely on three subpackages:
* The `nvidia-nat[langchain]` subpackage contains components for integrating with [LangChain](https://python.langchain.com/docs/introduction/).
* The `nvidia-nat[profiling]` subpackage contains components for profiling and performance analysis.

In [None]:
!uv pip install "nvidia-nat[langchain,profiling]"

<a id="deps"></a>
## 0.4) Additional dependencies

In [None]:
# needed for the alert triage agent used later
!uv pip install ansible-runner

<div style="color: red; font-style: italic;">
<strong>Note:</strong> Uncomment and run this cell to install git-lfs if using Google Colab.
</div>

In [None]:
# !apt-get update
# !apt-get install git git-lfs -y
# !git lfs install

<a id="llm-judge-h1"></a>
# 1.0) LLM-as-a-judge with NAT

The `nat eval` and `nat optimize` utilities enable developers to easily integrate LLM-as-a-judge capabilities with their workflows. `nat eval` allows for simple evaluations of a NAT workflow against an eval dataset. `nat optimize` extends this functionality by integrating with the **Optuna** library to perform grid and stochastic parameter sweeps and evaluations to identify optimal configurations for a task.

**Note:** _In this notebook, we will primarily demonstrate how to use `nat optimize` to identify a potentially optimal set of parameters for a NAT workflow. It is assumed that users will already have a strong understanding of ML model evaluations before building this concept into their workflows - as we will not be covering cross validation and train, validation, and test splitting of datasets. Please refer to python's [SciKit-Learn](https://scikit-learn.org/stable/) package as a strong reference for these concepts._

<a id="new-workflow"></a>
## 1.1) Create a new workflow

Create a basic chat completions workflow (using LangChain chat completions on backend).

In [None]:
!nat workflow create tmp_workflow --description "A simple chat completion workflow to compare model performance"

Let's look at the default configuration of this agent and confirm the agent type, LLMs, tool calls, and functions...

In [None]:
%%writefile ./tmp_workflow/configs/config_a.yml
llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-8b-instruct
    temperature: 0.7
    max_tokens: 1024

workflow:
  _type: chat_completion  # Use the type directly
  system_prompt: |
    You are a helpful AI assistant. Provide clear, accurate, and helpful 
    responses to user queries. Be concise and informative.
  llm_name: nim_llm

Now let's run this workflow for a simple Q&A example...

In [None]:
!nat run --config_file tmp_workflow/configs/config_a.yml --input "Suggest a single name for my new dog"

<a id="nat-eval"></a>
## 1.2) Head-to-head comparison of multiple LLMs using eval

Now that we've made a new workflow and shown that it works for a cursory `nat run` example, we will begin to build out an LLM-as-a-judge evaluation with trace profiling enabled for additional observability. In this next section, we are going to update the workflow configuration for evaluation and profiling.

Step-by-step instructions can be found in [4_observability_evaluation_and_profiling.ipynb](./4_observability_evaluation_and_profiling.ipynb). An end-to-end example of using the Optimizer can be viewed in the [Email Phishing Analyzer](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config_optimizer.yml).

The profiler instruments and measures your workflow's performance, while evaluators judge the quality of the outputs. They're separate concepts, so they belong in different sections of the config!

In this next step we will combine the eval and profile configuration into a single config for brevity.

<a id="config"></a>
### 1.2.1) LLM-as-a-judge workflow config

In the cell below we edit our initial workflow configuration to include `eval` and `optimizer` configurations.

Key components of this configuration:

**LLM Configuration:**
- `chat_completion_llm`: The backbone LLM that powers the workflow
- `optimizable_params`: Specifies which parameters the optimizer can tune (model name, temperature)
- `search_space`: Defines the values the optimizer will explore during optimization

**Judge LLM:**
- `nim_judge_llm`: A separate, more capable LLM (meta/llama-3.1-405b-instruct) used by the evaluator to assess the quality of the workflow's outputs
  - This LLM acts as an "LLM-as-a-judge" to score responses

**Evaluation Components:**
- `evaluators`: Define metrics to measure workflow quality (for example, accuracy, relevance)
- `profiler`: Instruments the workflow to collect performance metrics (latency, token usage, costs)

**Optimizer Components:**
- `reps_per_param_set`: Number of times to evaluate each parameter combination for statistical reliability
- `grid_search`: Strategy for exploring the search space (tests all combinations)
- `eval_metrics`: Metrics used to guide optimization decisions (for example, maximize accuracy while minimizing cost)

In [None]:
%%writefile tmp_workflow/configs/config_b.yml
llms:
  chat_completion_llm:
    _type: nim
    model_name: meta/llama-3.1-8b-instruct
    temperature: 0.0
    max_tokens: 1024
    optimizable_params:
      - model_name
      - temperature
    search_space:
      model_name:
        values:
          - meta/llama-3.1-8b-instruct
          - meta/llama-3.1-70b-instruct
      temperature:
        values:
          - 0.0
          - 0.7

  # Judge LLM for accuracy evaluation
  nim_judge_llm:
    _type: nim
    model_name: meta/llama-3.1-405b-instruct
    temperature: 0.0
    max_tokens: 8  # RAGAS accuracy only needs a score (0-1)

workflow:
  _type: chat_completion
  system_prompt: |
    You are a helpful AI assistant. Provide clear, accurate, and helpful 
    responses to user queries. Be concise and informative.
  llm_name: chat_completion_llm

general:
  telemetry:
    logging:
      console:
        _type: console
        level: INFO

eval:
  general:
    output_dir: ./tmp_workflow/eval_output
    verbose: true
    dataset:
        _type: json
        file_path: ./tmp_workflow/data/eval_data.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_judge_llm
    llm_latency:
      _type: avg_llm_latency
    token_efficiency:
      _type: avg_tokens_per_llm_end

  profiler:
      token_uniqueness_forecast: true
      workflow_runtime_forecast: true
      compute_llm_metrics: true
      csv_exclude_io_text: true
      prompt_caching_prefixes:
        enable: true
        min_frequency: 0.1
      bottleneck_analysis:
        enable_nested_stack: true
      concurrency_spike_analysis:
        enable: true
        spike_threshold: 7

optimizer:
  output_path: ./tmp_workflow/eval_output/optimizer/
  reps_per_param_set: 10 # Number of times to evaluate EACH config (for statistical significance)
  eval_metrics:
    accuracy:
      evaluator_name: answer_accuracy  # References the evaluator above
      direction: maximize
    token_efficiency:
      evaluator_name: token_efficiency
      direction: minimize
    latency:
      evaluator_name: llm_latency
      direction: minimize

  numeric:
    enabled: true
    sampler: grid # determines the number of trials to run for each parameter set

  prompt:
    enabled: false  # Disable for pure model comparison

<a id="dataset"></a>
### 1.2.2) Create an eval dataset

The dataset below is intended to be difficult for simple LLM chat completions, because:
- Math calculations (questions 1, 2, 5, 7, 9) require precise arithmetic that LLMs often struggle with
- Real-time data queries (questions 3, 8) need current information beyond the model's training cutoff
- Factual knowledge (questions 4, 6) may be outdated or incorrect without access to recent data
- Multi-step reasoning (questions 2, 7) requires combining multiple operations accurately

In [None]:
%%writefile tmp_workflow/data/eval_data.json
[
    {
        "id": "1",
        "question": "What is 15% of 847?",
        "answer": "The answer is 127.05"
    },
    {
        "id": "2", 
        "question": "If I invest $10,000 at 5% annual interest compounded monthly for 3 years, how much will I have?",
        "answer": "Approximately $11,614.72"
    },
    {
        "id": "3",
        "question": "What is the current weather in Tokyo?",
        "answer": "This requires real-time weather data for Tokyo, Japan."
    },
    {
        "id": "4",
        "question": "Who won the FIFA World Cup in 2022 and where was it held?",
        "answer": "Argentina won the 2022 FIFA World Cup, which was held in Qatar."
    },
    {
        "id": "5",
        "question": "Calculate the average of these numbers: 23, 45, 67, 89, 12, 34",
        "answer": "The average is 45"
    },
    {
        "id": "6",
        "question": "What is the capital of Australia and what is its approximate population?",
        "answer": "Canberra is the capital of Australia with a population of approximately 460,000 people."
    },
    {
        "id": "7",
        "question": "If a train travels 120 miles in 2 hours, then 180 miles in 3 hours, what is its average speed over the entire journey?",
        "answer": "The average speed is 60 miles per hour (300 miles / 5 hours)."
    },
    {
        "id": "8",
        "question": "Search for information about the latest NASA Mars mission and summarize the key findings.",
        "answer": "Requires web search for current NASA Mars mission information and synthesis of findings."
    },
    {
        "id": "9",
        "question": "What is 2 to the power of 10?",
        "answer": "1024"
    },
    {
        "id": "10",
        "question": "Who is the current CEO of Microsoft and when did they take the position?",
        "answer": "Satya Nadella has been CEO of Microsoft since February 2014."
    },
    {
        "id": "11",
        "question": "Convert 100 degrees Fahrenheit to Celsius and then to Kelvin.",
        "answer": "100°F = 37.78°C = 310.93K"
    },
    {
        "id": "12",
        "question": "Find the top 3 most popular programming languages in 2024 according to recent developer surveys.",
        "answer": "Requires web search for recent programming language popularity surveys from 2024."
    },
    {
        "id": "13",
        "question": "What is the square root of 289?",
        "answer": "17"
    },
    {
        "id": "14",
        "question": "If I start with $1000 and lose 20%, then gain 20% on the new amount, how much do I have?",
        "answer": "$960 (First: $1000 - 20% = $800, Then: $800 + 20% = $960)"
    },
    {
        "id": "15",
        "question": "What are the main differences between Python 3.11 and Python 3.12? Search for official documentation.",
        "answer": "Requires web search for Python 3.12 release notes and feature comparison."
    },
    {
        "id": "16",
        "question": "Calculate: (15 + 25) × 3 - 48 ÷ 6",
        "answer": "112"
    },
    {
        "id": "17",
        "question": "What is the chemical formula for water and what are its key properties?",
        "answer": "H2O. Key properties include: polar molecule, high specific heat capacity, excellent solvent, density maximum at 4°C."
    },
    {
        "id": "18",
        "question": "How many days are there between January 15, 2024 and March 30, 2024?",
        "answer": "75 days"
    },
    {
        "id": "19",
        "question": "Search for the latest NVIDIA GPU announcement and tell me the model name and key specifications.",
        "answer": "Requires web search for recent NVIDIA GPU announcements."
    },
    {
        "id": "20",
        "question": "If a rectangle has a length of 12 cm and a width of 8 cm, what is its area and perimeter?",
        "answer": "Area: 96 cm², Perimeter: 40 cm"
    }
]

<a id="optimize-first"></a>
### 1.2.3) Run the optimizer

In [None]:
!nat optimize --config_file tmp_workflow/configs/config_b.yml

<a id="interpret-optimizer-first"></a>
### 1.2.4) Interpret first optimizer run

**Understanding Evaluation Outputs**

This evaluation will have generated two artifacts for analysis at the `output_dir` specified in `config_b.yml`:
 - **`answer_accuracy_output.json`**
 - **`workflow_output.json`**
 - **`llm_latency_output.json`**
 - **`token_efficiency_output.json`**

**Interpreting `trajectory_accuracy_output.json`**

The `trajectory_accuracy_output.json` file contains the results of agent trajectory evaluation.

**Top-level fields:**
- **`average_score`** - Mean trajectory accuracy score across all evaluated examples (0.0 to 1.0)
- **`eval_output_items`** - Array of individual evaluation results for each test case

**Per-item fields:**
- **`id`** - Unique identifier for the test case
- **`score`** - Trajectory accuracy score for this specific example (0.0 to 1.0)
- **`reasoning`** - Evaluation reasoning, either:
  - String containing error message if evaluation failed
  - Object with:
    - **`reasoning`** - LLM judge's explanation of the score
    - **`trajectory`** - Array of [AgentAction, Output] pairs showing the agent's execution path

The trajectory accuracy evaluator assesses whether the agent used appropriate tools, followed a logical sequence of steps, and efficiently reached the correct answer.

**Interpreting `workflow_output.json`**

The `workflow_output.json` file contains the raw execution results from running the workflow on each test case.

**Top-level fields:**
- **`output_items`** - Array of workflow execution results for each test case in the dataset

**Per-item fields:**
- **`id`** - Unique identifier matching the test case ID
- **`input_obj`** - The input question or prompt sent to the workflow
- **`output_obj`** - The final answer generated by the workflow
- **`trajectory`** - Detailed execution trace containing:
  - **`event_type`** - Type of event (e.g., `LLM_START`, `LLM_END`, `TOOL_START`, `TOOL_END`, `SPAN_START`, `SPAN_END`)
  - **`event_timestamp`** - Unix timestamp of when the event occurred
  - **`metadata`** - Event-specific data including:
    - Tool names and inputs
    - LLM prompts and responses
    - Token counts (`prompt_tokens`, `completion_tokens`)
    - Model names
    - Function names
    - Error information

The workflow output provides complete observability into each execution, enabling detailed analysis of agent behavior, performance profiling, and debugging.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import ast

# Load the optimizer results
trials_df_path = Path("tmp_workflow/eval_output/optimizer/trials_dataframe_params.csv")

if trials_df_path.exists():
    trials_df = pd.read_csv(trials_df_path)
    
    print("Grid Search Optimization Results")
    print("=" * 80)
    print("\nTrials Summary:")
    print(trials_df.to_string(index=False))
    
    print("\n" + "=" * 80)
    print("\nModel Performance Statistics (Mean across repetitions):")
    print("-" * 80)
    
    # Group by model name to calculate statistics across repetitions
    for model_name in trials_df['params_llms.chat_completion_llm.model_name'].unique():
        model_trials = trials_df[trials_df['params_llms.chat_completion_llm.model_name'] == model_name]
        
        print(f"\n{model_name}:")
        
        # Parse rep_scores to extract individual repetition metrics
        if 'rep_scores' in model_trials.columns:
            all_accuracies = []
            all_token_efficiencies = []
            all_latencies = []
            
            for rep_scores_str in model_trials['rep_scores']:
                try:
                    rep_scores = ast.literal_eval(str(rep_scores_str))
                except (ValueError, SyntaxError, TypeError):
                    continue
                for score_set in rep_scores:
                    # score_set format: [accuracy, token_efficiency, latency]
                    all_accuracies.append(score_set[0])
                    all_token_efficiencies.append(score_set[1])
                    all_latencies.append(score_set[2])
            
            # Calculate mean and standard deviation
            def calculate_stats(values):
                mean = np.mean(values)
                std = np.std(values)
                ci_lower = np.percentile(values, 2.5)
                ci_upper = np.percentile(values, 97.5)
                return mean, std, ci_lower, ci_upper
            
            acc_mean, acc_std, acc_ci_lower, acc_ci_upper = calculate_stats(all_accuracies)
            tok_mean, tok_std, tok_ci_lower, tok_ci_upper = calculate_stats(all_token_efficiencies)
            lat_mean, lat_std, lat_ci_lower, lat_ci_upper = calculate_stats(all_latencies)
            
            print(f"  Accuracy:")
            print(f"    Mean: {acc_mean:.3f} (±{acc_std:.3f})")
            print(f"    95% CI: [{acc_ci_lower:.3f}, {acc_ci_upper:.3f}]")
            
            print(f"  Token Efficiency:")
            print(f"    Mean: {tok_mean:.3f} (±{tok_std:.3f})")
            print(f"    95% CI: [{tok_ci_lower:.3f}, {tok_ci_upper:.3f}]")
            
            print(f"  Latency:")
            print(f"    Mean: {lat_mean:.3f} (±{lat_std:.3f})")
            print(f"    95% CI: [{lat_ci_lower:.3f}, {lat_ci_upper:.3f}]")
        else:
            # Fallback to aggregated values if rep_scores not available
            # values_0 = accuracy, values_1 = token_efficiency, values_2 = latency
            acc_mean = np.mean(model_trials['values_0'])
            tok_mean = np.mean(model_trials['values_1'])
            lat_mean = np.mean(model_trials['values_2'])
            
            print(f"  Accuracy (mean): {acc_mean:.3f}")
            print(f"  Token Efficiency (mean): {tok_mean:.3f}")
            print(f"  Latency (mean): {lat_mean:.3f}")
            print("  Note: 95% CI not available without rep_scores data")
    
    print("\n" + "=" * 80)
    print("\nBest Configuration (by aggregated accuracy across all repetitions):")
    # Find the trial with best aggregated accuracy
    best_trial = trials_df.loc[trials_df['values_0'].idxmax()]
    print(f"Model: {best_trial['params_llms.chat_completion_llm.model_name']}")
    print(f"Temperature: {best_trial['params_llms.chat_completion_llm.temperature']}")
    print(f"Aggregated Accuracy Score: {best_trial['values_0']}")
    print(f"Aggregated Token Efficiency: {best_trial['values_1']}")
    print(f"Aggregated Latency: {best_trial['values_2']}")
else:
    print(f"Optimizer results not found at {trials_df_path}")
    print("Please run the optimizer first (cell 40)")


The results above show:
 
**Grid Search Optimization Summary:**
- The optimizer evaluated all combinations of models and temperatures defined in the search space
- Each configuration was tested multiple times (repetitions) to account for variability
- Three key metrics were tracked: accuracy, token efficiency (tokens used), and latency (response time)
 
 **Understanding the Statistics:**
- **Mean**: Average performance across all repetitions for each model
- **Standard Deviation (±)**: Measure of variability in performance
- **95% Confidence Interval**: Range where we expect 95% of results to fall

**Key Insights:**
 - Different models show different trade-offs between accuracy, efficiency, and speed
- Temperature settings affect response variability and quality
- The "Best Configuration" represents the optimal balance based on the weighted combination of all metrics
 
**Interpreting Your Results:**
When you run this optimization, look for:
- Which model/temperature combination achieves the highest aggregated accuracy
- How token efficiency varies between models (lower is more efficient)
- Latency differences (lower is faster)
- The confidence intervals to understand result stability

The optimizer automatically selects the best configuration and saves it to `optimized_config.yml` for use in production.

<a id="optimize-tool-calling-agents"></a>
# 2.0) Optimized model selection for tool-calling agents

<a id="create-triage-agent"></a>
# 2.1) Create a tool-calling agent
As we explained above, in many real-world applications straightforward chat completions requests may not be adequate without agentic tool-calling integration. Therefore, for the next exercise we are going to build a similar optimize pipeline for an advanced tool calling agent: the [Alert Triage Agent](https://github.com/NVIDIA/NeMo-Agent-Toolkit/tree/develop/examples/advanced_agents/alert_triage_agent). This agent uses tool calling to automate the triage of server-monitoring alerts. It demonstrates how to build an intelligent troubleshooting workflow using NeMo Agent toolkit and LangGraph.

The Alert Triage Agent is an advanced example that demonstrates:
- **Multi-tool orchestration** - Dynamically selects and uses diagnostic tools
- **Structured report generation** - Creates comprehensive analysis reports
- **Root cause categorization** - Classifies alerts into predefined categories
- **Offline evaluation mode** - Test with synthetic data before live deployment

We aim to demonstrate the power of model evaluation and optimization on agentic AI platforms. There are many foundational models to choose as your agent's backbone and academic benchmarks are not always representative of potential performance on your institutional data (refer to training data leakage and data domain shift research for more motivation). 

In [None]:
# Simple input prompt for branch selection
print("=" * 60)
print("Alert Triage Agent Installation")
print("=" * 60)
print("\nOptions:")
print("  - Enter 'local' for editable install from local repository")
print("  - Enter a branch name (e.g., 'develop', 'main') for git install")
print("=" * 60)

branch_name = input("\nEnter your choice: ").strip()

if branch_name.lower() == 'local':
    # Local editable install
    print("\nInstalling alert triage agent in editable mode from local repository...")
    
    # Try to find the local path relative to current directory
    from pathlib import Path
    # path-check-skip-next-line
    local_path = Path('../../examples/advanced_agents/alert_triage_agent')
    
    if local_path.exists():
        get_ipython().system(f'pip install -e {local_path}')
        print(f"✓ Installed from local path: {local_path.absolute()}")
    else:
        print(f"✗ Error: Local path not found: {local_path.absolute()}")
        print("Make sure you're running this from the correct directory")
else:
    # Git install from specified branch
    print(f"\nInstalling alert triage agent from branch: {branch_name}")
    get_ipython().system(f'pip install --no-deps "git+https://github.com/NVIDIA/NeMo-Agent-Toolkit.git@{branch_name}#subdirectory=examples/advanced_agents/alert_triage_agent"')
    print(f"✓ Installed from git branch: {branch_name}")

print("\n" + "=" * 60)

In [None]:
import importlib.resources
from pathlib import Path

# Find the installed package data directory
package_data = importlib.resources.files('nat_alert_triage_agent').joinpath('data')

maintenance_csv = str(package_data / 'maintenance_static_dataset.csv')
offline_csv = str(package_data / 'offline_data.csv')
benign_json = str(package_data / 'benign_fallback_offline_data.json')
offline_json = str(package_data / 'offline_data.json')

print(f"Package data directory: {package_data}")

<a id="configure-triage-agent"></a>
## 2.2) Configure the tool-calling agent

**Configuring the Alert Triage Agent**

The Alert Triage Agent requires several components:

1. **Diagnostic Tools** - Hardware checks, network connectivity, performance monitoring, telemetry analysis
2. **Sub-agents** - Telemetry metrics analysis agent that coordinates multiple telemetry tools
3. **Categorizer** - Classifies root causes into predefined categories
4. **Maintenance Check** - Filters out alerts during maintenance windows

We'll create a **local configuration file** and run in **offline mode** using synthetic data.

In the configuration file, you can see the list of LLMs that we have predefined to be compared when the optimizer runs. These 11 models will each be used as the agents backbone LLM for reasoning steps. The `tool_reasoning_llm` and `nim_rag_eval_llm` remain fixed to `meta/llama-3.1-70b-instruct`, but in a modified evaluation these models could be evaluated as well. 
```
- Meta: llama-3.1-8b-instruct
- Meta: llama-3.1-70b-instruct
- Meta: llama-3.1-405b-instruct
- Meta: llama-3.3-3b-instruct
- Meta: llama-3.3-70b-instruct
- Meta: llama-4-scout-17b-16e-instruct
- OpenAI: gpt-oss-20b
- OpenAI: gpt-oss-120b
- IBM: granite-3.3-8b-instruct
- MistralAI: mistral-small-3.1-24b-instruct-2503
- MistralAI: mistral-medium-3-instruct
```

In [None]:
import importlib.resources
from pathlib import Path
import yaml

package_data = importlib.resources.files('nat_alert_triage_agent').joinpath('data')

# Create config dictionary
config = {
    'functions': {
        'hardware_check': {
            '_type': 'hardware_check',
            'llm_name': 'tool_reasoning_llm',
            'offline_mode': True
        },
        'host_performance_check': {
            '_type': 'host_performance_check',
            'llm_name': 'tool_reasoning_llm',
            'offline_mode': True
        },
        'monitoring_process_check': {
            '_type': 'monitoring_process_check',
            'llm_name': 'tool_reasoning_llm',
            'offline_mode': True
        },
        'network_connectivity_check': {
            '_type': 'network_connectivity_check',
            'llm_name': 'tool_reasoning_llm',
            'offline_mode': True
        },
        'telemetry_metrics_host_heartbeat_check': {
            '_type': 'telemetry_metrics_host_heartbeat_check',
            'llm_name': 'tool_reasoning_llm',
            'offline_mode': True
        },
        'telemetry_metrics_host_performance_check': {
            '_type': 'telemetry_metrics_host_performance_check',
            'llm_name': 'tool_reasoning_llm',
            'offline_mode': True
        },
        'telemetry_metrics_analysis_agent': {
            '_type': 'telemetry_metrics_analysis_agent',
            'tool_names': [
                'telemetry_metrics_host_heartbeat_check',
                'telemetry_metrics_host_performance_check'
            ],
            'llm_name': 'agent_llm'
        },
        'maintenance_check': {
            '_type': 'maintenance_check',
            'llm_name': 'agent_llm',
            'static_data_path': str(package_data / 'maintenance_static_dataset.csv')
        },
        'categorizer': {
            '_type': 'categorizer',
            'llm_name': 'agent_llm'
        }
    },
    'workflow': {
        '_type': 'alert_triage_agent',
        'tool_names': [
            'hardware_check',
            'host_performance_check',
            'monitoring_process_check',
            'network_connectivity_check',
            'telemetry_metrics_analysis_agent'
        ],
        'llm_name': 'agent_llm',
        'offline_mode': True,
        'offline_data_path': str(package_data / 'offline_data.csv'),
        'benign_fallback_data_path': str(package_data / 'benign_fallback_offline_data.json')
    },
    'llms': {
        'agent_llm': {
            '_type': 'nim',
            'model_name': 'meta/llama-3.1-8b-instruct',
            'temperature': 0.0,
            'max_tokens': 2048,
            'optimizable_params': ['model_name'],
            'search_space': {
                'model_name': {
                    'values': [
                        # path-check-skip-next-line
                        'meta/llama-3.1-8b-instruct',
                        # path-check-skip-next-line
                        'meta/llama-3.1-70b-instruct',
                        # path-check-skip-next-line
                        'meta/llama-3.1-405b-instruct',
                        # path-check-skip-next-line
                        'meta/llama-3.3-3b-instruct',
                        # path-check-skip-next-line
                        'meta/llama-3.3-70b-instruct',
                        # path-check-skip-next-line
                        'meta/llama-4-scout-17b-16e-instruct',
                        # path-check-skip-next-line
                        'openai/gpt-oss-20b',
                        # path-check-skip-next-line
                        'openai/gpt-oss-120b',
                        # path-check-skip-next-line
                        'ibm/granite-3.3-8b-instruct',
                        # path-check-skip-next-line
                        'mistralai/mistral-small-3.1-24b-instruct-2503',
                        # path-check-skip-next-line
                        'mistralai/mistral-medium-3-instruct'
                    ]
                }
            }
        },
        'tool_reasoning_llm': {
            '_type': 'nim',
            'model_name': 'meta/llama-3.1-70b-instruct',
            'temperature': 0.2,
            'max_tokens': 2048
        },
        'nim_rag_eval_llm': {
            '_type': 'nim',
            'model_name': 'meta/llama-3.1-70b-instruct',
            'max_tokens': 8
        }
    },
    'eval': {
        'general': {
            # path-check-skip-next-line
            'output_dir': './tmp_workflow/alert_triage_output/',
            'dataset': {
                '_type': 'json',
                'file_path': str(package_data / 'offline_data.json')
            }
        },
        'evaluators': {
            'classification_accuracy': {
                '_type': 'classification_accuracy'
            },
            'rag_accuracy': {
                '_type': 'ragas',
                'metric': 'AnswerAccuracy',
                'llm_name': 'nim_rag_eval_llm'
            }
        },
        'profiler': {
            'token_uniqueness_forecast': True,
            'workflow_runtime_forecast': True,
            'compute_llm_metrics': True,
            'csv_exclude_io_text': True,
            'prompt_caching_prefixes': {
                'enable': True,
                'min_frequency': 0.1
            },
            'bottleneck_analysis': {
                'enable_nested_stack': True
            },
            'concurrency_spike_analysis': {
                'enable': True,
                'spike_threshold': 7
            }
        }
    },
    'optimizer': {
        # path-check-skip-next-line
        'output_path': './tmp_workflow/alert_triage_output/optimizer/',
        'reps_per_param_set': 1,
        'eval_metrics': {
            'classification_accuracy': {
                'evaluator_name': 'classification_accuracy',
                'direction': 'maximize'
            },
            'rag_accuracy': {
                'evaluator_name': 'rag_accuracy',
                'direction': 'maximize'
            }
        },
        'numeric': {
            'enabled': True,
            'sampler': 'grid'
        },
        'prompt': {
            'enabled': False
        }
    }
}

# Write to file
Path('./tmp_workflow/configs').mkdir(parents=True, exist_ok=True)
with open('./tmp_workflow/configs/alert_triage_config.yml', 'w') as f:
    yaml.dump(config, f, default_flow_style=False, sort_keys=False)

print(f"Config written with data paths from: {package_data}")

<a id="test-triage-agent"></a>
## 2.3) Test the tool-calling agent

Let's test the Alert Triage Agent with a single alert. This alert is an "InstanceDown" alert that, according to the offline dataset, is actually a false positive (the system is healthy).


In [None]:
!nat run --config_file tmp_workflow/configs/alert_triage_config.yml \
  --input '{"alert_id": 0, \
           "alert_name": "InstanceDown", \
           "host_id": "test-instance-0.example.com", \
           "severity": "critical", \
           "description": "Instance test-instance-0.example.com is not available for scraping for the last 5m. " \
                         "Please check: - instance is up and running; - monitoring service is in place and running; - network connectivity is ok", \
           "summary": "Instance test-instance-0.example.com is down", \
           "timestamp": "2025-04-28T05:00:00.000000"}'


After running the cell above, we have confirmed that the tool calling agent is properly configured and ready for a naive evaluation. This evaluation will be our performance baseline.

<a id="eval-triage-agent1"></a>
## 2.4) Evaluate the tool-calling agent (naive parameters)

*using `nat eval`...*

Now let's run a full evaluation on the Alert Triage Agent using the complete offline dataset. This dataset contains seven alerts with different root causes:

- **False positives** - System appears healthy despite alert
- **Hardware issues** - Hardware failures or degradation  
- **Software issues** - Malfunctioning monitoring services
- **Maintenance** - Scheduled maintenance windows
- **Repetitive behavior** - Benign recurring patterns

The evaluation will measure:
1. **Classification Accuracy** - How well the agent categorizes root causes
2. **Answer Accuracy** - How well the generated reports match expected outcomes (using RAGAS)


In [None]:
!nat eval --config_file ./tmp_workflow/configs/alert_triage_config.yml


**Understanding Alert Triage Evaluation Results**

The evaluation generates several output files in the `alert_triage_output` directory:

1. **classification_accuracy_output.json** - Root cause classification metrics
   - Shows accuracy, precision, recall, and F1 scores for each category
   - Contains confusion matrix for detailed analysis
   
2. **rag_accuracy_output.json** - Answer quality metrics
   - Measures how well generated reports match expected outcomes
   - Uses LLM-as-a-judge to evaluate report quality

3. **workflow_output.json** - Complete execution traces
   - Contains full agent trajectories with tool calls
   - Includes generated reports for each alert
   - Shows token usage and performance metrics

Let's examine the classification accuracy results:


We see that the classification accuracy results are around 43% based on RAG accuracy results of 46%.

Next we will run the optimizer over a variety of models and some reasonable hyperparameters, then use that optimal configuration and run the evaluation again.

In [None]:
import json

# Load and display classification accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_output/classification_accuracy_output.json', 'r') as f:
    classification_results = json.load(f)

print("Classification Accuracy Results:")
print(f"Average Score: {classification_results['average_score']:.2%}")
print("\nPer-Alert Results:")
for item in classification_results['eval_output_items']:
    print(f"  Alert {item['id']}: Score={item['score']:.2f} - {item['reasoning']}")

# Load and display RAG accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_output/rag_accuracy_output.json', 'r') as f:
    rag_results = json.load(f)

print("\n\nRAG Accuracy Results:")
print(f"Average Score: {rag_results['average_score']:.2%}")
print(f"Total Alerts Evaluated: {len(rag_results['eval_output_items'])}")


<a id="optimize-triage-agent"></a>
## 2.5) Optimize the tool-calling agent's LLM
*using `nat optimize`...*

In [None]:
!nat optimize --config_file tmp_workflow/configs/alert_triage_config.yml

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import ast

# Load the optimizer results
trials_df_path = Path("tmp_workflow/alert_triage_output/optimizer/trials_dataframe_params.csv")

if trials_df_path.exists():
    trials_df = pd.read_csv(trials_df_path)
    
    print("Grid Search Optimization Results")
    print("=" * 80)
    print("\nTrials Summary:")
    print(trials_df.to_string(index=False))
    
    print("\n" + "=" * 80)
    print("\nModel Performance Statistics (Mean across repetitions):")
    print("-" * 80)
    
    # Group by model name to calculate statistics across repetitions
    for model_name in trials_df['params_llms.agent_llm.model_name'].unique():
        model_trials = trials_df[trials_df['params_llms.agent_llm.model_name'] == model_name]
        
        print(f"\n{model_name}:")
        
        # Parse rep_scores to extract individual repetition metrics
        if 'rep_scores' in model_trials.columns:
            all_classification_accuracies = []
            all_rag_accuracies = []
            
            for rep_scores_str in model_trials['rep_scores']:
                rep_scores = ast.literal_eval(rep_scores_str)
                for score_set in rep_scores:
                    # score_set format: [classification_accuracy, rag_accuracy]
                    all_classification_accuracies.append(score_set[0])
                    all_rag_accuracies.append(score_set[1])
            
            # Calculate mean and standard deviation
            def calculate_stats(values):
                mean = np.mean(values)
                std = np.std(values)
                ci_lower = np.percentile(values, 2.5)
                ci_upper = np.percentile(values, 97.5)
                return mean, std, ci_lower, ci_upper
            
            class_acc_mean, class_acc_std, class_acc_ci_lower, class_acc_ci_upper = \
                calculate_stats(all_classification_accuracies)
            rag_acc_mean, rag_acc_std, rag_acc_ci_lower, rag_acc_ci_upper = calculate_stats(all_rag_accuracies)
            
            print("  Classification Accuracy:")
            print(f"    Mean: {class_acc_mean:.3f} (±{class_acc_std:.3f})")
            print(f"    95% CI: [{class_acc_ci_lower:.3f}, {class_acc_ci_upper:.3f}]")
            
            print("  RAG Accuracy:")
            print(f"    Mean: {rag_acc_mean:.3f} (±{rag_acc_std:.3f})")
            print(f"    95% CI: [{rag_acc_ci_lower:.3f}, {rag_acc_ci_upper:.3f}]")
        else:
            # Fallback to aggregated values if rep_scores not available
            # values_0 = classification_accuracy, values_1 = rag_accuracy
            class_acc_mean = np.mean(model_trials['values_0'])
            rag_acc_mean = np.mean(model_trials['values_1'])
            
            print(f"  Classification Accuracy (mean): {class_acc_mean:.3f}")
            print(f"  RAG Accuracy (mean): {rag_acc_mean:.3f}")
            print("  Note: 95% CI not available without rep_scores data")
    
    print("\n" + "=" * 80)
    print("\nBest Configuration (by aggregated classification accuracy across all repetitions):")
    # Find the trial with best aggregated classification accuracy
    best_trial = trials_df.loc[trials_df['values_0'].idxmax()]
    print(f"Model: {best_trial['params_llms.agent_llm.model_name']}")
    print(f"Aggregated Classification Accuracy Score: {best_trial['values_0']}")
    print(f"Aggregated RAG Accuracy: {best_trial['values_1']}")
else:
    print(f"Optimizer results not found at {trials_df_path}")
    print("Please run the optimizer first (cell 55)")


<a id="eval-triage-agent2"></a>
## 2.6) Re-evaluate the optimized tool-calling agent

Now apply the optimal parameters in **optimized_config.yml** to re-evaluate the original eval.

In [None]:
# path-check-skip-next-line
!nat eval --config_file ./tmp_workflow/alert_triage_output/optimizer/optimized_config.yml

In [None]:
import json

# Load and display classification accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_output/classification_accuracy_output.json', 'r') as f:
    classification_results = json.load(f)

print("Classification Accuracy Results:")
print(f"Average Score: {classification_results['average_score']:.2%}")
print("\nPer-Alert Results:")
for item in classification_results['eval_output_items']:
    print(f"  Alert {item['id']}: Score={item['score']:.2f} - {item['reasoning']}")

# Load and display RAG accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_output/rag_accuracy_output.json', 'r') as f:
    rag_results = json.load(f)

print("\n\nRAG Accuracy Results:")
print(f"Average Score: {rag_results['average_score']:.2%}")
print(f"Total Alerts Evaluated: {len(rag_results['eval_output_items'])}")


<a id="next-steps"></a>
# 3.0) Next steps

Continue learning how to fully utilize the NVIDIA NeMo Agent toolkit by exploring the other documentation and advanced agents in the `examples` directory.