## Model Selection and Parameter Optimization

In this notebook, we will demonstrate how the NVIDIA NeMo Agent toolkit (NAT) optimizer can be used to create a robust model evaluation, comparison, and selection pipeline for custom datasets.

**Goal**: exemplify the minimal configuration and demonstrate a practical example using the NAT optimizer module to evaluate and compare the performance of various LLMs. Help NAT users establish a working understanding of this feature so that they can create similar workflows in their own institutions.

## Table of Contents
 
- [0.0) Setup](#setup)
  - [0.1) Prerequisites](#prereqs)
  - [0.2) API Keys](#api-keys)
  - [0.3) Installing NeMo Agent Toolkit](#install-nat)
  - [0.4) Additional dependencies](#deps)
- [1.0) LLM-as-a-judge with NAT](#llm-judge-h1)
  - [1.1) Create a new workflow](#new-workflow)
  - [1.2) Head-to-head comparison of multiple LLMs using eval](#nat-eval)
    - [1.2.1) LLM-as-a-judge workflow config](#config)
    - [1.2.2) Add optimizer settings to the configuration](#optimizer-settings)
    - [1.2.3) Create an eval dataset](#dataset)
    - [1.2.4) Run the optimizer](#optimize-first)
    - [1.2.5) Interpret first optimizer run](#interpret-optimizer-first)
- [2.0) Optimized model and parameter selection for tool-calling agents](#optimize-tool-calling-agents)
  - [2.1) Create a tool-calling agent](#create-triage-agent)
  - [2.2) Configure the tool-calling agent](#configure-triage-agent)
  - [2.3) Test the tool-calling agent](#test-triage-agent)
  - [2.4) Evaluate the tool-calling agent](#eval-triage-agent1)
  - [2.5) Optimize the tool-calling agent's LLM](#optimize-triage-agent)
  - [2.6) Re-evaluate the optimized tool-calling agent](#eval-triage-agent2)
- [3.0) Concurrent model parameter and prompt tuning](#model-and-prompt-tuning)
  - [3.1) Optimizer configuration for all parameters (models, hyperparameters, and prompts)](#all-tuning-config)
  - [3.2) Evaluate the agent](#all-tuning-initial-eval)
  - [3.3) Optimize the agent](#all-tuning-optimize)
  - [3.4) Re-evaluate the optimized tool-calling agent](#eval-triage-agent2)
- [4.0) Next steps](#next-steps)

<a id="setup"></a>
# 0.0) Setup

<a id="prereqs"></a>
## 0.1) Prerequisites

We strongly recommend that users begin this notebook with a working understanding of NAT workflows. Please refer to earlier iterations of this notebook series prior to beginning this notebook.

- **Platform:** Linux, macOS, or Windows
- **Python:** version 3.11, 3.12, or 3.13
- **Python Packages:** `pip`

<a id="api-keys"></a>
## 0.2) API Keys

For this notebook, you will need the following API keys to run all examples end-to-end:

- **NVIDIA Build:** You can obtain an NVIDIA Build API Key by creating an [NVIDIA Build](https://build.nvidia.com) account and generating a key at https://build.nvidia.com/settings/api-keys

Then you can run the cell below:

In [21]:
import getpass
import os

if "NVIDIA_API_KEY" not in os.environ:
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

<a id="install-nat"></a>
## 0.3) Installing NeMo Agent Toolkit

The recommended way to install NAT is through `pip` or `uv pip`.

First, we will install `uv` which offers parallel downloads and faster dependency resolution.

In [None]:
!pip install uv

NeMo Agent toolkit can be installed through the PyPI `nvidia-nat` package.

There are several optional subpackages available for NAT. For this example, we will rely on three subpackages:
* The `nvidia-nat[langchain]` subpackage contains components for integrating with [LangChain](https://python.langchain.com/docs/introduction/).
* The `nvidia-nat[profiling]` subpackage contains components for profiling and performance analysis.

In [None]:
!uv pip install "nvidia-nat[langchain,profiling]"

<a id="deps"></a>
## 0.4) Additional dependencies

In [22]:
# needed for the alert triage agent used later
!uv pip install ansible-runner

[2mUsing Python 3.12.11 environment at: /Users/bbednarski/.venvs/unew_312[0m
[2mAudited [1m1 package[0m [2min 287ms[0m[0m


<div style="color: red; font-style: italic;">
<strong>Note:</strong> Uncomment and run this cell to install git-lfs if using Google Colab.
</div>

In [23]:
# !apt-get update
# !apt-get install git git-lfs -y
# !git lfs install

<a id="llm-judge-h1"></a>
# 1.0) LLM-as-a-judge with NAT

The `nat eval` and `nat optimize` utilities enable developers to easily integrate LLM-as-a-judge capabilities with their workflows. `nat eval` allows for simple evaluations of a NAT workflow against an eval dataset. `nat optimize` extends this functionality by integrating with the **Optuna** library to perform grid and stochastic parameter sweeps and evaluations to identify optimal configurations for a task.

**Note:** _In this notebook, we will primarily demonstrate how to use `nat optimize` to identify a potentially optimal set of parameters for a NAT workflow. It is assumed that users will already have a strong understanding of ML model evaluations before building this concept into their workflows - as we will not be covering cross validation and train, validation, and test splitting of datasets. Please refer to python's [SciKit-Learn](https://scikit-learn.org/stable/) package as a strong reference for these concepts._

<a id="new-workflow"></a>
## 1.1) Create a new workflow

Create a basic chat completions workflow (using LangChain chat completions on backend).

In [24]:
!nat workflow create tmp_workflow --description "A simple chat completion workflow to compare model performance"

Installing workflow 'tmp_workflow'...
Workflow 'tmp_workflow' installed successfully.
Workflow 'tmp_workflow' created successfully in '/Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/notebooks/tmp_workflow'.
[0m[0m

Let's look at the default configuration of this agent and confirm the agent type, LLMs, tool calls, and functions...

In [25]:
%%writefile ./tmp_workflow/configs/config_a.yml
llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-8b-instruct
    temperature: 0.7
    max_tokens: 1024

workflow:
  _type: chat_completion  # Use the type directly
  system_prompt: |
    You are a helpful AI assistant. Provide clear, accurate, and helpful 
    responses to user queries. Be concise and informative.
  llm_name: nim_llm

Writing ./tmp_workflow/configs/config_a.yml


Now let's run this workflow for a simple Q&A example...

In [26]:
!nat run --config_file tmp_workflow/configs/config_a.yml --input "Suggest a single name for my new dog"

2025-10-24 10:24:04 - INFO     - nat.cli.commands.start:192 - Starting NAT from config file: 'tmp_workflow/configs/config_a.yml'

Configuration Summary:
--------------------
Workflow Type: chat_completion
Number of Functions: 0
Number of Function Groups: 0
Number of LLMs: 1
Number of Embedders: 0
Number of Memory: 0
Number of Object Stores: 0
Number of Retrievers: 0
Number of TTC Strategies: 0
Number of Authentication Providers: 0

2025-10-24 10:24:07 - INFO     - nat.front_ends.console.console_front_end_plugin:102 - --------------------------------------------------
[32mWorkflow Result:
["I'd be happy to help you choose a name for your new dog.\n\nHere are some popular and unique name suggestions for a dog:\n\nFor a male dog:\n- Max\n- Cooper\n- Rocky\n- Bear\n- Finn\n\nFor a female dog:\n- Luna\n- Daisy\n- Bella\n- Lucy\n- Ginger\n\nOr, if you'd like something more unique:\n- Sage\n- Wren\n- Indie\n- Clio\n- Remi\n\nWhat breed or type of dog do you have? I can give more tailored sug

<a id="nat-eval"></a>
## 1.2) Head-to-head comparison of multiple LLMs using eval

Now that we've made a new workflow and shown that it works for a cursory `nat run` example, we will begin to build out an LLM-as-a-judge evaluation with trace profiling enabled for additional observability. In this next section, we are going to update the workflow configuration for evaluation and profiling.

Step-by-step instructions can be found in [4_observability_evaluation_and_profiling.ipynb](./4_observability_evaluation_and_profiling.ipynb). An end-to-end example of using the Optimizer can be viewed in the [Email Phishing Analyzer](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config_optimizer.yml).

The profiler instruments and measures your workflow's performance, while evaluators judge the quality of the outputs. They're separate concepts, so they belong in different sections of the config!

In this next step we will combine the eval and profile configuration into a single config for brevity.

<a id="config"></a>
### 1.2.1) LLM-as-a-judge workflow config

In the cell below we edit our initial workflow configuration to include `eval` and `optimizer` configurations.

Key components of this configuration:

**LLM Configuration:**
- `chat_completion_llm`: The backbone LLM that powers the workflow
- `optimizable_params`: Specifies which parameters the optimizer can tune (model name, temperature)
- `search_space`: Defines the values the optimizer will explore during optimization

**Judge LLM:**
- `nim_judge_llm`: A separate, more capable LLM (meta/llama-3.1-405b-instruct) used by the evaluator to assess the quality of the workflow's outputs
  - This LLM acts as an "LLM-as-a-judge" to score responses

**Evaluation Components:**
- `evaluators`: Define metrics to measure workflow quality (for example, accuracy, relevance)
- `profiler`: Instruments the workflow to collect performance metrics (latency, token usage, costs)

**Optimizer Components:**
- `reps_per_param_set`: Number of times to evaluate each parameter combination for statistical reliability
- `grid_search`: Strategy for exploring the search space (tests all combinations)
- `eval_metrics`: Metrics used to guide optimization decisions (for example, maximize accuracy while minimizing cost)

In [27]:
%%writefile tmp_workflow/configs/config_b.yml
llms:
  chat_completion_llm:
    _type: nim
    model_name: meta/llama-3.1-8b-instruct
    temperature: 0.0
    max_tokens: 1024
    optimizable_params:
      - model_name
      - temperature
    search_space:
      model_name:
        values:
          - meta/llama-3.1-8b-instruct
          - meta/llama-3.1-70b-instruct
      temperature:
        values:
          - 0.0
          - 0.7

  # Judge LLM for accuracy evaluation
  nim_judge_llm:
    _type: nim
    model_name: meta/llama-3.1-405b-instruct
    temperature: 0.0
    max_tokens: 8  # RAGAS accuracy only needs a score (0-1)

workflow:
  _type: chat_completion
  system_prompt: |
    You are a helpful AI assistant. Provide clear, accurate, and helpful 
    responses to user queries. Be concise and informative.
  llm_name: chat_completion_llm

general:
  telemetry:
    logging:
      console:
        _type: console
        level: INFO

eval:
  general:
    output_dir: ./tmp_workflow/eval_output
    verbose: true
    dataset:
        _type: json
        file_path: ./tmp_workflow/data/eval_data.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_judge_llm
    llm_latency:
      _type: avg_llm_latency
    token_efficiency:
      _type: avg_tokens_per_llm_end

  profiler:
      token_uniqueness_forecast: true
      workflow_runtime_forecast: true
      compute_llm_metrics: true
      csv_exclude_io_text: true
      prompt_caching_prefixes:
        enable: true
        min_frequency: 0.1
      bottleneck_analysis:
        enable_nested_stack: true
      concurrency_spike_analysis:
        enable: true
        spike_threshold: 7


Writing tmp_workflow/configs/config_b.yml


<a id="optimizer-settings"></a>
### 1.2.2) Add optimizer settings to the configuration

**For a complete reference of all optimizer configuration parameters, see the [Optimizer documentation](../../docs/source/reference/optimizer.md) or go to your working branch on [Github - dev](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/reference/optimizer.md).**



Next, we will append the optimizer-specific settings to our configuration file under the "optimizer" section. The following describes the purpose and configurability of each.

**Top-Level Settings**

`output_path: ./tmp_workflow/eval_output/optimizer/` - Specifies where all optimization results will be saved

Files created here:
- `optimized_config.yml` - The best configuration found
- `trials_dataframe_params.csv` - Detailed results from all trials
- `config_numeric_trial_{N}.yml` - Individual trial configurations
- `plots/` - Pareto front visualizations (if multiple metrics)

`reps_per_param_set: 10`

> What it does: Number of times to run your workflow with each parameter configuration. This is important because LLMs are > non-deterministic (same input can give different outputs) and we often want to determine performance over a larger sample.
> 
> How it works:
> - If testing 5 different configurations × 10 reps = 50 total workflow runs
> - Results are averaged across the 10 runs for statistical reliability
> 
> Trade-off:
> - Higher reps = more reliable results but slower optimization and more compute ysed
> - Lower reps = faster but less confidence in which config is truly better, cheaper

**Evaluation Metrics (`eval_metrics`)**

This section defines what you're optimizing for. You can have multiple objectives.

- `accuracy` (custom name, you choose this)
- `token_efficiency` (another custom name)
- `latency` (another custom name)

Key Concepts:
- `evaluator_name`: References an evaluator you've defined elsewhere in your config (must match exactly)
- `direction`:
  - `maximize` - Higher scores are better (accuracy, precision, F1)
  - `minimize` - Lower scores are better (latency, cost, error rate)
- Multi-objective optimization: With 3 metrics here, the optimizer finds configs that balance all three goals (Pareto optimization)
  - `weight` - coefficient of relative importance for the optimizer (defaults to 1.0)

**Numeric Optimization (`numeric`)**

Controls how numeric/categorical parameters are optimized (uses Optuna library).

`enabled: true`

> What it does: Turns on optimization of numeric parameters (like `temperature`, `max_tokens`, model selection)
> 
> When to enable: When you have optimizable parameters marked with `OptimizableField()` in your config
> 
> When to disable: If you only want to optimize prompts, or run a single evaluation

`sampler: grid`

> What it does: Determines the search strategy for finding the best parameters
> 
> Options:
> - `grid` - Exhaustive search: Tests every combination of parameter values
>   - Use when: Small search space, want guaranteed best result
>   - Example: 3 models × 2 temperatures = 6 combinations
> - `bayesian` or `null` - Smart search: Uses Bayesian optimization to intelligently sample promising areas
>   - Use when: Large search space, limited time/budget
>   - Example: Continuous ranges like temperature 0.0-1.0
> 
> Must specify either:
> - Explicit values: `[0.5, 0.7, 0.9]`, OR
> - Range with step: `low: 0.0, high: 1.0, step: 0.1`

**Prompt Optimization (`prompt`)**

Controls genetic algorithm-based prompt optimization.

`enabled: false`

> What it does: Turns on/off LLM-based prompt evolution
> 
> When to enable: When you want to optimize the actual text of prompts (like system prompts)
> 
> When to disable:
> - Comparing models and numeric parameters only (like this example)
> - Don't have prompt parameters marked for optimization
> - Want faster results (prompt optimization is slower)
> 
> Requires:
> - Prompt parameters marked with `OptimizableField(space=SearchSpace(is_prompt=True))`
> - LLM functions for generating prompt variations

**How This Configuration Works Together**

With this specific config, here's what happens:

Optimizer will:
- Test different parameter combinations (models, settings, etc.)
- Run each combination 10 times for reliability
- Measure 3 things: accuracy (↑), token efficiency (↓), latency (↓)
- Use grid search to test every combination systematically
- Skip prompt optimization (only testing model/parameter combinations)

Example workflow (if testing 3 models × 2 temperatures):
- Total unique configs: 6
- Runs per config: 10
- Total workflow runs: 60
- Result: Best config balancing accuracy, cost, and speed

Output:
- One "best" configuration file
- Detailed comparison of all tested configs
- Visualizations showing trade-offs between metrics

In [28]:
%%writefile -a tmp_workflow/configs/config_b.yml
optimizer:
  output_path: ./tmp_workflow/eval_output/optimizer/
  reps_per_param_set: 10 # Number of times to evaluate EACH config (for statistical significance)
  eval_metrics: # specifies which evaluatin metrics to optimize for
    accuracy: # custom name for the metric
      evaluator_name: answer_accuracy  # References the evaluator defined under the 'eval' section
      direction: maximize
      weight: 1.0 # coefficient of relative importance for the optimizer (defaults to 1.0)
    token_efficiency: # custom name for the metric
      evaluator_name: token_efficiency # References the evaluator defined under the 'eval' section
      direction: minimize
      weight: 1.0
    latency: # custom name for the metric
      evaluator_name: llm_latency # References the evaluator defined under the 'eval' section
      direction: minimize
      weight: 1.0

  numeric:
    enabled: true # enables numeric/categorical parameters to be optimized
    sampler: grid # uses Optuna GridSearch to determine the unique parameter sets to evaluate

  prompt:
    enabled: false  # Disable for pure model and hyperparameter comparison

Appending to tmp_workflow/configs/config_b.yml


<a id="dataset"></a>
### 1.2.3) Create an eval dataset

The dataset below is intended to be difficult for simple LLM chat completions, because:
- Math calculations (questions 1, 2, 5, 7, 9) require precise arithmetic that LLMs often struggle with
- Real-time data queries (questions 3, 8) need current information beyond the model's training cutoff
- Factual knowledge (questions 4, 6) may be outdated or incorrect without access to recent data
- Multi-step reasoning (questions 2, 7) requires combining multiple operations accurately

In [29]:
%%writefile tmp_workflow/data/eval_data.json
[
    {
        "id": "1",
        "question": "What is 15% of 847?",
        "answer": "The answer is 127.05"
    },
    {
        "id": "2", 
        "question": "If I invest $10,000 at 5% annual interest compounded monthly for 3 years, how much will I have?",
        "answer": "Approximately $11,614.72"
    },
    {
        "id": "3",
        "question": "What is the current weather in Tokyo?",
        "answer": "This requires real-time weather data for Tokyo, Japan."
    },
    {
        "id": "4",
        "question": "Who won the FIFA World Cup in 2022 and where was it held?",
        "answer": "Argentina won the 2022 FIFA World Cup, which was held in Qatar."
    },
    {
        "id": "5",
        "question": "Calculate the average of these numbers: 23, 45, 67, 89, 12, 34",
        "answer": "The average is 45"
    },
    {
        "id": "6",
        "question": "What is the capital of Australia and what is its approximate population?",
        "answer": "Canberra is the capital of Australia with a population of approximately 460,000 people."
    },
    {
        "id": "7",
        "question": "If a train travels 120 miles in 2 hours, then 180 miles in 3 hours, what is its average speed over the entire journey?",
        "answer": "The average speed is 60 miles per hour (300 miles / 5 hours)."
    },
    {
        "id": "8",
        "question": "Search for information about the latest NASA Mars mission and summarize the key findings.",
        "answer": "Requires web search for current NASA Mars mission information and synthesis of findings."
    },
    {
        "id": "9",
        "question": "What is 2 to the power of 10?",
        "answer": "1024"
    },
    {
        "id": "10",
        "question": "Who is the current CEO of Microsoft and when did they take the position?",
        "answer": "Satya Nadella has been CEO of Microsoft since February 2014."
    }
]

Writing tmp_workflow/data/eval_data.json


<a id="optimize-first"></a>
### 1.2.4) Run the optimizer

<div style="color: red; font-style: italic;">
<strong>Developer warning:</strong> Running the optimizer can take significant time (~30 minutes for search space of n=10) and  LLM inference tokens. Double check your config for unneeded search parameters or reduce the number of samples in the evaluation dataset to reduce cost.
</div>

In [33]:
!nat optimize --config_file tmp_workflow/configs/config_b.yml

2025-10-24 10:38:59 - INFO     - nat.profiler.parameter_optimization.parameter_optimizer:70 - Using Grid sampler for numeric optimization
[32m[I 2025-10-24 10:38:59,418][0m A new study created in memory with name: no-name-9ea08d4f-12e1-4fd0-b73b-f70036d0736a[0m
2025-10-24 10:38:59 - INFO     - nat.profiler.parameter_optimization.parameter_optimizer:125 - Starting numeric / enum parameter optimization...













































Running workflow:  10%|██▌                       | 1/10 [00:00<00:08,  1.10it/s][A[A[A[A[A[A[A[A[A
Running workflow:  10%|██▌                       | 1/10 [00:00<00:08,  1.00it/s][A
Running workflow:  20%|█████▏                    | 2/10 [00:01<00:04,  1.94it/s][A
Running workflow:  30%|███████▊                  | 3/10 [00:01<00:02,  3.07it/s][A

Running workflow:  20%|█████▏                    | 2/10 [00:01<00:05,  1.39it/s][A[A

Running workflow:  20%|█████▏                    | 2/10 [00:01<00:07,  1.09it/s][A[A
Running w

<a id="interpret-optimizer-first"></a>
### 1.2.5) Interpret first optimizer run

**Understanding Evaluation Outputs**

This evaluation will have generated two artifacts for analysis at the `output_dir` specified in `config_b.yml`:
 - **`answer_accuracy_output.json`**
 - **`workflow_output.json`**
 - **`llm_latency_output.json`**
 - **`token_efficiency_output.json`**

**Interpreting `trajectory_accuracy_output.json`**

The `trajectory_accuracy_output.json` file contains the results of agent trajectory evaluation.

**Top-level fields:**
- **`average_score`** - Mean trajectory accuracy score across all evaluated examples (0.0 to 1.0)
- **`eval_output_items`** - Array of individual evaluation results for each test case

**Per-item fields:**
- **`id`** - Unique identifier for the test case
- **`score`** - Trajectory accuracy score for this specific example (0.0 to 1.0)
- **`reasoning`** - Evaluation reasoning, either:
  - String containing error message if evaluation failed
  - Object with:
    - **`reasoning`** - LLM judge's explanation of the score
    - **`trajectory`** - Array of [AgentAction, Output] pairs showing the agent's execution path

The trajectory accuracy evaluator assesses whether the agent used appropriate tools, followed a logical sequence of steps, and efficiently reached the correct answer.

**Interpreting `workflow_output.json`**

The `workflow_output.json` file contains the raw execution results from running the workflow on each test case.

**Top-level fields:**
- **`output_items`** - Array of workflow execution results for each test case in the dataset

**Per-item fields:**
- **`id`** - Unique identifier matching the test case ID
- **`input_obj`** - The input question or prompt sent to the workflow
- **`output_obj`** - The final answer generated by the workflow
- **`trajectory`** - Detailed execution trace containing:
  - **`event_type`** - Type of event (e.g., `LLM_START`, `LLM_END`, `TOOL_START`, `TOOL_END`, `SPAN_START`, `SPAN_END`)
  - **`event_timestamp`** - Unix timestamp of when the event occurred
  - **`metadata`** - Event-specific data including:
    - Tool names and inputs
    - LLM prompts and responses
    - Token counts (`prompt_tokens`, `completion_tokens`)
    - Model names
    - Function names
    - Error information

The workflow output provides complete observability into each execution, enabling detailed analysis of agent behavior, performance profiling, and debugging.

In [34]:
from pathlib import Path
import pandas as pd

# Load the optimizer results
trials_df_path = Path("tmp_workflow/eval_output/optimizer/trials_dataframe_params.csv")

if trials_df_path.exists():
    trials_df = pd.read_csv(trials_df_path)

    print("Grid Search Optimization Results")
    print("=" * 80)
    print("\nTrials Summary:")
    print(trials_df.to_string(index=False))
    print("\n" + "=" * 80)

Grid Search Optimization Results

Trials Summary:
 number  values_accuracy  values_token_efficiency  values_latency             datetime_start          datetime_complete               duration params_llms.chat_completion_llm.model_name  params_llms.chat_completion_llm.temperature                                                                                                                                                                                                       rep_scores  system_attrs_grid_id                                                                                                                                  system_attrs_search_space    state  pareto_optimal
      0           0.0700                   161.20          17.171 2025-10-24 10:38:59.418266 2025-10-24 10:40:09.009520 0 days 00:01:09.591254                meta/llama-3.1-70b-instruct                                          0.0 [[0.1, 164.2, 9.75], [0.0, 161.3, 11.21], [0.1, 160.1, 12.61], [0.0, 161.3, 1

TODO: revise this after pareto plot update

The results above show:
 
**Grid Search Optimization Summary:**
- The optimizer evaluated all combinations of models and temperatures defined in the search space
- Each configuration was tested multiple times (repetitions) to account for variability
- Three key metrics were tracked: accuracy, token efficiency (tokens used), and latency (response time)
 
 **Understanding the Statistics:**
- **Mean**: Average performance across all repetitions for each model
- **Standard Deviation (±)**: Measure of variability in performance
- **95% Confidence Interval**: Range where we expect 95% of results to fall

**Key Insights:**
 - Different models show different trade-offs between accuracy, efficiency, and speed
- Temperature settings affect response variability and quality
- The "Best Configuration" represents the optimal balance based on the weighted combination of all metrics
 
**Interpreting Your Results:**
When you run this optimization, look for:
- Which model/temperature combination achieves the highest aggregated accuracy
- How token efficiency varies between models (lower is more efficient)
- Latency differences (lower is faster)
- The confidence intervals to understand result stability

The optimizer automatically selects the best configuration and saves it to `optimized_config.yml` for use in production.

<a id="optimize-tool-calling-agents"></a>
# 2.0) Optimized model and parameter selection for tool-calling agents

<a id="create-triage-agent"></a>
## 2.1) Create a tool-calling agent
As we explained above, in many real-world applications straightforward chat completions requests may not be adequate without agentic tool-calling integration. Therefore, for the next exercise we are going to build a similar optimize pipeline for an advanced tool calling agent: the [Alert Triage Agent](https://github.com/NVIDIA/NeMo-Agent-Toolkit/tree/develop/examples/advanced_agents/alert_triage_agent). This agent uses tool calling to automate the triage of server-monitoring alerts. It demonstrates how to build an intelligent troubleshooting workflow using NeMo Agent toolkit and LangGraph.

The Alert Triage Agent is an advanced example that demonstrates:
- **Multi-tool orchestration** - Dynamically selects and uses diagnostic tools
- **Structured report generation** - Creates comprehensive analysis reports
- **Root cause categorization** - Classifies alerts into predefined categories
- **Offline evaluation mode** - Test with synthetic data before live deployment

We aim to demonstrate the power of model evaluation and optimization on agentic AI platforms. There are many foundational models to choose as your agent's backbone and academic benchmarks are not always representative of potential performance on your institutional data (refer to training data leakage and data domain shift research for more motivation).

<div style="color: red; font-style: italic;">
<strong>Note:</strong> As the Alert Triage Agent is not shipped with the NAT PyPi package, we will either clone it from GitHub (by selecting your branch of choice), or if the package was installed with the `-e` editable code flag, we can work locally. We will parameterize the path to this agent to easily alter the configuration in the next cell
</div>

In [35]:
# Simple input prompt for branch selection
print("=" * 60)
print("Alert Triage Agent Installation")
print("=" * 60)
print("\nOptions:")
print("  - Enter 'local' for editable install from local repository")
print("  - Enter a branch name (e.g., 'develop', 'main') for git install")
print("=" * 60)

branch_name = input("\nEnter your choice: ").strip()

if branch_name.lower() == 'local':
    # Local editable install
    print("\nInstalling alert triage agent in editable mode from local repository...")

    # Try to find the local path relative to current directory
    from pathlib import Path
    # path-check-skip-next-line
    local_path = Path('../../examples/advanced_agents/alert_triage_agent')

    if local_path.exists():
        get_ipython().system(f'pip install -e {local_path}')
        print(f"✓ Installed from local path: {local_path.absolute()}")
    else:
        print(f"✗ Error: Local path not found: {local_path.absolute()}")
        print("Make sure you're running this from the correct directory")
else:
    # Git install from specified branch
    print(f"\nInstalling alert triage agent from branch: {branch_name}")
    get_ipython().system(f'pip install --no-deps "git+https://github.com/NVIDIA/NeMo-Agent-Toolkit.git@{branch_name}#subdirectory=examples/advanced_agents/alert_triage_agent"')
    print(f"✓ Installed from git branch: {branch_name}")

print("\n" + "=" * 60)

Alert Triage Agent Installation

Options:
  - Enter 'local' for editable install from local repository
  - Enter a branch name (e.g., 'develop', 'main') for git install

Installing alert triage agent in editable mode from local repository...
Obtaining file:///Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
[?25hINFO: pip is looking at multiple versions of nat-alert-triage-agent to determine which version is compatible with other requirements. This could take a while.
[31mERROR: Could not find a version that satisfies the requirement nvidia-nat~=1.4 (from nat-alert-triage-agent) (from versions: 1.1.0a20251020, 1.2.0a20250813, 1.2.0rc5, 1.2.0rc6, 1.2.0rc7, 1.2.0rc8, 1.2rc9, 1

In [36]:
import importlib.resources

# Find the installed package data directory
package_data = importlib.resources.files('nat_alert_triage_agent').joinpath('data')

maintenance_csv = str(package_data / 'maintenance_static_dataset.csv')
offline_csv = str(package_data / 'offline_data.csv')
benign_json = str(package_data / 'benign_fallback_offline_data.json')
offline_json = str(package_data / 'offline_data.json')

print(f"Package data directory: {package_data}")

Package data directory: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data


<a id="configure-triage-agent"></a>
## 2.2) Configure the tool-calling agent

**Configuring the Alert Triage Agent**

The Alert Triage Agent requires several components:

1. **Diagnostic Tools** - Hardware checks, network connectivity, performance monitoring, telemetry analysis
2. **Sub-agents** - Telemetry metrics analysis agent that coordinates multiple telemetry tools
3. **Categorizer** - Classifies root causes into predefined categories
4. **Maintenance Check** - Filters out alerts during maintenance windows

We'll create a **local configuration file** and run in **offline mode** using synthetic data.

In the configuration file, you can see the list of LLMs that we have predefined to be compared when the optimizer runs. We will only run the initial search across two models, for brevity and token efficiency. However, you can uncomment the entire list of 11 models (or add [more models](https://catalog.ngc.nvidia.com/)) to run a more robust search. This model will be used as the agent's backbone LLM for reasoning steps. The `tool_reasoning_llm` and `nim_rag_eval_llm` remain fixed to `meta/llama-3.1-70b-instruct`, but in a modified evaluation these models could be evaluated as well. 
```
- Meta: llama-3.1-8b-instruct
- Meta: llama-3.1-70b-instruct
- Meta: llama-3.1-405b-instruct
- Meta: llama-3.3-3b-instruct
- Meta: llama-3.3-70b-instruct
- Meta: llama-4-scout-17b-16e-instruct
- OpenAI: gpt-oss-20b
- OpenAI: gpt-oss-120b
- IBM: granite-3.3-8b-instruct
- MistralAI: mistral-small-3.1-24b-instruct-2503
- MistralAI: mistral-medium-3-instruct
```

We additionally provide two different vlaues for `temperature` to exemplify concurrent model and parameter searches:
```
- 0.0
- 0.5
```

<div style="color: red; font-style: italic;">
<strong>Developer warning:</strong> Running the optimizer can consume a significant amount of LLM inference tokens. To protect users from unexpected costs only 2 models have been left uncommented in the config below. Uncomment models to increase the search space.
</div>

We will create a YAML configuration file using Python code rather than a static file. This approach allows us to dynamically reference the package data directory and ensures the configuration is created in the notebook's working directory, making it easier to modify and experiment with different settings for optimization.

In [37]:
%%writefile ./tmp_workflow/configs/alert_triage_config_model_selection.yml
# path-check-skip-begin
functions:
  hardware_check:
    _type: hardware_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  host_performance_check:
    _type: host_performance_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  monitoring_process_check:
    _type: monitoring_process_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  network_connectivity_check:
    _type: network_connectivity_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  telemetry_metrics_host_heartbeat_check:
    _type: telemetry_metrics_host_heartbeat_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  telemetry_metrics_host_performance_check:
    _type: telemetry_metrics_host_performance_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  telemetry_metrics_analysis_agent:
    _type: telemetry_metrics_analysis_agent
    tool_names:
      - telemetry_metrics_host_heartbeat_check
      - telemetry_metrics_host_performance_check
    llm_name: agent_llm
  maintenance_check:
    _type: maintenance_check
    llm_name: agent_llm
    static_data_path: PLACEHOLDER_maintenance_static_dataset.csv
  categorizer:
    _type: categorizer
    llm_name: agent_llm

workflow:
  _type: alert_triage_agent
  tool_names:
    - hardware_check
    - host_performance_check
    - monitoring_process_check
    - network_connectivity_check
    - telemetry_metrics_analysis_agent
  llm_name: agent_llm
  offline_mode: true
  offline_data_path: PLACEHOLDER_offline_data.csv
  benign_fallback_data_path: PLACEHOLDER_benign_fallback_offline_data.json

llms:
  agent_llm:
    _type: nim
    model_name: meta/llama-3.1-8b-instruct
    temperature: 0.0
    max_tokens: 2048
    optimizable_params:
      - model_name
      - temperature
    search_space:
      model_name:
        values:
          - meta/llama-3.1-8b-instruct
          - meta/llama-3.1-70b-instruct
          # - meta/llama-3.1-405b-instruct
          # - meta/llama-3.3-3b-instruct
          # - meta/llama-3.3-70b-instruct
          # - meta/llama-4-scout-17b-16e-instruct
          # - openai/gpt-oss-20b
          # - openai/gpt-oss-120b
          # - ibm/granite-3.3-8b-instruct
          # - mistralai/mistral-small-3.1-24b-instruct-2503
          # - mistralai/mistral-medium-3-instruct
      temperature:
        values:
          - 0.0
          - 0.5
  tool_reasoning_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.2
    max_tokens: 2048
  nim_rag_eval_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    max_tokens: 8

eval:
  general:
    output_dir: ./tmp_workflow/alert_triage_model_selection_output/
    dataset:
      _type: json
      file_path: PLACEHOLDER_offline_data.json
  evaluators:
    classification_accuracy:
      _type: classification_accuracy
    rag_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_rag_eval_llm
  profiler:
    token_uniqueness_forecast: true
    workflow_runtime_forecast: true
    compute_llm_metrics: true
    csv_exclude_io_text: true
    prompt_caching_prefixes:
      enable: true
      min_frequency: 0.1
    bottleneck_analysis:
      enable_nested_stack: true
    concurrency_spike_analysis:
      enable: true
      spike_threshold: 7

Writing ./tmp_workflow/configs/alert_triage_config_model_selection.yml


Above we have defined the `SearchSpace` to include two different LLMs (variants of Meta's llama 3.1 model), and temperature of 0.0 and 0.5 (making 4 unique combinations via grid search).

Next, let's append sime simple optimizer settings to our configuration. We will optimize specifically for the predefined `classification_accuracy` evaluator, use a grid search sampler, and **disable prompt optimization**.

In [38]:
%%writefile -a ./tmp_workflow/configs/alert_triage_config_model_selection.yml
optimizer:
  output_path: ./tmp_workflow/alert_triage_model_selection_output/optimizer/
  reps_per_param_set: 1
  eval_metrics:
    classification_accuracy:
      evaluator_name: classification_accuracy
      direction: maximize
  numeric:
    enabled: true
    sampler: grid
  prompt:
    enabled: false
# path-check-skip-end

Appending to ./tmp_workflow/configs/alert_triage_config_model_selection.yml


Before running, let's replace the placeholder paths in our config, depending on where we have installed the Alert Traige Agent. This step is only needed for compatibility of this notebook to source NAT in multiple ways.

In [39]:
# Replace placeholder paths with actual package data paths
import importlib.resources
from pathlib import Path

import yaml

# Get the package data path
package_data = importlib.resources.files('nat_alert_triage_agent').joinpath('data')

# Read the YAML file
config_path = Path('./tmp_workflow/configs/alert_triage_config_model_selection.yml')
with open(config_path, 'r') as f:
    config_content = f.read()

# Replace placeholders with actual paths
replacements = {
    'PLACEHOLDER_maintenance_static_dataset.csv': str(package_data / 'maintenance_static_dataset.csv'),
    'PLACEHOLDER_offline_data.csv': str(package_data / 'offline_data.csv'),
    'PLACEHOLDER_benign_fallback_offline_data.json': str(package_data / 'benign_fallback_offline_data.json'),
    'PLACEHOLDER_offline_data.json': str(package_data / 'offline_data.json')
}

for placeholder, actual_path in replacements.items():
    config_content = config_content.replace(placeholder, actual_path)

# Write back to file
with open(config_path, 'w') as f:
    f.write(config_content)

print(f"✓ Config written with data paths from: {package_data}")


✓ Config written with data paths from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data


<a id="test-triage-agent"></a>
## 2.3) Test the tool-calling agent

Let's test the Alert Triage Agent with a single alert. This alert is an "InstanceDown" alert that, according to the offline dataset, is actually a false positive (the system is healthy).


In [40]:
import json

alert = {
    "alert_id": 0,
    "alert_name": "InstanceDown",
    "host_id": "test-instance-0.example.com",
    "severity": "critical",
    "description": (
        "Instance test-instance-0.example.com is not available for scraping for the last 5m. "
        "Please check: - instance is up and running; - monitoring service is in place and running; "
        "- network connectivity is ok"
    ),
    "summary": "Instance test-instance-0.example.com is down",
    "timestamp": "2025-04-28T05:00:00.000000"
}

!nat run --config_file tmp_workflow/configs/alert_triage_config_model_selection.yml --input '{json.dumps(alert)}'

2025-10-24 10:45:54 - INFO     - nat.cli.commands.start:192 - Starting NAT from config file: 'tmp_workflow/configs/alert_triage_config_model_selection.yml'
2025-10-24 10:45:54 - INFO     - nat_alert_triage_agent:104 - Preloaded test data from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data/offline_data.csv
2025-10-24 10:45:54 - INFO     - nat_alert_triage_agent:108 - Preloaded benign fallback data from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data/benign_fallback_offline_data.json

Configuration Summary:
--------------------
Workflow Type: alert_triage_agent
Number of Functions: 9
Number of Function Groups: 0
Number of LLMs: 3
Number of Embedders: 0
Number of Memory: 0
Number of Object Stores: 0
Number of Retrievers: 0
Number of TTC Strategies: 0
Number of Authentication Providers: 0

20

After running the cell above, we have confirmed that the tool calling agent is properly configured and ready for a naive evaluation. This evaluation will be our performance baseline.

<a id="eval-triage-agent1"></a>
## 2.4) Evaluate the tool-calling agent (naive parameters)

*using `nat eval`...*

Now let's run a full evaluation on the Alert Triage Agent using the complete offline dataset. This dataset contains seven alerts with different root causes:

- **False positives** - System appears healthy despite alert
- **Hardware issues** - Hardware failures or degradation  
- **Software issues** - Malfunctioning monitoring services
- **Maintenance** - Scheduled maintenance windows
- **Repetitive behavior** - Benign recurring patterns

The evaluation will measure:
1. **Classification Accuracy** - How well the agent categorizes root causes
2. **Answer Accuracy** - How well the generated reports match expected outcomes (using RAGAS)


In [41]:
!nat eval --config_file ./tmp_workflow/configs/alert_triage_config_model_selection.yml


2025-10-24 10:49:01 - INFO     - nat.eval.evaluate:448 - Starting evaluation run with config file: tmp_workflow/configs/alert_triage_config_model_selection.yml
2025-10-24 10:49:05 - INFO     - nat_alert_triage_agent:104 - Preloaded test data from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data/offline_data.csv
2025-10-24 10:49:05 - INFO     - nat_alert_triage_agent:108 - Preloaded benign fallback data from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data/benign_fallback_offline_data.json
Running workflow:   0%|                                   | 0/7 [00:00<?, ?it/s]2025-10-24 10:49:05 - INFO     - nat_alert_triage_agent:246 - Host: [test-instance-0.example.com] is NOT under maintenance according to the maintenance database
2025-10-24 10:49:05 - INFO     - nat_alert_triage_agent:246 - Host:

**Understanding Alert Triage Evaluation Results**

The evaluation generates several output files in the `alert_triage_output` directory:

1. **classification_accuracy_output.json** - Root cause classification metrics
   - Shows accuracy, precision, recall, and F1 scores for each category
   - Contains confusion matrix for detailed analysis
   
2. **rag_accuracy_output.json** - Answer quality metrics
   - Measures how well generated reports match expected outcomes
   - Uses LLM-as-a-judge to evaluate report quality

3. **workflow_output.json** - Complete execution traces
   - Contains full agent trajectories with tool calls
   - Includes generated reports for each alert
   - Shows token usage and performance metrics

Let's examine the classification accuracy results:


We see that the classification accuracy results are around 43% based on RAG accuracy results of 46%.

Next we will run the optimizer over a variety of models and some reasonable hyperparameters, then use that optimal configuration and run the evaluation again.

In [42]:

# Load and display classification accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_model_selection_output/classification_accuracy_output.json') as f:
    classification_results = json.load(f)

print("Classification Accuracy Results:")
print(f"Average Score: {classification_results['average_score']:.2%}")
print("\nPer-Alert Results:")
for item in classification_results['eval_output_items']:
    print(f"  Alert {item['id']}: Score={item['score']:.2f} - {item['reasoning']}")

# Load and display RAG accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_model_selection_output/rag_accuracy_output.json') as f:
    rag_results = json.load(f)

print("\n\nRAG Accuracy Results:")
print(f"Average Score: {rag_results['average_score']:.2%}")
print(f"Total Alerts Evaluated: {len(rag_results['eval_output_items'])}")


Classification Accuracy Results:
Average Score: 57.00%

Per-Alert Results:
  Alert 0: Score=1.00 - The prediction false_positive is correct. (label: false_positive)
  Alert 1: Score=1.00 - The prediction hardware is correct. (label: hardware)
  Alert 2: Score=0.00 - The prediction need_investigation is incorrect. (label: software)
  Alert 3: Score=0.00 - The prediction ## alert summary is incorrect. (label: maintenance)
  Alert 4: Score=1.00 - The prediction software is correct. (label: software)
  Alert 5: Score=1.00 - The prediction false_positive is correct. (label: false_positive)
  Alert 6: Score=0.00 - The prediction software is incorrect. (label: repetitive_behavior)


RAG Accuracy Results:
Average Score: 60.71%
Total Alerts Evaluated: 7


<a id="optimize-triage-agent"></a>
## 2.5) Optimize the tool-calling agent's LLM

*using `nat optimize`...*

Next we will run `nat optimize` for the Alert Traige Agent using a GridSearch sweep over the `OptimizableField`s in `alert_triage_config.yml`. In this case, we are just comparing backbone LLM models for the core agent, not the `tool_reasoning_llm`. Optimizable fields have been previously explained in this notebook, but in this case we are going to run a similar optimization pass over a complex tool-calling agent to demonstrate the power of `nat optimize` at scale.

<div style="color: red; font-style: italic;">
<strong>Developer warning:</strong> Running the optimizer can take significant time (~30 minutes for search space of n=10) and  LLM inference tokens. Double check your config for unneeded search parameters prior to running.
</div>

In [43]:
!nat optimize --config_file tmp_workflow/configs/alert_triage_config_model_selection.yml

2025-10-24 10:58:18 - INFO     - nat.profiler.parameter_optimization.parameter_optimizer:70 - Using Grid sampler for numeric optimization
[32m[I 2025-10-24 10:58:18,680][0m A new study created in memory with name: no-name-e747c825-0d94-4852-99fc-7201af9cc160[0m
2025-10-24 10:58:18 - INFO     - nat.profiler.parameter_optimization.parameter_optimizer:125 - Starting numeric / enum parameter optimization...
2025-10-24 10:58:23 - INFO     - nat_alert_triage_agent:104 - Preloaded test data from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data/offline_data.csv
2025-10-24 10:58:23 - INFO     - nat_alert_triage_agent:108 - Preloaded benign fallback data from: /Users/bbednarski/Projects/nat-getting-started-fork/NeMo-Agent-Toolkit/examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/data/benign_fallback_offline_data.json
Running workflow:   0%|                                   

In [44]:
import ast
from pathlib import Path

import numpy as np
import pandas as pd

# Load the optimizer results
trials_df_path = Path("tmp_workflow/alert_triage_model_selection_output/optimizer/trials_dataframe_params.csv")

if trials_df_path.exists():
    trials_df = pd.read_csv(trials_df_path)

    print("Grid Search Optimization Results")
    print("=" * 80)
    print("\nTrials Summary:")
    print(trials_df.to_string(index=False))
    print("\n" + "=" * 80)

Grid Search Optimization Results

Trials Summary:
 number  value             datetime_start          datetime_complete               duration params_llms.agent_llm.model_name  params_llms.agent_llm.temperature rep_scores  system_attrs_grid_id                                                                                                              system_attrs_search_space    state  pareto_optimal
      0   0.29 2025-10-24 10:58:18.680960 2025-10-24 10:59:54.207607 0 days 00:01:35.526647      meta/llama-3.1-70b-instruct                                0.0   [[0.29]]                     0 {'llms.agent_llm.model_name': ['meta/llama-3.1-8b-instruct', 'meta/llama-3.1-70b-instruct'], 'llms.agent_llm.temperature': [0.0, 0.5]} COMPLETE           False
      1   0.29 2025-10-24 10:59:54.207769 2025-10-24 11:01:25.678329 0 days 00:01:31.470560      meta/llama-3.1-70b-instruct                                0.5   [[0.29]]                     1 {'llms.agent_llm.model_name': ['meta/llama-3.1-8b-i

<a id="eval-triage-agent2"></a>
## 2.6) Re-evaluate the optimized tool-calling agent

After completing the `nat optimize` run above, a new file with the optimal parameters from the search have been serialized and saved to `'./tmp_workflow/alert_triage_model_selection_output/optimizer/optimized_config.yml`.

<div style="color: red; font-style: italic;">
<strong>Note:</strong> Performance of the optimized model may vary due to size of prior search space and number of evaluation trials.
</div>

In [None]:
# path-check-skip-next-line
!nat eval --config_file ./tmp_workflow/alert_triage_model_selection_output/optimizer/optimized_config.yml

In [None]:
import json

# Load and display classification accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_model_selection_output/classification_accuracy_output.json') as f:
    classification_results = json.load(f)

print("Classification Accuracy Results:")
print(f"Average Score: {classification_results['average_score']:.2%}")
print("\nPer-Alert Results:")
for item in classification_results['eval_output_items']:
    print(f"  Alert {item['id']}: Score={item['score']:.2f} - {item['reasoning']}")

# Load and display RAG accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_model_selection_output/rag_accuracy_output.json') as f:
    rag_results = json.load(f)

print("\n\nRAG Accuracy Results:")
print(f"Average Score: {rag_results['average_score']:.2%}")
print(f"Total Alerts Evaluated: {len(rag_results['eval_output_items'])}")


Up to this point, we have shown how to add models and tunable LLM parameters to the `SearchSpace`. We have demonstrated this using `sampler: grid`, which uses Optuna's grid search methods to create a deterministic search space for all of the unique combinations for all `optimizable_params` in the configuration. If range of search parameters is large, and a grid search produces too many unique combinations, users may optionally specify `sampler: bayesian` in their configuration, and use Optuna's `TPESampler` (univariate) and genetic algorithm (multivariable) samplers to use non-deterministic search methods.

<a id="model-and-prompt-tuning"></a>
# 3.0) Concurrent Model Parameter and Prompt Tuning

NAT uses a Genetic Algorithm (GA) to automatically optimize prompts through evolutionary search. This is a sophisticated approach that treats prompts as "individuals" in a population that evolves over multiple generations to find better-performing variations. The genetic algorithm is inspired by natural evolution and uses LLMs themselves to intelligently mutate and recombine prompts. Instead of random mutations like traditional GAs, NAT leverages the reasoning capabilities of LLMs to make informed changes to prompts.

*Note: The genetic algorithm for prompt optimization is configured through several parameters:*
- *`prompt.enabled`: Enable GA-based prompt optimization (default: `false`)*
- *`prompt.ga_population_size`: Population size - larger populations increase diversity but cost more per generation (default: `10`)*
- *`prompt.ga_generations`: Number of generations to evolve prompts (default: `5`)*
- *`prompt.ga_offspring_size`: Number of offspring per generation - if `null`, defaults to `ga_population_size - ga_elitism`*
- *`prompt.ga_crossover_rate`: Probability of recombination between two parents for each prompt parameter (default: `0.7`)*
- *`prompt.ga_mutation_rate`: Probability of mutating a child's prompt parameter using the LLM optimizer (default: `0.1`)*
- *`prompt.ga_elitism`: Number of elite individuals copied unchanged to the next generation (default: `1`)*
- *`prompt.ga_selection_method`: Parent selection scheme - `tournament` (default) or `roulette`*
- *`prompt.ga_tournament_size`: Tournament size when using tournament selection (default: `3`)*
- *`prompt.ga_parallel_evaluations`: Maximum number of concurrent evaluations (default: `8`)*
- *`prompt.ga_diversity_lambda`: Diversity penalty strength to discourage duplicate prompt sets - `0.0` disables it (default: `0.0`)- *`prompt.prompt_population_init_function`: Function name used to mutate base prompts to seed the initial population and perform tations. NAT includes a built-in `prompt_init` Function you can use.*
- *`prompt.prompt_recombination_function`: Optional function name used to recombine two parent prompts into a child prompt. NAT cludes a built-in `prompt_recombiner` Function you can use.*

** For more information see the [Optimizer documentation](../../docs/source/reference/optimizer.md) or go to your working branch on [Github - dev](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/reference/optimizer.md).**



<a id="all-tuning-config"></a>
## 3.1) Optimizer configuration for all parameters (models, hyperparameters, and prompts)

For this experiment we will create a new configuration at `tmp_workflow/configs/alert_triage_all_params_selection.yml`, for which we will configure an optimizer run to find the best model (backbone LLM only), hyperparameters (temperature only), and prompts. We can use our existing Alert Triage Agent here, with a modified config. Let's create a new config called `./tmp_workflow/configs/alert_triage_config_all_params_selection.yml` to manage this workflow for us.

First we will copy the same base configuration as the last example - with updated output paths for this experiment.

In [None]:
%%writefile ./tmp_workflow/configs/alert_triage_config_all_params_selection.yml
# path-check-skip-begin
functions:
  hardware_check:
    _type: hardware_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  host_performance_check:
    _type: host_performance_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  monitoring_process_check:
    _type: monitoring_process_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  network_connectivity_check:
    _type: network_connectivity_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  telemetry_metrics_host_heartbeat_check:
    _type: telemetry_metrics_host_heartbeat_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  telemetry_metrics_host_performance_check:
    _type: telemetry_metrics_host_performance_check
    llm_name: tool_reasoning_llm
    offline_mode: true
  telemetry_metrics_analysis_agent:
    _type: telemetry_metrics_analysis_agent
    tool_names:
      - telemetry_metrics_host_heartbeat_check
      - telemetry_metrics_host_performance_check
    llm_name: agent_llm
  maintenance_check:
    _type: maintenance_check
    llm_name: agent_llm
    static_data_path: PLACEHOLDER_maintenance_static_dataset.csv
  categorizer:
    _type: categorizer
    llm_name: agent_llm

workflow:
  _type: alert_triage_agent
  tool_names:
    - hardware_check
    - host_performance_check
    - monitoring_process_check
    - network_connectivity_check
    - telemetry_metrics_analysis_agent
  llm_name: agent_llm
  offline_mode: true
  offline_data_path: PLACEHOLDER_offline_data.csv
  benign_fallback_data_path: PLACEHOLDER_benign_fallback_offline_data.json

llms:
  agent_llm:
    _type: nim
    model_name: meta/llama-3.1-8b-instruct
    temperature: 0.0
    max_tokens: 2048
    optimizable_params:
      - model_name
      - temperature
    search_space:
      model_name:
        values:
          - meta/llama-3.1-8b-instruct
          - meta/llama-3.1-70b-instruct
          # - meta/llama-3.1-405b-instruct
          # - meta/llama-3.3-3b-instruct
          # - meta/llama-3.3-70b-instruct
          # - meta/llama-4-scout-17b-16e-instruct
          # - openai/gpt-oss-20b
          # - openai/gpt-oss-120b
          # - ibm/granite-3.3-8b-instruct
          # - mistralai/mistral-small-3.1-24b-instruct-2503
          # - mistralai/mistral-medium-3-instruct
    temperature:
      values:
        - 0.0
        - 0.5
  tool_reasoning_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.2
    max_tokens: 2048
  nim_rag_eval_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    max_tokens: 8

eval:
  general:
    output_dir: ./tmp_workflow/alert_triage_all_params_selection_output/
    dataset:
      _type: json
      file_path: PLACEHOLDER_offline_data.json
  evaluators:
    classification_accuracy:
      _type: classification_accuracy
    rag_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_rag_eval_llm
  profiler:
    token_uniqueness_forecast: true
    workflow_runtime_forecast: true
    compute_llm_metrics: true
    csv_exclude_io_text: true
    prompt_caching_prefixes:
      enable: true
      min_frequency: 0.1
    bottleneck_analysis:
      enable_nested_stack: true
    concurrency_spike_analysis:
      enable: true
      spike_threshold: 7

Then we will add in updated optimizer configuration code that allows the system prompts to be optimized.

In [None]:
%%writefile -a ./tmp_workflow/configs/alert_triage_config_all_params_selection.yml
optimizer:
  output_path: ./tmp_workflow/alert_triage_all_params_selection_output/optimizer/
  reps_per_param_set: 1
  eval_metrics:
    classification_accuracy:
      evaluator_name: classification_accuracy
      direction: maximize
  numeric:
    enabled: true
    sampler: grid
  prompt:
    enabled: false
# path-check-skip-end

Again, we will replace the placeholder paths for the output artifacts based on our earlier NAT source code pattern.

In [None]:
# Replace placeholder paths with actual package data paths
import importlib.resources
from pathlib import Path

import yaml

# Get the package data path
package_data = importlib.resources.files('nat_alert_triage_agent').joinpath('data')

# Read the YAML file
config_path = Path('./tmp_workflow/configs/alert_triage_config_all_params_selection.yml')
with open(config_path, 'r') as f:
    config_content = f.read()

# Replace placeholders with actual paths
replacements = {
    'PLACEHOLDER_maintenance_static_dataset.csv': str(package_data / 'maintenance_static_dataset.csv'),
    'PLACEHOLDER_offline_data.csv': str(package_data / 'offline_data.csv'),
    'PLACEHOLDER_benign_fallback_offline_data.json': str(package_data / 'benign_fallback_offline_data.json'),
    'PLACEHOLDER_offline_data.json': str(package_data / 'offline_data.json')
}

for placeholder, actual_path in replacements.items():
    config_content = config_content.replace(placeholder, actual_path)

# Write back to file
with open(config_path, 'w') as f:
    f.write(config_content)

print(f"✓ Config written with data paths from: {package_data}")


<a id="all-tuning-initial-eval"></a>
## 3.2) Evaluate the agent

As we've already tested this agent in Section 2.3, we will go right ahead to an initial evaluation.

In [None]:
!nat eval --config_file ./tmp_workflow/configs/alert_triage_config_all_params_selection.yml

Then let's analyze the results of the untuned agent.

In [None]:
# Load and display classification accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_all_params_selection_output/classification_accuracy_output.json') as f:
    classification_results = json.load(f)

print("Classification Accuracy Results:")
print(f"Average Score: {classification_results['average_score']:.2%}")
print("\nPer-Alert Results:")
for item in classification_results['eval_output_items']:
    print(f"  Alert {item['id']}: Score={item['score']:.2f} - {item['reasoning']}")

# Load and display RAG accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_all_params_selection_output/rag_accuracy_output.json') as f:
    rag_results = json.load(f)

print("\n\nRAG Accuracy Results:")
print(f"Average Score: {rag_results['average_score']:.2%}")
print(f"Total Alerts Evaluated: {len(rag_results['eval_output_items'])}")


<a id="all-tuning-optimize"></a>
## 3.3) Optimize the agent

Now let's re-run the optmize, but this time we will have model, parameter, and prompt tuning all enabled.

In [None]:
!nat optimize --config_file tmp_workflow/configs/alert_triage_config_all_params_selection.yml

In [None]:
import ast
from pathlib import Path

import numpy as np
import pandas as pd

# Load the optimizer results
trials_df_path = Path("tmp_workflow/alert_triage_all_params_selection_output/optimizer/trials_dataframe_params.csv")

if trials_df_path.exists():
    trials_df = pd.read_csv(trials_df_path)

    print("Grid Search Optimization Results")
    print("=" * 80)
    print("\nTrials Summary:")
    print(trials_df.to_string(index=False))

    print("\n" + "=" * 80)
    print("\nModel Performance Statistics (Mean across repetitions):")
    print("-" * 80)

    # Group by model name to calculate statistics across repetitions
    for model_name in trials_df['params_llms.agent_llm.model_name'].unique():
        model_trials = trials_df[trials_df['params_llms.agent_llm.model_name'] == model_name]

        print(f"\n{model_name}:")

        # Parse rep_scores to extract individual repetition metrics
        if 'rep_scores' in model_trials.columns:
            all_classification_accuracies = []
            all_rag_accuracies = []

            for rep_scores_str in model_trials['rep_scores']:
                rep_scores = ast.literal_eval(rep_scores_str)
                for score_set in rep_scores:
                    # score_set format: [classification_accuracy, rag_accuracy]
                    all_classification_accuracies.append(score_set[0])
                    all_rag_accuracies.append(score_set[1])

            # Calculate mean and standard deviation
            def calculate_stats(values):
                mean = np.mean(values)
                std = np.std(values)
                ci_lower = np.percentile(values, 2.5)
                ci_upper = np.percentile(values, 97.5)
                return mean, std, ci_lower, ci_upper

            class_acc_mean, class_acc_std, class_acc_ci_lower, class_acc_ci_upper = \
                calculate_stats(all_classification_accuracies)
            rag_acc_mean, rag_acc_std, rag_acc_ci_lower, rag_acc_ci_upper = calculate_stats(all_rag_accuracies)

            print("  Classification Accuracy:")
            print(f"    Mean: {class_acc_mean:.3f} (±{class_acc_std:.3f})")
            print(f"    95% CI: [{class_acc_ci_lower:.3f}, {class_acc_ci_upper:.3f}]")

            print("  RAG Accuracy:")
            print(f"    Mean: {rag_acc_mean:.3f} (±{rag_acc_std:.3f})")
            print(f"    95% CI: [{rag_acc_ci_lower:.3f}, {rag_acc_ci_upper:.3f}]")
        else:
            # Fallback to aggregated values if rep_scores not available
            # values_0 = classification_accuracy, values_1 = rag_accuracy
            class_acc_mean = np.mean(model_trials['values_0'])
            rag_acc_mean = np.mean(model_trials['values_1'])

            print(f"  Classification Accuracy (mean): {class_acc_mean:.3f}")
            print(f"  RAG Accuracy (mean): {rag_acc_mean:.3f}")
            print("  Note: 95% CI not available without rep_scores data")

    print("\n" + "=" * 80)
    print("\nBest Configuration (by aggregated classification accuracy across all repetitions):")
    # Find the trial with best aggregated classification accuracy
    best_trial = trials_df.loc[trials_df['values_0'].idxmax()]
    print(f"Model: {best_trial['params_llms.agent_llm.model_name']}")
    print(f"Aggregated Classification Accuracy Score: {best_trial['values_0']}")
    print(f"Aggregated RAG Accuracy: {best_trial['values_1']}")
else:
    print(f"Optimizer results not found at {trials_df_path}")
    print("Please run the optimizer first (cell 55)")


<a id="eval-triage-agent2"></a>
## 3.4) Re-evaluate the optimized tool-calling agent

After completing the `nat optimize` run above, a new file with the optimal parameters from the search have been serialized and saved to `'./tmp_workflow/alert_triage_all_params_selection_output/optimizer/optimized_config.yml`. Let's re-run those optimized parameters back through `nat eval` and compare the performance.

<div style="color: red; font-style: italic;">
<strong>Note:</strong> Performance of the optimized model may vary due to size of prior search space and number of evaluation trials.
</div>

In [None]:
# path-check-skip-next-line
!nat eval --config_file ./tmp_workflow/alert_triage_all_params_selection_output/optimizer/optimized_config.yml

In [None]:
import json

# Load and display classification accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_all_params_selection_output/classification_accuracy_output.json') as f:
    classification_results = json.load(f)

print("Classification Accuracy Results:")
print(f"Average Score: {classification_results['average_score']:.2%}")
print("\nPer-Alert Results:")
for item in classification_results['eval_output_items']:
    print(f"  Alert {item['id']}: Score={item['score']:.2f} - {item['reasoning']}")

# Load and display RAG accuracy results
# path-check-skip-next-line
with open('./tmp_workflow/alert_triage_all_params_selection_output/rag_accuracy_output.json') as f:
    rag_results = json.load(f)

print("\n\nRAG Accuracy Results:")
print(f"Average Score: {rag_results['average_score']:.2%}")
print(f"Total Alerts Evaluated: {len(rag_results['eval_output_items'])}")


##

<a id="next-steps"></a>
# 4.0) Next steps

Continue learning how to fully utilize the NVIDIA NeMo Agent toolkit by exploring the other documentation and advanced agents in the `examples` directory.