Skip to content

improvement: nvext.agent_hints and nvext.cache_control clean up#1648

Merged
rapids-bot[bot] merged 2 commits intoNVIDIA:release/1.5from
bbednarski9:bb/agent-hints-cleanup-v1.5-cherry
Feb 25, 2026
Merged

improvement: nvext.agent_hints and nvext.cache_control clean up#1648
rapids-bot[bot] merged 2 commits intoNVIDIA:release/1.5from
bbednarski9:bb/agent-hints-cleanup-v1.5-cherry

Conversation

@bbednarski9
Copy link
Contributor

@bbednarski9 bbednarski9 commented Feb 25, 2026

Description

This PR cleans the logic around nvext optional parameters for the NAT-dynamo integration and updates corresponding examples and unit tests. This is aligned with the scope of current examples and the expected status of dynamo 1.0.0.

Here is a table summarizing the parameters:

nvext Field Reference

Fields injected into the OpenAI-compatible request body by NAT's _DynamoTransport.
All fields live under the top-level nvext key in the JSON request body.


nvext.agent_hints

Routing hints consumed by Dynamo's built-in router/scheduler and the custom processor.py.

Field Description Used in NAT Source/Components Used in Dynamo Source/Components
prefix_id Unique string identifying the KV cache prefix for a conversation. Yes — DynamoPrefixContext generates it; processor.py routes on it. No
total_requests Expected number of LLM calls in this conversation. Yes — processor.py computes reuse_budget from it. No
osl Expected output tokens (always raw integer in agent_hints). Config accepts LOW/MEDIUM/HIGH strings for backward compat (mapped to 128/512/2048). Yes — processor.py uses it for router decode cost weighting. Validated for pass-through to Dynamo. Yes — native AgentHints.osl read by Dynamo's built-in frontend.
iat Expected inter-arrival time in milliseconds (always raw integer). Config accepts LOW/MEDIUM/HIGH strings for backward compat (mapped to 50/250/750). Yes — processor.py uses it for router stickiness weighting. Also used client-side to compute cache_control.ttl. No
latency_sensitivity How latency-sensitive this request is (from @latency_sensitive decorator or prediction trie). Increase to prioritize a request. Validated, passed through to Dynamo. Sets a context variable via Context.push_latency_sensitivity(). Yes — native AgentHints.latency_sensitivity → Dynamo queue ordering.
priority Engine scheduling priority (nvext_max_sensitivity - latency_sensitivity). Lower values = higher priority for vLLM; SGLang is configurable. Validated, passed through to Dynamo. Yes — native AgentHints.priority → engine queue, eviction, preemption.

nvext.cache_control

KV cache lifetime management. Injected as a sibling of agent_hints under nvext.
Only injected when nvext_cache_pin_type is set (not None).

Field Description Used in NAT Source/Components Used in Dynamo Source/Components
type Cache pinning strategy. Only valid value is "ephemeral". Required in JSON (no serde default on deserialization). dynamo_llm.py: Injected client-side via CachePinType.EPHEMERAL (default). Configurable via nvext_cache_pin_type param; set to null to disable cache_control entirely. nvext.rs: Deserialized into CacheControlType enum (single variant: Ephemeral). Presence of cache_control triggers pin_prefix after generation in the KV push router. Requires --enable-cache-control on the frontend.
ttl Duration string for how long to pin the prefix in the KV cache. Optional; defaults to "5m" (300s) when omitted. dynamo_llm.py: Computed client-side as total_requests * iat (ms), converted to seconds, formatted as "<N>m" (whole minutes) or "<N>s". The iat field is not consumed by Dynamo — it is only used here for this TTL computation and by the custom Thompson Sampling processor.py. nvext.rs CacheControl::ttl_seconds(): Only parses "5m" (300s) and "1h" (3600s). Any other value logs a warning and falls back to 300s. The parsed TTL is forwarded as ttl_seconds to pin_prefix on the worker via the cache_control service mesh endpoint (cache_control.rs / handler_base.py).

nvext_cache_control_mode (NAT config field)

Controls when nvext.cache_control is injected (not a wire-format field — only affects client-side behavior):

Mode Behavior
always (default) Inject cache_control on every request. Refreshes TTL each turn.
first_only Inject only on the first request per prefix_id. Pins the system prompt when first established in the KV cache; subsequent requests benefit from prefix matching without re-pinning the growing conversation context.

This field is only relevant when nvext_cache_pin_type is set (i.e., "ephemeral"). When nvext_cache_pin_type is null, no cache_control is injected regardless of this mode.


NAT Config → Wire Format Mapping

NAT Config Field Wire Format Location Notes
enable_nvext_hints (gating only) When false (default), no nvext injection occurs.
nvext_prefix_id_template (unused by transport) Prefix IDs come from DynamoPrefixContext at request time.
nvext_prefix_total_requests nvext.agent_hints.total_requests
nvext_prefix_osl nvext.agent_hints.osl Always sent as raw integer.
nvext_prefix_iat nvext.agent_hints.iat Always sent as raw integer. Also used to compute cache_control.ttl.
nvext_max_sensitivity nvext.agent_hints.priority priority = nvext_max_sensitivity - latency_sensitivity
(from Context) nvext.agent_hints.latency_sensitivity Set by @latency_sensitive decorator or prediction trie.
(from DynamoPrefixContext) nvext.agent_hints.prefix_id Auto-generated per workflow run + call depth.
nvext_cache_pin_type nvext.cache_control.type null disables cache_control entirely.
nvext_cache_control_mode (gating only) Controls whether cache_control is injected on every request or just the first.
nvext_prediction_trie_path (overrides agent_hints values) When set, per-call predictions override total_requests, osl, iat, and optionally latency_sensitivity.

Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • NVEXT-based agent hints for Dynamo LLMs.
    • Configurable cache control modes (ALWAYS, FIRST_ONLY) and optional prediction-trie support.
  • Changes

    • Renamed public config fields from prefix_* → nvext_prefix_*.
    • Added enable_nvext_hints toggle to enable NVEXT flow.
    • Switched from static prefix headers to dynamic per-request agent hints injection.
  • Documentation

    • Updated docs and log messages to reference NVEXT terminology.

@bbednarski9 bbednarski9 requested a review from a team as a code owner February 25, 2026 00:58
@bbednarski9 bbednarski9 added improvement Improvement to existing functionality breaking Breaking change non-breaking Non-breaking change and removed non-breaking Non-breaking change labels Feb 25, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Walkthrough

Migrates Dynamo LLM hinting from legacy prefix headers to an nvext-based agent-hints system. Renames config fields from prefix_* to nvext_prefix_*, adds enable_nvext_hints, introduces CacheControlMode and per-request nvext hint injection via a refactored HTTP transport and optional prediction trie.

Changes

Cohort / File(s) Summary
Example configs
examples/dynamo_integration/latency_sensitivity_demo/src/.../config_profile.yml, examples/dynamo_integration/latency_sensitivity_demo/src/.../config_with_trie.yml, examples/dynamo_integration/react_benchmark_agent/src/.../config_dynamo_adk_e2e_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../config_dynamo_prefix_e2e_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../eval_config_no_rethinking_full_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../eval_config_no_rethinking_minimal_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../eval_config_rethinking_full_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../optimize_rethinking_full_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../profile_rethinking_full_test.yml, examples/dynamo_integration/react_benchmark_agent/src/.../run_with_prediction_trie.yml
Replaced legacy prefix_* keys with nvext_prefix_* equivalents and added enable_nvext_hints flags; comments updated to reference NVEXT hints.
Core Dynamo LLM implementation
packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py
Major refactor: removed legacy prefix_* fields, added nvext_* fields and enable_nvext_hints, introduced CacheControlMode (ALWAYS, FIRST_ONLY), changed transport to inject nvext.agent_hints per request, added prediction-trie integration and per-prefix call tracking. Public field names and factory signatures updated accordingly.
ADK integration
packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py
Updated to build AsyncOpenAI client with custom httpx transport when enable_nvext_hints is set; loads optional prediction trie and switches from static header construction to nvext agent-hints injection.
LangChain integration
packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py
Wiring switched to use nvext_* config fields and enable_nvext_hints; passes nvext parameters and cache_control_mode to httpx client creation; logging and trie loading updated.
Core tests
packages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.py, packages/nvidia_nat_core/tests/nat/llm/test_dynamo_prediction_trie.py, packages/nvidia_nat_core/tests/nat/llm/test_dynamic_prediction_hook.py
Updated tests to reflect renamed config fields (nvext_*), added enable_nvext_hints assertions, removed a categorical-conversion test, and added/updated tests for cache-control modes, prediction trie naming, and per-prefix behavior.
ADK tests
packages/nvidia_nat_adk/tests/test_adk_llm.py
Adjusted fixtures and assertions to use nvext_* fields and enable_nvext_hints; expectations now check client injection rather than extra_headers.
LangChain tests
packages/nvidia_nat_langchain/tests/test_dynamo_trie_loading.py, packages/nvidia_nat_langchain/tests/test_llm_langchain.py
Updated tests and fixtures to use nvext_prediction_trie_path, enable_nvext_hints, nvext_prefix_id_template, and CacheControlMode; verified nvext fields are not forwarded to LLM client kwargs and cache_control_mode propagation.
Docs & small updates
external/dynamo/components/processor.py, packages/nvidia_nat_core/src/nat/data_models/profiler.py, packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py, packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py
Updated docstrings and field descriptions to reference nvext_* names; one comment tweak in a test. No behavioral changes.
ADK implementation tests
packages/nvidia_nat_adk/tests/test_adk_llm.py
Renamed config fields and adjusted test expectations to reflect nvext-based client injection and enable_nvext_hints.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Request
    participant Transport as _DynamoTransport
    participant Trie as Prediction Trie
    participant Server as Dynamo Server

    Client->>Transport: Send LLM request
    Transport->>Transport: Resolve prefix_id (nvext_prefix_id_template)
    alt nvext_prediction_trie_path present
        Transport->>Trie: Lookup prediction for prefix_id
        Trie-->>Transport: Prediction (osl/iat) or None
        Transport->>Transport: Apply prediction overrides if present
    end
    Transport->>Transport: Build nvext.agent_hints (total_requests, osl, iat)
    alt CacheControlMode == FIRST_ONLY & first call for prefix
        Transport->>Transport: Inject nvext.cache_control (pin/ttl)
    else CacheControlMode == ALWAYS
        Transport->>Transport: Inject nvext.cache_control (pin/ttl)
    end
    Transport->>Transport: Increment per-prefix call_index
    Transport->>Server: Forward request with nvext hints injected
    Server-->>Transport: Response
    Transport-->>Client: Return response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: a refactoring of nvext.agent_hints and nvext.cache_control configuration and cleanup, which is reflected throughout all modified files updating prefix-based configuration to nvext-based naming.
Docstring Coverage ✅ Passed Docstring coverage is 97.14% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)
packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py (2)

1060-1060: ⚠️ Potential issue | 🟡 Minor

Incorrect attribute name kv_cache_efficiency in docstring example — field is kv_efficiency.

DynamoCoreMetrics exposes kv_efficiency, not kv_cache_efficiency. The example at Line 1060 would raise AttributeError if executed.

📝 Proposed fix
-            print(f"KV Efficiency: {core.kv_cache_efficiency:.2%}")
+            print(f"KV Efficiency: {core.kv_efficiency:.2%}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py`
at line 1060, Docstring example uses a non-existent attribute name
kv_cache_efficiency which will raise AttributeError; update the example to use
the actual DynamoCoreMetrics attribute kv_efficiency (replace any occurrences of
kv_cache_efficiency in the DynamoCoreMetrics docstring or examples with
kv_efficiency) so the sample code references the correct field.

38-38: ⚠️ Potential issue | 🟡 Minor

Stale references in module docstring — not updated alongside line 255.

Line 38 still reads prefix hints (osl, iat) while the rest of the PR renames these to nvext_prefix_osl / nvext_prefix_iat. The same stale phrasing also appears at Line 221 inside DynamoCoreMetrics's class docstring.

📝 Suggested updates
-   - Affected by: prefix_id routing, prefix hints (osl, iat), request patterns
+   - Affected by: prefix_id routing, prefix hints (nvext_prefix_osl, nvext_prefix_iat), request patterns

Apply the same fix to Line 221 inside DynamoCoreMetrics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py`
at line 38, Update the stale phrase "prefix hints (osl, iat)" in the module
docstring and inside the DynamoCoreMetrics class docstring to the new names
"nvext_prefix_osl" and "nvext_prefix_iat"; specifically modify the module-level
docstring near the top and the DynamoCoreMetrics class docstring (search for the
symbol DynamoCoreMetrics) so both references now read something like "prefix
hints (nvext_prefix_osl, nvext_prefix_iat)" to match the rest of the PR.
packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py (3)

129-130: ⚠️ Potential issue | 🟡 Minor

DynamoPrefixContext.clear() is not protected against assertion failures.

Each test calls DynamoPrefixContext.set(...) unconditionally and then DynamoPrefixContext.clear() only at the very end. If any assert between them raises AssertionError, clear() is skipped and subsequent tests start with a stale prefix. While each test overwrites the prefix via .set(...), the prior dirty state is observable during the earlier _function_path_stack.set(None) reset window.

Use try/finally or move setup/teardown into a fixture.

🛡️ Proposed fix (inline try/finally pattern)
-    DynamoPrefixContext.set("e2e-test")
-
-    with ctx.push_active_function(...):
-        ...
-
-    DynamoPrefixContext.clear()
+    DynamoPrefixContext.set("e2e-test")
+    try:
+        with ctx.push_active_function(...):
+            ...
+    finally:
+        DynamoPrefixContext.clear()

Apply the same pattern to test_e2e_fallback_to_root and test_e2e_multiple_calls_in_same_context.

Also applies to: 191-191, 219-220, 243-243, 271-272, 328-328

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py` around
lines 129 - 130, Wrap each test that calls DynamoPrefixContext.set(...) in a
try/finally (or move set/clear into a pytest fixture) so
DynamoPrefixContext.clear() always runs even if an assertion fails; specifically
update the tests that call DynamoPrefixContext.set (e.g.,
test_e2e_fallback_to_root, test_e2e_multiple_calls_in_same_context and the
instances around the original set at DynamoPrefixContext.set("e2e-test")) to
place the body of the test inside the try and call DynamoPrefixContext.clear()
in the finally block, and apply the same pattern to the other occurrences listed
in the comment.

108-120: 🛠️ Refactor suggestion | 🟠 Major

Extract repeated transport setup into a pytest fixture.

The identical 6-line block (mock response + mock transport + _DynamoTransport construction) is copy-pasted into all three test functions. Per coding guidelines, frequently repeated code should be extracted into a named pytest fixture.

♻️ Proposed fixture extraction
+import pytest
+
+@pytest.fixture(name="dynamo_transport_with_lookup")
+def fixture_dynamo_transport_with_lookup():
+    mock_response = httpx.Response(200, json={"result": "ok"})
+    mock_transport = MagicMock()
+    mock_transport.handle_async_request = AsyncMock(return_value=mock_response)
+
+    def _factory(lookup):
+        transport = _DynamoTransport(
+            transport=mock_transport,
+            total_requests=10,
+            osl=512,
+            iat=250,
+            prediction_lookup=lookup,
+        )
+        return transport, mock_transport
+
+    return _factory

Then each test receives dynamo_transport_with_lookup as a parameter and calls transport, mock_transport = dynamo_transport_with_lookup(lookup), eliminating the duplicated setup.

As per coding guidelines: "Any frequently repeated code should be extracted into pytest fixtures."

Also applies to: 199-210, 251-262

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py` around
lines 108 - 120, Extract the repeated mock transport setup into a pytest fixture
named e.g. dynamo_transport_with_lookup that builds mock_response =
httpx.Response(200, json={"result":"ok"}), mock_transport with
handle_async_request AsyncMock returning mock_response, and constructs the
_DynamoTransport with transport=mock_transport, total_requests=10, osl=512,
iat=250, prediction_lookup=lookup; have the fixture accept or depend on the
existing lookup test fixture and return (transport, mock_transport). Replace the
six-line block in tests calling _DynamoTransport with a call to
dynamo_transport_with_lookup(lookup) (or by declaring
dynamo_transport_with_lookup and lookup as test parameters) and use the returned
transport and mock_transport in each test. Ensure the fixture is placed in the
test module or conftest so all three tests can import it.

133-133: 🛠️ Refactor suggestion | 🟠 Major

Move import json to the module-level import block.

Placing import json inside with blocks (and even inside async function bodies) violates PEP 8, which requires all imports at the top of the file. This also repeats the import in every test function.

♻️ Proposed fix
+import json
 import tempfile
 from pathlib import Path
 ...

Then remove the three inline import json statements at lines 133, 222, and 275.

As per coding guidelines: "Follow PEP 20 and PEP 8 for Python style guidelines."

Also applies to: 222-222, 275-275

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py` at
line 133, Move the inline "import json" statements to the module-level import
block at the top of the file: add a single "import json" with the other imports
in packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py, and
remove the three inline "import json" occurrences currently placed inside test
function bodies/with blocks (the ones inside async tests and with contexts).
Ensure no other code changes are made—just a single top-level import and
deletion of the redundant inline imports.
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml (1)

53-68: ⚠️ Potential issue | 🟡 Minor

Stale example still documents legacy x-prefix-* headers.

After migrating to nvext hints, Lines 64-68 should describe nvext.agent_hints payload fields (and nvext.cache_control when applicable), not legacy header keys.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml`
around lines 53 - 68, Update the stale example comment (lines showing
"x-prefix-*" headers) to document the new nvext hint payload fields instead:
reference the config keys such as enable_nvext_hints, nvext_prefix_id_template,
nvext_prefix_total_requests, nvext_prefix_osl and nvext_prefix_iat, and show the
corresponding nvext payload shape (e.g. nvext.agent_hints.prefix_id,
nvext.agent_hints.total_requests, nvext.agent_hints.osl, nvext.agent_hints.iat)
and mention nvext.cache_control when applicable instead of legacy header names.
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml (1)

18-43: ⚠️ Potential issue | 🟡 Minor

Terminology is still mixed between headers and nvext hints.

This config now uses nvext hints, but the file still describes “header injection” and x-nat-* headers. Please update wording to nvext.agent_hints-based injection for consistency.

Also applies to: 150-166

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml`
around lines 18 - 43, Update the docs to stop calling this "header injection"
and instead describe nvext.agent_hints-based injection: replace references to
"header injection" and "x-nat-*" with wording that the prediction trie populates
nvext.agent_hints (e.g., nvext.agent_hints.remaining_llm_calls,
nvext.agent_hints.interarrival_ms, nvext.agent_hints.expected_output_tokens) for
each LLM call; update all descriptive text that mentions x-nat-* headers or
"headers" (including the later block currently mentioning those names) to
consistently describe the nvext.agent_hints keys and the prediction lookup by
function path and call index used by the run_with_prediction_trie.yml
configuration.
external/dynamo/components/processor.py (1)

421-431: ⚠️ Potential issue | 🟡 Minor

Docstring now contradicts _to_category behavior.

Line 430 says values are always raw integers, but Lines 435-436 still accept LOW/MEDIUM/HIGH. Please align the docstring with actual parsing behavior (or enforce numeric-only input).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@external/dynamo/components/processor.py` around lines 421 - 431, The
docstring for _to_category currently contradicts its implementation: update the
_to_category docstring to reflect that the function accepts either categorical
strings ("LOW"/"MEDIUM"/"HIGH") or numeric inputs (int or numeric string which
is parsed with int()), describe the exact threshold mapping (value <
thresholds[0] → LOW, value < thresholds[1] → MEDIUM, else HIGH), and state that
non-numeric/non-categorical inputs will raise a ValueError; alternatively, if
you prefer enforcing numeric-only, modify _to_category to reject categorical
strings and raise on non-numeric input—choose one approach and make the
docstring and implementation consistent (refer to the _to_category function and
thresholds parameter when making the change).
packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py (1)

487-513: ⚠️ Potential issue | 🟠 Major

Bound _call_counts to prevent unbounded memory growth.

_call_counts grows per unique prefix_id and is never pruned. In long-lived workers with many conversations/prefixes, this can grow indefinitely and degrade memory/performance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py` around lines 487 - 513,
The per-prefix counter dict _call_counts in handle_async_request grows unbounded
per unique prefix_id; limit its size by adding a max capacity (e.g.,
self._max_call_counts) and evict old entries when inserting a new prefix_id:
inside the existing with self._call_counts_lock block, when computing call_index
for prefix_id check if prefix_id exists, otherwise if len(self._call_counts) >=
self._max_call_counts remove the oldest/least-recently-used key (use an
OrderedDict or maintain an eviction queue) before inserting the new prefix_id,
ensuring thread-safety with _call_counts_lock and preserving existing semantics
of call_index and _call_counts updates in handle_async_request.
🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py (1)

263-287: Load prediction trie only when nvext hints are enabled.

Right now the trie is loaded whenever nvext_prediction_trie_path is set, even when enable_nvext_hints is false (so it is never used). Gate trie loading by the same feature flag to avoid unnecessary startup I/O and warning noise.

♻️ Suggested change
-    prediction_lookup: PredictionTrieLookup | None = None
-    if llm_config.nvext_prediction_trie_path:
+    prediction_lookup: PredictionTrieLookup | None = None
+    if llm_config.enable_nvext_hints and llm_config.nvext_prediction_trie_path:
         try:
             trie_path = Path(llm_config.nvext_prediction_trie_path)
             trie = load_prediction_trie(trie_path)
             prediction_lookup = PredictionTrieLookup(trie)
             logger.info("Loaded prediction trie from %s", llm_config.nvext_prediction_trie_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py` around lines
263 - 287, The prediction trie is loaded unconditionally when
nvext_prediction_trie_path is set, wasting I/O when
llm_config.enable_nvext_hints is false; move the trie-loading block (the
Path/open and load_prediction_trie + instantiation of PredictionTrieLookup and
logger lines) inside the same enable flag check that creates http_async_client
(the create_httpx_client_with_dynamo_hooks call) so the trie is only loaded when
enable_nvext_hints is true and actually used; keep the same exception handling
(FileNotFoundError and generic Exception) but scope it under the enablement
branch and pass the resulting prediction_lookup into
create_httpx_client_with_dynamo_hooks as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml`:
- Around line 72-79: The comment claiming "Static fallback values (not used
because baseline does not use agent hints)" is misleading because
enable_nvext_hints is true and there is no trie path, so the nvext_prefix_* keys
(nvext_prefix_total_requests, nvext_prefix_osl, nvext_prefix_iat) act as active
defaults; update the comment to accurately state that these values serve as
active default hints when enable_nvext_hints is true and no trie path is
provided (or else remove/disable them), and ensure references to
enable_nvext_hints and the nvext_prefix_* keys reflect this behavior.

In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml`:
- Around line 61-69: The NVEXT comment block is outdated and mentions
extra_headers and a shared static prefix; update the comments around
enable_nvext_hints, nvext_prefix_id_template, nvext_prefix_total_requests,
nvext_prefix_osl, and nvext_prefix_iat to describe the current nvext-agent-hints
behavior (hints are injected into the request body per request rather than sent
via extra_headers and the prefix is templated per request via the {uuid}
placeholder), and clarify that nvext_prefix_total_requests is an expected
tool-call count and nvext_prefix_osl / nvext_prefix_iat are hint severity levels
for output length and inter-arrival timing respectively.

In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py`:
- Around line 202-203: The except Exception block that currently calls
logger.warning("Failed to load prediction trie: %s", e) should use
logger.exception(...) to record the full stack trace while keeping the same
graceful-degrade behavior; locate the try/except around the "Failed to load
prediction trie" message in packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py
and replace the logger.warning call with logger.exception using a similar
message so the exception and stack trace are logged but the exception is not
re-raised.

In `@packages/nvidia_nat_core/src/nat/data_models/profiler.py`:
- Line 67: The comment notes an inconsistent identifier: update the reference
named prefix_id to the new naming scheme nvext_prefix_id so it matches
nvext_prefix_osl and nvext_prefix_iat; locate the occurrence in the
Profiler/data model code where the string or field listing reads "prefix_id"
alongside "nvext_prefix_osl" and "nvext_prefix_iat" and rename that token to
"nvext_prefix_id" (and update any surrounding docstring or comment text that
mentions the old name) so all NAT config fields consistently use the
nvext_prefix_* convention.

In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Around line 508-513: Move the increment of the per-prefix counter so it only
runs after you confirm the request is eligible for injection; currently
call_index is updated inside the with self._call_counts_lock before validating
JSON/emptiness, which allows non-injectible requests to consume the FIRST_ONLY
slot. Change the code around the with self._call_counts_lock that updates
self._call_counts[prefix_id] to perform the get/assign only after the
injectibility check (the same check that determines whether to apply
cache_control), using the same lock when mutating _call_counts; apply this fix
to the three occurrences (the shown block using
call_index/prefix_id/_call_counts_lock and the similar blocks at the other two
locations reported).

In `@packages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.py`:
- Around line 76-84: The test name/docstring misstates behavior: setting
DynamoModelConfig.nvext_prefix_id_template to None does not disable hint
injection (that is controlled by enable_nvext_hints). Update the test
(test_disable_prefix_headers) to either rename it (e.g.,
test_set_prefix_template_none or
test_prefix_template_none_does_not_toggle_hints) and change the docstring to
state it only sets the prefix template to None and does not affect
enable_nvext_hints; keep the assertion that config.nvext_prefix_id_template is
None and do not change enable_nvext_hints behavior.

In `@packages/nvidia_nat_langchain/tests/test_llm_langchain.py`:
- Around line 182-190: Update the stale docstrings that say “header injection”
to reflect that the flow injects nvext fields into the JSON request body: for
fixtures dynamo_cfg and dynamo_cfg_with_prefix (and the other similar docstring
later in the file), change wording to something like “nvext request-body
injection (injects nvext.agent_hints / nvext.cache_control into the JSON body)”
so the documentation accurately describes request-body injection rather than
header injection.

---

Outside diff comments:
In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml`:
- Around line 53-68: Update the stale example comment (lines showing
"x-prefix-*" headers) to document the new nvext hint payload fields instead:
reference the config keys such as enable_nvext_hints, nvext_prefix_id_template,
nvext_prefix_total_requests, nvext_prefix_osl and nvext_prefix_iat, and show the
corresponding nvext payload shape (e.g. nvext.agent_hints.prefix_id,
nvext.agent_hints.total_requests, nvext.agent_hints.osl, nvext.agent_hints.iat)
and mention nvext.cache_control when applicable instead of legacy header names.

In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml`:
- Around line 18-43: Update the docs to stop calling this "header injection" and
instead describe nvext.agent_hints-based injection: replace references to
"header injection" and "x-nat-*" with wording that the prediction trie populates
nvext.agent_hints (e.g., nvext.agent_hints.remaining_llm_calls,
nvext.agent_hints.interarrival_ms, nvext.agent_hints.expected_output_tokens) for
each LLM call; update all descriptive text that mentions x-nat-* headers or
"headers" (including the later block currently mentioning those names) to
consistently describe the nvext.agent_hints keys and the prediction lookup by
function path and call index used by the run_with_prediction_trie.yml
configuration.

In `@external/dynamo/components/processor.py`:
- Around line 421-431: The docstring for _to_category currently contradicts its
implementation: update the _to_category docstring to reflect that the function
accepts either categorical strings ("LOW"/"MEDIUM"/"HIGH") or numeric inputs
(int or numeric string which is parsed with int()), describe the exact threshold
mapping (value < thresholds[0] → LOW, value < thresholds[1] → MEDIUM, else
HIGH), and state that non-numeric/non-categorical inputs will raise a
ValueError; alternatively, if you prefer enforcing numeric-only, modify
_to_category to reject categorical strings and raise on non-numeric input—choose
one approach and make the docstring and implementation consistent (refer to the
_to_category function and thresholds parameter when making the change).

In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Around line 487-513: The per-prefix counter dict _call_counts in
handle_async_request grows unbounded per unique prefix_id; limit its size by
adding a max capacity (e.g., self._max_call_counts) and evict old entries when
inserting a new prefix_id: inside the existing with self._call_counts_lock
block, when computing call_index for prefix_id check if prefix_id exists,
otherwise if len(self._call_counts) >= self._max_call_counts remove the
oldest/least-recently-used key (use an OrderedDict or maintain an eviction
queue) before inserting the new prefix_id, ensuring thread-safety with
_call_counts_lock and preserving existing semantics of call_index and
_call_counts updates in handle_async_request.

In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py`:
- Around line 129-130: Wrap each test that calls DynamoPrefixContext.set(...) in
a try/finally (or move set/clear into a pytest fixture) so
DynamoPrefixContext.clear() always runs even if an assertion fails; specifically
update the tests that call DynamoPrefixContext.set (e.g.,
test_e2e_fallback_to_root, test_e2e_multiple_calls_in_same_context and the
instances around the original set at DynamoPrefixContext.set("e2e-test")) to
place the body of the test inside the try and call DynamoPrefixContext.clear()
in the finally block, and apply the same pattern to the other occurrences listed
in the comment.
- Around line 108-120: Extract the repeated mock transport setup into a pytest
fixture named e.g. dynamo_transport_with_lookup that builds mock_response =
httpx.Response(200, json={"result":"ok"}), mock_transport with
handle_async_request AsyncMock returning mock_response, and constructs the
_DynamoTransport with transport=mock_transport, total_requests=10, osl=512,
iat=250, prediction_lookup=lookup; have the fixture accept or depend on the
existing lookup test fixture and return (transport, mock_transport). Replace the
six-line block in tests calling _DynamoTransport with a call to
dynamo_transport_with_lookup(lookup) (or by declaring
dynamo_transport_with_lookup and lookup as test parameters) and use the returned
transport and mock_transport in each test. Ensure the fixture is placed in the
test module or conftest so all three tests can import it.
- Line 133: Move the inline "import json" statements to the module-level import
block at the top of the file: add a single "import json" with the other imports
in packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py, and
remove the three inline "import json" occurrences currently placed inside test
function bodies/with blocks (the ones inside async tests and with contexts).
Ensure no other code changes are made—just a single top-level import and
deletion of the redundant inline imports.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py`:
- Line 1060: Docstring example uses a non-existent attribute name
kv_cache_efficiency which will raise AttributeError; update the example to use
the actual DynamoCoreMetrics attribute kv_efficiency (replace any occurrences of
kv_cache_efficiency in the DynamoCoreMetrics docstring or examples with
kv_efficiency) so the sample code references the correct field.
- Line 38: Update the stale phrase "prefix hints (osl, iat)" in the module
docstring and inside the DynamoCoreMetrics class docstring to the new names
"nvext_prefix_osl" and "nvext_prefix_iat"; specifically modify the module-level
docstring near the top and the DynamoCoreMetrics class docstring (search for the
symbol DynamoCoreMetrics) so both references now read something like "prefix
hints (nvext_prefix_osl, nvext_prefix_iat)" to match the rest of the PR.

---

Nitpick comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py`:
- Around line 263-287: The prediction trie is loaded unconditionally when
nvext_prediction_trie_path is set, wasting I/O when
llm_config.enable_nvext_hints is false; move the trie-loading block (the
Path/open and load_prediction_trie + instantiation of PredictionTrieLookup and
logger lines) inside the same enable flag check that creates http_async_client
(the create_httpx_client_with_dynamo_hooks call) so the trie is only loaded when
enable_nvext_hints is true and actually used; keep the same exception handling
(FileNotFoundError and generic Exception) but scope it under the enablement
branch and pass the resulting prediction_lookup into
create_httpx_client_with_dynamo_hooks as before.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f875ce and be9d98e.

📒 Files selected for processing (23)
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_full_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_rethinking_full_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/optimize_rethinking_full_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/profile_rethinking_full_test.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml
  • external/dynamo/components/processor.py
  • packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py
  • packages/nvidia_nat_adk/tests/test_adk_llm.py
  • packages/nvidia_nat_core/src/nat/data_models/profiler.py
  • packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py
  • packages/nvidia_nat_core/tests/nat/llm/test_dynamic_prediction_hook.py
  • packages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.py
  • packages/nvidia_nat_core/tests/nat/llm/test_dynamo_prediction_trie.py
  • packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py
  • packages/nvidia_nat_langchain/tests/test_dynamo_trie_loading.py
  • packages/nvidia_nat_langchain/tests/test_llm_langchain.py
💤 Files with no reviewable changes (1)
  • packages/nvidia_nat_core/tests/nat/llm/test_dynamic_prediction_hook.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml (1)

17-43: ⚠️ Potential issue | 🟡 Minor

File header block still describes the legacy extra_headers approach.

Several comment lines in the unchanged header block contradict the nvext-based injection introduced by this PR:

  • Line 24: "Automatic prefix header injection via LiteLLM's extra_headers"
  • Lines 28–30: "Headers are passed via extra_headers at initialization time … All requests from the same client share the same prefix ID"
  • Line 41: "LLM calls should include Dynamo prefix headers (visible in server logs)"
  • Line 42: "All tool calls within the conversation share the same prefix ID"
✏️ Suggested header update
-# - Automatic prefix header injection via LiteLLM's extra_headers
-# - KV cache optimization through consistent prefix IDs per client
+# - Per-request NVEXT hints injected into `nvext.agent_hints` request body fields
+# - KV cache optimization via per-request prefix IDs managed by DynamoPrefixContext
 #
 # Key Differences from LangChain:
 # - ADK uses LiteLLM under the hood, not httpx directly
-# - Headers are passed via extra_headers at initialization time
-# - All requests from the same client share the same prefix ID (ideal for conversations)
+# - NVEXT hints are injected per-request into the request body (not via extra_headers)
+# - Prefix IDs are templated per request via the {uuid} placeholder in nvext_prefix_id_template
 ...
 # Expected Output:
-#   - LLM calls should include Dynamo prefix headers (visible in server logs)
-#   - All tool calls within the conversation share the same prefix ID
+#   - LLM calls will carry nvext.agent_hints fields (visible in server logs when enable_nvext_hints: true)
+#   - Each request receives a unique prefix ID derived from nvext_prefix_id_template
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml`
around lines 17 - 43, Update the YAML file header comments to replace legacy
references to LiteLLM's "extra_headers" behavior with the new nvext-based
injection approach: change lines mentioning "Automatic prefix header injection
via LiteLLM's extra_headers", "Headers are passed via extra_headers at
initialization time", and the descriptions about shared prefix IDs so they
describe nvext-based injection and how prefix IDs are managed per client under
nvext; also update the Expected Output lines that mention "Dynamo prefix headers
(visible in server logs)" and "All tool calls within the conversation share the
same prefix ID" to reflect nvext semantics and where to observe the injected
prefix (e.g., nvext metadata) instead of extra_headers.
🧹 Nitpick comments (2)
packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py (1)

673-678: Debug log "Injected Dynamo hints" fires even when no injection occurred.

This log statement runs unconditionally after the POST/JSON block, so it logs "Injected Dynamo hints" for GET requests or non-JSON bodies where no injection actually happened. Consider moving it inside the if isinstance(body, dict): block or guarding it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py` around lines 673 - 678,
The debug log "Injected Dynamo hints" is executed unconditionally; only log when
an injection actually occurred by moving the logger.debug call inside the branch
that handles POST/JSON (the if isinstance(body, dict): block) or by introducing
a boolean flag (e.g., injection_occurred) that you set when modifying body and
then calling logger.debug only if that flag is True; update references to
prefix_id, total_requests, osl_raw, iat_raw, and latency_sensitivity accordingly
so the log only runs when those values were computed/used.
packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py (1)

158-158: Redundant import os — already imported at module level (line 17).

This inner import os shadows the module-level import on line 17 and is unnecessary.

♻️ Proposed fix
-    import os
-
     from google.adk.models.lite_llm import LiteLlm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py` at line 158, Remove the
redundant inner "import os" statement (it shadows the module-level import
already present) — keep the single top-level import of os and delete the
duplicate "import os" found inside the function/block so all references use the
module-level os import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.yml`:
- Around line 65-67: The config currently disables NVEXT hints
(enable_nvext_hints: false), preventing trie-based nvext injection; update the
configuration by setting enable_nvext_hints: true so the trie-backed nvext hints
are exercised together with the existing nvext_prefix_id_template
("latency-demo-{uuid}") and any trie-specific settings.

In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py`:
- Around line 200-203: Replace the two exception handlers so they both use
logger.exception() (to capture stack traces) and remove the redundant exception
argument: change the FileNotFoundError branch from logger.warning(...) to
logger.exception("Prediction trie file not found: %s",
config.nvext_prediction_trie_path) and change the generic except block to
logger.exception("Failed to load prediction trie: %s",
config.nvext_prediction_trie_path) (do not pass the caught exception object `e`
as a format argument); use the existing `logger` and
`config.nvext_prediction_trie_path` symbols to locate the handlers in the
try/except around loading the prediction trie.

In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Line 16: Update the typo in the description string where
"nvnext.cache_control" is written; it should read "nvext.cache_control" so both
occurrences reference nvext (the provider's nvext.agent_hints and
nvext.cache_control injection behavior) — locate the descriptive docstring or
comment in dynamo_llm.py (the Dynamo LLM provider text) and replace
"nvnext.cache_control" with "nvext.cache_control".

---

Outside diff comments:
In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml`:
- Around line 17-43: Update the YAML file header comments to replace legacy
references to LiteLLM's "extra_headers" behavior with the new nvext-based
injection approach: change lines mentioning "Automatic prefix header injection
via LiteLLM's extra_headers", "Headers are passed via extra_headers at
initialization time", and the descriptions about shared prefix IDs so they
describe nvext-based injection and how prefix IDs are managed per client under
nvext; also update the Expected Output lines that mention "Dynamo prefix headers
(visible in server logs)" and "All tool calls within the conversation share the
same prefix ID" to reflect nvext semantics and where to observe the injected
prefix (e.g., nvext metadata) instead of extra_headers.

---

Nitpick comments:
In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py`:
- Line 158: Remove the redundant inner "import os" statement (it shadows the
module-level import already present) — keep the single top-level import of os
and delete the duplicate "import os" found inside the function/block so all
references use the module-level os import.

In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Around line 673-678: The debug log "Injected Dynamo hints" is executed
unconditionally; only log when an injection actually occurred by moving the
logger.debug call inside the branch that handles POST/JSON (the if
isinstance(body, dict): block) or by introducing a boolean flag (e.g.,
injection_occurred) that you set when modifying body and then calling
logger.debug only if that flag is True; update references to prefix_id,
total_requests, osl_raw, iat_raw, and latency_sensitivity accordingly so the log
only runs when those values were computed/used.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be9d98e and 8cb4391.

📒 Files selected for processing (8)
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.yml
  • examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml
  • packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py
  • packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py
  • packages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py
  • packages/nvidia_nat_langchain/tests/test_llm_langchain.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
@dagardner-nv
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit c402fd3 into NVIDIA:release/1.5 Feb 25, 2026
16 of 17 checks passed
Charlie-Yi-2002 pushed a commit to Charlie-Yi-2002/NeMo-Agent-Toolkit that referenced this pull request Mar 5, 2026
…IA#1648)

This PR cleans the logic around nvext optional parameters for the NAT-dynamo integration and updates corresponding examples and unit tests. This is aligned with the scope of current examples and the expected status of dynamo 1.0.0.

Here is a table summarizing the parameters:

# nvext Field Reference

Fields injected into the OpenAI-compatible request body by NAT's `_DynamoTransport`.
All fields live under the top-level `nvext` key in the JSON request body.

---

## `nvext.agent_hints`

Routing hints consumed by Dynamo's built-in router/scheduler and the custom `processor.py`.

| Field | Description | Used in NAT Source/Components | Used in Dynamo Source/Components |
|---|---|---|---|
| `prefix_id` | Unique string identifying the KV cache prefix for a conversation. | Yes — `DynamoPrefixContext` generates it; `processor.py` routes on it. | No |
| `total_requests` | Expected number of LLM calls in this conversation. | Yes — `processor.py` computes `reuse_budget` from it. | No |
| `osl` | Expected output tokens (always raw integer in `agent_hints`). Config accepts `LOW`/`MEDIUM`/`HIGH` strings for backward compat (mapped to 128/512/2048). | Yes — `processor.py` uses it for router decode cost weighting. Validated for pass-through to Dynamo. | Yes — native `AgentHints.osl` read by Dynamo's built-in frontend. |
| `iat` | Expected inter-arrival time in milliseconds (always raw integer). Config accepts `LOW`/`MEDIUM`/`HIGH` strings for backward compat (mapped to 50/250/750). | Yes — `processor.py` uses it for router stickiness weighting. Also used client-side to compute `cache_control.ttl`. | No |
| `latency_sensitivity` | How latency-sensitive this request is (from `@latency_sensitive` decorator or prediction trie). Increase to prioritize a request. | Validated, passed through to Dynamo. Sets a context variable via `Context.push_latency_sensitivity()`. | Yes — native `AgentHints.latency_sensitivity` → Dynamo queue ordering. |
| `priority` | Engine scheduling priority (`nvext_max_sensitivity - latency_sensitivity`). Lower values = higher priority for vLLM; SGLang is configurable. | Validated, passed through to Dynamo. | Yes — native `AgentHints.priority` → engine queue, eviction, preemption. |

---

## `nvext.cache_control`

KV cache lifetime management. Injected as a sibling of `agent_hints` under `nvext`.
Only injected when `nvext_cache_pin_type` is set (not `None`).

| Field | Description | Used in NAT Source/Components | Used in Dynamo Source/Components |
|---|---|---|---|
| `type` | Cache pinning strategy. Only valid value is `"ephemeral"`. Required in JSON (no serde default on deserialization). | `dynamo_llm.py`: Injected client-side via `CachePinType.EPHEMERAL` (default). Configurable via `nvext_cache_pin_type` param; set to `null` to disable `cache_control` entirely. | `nvext.rs`: Deserialized into `CacheControlType` enum (single variant: `Ephemeral`). Presence of `cache_control` triggers `pin_prefix` after generation in the KV push router. Requires `--enable-cache-control` on the frontend. |
| `ttl` | Duration string for how long to pin the prefix in the KV cache. Optional; defaults to `"5m"` (300s) when omitted. | `dynamo_llm.py`: Computed client-side as `total_requests * iat` (ms), converted to seconds, formatted as `"<N>m"` (whole minutes) or `"<N>s"`. The `iat` field is not consumed by Dynamo — it is only used here for this TTL computation and by the custom Thompson Sampling `processor.py`. | `nvext.rs` `CacheControl::ttl_seconds()`: Only parses `"5m"` (300s) and `"1h"` (3600s). Any other value logs a warning and falls back to 300s. The parsed TTL is forwarded as `ttl_seconds` to `pin_prefix` on the worker via the `cache_control` service mesh endpoint (`cache_control.rs` / `handler_base.py`). |

### `nvext_cache_control_mode` (NAT config field)

Controls **when** `nvext.cache_control` is injected (not a wire-format field — only affects client-side behavior):

| Mode | Behavior |
|---|---|
| `always` (default) | Inject `cache_control` on every request. Refreshes TTL each turn. |
| `first_only` | Inject only on the first request per `prefix_id`. Pins the system prompt when first established in the KV cache; subsequent requests benefit from prefix matching without re-pinning the growing conversation context. |

This field is only relevant when `nvext_cache_pin_type` is set (i.e., `"ephemeral"`). When `nvext_cache_pin_type` is `null`, no `cache_control` is injected regardless of this mode.

---

## NAT Config → Wire Format Mapping

| NAT Config Field | Wire Format Location | Notes |
|---|---|---|
| `enable_nvext_hints` | *(gating only)* | When `false` (default), no `nvext` injection occurs. |
| `nvext_prefix_id_template` | *(unused by transport)* | Prefix IDs come from `DynamoPrefixContext` at request time. |
| `nvext_prefix_total_requests` | `nvext.agent_hints.total_requests` | |
| `nvext_prefix_osl` | `nvext.agent_hints.osl` | Always sent as raw integer. |
| `nvext_prefix_iat` | `nvext.agent_hints.iat` | Always sent as raw integer. Also used to compute `cache_control.ttl`. |
| `nvext_max_sensitivity` | `nvext.agent_hints.priority` | `priority = nvext_max_sensitivity - latency_sensitivity` |
| *(from Context)* | `nvext.agent_hints.latency_sensitivity` | Set by `@latency_sensitive` decorator or prediction trie. |
| *(from DynamoPrefixContext)* | `nvext.agent_hints.prefix_id` | Auto-generated per workflow run + call depth. |
| `nvext_cache_pin_type` | `nvext.cache_control.type` | `null` disables `cache_control` entirely. |
| `nvext_cache_control_mode` | *(gating only)* | Controls whether `cache_control` is injected on every request or just the first. |
| `nvext_prediction_trie_path` | *(overrides agent_hints values)* | When set, per-call predictions override `total_requests`, `osl`, `iat`, and optionally `latency_sensitivity`. |

Closes

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.



## Summary by CodeRabbit

* **New Features**
  * NVEXT-based agent hints for Dynamo LLMs.
  * Configurable cache control modes (ALWAYS, FIRST_ONLY) and optional prediction-trie support.

* **Changes**
  * Renamed public config fields from prefix_* → nvext_prefix_*.
  * Added enable_nvext_hints toggle to enable NVEXT flow.
  * Switched from static prefix headers to dynamic per-request agent hints injection.

* **Documentation**
  * Updated docs and log messages to reference NVEXT terminology.

Authors:
  - Bryan Bednarski (https://github.com/bbednarski9)

Approvers:
  - David Gardner (https://github.com/dagardner-nv)

URL: NVIDIA#1648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change improvement Improvement to existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants