improvement: nvext.agent_hints and nvext.cache_control clean up#1648
Conversation
WalkthroughMigrates Dynamo LLM hinting from legacy prefix headers to an nvext-based agent-hints system. Renames config fields from Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client Request
participant Transport as _DynamoTransport
participant Trie as Prediction Trie
participant Server as Dynamo Server
Client->>Transport: Send LLM request
Transport->>Transport: Resolve prefix_id (nvext_prefix_id_template)
alt nvext_prediction_trie_path present
Transport->>Trie: Lookup prediction for prefix_id
Trie-->>Transport: Prediction (osl/iat) or None
Transport->>Transport: Apply prediction overrides if present
end
Transport->>Transport: Build nvext.agent_hints (total_requests, osl, iat)
alt CacheControlMode == FIRST_ONLY & first call for prefix
Transport->>Transport: Inject nvext.cache_control (pin/ttl)
else CacheControlMode == ALWAYS
Transport->>Transport: Inject nvext.cache_control (pin/ttl)
end
Transport->>Transport: Increment per-prefix call_index
Transport->>Server: Forward request with nvext hints injected
Server-->>Transport: Response
Transport-->>Client: Return response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py (2)
1060-1060:⚠️ Potential issue | 🟡 MinorIncorrect attribute name
kv_cache_efficiencyin docstring example — field iskv_efficiency.
DynamoCoreMetricsexposeskv_efficiency, notkv_cache_efficiency. The example at Line 1060 would raiseAttributeErrorif executed.📝 Proposed fix
- print(f"KV Efficiency: {core.kv_cache_efficiency:.2%}") + print(f"KV Efficiency: {core.kv_efficiency:.2%}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py` at line 1060, Docstring example uses a non-existent attribute name kv_cache_efficiency which will raise AttributeError; update the example to use the actual DynamoCoreMetrics attribute kv_efficiency (replace any occurrences of kv_cache_efficiency in the DynamoCoreMetrics docstring or examples with kv_efficiency) so the sample code references the correct field.
38-38:⚠️ Potential issue | 🟡 MinorStale references in module docstring — not updated alongside line 255.
Line 38 still reads
prefix hints (osl, iat)while the rest of the PR renames these tonvext_prefix_osl/nvext_prefix_iat. The same stale phrasing also appears at Line 221 insideDynamoCoreMetrics's class docstring.📝 Suggested updates
- - Affected by: prefix_id routing, prefix hints (osl, iat), request patterns + - Affected by: prefix_id routing, prefix hints (nvext_prefix_osl, nvext_prefix_iat), request patternsApply the same fix to Line 221 inside
DynamoCoreMetrics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py` at line 38, Update the stale phrase "prefix hints (osl, iat)" in the module docstring and inside the DynamoCoreMetrics class docstring to the new names "nvext_prefix_osl" and "nvext_prefix_iat"; specifically modify the module-level docstring near the top and the DynamoCoreMetrics class docstring (search for the symbol DynamoCoreMetrics) so both references now read something like "prefix hints (nvext_prefix_osl, nvext_prefix_iat)" to match the rest of the PR.packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py (3)
129-130:⚠️ Potential issue | 🟡 Minor
DynamoPrefixContext.clear()is not protected against assertion failures.Each test calls
DynamoPrefixContext.set(...)unconditionally and thenDynamoPrefixContext.clear()only at the very end. If anyassertbetween them raisesAssertionError,clear()is skipped and subsequent tests start with a stale prefix. While each test overwrites the prefix via.set(...), the prior dirty state is observable during the earlier_function_path_stack.set(None)reset window.Use
try/finallyor move setup/teardown into a fixture.🛡️ Proposed fix (inline try/finally pattern)
- DynamoPrefixContext.set("e2e-test") - - with ctx.push_active_function(...): - ... - - DynamoPrefixContext.clear() + DynamoPrefixContext.set("e2e-test") + try: + with ctx.push_active_function(...): + ... + finally: + DynamoPrefixContext.clear()Apply the same pattern to
test_e2e_fallback_to_rootandtest_e2e_multiple_calls_in_same_context.Also applies to: 191-191, 219-220, 243-243, 271-272, 328-328
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py` around lines 129 - 130, Wrap each test that calls DynamoPrefixContext.set(...) in a try/finally (or move set/clear into a pytest fixture) so DynamoPrefixContext.clear() always runs even if an assertion fails; specifically update the tests that call DynamoPrefixContext.set (e.g., test_e2e_fallback_to_root, test_e2e_multiple_calls_in_same_context and the instances around the original set at DynamoPrefixContext.set("e2e-test")) to place the body of the test inside the try and call DynamoPrefixContext.clear() in the finally block, and apply the same pattern to the other occurrences listed in the comment.
108-120: 🛠️ Refactor suggestion | 🟠 MajorExtract repeated transport setup into a pytest fixture.
The identical 6-line block (mock response + mock transport +
_DynamoTransportconstruction) is copy-pasted into all three test functions. Per coding guidelines, frequently repeated code should be extracted into a named pytest fixture.♻️ Proposed fixture extraction
+import pytest + +@pytest.fixture(name="dynamo_transport_with_lookup") +def fixture_dynamo_transport_with_lookup(): + mock_response = httpx.Response(200, json={"result": "ok"}) + mock_transport = MagicMock() + mock_transport.handle_async_request = AsyncMock(return_value=mock_response) + + def _factory(lookup): + transport = _DynamoTransport( + transport=mock_transport, + total_requests=10, + osl=512, + iat=250, + prediction_lookup=lookup, + ) + return transport, mock_transport + + return _factoryThen each test receives
dynamo_transport_with_lookupas a parameter and callstransport, mock_transport = dynamo_transport_with_lookup(lookup), eliminating the duplicated setup.As per coding guidelines: "Any frequently repeated code should be extracted into pytest fixtures."
Also applies to: 199-210, 251-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py` around lines 108 - 120, Extract the repeated mock transport setup into a pytest fixture named e.g. dynamo_transport_with_lookup that builds mock_response = httpx.Response(200, json={"result":"ok"}), mock_transport with handle_async_request AsyncMock returning mock_response, and constructs the _DynamoTransport with transport=mock_transport, total_requests=10, osl=512, iat=250, prediction_lookup=lookup; have the fixture accept or depend on the existing lookup test fixture and return (transport, mock_transport). Replace the six-line block in tests calling _DynamoTransport with a call to dynamo_transport_with_lookup(lookup) (or by declaring dynamo_transport_with_lookup and lookup as test parameters) and use the returned transport and mock_transport in each test. Ensure the fixture is placed in the test module or conftest so all three tests can import it.
133-133: 🛠️ Refactor suggestion | 🟠 MajorMove
import jsonto the module-level import block.Placing
import jsoninsidewithblocks (and even insideasyncfunction bodies) violates PEP 8, which requires all imports at the top of the file. This also repeats the import in every test function.♻️ Proposed fix
+import json import tempfile from pathlib import Path ...Then remove the three inline
import jsonstatements at lines 133, 222, and 275.As per coding guidelines: "Follow PEP 20 and PEP 8 for Python style guidelines."
Also applies to: 222-222, 275-275
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py` at line 133, Move the inline "import json" statements to the module-level import block at the top of the file: add a single "import json" with the other imports in packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py, and remove the three inline "import json" occurrences currently placed inside test function bodies/with blocks (the ones inside async tests and with contexts). Ensure no other code changes are made—just a single top-level import and deletion of the redundant inline imports.examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml (1)
53-68:⚠️ Potential issue | 🟡 MinorStale example still documents legacy
x-prefix-*headers.After migrating to nvext hints, Lines 64-68 should describe
nvext.agent_hintspayload fields (andnvext.cache_controlwhen applicable), not legacy header keys.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml` around lines 53 - 68, Update the stale example comment (lines showing "x-prefix-*" headers) to document the new nvext hint payload fields instead: reference the config keys such as enable_nvext_hints, nvext_prefix_id_template, nvext_prefix_total_requests, nvext_prefix_osl and nvext_prefix_iat, and show the corresponding nvext payload shape (e.g. nvext.agent_hints.prefix_id, nvext.agent_hints.total_requests, nvext.agent_hints.osl, nvext.agent_hints.iat) and mention nvext.cache_control when applicable instead of legacy header names.examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml (1)
18-43:⚠️ Potential issue | 🟡 MinorTerminology is still mixed between headers and nvext hints.
This config now uses nvext hints, but the file still describes “header injection” and
x-nat-*headers. Please update wording tonvext.agent_hints-based injection for consistency.Also applies to: 150-166
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml` around lines 18 - 43, Update the docs to stop calling this "header injection" and instead describe nvext.agent_hints-based injection: replace references to "header injection" and "x-nat-*" with wording that the prediction trie populates nvext.agent_hints (e.g., nvext.agent_hints.remaining_llm_calls, nvext.agent_hints.interarrival_ms, nvext.agent_hints.expected_output_tokens) for each LLM call; update all descriptive text that mentions x-nat-* headers or "headers" (including the later block currently mentioning those names) to consistently describe the nvext.agent_hints keys and the prediction lookup by function path and call index used by the run_with_prediction_trie.yml configuration.external/dynamo/components/processor.py (1)
421-431:⚠️ Potential issue | 🟡 MinorDocstring now contradicts
_to_categorybehavior.Line 430 says values are always raw integers, but Lines 435-436 still accept
LOW/MEDIUM/HIGH. Please align the docstring with actual parsing behavior (or enforce numeric-only input).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@external/dynamo/components/processor.py` around lines 421 - 431, The docstring for _to_category currently contradicts its implementation: update the _to_category docstring to reflect that the function accepts either categorical strings ("LOW"/"MEDIUM"/"HIGH") or numeric inputs (int or numeric string which is parsed with int()), describe the exact threshold mapping (value < thresholds[0] → LOW, value < thresholds[1] → MEDIUM, else HIGH), and state that non-numeric/non-categorical inputs will raise a ValueError; alternatively, if you prefer enforcing numeric-only, modify _to_category to reject categorical strings and raise on non-numeric input—choose one approach and make the docstring and implementation consistent (refer to the _to_category function and thresholds parameter when making the change).packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py (1)
487-513:⚠️ Potential issue | 🟠 MajorBound
_call_countsto prevent unbounded memory growth.
_call_countsgrows per uniqueprefix_idand is never pruned. In long-lived workers with many conversations/prefixes, this can grow indefinitely and degrade memory/performance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py` around lines 487 - 513, The per-prefix counter dict _call_counts in handle_async_request grows unbounded per unique prefix_id; limit its size by adding a max capacity (e.g., self._max_call_counts) and evict old entries when inserting a new prefix_id: inside the existing with self._call_counts_lock block, when computing call_index for prefix_id check if prefix_id exists, otherwise if len(self._call_counts) >= self._max_call_counts remove the oldest/least-recently-used key (use an OrderedDict or maintain an eviction queue) before inserting the new prefix_id, ensuring thread-safety with _call_counts_lock and preserving existing semantics of call_index and _call_counts updates in handle_async_request.
🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py (1)
263-287: Load prediction trie only when nvext hints are enabled.Right now the trie is loaded whenever
nvext_prediction_trie_pathis set, even whenenable_nvext_hintsis false (so it is never used). Gate trie loading by the same feature flag to avoid unnecessary startup I/O and warning noise.♻️ Suggested change
- prediction_lookup: PredictionTrieLookup | None = None - if llm_config.nvext_prediction_trie_path: + prediction_lookup: PredictionTrieLookup | None = None + if llm_config.enable_nvext_hints and llm_config.nvext_prediction_trie_path: try: trie_path = Path(llm_config.nvext_prediction_trie_path) trie = load_prediction_trie(trie_path) prediction_lookup = PredictionTrieLookup(trie) logger.info("Loaded prediction trie from %s", llm_config.nvext_prediction_trie_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py` around lines 263 - 287, The prediction trie is loaded unconditionally when nvext_prediction_trie_path is set, wasting I/O when llm_config.enable_nvext_hints is false; move the trie-loading block (the Path/open and load_prediction_trie + instantiation of PredictionTrieLookup and logger lines) inside the same enable flag check that creates http_async_client (the create_httpx_client_with_dynamo_hooks call) so the trie is only loaded when enable_nvext_hints is true and actually used; keep the same exception handling (FileNotFoundError and generic Exception) but scope it under the enablement branch and pass the resulting prediction_lookup into create_httpx_client_with_dynamo_hooks as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml`:
- Around line 72-79: The comment claiming "Static fallback values (not used
because baseline does not use agent hints)" is misleading because
enable_nvext_hints is true and there is no trie path, so the nvext_prefix_* keys
(nvext_prefix_total_requests, nvext_prefix_osl, nvext_prefix_iat) act as active
defaults; update the comment to accurately state that these values serve as
active default hints when enable_nvext_hints is true and no trie path is
provided (or else remove/disable them), and ensure references to
enable_nvext_hints and the nvext_prefix_* keys reflect this behavior.
In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml`:
- Around line 61-69: The NVEXT comment block is outdated and mentions
extra_headers and a shared static prefix; update the comments around
enable_nvext_hints, nvext_prefix_id_template, nvext_prefix_total_requests,
nvext_prefix_osl, and nvext_prefix_iat to describe the current nvext-agent-hints
behavior (hints are injected into the request body per request rather than sent
via extra_headers and the prefix is templated per request via the {uuid}
placeholder), and clarify that nvext_prefix_total_requests is an expected
tool-call count and nvext_prefix_osl / nvext_prefix_iat are hint severity levels
for output length and inter-arrival timing respectively.
In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py`:
- Around line 202-203: The except Exception block that currently calls
logger.warning("Failed to load prediction trie: %s", e) should use
logger.exception(...) to record the full stack trace while keeping the same
graceful-degrade behavior; locate the try/except around the "Failed to load
prediction trie" message in packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py
and replace the logger.warning call with logger.exception using a similar
message so the exception and stack trace are logged but the exception is not
re-raised.
In `@packages/nvidia_nat_core/src/nat/data_models/profiler.py`:
- Line 67: The comment notes an inconsistent identifier: update the reference
named prefix_id to the new naming scheme nvext_prefix_id so it matches
nvext_prefix_osl and nvext_prefix_iat; locate the occurrence in the
Profiler/data model code where the string or field listing reads "prefix_id"
alongside "nvext_prefix_osl" and "nvext_prefix_iat" and rename that token to
"nvext_prefix_id" (and update any surrounding docstring or comment text that
mentions the old name) so all NAT config fields consistently use the
nvext_prefix_* convention.
In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Around line 508-513: Move the increment of the per-prefix counter so it only
runs after you confirm the request is eligible for injection; currently
call_index is updated inside the with self._call_counts_lock before validating
JSON/emptiness, which allows non-injectible requests to consume the FIRST_ONLY
slot. Change the code around the with self._call_counts_lock that updates
self._call_counts[prefix_id] to perform the get/assign only after the
injectibility check (the same check that determines whether to apply
cache_control), using the same lock when mutating _call_counts; apply this fix
to the three occurrences (the shown block using
call_index/prefix_id/_call_counts_lock and the similar blocks at the other two
locations reported).
In `@packages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.py`:
- Around line 76-84: The test name/docstring misstates behavior: setting
DynamoModelConfig.nvext_prefix_id_template to None does not disable hint
injection (that is controlled by enable_nvext_hints). Update the test
(test_disable_prefix_headers) to either rename it (e.g.,
test_set_prefix_template_none or
test_prefix_template_none_does_not_toggle_hints) and change the docstring to
state it only sets the prefix template to None and does not affect
enable_nvext_hints; keep the assertion that config.nvext_prefix_id_template is
None and do not change enable_nvext_hints behavior.
In `@packages/nvidia_nat_langchain/tests/test_llm_langchain.py`:
- Around line 182-190: Update the stale docstrings that say “header injection”
to reflect that the flow injects nvext fields into the JSON request body: for
fixtures dynamo_cfg and dynamo_cfg_with_prefix (and the other similar docstring
later in the file), change wording to something like “nvext request-body
injection (injects nvext.agent_hints / nvext.cache_control into the JSON body)”
so the documentation accurately describes request-body injection rather than
header injection.
---
Outside diff comments:
In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml`:
- Around line 53-68: Update the stale example comment (lines showing
"x-prefix-*" headers) to document the new nvext hint payload fields instead:
reference the config keys such as enable_nvext_hints, nvext_prefix_id_template,
nvext_prefix_total_requests, nvext_prefix_osl and nvext_prefix_iat, and show the
corresponding nvext payload shape (e.g. nvext.agent_hints.prefix_id,
nvext.agent_hints.total_requests, nvext.agent_hints.osl, nvext.agent_hints.iat)
and mention nvext.cache_control when applicable instead of legacy header names.
In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.yml`:
- Around line 18-43: Update the docs to stop calling this "header injection" and
instead describe nvext.agent_hints-based injection: replace references to
"header injection" and "x-nat-*" with wording that the prediction trie populates
nvext.agent_hints (e.g., nvext.agent_hints.remaining_llm_calls,
nvext.agent_hints.interarrival_ms, nvext.agent_hints.expected_output_tokens) for
each LLM call; update all descriptive text that mentions x-nat-* headers or
"headers" (including the later block currently mentioning those names) to
consistently describe the nvext.agent_hints keys and the prediction lookup by
function path and call index used by the run_with_prediction_trie.yml
configuration.
In `@external/dynamo/components/processor.py`:
- Around line 421-431: The docstring for _to_category currently contradicts its
implementation: update the _to_category docstring to reflect that the function
accepts either categorical strings ("LOW"/"MEDIUM"/"HIGH") or numeric inputs
(int or numeric string which is parsed with int()), describe the exact threshold
mapping (value < thresholds[0] → LOW, value < thresholds[1] → MEDIUM, else
HIGH), and state that non-numeric/non-categorical inputs will raise a
ValueError; alternatively, if you prefer enforcing numeric-only, modify
_to_category to reject categorical strings and raise on non-numeric input—choose
one approach and make the docstring and implementation consistent (refer to the
_to_category function and thresholds parameter when making the change).
In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Around line 487-513: The per-prefix counter dict _call_counts in
handle_async_request grows unbounded per unique prefix_id; limit its size by
adding a max capacity (e.g., self._max_call_counts) and evict old entries when
inserting a new prefix_id: inside the existing with self._call_counts_lock
block, when computing call_index for prefix_id check if prefix_id exists,
otherwise if len(self._call_counts) >= self._max_call_counts remove the
oldest/least-recently-used key (use an OrderedDict or maintain an eviction
queue) before inserting the new prefix_id, ensuring thread-safety with
_call_counts_lock and preserving existing semantics of call_index and
_call_counts updates in handle_async_request.
In `@packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py`:
- Around line 129-130: Wrap each test that calls DynamoPrefixContext.set(...) in
a try/finally (or move set/clear into a pytest fixture) so
DynamoPrefixContext.clear() always runs even if an assertion fails; specifically
update the tests that call DynamoPrefixContext.set (e.g.,
test_e2e_fallback_to_root, test_e2e_multiple_calls_in_same_context and the
instances around the original set at DynamoPrefixContext.set("e2e-test")) to
place the body of the test inside the try and call DynamoPrefixContext.clear()
in the finally block, and apply the same pattern to the other occurrences listed
in the comment.
- Around line 108-120: Extract the repeated mock transport setup into a pytest
fixture named e.g. dynamo_transport_with_lookup that builds mock_response =
httpx.Response(200, json={"result":"ok"}), mock_transport with
handle_async_request AsyncMock returning mock_response, and constructs the
_DynamoTransport with transport=mock_transport, total_requests=10, osl=512,
iat=250, prediction_lookup=lookup; have the fixture accept or depend on the
existing lookup test fixture and return (transport, mock_transport). Replace the
six-line block in tests calling _DynamoTransport with a call to
dynamo_transport_with_lookup(lookup) (or by declaring
dynamo_transport_with_lookup and lookup as test parameters) and use the returned
transport and mock_transport in each test. Ensure the fixture is placed in the
test module or conftest so all three tests can import it.
- Line 133: Move the inline "import json" statements to the module-level import
block at the top of the file: add a single "import json" with the other imports
in packages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.py, and
remove the three inline "import json" occurrences currently placed inside test
function bodies/with blocks (the ones inside async tests and with contexts).
Ensure no other code changes are made—just a single top-level import and
deletion of the redundant inline imports.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py`:
- Line 1060: Docstring example uses a non-existent attribute name
kv_cache_efficiency which will raise AttributeError; update the example to use
the actual DynamoCoreMetrics attribute kv_efficiency (replace any occurrences of
kv_cache_efficiency in the DynamoCoreMetrics docstring or examples with
kv_efficiency) so the sample code references the correct field.
- Line 38: Update the stale phrase "prefix hints (osl, iat)" in the module
docstring and inside the DynamoCoreMetrics class docstring to the new names
"nvext_prefix_osl" and "nvext_prefix_iat"; specifically modify the module-level
docstring near the top and the DynamoCoreMetrics class docstring (search for the
symbol DynamoCoreMetrics) so both references now read something like "prefix
hints (nvext_prefix_osl, nvext_prefix_iat)" to match the rest of the PR.
---
Nitpick comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.py`:
- Around line 263-287: The prediction trie is loaded unconditionally when
nvext_prediction_trie_path is set, wasting I/O when
llm_config.enable_nvext_hints is false; move the trie-loading block (the
Path/open and load_prediction_trie + instantiation of PredictionTrieLookup and
logger lines) inside the same enable flag check that creates http_async_client
(the create_httpx_client_with_dynamo_hooks call) so the trie is only loaded when
enable_nvext_hints is true and actually used; keep the same exception handling
(FileNotFoundError and generic Exception) but scope it under the enablement
branch and pass the resulting prediction_lookup into
create_httpx_client_with_dynamo_hooks as before.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.ymlexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_full_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_rethinking_full_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/optimize_rethinking_full_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/profile_rethinking_full_test.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/run_with_prediction_trie.ymlexternal/dynamo/components/processor.pypackages/nvidia_nat_adk/src/nat/plugins/adk/llm.pypackages/nvidia_nat_adk/tests/test_adk_llm.pypackages/nvidia_nat_core/src/nat/data_models/profiler.pypackages/nvidia_nat_core/src/nat/llm/dynamo_llm.pypackages/nvidia_nat_core/tests/nat/llm/test_dynamic_prediction_hook.pypackages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.pypackages/nvidia_nat_core/tests/nat/llm/test_dynamo_prediction_trie.pypackages/nvidia_nat_core/tests/nat/llm/test_runtime_prediction_e2e.pypackages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/llm.pypackages/nvidia_nat_langchain/tests/test_dynamo_trie_loading.pypackages/nvidia_nat_langchain/tests/test_llm_langchain.py
💤 Files with no reviewable changes (1)
- packages/nvidia_nat_core/tests/nat/llm/test_dynamic_prediction_hook.py
...integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml
Show resolved
Hide resolved
...ation/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml (1)
17-43:⚠️ Potential issue | 🟡 MinorFile header block still describes the legacy
extra_headersapproach.Several comment lines in the unchanged header block contradict the nvext-based injection introduced by this PR:
- Line 24: "Automatic prefix header injection via LiteLLM's extra_headers"
- Lines 28–30: "Headers are passed via extra_headers at initialization time … All requests from the same client share the same prefix ID"
- Line 41: "LLM calls should include Dynamo prefix headers (visible in server logs)"
- Line 42: "All tool calls within the conversation share the same prefix ID"
✏️ Suggested header update
-# - Automatic prefix header injection via LiteLLM's extra_headers -# - KV cache optimization through consistent prefix IDs per client +# - Per-request NVEXT hints injected into `nvext.agent_hints` request body fields +# - KV cache optimization via per-request prefix IDs managed by DynamoPrefixContext # # Key Differences from LangChain: # - ADK uses LiteLLM under the hood, not httpx directly -# - Headers are passed via extra_headers at initialization time -# - All requests from the same client share the same prefix ID (ideal for conversations) +# - NVEXT hints are injected per-request into the request body (not via extra_headers) +# - Prefix IDs are templated per request via the {uuid} placeholder in nvext_prefix_id_template ... # Expected Output: -# - LLM calls should include Dynamo prefix headers (visible in server logs) -# - All tool calls within the conversation share the same prefix ID +# - LLM calls will carry nvext.agent_hints fields (visible in server logs when enable_nvext_hints: true) +# - Each request receives a unique prefix ID derived from nvext_prefix_id_template🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml` around lines 17 - 43, Update the YAML file header comments to replace legacy references to LiteLLM's "extra_headers" behavior with the new nvext-based injection approach: change lines mentioning "Automatic prefix header injection via LiteLLM's extra_headers", "Headers are passed via extra_headers at initialization time", and the descriptions about shared prefix IDs so they describe nvext-based injection and how prefix IDs are managed per client under nvext; also update the Expected Output lines that mention "Dynamo prefix headers (visible in server logs)" and "All tool calls within the conversation share the same prefix ID" to reflect nvext semantics and where to observe the injected prefix (e.g., nvext metadata) instead of extra_headers.
🧹 Nitpick comments (2)
packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py (1)
673-678: Debug log "Injected Dynamo hints" fires even when no injection occurred.This log statement runs unconditionally after the POST/JSON block, so it logs "Injected Dynamo hints" for GET requests or non-JSON bodies where no injection actually happened. Consider moving it inside the
if isinstance(body, dict):block or guarding it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py` around lines 673 - 678, The debug log "Injected Dynamo hints" is executed unconditionally; only log when an injection actually occurred by moving the logger.debug call inside the branch that handles POST/JSON (the if isinstance(body, dict): block) or by introducing a boolean flag (e.g., injection_occurred) that you set when modifying body and then calling logger.debug only if that flag is True; update references to prefix_id, total_requests, osl_raw, iat_raw, and latency_sensitivity accordingly so the log only runs when those values were computed/used.packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py (1)
158-158: Redundantimport os— already imported at module level (line 17).This inner
import osshadows the module-level import on line 17 and is unnecessary.♻️ Proposed fix
- import os - from google.adk.models.lite_llm import LiteLlm🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py` at line 158, Remove the redundant inner "import os" statement (it shadows the module-level import already present) — keep the single top-level import of os and delete the duplicate "import os" found inside the function/block so all references use the module-level os import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.yml`:
- Around line 65-67: The config currently disables NVEXT hints
(enable_nvext_hints: false), preventing trie-based nvext injection; update the
configuration by setting enable_nvext_hints: true so the trie-backed nvext hints
are exercised together with the existing nvext_prefix_id_template
("latency-demo-{uuid}") and any trie-specific settings.
In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py`:
- Around line 200-203: Replace the two exception handlers so they both use
logger.exception() (to capture stack traces) and remove the redundant exception
argument: change the FileNotFoundError branch from logger.warning(...) to
logger.exception("Prediction trie file not found: %s",
config.nvext_prediction_trie_path) and change the generic except block to
logger.exception("Failed to load prediction trie: %s",
config.nvext_prediction_trie_path) (do not pass the caught exception object `e`
as a format argument); use the existing `logger` and
`config.nvext_prediction_trie_path` symbols to locate the handlers in the
try/except around loading the prediction trie.
In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Line 16: Update the typo in the description string where
"nvnext.cache_control" is written; it should read "nvext.cache_control" so both
occurrences reference nvext (the provider's nvext.agent_hints and
nvext.cache_control injection behavior) — locate the descriptive docstring or
comment in dynamo_llm.py (the Dynamo LLM provider text) and replace
"nvnext.cache_control" with "nvext.cache_control".
---
Outside diff comments:
In
`@examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml`:
- Around line 17-43: Update the YAML file header comments to replace legacy
references to LiteLLM's "extra_headers" behavior with the new nvext-based
injection approach: change lines mentioning "Automatic prefix header injection
via LiteLLM's extra_headers", "Headers are passed via extra_headers at
initialization time", and the descriptions about shared prefix IDs so they
describe nvext-based injection and how prefix IDs are managed per client under
nvext; also update the Expected Output lines that mention "Dynamo prefix headers
(visible in server logs)" and "All tool calls within the conversation share the
same prefix ID" to reflect nvext semantics and where to observe the injected
prefix (e.g., nvext metadata) instead of extra_headers.
---
Nitpick comments:
In `@packages/nvidia_nat_adk/src/nat/plugins/adk/llm.py`:
- Line 158: Remove the redundant inner "import os" statement (it shadows the
module-level import already present) — keep the single top-level import of os
and delete the duplicate "import os" found inside the function/block so all
references use the module-level os import.
In `@packages/nvidia_nat_core/src/nat/llm/dynamo_llm.py`:
- Around line 673-678: The debug log "Injected Dynamo hints" is executed
unconditionally; only log when an injection actually occurred by moving the
logger.debug call inside the branch that handles POST/JSON (the if
isinstance(body, dict): block) or by introducing a boolean flag (e.g.,
injection_occurred) that you set when modifying body and then calling
logger.debug only if that flag is True; update references to prefix_id,
total_requests, osl_raw, iat_raw, and latency_sensitivity accordingly so the log
only runs when those values were computed/used.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.ymlexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.ymlexamples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.ymlpackages/nvidia_nat_adk/src/nat/plugins/adk/llm.pypackages/nvidia_nat_core/src/nat/llm/dynamo_llm.pypackages/nvidia_nat_core/tests/nat/llm/test_dynamo_llm.pypackages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.pypackages/nvidia_nat_langchain/tests/test_llm_langchain.py
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/inference_optimization/dynamo_metrics.py
- examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml
...tegration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.yml
Show resolved
Hide resolved
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
|
/merge |
…IA#1648) This PR cleans the logic around nvext optional parameters for the NAT-dynamo integration and updates corresponding examples and unit tests. This is aligned with the scope of current examples and the expected status of dynamo 1.0.0. Here is a table summarizing the parameters: # nvext Field Reference Fields injected into the OpenAI-compatible request body by NAT's `_DynamoTransport`. All fields live under the top-level `nvext` key in the JSON request body. --- ## `nvext.agent_hints` Routing hints consumed by Dynamo's built-in router/scheduler and the custom `processor.py`. | Field | Description | Used in NAT Source/Components | Used in Dynamo Source/Components | |---|---|---|---| | `prefix_id` | Unique string identifying the KV cache prefix for a conversation. | Yes — `DynamoPrefixContext` generates it; `processor.py` routes on it. | No | | `total_requests` | Expected number of LLM calls in this conversation. | Yes — `processor.py` computes `reuse_budget` from it. | No | | `osl` | Expected output tokens (always raw integer in `agent_hints`). Config accepts `LOW`/`MEDIUM`/`HIGH` strings for backward compat (mapped to 128/512/2048). | Yes — `processor.py` uses it for router decode cost weighting. Validated for pass-through to Dynamo. | Yes — native `AgentHints.osl` read by Dynamo's built-in frontend. | | `iat` | Expected inter-arrival time in milliseconds (always raw integer). Config accepts `LOW`/`MEDIUM`/`HIGH` strings for backward compat (mapped to 50/250/750). | Yes — `processor.py` uses it for router stickiness weighting. Also used client-side to compute `cache_control.ttl`. | No | | `latency_sensitivity` | How latency-sensitive this request is (from `@latency_sensitive` decorator or prediction trie). Increase to prioritize a request. | Validated, passed through to Dynamo. Sets a context variable via `Context.push_latency_sensitivity()`. | Yes — native `AgentHints.latency_sensitivity` → Dynamo queue ordering. | | `priority` | Engine scheduling priority (`nvext_max_sensitivity - latency_sensitivity`). Lower values = higher priority for vLLM; SGLang is configurable. | Validated, passed through to Dynamo. | Yes — native `AgentHints.priority` → engine queue, eviction, preemption. | --- ## `nvext.cache_control` KV cache lifetime management. Injected as a sibling of `agent_hints` under `nvext`. Only injected when `nvext_cache_pin_type` is set (not `None`). | Field | Description | Used in NAT Source/Components | Used in Dynamo Source/Components | |---|---|---|---| | `type` | Cache pinning strategy. Only valid value is `"ephemeral"`. Required in JSON (no serde default on deserialization). | `dynamo_llm.py`: Injected client-side via `CachePinType.EPHEMERAL` (default). Configurable via `nvext_cache_pin_type` param; set to `null` to disable `cache_control` entirely. | `nvext.rs`: Deserialized into `CacheControlType` enum (single variant: `Ephemeral`). Presence of `cache_control` triggers `pin_prefix` after generation in the KV push router. Requires `--enable-cache-control` on the frontend. | | `ttl` | Duration string for how long to pin the prefix in the KV cache. Optional; defaults to `"5m"` (300s) when omitted. | `dynamo_llm.py`: Computed client-side as `total_requests * iat` (ms), converted to seconds, formatted as `"<N>m"` (whole minutes) or `"<N>s"`. The `iat` field is not consumed by Dynamo — it is only used here for this TTL computation and by the custom Thompson Sampling `processor.py`. | `nvext.rs` `CacheControl::ttl_seconds()`: Only parses `"5m"` (300s) and `"1h"` (3600s). Any other value logs a warning and falls back to 300s. The parsed TTL is forwarded as `ttl_seconds` to `pin_prefix` on the worker via the `cache_control` service mesh endpoint (`cache_control.rs` / `handler_base.py`). | ### `nvext_cache_control_mode` (NAT config field) Controls **when** `nvext.cache_control` is injected (not a wire-format field — only affects client-side behavior): | Mode | Behavior | |---|---| | `always` (default) | Inject `cache_control` on every request. Refreshes TTL each turn. | | `first_only` | Inject only on the first request per `prefix_id`. Pins the system prompt when first established in the KV cache; subsequent requests benefit from prefix matching without re-pinning the growing conversation context. | This field is only relevant when `nvext_cache_pin_type` is set (i.e., `"ephemeral"`). When `nvext_cache_pin_type` is `null`, no `cache_control` is injected regardless of this mode. --- ## NAT Config → Wire Format Mapping | NAT Config Field | Wire Format Location | Notes | |---|---|---| | `enable_nvext_hints` | *(gating only)* | When `false` (default), no `nvext` injection occurs. | | `nvext_prefix_id_template` | *(unused by transport)* | Prefix IDs come from `DynamoPrefixContext` at request time. | | `nvext_prefix_total_requests` | `nvext.agent_hints.total_requests` | | | `nvext_prefix_osl` | `nvext.agent_hints.osl` | Always sent as raw integer. | | `nvext_prefix_iat` | `nvext.agent_hints.iat` | Always sent as raw integer. Also used to compute `cache_control.ttl`. | | `nvext_max_sensitivity` | `nvext.agent_hints.priority` | `priority = nvext_max_sensitivity - latency_sensitivity` | | *(from Context)* | `nvext.agent_hints.latency_sensitivity` | Set by `@latency_sensitive` decorator or prediction trie. | | *(from DynamoPrefixContext)* | `nvext.agent_hints.prefix_id` | Auto-generated per workflow run + call depth. | | `nvext_cache_pin_type` | `nvext.cache_control.type` | `null` disables `cache_control` entirely. | | `nvext_cache_control_mode` | *(gating only)* | Controls whether `cache_control` is injected on every request or just the first. | | `nvext_prediction_trie_path` | *(overrides agent_hints values)* | When set, per-call predictions override `total_requests`, `osl`, `iat`, and optionally `latency_sensitivity`. | Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **New Features** * NVEXT-based agent hints for Dynamo LLMs. * Configurable cache control modes (ALWAYS, FIRST_ONLY) and optional prediction-trie support. * **Changes** * Renamed public config fields from prefix_* → nvext_prefix_*. * Added enable_nvext_hints toggle to enable NVEXT flow. * Switched from static prefix headers to dynamic per-request agent hints injection. * **Documentation** * Updated docs and log messages to reference NVEXT terminology. Authors: - Bryan Bednarski (https://github.com/bbednarski9) Approvers: - David Gardner (https://github.com/dagardner-nv) URL: NVIDIA#1648
Description
This PR cleans the logic around nvext optional parameters for the NAT-dynamo integration and updates corresponding examples and unit tests. This is aligned with the scope of current examples and the expected status of dynamo 1.0.0.
Here is a table summarizing the parameters:
nvext Field Reference
Fields injected into the OpenAI-compatible request body by NAT's
_DynamoTransport.All fields live under the top-level
nvextkey in the JSON request body.nvext.agent_hintsRouting hints consumed by Dynamo's built-in router/scheduler and the custom
processor.py.prefix_idDynamoPrefixContextgenerates it;processor.pyroutes on it.total_requestsprocessor.pycomputesreuse_budgetfrom it.oslagent_hints). Config acceptsLOW/MEDIUM/HIGHstrings for backward compat (mapped to 128/512/2048).processor.pyuses it for router decode cost weighting. Validated for pass-through to Dynamo.AgentHints.oslread by Dynamo's built-in frontend.iatLOW/MEDIUM/HIGHstrings for backward compat (mapped to 50/250/750).processor.pyuses it for router stickiness weighting. Also used client-side to computecache_control.ttl.latency_sensitivity@latency_sensitivedecorator or prediction trie). Increase to prioritize a request.Context.push_latency_sensitivity().AgentHints.latency_sensitivity→ Dynamo queue ordering.prioritynvext_max_sensitivity - latency_sensitivity). Lower values = higher priority for vLLM; SGLang is configurable.AgentHints.priority→ engine queue, eviction, preemption.nvext.cache_controlKV cache lifetime management. Injected as a sibling of
agent_hintsundernvext.Only injected when
nvext_cache_pin_typeis set (notNone).type"ephemeral". Required in JSON (no serde default on deserialization).dynamo_llm.py: Injected client-side viaCachePinType.EPHEMERAL(default). Configurable vianvext_cache_pin_typeparam; set tonullto disablecache_controlentirely.nvext.rs: Deserialized intoCacheControlTypeenum (single variant:Ephemeral). Presence ofcache_controltriggerspin_prefixafter generation in the KV push router. Requires--enable-cache-controlon the frontend.ttl"5m"(300s) when omitted.dynamo_llm.py: Computed client-side astotal_requests * iat(ms), converted to seconds, formatted as"<N>m"(whole minutes) or"<N>s". Theiatfield is not consumed by Dynamo — it is only used here for this TTL computation and by the custom Thompson Samplingprocessor.py.nvext.rsCacheControl::ttl_seconds(): Only parses"5m"(300s) and"1h"(3600s). Any other value logs a warning and falls back to 300s. The parsed TTL is forwarded asttl_secondstopin_prefixon the worker via thecache_controlservice mesh endpoint (cache_control.rs/handler_base.py).nvext_cache_control_mode(NAT config field)Controls when
nvext.cache_controlis injected (not a wire-format field — only affects client-side behavior):always(default)cache_controlon every request. Refreshes TTL each turn.first_onlyprefix_id. Pins the system prompt when first established in the KV cache; subsequent requests benefit from prefix matching without re-pinning the growing conversation context.This field is only relevant when
nvext_cache_pin_typeis set (i.e.,"ephemeral"). Whennvext_cache_pin_typeisnull, nocache_controlis injected regardless of this mode.NAT Config → Wire Format Mapping
enable_nvext_hintsfalse(default), nonvextinjection occurs.nvext_prefix_id_templateDynamoPrefixContextat request time.nvext_prefix_total_requestsnvext.agent_hints.total_requestsnvext_prefix_oslnvext.agent_hints.oslnvext_prefix_iatnvext.agent_hints.iatcache_control.ttl.nvext_max_sensitivitynvext.agent_hints.prioritypriority = nvext_max_sensitivity - latency_sensitivitynvext.agent_hints.latency_sensitivity@latency_sensitivedecorator or prediction trie.nvext.agent_hints.prefix_idnvext_cache_pin_typenvext.cache_control.typenulldisablescache_controlentirely.nvext_cache_control_modecache_controlis injected on every request or just the first.nvext_prediction_trie_pathtotal_requests,osl,iat, and optionallylatency_sensitivity.Closes
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Changes
Documentation