Add Dynamo Example for Latency Sensitivity Assignment#1634
Add Dynamo Example for Latency Sensitivity Assignment#1634rapids-bot[bot] merged 27 commits intoNVIDIA:developfrom
Conversation
Introduces a complete latency sensitivity demo, including customer query samples, profiler configuration, workflow implementation, and performance comparison tools. This setup demonstrates query triage and response generation using LangChain with NVIDIA LLMs. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Switched the latency sensitivity workflow to use the Dynamo LLM backend, replacing the NIM configuration. Enhanced the sensitivity report tool to parse profiler CSVs, showing latency and throughput metrics. Updated documentation with detailed Dynamo setup and workflow instructions. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Enhanced the profiler configuration with detailed topology descriptions and updated default settings for Dynamo. Refactored code for improved sensitivity reporting, added warmup skipping, and revised NAT function prompts for better prioritization in latency-sensitive scenarios. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Enhanced the profiler configuration with detailed topology descriptions and updated default settings for Dynamo. Refactored code for improved sensitivity reporting, added warmup skipping, and revised NAT function prompts for better prioritization in latency-sensitive scenarios. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
…xample # Conflicts: # packages/nvidia_nat_core/tests/nat/test_eval_deprecation_shim.py
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a parallel-sibling slack signal and Changes
Sequence Diagram(s)sequenceDiagram
participant Profiler as Profiler
participant TrieBuilder as Trie Builder
participant SiblingMap as Sibling Map
participant SensCalc as Sensitivity Calc
participant Output as Trie Output
Profiler->>TrieBuilder: extract contexts (spans, w_parallel>0)
TrieBuilder->>SiblingMap: _build_sibling_map (group by parent_id)
SiblingMap-->>TrieBuilder: siblings dict
loop each LLM call
TrieBuilder->>SiblingMap: _compute_parallel_slack(uuid,start,end,siblings)
SiblingMap-->>TrieBuilder: parallel_slack_ratio
TrieBuilder->>TrieBuilder: store parallel_slack in context
end
TrieBuilder->>SensCalc: _compute_logical_positions(contexts)
SensCalc-->>TrieBuilder: logical positions
TrieBuilder->>SensCalc: compute score = w_critical*... + w_fanout*... + w_position*... - w_parallel*parallel_penalty
SensCalc-->>TrieBuilder: raw scores
TrieBuilder->>SensCalc: min-max normalize -> [1, sensitivity_scale]
SensCalc-->>Output: contexts with normalized sensitivity
Output-->>Profiler: prediction_trie.json
sequenceDiagram
participant Client as Client
participant Workflow as Demo Workflow
participant Classify as Classify
participant Parallel as Parallel Nodes
participant Draft as Draft
participant Review as Review
participant Cache as Dynamo Cache
Client->>Workflow: submit query
Workflow->>Classify: classify_query_function
Classify->>Cache: llm.complete (HIGH hint)
Classify-->>Workflow: result
Workflow->>Parallel: fan-out (Research, Policy, Compliance, Sentiment)
par branches
Parallel->>Cache: LLM calls (varying durations/priorities)
end
Parallel-->>Workflow: aggregated results
Workflow->>Draft: draft_response_function
Draft->>Cache: llm.complete (MEDIUM hint)
Draft-->>Workflow: draft
Workflow->>Review: review_response_function
Review->>Cache: llm.complete (HIGH hint)
Review-->>Workflow: final_response
Workflow-->>Client: final_response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 8fad804 |
Removed redundant platform-specific instructions and streamlined existing steps. Added a recommended method to install SGLang from source for access to the latest updates, enhancing clarity and usability for users. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/source/improve-workflows/profiler.md (1)
219-224:⚠️ Potential issue | 🟡 MinorRephrase inanimate possessive ’s in Markdown.
Line 219 uses “workflow's call graph.” Please reword to avoid inanimate possessive ’s.
💡 Suggested fix
-During profiling, the `trie` builder processes all LLM call events and, for each unique position in your workflow's call graph (identified by `function path` and `call index`), accumulates: +During profiling, the `trie` builder processes all LLM call events and, for each unique position in the call graph of your workflow (identified by `function path` and `call index`), accumulates:As per coding guidelines, documentation in Markdown files should not use possessive ’s with inanimate objects.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/source/improve-workflows/profiler.md` around lines 219 - 224, Replace the inanimate possessive "workflow's call graph" with a non-possessive phrasing; update the sentence that describes the `trie` builder to say something like "each unique position in the call graph of your workflow (identified by `function path` and `call index`)" or "each unique position in the workflow call graph (identified by `function path` and `call index`)" so it avoids the possessive form while preserving the meaning.
🧹 Nitpick comments (7)
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh (2)
45-46: Use a no-op command for file truncation to silence ShellCheck SC2188.
> "$LOGFILE"is a bare redirection without a command.Proposed fix
LOGFILE="$LOG_DIR/all.log" -> "$LOGFILE" +: > "$LOGFILE"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh` around lines 45 - 46, The bare redirection line using LOGFILE ("> \"$LOGFILE\"") triggers ShellCheck SC2188; replace it with a no-op command plus the redirection (for example use ":" or "true" before the redirection) so the file is truncated while satisfying shellcheck. Update the statement that references LOGFILE (and uses LOG_DIR) to use a no-op command with the same redirection, e.g. ": > \"$LOGFILE\"" to silence SC2188 and preserve behavior.
14-19: Unused configuration variables.
PAGE_SIZE,HICACHE_RATIO, andHICACHE_POLICYare defined but never referenced in the script. Either remove them or wire them into the worker command-line arguments.Proposed fix: remove unused variables
# ── Config ─────────────────────────────────────────────────────────────────── MODEL="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" -PAGE_SIZE=64 -HICACHE_RATIO=1.0 -HICACHE_POLICY=write_through CONTEXT_LENGTH=262 MEM_FRACTION=0.7🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh` around lines 14 - 19, The script defines unused variables PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY; either remove those declarations or pass them into the worker launch so they take effect. Locate the variables PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY near the MODEL/CONTEXT_LENGTH/MEM_FRACTION block and either delete those three lines, or append corresponding command-line flags (e.g., --page-size, --hicache-ratio, --hicache-policy or the script's worker arg names) to the worker start/exec invocation (the same command that uses MODEL/CONTEXT_LENGTH/MEM_FRACTION) so the values are actually consumed.examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh (2)
1-72: Consider consolidating withdynamo_stack_sensitivity.sh.These two scripts are ~90% identical; the only differences are the
--schedule-low-priority-values-firstand--enable-priority-schedulingflags. A single parameterized script (e.g., accepting a--with-priorityflag) would reduce duplication and maintenance burden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh` around lines 1 - 72, Two nearly identical scripts (dynamo_stack.sh and dynamo_stack_sensitivity.sh) differ only by the flags --schedule-low-priority-values-first and --enable-priority-scheduling; refactor by making one script accept a parameter (e.g., --with-priority or env var WITH_PRIORITY) and conditionally append those flags when launching the worker in the CUDA_VISIBLE_DEVICES block, add simple CLI parsing/usage message and ensure the conditional uses the same unique identifiers (the python3 -m dynamo.sglang invocation and the flag names) so behavior is identical when the flag is present and eliminated when absent.
14-19: Unused config variables and bare redirect.
PAGE_SIZE,HICACHE_RATIO, andHICACHE_POLICYare unused (SC2034). Line 46 uses a bare redirect> "$LOGFILE"(SC2188). Same fixes as suggested fordynamo_stack_sensitivity.shapply here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh` around lines 14 - 19, Remove or mark intentionally-unused variables PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY: either delete those assignments if they aren’t needed, or add a shellcheck suppression comment (e.g., # shellcheck disable=SC2034) next to their declarations so linters know they are intentionally unused. Replace the bare redirect that truncates the logfile with an explicit no-op command redirect (use : > "$LOGFILE") or use a proper redirection with a command (e.g., exec >"$LOGFILE" 2>&1) to address SC2188. Ensure changes target the declarations of PAGE_SIZE/HICACHE_RATIO/HICACHE_POLICY and the single-line logfile redirect.examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py (1)
64-65: Registered functions are missing return type annotations.All eight
@register_functiongenerators (e.g., lines 64–65, 93–94, 133–134, …) omit a return type hint. As async generator functions thatyield FunctionInfo, their return type should beAsyncGenerator[FunctionInfo, None].+from collections.abc import AsyncGenerator ... -async def classify_query_function(config: ClassifyConfig, builder: Builder): +async def classify_query_function(config: ClassifyConfig, builder: Builder) -> AsyncGenerator[FunctionInfo, None]:Apply the same annotation to the other seven registered functions. As per coding guidelines: "Python methods should use type hints for all parameters and return values."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py` around lines 64 - 65, The registered async generator functions (e.g., classify_query_function) are missing return type hints; update each decorated function signature (for all eight `@register_function` functions such as classify_query_function and the other seven registered functions) to include -> AsyncGenerator[FunctionInfo, None], and ensure AsyncGenerator and FunctionInfo are imported from typing/your module as needed so the type hint resolves.examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py (2)
85-100: Extract repeateddata_pathconstruction into a pytest fixture.The path is constructed identically in both
test_dataset_exists(lines 86–88) andtest_dataset_has_entries(lines 93–94), violating DRY. Per coding guidelines, frequently repeated code should be extracted into pytest fixtures with thefixture_naming convention.♻️ Proposed refactor
+import json +import pytest from pathlib import Path import yaml CONFIGS_DIR = Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "configs" +_DATA_PATH = (Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "data" / + "customer_queries.json") + + +@pytest.fixture(name="data_path") +def fixture_data_path() -> Path: + return _DATA_PATH ... class TestDataset: """Verify the customer queries dataset.""" - def test_dataset_exists(self): - data_path = (Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "data" / - "customer_queries.json") + def test_dataset_exists(self, data_path: Path): assert data_path.exists() - def test_dataset_has_entries(self): - import json - - data_path = (Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "data" / - "customer_queries.json") + def test_dataset_has_entries(self, data_path: Path): with open(data_path) as f: data = json.load(f)As per coding guidelines: "Any frequently repeated code should be extracted into pytest fixtures. Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the
fixture_prefix."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py` around lines 85 - 100, Extract the repeated Path construction into a pytest fixture named fixture_data_path by adding a pytest.fixture(name="fixture_data_path") function that builds and returns the Path used in test_dataset_exists and test_dataset_has_entries; then update both tests to accept fixture_data_path as a parameter and use it instead of reconstructing data_path, and in test_dataset_has_entries open fixture_data_path to load JSON and run the existing assertions—this removes duplication while following the fixture naming and decorator requirements.
91-91: Moveimport jsonto module level.
import jsoninside a test method violates PEP 8's requirement that imports appear at the top of the file.jsonis part of the standard library and has no deferred-import rationale here.As per coding guidelines: "Follow PEP 8 for Python style guidelines."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py` at line 91, The inline "import json" inside the test function in test_workflow.py should be moved to the module-level imports; remove the deferred import from the test method and add "import json" with the other top-of-file imports so the standard-library import follows PEP 8 and the project's import conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/dynamo_integration/INSTALL_LIBRARY.md`:
- Around line 8-10: The relative link to the support matrix in
examples/dynamo_integration/INSTALL_LIBRARY.md is broken because it points to
docs/pages/reference/support-matrix.md relative to this repo; update the link to
use the absolute GitHub URL of the Dynamo support-matrix (replace the relative
path in the line containing "support matrix" /
"docs/pages/reference/support-matrix.md" with the full https://... GitHub/raw or
blob URL to the Dynamo docs) so it resolves correctly from this file's location.
- Around line 1-10: Add the standard Apache License 2.0 header comment to the
very top of INSTALL_LIBRARY.md: insert the project's required Apache-2.0 header
block (including the copyright year and copyright owner) as a top-of-file
comment so the file begins with the license header before any content; ensure
the header matches the project's canonical Apache-2.0 header wording used in
other files.
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 1-28: The root pyproject.toml is missing the example registration
under [tool.uv.sources]; add an entry for nat_latency_sensitivity_demo in the
[tool.uv.sources] table (use the key nat_latency_sensitivity_demo and point it
to examples/dynamo_integration/latency_sensitivity_demo with editable = true)
and place it in alphabetical order among other entries so the example is
registered.
In `@examples/dynamo_integration/latency_sensitivity_demo/README.md`:
- Around line 50-55: Replace inanimate possessive forms in the README: change
"workflow's" to "of the workflow" (or simply "workflow"), "profiler's" to "of
the profiler" or "profiler algorithm", and "Dynamo's" to "Dynamo" or "of Dynamo"
so no inanimate noun uses ’s; update the phrases around the headings "Why this
creates a measurable priority benefit at high concurrency" and "How Sensitivity
Scores Are Computed" (and the occurrence at lines ~79–80) to use these
non-possessive alternatives while preserving the original meaning.
- Around line 24-26: Replace the prose uses of the acronym "NAT" with the full
phrase "NeMo Agent Toolkit" in the README (e.g., the sentence describing the
LangGraph StateGraph and any other body text occurrences around the
workflow/triage description and later at lines ~211-212), keeping the code
identifier `nat` unchanged where it appears in backticks; ensure every non-code
occurrence of "NAT" is spelled out as "NeMo Agent Toolkit" while leaving
references to StateGraph and any `nat` code identifiers intact.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/compare_sensitivity_perf.py`:
- Around line 74-76: The for-loops unpacking call_idx and other unused variables
should rename those unused names with a leading underscore (e.g., change
call_idx to _call_idx) to satisfy Ruff B007/B905, and any zip(...) calls in this
module should include the explicit parameter strict=False (e.g., zip(a, b,
strict=False)) to address RUF059; update occurrences around the loop using
node.predictions_by_call_index.items() and the other similar unpackings (and the
zip usages mentioned) accordingly so unused variables are prefixed with "_" and
zip calls pass strict=False.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`:
- Around line 2-3: The header comment "Two workers enable cache-aware routing
via the KV router" is misleading because the script only launches a single
worker process in the worker start block; either update that header comment to
reflect one worker or actually start a second worker by duplicating the existing
worker-start command block (the single worker launch block that configures and
background-starts the worker) and adjusting its unique runtime settings (e.g.,
GPU/worker ID, ports, or PID file/env vars) so both workers can run concurrently
and the KV router can perform cache-aware routing.
- Around line 1-11: Add the standard Apache License 2.0 header to the top of the
script file so it complies with project licensing rules; insert the multi-line
license notice immediately after the shebang (#!/usr/bin/env bash) in
latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh and ensure it
includes the copyright line, license URL, and the SPDX identifier (or full
license text per project convention) so automated license checks will pass.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 2-3: The script header in
latency_sensitivity_demo/scripts/dynamo_stack.sh incorrectly states "Two
workers" while the script only starts one worker; either update the header text
to accurately say "One worker" or modify the script to launch the second worker
to match the comment. Locate the top-of-file header comment in dynamo_stack.sh
and either change the phrasing to reflect a single worker or add the commands
that start the missing second worker so the comment and implementation stay
consistent.
- Around line 1-11: Add the standard Apache License 2.0 header used by the
project to the top of this script
(examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh),
matching the sibling script's header exactly; ensure the header appears before
or immediately above the existing shebang line and includes the correct
copyright owner and license URL.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py`:
- Around line 285-303: The _draft function uses input_text.split("|") which can
produce >6 parts if LLM-generated fields contain '|' and thus misalign fields;
update _draft to split with a maxsplit of 5 (e.g., input_text.split("|",
maxsplit=5)) so the last five fields remain intact when populating query,
category, context, policy, compliance, sentiment before calling chain.ainvoke
and yielding FunctionInfo.from_fn(_draft). For a more robust fix consider
changing the workflow to accept a structured payload (JSON) instead of a
pipe-joined string.
- Around line 464-468: The except block catching GeneratorExit in the async
generator around FunctionInfo.from_fn(_run, description=_run.__doc__) must not
swallow the exception; either remove the except clause or log and re-raise it:
replace logger.exception("Exited early!") with logger.error("Exited early!")
followed by a bare raise to preserve the original stack trace (or simply delete
the entire except GeneratorExit block so the finally cleanup runs and
GeneratorExit propagates).
In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`:
- Around line 195-196: Replace the en-dashes in the problematic docstring and
comment with ASCII hyphens (e.g., change "0–1" to "0-1") to satisfy ruff
unicode-dash rules, and update the zip call in the trie builder routine to use
strict=True (e.g., change zip(a, b) to zip(a, b, strict=True)) so
mismatched-length iterables raise immediately; look for the docstring at the top
of packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py,
the nearby comment around the normalization logic, and the zip() call inside the
trie-building / normalization function (the function that pairs sequences when
building the trie) to apply these edits.
- Around line 274-312: _compute_logical_positions currently iterates contexts in
LLM_END order and uses group_earliest_end which misses transitive overlaps;
change the algorithm to group by temporal overlap using span_start_time ordering
and a group_latest_end (track max span_end_time) so any call whose
span_start_time <= group_latest_end joins the current parallel group and you
update group_latest_end = max(group_latest_end, ctx.span_end_time); to preserve
the function contract, sort indices by span_start_time, compute logical
positions on the sorted list (using current_group and group_latest_end), then
map the computed positions back to the original contexts order before returning
so callers (e.g., remaining_calls/position/fan-out signals) receive positions
aligned with the input list.
---
Outside diff comments:
In `@docs/source/improve-workflows/profiler.md`:
- Around line 219-224: Replace the inanimate possessive "workflow's call graph"
with a non-possessive phrasing; update the sentence that describes the `trie`
builder to say something like "each unique position in the call graph of your
workflow (identified by `function path` and `call index`)" or "each unique
position in the workflow call graph (identified by `function path` and `call
index`)" so it avoids the possessive form while preserving the meaning.
---
Nitpick comments:
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`:
- Around line 45-46: The bare redirection line using LOGFILE ("> \"$LOGFILE\"")
triggers ShellCheck SC2188; replace it with a no-op command plus the redirection
(for example use ":" or "true" before the redirection) so the file is truncated
while satisfying shellcheck. Update the statement that references LOGFILE (and
uses LOG_DIR) to use a no-op command with the same redirection, e.g. ": >
\"$LOGFILE\"" to silence SC2188 and preserve behavior.
- Around line 14-19: The script defines unused variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY; either remove those declarations or pass them
into the worker launch so they take effect. Locate the variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY near the MODEL/CONTEXT_LENGTH/MEM_FRACTION
block and either delete those three lines, or append corresponding command-line
flags (e.g., --page-size, --hicache-ratio, --hicache-policy or the script's
worker arg names) to the worker start/exec invocation (the same command that
uses MODEL/CONTEXT_LENGTH/MEM_FRACTION) so the values are actually consumed.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 1-72: Two nearly identical scripts (dynamo_stack.sh and
dynamo_stack_sensitivity.sh) differ only by the flags
--schedule-low-priority-values-first and --enable-priority-scheduling; refactor
by making one script accept a parameter (e.g., --with-priority or env var
WITH_PRIORITY) and conditionally append those flags when launching the worker in
the CUDA_VISIBLE_DEVICES block, add simple CLI parsing/usage message and ensure
the conditional uses the same unique identifiers (the python3 -m dynamo.sglang
invocation and the flag names) so behavior is identical when the flag is present
and eliminated when absent.
- Around line 14-19: Remove or mark intentionally-unused variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY: either delete those assignments if they
aren’t needed, or add a shellcheck suppression comment (e.g., # shellcheck
disable=SC2034) next to their declarations so linters know they are
intentionally unused. Replace the bare redirect that truncates the logfile with
an explicit no-op command redirect (use : > "$LOGFILE") or use a proper
redirection with a command (e.g., exec >"$LOGFILE" 2>&1) to address SC2188.
Ensure changes target the declarations of PAGE_SIZE/HICACHE_RATIO/HICACHE_POLICY
and the single-line logfile redirect.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py`:
- Around line 64-65: The registered async generator functions (e.g.,
classify_query_function) are missing return type hints; update each decorated
function signature (for all eight `@register_function` functions such as
classify_query_function and the other seven registered functions) to include ->
AsyncGenerator[FunctionInfo, None], and ensure AsyncGenerator and FunctionInfo
are imported from typing/your module as needed so the type hint resolves.
In `@examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py`:
- Around line 85-100: Extract the repeated Path construction into a pytest
fixture named fixture_data_path by adding a
pytest.fixture(name="fixture_data_path") function that builds and returns the
Path used in test_dataset_exists and test_dataset_has_entries; then update both
tests to accept fixture_data_path as a parameter and use it instead of
reconstructing data_path, and in test_dataset_has_entries open fixture_data_path
to load JSON and run the existing assertions—this removes duplication while
following the fixture naming and decorator requirements.
- Line 91: The inline "import json" inside the test function in test_workflow.py
should be moved to the module-level imports; remove the deferred import from the
test method and add "import json" with the other top-of-file imports so the
standard-library import follows PEP 8 and the project's import conventions.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
docs/source/improve-workflows/profiler.mdexamples/dynamo_integration/INSTALL_LIBRARY.mdexamples/dynamo_integration/latency_sensitivity_demo/README.mdexamples/dynamo_integration/latency_sensitivity_demo/pyproject.tomlexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/__init__.pyexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/compare_sensitivity_perf.pyexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.ymlexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.ymlexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/data/customer_queries.jsonexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/register.pyexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.shexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.shexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/sensitivity_report.pyexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.pyexamples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.pypackages/nvidia_nat_core/src/nat/data_models/profiler.pypackages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.pypackages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.pypackages/nvidia_nat_eval/src/nat/plugins/eval/profiler/profile_runner.py
...mo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh
Outdated
Show resolved
Hide resolved
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py
Show resolved
Hide resolved
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py
Show resolved
Hide resolved
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py
Show resolved
Hide resolved
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py
Show resolved
Hide resolved
Updated the method to use interval-merging logic, ensuring transitive overlaps are collapsed into a single group. Added comprehensive tests to validate behavior for overlapping, sequential, and edge-case scenarios. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Added SPDX license headers to bash scripts, markdown documentation, and configuration files in the Dynamo integration examples. This ensures proper attribution and compliance with the Apache License, Version 2.0. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py (1)
195-196: Outstanding ruff violations: en-dash characters andzip()withoutstrict=.These three ruff warnings remain unaddressed from the previous review cycle.
Also applies to: 263-263, 268-268
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py` around lines 195 - 196, Replace the Unicode en-dash in the docstring/comment ("0–1") with a normal hyphen ("0-1") and update any plain zip() calls in this module to zip(..., strict=True) to satisfy ruff; specifically, change the "0–1" text in the comment at the top of trie_builder.py and add strict=True to the zip() usages that pair parallel iterables (the zip() calls referenced around the 263 and 268 areas) so mismatched-length iterables raise immediately.
🧹 Nitpick comments (1)
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py (1)
218-221: Movefrom collections import Counterto the module-level import block.Importing inside a method body defeats module-level caching and obscures the dependency.
collectionsis already imported at the top of the file.♻️ Proposed fix
At the top of the file, extend the existing
collectionsimport:-from collections import defaultdict +from collections import Counter +from collections import defaultdictThen remove the inline import:
- from collections import Counter group_sizes = Counter(logical_positions)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py` around lines 218 - 221, Inline import "from collections import Counter" should be moved to the module-level imports; update the top-of-file collections import to also import Counter (so collections import line includes Counter) and then remove the inline import inside the function where group_sizes = Counter(logical_positions) is used (the inline import located just above group_sizes in trie_builder.py). This preserves module-level caching and keeps the dependency with other imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py`:
- Around line 588-613: Docstrings/comments in
test_logical_positions_transitive_overlap (docstring "A(0–10)..." ) and
test_logical_positions_two_groups (comment "A(0–5), B(1–4) Group 2: C(10–15),
D(11–14)" ) contain en-dash characters; replace each en-dash (–) with a plain
ASCII hyphen (-) so the docstring and comment read e.g. "A(0-10)" and "A(0-5),
B(1-4) Group 2: C(10-15), D(11-14)" to resolve the RUF002/RUF003 warnings
while leaving the surrounding text and tests unchanged.
---
Duplicate comments:
In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`:
- Around line 195-196: Replace the Unicode en-dash in the docstring/comment
("0–1") with a normal hyphen ("0-1") and update any plain zip() calls in this
module to zip(..., strict=True) to satisfy ruff; specifically, change the "0–1"
text in the comment at the top of trie_builder.py and add strict=True to the
zip() usages that pair parallel iterables (the zip() calls referenced around the
263 and 268 areas) so mismatched-length iterables raise immediately.
---
Nitpick comments:
In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`:
- Around line 218-221: Inline import "from collections import Counter" should be
moved to the module-level imports; update the top-of-file collections import to
also import Counter (so collections import line includes Counter) and then
remove the inline import inside the function where group_sizes =
Counter(logical_positions) is used (the inline import located just above
group_sizes in trie_builder.py). This preserves module-level caching and keeps
the dependency with other imports.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.pypackages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py
packages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py
Show resolved
Hide resolved
Replaced outdated references to "NAT" with "NeMo Agent Toolkit" for accuracy and consistency. This ensures alignment with the current naming conventions and improves clarity in the documentation. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml (1)
16-43:⚠️ Potential issue | 🟡 MinorRegister the example in the root
pyproject.toml.The
nat_latency_sensitivity_demoentry is missing from the rootpyproject.toml[tool.uv.sources]table. Per coding guidelines, every example with its ownpyproject.tomlmust be listed there.+nat_latency_sensitivity_demo = { path = "examples/dynamo_integration/latency_sensitivity_demo", editable = true }Add the entry in alphabetical order among existing entries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml` around lines 16 - 43, Add the missing example registration for nat_latency_sensitivity_demo into the root [tool.uv.sources] table: add an entry keyed by nat_latency_sensitivity_demo pointing to the example's relative path (matching how other examples are registered) with editable = true, and place it in alphabetical order among the existing entries; locate the example by its project name "nat_latency_sensitivity_demo" and its entry-point "latency_sensitivity_demo.register" to ensure the correct path is used.
🧹 Nitpick comments (2)
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh (1)
59-60: Make the log truncate explicit to avoid SC2188.
This redirection is valid, but: > "$LOGFILE"is clearer and keeps Shellcheck happy.♻️ Suggested tweak
-LOGFILE="$LOG_DIR/all.log" -> "$LOGFILE" +LOGFILE="$LOG_DIR/all.log" +: > "$LOGFILE"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh` around lines 59 - 60, The current redirection line using LOGFILE (LOGFILE="$LOG_DIR/all.log" followed by > "$LOGFILE") should be made explicit to truncate the file and satisfy ShellCheck SC2188; replace the bare redirection with an explicit truncate such as using a no-op redirection (e.g., : > "$LOGFILE") so the LOGFILE variable is still set and the file is cleared.examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml (1)
34-37: Movetestextra to an optional dependency group.Including the
testextra in the main install dependency pulls test tooling (pytest, etc.) into every install of this example package. Consider moving it to an optional group so users only install test dependencies when explicitly needed.♻️ Proposed refactor
[tool.setuptools_dynamic_dependencies] dependencies = [ - "nvidia-nat[eval,langchain] == {version}", + "nvidia-nat[eval,langchain] == {version}", ] + +[project.optional-dependencies] +test = ["nvidia-nat[test] == {version}"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml` around lines 34 - 37, The dependency string currently includes the "test" extra ("nvidia-nat[eval,langchain,test] == {version}"), which pulls test tooling into every install; remove "test" from that main dependency (change to "nvidia-nat[eval,langchain] == {version}") and create an optional/test extras group that contains the test extra or test-only packages (e.g., add a new optional-dependencies/extras group named "test" that includes "nvidia-nat[test] == {version}" or the specific test packages like pytest). Update the [tool.setuptools_dynamic_dependencies] dependencies list to the trimmed string and add the new optional group (e.g., under project.optional-dependencies or the setuptools_dynamic_dependencies extras section) so tests are installed only when explicitly requested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 24-43: Move the nested configs/ and data/ directories from
src/latency_sensitivity_demo/ to the example root (next to src/, README.md,
pyproject.toml), add those files to git-lfs, and update any code or package
metadata that references the old paths (e.g., imports or file loads in the
latency_sensitivity_demo package and any package_data or resource lookups tied
to the project name nat_latency_sensitivity_demo or entry point
latency_sensitivity_demo.register). Ensure tests and CI use the new paths
(configs/... and data/...) and adjust any relative paths in code, tests, and
pyproject dynamic dependencies or package_data settings so runtime/resource
loading still works after the move.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`:
- Around line 28-32: The script defines PAGE_SIZE, HICACHE_RATIO, and
HICACHE_POLICY but never uses them, which can mislead users; either remove these
unused variables or wire them into the runtime (e.g., export them or pass them
as environment variables/flags to the worker invocation). Locate the variable
declarations PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY in the script and
either delete those lines or add them to the command that starts the worker or
to an export block so the worker code (or cache launcher) can read them; keep
MODEL and CONTEXT_LENGTH as-is if they are used elsewhere.
- Around line 18-19: Update the usage comment to reference the correct script
filename: replace the incorrect "./dynamo-stack_sensitivity.sh" mention with
"./dynamo_stack_sensitivity.sh" in the header comment so the instruction matches
the actual script name (look for the comment lines starting with "# Edit the
config below," and the following usage line).
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 27-33: The PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY variables
in the config are defined but never used; update the script so these values are
actually passed into the worker process (e.g., export them as environment
variables or add them as CLI flags to the worker invocation) or remove the
unused variables to avoid misleading users, referencing the PAGE_SIZE,
HICACHE_RATIO and HICACHE_POLICY symbols and the worker invocation point where
env/flags are constructed so the config changes take effect.
---
Duplicate comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 16-43: Add the missing example registration for
nat_latency_sensitivity_demo into the root [tool.uv.sources] table: add an entry
keyed by nat_latency_sensitivity_demo pointing to the example's relative path
(matching how other examples are registered) with editable = true, and place it
in alphabetical order among the existing entries; locate the example by its
project name "nat_latency_sensitivity_demo" and its entry-point
"latency_sensitivity_demo.register" to ensure the correct path is used.
---
Nitpick comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 34-37: The dependency string currently includes the "test" extra
("nvidia-nat[eval,langchain,test] == {version}"), which pulls test tooling into
every install; remove "test" from that main dependency (change to
"nvidia-nat[eval,langchain] == {version}") and create an optional/test extras
group that contains the test extra or test-only packages (e.g., add a new
optional-dependencies/extras group named "test" that includes "nvidia-nat[test]
== {version}" or the specific test packages like pytest). Update the
[tool.setuptools_dynamic_dependencies] dependencies list to the trimmed string
and add the new optional group (e.g., under project.optional-dependencies or the
setuptools_dynamic_dependencies extras section) so tests are installed only when
explicitly requested.
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 59-60: The current redirection line using LOGFILE
(LOGFILE="$LOG_DIR/all.log" followed by > "$LOGFILE") should be made explicit to
truncate the file and satisfy ShellCheck SC2188; replace the bare redirection
with an explicit truncate such as using a no-op redirection (e.g., : >
"$LOGFILE") so the LOGFILE variable is still set and the file is cleared.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/dynamo_integration/INSTALL_LIBRARY.mdexamples/dynamo_integration/latency_sensitivity_demo/pyproject.tomlexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.shexamples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/dynamo_integration/INSTALL_LIBRARY.md
...on/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh
Show resolved
Hide resolved
...on/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh
Outdated
Show resolved
Hide resolved
...mo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
examples/dynamo_integration/latency_sensitivity_demo/README.md (1)
50-50:⚠️ Potential issue | 🟡 MinorInanimate possessives still unresolved — plus an additional occurrence at Line 307.
Four instances use possessive
'swith inanimate nouns:
Line Current Suggested 50 workflow's \classify_query``the \classify_query` of the workflow`54 The profiler's auto-sensitivity algorithmThe auto-sensitivity algorithm of the profiler79 tell Dynamo's routertell the Dynamo router307 when Dynamo's priority scheduling is workingwhen the Dynamo priority scheduling is workingAs per coding guidelines, documentation in Markdown files should not use possessive
'swith inanimate objects.Also applies to: 54-54, 79-79, 307-307
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/dynamo_integration/latency_sensitivity_demo/README.md` at line 50, Replace possessive "'s" forms for inanimate nouns with non-possessive phrasing: change "workflow's `classify_query`" to "the `classify_query` of the workflow", "The profiler's auto-sensitivity algorithm" to "The auto-sensitivity algorithm of the profiler", "tell Dynamo's router" to "tell the Dynamo router", and "when Dynamo's priority scheduling is working" to "when the Dynamo priority scheduling is working"; update those four occurrences (the `classify_query` reference, the profiler auto-sensitivity phrase, the Dynamo router mention, and the Dynamo priority scheduling mention) throughout the README to match the documentation guideline against inanimate possessive "'s".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/README.md`:
- Around line 226-228: The Markdown code-fence closing delimiter is malformed:
the opening fence uses three backticks (```bash) while the closing fence uses
four backticks (````), which swallows the rest of the document; fix the README
by replacing the four-backtick closing fence with a matching three-backtick
fence so the snippet (bash dynamo_stack_sensitivity.sh) is properly closed.
- Line 307: Replace the asterisk-style inline emphasis in the "Priority Routing
Effectiveness" paragraph by changing *increase* to underscore-style _increase_
so the document conforms to markdownlint MD049; search for the sentence that
begins "Priority Routing Effectiveness is the key section" and update the
emphasis around the word "increase" to use underscores.
- Around line 93-147: The sub-step heading "### Step 2a: Run the Profiler…" is
incorrectly numbered under the parent "## Step 1: Profile the Workflow" and
should be "Step 1b"; update the heading title "### Step 2a: Run the Profiler to
Build the Prediction Trie" to "### Step 1b: Run the Profiler to Build the
Prediction Trie" so it matches the parent "## Step 1" and the existing "### Step
1a: Start a Dynamo Endpoint", and scan nearby sub-headings to ensure any further
"Step 2..." labels are similarly renumbered for consistency.
- Around line 31-36: The three fenced code blocks for the directory/topology
diagram, the file-tree listing, and the sample report output in README.md are
missing a language specifier (causing MD040); update each opening fence from ```
to ```text so the directory/topology diagram block (the ASCII diagram starting
with the box layout), the file-tree listing block (the
examples/dynamo_integration/latency_sensitivity_demo/outputs/profile/ path), and
the sample report output block (the LATENCY SENSITIVITY REPORT ASCII header and
following lines) all begin with ```text to satisfy markdownlint and preserve
monospaced formatting.
---
Duplicate comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/README.md`:
- Line 50: Replace possessive "'s" forms for inanimate nouns with non-possessive
phrasing: change "workflow's `classify_query`" to "the `classify_query` of the
workflow", "The profiler's auto-sensitivity algorithm" to "The auto-sensitivity
algorithm of the profiler", "tell Dynamo's router" to "tell the Dynamo router",
and "when Dynamo's priority scheduling is working" to "when the Dynamo priority
scheduling is working"; update those four occurrences (the `classify_query`
reference, the profiler auto-sensitivity phrase, the Dynamo router mention, and
the Dynamo priority scheduling mention) throughout the README to match the
documentation guideline against inanimate possessive "'s".
|
Build from source instructions will be updated with container instructions when dynamo release is complete |
This file was deleted to reduce unused or unnecessary resources within the project. It is no longer relevant to the functionality or purpose of the latency sensitivity demo. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
This JSON file contains sample customer queries and answers designed for testing latency sensitivity. It will be used in the Dynamo integration demo to simulate realistic customer interactions. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Adjusted step numbering for clarity and corrected a minor formatting issue in the README. These changes improve readability and reduce potential confusion for users following the instructions. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Introduce initial setup files for the latency sensitivity demo in Dynamo integration. These include placeholders for configs and data, preparing the groundwork for testing and validation. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
This change includes the `nat_latency_sensitivity_demo` example as an editable dependency in `pyproject.toml`. It ensures the project can utilize this demo for dynamo integration-related functionality. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Adjusted relative links in the README to ensure accurate navigation to the installation guide. Relocated INSTALL_LIBRARY.md to match the structure of the latency_sensitivity_demo directory. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
|
/ok to test 912c2c2 |
Refine terminology and formatting in the README and install guide by standardizing model names and library references (e.g., `protobuf`, `libclang`). Enhance vocabulary rules in CI configurations to include new terms like `Huggingface` and `venv`. Minor updates improve clarity and ensure consistency. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
|
/ok to test c417500 |
Removed the mention of `w_parallel` defaulting to 0.0 for backward compatibility as it is no longer relevant. This improves clarity and ensures the documentation remains accurate and up-to-date. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Updated the regex pattern to correctly match both "Huggingface" and "HuggingFace" variations. This ensures consistency and proper handling of the term in the vocabulary checks. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
|
/ok to test b459477 |
Updated the CONTEXT_LENGTH parameter in both `dynamo_stack_sensitivity.sh` and `dynamo_stack.sh` from 262 to 262144. This change adjusts memory context size for improved performance and alignment with expected configurations. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
The CUDA toolkit setup steps were removed from the installation guide to streamline the documentation. This section is no longer necessary for the described workflow, reducing redundancy and simplifying setup instructions. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
willkill07
left a comment
There was a problem hiding this comment.
You also need to regenerate root uv.lock
This commit introduces the nat-latency-sensitivity-demo package to the project. Updates include its addition to `uv.lock` and `pyproject.toml` with relevant metadata and dependencies, enabling editable integration in the examples directory. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
|
/ok to test 38f1008 |
Updated all mentions of "HuggingFace" to "Hugging Face" for consistency and alignment with the official branding. This change spans multiple documentation files and one vocabulary configuration file. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
…amo-latency-example
|
/ok to test 6a07753 |
|
/ok to test f6aef4d |
|
/merge |
### Summary
Adds automatic latency sensitivity inference to the prediction trie, enabling per-LLM-call priority routing for NVIDIA Dynamo. When profiling a workflow, each LLM call is scored on a configurable 1–5 sensitivity scale based on its structural role in the workflow DAG. These scores are written into the prediction trie and used at runtime to inject `nvext.agent_hints.priority` into Dynamo requests — high-sensitivity calls (e.g., first/last nodes) get higher priority, while parallelizable background calls get lower priority.
Includes a new end-to-end example (`latency_sensitivity_demo`) that demonstrates the full profile → score → route lifecycle with a 7-node LangGraph workflow.
### What changed
#### Core: Auto-sensitivity algorithm (`trie_builder.py`)
The prediction trie builder now computes a composite sensitivity score for each LLM call in a trace using four weighted signals:
| Signal | Weight | Description |
|---|---|---|
| `w_critical` | Critical path | Fraction of total workflow time spent in this call |
| `w_fanout` | Fan-out | How many downstream calls depend on this one |
| `w_position` | Position (U-shaped) | First and last calls score highest; middle calls score lowest |
| `w_parallel` | Parallel slack penalty | Calls running in parallel get penalized — they have scheduling slack |
Key implementation details:
- **Logical position collapsing**: Parallel siblings that overlap in time are assigned the same logical position, so the U-shaped position signal treats them as a single workflow step rather than inflating their indices
- **Parallel group membership penalty**: Any call in a parallel group of size N gets a base penalty of `(N-1)/N` averaged with its individual slack ratio — even the longest sibling in a parallel group is penalized
- **Min-max normalization**: Raw scores are normalized across each trace so the full 1–scale range is always used
- **Sibling overlap detection**: Uses span start/end timestamps to detect which LLM calls ran concurrently, then groups them by parent function ID
#### Core: Config model (`profiler.py`)
`PredictionTrieConfig` gains fields for sensitivity tuning:
- `auto_sensitivity: bool` — enable/disable scoring
- `sensitivity_scale: int` — output scale (default 5)
- `w_critical`, `w_fanout`, `w_position`, `w_parallel` — signal weights
#### Core: Profile runner (`profile_runner.py`)
Passes the new `SensitivityConfig` from YAML config through to `PredictionTrieBuilder` when building the trie.
#### Docs: Profiler documentation
Updated `profiler.md` with auto-sensitivity configuration reference, signal explanations, and example YAML.
#### Example: `latency_sensitivity_demo`
A complete installable example in `examples/dynamo_integration/latency_sensitivity_demo/` demonstrating automatic latency sensitivity inference end-to-end.
**Workflow topology** (7 LLM calls per request):
```
┌─── research_context (LOW, ~1000 tok) ──┐
├─── lookup_policy (LOW, ~1000 tok) ──┤
classify (HIGH) ──► ├─── check_compliance (LOW, ~1000 tok) ──├──► draft_response (MED) ──► review (HIGH)
(~60 tok) └─── analyze_sentiment (LOW, ~1000 tok) ──┘ (~2000 tok) (~300 tok)
```
- **classify** (sensitivity=5): First node, high fan-out → highest priority
- **4 parallel branches** (sensitivity=1): Parallel siblings with long decode → lowest priority, GPU-saturating
- **draft_response** (sensitivity=3–4): Sequential mid-position
- **review** (sensitivity=4): Last node, short output → high priority
**Files included:**
| File | Purpose |
|---|---|
| `pyproject.toml` | Installable package with NAT dependency |
| `workflow.py` | 7-node LangGraph workflow with 4-way parallel fan-out |
| `register.py` | NAT plugin entry point |
| `config_profile.yml` | Profiling config — builds prediction trie with auto-sensitivity |
| `config_with_trie.yml` | Runtime config — uses trie for dynamic Dynamo priority injection |
| `customer_queries.json` | 8 sample customer support queries |
| `sensitivity_report.py` | CLI tool to pretty-print trie sensitivity scores with latency data |
| `compare_sensitivity_perf.py` | CLI tool to compare baseline vs trie-enabled runs (ms/tok, HIGH/LOW ratio) |
| `dynamo_stack.sh` / `dynamo_stack_sensitivity.sh` | Helper scripts for launching Dynamo with/without priority scheduling |
| `README.md` | Full walkthrough with topology diagrams and usage instructions |
| `tests/test_workflow.py` | 17 unit tests covering graph structure, config, imports, and end-to-end flow |
### How it works end-to-end
1. **Profile**: `nat eval --config_file config_profile.yml` runs the workflow and builds a prediction trie with per-node sensitivity scores
2. **Inspect**: `python -m latency_sensitivity_demo.sensitivity_report prediction_trie.json` shows a formatted table of scores
3. **Route**: `nat eval --config_file config_with_trie.yml` uses the trie at runtime — each LLM call gets `priority = max_sensitivity - latency_sensitivity` injected into `nvext.agent_hints`
4. **Compare**: `python -m latency_sensitivity_demo.compare_sensitivity_perf` shows ms/tok and HIGH/LOW priority ratio across runs
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
## Summary by CodeRabbit
* **New Features**
* Added a "Parallel sibling slack" signal (w_parallel, default 0.0) to latency-sensitivity scoring and a complete latency-sensitivity demo including workflow, CLI reporting tools, orchestration scripts, configs, and an installable demo package.
* **Documentation**
* New Dynamo install guide, demo README, and updated profiler docs describing the parallel execution signal and configuration.
* **Tests**
* Added unit and integration tests validating parallel slack, sensitivity scoring, config wiring, and reporting.
Authors:
- Dhruv Nandakumar (https://github.com/dnandakumar-nv)
Approvers:
- Will Killian (https://github.com/willkill07)
- David Gardner (https://github.com/dagardner-nv)
URL: NVIDIA#1634
Summary
Adds automatic latency sensitivity inference to the prediction trie, enabling per-LLM-call priority routing for NVIDIA Dynamo. When profiling a workflow, each LLM call is scored on a configurable 1–5 sensitivity scale based on its structural role in the workflow DAG. These scores are written into the prediction trie and used at runtime to inject
nvext.agent_hints.priorityinto Dynamo requests — high-sensitivity calls (e.g., first/last nodes) get higher priority, while parallelizable background calls get lower priority.Includes a new end-to-end example (
latency_sensitivity_demo) that demonstrates the full profile → score → route lifecycle with a 7-node LangGraph workflow.What changed
Core: Auto-sensitivity algorithm (
trie_builder.py)The prediction trie builder now computes a composite sensitivity score for each LLM call in a trace using four weighted signals:
w_criticalw_fanoutw_positionw_parallelKey implementation details:
(N-1)/Naveraged with its individual slack ratio — even the longest sibling in a parallel group is penalizedCore: Config model (
profiler.py)PredictionTrieConfiggains fields for sensitivity tuning:auto_sensitivity: bool— enable/disable scoringsensitivity_scale: int— output scale (default 5)w_critical,w_fanout,w_position,w_parallel— signal weightsCore: Profile runner (
profile_runner.py)Passes the new
SensitivityConfigfrom YAML config through toPredictionTrieBuilderwhen building the trie.Docs: Profiler documentation
Updated
profiler.mdwith auto-sensitivity configuration reference, signal explanations, and example YAML.Example:
latency_sensitivity_demoA complete installable example in
examples/dynamo_integration/latency_sensitivity_demo/demonstrating automatic latency sensitivity inference end-to-end.Workflow topology (7 LLM calls per request):
Files included:
pyproject.tomlworkflow.pyregister.pyconfig_profile.ymlconfig_with_trie.ymlcustomer_queries.jsonsensitivity_report.pycompare_sensitivity_perf.pydynamo_stack.sh/dynamo_stack_sensitivity.shREADME.mdtests/test_workflow.pyHow it works end-to-end
nat eval --config_file config_profile.ymlruns the workflow and builds a prediction trie with per-node sensitivity scorespython -m latency_sensitivity_demo.sensitivity_report prediction_trie.jsonshows a formatted table of scoresnat eval --config_file config_with_trie.ymluses the trie at runtime — each LLM call getspriority = max_sensitivity - latency_sensitivityinjected intonvext.agent_hintspython -m latency_sensitivity_demo.compare_sensitivity_perfshows ms/tok and HIGH/LOW priority ratio across runsBy Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Documentation
Tests