Skip to content

Add Dynamo Example for Latency Sensitivity Assignment#1634

Merged
rapids-bot[bot] merged 27 commits intoNVIDIA:developfrom
dnandakumar-nv:dynamo-latency-example
Feb 24, 2026
Merged

Add Dynamo Example for Latency Sensitivity Assignment#1634
rapids-bot[bot] merged 27 commits intoNVIDIA:developfrom
dnandakumar-nv:dynamo-latency-example

Conversation

@dnandakumar-nv
Copy link
Contributor

@dnandakumar-nv dnandakumar-nv commented Feb 24, 2026

Summary

Adds automatic latency sensitivity inference to the prediction trie, enabling per-LLM-call priority routing for NVIDIA Dynamo. When profiling a workflow, each LLM call is scored on a configurable 1–5 sensitivity scale based on its structural role in the workflow DAG. These scores are written into the prediction trie and used at runtime to inject nvext.agent_hints.priority into Dynamo requests — high-sensitivity calls (e.g., first/last nodes) get higher priority, while parallelizable background calls get lower priority.

Includes a new end-to-end example (latency_sensitivity_demo) that demonstrates the full profile → score → route lifecycle with a 7-node LangGraph workflow.

What changed

Core: Auto-sensitivity algorithm (trie_builder.py)

The prediction trie builder now computes a composite sensitivity score for each LLM call in a trace using four weighted signals:

Signal Weight Description
w_critical Critical path Fraction of total workflow time spent in this call
w_fanout Fan-out How many downstream calls depend on this one
w_position Position (U-shaped) First and last calls score highest; middle calls score lowest
w_parallel Parallel slack penalty Calls running in parallel get penalized — they have scheduling slack

Key implementation details:

  • Logical position collapsing: Parallel siblings that overlap in time are assigned the same logical position, so the U-shaped position signal treats them as a single workflow step rather than inflating their indices
  • Parallel group membership penalty: Any call in a parallel group of size N gets a base penalty of (N-1)/N averaged with its individual slack ratio — even the longest sibling in a parallel group is penalized
  • Min-max normalization: Raw scores are normalized across each trace so the full 1–scale range is always used
  • Sibling overlap detection: Uses span start/end timestamps to detect which LLM calls ran concurrently, then groups them by parent function ID

Core: Config model (profiler.py)

PredictionTrieConfig gains fields for sensitivity tuning:

  • auto_sensitivity: bool — enable/disable scoring
  • sensitivity_scale: int — output scale (default 5)
  • w_critical, w_fanout, w_position, w_parallel — signal weights

Core: Profile runner (profile_runner.py)

Passes the new SensitivityConfig from YAML config through to PredictionTrieBuilder when building the trie.

Docs: Profiler documentation

Updated profiler.md with auto-sensitivity configuration reference, signal explanations, and example YAML.

Example: latency_sensitivity_demo

A complete installable example in examples/dynamo_integration/latency_sensitivity_demo/ demonstrating automatic latency sensitivity inference end-to-end.

Workflow topology (7 LLM calls per request):

                      ┌─── research_context   (LOW,  ~1000 tok) ──┐
                      ├─── lookup_policy       (LOW,  ~1000 tok) ──┤
  classify (HIGH) ──► ├─── check_compliance    (LOW,  ~1000 tok) ──├──► draft_response (MED) ──► review (HIGH)
    (~60 tok)         └─── analyze_sentiment   (LOW,  ~1000 tok) ──┘       (~2000 tok)            (~300 tok)
  • classify (sensitivity=5): First node, high fan-out → highest priority
  • 4 parallel branches (sensitivity=1): Parallel siblings with long decode → lowest priority, GPU-saturating
  • draft_response (sensitivity=3–4): Sequential mid-position
  • review (sensitivity=4): Last node, short output → high priority

Files included:

File Purpose
pyproject.toml Installable package with NAT dependency
workflow.py 7-node LangGraph workflow with 4-way parallel fan-out
register.py NAT plugin entry point
config_profile.yml Profiling config — builds prediction trie with auto-sensitivity
config_with_trie.yml Runtime config — uses trie for dynamic Dynamo priority injection
customer_queries.json 8 sample customer support queries
sensitivity_report.py CLI tool to pretty-print trie sensitivity scores with latency data
compare_sensitivity_perf.py CLI tool to compare baseline vs trie-enabled runs (ms/tok, HIGH/LOW ratio)
dynamo_stack.sh / dynamo_stack_sensitivity.sh Helper scripts for launching Dynamo with/without priority scheduling
README.md Full walkthrough with topology diagrams and usage instructions
tests/test_workflow.py 17 unit tests covering graph structure, config, imports, and end-to-end flow

How it works end-to-end

  1. Profile: nat eval --config_file config_profile.yml runs the workflow and builds a prediction trie with per-node sensitivity scores
  2. Inspect: python -m latency_sensitivity_demo.sensitivity_report prediction_trie.json shows a formatted table of scores
  3. Route: nat eval --config_file config_with_trie.yml uses the trie at runtime — each LLM call gets priority = max_sensitivity - latency_sensitivity injected into nvext.agent_hints
  4. Compare: python -m latency_sensitivity_demo.compare_sensitivity_perf shows ms/tok and HIGH/LOW priority ratio across runs

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added a "Parallel sibling slack" signal (w_parallel, default 0.0) to latency-sensitivity scoring and a complete latency-sensitivity demo including workflow, CLI reporting tools, orchestration scripts, configs, and an installable demo package.
  • Documentation

    • New Dynamo install guide, demo README, and updated profiler docs describing the parallel execution signal and configuration.
  • Tests

    • Added unit and integration tests validating parallel slack, sensitivity scoring, config wiring, and reporting.

Introduces a complete latency sensitivity demo, including customer query samples, profiler configuration, workflow implementation, and performance comparison tools. This setup demonstrates query triage and response generation using LangChain with NVIDIA LLMs.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Switched the latency sensitivity workflow to use the Dynamo LLM backend, replacing the NIM configuration. Enhanced the sensitivity report tool to parse profiler CSVs, showing latency and throughput metrics. Updated documentation with detailed Dynamo setup and workflow instructions.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Enhanced the profiler configuration with detailed topology descriptions and updated default settings for Dynamo. Refactored code for improved sensitivity reporting, added warmup skipping, and revised NAT function prompts for better prioritization in latency-sensitive scenarios.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Enhanced the profiler configuration with detailed topology descriptions and updated default settings for Dynamo. Refactored code for improved sensitivity reporting, added warmup skipping, and revised NAT function prompts for better prioritization in latency-sensitive scenarios.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
…xample

# Conflicts:
#	packages/nvidia_nat_core/tests/nat/test_eval_deprecation_shim.py
@dnandakumar-nv dnandakumar-nv requested review from a team as code owners February 24, 2026 00:11
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a parallel-sibling slack signal and w_parallel parameter to profiler latency-sensitivity computation; extends trie builder to detect overlapping sibling spans, compute parallel slack, and reduce sensitivity for shadowed LLM calls. Also adds a complete Dynamo latency-sensitivity demo (configs, workflow, scripts, reports, tests, and docs).

Changes

Cohort / File(s) Summary
Profiler core
packages/nvidia_nat_core/src/nat/data_models/profiler.py, packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py, packages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py, packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/profile_runner.py, docs/source/improve-workflows/profiler.md
Added w_parallel: float = 0.0 to configs and SensitivityConfig; added _SiblingSpan, sibling-map construction, parallel-slack computation, logical-position collapsing, and integration of a parallel-aware penalty into sensitivity scoring. Updated builder to store span times and slack on LLM contexts; tests and docs updated.
Demo workflow & registration
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py, .../register.py, .../__init__.py
New 7-node LangGraph demo: typed function configs, registered functions, orchestrator that fans out parallel branches and converges, and a registration shim.
Demo config, data & packaging
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml, .../config_with_trie.yml, .../data/customer_queries.json, examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml
Added profiler/runtime YAMLs (including w_parallel weight usage), LFS dataset pointer, and pyproject with package metadata and entry point for the demo.
Reporting & analysis tools
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/sensitivity_report.py, .../compare_sensitivity_perf.py
New CLI tools to generate sensitivity reports from prediction tries and profiler CSVs and to compare multi-run latency grouped by computed sensitivity.
Orchestration scripts
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh, .../dynamo_stack_sensitivity.sh
New bash scripts to launch Dynamo frontend and workers with preflight checks, centralized logging, PID tracking, and cleanup traps.
Docs, install guide & tests
examples/dynamo_integration/INSTALL_LIBRARY.md, examples/dynamo_integration/latency_sensitivity_demo/README.md, examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py
Added install/build guide, demo README, and tests validating configs, dataset pointer, module registration, sensitivity weights (including w_parallel), report outputs, and workflow wiring.

Sequence Diagram(s)

sequenceDiagram
    participant Profiler as Profiler
    participant TrieBuilder as Trie Builder
    participant SiblingMap as Sibling Map
    participant SensCalc as Sensitivity Calc
    participant Output as Trie Output

    Profiler->>TrieBuilder: extract contexts (spans, w_parallel>0)
    TrieBuilder->>SiblingMap: _build_sibling_map (group by parent_id)
    SiblingMap-->>TrieBuilder: siblings dict

    loop each LLM call
        TrieBuilder->>SiblingMap: _compute_parallel_slack(uuid,start,end,siblings)
        SiblingMap-->>TrieBuilder: parallel_slack_ratio
        TrieBuilder->>TrieBuilder: store parallel_slack in context
    end

    TrieBuilder->>SensCalc: _compute_logical_positions(contexts)
    SensCalc-->>TrieBuilder: logical positions

    TrieBuilder->>SensCalc: compute score = w_critical*... + w_fanout*... + w_position*... - w_parallel*parallel_penalty
    SensCalc-->>TrieBuilder: raw scores
    TrieBuilder->>SensCalc: min-max normalize -> [1, sensitivity_scale]
    SensCalc-->>Output: contexts with normalized sensitivity
    Output-->>Profiler: prediction_trie.json
Loading
sequenceDiagram
    participant Client as Client
    participant Workflow as Demo Workflow
    participant Classify as Classify
    participant Parallel as Parallel Nodes
    participant Draft as Draft
    participant Review as Review
    participant Cache as Dynamo Cache

    Client->>Workflow: submit query
    Workflow->>Classify: classify_query_function
    Classify->>Cache: llm.complete (HIGH hint)
    Classify-->>Workflow: result

    Workflow->>Parallel: fan-out (Research, Policy, Compliance, Sentiment)
    par branches
        Parallel->>Cache: LLM calls (varying durations/priorities)
    end
    Parallel-->>Workflow: aggregated results

    Workflow->>Draft: draft_response_function
    Draft->>Cache: llm.complete (MEDIUM hint)
    Draft-->>Workflow: draft

    Workflow->>Review: review_response_function
    Review->>Cache: llm.complete (HIGH hint)
    Review-->>Workflow: final_response

    Workflow-->>Client: final_response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and directly describes the main purpose of the PR: adding a new example for latency sensitivity assignment in Dynamo, which is the central feature across all changesets.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dnandakumar-nv dnandakumar-nv added feature request New feature or request non-breaking Non-breaking change labels Feb 24, 2026
@dnandakumar-nv
Copy link
Contributor Author

/ok to test 8fad804

Removed redundant platform-specific instructions and streamlined existing steps. Added a recommended method to install SGLang from source for access to the latest updates, enhancing clarity and usability for users.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/source/improve-workflows/profiler.md (1)

219-224: ⚠️ Potential issue | 🟡 Minor

Rephrase inanimate possessive ’s in Markdown.

Line 219 uses “workflow's call graph.” Please reword to avoid inanimate possessive ’s.

💡 Suggested fix
-During profiling, the `trie` builder processes all LLM call events and, for each unique position in your workflow's call graph (identified by `function path` and `call index`), accumulates:
+During profiling, the `trie` builder processes all LLM call events and, for each unique position in the call graph of your workflow (identified by `function path` and `call index`), accumulates:

As per coding guidelines, documentation in Markdown files should not use possessive ’s with inanimate objects.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/improve-workflows/profiler.md` around lines 219 - 224, Replace
the inanimate possessive "workflow's call graph" with a non-possessive phrasing;
update the sentence that describes the `trie` builder to say something like
"each unique position in the call graph of your workflow (identified by
`function path` and `call index`)" or "each unique position in the workflow call
graph (identified by `function path` and `call index`)" so it avoids the
possessive form while preserving the meaning.
🧹 Nitpick comments (7)
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh (2)

45-46: Use a no-op command for file truncation to silence ShellCheck SC2188.

> "$LOGFILE" is a bare redirection without a command.

Proposed fix
 LOGFILE="$LOG_DIR/all.log"
-> "$LOGFILE"
+: > "$LOGFILE"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`
around lines 45 - 46, The bare redirection line using LOGFILE ("> \"$LOGFILE\"")
triggers ShellCheck SC2188; replace it with a no-op command plus the redirection
(for example use ":" or "true" before the redirection) so the file is truncated
while satisfying shellcheck. Update the statement that references LOGFILE (and
uses LOG_DIR) to use a no-op command with the same redirection, e.g. ": >
\"$LOGFILE\"" to silence SC2188 and preserve behavior.

14-19: Unused configuration variables.

PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY are defined but never referenced in the script. Either remove them or wire them into the worker command-line arguments.

Proposed fix: remove unused variables
 # ── Config ───────────────────────────────────────────────────────────────────
 MODEL="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
-PAGE_SIZE=64
-HICACHE_RATIO=1.0
-HICACHE_POLICY=write_through
 CONTEXT_LENGTH=262
 MEM_FRACTION=0.7
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`
around lines 14 - 19, The script defines unused variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY; either remove those declarations or pass them
into the worker launch so they take effect. Locate the variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY near the MODEL/CONTEXT_LENGTH/MEM_FRACTION
block and either delete those three lines, or append corresponding command-line
flags (e.g., --page-size, --hicache-ratio, --hicache-policy or the script's
worker arg names) to the worker start/exec invocation (the same command that
uses MODEL/CONTEXT_LENGTH/MEM_FRACTION) so the values are actually consumed.
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh (2)

1-72: Consider consolidating with dynamo_stack_sensitivity.sh.

These two scripts are ~90% identical; the only differences are the --schedule-low-priority-values-first and --enable-priority-scheduling flags. A single parameterized script (e.g., accepting a --with-priority flag) would reduce duplication and maintenance burden.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`
around lines 1 - 72, Two nearly identical scripts (dynamo_stack.sh and
dynamo_stack_sensitivity.sh) differ only by the flags
--schedule-low-priority-values-first and --enable-priority-scheduling; refactor
by making one script accept a parameter (e.g., --with-priority or env var
WITH_PRIORITY) and conditionally append those flags when launching the worker in
the CUDA_VISIBLE_DEVICES block, add simple CLI parsing/usage message and ensure
the conditional uses the same unique identifiers (the python3 -m dynamo.sglang
invocation and the flag names) so behavior is identical when the flag is present
and eliminated when absent.

14-19: Unused config variables and bare redirect.

PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY are unused (SC2034). Line 46 uses a bare redirect > "$LOGFILE" (SC2188). Same fixes as suggested for dynamo_stack_sensitivity.sh apply here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`
around lines 14 - 19, Remove or mark intentionally-unused variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY: either delete those assignments if they
aren’t needed, or add a shellcheck suppression comment (e.g., # shellcheck
disable=SC2034) next to their declarations so linters know they are
intentionally unused. Replace the bare redirect that truncates the logfile with
an explicit no-op command redirect (use : > "$LOGFILE") or use a proper
redirection with a command (e.g., exec >"$LOGFILE" 2>&1) to address SC2188.
Ensure changes target the declarations of PAGE_SIZE/HICACHE_RATIO/HICACHE_POLICY
and the single-line logfile redirect.
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py (1)

64-65: Registered functions are missing return type annotations.

All eight @register_function generators (e.g., lines 64–65, 93–94, 133–134, …) omit a return type hint. As async generator functions that yield FunctionInfo, their return type should be AsyncGenerator[FunctionInfo, None].

+from collections.abc import AsyncGenerator
 ...
-async def classify_query_function(config: ClassifyConfig, builder: Builder):
+async def classify_query_function(config: ClassifyConfig, builder: Builder) -> AsyncGenerator[FunctionInfo, None]:

Apply the same annotation to the other seven registered functions. As per coding guidelines: "Python methods should use type hints for all parameters and return values."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py`
around lines 64 - 65, The registered async generator functions (e.g.,
classify_query_function) are missing return type hints; update each decorated
function signature (for all eight `@register_function` functions such as
classify_query_function and the other seven registered functions) to include ->
AsyncGenerator[FunctionInfo, None], and ensure AsyncGenerator and FunctionInfo
are imported from typing/your module as needed so the type hint resolves.
examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py (2)

85-100: Extract repeated data_path construction into a pytest fixture.

The path is constructed identically in both test_dataset_exists (lines 86–88) and test_dataset_has_entries (lines 93–94), violating DRY. Per coding guidelines, frequently repeated code should be extracted into pytest fixtures with the fixture_ naming convention.

♻️ Proposed refactor
+import json
+import pytest
 from pathlib import Path
 import yaml
 
 CONFIGS_DIR = Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "configs"
+_DATA_PATH = (Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "data" /
+              "customer_queries.json")
+
+
+@pytest.fixture(name="data_path")
+def fixture_data_path() -> Path:
+    return _DATA_PATH
 
 ...
 
 class TestDataset:
     """Verify the customer queries dataset."""
 
-    def test_dataset_exists(self):
-        data_path = (Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "data" /
-                     "customer_queries.json")
+    def test_dataset_exists(self, data_path: Path):
         assert data_path.exists()
 
-    def test_dataset_has_entries(self):
-        import json
-
-        data_path = (Path(__file__).parent.parent / "src" / "latency_sensitivity_demo" / "data" /
-                     "customer_queries.json")
+    def test_dataset_has_entries(self, data_path: Path):
         with open(data_path) as f:
             data = json.load(f)

As per coding guidelines: "Any frequently repeated code should be extracted into pytest fixtures. Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the fixture_ prefix."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py`
around lines 85 - 100, Extract the repeated Path construction into a pytest
fixture named fixture_data_path by adding a
pytest.fixture(name="fixture_data_path") function that builds and returns the
Path used in test_dataset_exists and test_dataset_has_entries; then update both
tests to accept fixture_data_path as a parameter and use it instead of
reconstructing data_path, and in test_dataset_has_entries open fixture_data_path
to load JSON and run the existing assertions—this removes duplication while
following the fixture naming and decorator requirements.

91-91: Move import json to module level.

import json inside a test method violates PEP 8's requirement that imports appear at the top of the file. json is part of the standard library and has no deferred-import rationale here.

As per coding guidelines: "Follow PEP 8 for Python style guidelines."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py`
at line 91, The inline "import json" inside the test function in
test_workflow.py should be moved to the module-level imports; remove the
deferred import from the test method and add "import json" with the other
top-of-file imports so the standard-library import follows PEP 8 and the
project's import conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/dynamo_integration/INSTALL_LIBRARY.md`:
- Around line 8-10: The relative link to the support matrix in
examples/dynamo_integration/INSTALL_LIBRARY.md is broken because it points to
docs/pages/reference/support-matrix.md relative to this repo; update the link to
use the absolute GitHub URL of the Dynamo support-matrix (replace the relative
path in the line containing "support matrix" /
"docs/pages/reference/support-matrix.md" with the full https://... GitHub/raw or
blob URL to the Dynamo docs) so it resolves correctly from this file's location.
- Around line 1-10: Add the standard Apache License 2.0 header comment to the
very top of INSTALL_LIBRARY.md: insert the project's required Apache-2.0 header
block (including the copyright year and copyright owner) as a top-of-file
comment so the file begins with the license header before any content; ensure
the header matches the project's canonical Apache-2.0 header wording used in
other files.

In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 1-28: The root pyproject.toml is missing the example registration
under [tool.uv.sources]; add an entry for nat_latency_sensitivity_demo in the
[tool.uv.sources] table (use the key nat_latency_sensitivity_demo and point it
to examples/dynamo_integration/latency_sensitivity_demo with editable = true)
and place it in alphabetical order among other entries so the example is
registered.

In `@examples/dynamo_integration/latency_sensitivity_demo/README.md`:
- Around line 50-55: Replace inanimate possessive forms in the README: change
"workflow's" to "of the workflow" (or simply "workflow"), "profiler's" to "of
the profiler" or "profiler algorithm", and "Dynamo's" to "Dynamo" or "of Dynamo"
so no inanimate noun uses ’s; update the phrases around the headings "Why this
creates a measurable priority benefit at high concurrency" and "How Sensitivity
Scores Are Computed" (and the occurrence at lines ~79–80) to use these
non-possessive alternatives while preserving the original meaning.
- Around line 24-26: Replace the prose uses of the acronym "NAT" with the full
phrase "NeMo Agent Toolkit" in the README (e.g., the sentence describing the
LangGraph StateGraph and any other body text occurrences around the
workflow/triage description and later at lines ~211-212), keeping the code
identifier `nat` unchanged where it appears in backticks; ensure every non-code
occurrence of "NAT" is spelled out as "NeMo Agent Toolkit" while leaving
references to StateGraph and any `nat` code identifiers intact.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/compare_sensitivity_perf.py`:
- Around line 74-76: The for-loops unpacking call_idx and other unused variables
should rename those unused names with a leading underscore (e.g., change
call_idx to _call_idx) to satisfy Ruff B007/B905, and any zip(...) calls in this
module should include the explicit parameter strict=False (e.g., zip(a, b,
strict=False)) to address RUF059; update occurrences around the loop using
node.predictions_by_call_index.items() and the other similar unpackings (and the
zip usages mentioned) accordingly so unused variables are prefixed with "_" and
zip calls pass strict=False.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`:
- Around line 2-3: The header comment "Two workers enable cache-aware routing
via the KV router" is misleading because the script only launches a single
worker process in the worker start block; either update that header comment to
reflect one worker or actually start a second worker by duplicating the existing
worker-start command block (the single worker launch block that configures and
background-starts the worker) and adjusting its unique runtime settings (e.g.,
GPU/worker ID, ports, or PID file/env vars) so both workers can run concurrently
and the KV router can perform cache-aware routing.
- Around line 1-11: Add the standard Apache License 2.0 header to the top of the
script file so it complies with project licensing rules; insert the multi-line
license notice immediately after the shebang (#!/usr/bin/env bash) in
latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh and ensure it
includes the copyright line, license URL, and the SPDX identifier (or full
license text per project convention) so automated license checks will pass.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 2-3: The script header in
latency_sensitivity_demo/scripts/dynamo_stack.sh incorrectly states "Two
workers" while the script only starts one worker; either update the header text
to accurately say "One worker" or modify the script to launch the second worker
to match the comment. Locate the top-of-file header comment in dynamo_stack.sh
and either change the phrasing to reflect a single worker or add the commands
that start the missing second worker so the comment and implementation stay
consistent.
- Around line 1-11: Add the standard Apache License 2.0 header used by the
project to the top of this script
(examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh),
matching the sibling script's header exactly; ensure the header appears before
or immediately above the existing shebang line and includes the correct
copyright owner and license URL.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py`:
- Around line 285-303: The _draft function uses input_text.split("|") which can
produce >6 parts if LLM-generated fields contain '|' and thus misalign fields;
update _draft to split with a maxsplit of 5 (e.g., input_text.split("|",
maxsplit=5)) so the last five fields remain intact when populating query,
category, context, policy, compliance, sentiment before calling chain.ainvoke
and yielding FunctionInfo.from_fn(_draft). For a more robust fix consider
changing the workflow to accept a structured payload (JSON) instead of a
pipe-joined string.
- Around line 464-468: The except block catching GeneratorExit in the async
generator around FunctionInfo.from_fn(_run, description=_run.__doc__) must not
swallow the exception; either remove the except clause or log and re-raise it:
replace logger.exception("Exited early!") with logger.error("Exited early!")
followed by a bare raise to preserve the original stack trace (or simply delete
the entire except GeneratorExit block so the finally cleanup runs and
GeneratorExit propagates).

In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`:
- Around line 195-196: Replace the en-dashes in the problematic docstring and
comment with ASCII hyphens (e.g., change "0–1" to "0-1") to satisfy ruff
unicode-dash rules, and update the zip call in the trie builder routine to use
strict=True (e.g., change zip(a, b) to zip(a, b, strict=True)) so
mismatched-length iterables raise immediately; look for the docstring at the top
of packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py,
the nearby comment around the normalization logic, and the zip() call inside the
trie-building / normalization function (the function that pairs sequences when
building the trie) to apply these edits.
- Around line 274-312: _compute_logical_positions currently iterates contexts in
LLM_END order and uses group_earliest_end which misses transitive overlaps;
change the algorithm to group by temporal overlap using span_start_time ordering
and a group_latest_end (track max span_end_time) so any call whose
span_start_time <= group_latest_end joins the current parallel group and you
update group_latest_end = max(group_latest_end, ctx.span_end_time); to preserve
the function contract, sort indices by span_start_time, compute logical
positions on the sorted list (using current_group and group_latest_end), then
map the computed positions back to the original contexts order before returning
so callers (e.g., remaining_calls/position/fan-out signals) receive positions
aligned with the input list.

---

Outside diff comments:
In `@docs/source/improve-workflows/profiler.md`:
- Around line 219-224: Replace the inanimate possessive "workflow's call graph"
with a non-possessive phrasing; update the sentence that describes the `trie`
builder to say something like "each unique position in the call graph of your
workflow (identified by `function path` and `call index`)" or "each unique
position in the workflow call graph (identified by `function path` and `call
index`)" so it avoids the possessive form while preserving the meaning.

---

Nitpick comments:
In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`:
- Around line 45-46: The bare redirection line using LOGFILE ("> \"$LOGFILE\"")
triggers ShellCheck SC2188; replace it with a no-op command plus the redirection
(for example use ":" or "true" before the redirection) so the file is truncated
while satisfying shellcheck. Update the statement that references LOGFILE (and
uses LOG_DIR) to use a no-op command with the same redirection, e.g. ": >
\"$LOGFILE\"" to silence SC2188 and preserve behavior.
- Around line 14-19: The script defines unused variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY; either remove those declarations or pass them
into the worker launch so they take effect. Locate the variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY near the MODEL/CONTEXT_LENGTH/MEM_FRACTION
block and either delete those three lines, or append corresponding command-line
flags (e.g., --page-size, --hicache-ratio, --hicache-policy or the script's
worker arg names) to the worker start/exec invocation (the same command that
uses MODEL/CONTEXT_LENGTH/MEM_FRACTION) so the values are actually consumed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 1-72: Two nearly identical scripts (dynamo_stack.sh and
dynamo_stack_sensitivity.sh) differ only by the flags
--schedule-low-priority-values-first and --enable-priority-scheduling; refactor
by making one script accept a parameter (e.g., --with-priority or env var
WITH_PRIORITY) and conditionally append those flags when launching the worker in
the CUDA_VISIBLE_DEVICES block, add simple CLI parsing/usage message and ensure
the conditional uses the same unique identifiers (the python3 -m dynamo.sglang
invocation and the flag names) so behavior is identical when the flag is present
and eliminated when absent.
- Around line 14-19: Remove or mark intentionally-unused variables PAGE_SIZE,
HICACHE_RATIO, and HICACHE_POLICY: either delete those assignments if they
aren’t needed, or add a shellcheck suppression comment (e.g., # shellcheck
disable=SC2034) next to their declarations so linters know they are
intentionally unused. Replace the bare redirect that truncates the logfile with
an explicit no-op command redirect (use : > "$LOGFILE") or use a proper
redirection with a command (e.g., exec >"$LOGFILE" 2>&1) to address SC2188.
Ensure changes target the declarations of PAGE_SIZE/HICACHE_RATIO/HICACHE_POLICY
and the single-line logfile redirect.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py`:
- Around line 64-65: The registered async generator functions (e.g.,
classify_query_function) are missing return type hints; update each decorated
function signature (for all eight `@register_function` functions such as
classify_query_function and the other seven registered functions) to include ->
AsyncGenerator[FunctionInfo, None], and ensure AsyncGenerator and FunctionInfo
are imported from typing/your module as needed so the type hint resolves.

In `@examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py`:
- Around line 85-100: Extract the repeated Path construction into a pytest
fixture named fixture_data_path by adding a
pytest.fixture(name="fixture_data_path") function that builds and returns the
Path used in test_dataset_exists and test_dataset_has_entries; then update both
tests to accept fixture_data_path as a parameter and use it instead of
reconstructing data_path, and in test_dataset_has_entries open fixture_data_path
to load JSON and run the existing assertions—this removes duplication while
following the fixture naming and decorator requirements.
- Line 91: The inline "import json" inside the test function in test_workflow.py
should be moved to the module-level imports; remove the deferred import from the
test method and add "import json" with the other top-of-file imports so the
standard-library import follows PEP 8 and the project's import conventions.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd6f4cf and 8fad804.

📒 Files selected for processing (19)
  • docs/source/improve-workflows/profiler.md
  • examples/dynamo_integration/INSTALL_LIBRARY.md
  • examples/dynamo_integration/latency_sensitivity_demo/README.md
  • examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/__init__.py
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/compare_sensitivity_perf.py
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_profile.yml
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/configs/config_with_trie.yml
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/data/customer_queries.json
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/register.py
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/sensitivity_report.py
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/workflow.py
  • examples/dynamo_integration/latency_sensitivity_demo/tests/test_workflow.py
  • packages/nvidia_nat_core/src/nat/data_models/profiler.py
  • packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py
  • packages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/profiler/profile_runner.py

Updated the method to use interval-merging logic, ensuring transitive overlaps are collapsed into a single group. Added comprehensive tests to validate behavior for overlapping, sequential, and edge-case scenarios.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Added SPDX license headers to bash scripts, markdown documentation, and configuration files in the Dynamo integration examples. This ensures proper attribution and compliance with the Apache License, Version 2.0.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py (1)

195-196: Outstanding ruff violations: en-dash characters and zip() without strict=.

These three ruff warnings remain unaddressed from the previous review cycle.

Also applies to: 263-263, 268-268

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`
around lines 195 - 196, Replace the Unicode en-dash in the docstring/comment
("0–1") with a normal hyphen ("0-1") and update any plain zip() calls in this
module to zip(..., strict=True) to satisfy ruff; specifically, change the "0–1"
text in the comment at the top of trie_builder.py and add strict=True to the
zip() usages that pair parallel iterables (the zip() calls referenced around the
263 and 268 areas) so mismatched-length iterables raise immediately.
🧹 Nitpick comments (1)
packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py (1)

218-221: Move from collections import Counter to the module-level import block.

Importing inside a method body defeats module-level caching and obscures the dependency. collections is already imported at the top of the file.

♻️ Proposed fix

At the top of the file, extend the existing collections import:

-from collections import defaultdict
+from collections import Counter
+from collections import defaultdict

Then remove the inline import:

-        from collections import Counter
         group_sizes = Counter(logical_positions)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`
around lines 218 - 221, Inline import "from collections import Counter" should
be moved to the module-level imports; update the top-of-file collections import
to also import Counter (so collections import line includes Counter) and then
remove the inline import inside the function where group_sizes =
Counter(logical_positions) is used (the inline import located just above
group_sizes in trie_builder.py). This preserves module-level caching and keeps
the dependency with other imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@packages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py`:
- Around line 588-613: Docstrings/comments in
test_logical_positions_transitive_overlap (docstring "A(0–10)..." ) and
test_logical_positions_two_groups (comment "A(0–5), B(1–4)   Group 2: C(10–15),
D(11–14)" ) contain en-dash characters; replace each en-dash (–) with a plain
ASCII hyphen (-) so the docstring and comment read e.g. "A(0-10)" and "A(0-5),
B(1-4)   Group 2: C(10-15), D(11-14)" to resolve the RUF002/RUF003 warnings
while leaving the surrounding text and tests unchanged.

---

Duplicate comments:
In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`:
- Around line 195-196: Replace the Unicode en-dash in the docstring/comment
("0–1") with a normal hyphen ("0-1") and update any plain zip() calls in this
module to zip(..., strict=True) to satisfy ruff; specifically, change the "0–1"
text in the comment at the top of trie_builder.py and add strict=True to the
zip() usages that pair parallel iterables (the zip() calls referenced around the
263 and 268 areas) so mismatched-length iterables raise immediately.

---

Nitpick comments:
In `@packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py`:
- Around line 218-221: Inline import "from collections import Counter" should be
moved to the module-level imports; update the top-of-file collections import to
also import Counter (so collections import line includes Counter) and then
remove the inline import inside the function where group_sizes =
Counter(logical_positions) is used (the inline import located just above
group_sizes in trie_builder.py). This preserves module-level caching and keeps
the dependency with other imports.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb6d039 and dc40d6a.

📒 Files selected for processing (2)
  • packages/nvidia_nat_core/src/nat/profiler/prediction_trie/trie_builder.py
  • packages/nvidia_nat_core/tests/nat/profiler/prediction_trie/test_trie_builder.py

Replaced outdated references to "NAT" with "NeMo Agent Toolkit" for accuracy and consistency. This ensures alignment with the current naming conventions and improves clarity in the documentation.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml (1)

16-43: ⚠️ Potential issue | 🟡 Minor

Register the example in the root pyproject.toml.

The nat_latency_sensitivity_demo entry is missing from the root pyproject.toml [tool.uv.sources] table. Per coding guidelines, every example with its own pyproject.toml must be listed there.

+nat_latency_sensitivity_demo = { path = "examples/dynamo_integration/latency_sensitivity_demo", editable = true }

Add the entry in alphabetical order among existing entries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml` around
lines 16 - 43, Add the missing example registration for
nat_latency_sensitivity_demo into the root [tool.uv.sources] table: add an entry
keyed by nat_latency_sensitivity_demo pointing to the example's relative path
(matching how other examples are registered) with editable = true, and place it
in alphabetical order among the existing entries; locate the example by its
project name "nat_latency_sensitivity_demo" and its entry-point
"latency_sensitivity_demo.register" to ensure the correct path is used.
🧹 Nitpick comments (2)
examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh (1)

59-60: Make the log truncate explicit to avoid SC2188.
This redirection is valid, but : > "$LOGFILE" is clearer and keeps Shellcheck happy.

♻️ Suggested tweak
-LOGFILE="$LOG_DIR/all.log"
-> "$LOGFILE"
+LOGFILE="$LOG_DIR/all.log"
+: > "$LOGFILE"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`
around lines 59 - 60, The current redirection line using LOGFILE
(LOGFILE="$LOG_DIR/all.log" followed by > "$LOGFILE") should be made explicit to
truncate the file and satisfy ShellCheck SC2188; replace the bare redirection
with an explicit truncate such as using a no-op redirection (e.g., : >
"$LOGFILE") so the LOGFILE variable is still set and the file is cleared.
examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml (1)

34-37: Move test extra to an optional dependency group.

Including the test extra in the main install dependency pulls test tooling (pytest, etc.) into every install of this example package. Consider moving it to an optional group so users only install test dependencies when explicitly needed.

♻️ Proposed refactor
 [tool.setuptools_dynamic_dependencies]
 dependencies = [
-  "nvidia-nat[eval,langchain] == {version}",
+  "nvidia-nat[eval,langchain] == {version}",
 ]
+
+[project.optional-dependencies]
+test = ["nvidia-nat[test] == {version}"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml` around
lines 34 - 37, The dependency string currently includes the "test" extra
("nvidia-nat[eval,langchain,test] == {version}"), which pulls test tooling into
every install; remove "test" from that main dependency (change to
"nvidia-nat[eval,langchain] == {version}") and create an optional/test extras
group that contains the test extra or test-only packages (e.g., add a new
optional-dependencies/extras group named "test" that includes "nvidia-nat[test]
== {version}" or the specific test packages like pytest). Update the
[tool.setuptools_dynamic_dependencies] dependencies list to the trimmed string
and add the new optional group (e.g., under project.optional-dependencies or the
setuptools_dynamic_dependencies extras section) so tests are installed only when
explicitly requested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 24-43: Move the nested configs/ and data/ directories from
src/latency_sensitivity_demo/ to the example root (next to src/, README.md,
pyproject.toml), add those files to git-lfs, and update any code or package
metadata that references the old paths (e.g., imports or file loads in the
latency_sensitivity_demo package and any package_data or resource lookups tied
to the project name nat_latency_sensitivity_demo or entry point
latency_sensitivity_demo.register). Ensure tests and CI use the new paths
(configs/... and data/...) and adjust any relative paths in code, tests, and
pyproject dynamic dependencies or package_data settings so runtime/resource
loading still works after the move.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh`:
- Around line 28-32: The script defines PAGE_SIZE, HICACHE_RATIO, and
HICACHE_POLICY but never uses them, which can mislead users; either remove these
unused variables or wire them into the runtime (e.g., export them or pass them
as environment variables/flags to the worker invocation). Locate the variable
declarations PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY in the script and
either delete those lines or add them to the command that starts the worker or
to an export block so the worker code (or cache launcher) can read them; keep
MODEL and CONTEXT_LENGTH as-is if they are used elsewhere.
- Around line 18-19: Update the usage comment to reference the correct script
filename: replace the incorrect "./dynamo-stack_sensitivity.sh" mention with
"./dynamo_stack_sensitivity.sh" in the header comment so the instruction matches
the actual script name (look for the comment lines starting with "# Edit the
config below," and the following usage line).

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 27-33: The PAGE_SIZE, HICACHE_RATIO, and HICACHE_POLICY variables
in the config are defined but never used; update the script so these values are
actually passed into the worker process (e.g., export them as environment
variables or add them as CLI flags to the worker invocation) or remove the
unused variables to avoid misleading users, referencing the PAGE_SIZE,
HICACHE_RATIO and HICACHE_POLICY symbols and the worker invocation point where
env/flags are constructed so the config changes take effect.

---

Duplicate comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 16-43: Add the missing example registration for
nat_latency_sensitivity_demo into the root [tool.uv.sources] table: add an entry
keyed by nat_latency_sensitivity_demo pointing to the example's relative path
(matching how other examples are registered) with editable = true, and place it
in alphabetical order among the existing entries; locate the example by its
project name "nat_latency_sensitivity_demo" and its entry-point
"latency_sensitivity_demo.register" to ensure the correct path is used.

---

Nitpick comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml`:
- Around line 34-37: The dependency string currently includes the "test" extra
("nvidia-nat[eval,langchain,test] == {version}"), which pulls test tooling into
every install; remove "test" from that main dependency (change to
"nvidia-nat[eval,langchain] == {version}") and create an optional/test extras
group that contains the test extra or test-only packages (e.g., add a new
optional-dependencies/extras group named "test" that includes "nvidia-nat[test]
== {version}" or the specific test packages like pytest). Update the
[tool.setuptools_dynamic_dependencies] dependencies list to the trimmed string
and add the new optional group (e.g., under project.optional-dependencies or the
setuptools_dynamic_dependencies extras section) so tests are installed only when
explicitly requested.

In
`@examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh`:
- Around line 59-60: The current redirection line using LOGFILE
(LOGFILE="$LOG_DIR/all.log" followed by > "$LOGFILE") should be made explicit to
truncate the file and satisfy ShellCheck SC2188; replace the bare redirection
with an explicit truncate such as using a no-op redirection (e.g., : >
"$LOGFILE") so the LOGFILE variable is still set and the file is cleared.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc40d6a and 106a6dc.

📒 Files selected for processing (4)
  • examples/dynamo_integration/INSTALL_LIBRARY.md
  • examples/dynamo_integration/latency_sensitivity_demo/pyproject.toml
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack.sh
  • examples/dynamo_integration/latency_sensitivity_demo/src/latency_sensitivity_demo/scripts/dynamo_stack_sensitivity.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/dynamo_integration/INSTALL_LIBRARY.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
examples/dynamo_integration/latency_sensitivity_demo/README.md (1)

50-50: ⚠️ Potential issue | 🟡 Minor

Inanimate possessives still unresolved — plus an additional occurrence at Line 307.

Four instances use possessive 's with inanimate nouns:

Line Current Suggested
50 workflow's \classify_query`` the \classify_query` of the workflow`
54 The profiler's auto-sensitivity algorithm The auto-sensitivity algorithm of the profiler
79 tell Dynamo's router tell the Dynamo router
307 when Dynamo's priority scheduling is working when the Dynamo priority scheduling is working

As per coding guidelines, documentation in Markdown files should not use possessive 's with inanimate objects.

Also applies to: 54-54, 79-79, 307-307

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/dynamo_integration/latency_sensitivity_demo/README.md` at line 50,
Replace possessive "'s" forms for inanimate nouns with non-possessive phrasing:
change "workflow's `classify_query`" to "the `classify_query` of the workflow",
"The profiler's auto-sensitivity algorithm" to "The auto-sensitivity algorithm
of the profiler", "tell Dynamo's router" to "tell the Dynamo router", and "when
Dynamo's priority scheduling is working" to "when the Dynamo priority scheduling
is working"; update those four occurrences (the `classify_query` reference, the
profiler auto-sensitivity phrase, the Dynamo router mention, and the Dynamo
priority scheduling mention) throughout the README to match the documentation
guideline against inanimate possessive "'s".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/README.md`:
- Around line 226-228: The Markdown code-fence closing delimiter is malformed:
the opening fence uses three backticks (```bash) while the closing fence uses
four backticks (````), which swallows the rest of the document; fix the README
by replacing the four-backtick closing fence with a matching three-backtick
fence so the snippet (bash dynamo_stack_sensitivity.sh) is properly closed.
- Line 307: Replace the asterisk-style inline emphasis in the "Priority Routing
Effectiveness" paragraph by changing *increase* to underscore-style _increase_
so the document conforms to markdownlint MD049; search for the sentence that
begins "Priority Routing Effectiveness is the key section" and update the
emphasis around the word "increase" to use underscores.
- Around line 93-147: The sub-step heading "### Step 2a: Run the Profiler…" is
incorrectly numbered under the parent "## Step 1: Profile the Workflow" and
should be "Step 1b"; update the heading title "### Step 2a: Run the Profiler to
Build the Prediction Trie" to "### Step 1b: Run the Profiler to Build the
Prediction Trie" so it matches the parent "## Step 1" and the existing "### Step
1a: Start a Dynamo Endpoint", and scan nearby sub-headings to ensure any further
"Step 2..." labels are similarly renumbered for consistency.
- Around line 31-36: The three fenced code blocks for the directory/topology
diagram, the file-tree listing, and the sample report output in README.md are
missing a language specifier (causing MD040); update each opening fence from ```
to ```text so the directory/topology diagram block (the ASCII diagram starting
with the box layout), the file-tree listing block (the
examples/dynamo_integration/latency_sensitivity_demo/outputs/profile/ path), and
the sample report output block (the LATENCY SENSITIVITY REPORT ASCII header and
following lines) all begin with ```text to satisfy markdownlint and preserve
monospaced formatting.

---

Duplicate comments:
In `@examples/dynamo_integration/latency_sensitivity_demo/README.md`:
- Line 50: Replace possessive "'s" forms for inanimate nouns with non-possessive
phrasing: change "workflow's `classify_query`" to "the `classify_query` of the
workflow", "The profiler's auto-sensitivity algorithm" to "The auto-sensitivity
algorithm of the profiler", "tell Dynamo's router" to "tell the Dynamo router",
and "when Dynamo's priority scheduling is working" to "when the Dynamo priority
scheduling is working"; update those four occurrences (the `classify_query`
reference, the profiler auto-sensitivity phrase, the Dynamo router mention, and
the Dynamo priority scheduling mention) throughout the README to match the
documentation guideline against inanimate possessive "'s".

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 106a6dc and a94d51f.

📒 Files selected for processing (1)
  • examples/dynamo_integration/latency_sensitivity_demo/README.md

@dnandakumar-nv
Copy link
Contributor Author

Build from source instructions will be updated with container instructions when dynamo release is complete

This file was deleted to reduce unused or unnecessary resources within the project. It is no longer relevant to the functionality or purpose of the latency sensitivity demo.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
This JSON file contains sample customer queries and answers designed for testing latency sensitivity. It will be used in the Dynamo integration demo to simulate realistic customer interactions.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Adjusted step numbering for clarity and corrected a minor formatting issue in the README. These changes improve readability and reduce potential confusion for users following the instructions.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Introduce initial setup files for the latency sensitivity demo in Dynamo integration. These include placeholders for configs and data, preparing the groundwork for testing and validation.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
This change includes the `nat_latency_sensitivity_demo` example as an editable dependency in `pyproject.toml`. It ensures the project can utilize this demo for dynamo integration-related functionality.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Adjusted relative links in the README to ensure accurate navigation to the installation guide. Relocated INSTALL_LIBRARY.md to match the structure of the latency_sensitivity_demo directory.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
@dnandakumar-nv
Copy link
Contributor Author

/ok to test 912c2c2

Refine terminology and formatting in the README and install guide by standardizing model names and library references (e.g., `protobuf`, `libclang`). Enhance vocabulary rules in CI configurations to include new terms like `Huggingface` and `venv`. Minor updates improve clarity and ensure consistency.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
@dnandakumar-nv
Copy link
Contributor Author

/ok to test c417500

dnandakumar-nv and others added 3 commits February 24, 2026 09:16
Removed the mention of `w_parallel` defaulting to 0.0 for backward compatibility as it is no longer relevant. This improves clarity and ensures the documentation remains accurate and up-to-date.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Updated the regex pattern to correctly match both "Huggingface" and "HuggingFace" variations. This ensures consistency and proper handling of the term in the vocabulary checks.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
@dnandakumar-nv
Copy link
Contributor Author

/ok to test b459477

dnandakumar-nv and others added 3 commits February 24, 2026 15:00
Updated the CONTEXT_LENGTH parameter in both `dynamo_stack_sensitivity.sh` and `dynamo_stack.sh` from 262 to 262144. This change adjusts memory context size for improved performance and alignment with expected configurations.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
The CUDA toolkit setup steps were removed from the installation guide to streamline the documentation. This section is no longer necessary for the described workflow, reducing redundancy and simplifying setup instructions.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
Copy link
Member

@willkill07 willkill07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to regenerate root uv.lock

dnandakumar-nv and others added 2 commits February 24, 2026 13:31
This commit introduces the nat-latency-sensitivity-demo package to the project. Updates include its addition to `uv.lock` and `pyproject.toml` with relevant metadata and dependencies, enabling editable integration in the examples directory.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
@dnandakumar-nv
Copy link
Contributor Author

/ok to test 38f1008

Copy link
Contributor

@dagardner-nv dagardner-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for deps

Updated all mentions of "HuggingFace" to "Hugging Face" for consistency and alignment with the official branding. This change spans multiple documentation files and one vocabulary configuration file.

Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
@dnandakumar-nv
Copy link
Contributor Author

/ok to test 6a07753

@dnandakumar-nv
Copy link
Contributor Author

/ok to test f6aef4d

@dnandakumar-nv
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit d80cc14 into NVIDIA:develop Feb 24, 2026
15 checks passed
Charlie-Yi-2002 pushed a commit to Charlie-Yi-2002/NeMo-Agent-Toolkit that referenced this pull request Mar 5, 2026
### Summary

Adds automatic latency sensitivity inference to the prediction trie, enabling per-LLM-call priority routing for NVIDIA Dynamo. When profiling a workflow, each LLM call is scored on a configurable 1–5 sensitivity scale based on its structural role in the workflow DAG. These scores are written into the prediction trie and used at runtime to inject `nvext.agent_hints.priority` into Dynamo requests — high-sensitivity calls (e.g., first/last nodes) get higher priority, while parallelizable background calls get lower priority.

Includes a new end-to-end example (`latency_sensitivity_demo`) that demonstrates the full profile → score → route lifecycle with a 7-node LangGraph workflow.

### What changed

#### Core: Auto-sensitivity algorithm (`trie_builder.py`)

The prediction trie builder now computes a composite sensitivity score for each LLM call in a trace using four weighted signals:

| Signal | Weight | Description |
|---|---|---|
| `w_critical` | Critical path | Fraction of total workflow time spent in this call |
| `w_fanout` | Fan-out | How many downstream calls depend on this one |
| `w_position` | Position (U-shaped) | First and last calls score highest; middle calls score lowest |
| `w_parallel` | Parallel slack penalty | Calls running in parallel get penalized — they have scheduling slack |

Key implementation details:
- **Logical position collapsing**: Parallel siblings that overlap in time are assigned the same logical position, so the U-shaped position signal treats them as a single workflow step rather than inflating their indices
- **Parallel group membership penalty**: Any call in a parallel group of size N gets a base penalty of `(N-1)/N` averaged with its individual slack ratio — even the longest sibling in a parallel group is penalized
- **Min-max normalization**: Raw scores are normalized across each trace so the full 1–scale range is always used
- **Sibling overlap detection**: Uses span start/end timestamps to detect which LLM calls ran concurrently, then groups them by parent function ID

#### Core: Config model (`profiler.py`)

`PredictionTrieConfig` gains fields for sensitivity tuning:
- `auto_sensitivity: bool` — enable/disable scoring
- `sensitivity_scale: int` — output scale (default 5)
- `w_critical`, `w_fanout`, `w_position`, `w_parallel` — signal weights

#### Core: Profile runner (`profile_runner.py`)

Passes the new `SensitivityConfig` from YAML config through to `PredictionTrieBuilder` when building the trie.

#### Docs: Profiler documentation

Updated `profiler.md` with auto-sensitivity configuration reference, signal explanations, and example YAML.

#### Example: `latency_sensitivity_demo`

A complete installable example in `examples/dynamo_integration/latency_sensitivity_demo/` demonstrating automatic latency sensitivity inference end-to-end.

**Workflow topology** (7 LLM calls per request):

```
                      ┌─── research_context   (LOW,  ~1000 tok) ──┐
                      ├─── lookup_policy       (LOW,  ~1000 tok) ──┤
  classify (HIGH) ──► ├─── check_compliance    (LOW,  ~1000 tok) ──├──► draft_response (MED) ──► review (HIGH)
    (~60 tok)         └─── analyze_sentiment   (LOW,  ~1000 tok) ──┘       (~2000 tok)            (~300 tok)
```

- **classify** (sensitivity=5): First node, high fan-out → highest priority
- **4 parallel branches** (sensitivity=1): Parallel siblings with long decode → lowest priority, GPU-saturating
- **draft_response** (sensitivity=3–4): Sequential mid-position
- **review** (sensitivity=4): Last node, short output → high priority

**Files included:**

| File | Purpose |
|---|---|
| `pyproject.toml` | Installable package with NAT dependency |
| `workflow.py` | 7-node LangGraph workflow with 4-way parallel fan-out |
| `register.py` | NAT plugin entry point |
| `config_profile.yml` | Profiling config — builds prediction trie with auto-sensitivity |
| `config_with_trie.yml` | Runtime config — uses trie for dynamic Dynamo priority injection |
| `customer_queries.json` | 8 sample customer support queries |
| `sensitivity_report.py` | CLI tool to pretty-print trie sensitivity scores with latency data |
| `compare_sensitivity_perf.py` | CLI tool to compare baseline vs trie-enabled runs (ms/tok, HIGH/LOW ratio) |
| `dynamo_stack.sh` / `dynamo_stack_sensitivity.sh` | Helper scripts for launching Dynamo with/without priority scheduling |
| `README.md` | Full walkthrough with topology diagrams and usage instructions |
| `tests/test_workflow.py` | 17 unit tests covering graph structure, config, imports, and end-to-end flow |

### How it works end-to-end

1. **Profile**: `nat eval --config_file config_profile.yml` runs the workflow and builds a prediction trie with per-node sensitivity scores
2. **Inspect**: `python -m latency_sensitivity_demo.sensitivity_report prediction_trie.json` shows a formatted table of scores
3. **Route**: `nat eval --config_file config_with_trie.yml` uses the trie at runtime — each LLM call gets `priority = max_sensitivity - latency_sensitivity` injected into `nvext.agent_hints`
4. **Compare**: `python -m latency_sensitivity_demo.compare_sensitivity_perf` shows ms/tok and HIGH/LOW priority ratio across runs

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.



## Summary by CodeRabbit

* **New Features**
  * Added a "Parallel sibling slack" signal (w_parallel, default 0.0) to latency-sensitivity scoring and a complete latency-sensitivity demo including workflow, CLI reporting tools, orchestration scripts, configs, and an installable demo package.

* **Documentation**
  * New Dynamo install guide, demo README, and updated profiler docs describing the parallel execution signal and configuration.

* **Tests**
  * Added unit and integration tests validating parallel slack, sensitivity scoring, config wiring, and reporting.

Authors:
  - Dhruv Nandakumar (https://github.com/dnandakumar-nv)

Approvers:
  - Will Killian (https://github.com/willkill07)
  - David Gardner (https://github.com/dagardner-nv)

URL: NVIDIA#1634
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants