feat: (experimental) Dynamo aggregated backend - inference (4/5) by praateekmahajan · Pull Request #1834 · NVIDIA-NeMo/Curator

praateekmahajan · 2026-04-18T01:33:32Z

Reviewer tip: the diff in nemo_curator/core/serve/dynamo/backend.py (lifecycle / infra / frontend / health) and the new nemo_curator/core/serve/dynamo/vllm.py (worker launch) are where the real work lives. Everything else is tests, small constants.

Description

PR 4/5 of the inference-server restack
replace the feat!: Typed Serve Config + Dyamo config stub - inference (2/5) #1820 DynamoBackend placeholder with a real backend for aggregated serving — single-node TP and multi-node TP — built on the placement-group API landed in feat: Placement Group based subprocess manager - inference (3/5) #1833
split worker-launch concerns out of dynamo/backend.py into a new dynamo/vllm.py so lifecycle (start/stop, infra, frontend, health) and vLLM-specific worker assembly stay separately reviewable
promote a few repeated strings and indexes to typed constants in dynamo/constants.py: actor labels (ETCD_ACTOR_LABEL, NATS_ACTOR_LABEL, FRONTEND_ACTOR_LABEL) and infra-bundle layout (INFRA_ETCD_BUNDLE, INFRA_NATS_BUNDLE, INFRA_FRONTEND_BUNDLE, INFRA_NUM_BUNDLES)
mode=\"disagg\" raises NotImplementedError here; disagg + cross-model validators + full router-flag translation ship in the final PR

Design callouts

Single frontend per server, not per model. The Dynamo frontend auto-discovers every registered model via etcd, so one --http-port on the infra PG serves all models. The full per-key router-flag merge across models is PR 5's job; this PR wires only --router-mode plus router_kwargs passthrough.
Explicit --kv-events-config on every worker. Without this, Dynamo's args.py auto-creates a KVEventsConfig bound to tcp://*:20080 when prefix_caching is enabled (vLLM ≥0.16 default), and every worker on the same node fights over the same port. Rank-0 workers publish ZMQ KV events only when the router is mode=\"kv\" with kv_events=True; every other case disables events explicitly so the default-port binding never happens.
One _launch_vllm_worker for rank-0 and headless ranks. Rank-0 adds endpoint / discovery / planes / optional KV-events publisher; rank-N adds --headless and forces KV events off. Multi-node TP resolves --master-addr post-pg.ready() via get_bundle_node_ip(pg, 0).
Orphan sweeps in start(). remove_named_pgs_with_prefix reaps stale PGs, but PG removal force-kills scheduled actors and bypasses their atexit hook — which would orphan the subprocess tree. A list_actors(state=\"ALIVE\") sweep filtered by the PG name prefix runs first so graceful_stop_actors can SIGTERM the process groups cleanly (with SIGKILL escalation from PR 3 as the fallback).
Typed self._models narrowed once in __init__. server.models: list[BaseModelConfig] carries no Dynamo-specific fields; InferenceServer._validate_model_configs already enforces that every entry is a DynamoVLLMModelConfig, so we cast once and every backend method gets .num_replicas / .mode / .engine_kwargs autocomplete.

Usage

from nemo_curator.core.serve import (
    DynamoRouterConfig,
    DynamoServerConfig,
    DynamoVLLMModelConfig,
    InferenceServer,
)

model = DynamoVLLMModelConfig(
    model_identifier=\"Qwen/Qwen3-0.6B\",
    num_replicas=2,
    engine_kwargs={\"tensor_parallel_size\": 1, \"enforce_eager\": True},
    # `dynamo_kwargs` is a dict pass-through for any `python -m dynamo.vllm` flag
    # Curator doesn't type yet (e.g. tool_call_parser, reasoning_parser, ...).
    dynamo_kwargs={\"tool_call_parser\": \"hermes\"},
)
backend_cfg = DynamoServerConfig(
    router=DynamoRouterConfig(mode=\"kv\", kv_events=True),
)
with InferenceServer(models=[model], backend=backend_cfg) as server:
    print(server.endpoint)  # http://<infra-node-ip>:<port>/v1

What's intentionally NOT in this PR

disaggregated serving (mode=\"disagg\"): raises NotImplementedError with a pointer to PR 5
cross-model validators: _validate_frontend_config (coherent namespace/planes/router across models), _validate_unique_model_names, and the disagg-TP-fit branch of _validate_gpu_requirements
full router-flag translation: _resolve_frontend_router_config per-key fallback + --router-kv-events / --router-temperature / --router-ttl-secs / etc. — PR 5 promotes whatever needs typed fields; everything else stays on router_kwargs
Ray Serve backend changes: this PR touches only the Dynamo subpackage

Verification

ruff check nemo_curator/core/serve/ tests/core/serve/ — clean
pytest tests/core/serve/ -m \"not gpu\" — 67 passed (~43s)
CUDA_VISIBLE_DEVICES=2,3 pytest tests/core/serve/integration/test_dynamo.py -m gpu — 3 passed (~3m49s)
- TestDynamoAggregatedSingleNode::test_is_active_and_queryable — full Dynamo frontend + etcd + NATS + vLLM worker answers OpenAI chat completions
- TestDynamoAggregatedSingleNode::test_restart_after_stop — exercises the orphan-PG + orphan-actor sweeps
- TestDynamoRejectsDisagg::test_disagg_mode_raises_notimplemented — verifies the PR 5 deferral

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

praateekmahajan · 2026-04-18T01:34:37Z

/claude review

copy-pr-bot · 2026-04-18T01:34:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

praateekmahajan · 2026-04-22T00:10:05Z

@greptileai

/claude review

praateekmahajan · 2026-04-22T00:10:30Z

/ok to test 67e9c47

praateekmahajan · 2026-04-22T17:37:08Z

@greptileai
/claude review

praateekmahajan · 2026-04-22T17:37:22Z

/ok to test 3e90016

claude

LGTM

greptile-apps · 2026-04-22T17:43:51Z

Want your agent to iterate on Greptile's feedback? Try greploops.

oyilmaz-nvidia

To be honest, it's a long PR and there are too many details. But when I look at the high level API that users will use, I think it's simple and good designed.

So, overall it looks good to me. But we need more examples and tutorials to test it. I guess you'll have them in the last PR?

Replaces the PR 2 `DynamoBackend` placeholder with a real backend for aggregated serving (single-node TP and multi-node TP) built on the placement-group API landed in PR 3. Lifecycle: - `start()` enters the `nemo_curator_dynamo` Ray namespace, sweeps any leftover PGs + actors from a prior driver session, then deploys infra (etcd + NATS) -> workers -> frontend and blocks on a `/v1/models` health check. - `stop()` reconnects to the same namespace, parallel-stops every actor via `ManagedSubprocess.stop_many`, and removes the replica + infra PGs. Worker launch lives in the new `dynamo/vllm.py`: a single `_launch_vllm_worker` handles both rank-0 and headless ranks; KV events are always passed as an explicit `--kv-events-config` (exact ZMQ publishing when the router is in `kv` mode with `kv_events=True`, explicitly disabled otherwise) so replicas don't fight over the vLLM default port. Router wiring is intentionally minimal: `--router-mode` if set, plus every entry in `router_kwargs` as `--key value`. Full per-key router flag translation, cross-model validators (frontend coherence, disagg-TP-fit, unique names), and disaggregated serving all land in PR 5. Attempting `mode="disagg"` raises `NotImplementedError` with a pointer to the next PR. Tests: 22 unit tests for backend + vllm helpers, 8 for runtime-env merging, 3 GPU integration tests (aggregated serve, restart-after- stop exercising the orphan-PG/actor sweep, disagg-rejection). Registered the new GPU test file under the `sdg` group in `tests/gpu_test_groups.json`. Verification: 67 CPU tests pass; 3 GPU integration tests pass on a 2-GPU box with `CUDA_VISIBLE_DEVICES=2,3` (3m49s). Signed-off-by: Praateek <praateekm@gmail.com>

- Guard `remove_named_pgs_with_prefix(self._pg_name_prefix)` in `DynamoBackend.stop()` against an empty prefix. If `start()` fails early (empty models, `mode="disagg"`, missing etcd/nats binary) before the prefix is assigned, `stop()` would otherwise call `remove_named_pgs_with_prefix("")` and wipe every named PG in the `nemo_curator_dynamo` namespace. (greptile P1) - Restructure the etcd/NATS port/URL resolution in `_deploy_and_healthcheck`: only compute a port and spawn the internal service in the "no user endpoint" branch. The previous form extracted a port from the user-supplied URL even when it was never consumed, and raised `ValueError` on valid URLs with a path component (e.g., `http://etcd:2379/v3`). (greptile P2) - Add `assert master_addr is not None` inside the multi-node branch of `_launch_vllm_worker` so the type checker sees the narrowed `str` before it flows into the CLI args list. (greptile P2) - Split `test_restart_after_stop` out of `TestDynamoAggregatedSingleNode` into its own `TestDynamoRestartAfterStop` class with its own server instance. The test stops the server mid-class, which under a randomized test-order runner (e.g. pytest-randomly) could leave the shared class-scoped fixture stopped before other tests in `TestDynamoAggregatedSingleNode` run. (claude bot) Verification: 67 CPU tests + 3 GPU integration tests (5m41s). Signed-off-by: Praateek <praateekm@gmail.com>

The L0_Unit_Test_GPU-sdg job runs tests/core/serve/integration/test_dynamo.py which spawns etcd and nats-server subprocesses via DynamoBackend. The curator CI image didn't ship either binary, so _check_binary("etcd") raised FileNotFoundError at test setup. Add docker/common/install_etcd_nats.sh (same shape as install_ffmpeg.sh: downloads to /tmp, installs to /usr/local/bin/). Versions are pinned to match upstream ai-dynamo/dynamo container/context.yaml (etcd v3.5.21, nats-server v2.10.28) so Curator and Dynamo runtime images carry identical binaries. Uses curl (already in the base image) so no extra apt deps are pulled in. Signed-off-by: Praateek <praateekm@gmail.com>

…er/env-marker/driver-sweep The prior design layered in a ctypes prctl PR_SET_CHILD_SUBREAPER call, a CURATOR_SUBPROC_MARKER env var inherited by every subprocess, a /proc/*/environ scan on teardown, and a driver-side sweep_orphan_subprocesses_by_prefix helper to catch vLLM V1 'EngineCore' setsid grandchildren orphaned to PID 1. Upstream inspection (vllm/v1/engine/utils.py + multiproc_executor.py) shows vLLM actually spawns EngineCore + WorkerProcs via multiprocessing.Process(ctx='spawn'), which is fork+execv -- no setsid -- so every descendant stays in the launcher's process group. killpg on the launcher PID reaches them all. Changes: - subprocess_mgr.py: new _reap_process_group helper (SIGTERM -> probe -> SIGKILL via killpg(pgid, 0) which survives the launcher's own death); _stop_subprocess collapses to one call; force_sigkill_subprocess becomes one killpg line; remove _become_child_subreaper, _snapshot_descendants, _kill_descendants, env marker infrastructure, sweep_orphan_subprocesses_by_prefix. - dynamo/backend.py: drop sweep_orphan_subprocesses_by_prefix consumption; in start() run _sweep_orphan_actors BEFORE remove_named_pgs_with_prefix so detached actors get a chance to graceful-stop + killpg their subprocess tree before PG removal hard-kills them. - tests: replace setsid-grandchild + env-marker tests with a multiprocessing. Process(spawn) pattern matching real vLLM, and add a launcher-already-exited test confirming killpg(pgid) still reaps members. Add TestDynamoBackendStart locking in the sweep-actors-before-remove-pgs ordering. Signed-off-by: Praateek <praateekm@gmail.com>

Signed-off-by: Praateek <praateekm@gmail.com>

…shift in install script - add ``tests/core/serve/dynamo/conftest.py`` with a ``captured_spawn`` pytest fixture that patches ``ManagedSubprocess.spawn`` and yields the recorded calls list — fixes the ``fixture 'captured_spawn' not found`` CI failure caused by ``test_backend.py`` referencing it without the conftest being committed - add ``tests/core/serve/dynamo/test_vllm.py`` covering ``aggregated_model_uses_exact_kv_events`` and ``launch_replicas`` CLI- arg construction (single-node, kv-router, multi-node, dynamo_kwargs, replica fan-out); uses real ``plan_replica_bundle_shape`` over an injected topology instead of mocking the planner - remove dead ``shift`` inside ``for i in "$@"`` loop in ``docker/common/install_etcd_nats.sh`` (bash expands ``"$@"`` at loop entry so the shift mutates parameters never referenced again) Signed-off-by: Praateek <praateekm@gmail.com>

Signed-off-by: Praateek <praateekm@gmail.com>

praateekmahajan requested a review from a team as a code owner April 18, 2026 01:33

praateekmahajan requested review from meatybobby and removed request for a team April 18, 2026 01:33

copy-pr-bot Bot temporarily deployed to test April 18, 2026 01:33 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 01:33 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 18, 2026 01:33 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 01:33 Inactive

praateekmahajan marked this pull request as draft April 18, 2026 01:34

claude Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread tests/core/serve/integration/test_dynamo.py Outdated

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 01:40 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 18, 2026 01:40 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 01:40 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 16:41 Inactive

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread docker/common/install_etcd_nats.sh

claude Bot approved these changes Apr 22, 2026

View reviewed changes

oyilmaz-nvidia reviewed Apr 22, 2026

View reviewed changes

oyilmaz-nvidia approved these changes Apr 22, 2026

View reviewed changes

praateekmahajan mentioned this pull request Apr 22, 2026

feat: (experimental) Dynamo Disaggregated serving + validators + router polish - inference (5/5) #1861

Merged

4 tasks

oyilmaz-nvidia approved these changes Apr 23, 2026

View reviewed changes

praateekmahajan added 7 commits April 23, 2026 12:40

refactor(serve): refresh detached actor handles across Dynamo shutdown

cad51d3

Signed-off-by: Praateek <praateekm@gmail.com>

chore: trigger DCO re-check after base retarget

5b44c66

Signed-off-by: Praateek <praateekm@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: (experimental) Dynamo aggregated backend - inference (4/5)#1834

feat: (experimental) Dynamo aggregated backend - inference (4/5)#1834
praateekmahajan merged 7 commits intomainfrom
praateek/infeserv-dynamo-4-topology

praateekmahajan commented Apr 18, 2026 •

edited

Loading

Uh oh!

praateekmahajan commented Apr 18, 2026

Uh oh!

copy-pr-bot Bot commented Apr 18, 2026

Uh oh!

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

claude Bot left a comment

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Uh oh!

oyilmaz-nvidia left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

praateekmahajan commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Design callouts

Usage

What's intentionally NOT in this PR

Verification

Checklist

Uh oh!

praateekmahajan commented Apr 18, 2026

Uh oh!

copy-pr-bot Bot commented Apr 18, 2026

Uh oh!

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

praateekmahajan commented Apr 22, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Uh oh!

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

praateekmahajan commented Apr 18, 2026 •

edited

Loading