Skip to content

feat: (experimental) Dynamo aggregated backend - inference (4/5)#1834

Merged
praateekmahajan merged 7 commits intomainfrom
praateek/infeserv-dynamo-4-topology
Apr 23, 2026
Merged

feat: (experimental) Dynamo aggregated backend - inference (4/5)#1834
praateekmahajan merged 7 commits intomainfrom
praateek/infeserv-dynamo-4-topology

Conversation

@praateekmahajan
Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan commented Apr 18, 2026

Reviewer tip: the diff in nemo_curator/core/serve/dynamo/backend.py (lifecycle / infra / frontend / health) and the new nemo_curator/core/serve/dynamo/vllm.py (worker launch) are where the real work lives. Everything else is tests, small constants.

Description

  • PR 4/5 of the inference-server restack
  • replace the feat!: Typed Serve Config + Dyamo config stub - inference (2/5) #1820 DynamoBackend placeholder with a real backend for aggregated serving — single-node TP and multi-node TP — built on the placement-group API landed in feat: Placement Group based subprocess manager - inference (3/5) #1833
  • split worker-launch concerns out of dynamo/backend.py into a new dynamo/vllm.py so lifecycle (start/stop, infra, frontend, health) and vLLM-specific worker assembly stay separately reviewable
  • promote a few repeated strings and indexes to typed constants in dynamo/constants.py: actor labels (ETCD_ACTOR_LABEL, NATS_ACTOR_LABEL, FRONTEND_ACTOR_LABEL) and infra-bundle layout (INFRA_ETCD_BUNDLE, INFRA_NATS_BUNDLE, INFRA_FRONTEND_BUNDLE, INFRA_NUM_BUNDLES)
  • mode=\"disagg\" raises NotImplementedError here; disagg + cross-model validators + full router-flag translation ship in the final PR

Design callouts

  • Single frontend per server, not per model. The Dynamo frontend auto-discovers every registered model via etcd, so one --http-port on the infra PG serves all models. The full per-key router-flag merge across models is PR 5's job; this PR wires only --router-mode plus router_kwargs passthrough.
  • Explicit --kv-events-config on every worker. Without this, Dynamo's args.py auto-creates a KVEventsConfig bound to tcp://*:20080 when prefix_caching is enabled (vLLM ≥0.16 default), and every worker on the same node fights over the same port. Rank-0 workers publish ZMQ KV events only when the router is mode=\"kv\" with kv_events=True; every other case disables events explicitly so the default-port binding never happens.
  • One _launch_vllm_worker for rank-0 and headless ranks. Rank-0 adds endpoint / discovery / planes / optional KV-events publisher; rank-N adds --headless and forces KV events off. Multi-node TP resolves --master-addr post-pg.ready() via get_bundle_node_ip(pg, 0).
  • Orphan sweeps in start(). remove_named_pgs_with_prefix reaps stale PGs, but PG removal force-kills scheduled actors and bypasses their atexit hook — which would orphan the subprocess tree. A list_actors(state=\"ALIVE\") sweep filtered by the PG name prefix runs first so graceful_stop_actors can SIGTERM the process groups cleanly (with SIGKILL escalation from PR 3 as the fallback).
  • Typed self._models narrowed once in __init__. server.models: list[BaseModelConfig] carries no Dynamo-specific fields; InferenceServer._validate_model_configs already enforces that every entry is a DynamoVLLMModelConfig, so we cast once and every backend method gets .num_replicas / .mode / .engine_kwargs autocomplete.

Usage

from nemo_curator.core.serve import (
    DynamoRouterConfig,
    DynamoServerConfig,
    DynamoVLLMModelConfig,
    InferenceServer,
)

model = DynamoVLLMModelConfig(
    model_identifier=\"Qwen/Qwen3-0.6B\",
    num_replicas=2,
    engine_kwargs={\"tensor_parallel_size\": 1, \"enforce_eager\": True},
    # `dynamo_kwargs` is a dict pass-through for any `python -m dynamo.vllm` flag
    # Curator doesn't type yet (e.g. tool_call_parser, reasoning_parser, ...).
    dynamo_kwargs={\"tool_call_parser\": \"hermes\"},
)
backend_cfg = DynamoServerConfig(
    router=DynamoRouterConfig(mode=\"kv\", kv_events=True),
)
with InferenceServer(models=[model], backend=backend_cfg) as server:
    print(server.endpoint)  # http://<infra-node-ip>:<port>/v1

What's intentionally NOT in this PR

  • disaggregated serving (mode=\"disagg\"): raises NotImplementedError with a pointer to PR 5
  • cross-model validators: _validate_frontend_config (coherent namespace/planes/router across models), _validate_unique_model_names, and the disagg-TP-fit branch of _validate_gpu_requirements
  • full router-flag translation: _resolve_frontend_router_config per-key fallback + --router-kv-events / --router-temperature / --router-ttl-secs / etc. — PR 5 promotes whatever needs typed fields; everything else stays on router_kwargs
  • Ray Serve backend changes: this PR touches only the Dynamo subpackage

Verification

  • ruff check nemo_curator/core/serve/ tests/core/serve/ — clean
  • pytest tests/core/serve/ -m \"not gpu\" — 67 passed (~43s)
  • CUDA_VISIBLE_DEVICES=2,3 pytest tests/core/serve/integration/test_dynamo.py -m gpu — 3 passed (~3m49s)
    • TestDynamoAggregatedSingleNode::test_is_active_and_queryable — full Dynamo frontend + etcd + NATS + vLLM worker answers OpenAI chat completions
    • TestDynamoAggregatedSingleNode::test_restart_after_stop — exercises the orphan-PG + orphan-actor sweeps
    • TestDynamoRejectsDisagg::test_disagg_mode_raises_notimplemented — verifies the PR 5 deferral

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@praateekmahajan praateekmahajan requested a review from a team as a code owner April 18, 2026 01:33
@praateekmahajan praateekmahajan requested review from meatybobby and removed request for a team April 18, 2026 01:33
@praateekmahajan
Copy link
Copy Markdown
Contributor Author

/claude review

@praateekmahajan praateekmahajan marked this pull request as draft April 18, 2026 01:34
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 18, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Comment thread tests/core/serve/integration/test_dynamo.py Outdated
@praateekmahajan
Copy link
Copy Markdown
Contributor Author

@greptileai

/claude review

@praateekmahajan
Copy link
Copy Markdown
Contributor Author

/ok to test 67e9c47

Comment thread docker/common/install_etcd_nats.sh
@praateekmahajan
Copy link
Copy Markdown
Contributor Author

@greptileai
/claude review

@praateekmahajan
Copy link
Copy Markdown
Contributor Author

/ok to test 3e90016

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Copy link
Copy Markdown
Contributor

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, it's a long PR and there are too many details. But when I look at the high level API that users will use, I think it's simple and good designed.

So, overall it looks good to me. But we need more examples and tutorials to test it. I guess you'll have them in the last PR?

Replaces the PR 2 `DynamoBackend` placeholder with a real backend for
aggregated serving (single-node TP and multi-node TP) built on the
placement-group API landed in PR 3.

Lifecycle:
- `start()` enters the `nemo_curator_dynamo` Ray namespace, sweeps any
  leftover PGs + actors from a prior driver session, then deploys
  infra (etcd + NATS) -> workers -> frontend and blocks on a
  `/v1/models` health check.
- `stop()` reconnects to the same namespace, parallel-stops every
  actor via `ManagedSubprocess.stop_many`, and removes the replica +
  infra PGs.

Worker launch lives in the new `dynamo/vllm.py`: a single
`_launch_vllm_worker` handles both rank-0 and headless ranks; KV
events are always passed as an explicit `--kv-events-config` (exact
ZMQ publishing when the router is in `kv` mode with `kv_events=True`,
explicitly disabled otherwise) so replicas don't fight over the vLLM
default port.

Router wiring is intentionally minimal: `--router-mode` if set, plus
every entry in `router_kwargs` as `--key value`. Full per-key router
flag translation, cross-model validators (frontend coherence,
disagg-TP-fit, unique names), and disaggregated serving all land in
PR 5. Attempting `mode="disagg"` raises `NotImplementedError` with a
pointer to the next PR.

Tests: 22 unit tests for backend + vllm helpers, 8 for runtime-env
merging, 3 GPU integration tests (aggregated serve, restart-after-
stop exercising the orphan-PG/actor sweep, disagg-rejection).
Registered the new GPU test file under the `sdg` group in
`tests/gpu_test_groups.json`.

Verification: 67 CPU tests pass; 3 GPU integration tests pass on a
2-GPU box with `CUDA_VISIBLE_DEVICES=2,3` (3m49s).

Signed-off-by: Praateek <praateekm@gmail.com>
- Guard `remove_named_pgs_with_prefix(self._pg_name_prefix)` in
  `DynamoBackend.stop()` against an empty prefix. If `start()` fails
  early (empty models, `mode="disagg"`, missing etcd/nats binary)
  before the prefix is assigned, `stop()` would otherwise call
  `remove_named_pgs_with_prefix("")` and wipe every named PG in the
  `nemo_curator_dynamo` namespace. (greptile P1)

- Restructure the etcd/NATS port/URL resolution in
  `_deploy_and_healthcheck`: only compute a port and spawn the
  internal service in the "no user endpoint" branch. The previous
  form extracted a port from the user-supplied URL even when it was
  never consumed, and raised `ValueError` on valid URLs with a path
  component (e.g., `http://etcd:2379/v3`). (greptile P2)

- Add `assert master_addr is not None` inside the multi-node branch
  of `_launch_vllm_worker` so the type checker sees the narrowed
  `str` before it flows into the CLI args list. (greptile P2)

- Split `test_restart_after_stop` out of `TestDynamoAggregatedSingleNode`
  into its own `TestDynamoRestartAfterStop` class with its own server
  instance. The test stops the server mid-class, which under a
  randomized test-order runner (e.g. pytest-randomly) could leave
  the shared class-scoped fixture stopped before other tests in
  `TestDynamoAggregatedSingleNode` run. (claude bot)

Verification: 67 CPU tests + 3 GPU integration tests (5m41s).
Signed-off-by: Praateek <praateekm@gmail.com>
The L0_Unit_Test_GPU-sdg job runs tests/core/serve/integration/test_dynamo.py
which spawns etcd and nats-server subprocesses via DynamoBackend. The curator
CI image didn't ship either binary, so _check_binary("etcd") raised
FileNotFoundError at test setup.

Add docker/common/install_etcd_nats.sh (same shape as install_ffmpeg.sh:
downloads to /tmp, installs to /usr/local/bin/). Versions are pinned to match
upstream ai-dynamo/dynamo container/context.yaml (etcd v3.5.21,
nats-server v2.10.28) so Curator and Dynamo runtime images carry identical
binaries. Uses curl (already in the base image) so no extra apt deps are
pulled in.

Signed-off-by: Praateek <praateekm@gmail.com>
…er/env-marker/driver-sweep

The prior design layered in a ctypes prctl PR_SET_CHILD_SUBREAPER call, a
CURATOR_SUBPROC_MARKER env var inherited by every subprocess, a /proc/*/environ
scan on teardown, and a driver-side sweep_orphan_subprocesses_by_prefix helper
to catch vLLM V1 'EngineCore' setsid grandchildren orphaned to PID 1. Upstream
inspection (vllm/v1/engine/utils.py + multiproc_executor.py) shows vLLM actually
spawns EngineCore + WorkerProcs via multiprocessing.Process(ctx='spawn'), which
is fork+execv -- no setsid -- so every descendant stays in the launcher's
process group. killpg on the launcher PID reaches them all.

Changes:
- subprocess_mgr.py: new _reap_process_group helper (SIGTERM -> probe -> SIGKILL
  via killpg(pgid, 0) which survives the launcher's own death); _stop_subprocess
  collapses to one call; force_sigkill_subprocess becomes one killpg line; remove
  _become_child_subreaper, _snapshot_descendants, _kill_descendants, env marker
  infrastructure, sweep_orphan_subprocesses_by_prefix.
- dynamo/backend.py: drop sweep_orphan_subprocesses_by_prefix consumption; in
  start() run _sweep_orphan_actors BEFORE remove_named_pgs_with_prefix so
  detached actors get a chance to graceful-stop + killpg their subprocess tree
  before PG removal hard-kills them.
- tests: replace setsid-grandchild + env-marker tests with a multiprocessing.
  Process(spawn) pattern matching real vLLM, and add a launcher-already-exited
  test confirming killpg(pgid) still reaps members. Add TestDynamoBackendStart
  locking in the sweep-actors-before-remove-pgs ordering.

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
…shift in install script

- add ``tests/core/serve/dynamo/conftest.py`` with a ``captured_spawn``
  pytest fixture that patches ``ManagedSubprocess.spawn`` and yields the
  recorded calls list — fixes the ``fixture 'captured_spawn' not found``
  CI failure caused by ``test_backend.py`` referencing it without the
  conftest being committed
- add ``tests/core/serve/dynamo/test_vllm.py`` covering
  ``aggregated_model_uses_exact_kv_events`` and ``launch_replicas`` CLI-
  arg construction (single-node, kv-router, multi-node, dynamo_kwargs,
  replica fan-out); uses real ``plan_replica_bundle_shape`` over an
  injected topology instead of mocking the planner
- remove dead ``shift`` inside ``for i in "$@"`` loop in
  ``docker/common/install_etcd_nats.sh`` (bash expands ``"$@"`` at loop
  entry so the shift mutates parameters never referenced again)

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants