feat: (experimental) Dynamo aggregated backend - inference (4/5)#1834
Merged
praateekmahajan merged 7 commits intomainfrom Apr 23, 2026
Merged
feat: (experimental) Dynamo aggregated backend - inference (4/5)#1834praateekmahajan merged 7 commits intomainfrom
praateekmahajan merged 7 commits intomainfrom
Conversation
Contributor
Author
|
/claude review |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/claude review |
Contributor
Author
|
/ok to test 67e9c47 |
Contributor
Author
|
@greptileai |
Contributor
Author
|
/ok to test 3e90016 |
Contributor
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Contributor
oyilmaz-nvidia
left a comment
There was a problem hiding this comment.
To be honest, it's a long PR and there are too many details. But when I look at the high level API that users will use, I think it's simple and good designed.
So, overall it looks good to me. But we need more examples and tutorials to test it. I guess you'll have them in the last PR?
oyilmaz-nvidia
approved these changes
Apr 22, 2026
4 tasks
oyilmaz-nvidia
approved these changes
Apr 23, 2026
Replaces the PR 2 `DynamoBackend` placeholder with a real backend for aggregated serving (single-node TP and multi-node TP) built on the placement-group API landed in PR 3. Lifecycle: - `start()` enters the `nemo_curator_dynamo` Ray namespace, sweeps any leftover PGs + actors from a prior driver session, then deploys infra (etcd + NATS) -> workers -> frontend and blocks on a `/v1/models` health check. - `stop()` reconnects to the same namespace, parallel-stops every actor via `ManagedSubprocess.stop_many`, and removes the replica + infra PGs. Worker launch lives in the new `dynamo/vllm.py`: a single `_launch_vllm_worker` handles both rank-0 and headless ranks; KV events are always passed as an explicit `--kv-events-config` (exact ZMQ publishing when the router is in `kv` mode with `kv_events=True`, explicitly disabled otherwise) so replicas don't fight over the vLLM default port. Router wiring is intentionally minimal: `--router-mode` if set, plus every entry in `router_kwargs` as `--key value`. Full per-key router flag translation, cross-model validators (frontend coherence, disagg-TP-fit, unique names), and disaggregated serving all land in PR 5. Attempting `mode="disagg"` raises `NotImplementedError` with a pointer to the next PR. Tests: 22 unit tests for backend + vllm helpers, 8 for runtime-env merging, 3 GPU integration tests (aggregated serve, restart-after- stop exercising the orphan-PG/actor sweep, disagg-rejection). Registered the new GPU test file under the `sdg` group in `tests/gpu_test_groups.json`. Verification: 67 CPU tests pass; 3 GPU integration tests pass on a 2-GPU box with `CUDA_VISIBLE_DEVICES=2,3` (3m49s). Signed-off-by: Praateek <praateekm@gmail.com>
- Guard `remove_named_pgs_with_prefix(self._pg_name_prefix)` in
`DynamoBackend.stop()` against an empty prefix. If `start()` fails
early (empty models, `mode="disagg"`, missing etcd/nats binary)
before the prefix is assigned, `stop()` would otherwise call
`remove_named_pgs_with_prefix("")` and wipe every named PG in the
`nemo_curator_dynamo` namespace. (greptile P1)
- Restructure the etcd/NATS port/URL resolution in
`_deploy_and_healthcheck`: only compute a port and spawn the
internal service in the "no user endpoint" branch. The previous
form extracted a port from the user-supplied URL even when it was
never consumed, and raised `ValueError` on valid URLs with a path
component (e.g., `http://etcd:2379/v3`). (greptile P2)
- Add `assert master_addr is not None` inside the multi-node branch
of `_launch_vllm_worker` so the type checker sees the narrowed
`str` before it flows into the CLI args list. (greptile P2)
- Split `test_restart_after_stop` out of `TestDynamoAggregatedSingleNode`
into its own `TestDynamoRestartAfterStop` class with its own server
instance. The test stops the server mid-class, which under a
randomized test-order runner (e.g. pytest-randomly) could leave
the shared class-scoped fixture stopped before other tests in
`TestDynamoAggregatedSingleNode` run. (claude bot)
Verification: 67 CPU tests + 3 GPU integration tests (5m41s).
Signed-off-by: Praateek <praateekm@gmail.com>
The L0_Unit_Test_GPU-sdg job runs tests/core/serve/integration/test_dynamo.py
which spawns etcd and nats-server subprocesses via DynamoBackend. The curator
CI image didn't ship either binary, so _check_binary("etcd") raised
FileNotFoundError at test setup.
Add docker/common/install_etcd_nats.sh (same shape as install_ffmpeg.sh:
downloads to /tmp, installs to /usr/local/bin/). Versions are pinned to match
upstream ai-dynamo/dynamo container/context.yaml (etcd v3.5.21,
nats-server v2.10.28) so Curator and Dynamo runtime images carry identical
binaries. Uses curl (already in the base image) so no extra apt deps are
pulled in.
Signed-off-by: Praateek <praateekm@gmail.com>
…er/env-marker/driver-sweep The prior design layered in a ctypes prctl PR_SET_CHILD_SUBREAPER call, a CURATOR_SUBPROC_MARKER env var inherited by every subprocess, a /proc/*/environ scan on teardown, and a driver-side sweep_orphan_subprocesses_by_prefix helper to catch vLLM V1 'EngineCore' setsid grandchildren orphaned to PID 1. Upstream inspection (vllm/v1/engine/utils.py + multiproc_executor.py) shows vLLM actually spawns EngineCore + WorkerProcs via multiprocessing.Process(ctx='spawn'), which is fork+execv -- no setsid -- so every descendant stays in the launcher's process group. killpg on the launcher PID reaches them all. Changes: - subprocess_mgr.py: new _reap_process_group helper (SIGTERM -> probe -> SIGKILL via killpg(pgid, 0) which survives the launcher's own death); _stop_subprocess collapses to one call; force_sigkill_subprocess becomes one killpg line; remove _become_child_subreaper, _snapshot_descendants, _kill_descendants, env marker infrastructure, sweep_orphan_subprocesses_by_prefix. - dynamo/backend.py: drop sweep_orphan_subprocesses_by_prefix consumption; in start() run _sweep_orphan_actors BEFORE remove_named_pgs_with_prefix so detached actors get a chance to graceful-stop + killpg their subprocess tree before PG removal hard-kills them. - tests: replace setsid-grandchild + env-marker tests with a multiprocessing. Process(spawn) pattern matching real vLLM, and add a launcher-already-exited test confirming killpg(pgid) still reaps members. Add TestDynamoBackendStart locking in the sweep-actors-before-remove-pgs ordering. Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
…shift in install script - add ``tests/core/serve/dynamo/conftest.py`` with a ``captured_spawn`` pytest fixture that patches ``ManagedSubprocess.spawn`` and yields the recorded calls list — fixes the ``fixture 'captured_spawn' not found`` CI failure caused by ``test_backend.py`` referencing it without the conftest being committed - add ``tests/core/serve/dynamo/test_vllm.py`` covering ``aggregated_model_uses_exact_kv_events`` and ``launch_replicas`` CLI- arg construction (single-node, kv-router, multi-node, dynamo_kwargs, replica fan-out); uses real ``plan_replica_bundle_shape`` over an injected topology instead of mocking the planner - remove dead ``shift`` inside ``for i in "$@"`` loop in ``docker/common/install_etcd_nats.sh`` (bash expands ``"$@"`` at loop entry so the shift mutates parameters never referenced again) Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
DynamoBackendplaceholder with a real backend for aggregated serving — single-node TP and multi-node TP — built on the placement-group API landed in feat: Placement Group based subprocess manager - inference (3/5) #1833dynamo/backend.pyinto a newdynamo/vllm.pyso lifecycle (start/stop, infra, frontend, health) and vLLM-specific worker assembly stay separately reviewabledynamo/constants.py: actor labels (ETCD_ACTOR_LABEL,NATS_ACTOR_LABEL,FRONTEND_ACTOR_LABEL) and infra-bundle layout (INFRA_ETCD_BUNDLE,INFRA_NATS_BUNDLE,INFRA_FRONTEND_BUNDLE,INFRA_NUM_BUNDLES)mode=\"disagg\"raisesNotImplementedErrorhere; disagg + cross-model validators + full router-flag translation ship in the final PRDesign callouts
--http-porton the infra PG serves all models. The full per-key router-flag merge across models is PR 5's job; this PR wires only--router-modeplusrouter_kwargspassthrough.--kv-events-configon every worker. Without this, Dynamo'sargs.pyauto-creates aKVEventsConfigbound totcp://*:20080whenprefix_cachingis enabled (vLLM ≥0.16 default), and every worker on the same node fights over the same port. Rank-0 workers publish ZMQ KV events only when the router ismode=\"kv\"withkv_events=True; every other case disables events explicitly so the default-port binding never happens._launch_vllm_workerfor rank-0 and headless ranks. Rank-0 adds endpoint / discovery / planes / optional KV-events publisher; rank-N adds--headlessand forces KV events off. Multi-node TP resolves--master-addrpost-pg.ready()viaget_bundle_node_ip(pg, 0).start().remove_named_pgs_with_prefixreaps stale PGs, but PG removal force-kills scheduled actors and bypasses theiratexithook — which would orphan the subprocess tree. Alist_actors(state=\"ALIVE\")sweep filtered by the PG name prefix runs first sograceful_stop_actorscan SIGTERM the process groups cleanly (with SIGKILL escalation from PR 3 as the fallback).self._modelsnarrowed once in__init__.server.models: list[BaseModelConfig]carries no Dynamo-specific fields;InferenceServer._validate_model_configsalready enforces that every entry is aDynamoVLLMModelConfig, so wecastonce and every backend method gets.num_replicas/.mode/.engine_kwargsautocomplete.Usage
What's intentionally NOT in this PR
mode=\"disagg\"): raisesNotImplementedErrorwith a pointer to PR 5_validate_frontend_config(coherent namespace/planes/router across models),_validate_unique_model_names, and the disagg-TP-fit branch of_validate_gpu_requirements_resolve_frontend_router_configper-key fallback +--router-kv-events/--router-temperature/--router-ttl-secs/ etc. — PR 5 promotes whatever needs typed fields; everything else stays onrouter_kwargsVerification
ruff check nemo_curator/core/serve/ tests/core/serve/— cleanpytest tests/core/serve/ -m \"not gpu\"— 67 passed (~43s)CUDA_VISIBLE_DEVICES=2,3 pytest tests/core/serve/integration/test_dynamo.py -m gpu— 3 passed (~3m49s)TestDynamoAggregatedSingleNode::test_is_active_and_queryable— full Dynamo frontend + etcd + NATS + vLLM worker answers OpenAI chat completionsTestDynamoAggregatedSingleNode::test_restart_after_stop— exercises the orphan-PG + orphan-actor sweepsTestDynamoRejectsDisagg::test_disagg_mode_raises_notimplemented— verifies the PR 5 deferralChecklist