Skip to content

refactor: extract Ray Serve backend boundary - inference (1/5)#1813

Merged
praateekmahajan merged 7 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/infeserv-dynamo-1-ray-serve-extraction
Apr 16, 2026
Merged

refactor: extract Ray Serve backend boundary - inference (1/5)#1813
praateekmahajan merged 7 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/infeserv-dynamo-1-ray-serve-extraction

Conversation

@praateekmahajan
Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan commented Apr 15, 2026

Description

  • PR 1/5 of the inference-server restack
  • turn nemo_curator.core.serve into a package and move Ray Serve deployment, shutdown, and quiet runtime-env handling into nemo_curator/core/serve/ray_serve/backend.py
  • keep InferenceServer focused on lifecycle and backend dispatch while preserving the existing InferenceModelConfig / InferenceServer public API for this PR
  • rename the public activity helper to is_inference_server_active() and move Ray Serve GPU integration coverage under tests/core/serve/ray_serve/test_integration.py

Usage

from nemo_curator.core.serve import InferenceModelConfig, InferenceServer

config = InferenceModelConfig(
    model_identifier="google/gemma-3-27b-it",
    deployment_config={"autoscaling_config": {"min_replicas": 1, "max_replicas": 1}},
    engine_kwargs={"tensor_parallel_size": 1},
)

with InferenceServer(models=[config]) as server:
    print(server.endpoint)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
@praateekmahajan praateekmahajan requested a review from a team as a code owner April 15, 2026 22:18
@praateekmahajan praateekmahajan requested review from abhinavg4 and removed request for a team April 15, 2026 22:18
@praateekmahajan praateekmahajan changed the title Refactor Ray Serve behind a backend boundary Inference Server - 1/N - Refactor Ray Serve behind a backend boundary Apr 15, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR refactors nemo_curator/core/serve from a single module into a package, extracting Ray Serve deployment, shutdown, and quiet runtime-env logic into RayServeBackend behind a new InferenceBackend Protocol. State management, atexit cleanup, and the error path on failed start are all handled correctly, and the public API (InferenceModelConfig, InferenceServer, is_inference_server_active) is preserved.

Confidence Score: 5/5

Safe to merge; the only finding is a P2 test cleanup — a stale patch.object(serve, 'shutdown') that is dead code after the refactor.

All production code changes are correct — Protocol contract satisfied, atexit lifecycle is sound, error paths clean up state properly. The single finding is a non-blocking test quality issue (misleading patch in an existing test), and the actual delegation behavior it was meant to exercise is covered by the new test_start_stop_delegates_to_backend test.

tests/core/test_serve.py — test_stop_calls_shutdown has a stale patch.object(serve, 'shutdown') that should be removed or replaced with a stub backend.

Important Files Changed

Filename Overview
nemo_curator/core/serve/init.py New package init that re-exports InferenceModelConfig, InferenceServer, and is_inference_server_active — preserves the public API surface.
nemo_curator/core/serve/ray_serve/backend.py New RayServeBackend class correctly encapsulates Ray Serve deploy/shutdown, client-cache reset, and quiet runtime env helpers moved from InferenceServer.
nemo_curator/core/serve/server.py Refactored InferenceServer correctly introduces InferenceBackend Protocol, delegates start/stop to backend, and handles error paths and atexit cleanup properly.
tests/core/test_serve.py Mostly well updated; test_stop_calls_shutdown has a stale patch.object(serve, 'shutdown') that is dead code after the refactor — shutdown is no longer called directly but via the backend.
tests/pipelines/test_pipelines.py Cleanly switched to patching is_inference_server_active with a mock, removing fragile _active_servers mutation.
tests/core/serve/ray_serve/test_integration.py Integration tests migrated from test_serve.py; fixture and test logic are equivalent to the old file with correct use of is_inference_server_active.
nemo_curator/pipeline/pipeline.py Straightforward rename from is_ray_serve_active to is_inference_server_active; no behavioral change.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant IS as InferenceServer
    participant B as RayServeBackend
    participant R as Ray / Serve

    U->>IS: start()
    IS->>IS: check _active_servers (singleton guard)
    IS->>IS: atexit.register(self.stop)
    IS->>IS: _create_backend()
    IS->>B: start()
    B->>R: ray.init(ignore_reinit_error=True)
    B->>B: _deploy()
    B->>R: serve.start(http_options={port})
    B->>R: serve.run(app, blocking=False)
    B->>IS: _wait_for_healthy()
    IS-->>B: healthy
    R-->>B: driver disconnects (context exit)
    B-->>IS: start() returns
    IS->>IS: _active_servers.add(name), _started=True
    IS-->>U: Inference server is ready at endpoint

    U->>IS: stop()
    IS->>B: stop()
    B->>R: ray.init(ignore_reinit_error=True)
    B->>R: serve.shutdown()
    R-->>B: done
    B-->>IS: stop() returns
    IS->>IS: _active_servers.discard(name), _started=False
    IS->>IS: atexit.unregister(self.stop)
    IS-->>U: done
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into praateek/infese..." | Re-trigger Greptile

Comment thread nemo_curator/core/serve/server.py Outdated
Comment thread nemo_curator/core/serve/server.py Outdated
@praateekmahajan praateekmahajan changed the title Inference Server - 1/N - Refactor Ray Serve behind a backend boundary refactor: Inference Server - 1 /5 - extract Ray Serve backend boundary Apr 15, 2026
@praateekmahajan praateekmahajan changed the title refactor: Inference Server - 1 /5 - extract Ray Serve backend boundary refactor: Inference Server - 1/5 - extract Ray Serve backend boundary Apr 15, 2026
@praateekmahajan
Copy link
Copy Markdown
Contributor Author

/claude review

@praateekmahajan praateekmahajan changed the title refactor: Inference Server - 1/5 - extract Ray Serve backend boundary refactor: extract Ray Serve backend boundary - inference (1/5) Apr 15, 2026
Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants