Simple and stable Inference APIs by YangFei1990 · Pull Request #4697 · NVIDIA/Megatron-LM

YangFei1990 · 2026-05-08T06:30:24Z

What does this PR do ?

Motivation And Goals

The current Megatron inference APIs expose many internal building blocks:

InferenceConfig
DynamicInferenceContext
GPTInferenceWrapper
TextGenerationController
DynamicInferenceEngine
InferenceClient
DataParallelInferenceCoordinator
dynamic text generation server helpers

This is powerful, but it makes simple usage verbose. A user who wants to run offline generation or serve requests must understand engine construction, context selection, tokenizer setup, coordinator lifecycle, and per-rank behavior.

APIs

Inspired by vLLM, we propose two dimensions for the API design.

Sync and Async APIs

We propose two major APIs for Megatron inference:

MegatronLLM: synchronous offline inference API. Calls block for final
outputs.
MegatronAsyncLLM: asyncio-native generation, online serving (OpenAI
compatible).

Both classes support offline inference, lifecycle control (pause/unpause/suspend/resume), and access to the underlying engine for expert use. The differentiators are: MegatronAsyncLLM exposes async methods and the online HTTP server (serve(...)); MegatronLLM exposes sync methods.

The underlying primitive APIs can also be accessed through corresponding
property attributes (engine, context, controller).

Coordinator

Both sync and async APIs support direct mode and coordinator mode, specified by the use_coordinator argument in the API constructor. We also provide an is_primary_rank property to help users understand which rank should feed data and collect outputs.

Without coordinator, all ranks are treated as user-managed ranks, and users need to handle load balancing between different DP/EP ranks. Every rank's is_primary_rank returns true: the API does not decide which rank should receive which prompts or which rank should emit output. Users must split data across different DP/EP ranks, ensure consistent inputs across TP/PP/CP ranks, and gather/aggregate results from different DP/EP ranks. If users do not shard inputs correctly, they may duplicate work or violate TP/PP/EP/DP group expectations.

With coordinator, the coordinator manages load balancing. Users feed data on the coordinator (primary) rank and collect output on that rank. is_primary_rank returns true only on the coordinator rank, which is global rank 0. Online serving mode requires use_coordinator=True when DP/EP size is greater than 1.

Lifecycle methods (pause/unpause/suspend/resume) are only meaningful in coordinator mode. They raise RuntimeError in direct mode.

Examples

Here we list some common examples, for details check examples/inference

Offline Sync Generation With Coordinator

from megatron.inference import MegatronLLM, SamplingParams

llm = MegatronLLM(
    ...,
    use_coordinator=True,
)

if llm.is_primary_rank:
    outputs = llm.generate(prompts, SamplingParams(num_tokens_to_generate=128))
    for output in outputs:
        print(output.generated_text)

llm.shutdown()

Concurrent Async Generation With Multiple Prompts

from megatron.inference import MegatronAsyncLLM, SamplingParams

llm = MegatronAsyncLLM(
    ...,
    use_coordinator=True,
    coordinator_host="10.0.0.1",
    coordinator_port=6000,
)

if llm.is_primary_rank:
    sampling_params = SamplingParams(num_tokens_to_generate=64)
    results = await llm.generate(prompts, sampling_params)
    for result in results:
        print(result.generated_text)

await llm.shutdown()

Programmatic OpenAI-Compatible Server

from megatron.inference import MegatronAsyncLLM, ServeConfig

llm = MegatronAsyncLLM(
    ...,
    use_coordinator=True,
    coordinator_host="10.0.0.1",  # Internal/routable host for coordinator ZMQ.
    coordinator_port=6000,
)

# All ranks enter llm.serve, but only the primary rank hosts the HTTP server.
# `blocking=True` (default) keeps serve() awaiting until the server stops.
await llm.serve(
    ServeConfig(
        host="0.0.0.0",  # HTTP bind host.
        port=5000,
    ),
)

# Users can send OpenAI-compatible requests to the primary rank's HTTP endpoint.
# For example, from another process:
#
# from openai import OpenAI
#
# client = OpenAI(api_key="EMPTY", base_url="http://<primary-host>:5000/v1")
# response = client.chat.completions.create(
#     model="megatron-gpt",
#     messages=[{"role": "user", "content": "Explain Megatron inference."}],
#     max_tokens=128,
#     temperature=0.8,
#     top_p=0.95,
#     extra_body={"top_k": 40},  # Sent to the server as top-level {"top_k": 40}.
# )
# print(response.choices[0].message.content)
#
# Equivalent raw HTTP request:
#
# curl http://<primary-host>:5000/v1/chat/completions \
#   -H "Content-Type: application/json" \
#   -d '{
#     "model": "megatron-gpt",
#     "messages": [{"role": "user", "content": "Explain Megatron inference."}],
#     "max_tokens": 128,
#     "temperature": 0.8,
#     "top_p": 0.95,
#     "top_k": 40
#   }'

PR review

The major files to review are newly added examples in examples/inference and the high level implementations in megatron/inference, the rest are most test coverage and doc changes.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ad CUDA device Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inference/legacy/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ith reused legacy goldens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… bespoke driver Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…al inference sections Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-08T06:30:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-08T06:30:35Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

YangFei1990 · 2026-05-08T06:32:02Z

/ok to test 5f651b7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ron-LM into inference_apis

YangFei1990 · 2026-05-08T19:41:18Z

/ok to test 963f663

ko3n1g

This will require changes in the NeMo side. Can we work with @oyilmaz-nvidia before the merge?

ko3n1g

Thanks for sharing more background in our offline discussion, all good now

ko3n1g · 2026-05-28T21:48:27Z

/ok to test 27b920e

svcnvidia-nemo-ci · 2026-05-28T23:27:02Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26608290361

YangFei1990 · 2026-05-29T01:20:48Z

/ok to test 4e3269d

This reverts commit 0298310.

…ron-LM into inference_apis

YangFei1990 · 2026-05-29T02:29:02Z

/ok to test 87ba21c

svcnvidia-nemo-ci · 2026-05-29T04:18:20Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26617664504

YangFei1990 and others added 14 commits May 6, 2026 20:03

feat(inference): add ServeConfig and _EventLoopManager

8debc5e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(inference): add coordinator runtime and _MegatronLLMBase

4ca3c95

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(inference): add MegatronAsyncLLM, slim base class

d7e68f1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(inference): add MegatronLLM

c03ab48

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(inference): drop model_name fields from ServeConfig

5c38044

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(inference): add MegatronAsyncLLM.serve(), drop ServeConfig.role

6331cc0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(inference): add offline_inference example, fix high-level API bugs

b6448e8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(inference): add launch_inference_server example, fix daemon-thre…

44ec5a9

…ad CUDA device Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(tests): repoint inference recipes and cuda_graphs.sh to examples/…

0d8ae8b

…inference/legacy/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(inference): add unit tests for the high-level inference API

589626d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(inference): add functional tests for offline_inference 4 modes w…

2eb2e81

…ith reused legacy goldens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(inference): add HTTP smoke test for launch_inference_server with…

739f3b0

… bespoke driver Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(inference): add README for the high-level inference API

8ff9a26

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(inference): rewrite examples README and remove stale llama_mistr…

9e7fae3

…al inference sections Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

YangFei1990 requested a review from a team as a code owner May 8, 2026 06:30

svcnvidia-nemo-ci marked this pull request as draft May 8, 2026 06:30

Merge branch 'main' into inference_apis

5f651b7

YangFei1990 marked this pull request as ready for review May 8, 2026 06:31

svcnvidia-nemo-ci requested a review from a team May 8, 2026 06:31

svcnvidia-nemo-ci added the complexity: high label May 8, 2026

YangFei1990 and others added 3 commits May 8, 2026 11:48

ci(inference): satisfy linting, copyright-check, and build-docs

34494c5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'inference_apis' of https://github.com/YangFei1990/Megat…

8caa84d

…ron-LM into inference_apis

Merge branch 'main' into inference_apis

963f663

copy-pr-bot Bot temporarily deployed to test May 8, 2026 19:42 Inactive

dimapihtar requested a review from chtruong814 May 12, 2026 13:57

dimapihtar assigned chtruong814 May 12, 2026

copy-pr-bot Bot temporarily deployed to public May 28, 2026 17:31 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 17:39 Inactive

ko3n1g approved these changes May 28, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 28, 2026

YangFei1990 enabled auto-merge May 28, 2026 18:01

YangFei1990 added 2 commits May 28, 2026 11:04

Merge branch 'main' into inference_apis

d76fca4

Merge branch 'main' into inference_apis

27b920e

ko3n1g requested changes May 28, 2026

View reviewed changes

shanmugamr1992 requested a review from ko3n1g May 28, 2026 21:07

ko3n1g approved these changes May 28, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 28, 2026 21:48 Inactive

copy-pr-bot Bot temporarily deployed to test May 28, 2026 21:49 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 21:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 22:00 Inactive

YangFei1990 added this pull request to the merge queue May 28, 2026

YangFei1990 and others added 2 commits May 28, 2026 18:19

increase test coverage

0298310

Merge branch 'main' into inference_apis

4e3269d

YangFei1990 added 3 commits May 28, 2026 19:23

fix import

42a9ead

Revert "increase test coverage"

f85df57

This reverts commit 0298310.

Merge branch 'inference_apis' of https://github.com/YangFei1990/Megat…

87ba21c

…ron-LM into inference_apis

YangFei1990 mentioned this pull request May 29, 2026

Fix test failures for new inference APIs #5068

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple and stable Inference APIs#4697

Simple and stable Inference APIs#4697
YangFei1990 merged 49 commits into
NVIDIA:mainfrom
YangFei1990:inference_apis

YangFei1990 commented May 8, 2026

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

YangFei1990 commented May 8, 2026

Uh oh!

YangFei1990 commented May 8, 2026

Uh oh!

ko3n1g left a comment

Uh oh!

ko3n1g left a comment

Uh oh!

ko3n1g commented May 28, 2026

Uh oh!

svcnvidia-nemo-ci commented May 28, 2026

Uh oh!

YangFei1990 commented May 29, 2026

Uh oh!

YangFei1990 commented May 29, 2026

Uh oh!

svcnvidia-nemo-ci commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

YangFei1990 commented May 8, 2026

What does this PR do ?

Motivation And Goals

APIs

Sync and Async APIs

Coordinator

Examples

Offline Sync Generation With Coordinator

Concurrent Async Generation With Multiple Prompts

Programmatic OpenAI-Compatible Server

PR review

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

YangFei1990 commented May 8, 2026

Uh oh!

YangFei1990 commented May 8, 2026

Uh oh!

ko3n1g left a comment

Choose a reason for hiding this comment

Uh oh!

ko3n1g left a comment

Choose a reason for hiding this comment

Uh oh!

ko3n1g commented May 28, 2026

Uh oh!

svcnvidia-nemo-ci commented May 28, 2026

Uh oh!

YangFei1990 commented May 29, 2026

Uh oh!

YangFei1990 commented May 29, 2026

Uh oh!

svcnvidia-nemo-ci commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants