JAX inference offloading bridge #1775

yhtang · 2025-11-11T08:56:18Z

No description provided.

Copilot

Pull Request Overview

This PR introduces a JAX-vLLM rollout offloading bridge that enables efficient coupling between JAX training and vLLM inference for reinforcement learning post-training workloads. The bridge offloads rollout generation to vLLM while keeping training in JAX, using NCCL for direct GPU-to-GPU weight transfers.

Implements a lightweight RPC gateway for control plane coordination between trainer and rollout engine
Provides NCCL-based data plane for fast GPU-to-GPU weight streaming with tensor resharding
Supports multiple transfer modes (fused, unfused, grouped) and flexible parallelism configurations (FSDP/TP)

Reviewed Changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 35 comments.

Show a summary per file

File	Description
setup.py	Package setup with protobuf build hooks and dependencies
pyproject.toml	Build system configuration with ruff linting rules
pep517_backend.py	Custom PEP 517 build backend for protobuf compilation
jax_inference_offloading/vllm/	vLLM wrapper and worker extension for weight updates
jax_inference_offloading/transport/	NCCL transport implementations (star topology, tensor/model transports)
jax_inference_offloading/controller/	Gateway server and client implementations for trainer/rollout coordination
jax_inference_offloading/models/	Model parameter mappings for Llama3 and Gemma families
jax_inference_offloading/jax/	Offloading bridge API for JAX integration
jax_inference_offloading/tunix/	Tunix-specific rollout and model loading utilities
examples/	Example scripts for single-node and multi-node deployments

Comments suppressed due to low confidence (1)

jax-inference-offloading/jax_inference_offloading/controller/rollout_client.py:139

This assignment to 'shutdown' is unnecessary as it is redefined before this value is used.

  def shutdown(self):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jax-inference-offloading/jax_inference_offloading/controller/rollout_client.py

jax-inference-offloading/jax_inference_offloading/vllm/extension.py

jax-inference-offloading/jax_inference_offloading/transport/model/nccl_fused.py

jax-inference-offloading/examples/rollout.py

jax-inference-offloading/jax_inference_offloading/tunix/rollout.py

jax-inference-offloading/jax_inference_offloading/tunix/load_model.py

jax-inference-offloading/jax_inference_offloading/transport/model/nccl_fused.py

mjsML · 2025-11-12T09:51:24Z

@jreiffers PTAL

…ollout_client.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yhtang · 2025-11-13T06:35:21Z

The new CI workflow for the jax-inference-offloading subfolder is currently failing because the base container image does not yet include the changes (git-clone.sh) from this PR. Given that cyclic dependency, I suggest we treat this workflow as a pilot and proceed with merging as-is, as long as the main CI passes, then iterate to fix the workflow and address any issues once the updated base container is available.

jax-inference-offloading/examples/trainer_grpo.py

.github/workflows/jio.yaml

jax-inference-offloading/.flake8

.github/workflows/jio.yaml

jax-inference-offloading/.gitignore

jax-inference-offloading/dockerfile/oss.dockerfile

Steboss · 2025-11-14T09:40:14Z

The new CI workflow for the jax-inference-offloading subfolder is currently failing because the base container image does not yet include the changes (git-clone.sh) from this PR. Given that cyclic dependency, I suggest we treat this workflow as a pilot and proceed with merging as-is, as long as the main CI passes, then iterate to fix the workflow and address any issues once the updated base container is available.

@yhtang JIO is not part of the main ci? and I can't understand why the CI doesn't pick up your changes to git-clone.sh. the base container should start working and having it. I can see indeed in the error your new sparse option https://github.com/NVIDIA/JAX-Toolbox/actions/runs/19322673777/job/55267036826?pr=1775#step:4:1184
maybe I am wrong

Steboss · 2025-11-14T09:44:51Z

@yhtang other question
I can see there's a loooot of code. Does this come from a specific library or github repo? How likely is this code will change (many times) in the next few months? Are we introducing any technical debt? It may be hard to keep the code and its changes under control?

aybchan

@yhtang Could you clarify what you would like the CI here to achieve? Is it to build and publish an image with JAX inference offloading based on nightly JAX (i.e. similar to what we do with maxtext)? Is there also an intention to run the examples you've added in the CI with this image, or some other way to test it?

jax-inference-offloading/dockerfile/oss.dockerfile

yhtang · 2025-11-14T22:41:33Z

The new CI workflow for the jax-inference-offloading subfolder is currently failing because the base container image does not yet include the changes (git-clone.sh) from this PR. Given that cyclic dependency, I suggest we treat this workflow as a pilot and proceed with merging as-is, as long as the main CI passes, then iterate to fix the workflow and address any issues once the updated base container is available.

@yhtang JIO is not part of the main ci? and I can't understand why the CI doesn't pick up your changes to git-clone.sh. the base container should start working and having it. I can see indeed in the error your new sparse option https://github.com/NVIDIA/JAX-Toolbox/actions/runs/19322673777/job/55267036826?pr=1775#step:4:1184 maybe I am wrong

I think the issue is that this workflow is still using the vanilla CUDA DL base image rather than the JAX-Toolbox base image built in our workflow. As a result, it doesn’t see the updated git-clone.sh even though it’s included in this PR. Nevertheless, I've reverted git-clone.sh and will add it back when we merge the offloading CI workflow with the main workflow.

My plan is to introduce two Dockerfiles: one for pure OSS installation and another based on the JAX-Toolbox base image and JAX builds. I’d handle the second Dockerfile and the corresponding CI wiring in a follow-up PR to keep the scope of this change focused.

yhtang · 2025-11-15T08:29:10Z

The standalone JIO CI workflow is now passing, and the remaining main CI failures (e.g. https://github.com/NVIDIA/JAX-Toolbox/actions/runs/19385835327/job/55473082584?pr=1775#step:4:2129) match those on the main branch (e.g. https://github.com/NVIDIA/JAX-Toolbox/actions/runs/19386305470/job/55474369243#step:4:2407).

Per our earlier agreement that this PR is scoped to changes within the jax-inference-offloading folder, I’ll go ahead and merge this PR and keep subsequent updates similarly contained until we work together to integrate the JIO CI into the main CI.

yhtang added 5 commits November 11, 2025 08:26

JAX-vLLM inference offloading bridge

f9d26d9

add basic CI

b27a0e7

fix typo

79afb97

merge latest git-clone.sh

fd17865

update README

c5d7e59

yhtang marked this pull request as ready for review November 12, 2025 05:52

yhtang requested a review from Copilot November 12, 2025 05:52

Copilot started reviewing on behalf of yhtang November 12, 2025 05:53 View session

Copilot finished reviewing on behalf of yhtang November 12, 2025 05:54

Copilot AI reviewed Nov 12, 2025

View reviewed changes

fix CI

08dcac4

yhtang requested review from Steboss, aybchan, mjsML and nouiz November 12, 2025 09:43

mjsML removed request for mjsML and nouiz November 12, 2025 09:50

mjsML requested review from jreiffers and olupton November 12, 2025 09:55

yhtang and others added 3 commits November 12, 2025 16:57

Update jax-inference-offloading/jax_inference_offloading/controller/r…

efda639

…ollout_client.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

address Copilot comments

02130b8

address Copilot comments

019a215

jreiffers reviewed Nov 13, 2025

View reviewed changes

jax-inference-offloading/examples/trainer_grpo.py Show resolved Hide resolved

Steboss requested changes Nov 14, 2025

View reviewed changes

aybchan reviewed Nov 14, 2025

View reviewed changes

jax-inference-offloading/dockerfile/oss.dockerfile Outdated Show resolved Hide resolved

jax-inference-offloading/dockerfile/oss.dockerfile Show resolved Hide resolved

aybchan previously approved these changes Nov 14, 2025

View reviewed changes

yhtang added 2 commits November 14, 2025 22:21

reduce GRPO example script LOC

2e775c6

fix container build

ce2cd94

use 'draft' status

318b0ec

yhtang dismissed aybchan’s stale review via 318b0ec November 14, 2025 22:43

yhtang added 8 commits November 14, 2025 23:04

fix CI

c82d96b

fix CI

e4b6aeb

fix CI

2c46bf6

fix CI

cb4ddab

CI debug

a3cc037

CI debug

2d530b4

CI debug

fe51df8

CI debug

eb5e1ea

yhtang merged commit 03c29c6 into main Nov 15, 2025
70 of 79 checks passed

yhtang deleted the yhtang/jax-inference-offloading branch November 15, 2025 08:30

JAX inference offloading bridge #1775

JAX inference offloading bridge #1775

Uh oh!

Conversation

yhtang commented Nov 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjsML commented Nov 12, 2025

Uh oh!

yhtang commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Steboss commented Nov 14, 2025

Uh oh!

Steboss commented Nov 14, 2025

Uh oh!

aybchan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yhtang commented Nov 14, 2025

Uh oh!

yhtang commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aybchan left a comment •

edited

Loading

yhtang commented Nov 15, 2025 •

edited

Loading