[4/5] feat: CUDA IPC for teacher logits transfer by avenkateshha · Pull Request #2350 · NVIDIA-NeMo/RL

avenkateshha · 2026-04-27T03:56:14Z

Adds the IPC plumbing that lets a teacher policy worker hand its logits to the student worker without going through Ray's serialization path — required for cross-tokenizer distillation where teacher full-vocab logits are too big to pickle per step.

nemo_rl/distributed/ipc_utils.py: get_handle_from_tensor and rebuild_cuda_tensor_from_ipc helpers wrapping CUDA IPC handles.
nemo_rl/models/automodel/train.py: two new post-processors — XTokenTeacherIPCExportPostProcessor (teacher side, allocates a pre-sized CUDA buffer and exports the IPC handle per microbatch) and XTokenTeacherIPCLossPostProcessor (student side, rebuilds the tensor from the handle and feeds it to the loss fn). Existing post-processors are untouched.

What does this PR do?

Adds CUDA-IPC-based teacher → student logit transfer so off-policy distillation can pass full-vocab teacher logits between Ray workers without serialization overhead.

Issues

None linked yet.

Usage

from nemo_rl.distributed.ipc_utils import get_handle_from_tensor, rebuild_cuda_tensor_from_ipc

# Teacher worker (sender):
buf = torch.empty(B, T, V, dtype=torch.bfloat16, device="cuda")
buf.copy_(teacher_logits)
handle = get_handle_from_tensor(buf)  # pickle-safe

# Student worker (receiver, given handle + device id):
teacher_logits = rebuild_cuda_tensor_from_ipc(handle, device_id=local_rank)

The two XTokenTeacherIPC*PostProcessor classes wire this into the automodel forward/backward path automatically.

Before your PR is "Ready for review"

Read Contributor guidelines
No new tests in this PR. IPC paths are exercised by PR 5 (multi-teacher requires use_ipc=true).
Static py_compile confirmed clean.
No docs entry — added alongside PR 5.

Additional Information

Draft. Stacked on PR 3 — #2349. IPC infrastructure is independent of the loss/collator and could ship in any order, but is sequenced here so PR 5 can rely on it.

Full chain:

TokenAligner + projection utilities — [1/5] feat: add TokenAligner and cross-tokenizer projection utilities #2347
Collator + Arrow dataset + eval datasets — [2/5] feat: cross-tokenizer collator, Arrow dataset, and eval datasets #2348
CT distillation loss + multi-teacher aggregator — [3/5] feat: cross-tokenizer distillation loss and multi-teacher aggregator #2349
(this PR) CUDA IPC for teacher logits transfer
Algorithm + worker integration — [5/5] feat: off-policy distillation algorithm and worker integration #2351

copy-pr-bot · 2026-04-27T03:56:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Foundational library code for cross-tokenizer distillation. No algorithm or training-loop integration yet — those follow in subsequent PRs. - nemo_rl/algorithms/x_token/tokenalign.py: TokenAligner(nn.Module) with Numba-accelerated DP alignment, projection-matrix loading (dense and sparse COO), and the project_token_likelihoods_instance forward path used by the cross-tokenizer loss. - nemo_rl/algorithms/x_token/__init__.py: package init. - nemo_rl/utils/x_token/{minimal_projection_generator, minimal_projection_via_multitoken,reapply_exact_map, sort_and_cut_projection_matrix}.py: standalone CLI scripts (argparse-driven, __main__ entrypoints) for one-time projection-matrix preparation. Not on the training import path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>

Data-layer plumbing for cross-tokenizer off-policy distillation, plus in-training eval datasets. Builds on the TokenAligner package from the prior PR. - nemo_rl/data/cross_tokenizer_collate.py: CrossTokenizerCollator and TeacherCTSpec. Runs in StatefulDataLoader worker processes — does per-teacher tokenize + DP alignment up front so the train loop only consumes pre-built per_teacher_ct_data. Lazy-imports TokenAligner so workers that don't need cross-tokenizer never touch x_token. - nemo_rl/data/__init__.py: add NotRequired prefetch_factor to DataConfig. - nemo_rl/data/datasets/response_datasets/arrow_text_dataset.py: ArrowTextDataset with lazy packing, registered as "arrow_text" in DATASET_REGISTRY. - nemo_rl/data/datasets/eval_datasets/{humaneval_plus,mbpp_plus,mmlu}.py and registry entries: in-training eval datasets. mmlu.py adds an optional num_few_shot argument with a static _build_few_shot_prefixes helper; default of 0 preserves existing behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>

Adds the loss-fn layer for cross-tokenizer distillation. Builds on the TokenAligner package (PR 1). - CrossTokenizerDistillationLossFn: per-token KL/CE loss over 1:1 aligned positions, with optional gold-loss path. Holds a reference to a TokenAligner; teacher data (input_ids, aligned_pairs, optional chunked COO masks) is set per-step via set_cross_tokenizer_data. - CrossTokenizerDistillationLossConfig and CrossTokenizerDistillationLossDataDict TypedDicts. - MultiTeacherLossAggregator: wraps a list of optional CrossTokenizerDistillationLossFn instances with per-teacher weights. N=1 is a degenerate case used by the unified single-/multi-teacher worker path; the algorithm-layer multi-teacher orchestration comes in a later PR. - _scatter_chunk_mask_from_coo helper for the chunked-CE path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>

Adds the IPC plumbing that lets a teacher policy worker hand its logits to the student worker without going through Ray's serialization path — required for cross-tokenizer distillation where teacher full-vocab logits are too big to pickle per step. - nemo_rl/distributed/ipc_utils.py: get_handle_from_tensor and rebuild_cuda_tensor_from_ipc helpers wrapping CUDA IPC handles. - nemo_rl/models/automodel/train.py: two new post-processors — XTokenTeacherIPCExportPostProcessor (teacher side, allocates a pre-sized CUDA buffer and exports the IPC handle per microbatch) and XTokenTeacherIPCLossPostProcessor (student side, rebuilds the tensor from the handle and feeds it to the loss fn). Existing post-processors are untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>

avenkateshha mentioned this pull request Apr 27, 2026

[1/5] feat: add TokenAligner and cross-tokenizer projection utilities #2347

Closed

4 tasks

This was referenced Apr 27, 2026

[2/5] feat: cross-tokenizer collator, Arrow dataset, and eval datasets #2348

Closed

[3/5] feat: cross-tokenizer distillation loss and multi-teacher aggregator #2349

Closed

[5/5] feat: off-policy distillation algorithm and worker integration #2351

Closed

github-actions Bot added the community-request label Apr 27, 2026

avenkateshha changed the title ~~feat: CUDA IPC for teacher logits transfer~~ [4/5] feat: CUDA IPC for teacher logits transfer Apr 27, 2026

avenkateshha force-pushed the avenkateshha/xtoken-off-policy-distillation/04-ipc branch from a772094 to fbe9827 Compare April 27, 2026 10:21

avenkateshha and others added 3 commits April 27, 2026 03:25

avenkateshha force-pushed the avenkateshha/xtoken-off-policy-distillation/04-ipc branch from fbe9827 to a146953 Compare April 27, 2026 10:25

avenkateshha closed this May 16, 2026

avenkateshha deleted the avenkateshha/xtoken-off-policy-distillation/04-ipc branch May 16, 2026 01:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4/5] feat: CUDA IPC for teacher logits transfer#2350

[4/5] feat: CUDA IPC for teacher logits transfer#2350
avenkateshha wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
avenkateshha:avenkateshha/xtoken-off-policy-distillation/04-ipc

avenkateshha commented Apr 27, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

avenkateshha commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avenkateshha commented Apr 27, 2026 •

edited

Loading