Skip to content

[4/5] feat: CUDA IPC for teacher logits transfer#2350

Closed
avenkateshha wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
avenkateshha:avenkateshha/xtoken-off-policy-distillation/04-ipc
Closed

[4/5] feat: CUDA IPC for teacher logits transfer#2350
avenkateshha wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
avenkateshha:avenkateshha/xtoken-off-policy-distillation/04-ipc

Conversation

@avenkateshha
Copy link
Copy Markdown

@avenkateshha avenkateshha commented Apr 27, 2026

Adds the IPC plumbing that lets a teacher policy worker hand its logits to the student worker without going through Ray's serialization path — required for cross-tokenizer distillation where teacher full-vocab logits are too big to pickle per step.

  • nemo_rl/distributed/ipc_utils.py: get_handle_from_tensor and rebuild_cuda_tensor_from_ipc helpers wrapping CUDA IPC handles.
  • nemo_rl/models/automodel/train.py: two new post-processors — XTokenTeacherIPCExportPostProcessor (teacher side, allocates a pre-sized CUDA buffer and exports the IPC handle per microbatch) and XTokenTeacherIPCLossPostProcessor (student side, rebuilds the tensor from the handle and feeds it to the loss fn). Existing post-processors are untouched.

What does this PR do?

Adds CUDA-IPC-based teacher → student logit transfer so off-policy distillation can pass full-vocab teacher logits between Ray workers without serialization overhead.

Issues

None linked yet.

Usage

from nemo_rl.distributed.ipc_utils import get_handle_from_tensor, rebuild_cuda_tensor_from_ipc

# Teacher worker (sender):
buf = torch.empty(B, T, V, dtype=torch.bfloat16, device="cuda")
buf.copy_(teacher_logits)
handle = get_handle_from_tensor(buf)  # pickle-safe

# Student worker (receiver, given handle + device id):
teacher_logits = rebuild_cuda_tensor_from_ipc(handle, device_id=local_rank)

The two XTokenTeacherIPC*PostProcessor classes wire this into the automodel forward/backward path automatically.

Before your PR is "Ready for review"

  • Read Contributor guidelines
  • No new tests in this PR. IPC paths are exercised by PR 5 (multi-teacher requires use_ipc=true).
  • Static py_compile confirmed clean.
  • No docs entry — added alongside PR 5.

Additional Information

Draft. Stacked on PR 3 — #2349. IPC infrastructure is independent of the loss/collator and could ship in any order, but is sequenced here so PR 5 can rely on it.

Full chain:

  1. TokenAligner + projection utilities — [1/5] feat: add TokenAligner and cross-tokenizer projection utilities #2347
  2. Collator + Arrow dataset + eval datasets — [2/5] feat: cross-tokenizer collator, Arrow dataset, and eval datasets #2348
  3. CT distillation loss + multi-teacher aggregator — [3/5] feat: cross-tokenizer distillation loss and multi-teacher aggregator #2349
  4. (this PR) CUDA IPC for teacher logits transfer
  5. Algorithm + worker integration — [5/5] feat: off-policy distillation algorithm and worker integration #2351

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Foundational library code for cross-tokenizer distillation. No algorithm
or training-loop integration yet — those follow in subsequent PRs.

- nemo_rl/algorithms/x_token/tokenalign.py: TokenAligner(nn.Module) with
  Numba-accelerated DP alignment, projection-matrix loading
  (dense and sparse COO), and the project_token_likelihoods_instance
  forward path used by the cross-tokenizer loss.
- nemo_rl/algorithms/x_token/__init__.py: package init.
- nemo_rl/utils/x_token/{minimal_projection_generator,
  minimal_projection_via_multitoken,reapply_exact_map,
  sort_and_cut_projection_matrix}.py: standalone CLI scripts
  (argparse-driven, __main__ entrypoints) for one-time projection-matrix
  preparation. Not on the training import path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>
@avenkateshha avenkateshha force-pushed the avenkateshha/xtoken-off-policy-distillation/04-ipc branch from a772094 to fbe9827 Compare April 27, 2026 10:21
avenkateshha and others added 3 commits April 27, 2026 03:25
Data-layer plumbing for cross-tokenizer off-policy distillation, plus
in-training eval datasets. Builds on the TokenAligner package from the
prior PR.

- nemo_rl/data/cross_tokenizer_collate.py: CrossTokenizerCollator and
  TeacherCTSpec. Runs in StatefulDataLoader worker processes — does
  per-teacher tokenize + DP alignment up front so the train loop only
  consumes pre-built per_teacher_ct_data. Lazy-imports TokenAligner so
  workers that don't need cross-tokenizer never touch x_token.
- nemo_rl/data/__init__.py: add NotRequired prefetch_factor to DataConfig.
- nemo_rl/data/datasets/response_datasets/arrow_text_dataset.py:
  ArrowTextDataset with lazy packing, registered as "arrow_text" in
  DATASET_REGISTRY.
- nemo_rl/data/datasets/eval_datasets/{humaneval_plus,mbpp_plus,mmlu}.py
  and registry entries: in-training eval datasets. mmlu.py adds an
  optional num_few_shot argument with a static _build_few_shot_prefixes
  helper; default of 0 preserves existing behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>
Adds the loss-fn layer for cross-tokenizer distillation. Builds on the
TokenAligner package (PR 1).

- CrossTokenizerDistillationLossFn: per-token KL/CE loss over 1:1 aligned
  positions, with optional gold-loss path. Holds a reference to a
  TokenAligner; teacher data (input_ids, aligned_pairs, optional chunked
  COO masks) is set per-step via set_cross_tokenizer_data.
- CrossTokenizerDistillationLossConfig and
  CrossTokenizerDistillationLossDataDict TypedDicts.
- MultiTeacherLossAggregator: wraps a list of optional
  CrossTokenizerDistillationLossFn instances with per-teacher weights.
  N=1 is a degenerate case used by the unified single-/multi-teacher
  worker path; the algorithm-layer multi-teacher orchestration comes in
  a later PR.
- _scatter_chunk_mask_from_coo helper for the chunked-CE path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>
Adds the IPC plumbing that lets a teacher policy worker hand its logits
to the student worker without going through Ray's serialization path —
required for cross-tokenizer distillation where teacher full-vocab logits
are too big to pickle per step.

- nemo_rl/distributed/ipc_utils.py: get_handle_from_tensor and
  rebuild_cuda_tensor_from_ipc helpers wrapping CUDA IPC handles.
- nemo_rl/models/automodel/train.py: two new post-processors —
  XTokenTeacherIPCExportPostProcessor (teacher side, allocates a
  pre-sized CUDA buffer and exports the IPC handle per microbatch) and
  XTokenTeacherIPCLossPostProcessor (student side, rebuilds the tensor
  from the handle and feeds it to the loss fn). Existing post-processors
  are untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Adithyakrishna Hanasoge <avenkateshha@nvidia.com>
@avenkateshha avenkateshha force-pushed the avenkateshha/xtoken-off-policy-distillation/04-ipc branch from fbe9827 to a146953 Compare April 27, 2026 10:25
@avenkateshha avenkateshha deleted the avenkateshha/xtoken-off-policy-distillation/04-ipc branch May 16, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants