feat: TP-aware KDLoss with distributed softmax and T² scaling by Separius · Pull Request #1499 · NVIDIA-NeMo/Automodel

Separius · 2026-03-09T17:03:24Z

Add tensor-parallel support to KDLoss via two new module-level helpers:

_infer_tp_group_from_dtensor: extracts the TP ProcessGroup from a vocab-sharded DTensor logit, avoiding an explicit tp_group argument in most cases.
_kl_forward_tp: computes per-token KL using numerically stable global softmax/log-softmax over all_reduce, keeping logits on local shards to avoid gathering the full vocabulary.

KDLoss.forward gains a tp_group parameter (default None, backward- compatible) and auto-detects a TP group from DTensor student_logits. T² loss scaling (Hinton et al., 2015) is applied when temperature != 1 so that gradient magnitudes stay independent of the chosen temperature.

Tests extended with single-process gloo-backed fixtures that verify the TP path matches the non-TP path at world_size=1, plus dedicated tests for T² scaling and _infer_tp_group_from_dtensor.

Add tensor-parallel support to KDLoss via two new module-level helpers: - _infer_tp_group_from_dtensor: extracts the TP ProcessGroup from a vocab-sharded DTensor logit, avoiding an explicit tp_group argument in most cases. - _kl_forward_tp: computes per-token KL using numerically stable global softmax/log-softmax over all_reduce, keeping logits on local shards to avoid gathering the full vocabulary. KDLoss.forward gains a tp_group parameter (default None, backward- compatible) and auto-detects a TP group from DTensor student_logits. T² loss scaling (Hinton et al., 2015) is applied when temperature != 1 so that gradient magnitudes stay independent of the chosen temperature. Tests extended with single-process gloo-backed fixtures that verify the TP path matches the non-TP path at world_size=1, plus dedicated tests for T² scaling and _infer_tp_group_from_dtensor. Signed-off-by: Sepehr Sameni <ssameni@nvidia.com>

copy-pr-bot · 2026-03-09T17:03:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Separius · 2026-03-09T17:03:50Z

@akoumpa for visibility

akoumpa · 2026-03-09T17:37:39Z

/ok to test 8ffe1e7

Separius requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and hemildesai as code owners March 9, 2026 17:03

Separius mentioned this pull request Mar 9, 2026

feat: more efficient KD loss with TP #1322

Closed

3 tasks

copy-pr-bot Bot temporarily deployed to test March 9, 2026 17:38 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 9, 2026 17:38 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 9, 2026 17:53 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 9, 2026 18:02 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 9, 2026 18:21 Inactive

akoumpa approved these changes Mar 9, 2026

View reviewed changes

akoumpa merged commit 30fbb00 into main Mar 9, 2026
52 checks passed

akoumpa deleted the ssameni/feat_tp_kd_loss branch March 9, 2026 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TP-aware KDLoss with distributed softmax and T² scaling#1499

feat: TP-aware KDLoss with distributed softmax and T² scaling#1499
akoumpa merged 1 commit intomainfrom
ssameni/feat_tp_kd_loss

Separius commented Mar 9, 2026

Uh oh!

copy-pr-bot Bot commented Mar 9, 2026

Uh oh!

Separius commented Mar 9, 2026

Uh oh!

akoumpa commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Separius commented Mar 9, 2026

Uh oh!

copy-pr-bot Bot commented Mar 9, 2026

Uh oh!

Separius commented Mar 9, 2026

Uh oh!

akoumpa commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants