[FEAT][distributed]: implement Sequence-Parallel aware LogProb and Loss reductions

## Description:
### Context:
Training on extreme long-context tasks (32k+ tokens) requires Sequence Parallelism or Context Parallelism , where a single sequence is sharded across multiple GPUs. Our current loss functions (LogP, KL, Masked-Mean) assume the full sequence resides on a single device.

## Tasks:
1. Update the selected_logprobs and KL operators to accept sharded input sequences.
2. Implement cross-rank reductions (e.g., partial sum of masked tokens -> AllReduce -> global division) for metrics like masked_mean and masked_sum.
3. Ensure that gradients route back to the correct sequence shards during the backward pass seamlessly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT][distributed]: implement Sequence-Parallel aware LogProb and Loss reductions #49

Description:

Context:

Tasks:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEAT][distributed]: implement Sequence-Parallel aware LogProb and Loss reductions #49

Description

Description:

Context:

Tasks:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions