Skip to content

Conversation

kshitij12345
Copy link
Collaborator

@kshitij12345 kshitij12345 commented Sep 5, 2025

Related #2338

TODO:

  • Figure out why we see Error from segmentation group 1: The singleton Communicator isn't available. This is most likely because the instance wasn't successfully initialized due to lack of a multi-process running (e.g. mpirun or torchrun). only when running this primitive. Need to set environment variables for nvFuser multi-device to work. See changes to helper.py

@kshitij12345 kshitij12345 changed the title [WIP] DTensor: Add _grouped_mm torch and prim [WIP] DTensor: Add torch symbol and prim for _grouped_mm Sep 5, 2025
@github-actions github-actions bot added the ci label Sep 24, 2025
@kshitij12345 kshitij12345 force-pushed the dtensor-prims._grouped_mm branch from 7d785d7 to 9221690 Compare October 2, 2025 10:21
@github-actions github-actions bot removed the ci label Oct 2, 2025
@kshitij12345 kshitij12345 changed the title [WIP] DTensor: Add torch symbol and prim for _grouped_mm [DTensor] Add torch symbol and prim for _grouped_mm Oct 2, 2025
@kshitij12345 kshitij12345 self-assigned this Oct 2, 2025
@kshitij12345 kshitij12345 added the DTensor Issues about DTensor support in Thunder label Oct 2, 2025
@kshitij12345 kshitij12345 marked this pull request as ready for review October 2, 2025 10:32
@t-vi
Copy link
Collaborator

t-vi commented Oct 2, 2025

We need to make the access to torch._grouped_mm conditional or bump the min torch version.

@kshitij12345 kshitij12345 requested a review from crcrpar October 2, 2025 12:31
@kshitij12345
Copy link
Collaborator Author

We need to make the access to torch._grouped_mm conditional or bump the min torch version.

Have made the access conditional, thanks @t-vi

@kshitij12345
Copy link
Collaborator Author

kshitij12345 commented Oct 2, 2025

I have pushed a couple of commits after changing PR status from draft to ready, but the Lit Job haven't been triggered.

image

Comment on lines +255 to +268
"input_shardings",
[
(
[
Shard(
-1,
)
],
[
Shard(1),
],
[Replicate()],
),
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: what's the type of input_shardings? Tuple of two list of Shard/Replicates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is

[
  ([Shard(-1)], [Shard(1)], [Replicate()]),
]

NOTE: Each elements of the tuple is Sequence[Placement] as expected by distribute_tensor

Doc: https://docs.pytorch.org/docs/stable/distributed.tensor.html#torch.distributed.tensor.distribute_tensor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DTensor Issues about DTensor support in Thunder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants