Skip to content

Conversation

@wujingyue
Copy link
Collaborator

so nvFuser gets a larger portion of the MoE layer. Context: NVIDIA/Fuser#4866 (comment)

Copy link
Collaborator

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI tests for cumsum are failing due to accuracy issues.

I think we can look at this decomposition -
https://github.com/pytorch/pytorch/blob/17b9c618ddaaffbf07f9231d7cd421fdf76462dc/torch/_refs/__init__.py#L4668-L4689

However, I was wondering whether nvFuser can just add fd.ops.cumsum.

Thanks!

@wujingyue
Copy link
Collaborator Author

Side note: the sample data is non-deterministic, making debugging hard. In nvFuser, we call torch.manual_seed before each test so things like torch.randn and torch.testing.make_tensor will generate deterministic values. cc @IvanYashchuk

@wujingyue
Copy link
Collaborator Author

The CI tests for cumsum are failing due to accuracy issues.

Thanks for pointing that out!

nvFuser actually computes more accurate results than torch. This is because torch.cumsum calls cub::DeviceScan which in this case accumulates in float16.

https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/tests/test_ops.py#L91 fails to detect this because it doesn't up the dtype. I'm experimenting with aa1126d...

@wujingyue
Copy link
Collaborator Author

However, I was wondering whether nvFuser can just add fd.ops.cumsum.

cc @jacobhinkle I believe this is getting slightly closer with your recent ScanOp?

Anyhow, I don't expect cumsum of 128 (i.e. number of experts) integers to be a bottleneck. So triu+matmul or triu+where+sum as you suggested should be fine.

@wujingyue
Copy link
Collaborator Author

The only CI failures are caused by cumsum_transform sometimes taking torch.dtype and sometimes thunder.dtypes.dtype. I think it's related to register_supported(ltorch.cumsum, ...) and that dtype is an argument. Any idea how to solve this? cc @IvanYashchuk and @kshitij12345

@kshitij12345
Copy link
Collaborator

The only CI failures are caused by cumsum_transform sometimes taking torch.dtype and sometimes thunder.dtypes.dtype. I think it's related to register_supported(ltorch.cumsum, ...) and that dtype is an argument. Any idea how to solve this?

You can use - lcdtype_to_nvdtype(dtypes.to_dtype(dtype)) as dtypes.to_dtype handles converting torch.dtype to thunder's dtype and preserves thunder's dtype. This should unblock the PR.

But I am surprised that dtype can be torch.dtype or thunder.dtypes.dtype, I will check to see why that is the case and file an issue accordingly.

@wujingyue
Copy link
Collaborator Author

You can use - lcdtype_to_nvdtype(dtypes.to_dtype(dtype))

That works -- thanks!

The PR is ready to review now.

@wujingyue wujingyue requested a review from kshitij12345 July 31, 2025 14:15
Copy link
Collaborator

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @wujingyue

@wujingyue
Copy link
Collaborator Author

@t-vi and @mruberry, this is ready to merge

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-vi t-vi merged commit 24f1d00 into main Aug 1, 2025
50 checks passed
@t-vi t-vi deleted the wjy/cumsum branch August 1, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants