Lower cumsum to nvfuser #2374

wujingyue · 2025-07-29T23:23:07Z

so nvFuser gets a larger portion of the MoE layer. Context: NVIDIA/Fuser#4866 (comment)

kshitij12345

The CI tests for cumsum are failing due to accuracy issues.

I think we can look at this decomposition -
https://github.com/pytorch/pytorch/blob/17b9c618ddaaffbf07f9231d7cd421fdf76462dc/torch/_refs/__init__.py#L4668-L4689

However, I was wondering whether nvFuser can just add fd.ops.cumsum.

Thanks!

wujingyue · 2025-07-30T23:08:16Z

Side note: the sample data is non-deterministic, making debugging hard. In nvFuser, we call torch.manual_seed before each test so things like torch.randn and torch.testing.make_tensor will generate deterministic values. cc @IvanYashchuk

wujingyue · 2025-07-30T23:24:39Z

The CI tests for cumsum are failing due to accuracy issues.

Thanks for pointing that out!

nvFuser actually computes more accurate results than torch. This is because torch.cumsum calls cub::DeviceScan which in this case accumulates in float16.

https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/tests/test_ops.py#L91 fails to detect this because it doesn't up the dtype. I'm experimenting with aa1126d...

wujingyue · 2025-07-30T23:26:54Z

However, I was wondering whether nvFuser can just add fd.ops.cumsum.

cc @jacobhinkle I believe this is getting slightly closer with your recent ScanOp?

Anyhow, I don't expect cumsum of 128 (i.e. number of experts) integers to be a bottleneck. So triu+matmul or triu+where+sum as you suggested should be fine.

wujingyue · 2025-07-31T01:10:41Z

The only CI failures are caused by cumsum_transform sometimes taking torch.dtype and sometimes thunder.dtypes.dtype. I think it's related to register_supported(ltorch.cumsum, ...) and that dtype is an argument. Any idea how to solve this? cc @IvanYashchuk and @kshitij12345

kshitij12345 · 2025-07-31T12:04:36Z

The only CI failures are caused by cumsum_transform sometimes taking torch.dtype and sometimes thunder.dtypes.dtype. I think it's related to register_supported(ltorch.cumsum, ...) and that dtype is an argument. Any idea how to solve this?

You can use - lcdtype_to_nvdtype(dtypes.to_dtype(dtype)) as dtypes.to_dtype handles converting torch.dtype to thunder's dtype and preserves thunder's dtype. This should unblock the PR.

But I am surprised that dtype can be torch.dtype or thunder.dtypes.dtype, I will check to see why that is the case and file an issue accordingly.

wujingyue · 2025-07-31T14:15:43Z

You can use - lcdtype_to_nvdtype(dtypes.to_dtype(dtype))

That works -- thanks!

The PR is ready to review now.

kshitij12345

LGTM, thanks @wujingyue

thunder/tests/test_ops.py

thunder/executors/nvfuserex_impl.py

wujingyue · 2025-07-31T16:26:51Z

@t-vi and @mruberry, this is ready to merge

t-vi

Thank you @wujingyue @kshitij12345

wujingyue added 2 commits July 29, 2025 11:51

WIP

2cfe621

works

85bafcd

wujingyue requested review from lantiga, mruberry and t-vi as code owners July 29, 2025 23:23

wujingyue requested review from IvanYashchuk and kshitij12345 July 29, 2025 23:23

kshitij12345 reviewed Jul 30, 2025

View reviewed changes

Atempt to fix the precision error

aa1126d

fix dtype

48d64ec

wujingyue requested a review from kshitij12345 July 31, 2025 14:15

kshitij12345 approved these changes Jul 31, 2025

View reviewed changes

thunder/tests/test_ops.py Outdated Show resolved Hide resolved

thunder/executors/nvfuserex_impl.py Show resolved Hide resolved

wujingyue added 2 commits July 31, 2025 08:12

Simplify

c3854a3

comment

326e467

t-vi approved these changes Aug 1, 2025

View reviewed changes

t-vi merged commit 24f1d00 into main Aug 1, 2025
50 checks passed

t-vi deleted the wjy/cumsum branch August 1, 2025 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lower cumsum to nvfuser #2374

Lower cumsum to nvfuser #2374

Uh oh!

wujingyue commented Jul 29, 2025

Uh oh!

kshitij12345 left a comment

Uh oh!

wujingyue commented Jul 30, 2025

Uh oh!

wujingyue commented Jul 30, 2025

Uh oh!

wujingyue commented Jul 30, 2025

Uh oh!

wujingyue commented Jul 31, 2025

Uh oh!

kshitij12345 commented Jul 31, 2025

Uh oh!

wujingyue commented Jul 31, 2025

Uh oh!

kshitij12345 left a comment

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jul 31, 2025

Uh oh!

t-vi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lower cumsum to nvfuser #2374

Lower cumsum to nvfuser #2374

Uh oh!

Conversation

wujingyue commented Jul 29, 2025

Uh oh!

kshitij12345 left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Jul 30, 2025

Uh oh!

wujingyue commented Jul 30, 2025

Uh oh!

wujingyue commented Jul 30, 2025

Uh oh!

wujingyue commented Jul 31, 2025

Uh oh!

kshitij12345 commented Jul 31, 2025

Uh oh!

wujingyue commented Jul 31, 2025

Uh oh!

kshitij12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jul 31, 2025

Uh oh!

t-vi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants