-
Notifications
You must be signed in to change notification settings - Fork 110
Lower cumsum to nvfuser #2374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower cumsum to nvfuser #2374
Conversation
kshitij12345
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI tests for cumsum are failing due to accuracy issues.
I think we can look at this decomposition -
https://github.com/pytorch/pytorch/blob/17b9c618ddaaffbf07f9231d7cd421fdf76462dc/torch/_refs/__init__.py#L4668-L4689
However, I was wondering whether nvFuser can just add fd.ops.cumsum.
Thanks!
|
Side note: the sample data is non-deterministic, making debugging hard. In nvFuser, we call torch.manual_seed before each test so things like |
Thanks for pointing that out! nvFuser actually computes more accurate results than torch. This is because torch.cumsum calls cub::DeviceScan which in this case accumulates in float16. https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/tests/test_ops.py#L91 fails to detect this because it doesn't up the dtype. I'm experimenting with aa1126d... |
cc @jacobhinkle I believe this is getting slightly closer with your recent ScanOp? Anyhow, I don't expect cumsum of 128 (i.e. number of experts) integers to be a bottleneck. So triu+matmul or triu+where+sum as you suggested should be fine. |
|
The only CI failures are caused by cumsum_transform sometimes taking torch.dtype and sometimes thunder.dtypes.dtype. I think it's related to |
You can use - But I am surprised that |
That works -- thanks! The PR is ready to review now. |
kshitij12345
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @wujingyue
t-vi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wujingyue @kshitij12345
so nvFuser gets a larger portion of the MoE layer. Context: NVIDIA/Fuser#4866 (comment)