[Inductor] Masked `tl.load` operations should explicitly include `other` if the masked out values are expected to be used #126535

alexbaden · 2024-05-17T13:45:00Z

🐛 Describe the bug

The expected semantics of a Triton tl.load call with a mask supplied but no other parameter is to leave the masked out values (values where the mask is false) undefined. However, the CUDA backend in Triton explicitly zero-initializes masked out values as part of the predicated load instruction generated by the compiler. It appears that this behavior is being relied upon in Inductor (e.g. #126173) which results in undefined behavior from Inductor generated kernels on other Triton backends. In particular, the Intel backend used an undefined llvm value instead of 0-initialization when a masked load occurs w/out other.

The suggested solution is to add other where all masked loads are generated in Inductor, but looking at the code I noticed commentary around issues with masked loads when other is present (e.g.

pytorch/torch/_inductor/codegen/triton.py

Line 1893 in 55033ab

if not check:

). While the corresponding issue in Triton has been closed for some time, I am not sure what the side effects could be of adding other back to these areas. This issue is opened to investigate those.

In the meantime, we have decided to follow the CUDA backend implementation and explicitly zero initialize in the Intel XPU backend. It is possible this contains some performance benefit, in addition to fixing the bug above (though the performance benefit may just be a side effect of not trying to compare undefined values, I have not investigated). But, our reasoning is CUDA is the reference backend for Triton, and regardless of what the language spec says users are going to expect us to follow the CUDA backend semantics, particularly users who started with CUDA (like Inductor and PyTorch). It appears that the AMD Triton backend also zero-initializes, but I don't have hardware to test that.

I am happy to investigate adding other to the masked load operations (I found 3 or 4 of them doing a quick survey last night), but wanted to open this issue first for discussion.

Versions

main

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

The text was updated successfully, but these errors were encountered:

ezyang · 2024-05-21T01:33:20Z

cc @jansel

jansel · 2024-05-21T02:52:09Z

Given that triton-lang/triton#2813 is fixed, I think it is fine to add other=... to inductor's generated code and remove that old workaround.

@alexbaden want to submit a PR to add that?

alexbaden · 2024-05-21T04:08:17Z

Will do!

For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur. Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized. Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called. Fixes #126535 Pull Request resolved: #127311 Approved by: https://github.com/jansel

mikaylagawarecki added the module: inductor label May 20, 2024

pytorch-bot bot added the oncall: pt2 label May 20, 2024

xmfan added the triage review label May 20, 2024

xmfan added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels May 21, 2024

alexbaden mentioned this issue May 28, 2024

[Inductor] Add 0 initialization to Triton masked loads #127311

Closed

pytorchmergebot closed this as completed in 5d316c8 May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] Masked `tl.load` operations should explicitly include `other` if the masked out values are expected to be used #126535

[Inductor] Masked `tl.load` operations should explicitly include `other` if the masked out values are expected to be used #126535

alexbaden commented May 17, 2024 •

edited by pytorch-bot bot

ezyang commented May 21, 2024

jansel commented May 21, 2024

alexbaden commented May 21, 2024

[Inductor] Masked tl.load operations should explicitly include other if the masked out values are expected to be used #126535

[Inductor] Masked tl.load operations should explicitly include other if the masked out values are expected to be used #126535

Comments

alexbaden commented May 17, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

ezyang commented May 21, 2024

jansel commented May 21, 2024

alexbaden commented May 21, 2024

[Inductor] Masked `tl.load` operations should explicitly include `other` if the masked out values are expected to be used #126535

[Inductor] Masked `tl.load` operations should explicitly include `other` if the masked out values are expected to be used #126535

alexbaden commented May 17, 2024 •

edited by pytorch-bot bot