-
Notifications
You must be signed in to change notification settings - Fork 75
Port https://github.com/pytorch/pytorch/pull/149738 to ROCm (part 1). #2000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port https://github.com/pytorch/pytorch/pull/149738 to ROCm (part 1). #2000
Conversation
|
Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE Detected error during Pytorch building: |
|
Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE |
|
Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE |
|
Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE |
This PR ports to release/2.5 the generalization of vectorized elementwise kernels for multiple heterogeneous tensor types. It's still missing the reverted threadblock mapping present in the original PR, which will come in a later PR. Co-authored-by: Jerry Mannil <Jerry.Mannil@amd.com>
50407c2 to
b287fde
Compare
|
Latest PR version includes handling of tensor output type other than float, which was failing in some elementwise kernel calls, such as in-place a+=b with a's type Half and b's type float. @jerrymannil thanks for catching the bug and fixing it. |
|
Jenkins build for b287fde55b004220299cfc6137b03cedf7658c64 commit finished as FAILURE |
| c10::CppTypeToScalarType<float>::value}), | ||
| std::array<c10::ScalarType, 3>( | ||
| {c10::CppTypeToScalarType<BFloat16>::value, | ||
| c10::CppTypeToScalarType<BFloat16>::value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support (BFloat16, float, BFloat16) and similar cases, given dtypes can be in any order ?
for eg. https://pytorch.org/docs/stable/generated/torch.add.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that makes sense: we see output tensor type as Half when we do an in-place operation and the first argument is a Half
a+=b, where a has type Half and b has type float.
However, if you have
a+=b, where a has type float and b has type BFloat16
the output tensor type is float, which is captured by the first case.
I was not able to get the second example above to have an output tensor type of BFloat16, but please let me know if you can. Maybe a non in-place operation like
a = b+c
where a is Bfloat16 or Half?
I don't have a good test case for that. If you do, please share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create such cases using torch.add(x, y, out=z)
But I am not sure if this is a widely use case.
So what we have is good for now; we can update the rules later if need
| output_offset_calculator, | ||
| loader, | ||
| storer); | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are doing unconditional return, we don't fall back to legacy for (Half, flat, Half) etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check at line 727
check_binary_rt_types_for_specialization(iter)
makes sure that we only enter the statically unrolled list of if statements if we know one of them is going to succeed. Hence the direct return.
Please note that the check above now includes the output tensor type.
If you have an example that needs to fallback and instead it returns, please share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. looks good then.
This PR ports to release/2.5 the generalization of vectorized elementwise kernels for multiple heterogeneous tensor types. It's still missing the reverted threadblock mapping present in the original PR, which will come in a later PR.