Port https://github.com/pytorch/pytorch/pull/149738 to ROCm (part 1). #2000

carlobertolli · 2025-03-25T23:43:35Z

This PR ports to release/2.5 the generalization of vectorized elementwise kernels for multiple heterogeneous tensor types. It's still missing the reverted threadblock mapping present in the original PR, which will come in a later PR.

rocm-repo-management-api · 2025-03-26T00:37:01Z

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

[8113/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[8114/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/CuSparseHandlePool.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[8115/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DPYTORCH_LAYERNORM_FAST_RECIPROCAL -DROCM_VERSION=60304 -DTORCH_ENABLE_LLVM -DTORCH_HIP_BUILD_MAIN_LIB -DTORCH_HIP_VERSION=603 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_PROF_API=1 -DUSE_ROCM -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dtorch_hip_EXPORTS -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/var/lib/jenkins/pytorch/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/pytorch/third_party/onnx -I/var/lib/jenkins/pytorch/build/third_party/onnx -I/var/lib/jenkins/pytorch/nlohmann -I/opt/rocm/hcc/include -I/opt/rocm/rocblas/include -I/opt/rocm/hipsparse/include -I/var/lib/jenkins/pytorch/aten/src/THH -I/var/lib/jenkins/pytorch/aten/src/ATen/hip -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/library/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/torch/include -I/var/lib/jenkins/pytorch/c10/hip/../.. -I/var/lib/jenkins/pytorch/c10/.. -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/build/third_party/gloo/hip -isystem /opt/rocm-6.3.4/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.10/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/pytorch/third_party/ideep/include -isystem /var/lib/jenkins/pytorch/INTERFACE -isystem /var/lib/jenkins/pytorch/third_party/nlohmann/include -isystem /opt/rocm/include -isystem /opt/rocm-6.3.4/include/hiprand -isystem /opt/rocm-6.3.4/include/rocrand -isystem /opt/rocm/magma/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DMKL_HAS_SBGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=603 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -D_GLIBCXX_USE_CXX11_ABI=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIP_NEW_TYPE_ENUMS -DHIP_VERSION=6 -DUSE_MIOPEN -MD -MT caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o -MF caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o.d -o caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o -c /var/lib/jenkins/pytorch/torch/csrc/distributed/c10d/UCCTracing.cpp
sccache: encountered fatal error
sccache: error : Invalid checksum
sccache:  cause: Invalid checksum
[8116/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.cpp.o

rocm-repo-management-api · 2025-03-26T16:39:53Z

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-03-27T03:41:04Z

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-03-27T17:44:27Z

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

This PR ports to release/2.5 the generalization of vectorized elementwise kernels for multiple heterogeneous tensor types. It's still missing the reverted threadblock mapping present in the original PR, which will come in a later PR. Co-authored-by: Jerry Mannil <Jerry.Mannil@amd.com>

carlobertolli · 2025-04-01T16:23:58Z

Latest PR version includes handling of tensor output type other than float, which was failing in some elementwise kernel calls, such as in-place a+=b with a's type Half and b's type float. @jerrymannil thanks for catching the bug and fixing it.

rocm-repo-management-api · 2025-04-01T17:03:51Z

Jenkins build for b287fde55b004220299cfc6137b03cedf7658c64 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jerrymannil · 2025-04-01T19:33:25Z

aten/src/ATen/native/cuda/CUDALoops.cuh

+         c10::CppTypeToScalarType<float>::value}),
+    std::array<c10::ScalarType, 3>(
+        {c10::CppTypeToScalarType<BFloat16>::value,
+         c10::CppTypeToScalarType<BFloat16>::value,


Should we support (BFloat16, float, BFloat16) and similar cases, given dtypes can be in any order ?
for eg. https://pytorch.org/docs/stable/generated/torch.add.html

I am not sure that makes sense: we see output tensor type as Half when we do an in-place operation and the first argument is a Half
a+=b, where a has type Half and b has type float.
However, if you have
a+=b, where a has type float and b has type BFloat16
the output tensor type is float, which is captured by the first case.
I was not able to get the second example above to have an output tensor type of BFloat16, but please let me know if you can. Maybe a non in-place operation like
a = b+c
where a is Bfloat16 or Half?
I don't have a good test case for that. If you do, please share.

We can create such cases using torch.add(x, y, out=z)
But I am not sure if this is a widely use case.
So what we have is good for now; we can update the rules later if need

jerrymannil · 2025-04-01T19:35:57Z

aten/src/ATen/native/cuda/CUDALoops.cuh

+                output_offset_calculator,
+                loader,
+                storer);
        return;


Since we are doing unconditional return, we don't fall back to legacy for (Half, flat, Half) etc

This check at line 727
check_binary_rt_types_for_specialization(iter)
makes sure that we only enter the statically unrolled list of if statements if we know one of them is going to succeed. Hence the direct return.
Please note that the check above now includes the output tensor type.
If you have an example that needs to fallback and instead it returns, please share.

Ok. looks good then.

carlobertolli requested review from jeffdaily, jerrymannil and pruthvistony March 25, 2025 23:43

carlobertolli force-pushed the GeneralizedVectorizedTemplatedEWK.rocm branch from 50407c2 to b287fde Compare April 1, 2025 16:08

jerrymannil reviewed Apr 1, 2025

View reviewed changes

jerrymannil approved these changes Apr 2, 2025

View reviewed changes

jerrymannil merged commit 640334b into ROCm:release/2.5 Apr 2, 2025
3 of 7 checks passed

Port https://github.com/pytorch/pytorch/pull/149738 to ROCm (part 1). #2000

Port https://github.com/pytorch/pytorch/pull/149738 to ROCm (part 1). #2000

Uh oh!

Conversation

carlobertolli commented Mar 25, 2025

Uh oh!

rocm-repo-management-api bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlobertolli commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerrymannil Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlobertolli Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

jerrymannil Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

jerrymannil Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

carlobertolli Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

jerrymannil Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rocm-repo-management-api bot commented Mar 26, 2025 •

edited

Loading

rocm-repo-management-api bot commented Mar 26, 2025 •

edited

Loading

rocm-repo-management-api bot commented Mar 27, 2025 •

edited

Loading

rocm-repo-management-api bot commented Mar 27, 2025 •

edited

Loading

carlobertolli commented Apr 1, 2025 •

edited

Loading

rocm-repo-management-api bot commented Apr 1, 2025 •

edited

Loading

jerrymannil Apr 1, 2025 •

edited

Loading