Skip to content

Conversation

@carlobertolli
Copy link

This PR ports to release/2.5 the generalization of vectorized elementwise kernels for multiple heterogeneous tensor types. It's still missing the reverted threadblock mapping present in the original PR, which will come in a later PR.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 26, 2025

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

[8113/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[8114/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/CuSparseHandlePool.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[8115/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DPYTORCH_LAYERNORM_FAST_RECIPROCAL -DROCM_VERSION=60304 -DTORCH_ENABLE_LLVM -DTORCH_HIP_BUILD_MAIN_LIB -DTORCH_HIP_VERSION=603 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_PROF_API=1 -DUSE_ROCM -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dtorch_hip_EXPORTS -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/var/lib/jenkins/pytorch/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/pytorch/third_party/onnx -I/var/lib/jenkins/pytorch/build/third_party/onnx -I/var/lib/jenkins/pytorch/nlohmann -I/opt/rocm/hcc/include -I/opt/rocm/rocblas/include -I/opt/rocm/hipsparse/include -I/var/lib/jenkins/pytorch/aten/src/THH -I/var/lib/jenkins/pytorch/aten/src/ATen/hip -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/library/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/torch/include -I/var/lib/jenkins/pytorch/c10/hip/../.. -I/var/lib/jenkins/pytorch/c10/.. -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/build/third_party/gloo/hip -isystem /opt/rocm-6.3.4/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.10/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/pytorch/third_party/ideep/include -isystem /var/lib/jenkins/pytorch/INTERFACE -isystem /var/lib/jenkins/pytorch/third_party/nlohmann/include -isystem /opt/rocm/include -isystem /opt/rocm-6.3.4/include/hiprand -isystem /opt/rocm-6.3.4/include/rocrand -isystem /opt/rocm/magma/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DMKL_HAS_SBGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=603 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -D_GLIBCXX_USE_CXX11_ABI=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIP_NEW_TYPE_ENUMS -DHIP_VERSION=6 -DUSE_MIOPEN -MD -MT caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o -MF caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o.d -o caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCTracing.cpp.o -c /var/lib/jenkins/pytorch/torch/csrc/distributed/c10d/UCCTracing.cpp
sccache: encountered fatal error
sccache: error : Invalid checksum
sccache:  cause: Invalid checksum
[8116/8684] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.cpp.o

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 26, 2025

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 27, 2025

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 27, 2025

Jenkins build for 50407c216a8b7d66f0ff670f419f2ad469fdc87d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

This PR ports to release/2.5 the generalization of vectorized elementwise kernels for multiple heterogeneous tensor types.
It's still missing the reverted threadblock mapping present in the original PR, which will come in a later PR.

Co-authored-by: Jerry Mannil <Jerry.Mannil@amd.com>
@carlobertolli carlobertolli force-pushed the GeneralizedVectorizedTemplatedEWK.rocm branch from 50407c2 to b287fde Compare April 1, 2025 16:08
@carlobertolli
Copy link
Author

carlobertolli commented Apr 1, 2025

Latest PR version includes handling of tensor output type other than float, which was failing in some elementwise kernel calls, such as in-place a+=b with a's type Half and b's type float. @jerrymannil thanks for catching the bug and fixing it.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Apr 1, 2025

Jenkins build for b287fde55b004220299cfc6137b03cedf7658c64 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

c10::CppTypeToScalarType<float>::value}),
std::array<c10::ScalarType, 3>(
{c10::CppTypeToScalarType<BFloat16>::value,
c10::CppTypeToScalarType<BFloat16>::value,
Copy link
Collaborator

@jerrymannil jerrymannil Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support (BFloat16, float, BFloat16) and similar cases, given dtypes can be in any order ?
for eg. https://pytorch.org/docs/stable/generated/torch.add.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that makes sense: we see output tensor type as Half when we do an in-place operation and the first argument is a Half
a+=b, where a has type Half and b has type float.
However, if you have
a+=b, where a has type float and b has type BFloat16
the output tensor type is float, which is captured by the first case.
I was not able to get the second example above to have an output tensor type of BFloat16, but please let me know if you can. Maybe a non in-place operation like
a = b+c
where a is Bfloat16 or Half?
I don't have a good test case for that. If you do, please share.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create such cases using torch.add(x, y, out=z)
But I am not sure if this is a widely use case.
So what we have is good for now; we can update the rules later if need

output_offset_calculator,
loader,
storer);
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are doing unconditional return, we don't fall back to legacy for (Half, flat, Half) etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check at line 727
check_binary_rt_types_for_specialization(iter)
makes sure that we only enter the statically unrolled list of if statements if we know one of them is going to succeed. Hence the direct return.
Please note that the check above now includes the output tensor type.
If you have an example that needs to fallback and instead it returns, please share.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. looks good then.

@jerrymannil jerrymannil merged commit 640334b into ROCm:release/2.5 Apr 2, 2025
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants