Skip to content

Conversation

@pruthvistony
Copy link
Collaborator

@rocm-mici
Copy link

Jenkins build for ed0e6e5e9af1fe15a82aa2fa53f510c97318b6b6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony
Copy link
Collaborator Author

pruthvistony commented Oct 14, 2024

If I remove the torch_cpu HSA additional line, not sure why this dependency.
https://github.com/ROCm/pytorch/pull/1633/files#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1420

I get this error -

[2024-10-14T06:44:26.129Z] FAILED: bin/BackoffTest
[2024-10-14T06:44:26.129Z] : && /opt/cache/bin/c++ -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic -Wl,-rpath-link,/usr/lib/x86_64-linux-gnu -Wl,--no-as-needed test_cpp_c10d/CMakeFiles/BackoffTest.dir/BackoffTest.cpp.o -o bin/BackoffTest -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/var/lib/jenkins/pytorch/build/lib:/opt/conda/envs/py_3.12/lib lib/libtorch_cpu.so lib/libgtest_main.a -lpthread lib/libprotobuf.a lib/libc10.so /opt/conda/envs/py_3.12/lib/libmkl_intel_lp64.so /opt/conda/envs/py_3.12/lib/libmkl_gnu_thread.so /opt/conda/envs/py_3.12/lib/libmkl_core.so -fopenmp /usr/lib/x86_64-linux-gnu/libpthread.a -lm /usr/lib/x86_64-linux-gnu/libdl.a lib/libgtest.a && /opt/conda/envs/py_3.12/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/BackoffTest && :
[2024-10-14T06:44:26.129Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_map@ROCR_1' [2024-10-14T06:44:26.129Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_address_reserve@ROCR_1'
[2024-10-14T06:44:26.129Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_address_free@ROCR_1' [2024-10-14T06:44:26.129Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_handle_release@ROCR_1'
[2024-10-14T06:44:26.129Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_export_shareable_handle@ROCR_1' [2024-10-14T06:44:26.129Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_import_shareable_handle@ROCR_1'
[2024-10-14T06:44:26.130Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_handle_create@ROCR_1' [2024-10-14T06:44:26.130Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_unmap@ROCR_1'
[2024-10-14T06:44:26.130Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_set_access@ROCR_1' [2024-10-14T06:44:26.130Z] /usr/bin/ld: /opt/rocm/lib/libamdhip64.so.6: undefined reference to hsa_amd_vmem_get_access@ROCR_1'

for [2024-10-14T06:44:26.130Z] FAILED: bin/test_edge_op_registration

link - http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.4-ub24-py3.12-ci/21/

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ROCM_HSART_LIB suddenly needed?

@rocm-mici
Copy link

Jenkins build for ed0e6e5e9af1fe15a82aa2fa53f510c97318b6b6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony
Copy link
Collaborator Author

docker image - compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:14929_ubuntu24.04_py3.12_pytorch_rocm6.3_ub24_ed0e6e5

Copy link
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some changes and another look at the ROCM_HSART_LIB changes

fi

if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] ; then
conda_install_through_forge libstdcxx-ng=14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the libstdcxx-ng version dependent on Ubuntu version or python version?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly this is the lowest available for ub24, python doesn't have a strict requirement for this version. But I'm not 100% sure, worked on this some time ago.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like at the minimum, this condition needs to be on OS version. I assume the reason is similar to pytorch#121556 where we get symbol version errors when building PyTorch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dependent on UB 24.04 and minimum python version is 3.12, so I will update the condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the condition.

if [[ $UBUNTU_VERSION == 24.04 ]]; then
apt-get install -y --no-install-recommends gpg-agent
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
| sudo tee /etc/apt/preferences.d/rocm-pin-600
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to DevOps, this is needed for Ubuntu 22.04 onwards (as also confirmed by https://github.com/ROCm/ROCm-docker/blob/master/build_all.sh). So we should:

  1. update this condition to be applicable for 22.04 onwards,
  2. cherry-pick this change into all release branches and
  3. Remove the corresponding patch logic in DevOps's build_pytorch.bash

if [[ $UBUNTU_VERSION == 24.04 ]]; then
# touch is used to disable harmless error message
touch /var/mail/ubuntu && chown ubuntu /var/mail/ubuntu && userdel -r ubuntu
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would seem to be a problem for upstream PyTorch as well. Can we file an upstream GitHub issue with logs and error snippets so that they're aware of this and might come up with a different way to address this? It's okay to merge this patch in rocm6.3_internal_testing though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iupaikov-amd Can you please file a github issue on pytorch/pytorch for this? Upstream PyTorch team would like some more details. Please discuss with Pruthvi or me if you have questions regarding what info to mention on the issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream issue about install_user.sh: pytorch#138812

using aotriton::v2::flash::attn_bwd;
using sdp::aotriton_adapter::mk_aotensor;
using sdp::aotriton_adapter::cast_dtype;
using sdp::aotriton_adapter::mk_aoscalartensor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is already in rocm6.3_internal_testing via 7ac294f.
Need a rebase.

using aotriton::v2::flash::attn_bwd;
using sdp::aotriton_adapter::mk_aotensor;
using sdp::aotriton_adapter::cast_dtype;
using sdp::aotriton_adapter::mk_aoscalartensor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is already in rocm6.3_internal_testing via 7ac294f. Need a rebase.

target_link_libraries(torch_cpu PUBLIC ${Caffe2_PUBLIC_DEPENDENCY_LIBS})
target_link_libraries(torch_cpu PRIVATE ${Caffe2_DEPENDENCY_LIBS})
target_link_libraries(torch_cpu PRIVATE ${Caffe2_DEPENDENCY_WHOLE_LINK_LIBS})
target_link_libraries(torch_cpu PUBLIC ${ROCM_HSART_LIB})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other ROCM_HSART_LIB usages seem like they should be avoidable as well, but this one seems the most egregious, being a torch_cpu dependency? I wonder if @naromero77amd's latest refactor in https://github.com/pytorch/pytorch/pull/137112/files might help with all the ROCM_HSART_LIB occurrences because it uses CMake targets instead of paths to .so files.

Copy link
Collaborator Author

@pruthvistony pruthvistony Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd ,
torch_cpu is already dependent on libamdhip64.so even before this change.

root@ctr-ubbsmc12:/var/lib/jenkins/pytorch/build/lib# ldd libtorch_cpu.so
linux-vdso.so.1 (0x00007fff799f6000)
libc10.so (0x00007fd3ae78e000)
libgcc_s.so.1 => /opt/conda/envs/py_3.10/lib/libgcc_s.so.1 (0x00007fd3ae775000)
libmkl_intel_lp64.so.1 => /opt/conda/envs/py_3.10/lib/libmkl_intel_lp64.so.1 (0x00007fd3adbd6000)
libmkl_gnu_thread.so.1 => /opt/conda/envs/py_3.10/lib/libmkl_gnu_thread.so.1 (0x00007fd3ac04b000)
libmkl_core.so.1 => /opt/conda/envs/py_3.10/lib/libmkl_core.so.1 (0x00007fd3a7bdb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd3a7ae6000)
libgomp.so.1 => /opt/conda/envs/py_3.10/lib/libgomp.so.1 (0x00007fd3a7aad000)
libroctracer64.so.4 => /opt/rocm/lib/libroctracer64.so.4 (0x00007fd3a7a44000)
libamdhip64.so.6 => /opt/rocm/lib/libamdhip64.so.6 (0x00007fd3a6181000)
libstdc++.so.6 => /opt/conda/envs/py_3.10/lib/libstdc++.so.6 (0x00007fd3a5fcd000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd3a5da2000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd3bb09b000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fd3a5d95000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fd3a5d90000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd3a5d8b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fd3a5d86000)
libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007fd3a5a49000)
librocprofiler-register.so.0 => /opt/rocm/lib/librocprofiler-register.so.0 (0x00007fd3a59c7000)
libamd_comgr.so.2 => /opt/rocm/lib/libamd_comgr.so.2 (0x00007fd39b4e1000)
libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007fd39b4c3000)
libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007fd39b4a9000)
libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007fd39b497000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fd39b47b000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x00007fd39b3ac000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007fd39b37a000)

@pruthvistony
Copy link
Collaborator Author

pruthvistony commented Oct 15, 2024

There is a hip sample - /opt/rocm/share/hip/samples/2_Cookbook/15_static_library/host_functions/CMakeLists.txt

target_link_libraries(test_opt_static PRIVATE amdhip64 amd_comgr hsa-runtime64::hsa-runtime64)

removed amd_comgr and hsa-runtime64::hsa-runtime64, but build of test_opt_static was fine. So it seems this problem is NOT universal.

@rocm-mici
Copy link

Jenkins build for ed0e6e5e9af1fe15a82aa2fa53f510c97318b6b6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-mici
Copy link

Jenkins build for 243f1f9e0b400a38c09d0a59b74adb1b77e99b3f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony
Copy link
Collaborator Author

@pruthvistony pruthvistony merged commit 8e3ef93 into rocm6.3_internal_testing Oct 23, 2024
@pruthvistony pruthvistony deleted the rocm6.3_ub24 branch October 23, 2024 05:45
@jithunnair-amd
Copy link
Collaborator

docker image - compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:14929_ubuntu24.04_py3.12_pytorch_rocm6.3_ub24_ed0e6e5

Generated by CI build: http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.4-ub24-py3.12-ci/22/

@jithunnair-amd jithunnair-amd changed the title Changes to support UB 24.04 build [rocm6.3_internal_testing] Changes to support UB 24.04 build Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants