-
Notifications
You must be signed in to change notification settings - Fork 75
[rocm6.3_internal_testing] Changes to support UB 24.04 build #1633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins build for ed0e6e5e9af1fe15a82aa2fa53f510c97318b6b6 commit finished as FAILURE |
|
If I remove the torch_cpu HSA additional line, not sure why this dependency. I get this error - [2024-10-14T06:44:26.129Z] FAILED: bin/BackoffTest for [2024-10-14T06:44:26.130Z] FAILED: bin/test_edge_op_registration link - http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.4-ub24-py3.12-ci/21/ |
jeffdaily
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is ROCM_HSART_LIB suddenly needed?
|
Jenkins build for ed0e6e5e9af1fe15a82aa2fa53f510c97318b6b6 commit finished as FAILURE |
|
docker image - compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:14929_ubuntu24.04_py3.12_pytorch_rocm6.3_ub24_ed0e6e5 |
jithunnair-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some changes and another look at the ROCM_HSART_LIB changes
| fi | ||
|
|
||
| if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] ; then | ||
| conda_install_through_forge libstdcxx-ng=14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the libstdcxx-ng version dependent on Ubuntu version or python version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly this is the lowest available for ub24, python doesn't have a strict requirement for this version. But I'm not 100% sure, worked on this some time ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like at the minimum, this condition needs to be on OS version. I assume the reason is similar to pytorch#121556 where we get symbol version errors when building PyTorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dependent on UB 24.04 and minimum python version is 3.12, so I will update the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the condition.
| if [[ $UBUNTU_VERSION == 24.04 ]]; then | ||
| apt-get install -y --no-install-recommends gpg-agent | ||
| echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \ | ||
| | sudo tee /etc/apt/preferences.d/rocm-pin-600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to DevOps, this is needed for Ubuntu 22.04 onwards (as also confirmed by https://github.com/ROCm/ROCm-docker/blob/master/build_all.sh). So we should:
- update this condition to be applicable for 22.04 onwards,
- cherry-pick this change into all release branches and
- Remove the corresponding patch logic in DevOps's build_pytorch.bash
| if [[ $UBUNTU_VERSION == 24.04 ]]; then | ||
| # touch is used to disable harmless error message | ||
| touch /var/mail/ubuntu && chown ubuntu /var/mail/ubuntu && userdel -r ubuntu | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would seem to be a problem for upstream PyTorch as well. Can we file an upstream GitHub issue with logs and error snippets so that they're aware of this and might come up with a different way to address this? It's okay to merge this patch in rocm6.3_internal_testing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iupaikov-amd Can you please file a github issue on pytorch/pytorch for this? Upstream PyTorch team would like some more details. Please discuss with Pruthvi or me if you have questions regarding what info to mention on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream issue about install_user.sh: pytorch#138812
| using aotriton::v2::flash::attn_bwd; | ||
| using sdp::aotriton_adapter::mk_aotensor; | ||
| using sdp::aotriton_adapter::cast_dtype; | ||
| using sdp::aotriton_adapter::mk_aoscalartensor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is already in rocm6.3_internal_testing via 7ac294f.
Need a rebase.
| using aotriton::v2::flash::attn_bwd; | ||
| using sdp::aotriton_adapter::mk_aotensor; | ||
| using sdp::aotriton_adapter::cast_dtype; | ||
| using sdp::aotriton_adapter::mk_aoscalartensor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is already in rocm6.3_internal_testing via 7ac294f. Need a rebase.
| target_link_libraries(torch_cpu PUBLIC ${Caffe2_PUBLIC_DEPENDENCY_LIBS}) | ||
| target_link_libraries(torch_cpu PRIVATE ${Caffe2_DEPENDENCY_LIBS}) | ||
| target_link_libraries(torch_cpu PRIVATE ${Caffe2_DEPENDENCY_WHOLE_LINK_LIBS}) | ||
| target_link_libraries(torch_cpu PUBLIC ${ROCM_HSART_LIB}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other ROCM_HSART_LIB usages seem like they should be avoidable as well, but this one seems the most egregious, being a torch_cpu dependency? I wonder if @naromero77amd's latest refactor in https://github.com/pytorch/pytorch/pull/137112/files might help with all the ROCM_HSART_LIB occurrences because it uses CMake targets instead of paths to .so files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jithunnair-amd ,
torch_cpu is already dependent on libamdhip64.so even before this change.
root@ctr-ubbsmc12:/var/lib/jenkins/pytorch/build/lib# ldd libtorch_cpu.so
linux-vdso.so.1 (0x00007fff799f6000)
libc10.so (0x00007fd3ae78e000)
libgcc_s.so.1 => /opt/conda/envs/py_3.10/lib/libgcc_s.so.1 (0x00007fd3ae775000)
libmkl_intel_lp64.so.1 => /opt/conda/envs/py_3.10/lib/libmkl_intel_lp64.so.1 (0x00007fd3adbd6000)
libmkl_gnu_thread.so.1 => /opt/conda/envs/py_3.10/lib/libmkl_gnu_thread.so.1 (0x00007fd3ac04b000)
libmkl_core.so.1 => /opt/conda/envs/py_3.10/lib/libmkl_core.so.1 (0x00007fd3a7bdb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd3a7ae6000)
libgomp.so.1 => /opt/conda/envs/py_3.10/lib/libgomp.so.1 (0x00007fd3a7aad000)
libroctracer64.so.4 => /opt/rocm/lib/libroctracer64.so.4 (0x00007fd3a7a44000)
libamdhip64.so.6 => /opt/rocm/lib/libamdhip64.so.6 (0x00007fd3a6181000)
libstdc++.so.6 => /opt/conda/envs/py_3.10/lib/libstdc++.so.6 (0x00007fd3a5fcd000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd3a5da2000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd3bb09b000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fd3a5d95000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fd3a5d90000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd3a5d8b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fd3a5d86000)
libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007fd3a5a49000)
librocprofiler-register.so.0 => /opt/rocm/lib/librocprofiler-register.so.0 (0x00007fd3a59c7000)
libamd_comgr.so.2 => /opt/rocm/lib/libamd_comgr.so.2 (0x00007fd39b4e1000)
libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007fd39b4c3000)
libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007fd39b4a9000)
libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007fd39b497000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fd39b47b000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x00007fd39b3ac000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007fd39b37a000)
|
There is a hip sample -
removed |
|
Jenkins build for ed0e6e5e9af1fe15a82aa2fa53f510c97318b6b6 commit finished as FAILURE |
ed0e6e5 to
243f1f9
Compare
|
Jenkins build for 243f1f9e0b400a38c09d0a59b74adb1b77e99b3f commit finished as FAILURE |
|
Successful build - http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.4-ub24-py3.12-ci/37/ |
Changes are based on branch by @iupaikov-amd - https://github.com/ROCm/pytorch/tree/ypajkov/ubuntu_noble_build
Generated by CI build: http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.4-ub24-py3.12-ci/22/ |
Changes are based on branch by @iupaikov-amd - https://github.com/ROCm/pytorch/tree/ypajkov/ubuntu_noble_build
Changes are based on branch by @iupaikov-amd - https://github.com/ROCm/pytorch/tree/ypajkov/ubuntu_noble_build Validation: http://rocm-ci.amd.com/view/Release-6.4/job/framework-pytorch-2.3-ub24-py3.12-ci_rel-6.4/3/ Fixes https://ontrack-internal.amd.com/browse/SWDEV-520708 Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
Changes are based on branch by @iupaikov-amd - https://github.com/ROCm/pytorch/tree/ypajkov/ubuntu_noble_build
Changes are based on branch by @iupaikov-amd - https://github.com/ROCm/pytorch/tree/ypajkov/ubuntu_noble_build