Skip to content

Conversation

@amd-sriram
Copy link

@amd-sriram amd-sriram commented Mar 19, 2025

Altering the flag to use the correct streamType for CUDAPluggableAllocator. This is impacting Distributed Fused Adam in Rocm/APEX.

See PR ROCm/apex#184

Related Issue : https://ontrack-internal.amd.com/browse/SWDEV-519796

To fix the following error when building apex:
2025-03-25T14:19:25.0187731Z #5 1815.5 FAILED: /root/dockerbuild/pytorch.1.7.0a0/apex/build/temp.linux-x86_64-cpython-312/apex/contrib/csrc/nccl_allocator/NCCLAllocator_hip.o

torch/include/torch/csrc/cuda/CUDAPluggableAllocator.h:125:8: error: ‘void torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator::recordStream(const c10::DataPtr&, torch::cuda::CUDAPluggableAllocator::streamType)’ marked ‘override’, but does not override
2025-03-25T14:19:25.0192670Z #5 1815.5   125 |   void recordStream(const c10::DataPtr&, streamType stream) override;
2025-03-25T14:19:25.0192737Z #5 1815.5       |        ^~~~~~~~~~~~

@amd-sriram amd-sriram self-assigned this Mar 19, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 19, 2025

Jenkins build for 34a22804e52a055998431cbb4830953ba4a3eb3a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

…cator. This is impacting Distributed Fused Adam in Rocm/APEX.

See PR ROCm/apex#184
@amd-sriram amd-sriram force-pushed the fix_nccl_ub_fail_apex_distributed_fused_adam branch from 34a2280 to f12755b Compare March 20, 2025 18:41
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 20, 2025

Jenkins build for f12755bad0567dcaf30304e55c7a870af87f9caa commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@amd-sriram amd-sriram changed the title [ROCm6.4_internal_testing] [ROCm6.4_internal_testing] Update CUDAPluggableAllocator.h Mar 24, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 24, 2025

Jenkins build for f12755bad0567dcaf30304e55c7a870af87f9caa commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony pruthvistony merged commit 5f50cdd into rocm6.4_internal_testing Mar 24, 2025
0 of 3 checks passed
@pruthvistony pruthvistony deleted the fix_nccl_ub_fail_apex_distributed_fused_adam branch March 24, 2025 17:23
akashveramd pushed a commit that referenced this pull request Apr 1, 2025
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR ROCm/apex#184

Pull Request resolved: pytorch#150010
Approved by: https://github.com/jeffdaily
jeffdaily pushed a commit that referenced this pull request Jun 4, 2025
[ROCm] Update CUDAPluggableAllocator.h (#1984) (pytorch#150010)

Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR ROCm/apex#184

Pull Request resolved: pytorch#150010
Approved by: https://github.com/jeffdaily

(cherry picked from commit a19b667)

Co-authored-by: Sriram Kumar <skishore@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants