Skip to content

TE building over TheRock#511

Merged
ipanfilo merged 6 commits intodevfrom
ipanfilo/therock_wheels
Apr 1, 2026
Merged

TE building over TheRock#511
ipanfilo merged 6 commits intodevfrom
ipanfilo/therock_wheels

Conversation

@ipanfilo
Copy link
Copy Markdown
Collaborator

Description

Support TE wheels building over ROCm TheRock

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Update wheels Docker file to use TheRock
  • Update different components build and initialization to use TheRock default location.
  • Update CI scripts
  • Make RTC code changes to be aware of TheRock default layout
  • Support comma separated list of GPU arch

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ipanfilo ipanfilo added the ci-level 1 CI test level 1 label Mar 30, 2026
@ipanfilo ipanfilo mentioned this pull request Mar 30, 2026
13 tasks
@Micky774
Copy link
Copy Markdown
Contributor

May we also swap these two lines, or modify the second into an append:

TransformerEngine/setup.py

Lines 102 to 103 in 2cb27a3

cmake_flags.append("-DUSE_ROCM=OFF")
cmake_flags = ["-DCMAKE_CUDA_ARCHITECTURES={}".format(archs)]

Comment on lines +15 to +22
if [ -n "$ROCM_PATH" ]; then
true # Use provided ROCM_PATH
elif [ -d "/opt/rocm/core" ]; then
ROCM_PATH="/opt/rocm/core"
else
ROCM_PATH="/opt/rocm"
fi
export ROCM_PATH
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback logic from $ROCM_PATH -> /opt/rocm/core -> /opt/rocm appears multiple times in multiple files. Not a blocker, but if the fallback order changes there are multiple places to update. Maybe some refactoring to share this logic in one place could help.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. Python and shell scripts need it for different purposes and they are also totally independent entities that do not use common shared code.
Good thing is that such changes like switching to different distribution model are extremely rare.
For CMake files it might however be possible to have some rocm_utils.cmake with ROCM_PATH configuring.

Copy link
Copy Markdown
Contributor

@Micky774 Micky774 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ditto Allen's comment. I think it would be good to do so at least for the cmake files. For the python files, we could maybe move rocm_path() into the actual runtime library and have the build tools import from it?

Comment on lines +26 to +28
RUN dnf group install -y "Development Tools" && dnf install -y git cmake llvm-toolset gcc-toolset-12
RUN dnf install -y --disablerepo=epel amdrocm-core-devel-gfx950

#Uncomment the next line for ROCm 6.4 cmake workaround: remove newer incomnpatible cmake preinstalled on base image
#RUN rm /usr/local/bin/cmake || true
# xz-devel installs lzma needed by AOTriton
RUN dnf install -y gcc-toolset-12 xz-devel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, git, cmake llvm-toolset and gcc-toolset-12 were installed explicitly and now the only explicity install is for gcc-toolset-12. Are they provided by the amdrocm-core-devel-gfx950 or the base image now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are part of base image now


ENV NVTE_RELEASE_BUILD=1

ARG GPU_TARGETS="gfx942;gfx950"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are the supported GPU targets, does the previously installed amdrocm-core-devel-gfx950 still support gfx942?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite counterintuitive but for building it does not actually make difference which GPU ROCm is installed if all builds libraries provide the same API - ROCm .so are just needed for TE linking. Thus, we use the latest released HW for building and runtime will use amdrocm-core corresponding to actual GPU.

Comment on lines 183 to 198
"""
Determines which build platform to use:

- If `NVTE_USE_ROCM` is set:
- Non-zero value: Use ROCm, if hipcc is detected.
- Zero value: Use CUDA, if nvcc is detected.
- If `NVTE_USE_ROCM` is not set:
- Attempt to auto-detect: Check for ROCm first, then CUDA.

Returns:
bool: `True` for ROCm, `False` for CUDA.

Raises:
ValueError: If NVTE_USE_ROCM is set to invalid value.
FileNotFoundError: If required tools (hipcc or nvcc) are not found.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should be updated to reflect the changes below. Something like:

"""
Determines which build platform to use:

- If `NVTE_USE_ROCM` is set:
    - "0": Use CUDA, if nvcc is detected.
    - Any other value: Require ROCm and use it if hipcc is detected.
- If `NVTE_USE_ROCM` is not set:
    - If `HIP_PLATFORM=amd`, require ROCm and use it if hipcc is detected.
    - Otherwise, attempt to auto-detect: Check for ROCm first, then CUDA.

Returns:
    bool: `True` for ROCm, `False` for CUDA.

Raises:
    FileNotFoundError: If ROCm is required but hipcc is not found, or if
        neither ROCm nor CUDA can be detected.
"""

# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
#
# See LICENSE for license information.
set -euo pipefail
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The -u options tells bash to treat unset variables as an error and exit immediately. The line 15 below if [ -n "$ROCM_PATH" ]; then expands ROCM_PATH before it is set, so the script exits immediately when ROCM_PATH is unset.

Using if [ -n "${ROCM_PATH:-}" ]; then is safer since it expands to an empty string instead of erroring, so the fallback logic can run.


- If `NVTE_USE_ROCM` is set:
- Non-zero value: Use ROCm, if hipcc is detected.
- Any value excet "0": Use ROCm, if hipcc is detected.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "except"

Copy link
Copy Markdown
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a comment on env variable usage and some small nits.

Copy link
Copy Markdown
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@ipanfilo ipanfilo merged commit 67b93a8 into dev Apr 1, 2026
3 checks passed
@ipanfilo ipanfilo deleted the ipanfilo/therock_wheels branch April 1, 2026 01:52
wangye805 pushed a commit that referenced this pull request Apr 2, 2026
* Update Dockerfile to use ROCm TheRock
* Update wheels building script to work with ROCm TheRock and the latest Manylinux image
* Support default ROCm location /opt/rocm/core
* Fix UB code build on TheRock
* Support comma separated list of target GPU architectures
* Guess ROCm build from HIP_PLATFORM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants