TE building over TheRock by ipanfilo · Pull Request #511 · ROCm/TransformerEngine

ipanfilo · 2026-03-30T17:14:18Z

Description

Support TE wheels building over ROCm TheRock

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Update wheels Docker file to use TheRock
Update different components build and initialization to use TheRock default location.
Update CI scripts
Make RTC code changes to be aware of TheRock default layout
Support comma separated list of GPU arch

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Micky774 · 2026-03-30T19:50:53Z

May we also swap these two lines, or modify the second into an append:

TransformerEngine/setup.py

Lines 102 to 103 in 2cb27a3

    
           cmake_flags.append("-DUSE_ROCM=OFF") 
        
           cmake_flags = ["-DCMAKE_CUDA_ARCHITECTURES={}".format(archs)]

AllenFarcas · 2026-03-30T20:05:03Z

.github/scripts/aiter_prebuild_upload.sh

+if [ -n "$ROCM_PATH" ]; then
+    true # Use provided ROCM_PATH
+elif [ -d "/opt/rocm/core" ]; then
+    ROCM_PATH="/opt/rocm/core"
+else
+    ROCM_PATH="/opt/rocm"
+fi
+export ROCM_PATH


This fallback logic from $ROCM_PATH -> /opt/rocm/core -> /opt/rocm appears multiple times in multiple files. Not a blocker, but if the fallback order changes there are multiple places to update. Maybe some refactoring to share this logic in one place could help.

I thought about it. Python and shell scripts need it for different purposes and they are also totally independent entities that do not use common shared code.
Good thing is that such changes like switching to different distribution model are extremely rare.
For CMake files it might however be possible to have some rocm_utils.cmake with ROCM_PATH configuring.

I ditto Allen's comment. I think it would be good to do so at least for the cmake files. For the python files, we could maybe move rocm_path() into the actual runtime library and have the build tools import from it?

AllenFarcas · 2026-03-30T20:22:09Z

build_tools/wheel_utils/Dockerfile.rocm.manylinux.x86

-RUN dnf group install -y "Development Tools" && dnf install -y git cmake llvm-toolset gcc-toolset-12
+RUN dnf install -y --disablerepo=epel amdrocm-core-devel-gfx950

-#Uncomment the next line for ROCm 6.4 cmake workaround: remove newer incomnpatible cmake preinstalled on base image
-#RUN rm /usr/local/bin/cmake || true
+# xz-devel installs lzma needed by AOTriton
+RUN dnf install -y gcc-toolset-12 xz-devel


Previously, git, cmake llvm-toolset and gcc-toolset-12 were installed explicitly and now the only explicity install is for gcc-toolset-12. Are they provided by the amdrocm-core-devel-gfx950 or the base image now?

They are part of base image now

AllenFarcas · 2026-03-30T20:57:34Z

build_tools/wheel_utils/Dockerfile.rocm.manylinux.x86

+
 ENV NVTE_RELEASE_BUILD=1

 ARG GPU_TARGETS="gfx942;gfx950"


Since these are the supported GPU targets, does the previously installed amdrocm-core-devel-gfx950 still support gfx942?

This is quite counterintuitive but for building it does not actually make difference which GPU ROCm is installed if all builds libraries provide the same API - ROCm .so are just needed for TE linking. Thus, we use the latest released HW for building and runtime will use amdrocm-core corresponding to actual GPU.

AllenFarcas · 2026-03-30T21:04:18Z

build_tools/utils.py

    """
    Determines which build platform to use:

    - If `NVTE_USE_ROCM` is set:
        - Non-zero value: Use ROCm, if hipcc is detected.
        - Zero value: Use CUDA, if nvcc is detected.
    - If `NVTE_USE_ROCM` is not set:
        - Attempt to auto-detect: Check for ROCm first, then CUDA.

    Returns:
        bool: `True` for ROCm, `False` for CUDA.

    Raises:
        ValueError: If NVTE_USE_ROCM is set to invalid value.
        FileNotFoundError: If required tools (hipcc or nvcc) are not found.
    """


The docstring should be updated to reflect the changes below. Something like:

""" Determines which build platform to use: - If `NVTE_USE_ROCM` is set: - "0": Use CUDA, if nvcc is detected. - Any other value: Require ROCm and use it if hipcc is detected. - If `NVTE_USE_ROCM` is not set: - If `HIP_PLATFORM=amd`, require ROCm and use it if hipcc is detected. - Otherwise, attempt to auto-detect: Check for ROCm first, then CUDA. Returns: bool: `True` for ROCm, `False` for CUDA. Raises: FileNotFoundError: If ROCm is required but hipcc is not found, or if neither ROCm nor CUDA can be detected. """

AllenFarcas · 2026-03-30T21:34:43Z

.github/scripts/aiter_prebuild_upload.sh

 # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.
 #
 # See LICENSE for license information.
 set -euo pipefail


Bug: The -u options tells bash to treat unset variables as an error and exit immediately. The line 15 below if [ -n "$ROCM_PATH" ]; then expands ROCM_PATH before it is set, so the script exits immediately when ROCM_PATH is unset.

Using if [ -n "${ROCM_PATH:-}" ]; then is safer since it expands to an empty string instead of erroring, so the fallback logic can run.

Micky774 · 2026-03-31T14:35:47Z

build_tools/utils.py


    - If `NVTE_USE_ROCM` is set:
-        - Non-zero value: Use ROCm, if hipcc is detected.
+        - Any value excet "0": Use ROCm, if hipcc is detected.


Typo: "except"

Micky774

Overall looks good, just a comment on env variable usage and some small nits.

build_tools/utils.py

transformer_engine/common/__init__.py

build_tools/utils.py

Micky774

Looks good, thanks!

* Update Dockerfile to use ROCm TheRock * Update wheels building script to work with ROCm TheRock and the latest Manylinux image * Support default ROCm location /opt/rocm/core * Fix UB code build on TheRock * Support comma separated list of target GPU architectures * Guess ROCm build from HIP_PLATFORM

ipanfilo added 2 commits March 29, 2026 23:50

Integrate fix for aiter build w/o torch

0710d2f

TE wheels build on ROCm TheRock

483267e

ipanfilo requested review from AllenFarcas, Micky774 and alextmagro March 30, 2026 17:14

ipanfilo requested review from wangye805 and wenchenvincent as code owners March 30, 2026 17:14

ipanfilo added the ci-level 1 CI test level 1 label Mar 30, 2026

ipanfilo mentioned this pull request Mar 30, 2026

FIX Wheel building #512

Closed

13 tasks

AllenFarcas reviewed Mar 30, 2026

View reviewed changes

Address review comments. Switch docker to gcc-tools-13

c8e14d4

Micky774 reviewed Mar 31, 2026

View reviewed changes

build_tools/utils.py Outdated Show resolved Hide resolved

transformer_engine/common/__init__.py Show resolved Hide resolved

build_tools/utils.py Outdated Show resolved Hide resolved

Address comments. Create utility cmake with reusable ROCm build steps

ecd6a11

ipanfilo requested review from AllenFarcas and Micky774 March 31, 2026 17:22

Micky774 approved these changes Mar 31, 2026

View reviewed changes

ipanfilo added 2 commits March 31, 2026 18:39

Merge branch 'dev' into ipanfilo/therock_wheels

9eb924a

Fix UB code build on TheRock

b70356a

ipanfilo merged commit 67b93a8 into dev Apr 1, 2026
3 checks passed

ipanfilo deleted the ipanfilo/therock_wheels branch April 1, 2026 01:52

Conversation

ipanfilo commented Mar 30, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Micky774 commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Micky774 Mar 31, 2026 •

edited

Loading