Conversation
|
May we also swap these two lines, or modify the second into an append: Lines 102 to 103 in 2cb27a3 |
| if [ -n "$ROCM_PATH" ]; then | ||
| true # Use provided ROCM_PATH | ||
| elif [ -d "/opt/rocm/core" ]; then | ||
| ROCM_PATH="/opt/rocm/core" | ||
| else | ||
| ROCM_PATH="/opt/rocm" | ||
| fi | ||
| export ROCM_PATH |
There was a problem hiding this comment.
This fallback logic from $ROCM_PATH -> /opt/rocm/core -> /opt/rocm appears multiple times in multiple files. Not a blocker, but if the fallback order changes there are multiple places to update. Maybe some refactoring to share this logic in one place could help.
There was a problem hiding this comment.
I thought about it. Python and shell scripts need it for different purposes and they are also totally independent entities that do not use common shared code.
Good thing is that such changes like switching to different distribution model are extremely rare.
For CMake files it might however be possible to have some rocm_utils.cmake with ROCM_PATH configuring.
There was a problem hiding this comment.
I ditto Allen's comment. I think it would be good to do so at least for the cmake files. For the python files, we could maybe move rocm_path() into the actual runtime library and have the build tools import from it?
| RUN dnf group install -y "Development Tools" && dnf install -y git cmake llvm-toolset gcc-toolset-12 | ||
| RUN dnf install -y --disablerepo=epel amdrocm-core-devel-gfx950 | ||
|
|
||
| #Uncomment the next line for ROCm 6.4 cmake workaround: remove newer incomnpatible cmake preinstalled on base image | ||
| #RUN rm /usr/local/bin/cmake || true | ||
| # xz-devel installs lzma needed by AOTriton | ||
| RUN dnf install -y gcc-toolset-12 xz-devel |
There was a problem hiding this comment.
Previously, git, cmake llvm-toolset and gcc-toolset-12 were installed explicitly and now the only explicity install is for gcc-toolset-12. Are they provided by the amdrocm-core-devel-gfx950 or the base image now?
There was a problem hiding this comment.
They are part of base image now
|
|
||
| ENV NVTE_RELEASE_BUILD=1 | ||
|
|
||
| ARG GPU_TARGETS="gfx942;gfx950" |
There was a problem hiding this comment.
Since these are the supported GPU targets, does the previously installed amdrocm-core-devel-gfx950 still support gfx942?
There was a problem hiding this comment.
This is quite counterintuitive but for building it does not actually make difference which GPU ROCm is installed if all builds libraries provide the same API - ROCm .so are just needed for TE linking. Thus, we use the latest released HW for building and runtime will use amdrocm-core corresponding to actual GPU.
| """ | ||
| Determines which build platform to use: | ||
|
|
||
| - If `NVTE_USE_ROCM` is set: | ||
| - Non-zero value: Use ROCm, if hipcc is detected. | ||
| - Zero value: Use CUDA, if nvcc is detected. | ||
| - If `NVTE_USE_ROCM` is not set: | ||
| - Attempt to auto-detect: Check for ROCm first, then CUDA. | ||
|
|
||
| Returns: | ||
| bool: `True` for ROCm, `False` for CUDA. | ||
|
|
||
| Raises: | ||
| ValueError: If NVTE_USE_ROCM is set to invalid value. | ||
| FileNotFoundError: If required tools (hipcc or nvcc) are not found. | ||
| """ |
There was a problem hiding this comment.
The docstring should be updated to reflect the changes below. Something like:
"""
Determines which build platform to use:
- If `NVTE_USE_ROCM` is set:
- "0": Use CUDA, if nvcc is detected.
- Any other value: Require ROCm and use it if hipcc is detected.
- If `NVTE_USE_ROCM` is not set:
- If `HIP_PLATFORM=amd`, require ROCm and use it if hipcc is detected.
- Otherwise, attempt to auto-detect: Check for ROCm first, then CUDA.
Returns:
bool: `True` for ROCm, `False` for CUDA.
Raises:
FileNotFoundError: If ROCm is required but hipcc is not found, or if
neither ROCm nor CUDA can be detected.
"""
| # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. | ||
| # | ||
| # See LICENSE for license information. | ||
| set -euo pipefail |
There was a problem hiding this comment.
Bug: The -u options tells bash to treat unset variables as an error and exit immediately. The line 15 below if [ -n "$ROCM_PATH" ]; then expands ROCM_PATH before it is set, so the script exits immediately when ROCM_PATH is unset.
Using if [ -n "${ROCM_PATH:-}" ]; then is safer since it expands to an empty string instead of erroring, so the fallback logic can run.
build_tools/utils.py
Outdated
|
|
||
| - If `NVTE_USE_ROCM` is set: | ||
| - Non-zero value: Use ROCm, if hipcc is detected. | ||
| - Any value excet "0": Use ROCm, if hipcc is detected. |
Micky774
left a comment
There was a problem hiding this comment.
Overall looks good, just a comment on env variable usage and some small nits.
* Update Dockerfile to use ROCm TheRock * Update wheels building script to work with ROCm TheRock and the latest Manylinux image * Support default ROCm location /opt/rocm/core * Fix UB code build on TheRock * Support comma separated list of target GPU architectures * Guess ROCm build from HIP_PLATFORM
Description
Support TE wheels building over ROCm TheRock
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: