Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems building apex with ROCm-5.4, 5.5, and 5.6 #115

Open
adammoody opened this issue Aug 25, 2023 · 3 comments
Open

Problems building apex with ROCm-5.4, 5.5, and 5.6 #115

adammoody opened this issue Aug 25, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@adammoody
Copy link

adammoody commented Aug 25, 2023

Describe the Bug
The latest master branch fails to build with several ROCm versions, including 5.4, 5.5, and 5.6.

Rolling back to the commit made on June 20 (git checkout 10c7482) allows ROCm-5.4 to build. The build still fails for 5.5 and 5.6 but with a different error.

Minimal Steps/Code to Reproduce the Bug

For ROCm-5.4.3, I use the following to build:

virtualenv --system-site-packages env
source env/bin/activate

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

git clone --recursive https://github.com/ROCmSoftwarePlatform/apex.git
cd apex

export DISTUTILS_DEBUG=1
export __HIP_PLATFORM_HCC__
export __HIP_PLATFORM_AMD__
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_HOME=/opt/rocm-5.4.3
export CC=gcc
export CXX=g++
pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

The build fails when compiling csrc/mlp_hip.hip with errors like the following:

  csrc/mlp_hip.hip:65:53: error: unknown type name 'hipblasOperation_t'; did you mean 'hipsparseOperation_t'?
  static rocblas_operation hipOperationToRocOperation(hipblasOperation_t op)
                                                      ^~~~~~~~~~~~~~~~~~
                                                      hipsparseOperation_t
  /opt/rocm-5.4.3/include/hipsparse/hipsparse.h:317:3: note: 'hipsparseOperation_t' declared here
  } hipsparseOperation_t;
    ^
  csrc/mlp_hip.hip:69:10: error: use of undeclared identifier 'HIPBLAS_OP_N'
      case HIPBLAS_OP_N:
           ^
  csrc/mlp_hip.hip:71:10: error: use of undeclared identifier 'HIPBLAS_OP_T'
      case HIPBLAS_OP_T:
           ^
  csrc/mlp_hip.hip:73:10: error: use of undeclared identifier 'HIPBLAS_OP_C'
      case HIPBLAS_OP_C:
           ^
  csrc/mlp_hip.hip:79:8: error: unknown type name 'hipblasStatus_t'; did you mean 'hipsparseStatus_t'?
  static hipblasStatus_t rocBLASStatusToHIPStatus(rocblas_status error)
         ^~~~~~~~~~~~~~~
         hipsparseStatus_t
  /opt/rocm-5.4.3/include/hipsparse/hipsparse.h:188:3: note: 'hipsparseStatus_t' declared here
  } hipsparseStatus_t;
    ^

Rolling back to the commit from June 20 allows the build to complete:

cd apex

git checkout 10c7482
git submodule init
git submodule update

export DISTUTILS_DEBUG=1
export __HIP_PLATFORM_HCC__
export __HIP_PLATFORM_AMD__
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_HOME=/opt/rocm-5.4.3
export CC=gcc
export CXX=g++
pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Building apex from master with ROCm-5.5 and ROCm-5.6 fail with errors similar to each other, but errors that are distinct from ROCm-5.4. Here are the steps I used to build with ROCm-5.6:

virtualenv --system-site-packages env
source env/bin/activate

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6

git clone --recursive https://github.com/ROCmSoftwarePlatform/apex.git
cd apex

export DISTUTILS_DEBUG=1
export __HIP_PLATFORM_HCC__
export __HIP_PLATFORM_AMD__
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_HOME=/opt/rocm-5.6.0
export CC=gcc
export CXX=g++
pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

That fails with the following error:

  csrc/mlp_hip.hip:91:10: error: use of undeclared identifier 'rocblas_status_excluded_from_build'
      case rocblas_status_excluded_from_build:
           ^
  csrc/mlp_hip.hip:104:10: error: use of undeclared identifier 'rocblas_status_arch_mismatch'; did you mean 'rocblas_status_size_query_mismatch'?
      case rocblas_status_arch_mismatch:
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
           rocblas_status_size_query_mismatch
  /opt/rocm-5.6.0/include/rocblas/internal/rocblas-types.h:212:5: note: 'rocblas_status_size_query_mismatch' declared here
      rocblas_status_size_query_mismatch = 8, /**< Unmatched start/stop size query */
      ^
  csrc/mlp_hip.hip:104:10: error: duplicate case value 'rocblas_status_size_query_mismatch'
      case rocblas_status_arch_mismatch:
           ^
  csrc/mlp_hip.hip:96:10: note: previous case defined here
      case rocblas_status_size_query_mismatch:
           ^

In this case, rolling back to the June 20 commit fails with a different error:

  csrc/mlp_hip.hip:89:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:92:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:96:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:99:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:101:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:102:7: error: use of undeclared identifier 'rocblas_gemm_algo_standard'
        rocblas_gemm_algo_standard,
        ^

Building with the June 20 commit, I see that the csrc/mlp_hip.hip file contains the following for ROCm-5.5 and ROCm-5.6 (which fails):

/* Includes, cuda */
#include <hipblas/hipblas.h>
#include <hip/hip_runtime.h>

but it has the following for ROCm-5.4 (which builds):

/* Includes, cuda */
#include <rocblas/rocblas.h>
#include <hip/hip_runtime.h>

Expected Behavior

Environment

@adammoody adammoody added the bug Something isn't working label Aug 25, 2023
@adammoody adammoody changed the title Problems building apex main branch with ROCm-5.4, 5.5, and 5.6 Problems building apex with ROCm-5.4, 5.5, and 5.6 Aug 25, 2023
@loadams
Copy link

loadams commented Aug 29, 2023

I'm seeing this as well, a number of errors like those above while building the cuda_ext.

/apex/csrc/mlp_hip.hip:65:53: error: unknown type name 'hipblasOperation_t'; did you mean 'hipsparseOperation_t'?
  static rocblas_operation hipOperationToRocOperation(hipblasOperation_t op)

@loadams
Copy link

loadams commented Aug 29, 2023

FYI @jithunnair-amd

@hliuca
Copy link

hliuca commented Sep 11, 2023

Hi @adammoody and @loadams, if you are using PyTorch 2.0 or earlier, please use master branch for apex. If you are using PyTorch 2.1+, please use torch_2.1_higher branch.

There are some changes related to CUDA to HIP conversion in PyTorch.

export HIP_PLATFORM_HCC
export HIP_PLATFORM_AMD
these two commands are not needed.

I am not apex developer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants