Skip to content

Can't build jaxlib in GH200  #21299

Open
Open
@giladqm

Description

@giladqm

Description

I'm trying to run some code utilizing my GH200 without success. Unable to build jaxlib for my GPU.

System info (python version, jaxlib version, accelerator, etc.)

root@470c73980644:~/jax# nvidia-smi
Sun May 19 12:13:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | On |
| N/A 23C P0 62W / 900W | 5MiB / 97871MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@470c73980644:~/jax# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:24:28_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

the error i get:
Error limit reached.
100 errors detected in the compilation of "external/xla/xla/stream_executor/gpu/redzone_allocator_kernel_cuda.cc".
Compilation terminated.
Target //jaxlib/tools:build_gpu_plugin_wheel failed to build
INFO: Elapsed time: 7.262s, Critical Path: 4.88s
INFO: 73 processes: 73 internal.
FAILED: Build did NOT complete successfully
ERROR: Build failed. Not running target
Traceback (most recent call last):
File "/root/jax/build/build.py", line 733, in
main()
File "/root/jax/build/build.py", line 727, in main
shell(build_pjrt_plugin_command)
File "/root/jax/build/build.py", line 45, in shell
output = subprocess.check_output(cmd)
File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/bin/bazel', 'run', '--verbose_failures=true', '//jaxlib/tools:build_gpu_plugin_wheel', '--', '--output_path=/root/jax/dist', '--jaxlib_git_hash=45a7c22e932fee257016bf0da1022be146ed6095', '--cpu=aarch64', '--cuda_version=12']' returned non-zero exit status 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NVIDIA GPUIssues specific to NVIDIA GPUsbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions