Short description
Switching to device ≠ 1 on a multi-GPU node (LUMI MI250x) results in Julia hanging. Changing the default_device allows to switch device at Julia startup, while changing device even at startup fails.
This behaviour occurs on AMDGPU v0.4.13 and on AMDGPU#master
julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm/lib/libhsa-runtime64.so.1
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so.5
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (12):
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- GPU-c96383daa806fd67 [gfx90a]
- GPU-9a4356d61a3e2421 [gfx90a]
- GPU-5b303e9096b8783a [gfx90a]
- GPU-0b64175228ee7027 [gfx90a]
- GPU-929a3274a9566ca5 [gfx90a]
- GPU-2b0e86a577d2d025 [gfx90a]
- GPU-1b0f4c6aeb8256b1 [gfx90a]
- GPU-9f26db390c83e0eb [gfx90a]
Details
Allocating memory on default device 1 works as expected
julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
GPU-c96383daa806fd67 [gfx90a]
GPU-9a4356d61a3e2421 [gfx90a]
GPU-5b303e9096b8783a [gfx90a]
GPU-0b64175228ee7027 [gfx90a]
GPU-929a3274a9566ca5 [gfx90a]
GPU-2b0e86a577d2d025 [gfx90a]
GPU-1b0f4c6aeb8256b1 [gfx90a]
GPU-9f26db390c83e0eb [gfx90a]
julia> AMDGPU.default_device_id()
1
julia> AMDGPU.ones(4,4)
4×4 ROCMatrix{Float32}:
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
Switching to default device and repeating allocation fails:
julia> AMDGPU.default_device_id!(2)
GPU-9a4356d61a3e2421 [gfx90a]
julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-c96383daa806fd67 [gfx90a] at 0x000015266ffe4000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
[1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
[2] wait
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
[3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
[4] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
[5] #gpu_call#1
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
[6] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
[7] fill!
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
[8] ones
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
[9] ones(::Int64, ::Int64)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
[10] top-level scope
@ REPL[11]:1
julia>
Using device instead of default_device to switch from defaut device fails right-away:
julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
GPU-c96383daa806fd67 [gfx90a]
GPU-9a4356d61a3e2421 [gfx90a]
GPU-5b303e9096b8783a [gfx90a]
GPU-0b64175228ee7027 [gfx90a]
GPU-929a3274a9566ca5 [gfx90a]
GPU-2b0e86a577d2d025 [gfx90a]
GPU-1b0f4c6aeb8256b1 [gfx90a]
GPU-9f26db390c83e0eb [gfx90a]
julia> AMDGPU.device!(AMDGPU.devices()[2]))
ERROR: syntax: extra token ")" after end of expression
Stacktrace:
[1] top-level scope
@ none:1
julia> AMDGPU.device!(AMDGPU.devices()[2])
GPU-9a4356d61a3e2421 [gfx90a]
julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-9a4356d61a3e2421 [gfx90a] at 0x00001501ec20c000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
[1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
[2] wait
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
[3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
[4] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
[5] #gpu_call#1
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
[6] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
[7] fill!
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
[8] ones
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
[9] ones(::Int64, ::Int64)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
[10] top-level scope
@ REPL[5]:1
julia>
After failure, exiting Julia results in following stack trace:
julia>
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
__sched_yield at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x150418209464)
unknown function (ip: 0x15041820cace)
unknown function (ip: 0x1504182190f8)
unknown function (ip: 0x15041816f1ed)
hipStreamDestroy at /opt/rocm/lib/libamdhip64.so.5 (unknown line)
hipStreamDestroy at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/libhip.jl:59
#7 at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/HIP.jl:111
unknown function (ip: 0x1501f3a97332)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
run_finalizer at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:417
jl_gc_run_finalizers_in_list at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:507
run_finalizers at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:553
ijl_atexit_hook at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/init.c:299
jl_repl_entrypoint at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/jlapi.c:718
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
Short description
Switching to device ≠ 1 on a multi-GPU node (LUMI MI250x) results in Julia hanging. Changing the
default_deviceallows to switch device at Julia startup, while changingdeviceeven at startup fails.This behaviour occurs on
AMDGPU v0.4.13and onAMDGPU#masterDetails
Allocating memory on default device 1 works as expected
Switching to default device and repeating allocation fails:
Using
deviceinstead ofdefault_deviceto switch from defaut device fails right-away:After failure, exiting Julia results in following stack trace: