Skip to content

Switching to device ≠ 1 hangs on multi-GPU node #425

@luraess

Description

@luraess

Short description

Switching to device ≠ 1 on a multi-GPU node (LUMI MI250x) results in Julia hanging. Changing the default_device allows to switch device at Julia startup, while changing device even at startup fails.

This behaviour occurs on AMDGPU v0.4.13 and on AMDGPU#master

julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm/lib/libhsa-runtime64.so.1
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so.5
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (12):
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- GPU-c96383daa806fd67 [gfx90a]
- GPU-9a4356d61a3e2421 [gfx90a]
- GPU-5b303e9096b8783a [gfx90a]
- GPU-0b64175228ee7027 [gfx90a]
- GPU-929a3274a9566ca5 [gfx90a]
- GPU-2b0e86a577d2d025 [gfx90a]
- GPU-1b0f4c6aeb8256b1 [gfx90a]
- GPU-9f26db390c83e0eb [gfx90a]

Details

Allocating memory on default device 1 works as expected

julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
 GPU-c96383daa806fd67 [gfx90a]
 GPU-9a4356d61a3e2421 [gfx90a]
 GPU-5b303e9096b8783a [gfx90a]
 GPU-0b64175228ee7027 [gfx90a]
 GPU-929a3274a9566ca5 [gfx90a]
 GPU-2b0e86a577d2d025 [gfx90a]
 GPU-1b0f4c6aeb8256b1 [gfx90a]
 GPU-9f26db390c83e0eb [gfx90a]

 julia> AMDGPU.default_device_id()
1

julia> AMDGPU.ones(4,4)
4×4 ROCMatrix{Float32}:
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0

Switching to default device and repeating allocation fails:

julia> AMDGPU.default_device_id!(2)
GPU-9a4356d61a3e2421 [gfx90a]

julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-c96383daa806fd67 [gfx90a] at 0x000015266ffe4000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
  [1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
  [2] wait
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
  [3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
  [4] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
  [5] #gpu_call#1
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
  [6] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
  [7] fill!
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
  [8] ones
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
  [9] ones(::Int64, ::Int64)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
 [10] top-level scope
    @ REPL[11]:1

julia> 

Using device instead of default_device to switch from defaut device fails right-away:

julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
 GPU-c96383daa806fd67 [gfx90a]
 GPU-9a4356d61a3e2421 [gfx90a]
 GPU-5b303e9096b8783a [gfx90a]
 GPU-0b64175228ee7027 [gfx90a]
 GPU-929a3274a9566ca5 [gfx90a]
 GPU-2b0e86a577d2d025 [gfx90a]
 GPU-1b0f4c6aeb8256b1 [gfx90a]
 GPU-9f26db390c83e0eb [gfx90a]

julia> AMDGPU.device!(AMDGPU.devices()[2]))
ERROR: syntax: extra token ")" after end of expression
Stacktrace:
 [1] top-level scope
   @ none:1

julia> AMDGPU.device!(AMDGPU.devices()[2])
GPU-9a4356d61a3e2421 [gfx90a]

julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-9a4356d61a3e2421 [gfx90a] at 0x00001501ec20c000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
  [1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
  [2] wait
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
  [3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
  [4] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
  [5] #gpu_call#1
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
  [6] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
  [7] fill!
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
  [8] ones
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
  [9] ones(::Int64, ::Int64)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
 [10] top-level scope
    @ REPL[5]:1

julia> 

After failure, exiting Julia results in following stack trace:

julia> 
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
__sched_yield at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x150418209464)
unknown function (ip: 0x15041820cace)
unknown function (ip: 0x1504182190f8)
unknown function (ip: 0x15041816f1ed)
hipStreamDestroy at /opt/rocm/lib/libamdhip64.so.5 (unknown line)
hipStreamDestroy at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/libhip.jl:59
#7 at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/HIP.jl:111
unknown function (ip: 0x1501f3a97332)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
run_finalizer at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:417
jl_gc_run_finalizers_in_list at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:507
run_finalizers at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:553
ijl_atexit_hook at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/init.c:299
jl_repl_entrypoint at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/jlapi.c:718
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions