Test multigpu on CI #2348

maleadt · 2024-04-25T13:29:31Z

No description provided.

[skip julia] [skip cuda] [skip subpackages] [skip downstream]

giordano · 2024-04-25T13:38:16Z

I'm getting

julia> Pkg.test("CUDA"; test_args=`--gpu=0,1`);
     Testing CUDA
[...]
  [052768ef] CUDA v5.4.0 `https://github.com/JuliaGPU/CUDA.jl#tb/multigpu`
[...]
     Testing Running tests...
┌ Info: System information:
│ CUDA runtime 12.4, artifact installation
│ CUDA driver 12.4
│ NVIDIA driver 550.54.14
│ 
│ CUDA libraries: 
│ - CUBLAS: 12.4.5
│ - CURAND: 10.3.5
│ - CUFFT: 11.2.1
│ - CUSOLVER: 11.6.1
│ - CUSPARSE: 12.3.1
│ - CUPTI: 22.0.0
│ - NVML: 12.0.0+550.54.14
│ 
│ Julia packages: 
│ - CUDA: 5.4.0
│ - CUDA_Driver_jll: 0.8.1+0
│ - CUDA_Runtime_jll: 0.12.1+0
│ 
│ Toolchain:
│ - Julia: 1.10.2
│ - LLVM: 15.0.7
│ 
│ 2 devices:
│   0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└   1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.

and looking at btop seems to confirm only first device is being used.

giordano · 2024-04-25T13:44:48Z

Uhm, perhaps the log is misleading, because one test (and only one) is failing with

Some tests did not pass: 408 passed, 0 failed, 1 errored, 0 broken.
base/array: Error During Test at /home/cceamgi/.julia/packages/CUDA/54m3h/test/base/array.jl:842
  Got exception outside of a @test
  ArgumentError: cannot take the GPU address of inaccessible device memory.
  
  You are trying to use memory from GPU 1 while executing on GPU 0.
  P2P access between these devices is not possible; either switch execution to GPU 1
  by calling `CUDA.device!(1)`, or copy the data to an array allocated on device 0.
  Stacktrace:
    [1] convert(::Type{CuPtr{Float64}}, managed::CUDA.Managed{CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/memory.jl:540
    [2] unsafe_convert
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:429 [inlined]
    [3] #pointer#1109
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:387 [inlined]
    [4] pointer
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:379 [inlined]
    [5] (::CUDA.var"#1115#1116"{Float64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, Int64})()
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:569
    [6] #context!#978
      @ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:170 [inlined]
    [7] context!
      @ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:165 [inlined]
    [8] unsafe_copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, doffs::Int64, src::CuArray{Float64, 2, CUDA.DeviceMemory}, soffs::Int64, n::Int64)
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:567
    [9] copyto!
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:512 [inlined]
   [10] copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, src::CuArray{Float64, 2, CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:516
   [11] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:850 [inlined]
   [12] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [13] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:843 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [15] top-level scope
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:776
   [16] include
      @ ./client.jl:489 [inlined]
   [17] #11
      @ ~/.julia/packages/CUDA/54m3h/test/runtests.jl:87 [inlined]
   [18] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
   [19] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [20] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
   [21] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/src/utilities.jl:35 [inlined]
   [22] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/src/memory.jl:813 [inlined]
   [23] top-level scope
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:59
   [24] eval
      @ ./boot.jl:385 [inlined]
   [25] runtests(f::Function, name::String, time_source::Symbol)
      @ Main ~/.julia/packages/CUDA/54m3h/test/setup.jl:71
   [26] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
      @ Base ./essentials.jl:892
   [27] invokelatest(::Any, ::Any, ::Vararg{Any})
      @ Base ./essentials.jl:889
   [28] (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
   [29] run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
   [30] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287

which suggests this is trying to run some code on both GPUs.

maleadt · 2024-04-25T13:49:46Z

99% of the tests are only going to be using device 0; the fact that multiple devices are available only enables certain tests that require them. We don't do load balancing over multiple devices or anything (typically the CPU is the bottleneck anyway).

giordano · 2024-04-25T13:50:34Z

I restarted the julia session (I think I messed up the value of CUDA_VISIBLE_DEVICES before) and now I get

│ 2 devices:
│   0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└   1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.

which is more promising, but I still get the test failure above.

codecov · 2024-04-25T14:06:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.54%. Comparing base (5dd6bb2) to head (efe63d6).
Report is 1 commits behind head on master.

❗ Current head efe63d6 differs from pull request most recent head 0d661b7. Consider uploading reports for the commit 0d661b7 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2348       +/-   ##
===========================================
- Coverage   71.86%   58.54%   -13.33%     
===========================================
  Files         155      155               
  Lines       15072    14964      -108     
===========================================
- Hits        10832     8760     -2072     
- Misses       4240     6204     +1964

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maleadt · 2024-04-26T09:29:47Z

Pushed a fix for that issue; @giordano can you try again?

giordano · 2024-04-26T09:58:31Z

Test Summary: |  Pass  Broken  Total  Time
  Overall     | 24156       9  24165      
    SUCCESS
     Testing CUDA tests passed

All green now, thanks!

Fix test multigpu hint and parsing.

36a67fb

maleadt added the ci Everything related to continuous integration. label Apr 25, 2024

Add multigpu CI job.

efe63d6

[skip julia] [skip cuda] [skip subpackages] [skip downstream]

maleadt force-pushed the tb/multigpu branch from 44c8799 to efe63d6 Compare April 25, 2024 13:31

Stage through CPU buffer when P2P copies are unsupported.

0d661b7

maleadt merged commit dd9ff2f into master Apr 26, 2024
1 check was pending

maleadt deleted the tb/multigpu branch April 26, 2024 11:28

maleadt mentioned this pull request Apr 26, 2024

Fix and test multigpu support #2218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test multigpu on CI #2348

Test multigpu on CI #2348

maleadt commented Apr 25, 2024

giordano commented Apr 25, 2024

giordano commented Apr 25, 2024

maleadt commented Apr 25, 2024

giordano commented Apr 25, 2024

codecov bot commented Apr 25, 2024 •

edited

maleadt commented Apr 26, 2024

giordano commented Apr 26, 2024

Test multigpu on CI #2348

Test multigpu on CI #2348

Conversation

maleadt commented Apr 25, 2024

giordano commented Apr 25, 2024

giordano commented Apr 25, 2024

maleadt commented Apr 25, 2024

giordano commented Apr 25, 2024

codecov bot commented Apr 25, 2024 • edited

Codecov Report

maleadt commented Apr 26, 2024

giordano commented Apr 26, 2024

codecov bot commented Apr 25, 2024 •

edited