Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test multigpu on CI #2348

Merged
merged 3 commits into from Apr 26, 2024
Merged

Test multigpu on CI #2348

merged 3 commits into from Apr 26, 2024

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Apr 25, 2024

No description provided.

@maleadt maleadt added the ci Everything related to continuous integration. label Apr 25, 2024
[skip julia]
[skip cuda]
[skip subpackages]
[skip downstream]
@giordano
Copy link

I'm getting

julia> Pkg.test("CUDA"; test_args=`--gpu=0,1`);
     Testing CUDA
[...]
  [052768ef] CUDA v5.4.0 `https://github.com/JuliaGPU/CUDA.jl#tb/multigpu`
[...]
     Testing Running tests...
┌ Info: System information:
│ CUDA runtime 12.4, artifact installation
│ CUDA driver 12.4
│ NVIDIA driver 550.54.14
│ 
│ CUDA libraries: 
│ - CUBLAS: 12.4.5
│ - CURAND: 10.3.5
│ - CUFFT: 11.2.1
│ - CUSOLVER: 11.6.1
│ - CUSPARSE: 12.3.1
│ - CUPTI: 22.0.0
│ - NVML: 12.0.0+550.54.14
│ 
│ Julia packages: 
│ - CUDA: 5.4.0
│ - CUDA_Driver_jll: 0.8.1+0
│ - CUDA_Runtime_jll: 0.12.1+0
│ 
│ Toolchain:
│ - Julia: 1.10.2
│ - LLVM: 15.0.7
│ 
│ 2 devices:
│   0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└   1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.

and looking at btop seems to confirm only first device is being used.

@giordano
Copy link

Uhm, perhaps the log is misleading, because one test (and only one) is failing with

Some tests did not pass: 408 passed, 0 failed, 1 errored, 0 broken.
base/array: Error During Test at /home/cceamgi/.julia/packages/CUDA/54m3h/test/base/array.jl:842
  Got exception outside of a @test
  ArgumentError: cannot take the GPU address of inaccessible device memory.
  
  You are trying to use memory from GPU 1 while executing on GPU 0.
  P2P access between these devices is not possible; either switch execution to GPU 1
  by calling `CUDA.device!(1)`, or copy the data to an array allocated on device 0.
  Stacktrace:
    [1] convert(::Type{CuPtr{Float64}}, managed::CUDA.Managed{CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/memory.jl:540
    [2] unsafe_convert
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:429 [inlined]
    [3] #pointer#1109
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:387 [inlined]
    [4] pointer
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:379 [inlined]
    [5] (::CUDA.var"#1115#1116"{Float64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, Int64})()
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:569
    [6] #context!#978
      @ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:170 [inlined]
    [7] context!
      @ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:165 [inlined]
    [8] unsafe_copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, doffs::Int64, src::CuArray{Float64, 2, CUDA.DeviceMemory}, soffs::Int64, n::Int64)
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:567
    [9] copyto!
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:512 [inlined]
   [10] copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, src::CuArray{Float64, 2, CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:516
   [11] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:850 [inlined]
   [12] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [13] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:843 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [15] top-level scope
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:776
   [16] include
      @ ./client.jl:489 [inlined]
   [17] #11
      @ ~/.julia/packages/CUDA/54m3h/test/runtests.jl:87 [inlined]
   [18] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
   [19] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [20] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
   [21] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/src/utilities.jl:35 [inlined]
   [22] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/src/memory.jl:813 [inlined]
   [23] top-level scope
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:59
   [24] eval
      @ ./boot.jl:385 [inlined]
   [25] runtests(f::Function, name::String, time_source::Symbol)
      @ Main ~/.julia/packages/CUDA/54m3h/test/setup.jl:71
   [26] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
      @ Base ./essentials.jl:892
   [27] invokelatest(::Any, ::Any, ::Vararg{Any})
      @ Base ./essentials.jl:889
   [28] (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
   [29] run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
   [30] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287

which suggests this is trying to run some code on both GPUs.

@maleadt
Copy link
Member Author

maleadt commented Apr 25, 2024

99% of the tests are only going to be using device 0; the fact that multiple devices are available only enables certain tests that require them. We don't do load balancing over multiple devices or anything (typically the CPU is the bottleneck anyway).

@giordano
Copy link

I restarted the julia session (I think I messed up the value of CUDA_VISIBLE_DEVICES before) and now I get

│ 2 devices:
│   0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└   1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.

which is more promising, but I still get the test failure above.

Copy link

codecov bot commented Apr 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.54%. Comparing base (5dd6bb2) to head (efe63d6).
Report is 1 commits behind head on master.

❗ Current head efe63d6 differs from pull request most recent head 0d661b7. Consider uploading reports for the commit 0d661b7 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2348       +/-   ##
===========================================
- Coverage   71.86%   58.54%   -13.33%     
===========================================
  Files         155      155               
  Lines       15072    14964      -108     
===========================================
- Hits        10832     8760     -2072     
- Misses       4240     6204     +1964     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2024

Pushed a fix for that issue; @giordano can you try again?

@giordano
Copy link

Test Summary: |  Pass  Broken  Total  Time
  Overall     | 24156       9  24165      
    SUCCESS
     Testing CUDA tests passed 

All green now, thanks!

@maleadt maleadt merged commit dd9ff2f into master Apr 26, 2024
1 check was pending
@maleadt maleadt deleted the tb/multigpu branch April 26, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Everything related to continuous integration.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants