Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on GPU on Caltech cluster #670

Closed
kmdeck opened this issue Jun 21, 2024 · 2 comments
Closed

Failure on GPU on Caltech cluster #670

kmdeck opened this issue Jun 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@kmdeck
Copy link
Member

kmdeck commented Jun 21, 2024

Describe the bug

Currently, our Global Run does not run on GPU on the caltech cluster (in experiments), but it does on the clima one (land benchmark). It seems due to this line:
Y.canopy.hydraulics.ϑ_l.:1 .= plant_ν

To Reproduce

julia --color=yes --project=.buildkite experiments/integrated/global/global_soil_canopy.jl
ERROR: LoadError: Scalar indexing is disallowed.
Invocation of setindex! resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.
If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:116
  [5] setindex!(A::CUDA.CuArray{Float64, 4, CUDA.DeviceMemory}, v::Float64, I::Int64)
    @ GPUArrays /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/GPUArrays/OqrUV/src/host/indexing.jl:56
  [6] setindex!
    @ ./subarray.jl:343 [inlined]
  [7] _setindex!
    @ ./abstractarray.jl:1419 [inlined]
  [8] setindex!
    @ ./abstractarray.jl:1396 [inlined]
  [9] set_struct!
    @ /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/ClimaCore/Bf3Rx/src/DataLayouts/struct.jl:257 [inlined]
 [10] setindex!
    @ /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/ClimaCore/Bf3Rx/src/DataLayouts/DataLayouts.jl:784 [inlined]
 [11] fill!(data::ClimaCore.DataLayouts.IJF{Float64, 2, SubArray{Float64, 3, CUDA.CuArray{Float64, 4, CUDA.DeviceMemory}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Int64}, true}}, val::Float64)
    @ ClimaCore.DataLayouts /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/ClimaCore/Bf3Rx/src/DataLayouts/DataLayouts.jl:734
 [12] fill!(data::ClimaCore.DataLayouts.IJFH{Float64, 2, SubArray{Float64, 4, CUDA.CuArray{Float64, 4, CUDA.DeviceMemory}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, val::Float64)
    @ ClimaCore.DataLayouts /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/ClimaCore/Bf3Rx/src/DataLayouts/DataLayouts.jl:316
 [13] copyto!
    @ /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/ClimaCore/Bf3Rx/src/DataLayouts/broadcast.jl:479 [inlined]
 [14] copyto!
    @ /central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/packages/ClimaCore/Bf3Rx/src/Fields/broadcast.jl:408 [inlined]
 [15] materialize!
    @ ./broadcast.jl:914 [inlined]
 [16] materialize!(dest::ClimaCore.Fields.Field{ClimaCore.DataLayouts.IJFH{Float64, 2, SubArray{Float64, 4, CUDA.CuArray{Float64, 4, CUDA.DeviceMemory}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, ClimaCore.Spaces.SpectralElementSpace2D{ClimaCore.Grids.SpectralElementGrid2D{ClimaCore.Topologies.Topology2D{ClimaComms.SingletonCommsContext{ClimaComms.CUDADevice}, ClimaCore.Meshes.EquiangularCubedSphere{ClimaCore.Domains.SphereDomain{Float64}, ClimaCore.Meshes.NormalizedBilinearMap}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, LinearIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, CUDA.CuArray{Tuple{Int64, Int64, Int64, Int64, Bool}, 1, CUDA.DeviceMemory}, Vector{Tuple{Int64, Int64, Int64, Int64, Bool}}, CUDA.CuArray{Tuple{Int64, Int64}, 1, CUDA.DeviceMemory}, CUDA.CuArray{Int64, 1, CUDA.DeviceMemory}, CUDA.CuArray{Tuple{Bool, Int64, Int64}, 1, CUDA.DeviceMemory}, CUDA.CuArray{Int64, 1, CUDA.DeviceMemory}, CUDA.CuArray{Int64, 1, CUDA.DeviceMemory}, @NamedTuple{}, CUDA.CuArray{Tuple{Int64, Int64}, 1, CUDA.DeviceMemory}}, ClimaCore.Quadratures.GLL{2}, ClimaCore.Geometry.SphericalGlobalGeometry{Float64}, ClimaCore.DataLayouts.IJFH{ClimaCore.Geometry.LocalGeometry{(1, 2), ClimaCore.Geometry.LatLongPoint{Float64}, Float64, StaticArraysCore.SMatrix{2, 2, Float64, 4}}, 2, CUDA.CuArray{Float64, 4, CUDA.DeviceMemory}}, ClimaCore.DataLayouts.IJFH{Float64, 2, CUDA.CuArray{Float64, 4, CUDA.DeviceMemory}}, ClimaCore.DataLayouts.IFH{ClimaCore.Geometry.SurfaceGeometry{Float64, ClimaCore.Geometry.UVVector{Float64}}, 2, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, @NamedTuple{}}}}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(identity), Tuple{Float64}})
    @ Base.Broadcast ./broadcast.jl:911
 [17] top-level scope
    @ /central/scratch/esm/slurm-buildkite/climaland-ci/4335/climaland-ci/experiments/integrated/global/global_soil_canopy.jl:335
in expression starting at /central/scratch/esm/slurm-buildkite/climaland-ci/4335/climaland-ci/experiments/integrated/global/global_soil_canopy.jl:335
🚨 Error: The command exited with status 1
@kmdeck kmdeck added the bug Something isn't working label Jun 21, 2024
@Sbozzolo
Copy link
Member

Sbozzolo commented Aug 1, 2024

I get this error with your reproducer:

CLIMACOMMS_DEVICE="CUDA" CLIMACOMMS_CONTEXT="SINGLETON" julia --color=yes --project=.buildkite experiments/integrated/global/global_soil_canopy.jl
ERROR: LoadError: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.knl_copyto!(::ClimaCore.DataLayouts.VIJFH{Float64, 10, 2, CUDA.CuDeviceArray{Float64, 5, 1}}, ::Base.Broadcast.Broadcasted{ClimaCore.DataLayouts.VIJFHStyle{10, 2, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, NTuple{5, Base.OneTo{Int64}}, typeof(ClimaCore.RecursiveApply.rmul), Tuple{ClimaCore.DataLayouts.IJFH{Float64, 2, SubArray{Float64, 4, CUDA.CuDeviceArray{Float64, 4, 1}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, Base.Broadcast.Broadcasted{ClimaCore.DataLayouts.VIJFHStyle{10, 2, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, Nothing, typeof(ClimaLand.Canopy.PlantHydraulics.flux), Tuple{ClimaCore.DataLayouts.VIJFH{Float64, 10, 2, SubArray{Float64, 5, CUDA.CuDeviceArray{Float64, 5, 1}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, Float64, ClimaCore.DataLayouts.VIJFH{Float64, 10, 2, CUDA.CuDeviceArray{Float64, 5, 1}}, ClimaCore.DataLayouts.IJFH{Float64, 2, SubArray{Float64, 4, CUDA.CuDeviceArray{Float64, 4, 1}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, Base.Broadcast.Broadcasted{ClimaCore.DataLayouts.VIJFHStyle{10, 2, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, Nothing, typeof(ClimaLand.Canopy.PlantHydraulics.hydraulic_conductivity), Tuple{Tuple{ClimaLand.Canopy.PlantHydraulics.Weibull{Float64}}, ClimaCore.DataLayouts.VIJFH{Float64, 10, 2, CUDA.CuDeviceArray{Float64, 5, 1}}}}, Base.Broadcast.Broadcasted{ClimaCore.DataLayouts.IJFHStyle{2, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, Nothing, typeof(ClimaLand.Canopy.PlantHydraulics.hydraulic_conductivity), Tuple{Tuple{ClimaLand.Canopy.PlantHydraulics.Weibull{Float64}}, ClimaCore.DataLayouts.IJFH{Float64, 2, SubArray{Float64, 4, CUDA.CuDeviceArray{Float64, 4, 1}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}}}}}, Base.Broadcast.Broadcasted{ClimaCore.DataLayouts.VIJFHStyle{10, 2, CUDA.CuArray{Float64, N, CUDA.DeviceMemory} where N}, Nothing, typeof(root_distribution), Tuple{ClimaCore.DataLayouts.VIJFH{Float64, 10, 2, SubArray{Float64, 5, CUDA.CuDeviceArray{Float64, 5, 1}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}}}}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to /)
Stacktrace:
 [1] #root_distribution#18
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:247
 [2] root_distribution
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:246
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:709
 [4] _broadcast_getindex
   @ ./broadcast.jl:682
 [5] _getindex
   @ ./broadcast.jl:706
 [6] _getindex (repeats 2 times)
   @ ./broadcast.jl:705
 [7] _broadcast_getindex
   @ ./broadcast.jl:681
 [8] getindex
   @ ./broadcast.jl:636
 [9] knl_copyto!
   @ ~/.julia/packages/ClimaCore/fqQdO/ext/cuda/data_layouts_copyto.jl:13
Reason: unsupported dynamic function invocation (call to Float64)
Stacktrace:
 [1] #root_distribution#18
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:247
 [2] root_distribution
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:246
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:709
 [4] _broadcast_getindex
   @ ./broadcast.jl:682
 [5] _getindex
   @ ./broadcast.jl:706
 [6] _getindex (repeats 2 times)
   @ ./broadcast.jl:705
 [7] _broadcast_getindex
   @ ./broadcast.jl:681
 [8] getindex
   @ ./broadcast.jl:636
 [9] knl_copyto!
   @ ~/.julia/packages/ClimaCore/fqQdO/ext/cuda/data_layouts_copyto.jl:13
Reason: unsupported dynamic function invocation (call to exp)
Stacktrace:
 [1] #root_distribution#18
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:247
 [2] root_distribution
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:246
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:709
 [4] _broadcast_getindex
   @ ./broadcast.jl:682
 [5] _getindex
   @ ./broadcast.jl:706
 [6] _getindex (repeats 2 times)
   @ ./broadcast.jl:705
 [7] _broadcast_getindex
   @ ./broadcast.jl:681
 [8] getindex
   @ ./broadcast.jl:636
 [9] knl_copyto!
   @ ~/.julia/packages/ClimaCore/fqQdO/ext/cuda/data_layouts_copyto.jl:13
Reason: unsupported dynamic function invocation (call to *)
Stacktrace:
 [1] #root_distribution#18
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:247
 [2] root_distribution
   @ /central/home/gbozzola/ClimaLand.jl/experiments/integrated/global/global_soil_canopy.jl:246
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:709
 [4] _broadcast_getindex
   @ ./broadcast.jl:682
 [5] _getindex
   @ ./broadcast.jl:706
 [6] _getindex (repeats 2 times)
   @ ./broadcast.jl:705
 [7] _broadcast_getindex
   @ ./broadcast.jl:681
 [8] getindex
   @ ./broadcast.jl:636
 [9] knl_copyto!
   @ ~/.julia/packages/ClimaCore/fqQdO/ext/cuda/data_layouts_copyto.jl:13
Reason: unsupported dynamic function invocation (call to _broadcast_getindex_evalf)
Stacktrace:
 [1] _broadcast_getindex
   @ ./broadcast.jl:682
 [2] getindex
   @ ./broadcast.jl:636
 [3] knl_copyto!
   @ ~/.julia/packages/ClimaCore/fqQdO/ext/cuda/data_layouts_copyto.jl:13
Reason: unsupported dynamic function invocation (call to setindex!)

Can you check that it runs on Clima?

There's probably an extra step with the root_distribution

@Sbozzolo
Copy link
Member

I checked the most recent main out, fixed the issue with root_distribution, and tested in on the caltech cluster. It works, so maybe this is fixed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants