Skip to content

Conversation

@giordano
Copy link
Member

Let's see how this fares. The idea is to reduce a little bit pressure on the juliaecosystem runners.

@giordano giordano requested a review from mofeing January 16, 2025 18:37
@giordano giordano marked this pull request as draft January 16, 2025 21:11
@giordano
Copy link
Member Author

Queue is quite long at the moment, we aren't going to gain much for the time being 🥲

@mofeing
Copy link
Collaborator

mofeing commented Jan 16, 2025

A lot of people might be trying to use it too. Let's give a couple of days and trigger another run then.

Copy link
Contributor

@imciner2 imciner2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have the 64 suffix. So, the long wait time is because it is trying to run a job on a runner that doesn't exist.

@giordano
Copy link
Member Author

🤦

@giordano
Copy link
Member Author

https://github.com/EnzymeAD/Reactant.jl/actions/runs/12820824033/job/35751035540#step:9:778

     CondaPkg Creating environment
             │ /home/runner/.julia/artifacts/528c740035ea96b4091a42660cac663aa0a3f036/bin/micromamba
             │ -r /home/runner/.julia/scratchspaces/0b3b1443-0f03-428d-bdfb-f27f9c1191ea/root
             │ create
             │ -y
             │ -p /tmp/jl_BCzCK3/.CondaPkg/env
             │ --override-channels
             │ --no-channel-priority
             │ libstdcxx-ng[version='>=3.4,<13.0']
             │ python[version='>=3.8,<4',channel='conda-forge',build='*cpython*']
             │ uv[version='>=0.4']
             └ -c conda-forge
info     libmamba ****************** Backtrace Start ******************
debug    libmamba Loading configuration
trace    libmamba Compute configurable 'create_base'
trace    libmamba Compute configurable 'no_env'
trace    libmamba Compute configurable 'no_rc'
trace    libmamba Compute configurable 'rc_files'
trace    libmamba Compute configurable 'root_prefix'
trace    libmamba Get RC files configuration from locations up to HomeDir
trace    libmamba Configuration not found at '/home/runner/.mambarc'
trace    libmamba Configuration not found at '/home/runner/.mamba/mambarc.d'
trace    libmamba Configuration not found at '/home/runner/.mamba/mambarc'
trace    libmamba Configuration not found at '/home/runner/.mamba/.mambarc'
critical libmamba filesystem error: status: Permission denied [/home/runneradmin/.config/mamba/mambarc.d]
info     libmamba ****************** Backtrace End ********************

wut

@mofeing
Copy link
Collaborator

mofeing commented Jan 19, 2025

current failing tests are due to some bug in the CUDA integration. the buildkite job was failing before this PR, but the one "CI / Julia 1.11 - integration - ubuntu-24.04-arm - aarch64 - packaged libReactant - assertions=false" could be a spurious error?

@giordano
Copy link
Member Author

I don't think that's spurious, I've got similar errors locally on my laptop the other day but didn't have the time to investigate.

@mofeing
Copy link
Collaborator

mofeing commented Jan 19, 2025

cc @avik-pal

@giordano
Copy link
Member Author

Same error happens also on main on Buildkite, looks very real to me: https://buildkite.com/julialang/reactant-dot-jl/builds/3529#01947d51-fb10-4f9c-b3ed-7e4897ed7dd8/286-1084. However it doesn't seem to systematic.

@wsmoses
Copy link
Member

wsmoses commented Jan 19, 2025

@maleadt @vchuravy @gbaraldi if you have any insights here, for some reason GPUCompiler.load_runtime of a cuda config is throwing "Not implemented" on aarch64

@giordano
Copy link
Member Author

My understanding is that it's happening at line 29 of

@testset "Square Kernel" begin
oA = collect(1:1:64)
A = Reactant.to_rarray(oA)
B = Reactant.to_rarray(100 .* oA)
if CUDA.functional()
@jit square!(A, B)
@test all(Array(A) .≈ (oA .* oA .* 100))
@test all(Array(B) .≈ (oA .* 100))
else
@code_hlo optimize = :before_kernel square!(A, B)
end
end
in the !CUDA.functional() branch, which makes sense since we don't have a GPU in this setup.

@wsmoses
Copy link
Member

wsmoses commented Jan 19, 2025

yeah but we should still be able to compile the code successfully (and this works in a similar no gpu case in x86)

@vchuravy
Copy link
Member

No not really. Are you doing any world-age shenanigans? You might be executing in a world before CUDA.jl got loaded?

@wsmoses
Copy link
Member

wsmoses commented Jan 19, 2025

well for one thing CUDA.isfunctional() is false [as there is no GPU on the machine]. but no CUDA should've already been loaded (and as an example we've already GPUCopiler.compile'd to generate LLVM from CUDA.jl]

@vchuravy
Copy link
Member

You may be in a scenario where GPUCompiler is loaded and thus most of the compiler functionality is there, but you are then calling something that CUDA.jl is supposed to implement, and you are executing in a world before CUDA.jl was loaded?

That's really the only way you could get a the "not implemented" error

@maleadt
Copy link

maleadt commented Jan 20, 2025

That's really the only way you could get a the "not implemented" error

There's a bug in 1.10/1.11 that can result in this happening (as observed in SciML code). JuliaLang/julia#57077 should fix it.

@giordano
Copy link
Member Author

Oh, cool, I can see if I can find an aarch64 machine to reproduce this on and see if that PR fixes it. Thanks for the heads up!

@giordano
Copy link
Member Author

I was able to successfully run the CUDA integration tests with JuliaLang/julia#57077 12 times in a row on a Grace Grace system, while they failed at the first try with Julia v1.11.2, so it looks like it was right that one.

I guess this PR is ready to go then, failure is unrelated (also because it was happening with buildkite too).

@giordano giordano marked this pull request as ready for review January 20, 2025 16:01
@wsmoses
Copy link
Member

wsmoses commented Jan 20, 2025

@giordano can you add a guard around the cuda tests if aarch64 and Julia version wouldn’t have the fix, don’t run them?

giordano and others added 2 commits January 20, 2025 16:07
Copy link
Collaborator

@mofeing mofeing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@wsmoses
Copy link
Member

wsmoses commented Jan 20, 2025

CI has some semeingly new red?

@giordano
Copy link
Member Author

  • Julia 1.10 - core - ubuntu-24.04-arm - aarch64 - packaged libReactant - assertions=false failed to check out the repository, will restart it 🫠
  • Julia 1.11 - core - ubuntu-20.04 - x64 - packaged libReactant - assertions=false (note it's x86-64, not aarch64) is failing a test not touched here:
    fn = tanpi: Test Failed at /home/runner/work/Reactant.jl/Reactant.jl/test/basic.jl:1012
      Expression: #= /home/runner/work/Reactant.jl/Reactant.jl/test/basic.jl:1012 =# @jit(fn.(x_ra)) ≈ fn.(x)
       Evaluated: ConcreteRArray{Float32, 2}(Float32[-0.21030955 6.786836 … 25.12034 -14.385436; -0.19476584 1.6611387 … -0.19030015 14.128226; -19.919302 4.171895 … 0.16302171 2.0547147; 0.40969867 0.09946294 … 1.0452943 1.3147937]) ≈ Float32[-0.21030961 6.7868366 … 25.120296 -14.385441; -0.19476582 1.6611385 … -0.19030032 14.128206; -19.919342 4.1718936 … 0.16302171 2.0547144; 0.40969867 0.099462934 … 1.0452943 1.3147937]
    

@wsmoses
Copy link
Member

wsmoses commented Jan 20, 2025

ah hm, maybe we should lower the tolerance for the tan test or something?

@mofeing

This comment was marked as outdated.

@mofeing
Copy link
Collaborator

mofeing commented Jan 20, 2025

ah hm, maybe we should lower the tolerance for the tan test or something?

yeah, i agree

@giordano giordano merged commit 29627db into main Jan 21, 2025
31 of 33 checks passed
@giordano giordano deleted the mg/gha-aarch64 branch January 21, 2025 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants