Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to unclog documentation builds on Buildkite #1561

Closed
wants to merge 5 commits into from

Conversation

ali-ramadhan
Copy link
Member

No description provided.

@ali-ramadhan
Copy link
Member Author

@mukund-gupta Maybe try this branch to see if it fixes the issue you encountered with this error: https://buildkite.com/clima/oceananigans/builds/1947#b5157111-6448-4db8-bc8b-8a777f85001a/19-300

@ali-ramadhan
Copy link
Member Author

Seems that many builds are not getting stuck on both CPU and GPU. I wonder if it's a new package version causing problems since this started happening after I updates the Manifest.toml.

And the docs update docs/Manifest.toml before building so it could explain why the docs build was getting stuck while the others seemed fine:

- "$TARTARUS_HOME/julia-$JULIA_VERSION/bin/julia --color=yes --project=docs/ -e 'using Pkg; Pkg.instantiate(); Pkg.develop(PackageSpec(path=pwd()))'"

@ali-ramadhan
Copy link
Member Author

Can confirm that tests get stuck when I manually run ] test on Tartarus. It first gets stuck at

[2021/04/09 09:02:15.825] INFO    Testing budgets with Flux boundary conditions [GPU]...
[2021/04/09 09:02:15.825] INFO      Testing budgets with Flux boundary conditions [GPU, (Periodic, Bounded, Bounded), u, north]...

so presumably whatever is causing it to hang is a commonly used function/bit of code...

@ali-ramadhan
Copy link
Member Author

When I killed it I got

fatal: error thrown and no exception handler available.                                                                                                                                                                                                                                
InterruptException()                                                                                                                                     
jl_mutex_unlock at /buildworker/worker/package_linux64/build/src/locks.h:143 [inlined]                                                                             
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:476                                                                                                                                                                                                                  
^Cpoptask at ./task.jl:704                                                                                                                                                                                                                                                                     
wait at ./task.jl:712 [inlined]                                                                                                                                                                             
task_done_hook at ./task.jl:442                                                                                                                                                                                                                                                                
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]                                                                                                                                                                                                                
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398                                                                                                                  
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]                                                                                                                                                                                                              
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:198                                                                               
start_task at /buildworker/worker/package_linux64/build/src/task.c:717                                                                                   
unknown function (ip: (nil))                                                                                                                                     
WARNING: Force throwing a SIGINT

so maybe we have a bad KernelAbstractions.jl wait?

@glwagner
Copy link
Member

glwagner commented Apr 9, 2021

Ah, interesting. I think you're right that it seems probable to be a KernelAbstractions thingy (but want to point out that wait can be called in other contexts / packages and has usage outside KernelAbstactions).

Also explains why it mysteriously started happening due to auto update of docs/Manifest.toml. Should we stop updating docs/Manifest.toml? Is that possible?

@ali-ramadhan
Copy link
Member Author

I was able to reproduce the hanging by running the test manually in the REPL. It gets stuck somewhere in run!(simulation) but couldn't get a useful stacktrace out.

It does not hang in v0.54.0.

I tried downgrading and pinning KernelAbstractions.jl and CUDA.jl back down to the version used in the v0.54.0 Manifest.toml but it still got stuck... Could be some other package.

Should we stop updating docs/Manifest.toml? Is that possible?

Couldn't find anything in the Pkg.jl docs that would help but maybe we should switch the order of the instantiate and develop calls here?

- "$TARTARUS_HOME/julia-$JULIA_VERSION/bin/julia --color=yes --project=docs/ -e 'using Pkg; Pkg.instantiate(); Pkg.develop(PackageSpec(path=pwd()))'"

@glwagner glwagner mentioned this pull request Apr 9, 2021
@navidcy
Copy link
Collaborator

navidcy commented Apr 9, 2021

I notice that in the Documenter.jl Docs it's suggested:

julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'

@glwagner
Copy link
Member

Here's the differences in how docs/Manifest.toml is updated for a successful and the current build:

build 1872 build 1983 from this PR
Docs build successfully! Build is hanging?
image image

The differences are:

  • ArrayInterface (3.1.6 -> 3.1.7)
  • CUDA (2.4.1 -> 2.4.3)
  • ChainRulesCore (0.9.36 -> 0.9.37)
  • GPUArrays (6.2.1 -> 6.2.2)
  • NNlib (0.7.17 -> 0.7.18)
  • StructTypes (1.5.0 -> 1.5.2)
  • TaylorSeries (0.10.11 -> 0.10.12)

@glwagner
Copy link
Member

@navidcy I think its a red herring. The issue is that the update of some package is causing run!(simulation) to hang. For prior builds this problem only affects the documentation build (because only docs/Manifest.toml was updated, not Oceananigans.jl/Manifest.toml). Here, we have updated all packages and now the issue has infected the entire build.

@glwagner
Copy link
Member

Actually, I believe that @ali-ramadhan found that the issue is compilation of run!(simulation). For some reason the compiler chokes on it. The weird thing is that time_step!(model, dt) works (which is what I'd naively expect was the issue).

@navidcy
Copy link
Collaborator

navidcy commented Apr 10, 2021

omg... I'm thoroughly confused.

also, what is puzzling even more, is that on the list on the left (that you mention all is "ok") I see
CUDA v2.4.1 + FFTW v1.3.2
But FFTW v1.3.2 requires AbstractFFTs v1 while CUDA v2.4.1 compats only allow AbstractFFTs 0.4, 0.5. How could that be?

@glwagner
Copy link
Member

glwagner commented Apr 10, 2021

I see

AbstractFFTs = "0.4, 0.5, 1"

https://github.com/JuliaGPU/CUDA.jl/blob/5767efb7fa65c3811b187ab310f0bb5e484bc2e4/Project.toml#L31

What does that mean?

@navidcy
Copy link
Collaborator

navidcy commented Apr 10, 2021

@glwagner
Copy link
Member

Touche.

@glwagner
Copy link
Member

PencilFFTs 0.12.2 also requires AbstractFFTs v1:

https://github.com/jipolanco/PencilFFTs.jl/blob/master/Project.toml

@navidcy
Copy link
Collaborator

navidcy commented Apr 10, 2021

PencilFFTs 0.12.2 also requires AbstractFFTs v1:

https://github.com/jipolanco/PencilFFTs.jl/blob/master/Project.toml

Indeed....! But again the url you provided is from PencilFFTs#master and not v0.12.2... :)

@navidcy
Copy link
Collaborator

navidcy commented Apr 10, 2021

I'm simply confused with the whole shenanigans of this issue. I need a fresh start perhaps :)

@glwagner
Copy link
Member

PencilFFTs 0.12.2 also requires AbstractFFTs v1:
https://github.com/jipolanco/PencilFFTs.jl/blob/master/Project.toml

Indeed....! But again the url you provided is from PencilFFTs#master and not v0.12.2... :)

True, its still tagged as 0.12.2 but I guess the compat could have been updated without bumping the version.

@glwagner
Copy link
Member

glwagner commented Apr 10, 2021

The first build to exhibit the issue (all tests pass except docs) appears to be 1881, and the last build to pass before it is 1878. Two packages are different:

No idea if this is progress...

@navidcy
Copy link
Collaborator

navidcy commented Apr 10, 2021

Nice. This might help our investigations. Will get to it tomorrow.

@ali-ramadhan
Copy link
Member Author

Still also confused here... Just merged master into this branch. Maybe things magically got fixed over the weekend 😄

@navidcy
Copy link
Collaborator

navidcy commented Apr 12, 2021

Still also confused here... Just merged master into this branch. Maybe things magically got fixed over the weekend 😄

Wishful thinking..

@glwagner
Copy link
Member

Can close this now since #1573 ...

@navidcy navidcy deleted the ali/unclog-docs branch April 25, 2021 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants