Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NCCL] 2.19.4 build from source #7794

Merged
merged 13 commits into from Dec 15, 2023
Merged

Conversation

simonbyrne
Copy link
Contributor

It was suggested in #7491 (comment) to try building this from source, instead of repackaging the binaries.

@simonbyrne simonbyrne changed the title build latest NCCL from source [NCCL] 2.19.4 build from source Dec 11, 2023
@simonbyrne
Copy link
Contributor Author

catastrophic error: error while writing generated C file: No space left on device

Any ideas?

@giordano
Copy link
Member

giordano commented Dec 12, 2023

Run df -hT after the failure: #7779 (comment)

@simonbyrne
Copy link
Contributor Author

How do I run something after the failure?

@giordano
Copy link
Member

command_that_fails || df -hT

like in

bazel build --test_output=all --spawn_strategy=local --verbose_failures //xla/... || df -hT

@simonbyrne
Copy link
Contributor Author

The non-x86_64 builds are failing because it is trying to run fmt: I added coreutils as a BuildDependency, but that installs the target arch version. Is there a way to get the x86_64 version?

@giordano
Copy link
Member

HostBuildDependency: https://docs.binarybuilder.org/stable/build_tips/#Dependencies-for-the-target-system-vs-host-system

@simonbyrne
Copy link
Contributor Author

@simonbyrne
Copy link
Contributor Author

In the cross-compiled targets (aarch64 and powerpc64le) I get

[20:04:32] bash: line 1: /workspace/destdir/cuda/bin/nvcc: cannot execute binary file: Exec format error

Are we picking up the wrong nvcc?

@simonbyrne
Copy link
Contributor Author

As for the space issue:
https://buildkite.com/julialang/yggdrasil/builds/7099#018c5f9f-b77a-41ed-8340-0095d6fbe0c7/6-2661

[20:06:02] ./prims_ll128.h(222): catastrophic error: error while writing generated C file: No space left on device
[20:06:02] 
[20:06:02] 1 catastrophic error detected in the compilation of "/workspace/destdir/obj/device/gensrc/reduce_scatter_sum_u8.cu".
[20:06:02] Compilation terminated.

and
https://buildkite.com/julialang/yggdrasil/builds/7099#018c5f9f-b77a-41ed-8340-0095d6fbe0c7/6-7520

[20:11:56]  ---> df -hT
[20:11:56] Filesystem           Type            Size      Used Available Use% Mounted on
[20:11:56] overlay              overlay         1.0G    184.0K   1023.8M   0% /
[20:11:56] df: /opt/x86_64-linux-musl/GCCBootstrap-4.8.5: No such file or directory
[20:11:56] df: /opt/x86_64-linux-musl/PlatformSupport-2023.6.10: No such file or directory
[20:11:56] df: /opt/x86_64-linux-musl/LLVMBootstrap-16.0.6: No such file or directory
[20:11:56] df: /opt/x86_64-linux-gnu/GCCBootstrap-4.8.5: No such file or directory
[20:11:56] df: /opt/x86_64-linux-gnu/PlatformSupport-2023.6.10: No such file or directory
[20:11:56] overlay              overlay         1.0G    184.0K   1023.8M   0% /opt/x86_64-linux-gnu
[20:11:56] overlay              overlay         1.0G    184.0K   1023.8M   0% /opt/x86_64-linux-musl
[20:11:56] udev                 devtmpfs      251.8G         0    251.8G   0% /dev/null
[20:11:56] udev                 devtmpfs      251.8G         0    251.8G   0% /dev/tty
[20:11:56] udev                 devtmpfs      251.8G         0    251.8G   0% /dev/urandom
[20:11:56] tank/root            zfs           228.6G     24.8G    203.8G  11% /etc/resolv.conf
[20:11:56] /dev/nvme0n1p1       ext4            3.4T      2.3T    938.6G  72% /root/.ccache
[20:11:56] tank/root            zfs           228.6G     24.8G    203.8G  11% /opt/toolchains
[20:11:56] tank/root            zfs           228.6G     24.8G    203.8G  11% /opt/bin
[20:11:56] /dev/nvme0n1p1       ext4            3.4T      2.3T    938.6G  72% /meta
[20:11:56] /dev/nvme0n1p1       ext4            3.4T      2.3T    938.6G  72% /workspace

so it doesn't look full?

@simonbyrne
Copy link
Contributor Author

Okay, changing TMPDIR seemed to fix it.

Now I get https://buildkite.com/julialang/yggdrasil/builds/7100#018c6005-3d85-477f-9cbc-ca9413d67490/6-7523:

/opt/x86_64-linux-gnu/bin/../lib/gcc/x86_64-linux-gnu/4.8.5/../../../../x86_64-linux-gnu/bin/ld: cannot find -lcudart_static

@simonbyrne
Copy link
Contributor Author

Okay I think this is now working: @maleadt @vchuravy can you take a look?

Some questions:

  • Can nvcc cross compile? I don't see any other examples where we do.
  • Should I link against the cudart_static (the default) or cudart (what I've done above)?

@simonbyrne simonbyrne marked this pull request as ready for review December 13, 2023 04:37
@simonbyrne
Copy link
Contributor Author

Also, should we delete the static library from the artifact? It's pretty big (200mb)

N/NCCL/build_tarballs.jl Show resolved Hide resolved
@maleadt
Copy link
Collaborator

maleadt commented Dec 13, 2023

  • Can nvcc cross compile? I don't see any other examples where we do.

No. You can try using clang for that, but I'm not sure it'll manage to compile all of NCCL's CUDA code.

  • Should I link against the cudart_static (the default) or cudart (what I've done above)?

The dynamic one is fine, and keeps the generated binaries smaller.
It does of course mean that you need to depend on the CUDA_Runtime_jll, but that should be fine here, I guess.

@simonbyrne
Copy link
Contributor Author

Okay, I think this is ready. I'll leave the cross compilation part for now.

@simonbyrne
Copy link
Contributor Author

Can we merge this?

@maleadt maleadt dismissed imciner2’s stale review December 15, 2023 08:07

LICENSE has been re-added

@maleadt maleadt merged commit c8669ac into JuliaPackaging:master Dec 15, 2023
12 checks passed
@simonbyrne
Copy link
Contributor Author

simonbyrne commented Dec 15, 2023

Hmm, it looks like it isn't loading correctly on Linux when no CUDA is installed (JuliaRegistries/General#97147 (comment))

(nccl) pkg> add NCCL_jll#main
...

julia> using NCCL_jll
ERROR: InitError: could not load library "/home/spjbyrne/.julia/artifacts/c99fd5bde3173b2e3f53a67a1383b5bb611c8267/lib/libnccl.so"
libcudart.so.12: cannot open shared object file: No such file or directory
Stacktrace:
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl ./libdl.jl:117
  [2] dlopen(s::String, flags::UInt32)
    @ Base.Libc.Libdl ./libdl.jl:116
  [3] macro expansion
    @ ~/.julia/packages/JLLWrappers/pG9bm/src/products/library_generators.jl:63 [inlined]
  [4] __init__()
    @ NCCL_jll ~/.julia/packages/NCCL_jll/qlpSf/src/wrappers/x86_64-linux-gnu-cuda+12.3.jl:10
  [5] register_restored_modules(sv::Core.SimpleVector, pkg::Base.PkgId, path::String)
    @ Base ./loading.jl:1115
  [6] _include_from_serialized(pkg::Base.PkgId, path::String, ocachepath::String, depmods::Vector{Any})
    @ Base ./loading.jl:1061
  [7] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt128)
    @ Base ./loading.jl:1506
  [8] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1783
  [9] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1660
 [10] macro expansion
    @ ./loading.jl:1648 [inlined]
 [11] macro expansion
    @ ./lock.jl:267 [inlined]
 [12] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1611
during initialization of module NCCL_jll

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants