Skip to content

Segmentation fault with Distributed when --threads is set #145

@Socob

Description

@Socob

I’m getting segmentation faults when using Distributed while passing --threads to Julia, even when I’m not actually using any of those threads (see the MWE below). Needless to say, this is a huge problem when doing hybrid distributed- and shared-memory parallelization!

$ julia test.jl
start
      From worker 12:	
      From worker 12:	[58424] signal (11.1): Segmentation fault
      From worker 12:	in expression starting at none:1
      From worker 12:	Allocations: 101999211 (Pool: 93311196; Big: 8688015); GC: 1591
Worker 12 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:947
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:955
 [3] unsafe_read
   @ ./io.jl:774 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:773
 [5] read!
   @ ./io.jl:775 [inlined]
 [6] deserialize_hdr_raw
   @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
ERROR: LoadError: ProcessExitedException(12)
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:448
 [2] macro expansion
   @ ./task.jl:480 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:219
 [4] macro expansion
   @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:203 [inlined]
 [5] main()
   @ Main ~/test.jl:9
 [6] top-level scope
   @ ~/keeper/Documents/docs/postdocs/work/parity_violation/analytic4PC/run3.jl:30
in expression starting at ~/test.jl:29

Using the commented line instead (without --threads), I’m not getting any segmentation faults.

Triggering the segfault does seem to depend on the number of worker processes, in that with a small number of workers, the issue is not triggered (or at least not consistently). It also doesn’t appear immediately, but after some non-deterministic time. The details may be machine-specific, but I’ve reproduced this on several different machines.

I don’t have any attempts at an explanation, since I don’t see how merely setting the number of Julia threads would affect this code.


  1. The output of versioninfo():
    Julia Version 1.10.2
    Commit bd47eca2c8a (2024-03-01 10:14 UTC)
    Build Info:
      Official https://julialang.org/ release
    Platform Info:
      OS: Linux (x86_64-linux-gnu)
      CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
      WORD_SIZE: 64
      LIBM: libopenlibm
      LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
    Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
    
  2. How you installed Julia: juliaup
  3. A minimal working example (MWE), also known as a minimum reproducible example:
    using Distributed
    
    function main()
        arr = zeros(1000, 10000)
        arr .= 1.0
        println("start"); flush(stdout)
        @everywhere workers() begin
            # dummy calculation
            arr = $arr
            for i in 1:size(arr, 2)
                sum(
                    sum(1.1 .* @view arr[:, i])
                    for _ in 1:5000
                )
            end
        end
        println("DONE"); flush(stdout)
    end
    
    addprocs(
        15;
        # results in segfault
        exeflags=`--startup-file=no --threads=16`
        # no segfault!
    #    exeflags=`--startup-file=no`
    )
    main()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions