Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault with multi-threading #219

Open
lmiq opened this issue Sep 9, 2022 · 24 comments
Open

segmentation fault with multi-threading #219

lmiq opened this issue Sep 9, 2022 · 24 comments

Comments

@lmiq
Copy link

lmiq commented Sep 9, 2022

I have this MWE, where I get segmentation faults (frequently, but not deterministically), when trying to run some script that uses multi-threading on the Julia side.

I have used before launching ipython3:

export JULIA_NUM_THREADS=4

(my computer has 4 cores - 8 threads).

The MWE is:

Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from juliacall import Main as jl

In [2]: import numpy as np

In [3]: jl.seval("""
   ...: function test(x)
   ...:     partial = zeros(Threads.nthreads())
   ...:     Threads.@threads for i in 1:Threads.nthreads()
   ...:         for j in i:Threads.nthreads():length(x)
   ...:             partial[i] += x[j]
   ...:         end
   ...:     end
   ...:     return sum(partial)
   ...: end
   ...: """)
Out[3]: test (generic function with 1 method)

In [4]: x = np.random.random((10_000,))

In [5]: %timeit jl.test(x)
72.2 µs ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit jl.test(x)
Segmentation fault (core dumped)

Here I have emulated the error using the %timeit macro from ipython, but my actual error I get after some runs of a function of my package:

In [1]: from juliacall import Main as jl

In [2]: jl.seval("using CellListMap")

In [3]: import numpy as np

In [5]: x = np.random.random((10_000,3))

In [6]: nb = jl.neighborlist(x.transpose(), 0.05)

In [7]: nb = jl.neighborlist(x.transpose(), 0.05)

In [8]: nb = jl.neighborlist(x.transpose(), 0.05)

In [9]: nb = jl.neighborlist(x.transpose(), 0.05)

In [10]: nb = jl.neighborlist(x.transpose(), 0.05)

In [11]: nb = jl.neighborlist(x.transpose(), 0.05)

In [12]: nb = jl.neighborlist(x.transpose(), 0.05)

In [13]: nb = jl.neighborlist(x.transpose(), 0.05)

In [14]: nb = jl.neighborlist(x.transpose(), 0.05)

In [15]: nb = jl.neighborlist(x.transpose(), 0.05)

In [16]: nb = jl.neighborlist(x.transpose(), 0.05)

In [17]: nb = jl.neighborlist(x.transpose(), 0.05)

In [18]: nb = jl.neighborlist(x.transpose(), 0.05)

In [19]: nb = jl.neighborlist(x.transpose(), 0.05)
Segmentation fault (core dumped)

%timeit runs the function multiple times, there seems to be some memory corruption, or memory overflow, causing the error.

Anyway, even if you have only some hint on how to debug this, I will be very thankful.

(even in the simplest example above, the segfaults only occur with multi-threading).

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 9, 2022

This is probably the same as #201 and #202.

The Python interpreter must only be used on thread 1. If you use Julia's multithreading, then even if it doesn't explicitly use Python, at some point the GC may free a Python object. If this GC runs on a thread other than 1, Python will crash.

The solution is to guard multithreaded Julia code with calls to PythonCall.GC.disable() and PythonCall.GC.enable(), so the Julia code in your MWE becomes:

function test(x)
    partial = zeros(Threads.nthreads())
    PythonCall.GC.disable()
    try
        Threads.@threads for i in 1:Threads.nthreads()
            for j in i:Threads.nthreads():length(x)
                partial[i] += x[j]
            end
        end
    finally
        PythonCall.GC.enable()
    end
    return sum(partial)
end

This ensures that no Python objects are freed during the multithreaded portion: if Julia's GC collects any Python objects, they will be cached and freed at PythonCall.GC.enable().

@lmiq
Copy link
Author

lmiq commented Sep 9, 2022

Not sure if I did something wrong, but I still get the segfaults:

In [1]: from juliacall import Main as jl

In [2]: jl.seval("""
   ...: function test(x)
   ...:     partial = zeros(Threads.nthreads())
   ...:     PythonCall.GC.disable()
   ...:     try
   ...:         Threads.@threads for i in 1:Threads.nthreads()
   ...:             for j in i:Threads.nthreads():length(x)
   ...:                 partial[i] += x[j]
   ...:             end
   ...:         end
   ...:     finally
   ...:         PythonCall.GC.enable()
   ...:     end
   ...:     return sum(partial)
   ...: end
   ...: """)
Out[2]: test (generic function with 1 method)

In [3]: import numpy as np

In [4]: x = np.random.random((10_000,));

In [6]: %timeit jl.test(x)
60.4 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit jl.test(x)
Segmentation fault (core dumped)

(the same for my package)

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 9, 2022

In that case I don't know what's going on. I can reproduce the error but don't get a core dump. Can you poke around the core dump with gdb and see what's going on? Maybe get a backtrace?

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 9, 2022

If I do jl.GC.enable(False) before %timeit then the problem stops.

The PythonCall GC code is itself not thread safe, but I assumed that that wasn't an issue since Julia's GC can only run from one thread at a time, so I assumed that finalizers only ran in one thread at a time. Maybe that's not the case. I think with a little more digging I can figure this out.

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 9, 2022

I'm flummoxed!

Here's a slightly smaller MWE:

In [1]: import os

In [2]: os.environ['PYTHON_JULIACALL_THREADS'] = '4'

In [3]: from juliacall import Main as jl

In [4]: jl.seval("""
   ...: function test()
   ...:     partial = zeros(Threads.nthreads())
   ...:     x = zeros(10_000)
   ...:     Threads.@threads for i in 1:Threads.nthreads()
   ...:         for j in i:Threads.nthreads():length(x)
   ...:             partial[i] += x[j]
   ...:         end
   ...:     end
   ...:     return sum(partial)
   ...: end
   ...: """)
Out[4]: test (generic function with 1 method)

In [5]: %timeit jl.test()
The slowest run took 10.87 times longer than the fastest. This could mean that an intermediate result is being cached.
150 µs ± 103 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit jl.test()
Segmentation fault

Things tried:

  • Deleting all the finalizers/GC stuff from PythonCall, so when a Python object is done with, it is never freed. Segfault still occurs.
  • Running with 1 thread (either by setting the number of threads to 1 or removing @threads) fixes it.
  • Calling jl.GC.enable(False) before %timeit ... fixes it.
  • Doing %timeit jl.seval("test()") instead fixes it. Weird!

So it looks like some combination of garbage collection and threading are the issue. But I've turned off the finalizers in PythonCall so I don't know what else could be the issue.

@nmheim
Copy link

nmheim commented Jan 11, 2023

If I am using this approach to disable/enable GC I am running out of memory at some point. Would it help if I make a minimal example that reproduces this behaviour?

@cjdoris
Copy link
Collaborator

cjdoris commented Jan 12, 2023

It's not surprising that disabling GC entirely means you run out of memory. You could maybe do a GC.collect() periodically (outside of any threaded code).

@nmheim
Copy link

nmheim commented Jan 13, 2023

Hmm, but I am enabling GC again after the threaded call. I cant find a GC.collect function. I tried GC.gc (outside threaded calls) but that results in a segfault. Thank you for your help!:)

@vchuravy
Copy link

As said on discourse: https://discourse.julialang.org/t/floop-threading-and-juliacall-produce-segmentation-fault/100717/11?u=vchuravy

Julia intentionally causes segmentation faults as part of the GC safepoint mechanism. This MPI.jl issue is somewhat related
JuliaParallel/MPI.jl#725

PythonCall starts Julia with --handle-signals=no

https://github.com/cjdoris/PythonCall.jl/blob/e374e2503b7b843c457d30596a2428ebfc26bb0b/pysrc/juliacall/__init__.py#L132

Which then causes benign segmentation faults to become spurios ones.

@giordano
Copy link

To be clear, you're suggesting to change that option to "yes" (or just remove it)? According to git blame, it was introduced last year in bef6afa without much explanation (just "better compatibility"), it's unclear what was the motivation for that change.

@vchuravy
Copy link

vchuravy commented Jun 24, 2023

Yes, changing it from no to yes will at least no longer cause faults due to GC (and it looks like most of the recorded issues here worked around it by running with GC off.

This issue was opened shortly after that change aswell.

@giordano
Copy link

I see that the offending commit is part of the branch https://github.com/cjdoris/PythonCall.jl/commits/gc referenced in issues #201 and #202 which was meant to deal with other segmentation faults in the GC in multithreading programs, but it sounds like this specific change just worsened the situation. Is that an accurate reconstruction of the events @cjdoris?

@vchuravy
Copy link

Hangs could come about from one thread wanting to stop the world and another executing Python code? The code would need to manually transition to GC safe before calling into Python.

@brian-dellabetta
Copy link
Contributor

brian-dellabetta commented Jun 28, 2023

@vchuravy thanks for the tip here! Just wanted to post that this works perfectly for me, after setting CONFIG['opt_handle_signals'] = 'yes', with the caveat that we still need to wrap threaded code in PythonCall.GC.disable() / PythonCall.GC.enable(). Rather than segfaulting, it now hangs at this line if PythonCall.GC is not disabled in some of our threaded code.

This is significantly better than the workaround i posted here, as we don't have to worry about huge memory allocation issues in the threaded code or temporarily calling Base.GC.gc on the main thread. Let's wait to hear what the maintainer thinks, at the very least if this could be set by an env var and default to no to retain backwards compatibility, we could drop a lot of hack code and get a big performance benefit

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 4, 2023

Thank you for looking into this. I had no idea there was such a thing as a benign segfault! Given it is so important for Julia's GC I guess we should turn Julia's signal handling back on.

I'm not sure what that terrible "better compatibility" comment was about. I suspect that it's because Python also wants to handle signals, and letting them both do it interferes with each other. For example, I think allowing Julia's signal handler prevented you from doing keyboard interrupts (Ctrl-C) in Python code.

@brian-dellabetta
Copy link
Contributor

For example, I think allowing Julia's signal handler prevented you from doing keyboard interrupts (Ctrl-C) in Python code.

@cjdoris in my experience, allowing Julia's signal handler allows keyboard interrupts (Ctrl-C) to work in Python code. Using branch from #333 :

PYTHON_JULIACALL_HANDLE_SIGNALS=no python
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:07:22) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from juliacall import Main as jl
>>> jl.seval("sleep(10)")
^C>>>

^C above corresponds to my keyboard interrupt. You can see it just continues on.

PYTHON_JULIACALL_HANDLE_SIGNALS=yes python
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:07:22) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from juliacall import Main as jl
>>> jl.seval("sleep(10)")
^C
[73719] signal (2): Interrupt: 2
in expression starting at none:1
kevent at /usr/lib/system/libsystem_kernel.dylib (unknown line)
uv__io_poll at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
uv_run at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_task_get_next at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
poptask at ./task.jl:974
wait at ./task.jl:983
task_done_hook at ./task.jl:672
jfptr_task_done_hook_33190.clone_1 at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/sys.dylib (unknown line)
ijl_apply_generic at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_finish_task at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
start_task at /Users/brian/.julia/juliaup/julia-1.9.2+0.x64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
unknown function (ip: 0x0)
Allocations: 1974217 (Pool: 1972735; Big: 1482); GC: 3

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

This issue has been marked as stale because it has been open for 30 days with no activity. If the issue is still relevant then please leave a comment, or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues about to be auto-closed label Sep 8, 2023
@lmiq
Copy link
Author

lmiq commented Sep 8, 2023

I think it is still relevant. The workaround of disabling GC in the internal Julia code can cause memory build up, making threading unusable in some cases.

I don't know if this can be solved by modifications of this package alone, that's another question.

@vchuravy
Copy link

vchuravy commented Sep 8, 2023

So there are a couple of interactions here:

  1. Julia uses signals internally for multiple purposes, starting Julia with signal handlers off and multi-threading on is an unsupported configuration (I am working on a change that will throw an error if this is detecteed)
  2. When python code is executing and another thread wants to run GC we may appear to hang. In order to avoid this before calling into python the thread must be marked as GC safe and unmarked when returning to Julia.

If you want the python signal handlers to work as well then someone would need to implement signal chaining which might be a bit fiddly but shouldn't be to complicated. I briefly chatted with @vtjnash about this a while ago.

@github-actions github-actions bot removed the stale Issues about to be auto-closed label Sep 9, 2023
@cjdoris
Copy link
Collaborator

cjdoris commented Sep 13, 2023

I think this was largely fixed (well, worked-around) by setting the env var PYTHON_JULIACALL_HANDLE_SIGNALS=yes? This installs the Julia signal handler, so that GC works as expected. The downside is that you don't get the Python signal handler, which breaks some things like Ctrl-C being caught and turned into KeyboardInterrupt.

@vchuravy signal chaining would be cool - presumably this requires some level of co-operation by either the core Julia or Python implementation?

@MilesCranmer
Copy link
Contributor

MilesCranmer commented Jan 26, 2024

@cjdoris could PYTHON_JULIACALL_HANDLE_SIGNALS=yes be used as default? Tons of Julia code is multithreaded under-the-hood so this sort of thing will inevitably be triggered at some point. The user current experience is to get a bus error without further explanation which is not great.

Alternatively is there any way one could guarantee that all objects passed to a Julia call are copied and no longer accessible to Python? Sort of like how Rust guarantees thread safety. I think allowing mutation by default is going to be dangerous for my application so would like a way to force it to copy objects when passed one way or the other. (I'm not even sure jl.deepcopy() on arguments would work here, but maybe it would?)

@vchuravy what is the ETA of signal chaining and who should I poke to help get this working?

@vchuravy
Copy link

I don't know that anyone is working on signal chaining.

@cjdoris
Copy link
Collaborator

cjdoris commented Jan 27, 2024

The reason why PYTHON_JULIACALL_HANDLE_SIGNALS=no is the default is because setting it to yes messes with ordinary Python behaviour. For example, Ctrl-C no longer raises a KeyboardInterrupt but just crashes Python immediately.

However what would probably be a good idea is to raise a warning if juliacall is being started with multiple threads and signal handling turned off, suggesting the user may want to enable julia signal handling.

@hhaensel
Copy link

Maybe my solution in #201 (comment) could help you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants