Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interchange: launch as a fork/exec not Python multiprocessing #3373

Closed
benclifford opened this issue Apr 20, 2024 · 3 comments
Closed

Interchange: launch as a fork/exec not Python multiprocessing #3373

benclifford opened this issue Apr 20, 2024 · 3 comments

Comments

@benclifford
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
The interchange is launched using Python multiprocessing but mostly wants to be its own freestanding process: for example, it awkwardly works around inheriting the main process logger by avoiding the "parsl" logging topic; and it communicates with the main process mostly via ZMQ (with only a small bit of multiprocessing-based communication near the start).

Multiprocessing-fork is a Bad Thing (see issue #2343) so launching the interchange this way should go away.

Launching as a fresh process would also allow more normal logging (for example, logging everything to the interchange log file rather than carefully avoiding what was inherited).

The only use of multiprocessing queues, which is to communicate the ZMQ port numbers to be used by workers, could be transmitted in a different way: for example add a WORKER_PORTS command to the command channel.

Describe the solution you'd like
Use normal fork/exec so this behaves like a freestanding process

@benclifford
Copy link
Collaborator Author

Launching as a fresh process would also allow more normal logging (for example, logging everything to the interchange log file rather than carefully avoiding what was inherited).

regarding this point, I think I have encountered a situation (although it's quite awkward to trace) where some changes I would like to contribute to Parsl make use of regular Parsl code that logs to a parsl.* logging channel, and then hangs - I think (although I am not completely convinced) that this is because of a fork-but-no-exec lock held over the fork for some part of the regular parsl logging setup.

Implementing this issue would remove that possibility.

@benclifford
Copy link
Collaborator Author

in the immediately preceeding comment, the hang looks like this:

(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x6473d2908b30)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x6473d2908b30, expected=expected@entry=0, clockid=clockid@entry=0, 
    abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x00007b717460befb in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x6473d2908b30, expected=expected@entry=0, 
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007b7174616c3f in do_futex_wait (sem=sem@entry=0x6473d2908b30, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007b7174616cd0 in __new_sem_wait_slow64 (sem=sem@entry=0x6473d2908b30, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007b7174616d39 in __new_sem_wait (sem=sem@entry=0x6473d2908b30) at ./nptl/sem_wait.c:42
#6  0x00006473d0fed3b8 in PyThread_acquire_lock_timed (lock=0x6473d2908b30, microseconds=<optimized out>, intr_flag=intr_flag@entry=0)
    at Python/thread_pthread.h:486
#7  0x00006473d0fed63c in PyThread_acquire_lock (lock=<optimized out>, waitflag=waitflag@entry=1) at Python/thread_pthread.h:744
#8  0x00006473d103305a in _enter_buffered_busy (self=self@entry=0x7b716c364720) at ./Modules/_io/bufferedio.c:300
#9  0x00006473d1037c6b in _io_BufferedWriter_write_impl (buffer=0x7ffd754ba520, self=0x7b716c364720) at ./Modules/_io/bufferedio.c:2028
#10 _io_BufferedWriter_write (self=0x7b716c364720, arg=0x7b716c3144b0) at ./Modules/_io/clinic/bufferedio.c.h:955
#11 0x00006473d0e6ce31 in method_vectorcall_O (func=0x7b717423f240, args=0x7ffd754ba640, nargsf=<optimized out>, kwnames=0x0)
    at Objects/descrobject.c:482
#12 0x00006473d0e5fda5 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffd754ba640, callable=0x7b717423f240, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at ./Include/internal/pycore_call.h:92
#13 PyObject_VectorcallMethod (name=name@entry=0x6473d1350580 <_PyRuntime+61664>, args=args@entry=0x7ffd754ba640, nargsf=<optimized out>, 
    nargsf@entry=9223372036854775810, kwnames=kwnames@entry=0x0) at Objects/call.c:887
#14 0x00006473d10398b4 in PyObject_CallMethodOneArg (arg=0x7b716c3144b0, name=0x6473d1350580 <_PyRuntime+61664>, self=<optimized out>)
    at ./Include/cpython/abstract.h:103
#15 _textiowrapper_writeflush (self=self@entry=0x7b716eca5460) at ./Modules/_io/textio.c:1629
#16 0x00006473d103acf2 in _io_TextIOWrapper_flush_impl (self=0x7b716eca5460) at ./Modules/_io/textio.c:3095
#17 _io_TextIOWrapper_flush (self=0x7b716eca5460, _unused_ignored=<optimized out>) at ./Modules/_io/clinic/textio.c.h:966
#18 0x00006473d0df7e0f in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b717484a560, throwflag=<optimized out>)
    at Python/bytecodes.c:3150
#19 0x00006473d0e62731 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=4, args=0x7ffd754ba860, callable=0x7b71736191c0, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at ./Include/internal/pycore_call.h:92
#20 method_vectorcall (method=<optimized out>, args=0x7b716c1e96d8, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:91
#21 0x00006473d0df72e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b717484a130, throwflag=<optimized out>)
    at Python/bytecodes.c:3254
#22 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716c3c8d40, nargsf=<optimized out>, args=0x7ffd754baa80, 
    callable=0x7b716f236340, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#23 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b716f236340, 
    obj=obj@entry=0x7b717287eb10, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b7172895b80)
    at Objects/call.c:508
#24 0x00006473d0ee9ead in slot_tp_init (self=0x7b717287eb10, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b7172895b80)
    at Objects/typeobject.c:9013
#25 0x00006473d0ed75d7 in type_call (type=<optimized out>, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b7172895b80)
    at Objects/typeobject.c:1673
#26 0x00006473d0e60b89 in PyObject_Call () at Objects/call.c:376
--Type <RET> for more, q to quit, c to continue without paging--c
#27 0x00006473d0df72e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174849f90, throwflag=<optimized out>)
    at Python/bytecodes.c:3254
#28 0x00006473d0e60f34 in _PyObject_FastCallDictTstate (kwargs=0x0, nargsf=2, args=0x7ffd754bad70, callable=0x7b716ed62e80, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:133
#29 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b716ed62e80, 
    obj=obj@entry=0x7b716c2267b0, args=args@entry=0x7b716c21ae00, kwargs=kwargs@entry=0x0) at Objects/call.c:508
#30 0x00006473d0ee9ead in slot_tp_init (self=0x7b716c2267b0, args=0x7b716c21ae00, kwds=0x0) at Objects/typeobject.c:9013
#31 0x00006473d0ed75d7 in type_call (type=<optimized out>, args=0x7b716c21ae00, kwds=0x0) at Objects/typeobject.c:1673
#32 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x6473d2f4fc80, 
    args=args@entry=0x7b7174849c80, nargs=<optimized out>, keywords=0x0) at Objects/call.c:240
#33 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b7174849c80, 
    callable=0x6473d2f4fc80, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#34 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174849c18, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#35 0x00006473d0e60f34 in _PyObject_FastCallDictTstate (kwargs=0x0, nargsf=2, args=0x7ffd754bb050, callable=0x7b716f19c540, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:133
#36 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b716f19c540, 
    obj=obj@entry=0x7b716c39a2d0, args=args@entry=0x7b716c1f83a0, kwargs=kwargs@entry=0x0) at Objects/call.c:508
#37 0x00006473d0ee9ead in slot_tp_init (self=0x7b716c39a2d0, args=0x7b716c1f83a0, kwds=0x0) at Objects/typeobject.c:9013
#38 0x00006473d0ed75d7 in type_call (type=<optimized out>, args=0x7b716c1f83a0, kwds=0x0) at Objects/typeobject.c:1673
#39 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x6473d2e1d240, 
    args=args@entry=0x7b71748497c0, nargs=<optimized out>, keywords=0x0) at Objects/call.c:240
#40 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b71748497c0, 
    callable=0x6473d2e1d240, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#41 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174849758, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#42 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716c1f8580, nargsf=<optimized out>, args=0x7ffd754bb330, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#43 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b7173628d60, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716c23e100)
    at Objects/call.c:508
#44 0x00006473d0ee476d in slot_tp_call (self=0x7b7173628d60, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716c23e100)
    at Objects/typeobject.c:8769
#45 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173628d60, 
    args=args@entry=0x7b7174849210, nargs=<optimized out>, keywords=0x7b71736ffeb0) at Objects/call.c:240
#46 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b7174849210, 
    callable=0x7b7173628d60, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#47 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b71748491b0, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#48 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716c1f8df0, nargsf=<optimized out>, args=0x7ffd754bb5f0, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#49 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b7173628f40, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716c3a8800)
    at Objects/call.c:508
#50 0x00006473d0ee476d in slot_tp_call (self=0x7b7173628f40, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716c3a8800)
    at Objects/typeobject.c:8769
#51 0x00006473d0e60b89 in PyObject_Call () at Objects/call.c:376
#52 0x00006473d0df72e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174848e48, throwflag=<optimized out>)
    at Python/bytecodes.c:3254
#53 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716ed57a80, nargsf=<optimized out>, args=0x7ffd754bb8c0, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#54 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b71736290d0, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716c1cbfc0)
    at Objects/call.c:508
#55 0x00006473d0ee476d in slot_tp_call (self=0x7b71736290d0, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716c1cbfc0)
    at Objects/typeobject.c:8769
#56 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b71736290d0, 
    args=args@entry=0x7b71748488f0, nargs=<optimized out>, keywords=0x7b71737e6c00) at Objects/call.c:240
#57 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b71748488f0, 
    callable=0x7b71736290d0, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#58 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174848870, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#59 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716effca30, nargsf=<optimized out>, args=0x7ffd754bbb80, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#60 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b71736291c0, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716f04c4c0)
    at Objects/call.c:508
#61 0x00006473d0ee476d in slot_tp_call (self=0x7b71736291c0, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716f04c4c0)
    at Objects/typeobject.c:8769
#62 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b71736291c0, 
    args=args@entry=0x7b7174848600, nargs=<optimized out>, keywords=0x7b7173a6b2e0) at Objects/call.c:240
#63 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b7174848600, 
    callable=0x7b71736291c0, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#64 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174848598, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#65 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b7172c7fd30, nargsf=<optimized out>, args=0x7ffd754bbe40, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#66 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b7173628400, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716f1b6040)
    at Objects/call.c:508
#67 0x00006473d0ee476d in slot_tp_call (self=0x7b7173628400, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716f1b6040)
    at Objects/typeobject.c:8769
#68 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173628400, 
    args=args@entry=0x7b71748481d0, nargs=<optimized out>, keywords=0x7b7173a6ba60) at Objects/call.c:240
#69 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b71748481d0, 
    callable=0x7b7173628400, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#70 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, frame=0x7b7174848120, 
    frame@entry=0x7b7174848020, throwflag=throwflag@entry=0) at Python/bytecodes.c:2706
#71 0x00006473d0f7b617 in _PyEval_EvalFrame (throwflag=0, frame=0x7b7174848020, tstate=0x6473d13b1828 <_PyRuntime+459656>)
    at ./Include/internal/pycore_ceval.h:89
#72 _PyEval_Vector (args=0x0, argcount=0, kwnames=0x0, locals=0x7b717428dac0, func=0x7b7174055120, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at Python/ceval.c:1683
#73 PyEval_EvalCode (co=co@entry=0x7b7174214630, globals=globals@entry=0x7b717428dac0, locals=locals@entry=0x7b717428dac0)
    at Python/ceval.c:578
#74 0x00006473d0fd7aa8 in run_eval_code_obj (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, co=co@entry=0x7b7174214630, 
    globals=globals@entry=0x7b717428dac0, locals=locals@entry=0x7b717428dac0) at Python/pythonrun.c:1722
#75 0x00006473d0fd7b97 in run_mod (mod=<optimized out>, filename=filename@entry=0x7b71740cc870, globals=globals@entry=0x7b717428dac0, 
    locals=locals@entry=0x7b717428dac0, flags=flags@entry=0x7ffd754bc278, arena=arena@entry=0x7b71741afe30) at Python/pythonrun.c:1743
#76 0x00006473d0fda74c in pyrun_file (flags=0x7ffd754bc278, closeit=1, locals=0x7b717428dac0, globals=0x7b717428dac0, start=257, 
    filename=0x7b71740cc870, fp=0x6473d14c5c50) at Python/pythonrun.c:1643
#77 _PyRun_SimpleFileObject (fp=fp@entry=0x6473d14c5c50, filename=filename@entry=0x7b71740cc870, closeit=closeit@entry=1, 
    flags=flags@entry=0x7ffd754bc278) at Python/pythonrun.c:433
#78 0x00006473d0fdad2c in _PyRun_AnyFileObject (fp=0x6473d14c5c50, filename=filename@entry=0x7b71740cc870, closeit=closeit@entry=1, 
    flags=flags@entry=0x7ffd754bc278) at Python/pythonrun.c:78
#79 0x00006473d10019e9 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7b71740cc870, program_name=0x7b71740cc8d0)
    at Modules/main.c:360
#80 pymain_run_file (config=0x6473d1354408 <_PyRuntime+77672>) at Modules/main.c:379
#81 pymain_run_python (exitcode=exitcode@entry=0x7ffd754bc3b0) at Modules/main.c:629
#82 0x00006473d100202a in Py_RunMain () at Modules/main.c:709
#83 pymain_main (args=0x7ffd754bc370) at Modules/main.c:739
#84 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:763
#85 0x00007b71745ad24a in __libc_start_call_main (main=main@entry=0x6473d0df6740 <main>, argc=argc@entry=5, argv=argv@entry=0x7ffd754bc4e8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#86 0x00007b71745ad305 in __libc_start_main_impl (main=0x6473d0df6740 <main>, argc=5, argv=0x7ffd754bc4e8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd754bc4d8) at ../csu/libc-start.c:360
#87 0x00006473d0e03011 in _start ()
(gdb) 

always on the lock belonging to ./Modules/_io/bufferedio.c (in the Python source code).

This failure is extremely sensitive to peturbation - adding a log statement right before launching the interchange makes this hang not happen, for example.

I could go further and modify the C source code to track which process actually took that lock to confirm that this lock really is coming from the parent locked; or I could an assert near startup to assert it is not locked. I'm not going to dig into that further, because this is good enough evidence for me based on what I've seen debugging this sort of thing previously that this is a logging lock held across a fork.

So, I'll proceed to work on this issue #3373 because I need to be able to work on the interchange for further DESC monitoring work.

benclifford added a commit that referenced this issue May 17, 2024
This is in preparation in an upcoming PR for replacing the current
multiprocessing.Queue based report of interchange ports (which has a 120s
timeout) with a command client based retrieval of that information
(so now the command client needs to implement a 120s timeout to closely
replicate that behaviour). That work is itself part of using fork/exec to
launch the interchange, rather than multiprocessing.fork (issue #3373)

When the command client timeouts after sending a command, then it sets itself
to permanently bad: this is because the state of the channel is now unknown.
eg. Should the next action be to receive a response from a previously timed out
command that was eventually executed? Should the channel be recreated assuming
a previously sent command was never sent?

Tagging issue #3376 (command client is not thread safe) because I feel like
reworking this timeout behaviour and reworking that thread safety might be a
single piece of deeper work.
benclifford added a commit that referenced this issue May 27, 2024
This is in preparation in an upcoming PR for replacing the current
multiprocessing.Queue based report of interchange ports (which has a 120s
timeout) with a command client based retrieval of that information
(so now the command client needs to implement a 120s timeout to closely
replicate that behaviour). That work is itself part of using fork/exec to
launch the interchange, rather than multiprocessing.fork (issue #3373)

When the command client timeouts after sending a command, then it sets itself
to permanently bad: this is because the state of the channel is now unknown.
eg. Should the next action be to receive a response from a previously timed out
command that was eventually executed? Should the channel be recreated assuming
a previously sent command was never sent?

Co-authored-by: rjmello <30907815+rjmello@users.noreply.github.com>
benclifford added a commit that referenced this issue May 27, 2024
Before this PR, the interchange used a multiprocessing.Queue to send a single
message containing the ports it is listening on back to the submitting
process. This ties the interchange into being forked via multiprocessing,
even though the rest of the interchange is architected to be forked anyhow,
as part of earlier remote-interchange work.

After this PR, the CommandClient used for other submit-side to interchange
communication is used to retrieve this information, removing that reliance
on multiprocessing and reducing the number of different communication channels
used between the interchange and submit side by one.

See issue #3373 for more context on launching the interchange via fork/exec
rather than using multiprocessing.
benclifford added a commit that referenced this issue May 28, 2024
)

Before this PR, the interchange used a multiprocessing.Queue to send a single
message containing the ports it is listening on back to the submitting
process. This ties the interchange into being forked via multiprocessing,
even though the rest of the interchange is architected to be forked anyhow,
as part of earlier remote-interchange work.

After this PR, the CommandClient used for other submit-side to interchange
communication is used to retrieve this information, removing that reliance
on multiprocessing and reducing the number of different communication channels
used between the interchange and submit side by one.

See issue #3373 for more context on launching the interchange via fork/exec
rather than using multiprocessing.

The CommandClient is not threadsafe - see #3376 - and it is possible that this will introduce a new thread-unsafety, as this will be called from the main thread of execution, and most invocations happen (later on) from the status poller thread.
benclifford added a commit that referenced this issue May 28, 2024
…l, not multiprocessing

any downstream packaging will need to be aware of the presence of interchange.py as a new command-line invocable script
and this might break some build instructions which do not configure installed scripts onto the path.

this PR replaces keyword arguments with argparse command line parameters. it does not attempt to make those
command line arguments differently-optional than the constructor of the Interchange class (for example, worker_ports and
worker_port_range are both mandatory, because they are both specified before this PR)

i'm somewhat uncomfortable with this seeming like an ad-hoc serialise/deserialise protocol for what was previously
effecting a dict of typed python objects... but it's what process worker pool does.

see issue #3373 for interchange specific issue

see issue #2343 for parsl general fork vs threads issue

see possibly issue #3378?
@benclifford
Copy link
Collaborator Author

Implemented by #3463

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant