Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preloading '__main__' with forkserver has been broken for a long time #98552

Open
aggieNick02 opened this issue Oct 22, 2022 · 4 comments
Open
Labels
3.11 only security fixes 3.12 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@aggieNick02
Copy link

aggieNick02 commented Oct 22, 2022

Bug report

The forkserver start method provides the ability to call set_forkserver_preload on the multiprocessing context to load modules into and configure the forkserver process. By doing this carefully, you can avoid having to do module loading and other work each time the forkserver process is forked to create a new process. Without doing such work, forkserver can be way slower than the traditional fork start method

You can specify the module '__main__' in the set_forkserver_preload list, and the forkserver source has special code when you do this. It ensures that the main file path does not have to be configured/loaded after each fork. To do this, in multiprocessing.forkserver.ensure_running, it calls multiprocessing.spawn.get_preparation_data and then uses the main_path entry that may be in the returned dict.

Unfortunately, 3 months after it was introduced, this functionality was broken in commit 9a76735. That commit renamed the main_path dictionary entry returned in get_preparation_data to init_main_from_path, but didn't update the use in multiprocessing.fork_server

Not having the ability to load and configure main on the forkserver ends up being unusually painful for my recent scenario, which led to tracking this down. I have a python program on a share that spawns short-lived processes at a high rate. Then multiple machines run this program from the share. Huge slowdown ensues as smbd processes on the server go crazy responding to every new process on every client reading the file and stat-ing the directory the file is contained in.

A simple fix in multiprocessing.forkserver accounting for the changed name rectifies the problem. I'll work on putting that PR together. Any thoughts on a workaround that doesn't require modifying the python source are welcome, as I imagine it will be a while until I'm on a python with the fix.

Here's a simple repro.

import time
import multiprocessing
print('hi from forkserver_repro')
def _silly(i):
    time.sleep(0.2)
    return i%2

def run_subprocesses():
    process_list=[]
    for i in range(10):
        process_list.append(multiprocessing.Process(target=_silly,args=[i]))
    for process in process_list:
        process.start()
    for process in process_list:
        process.join()
    return 0

if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver')
    ctx = multiprocessing.get_context('forkserver')
    ctx.set_forkserver_preload(['__main__',])
    print('Only one more "hi from forkserver_repro" should print! More means a bug!')
    run_subprocesses()

Your environment

Ubuntu 20.04.4 LTS, CPython 3.8
Python source code examination indicates this bug is still present in the current version of CPython.

@aggieNick02 aggieNick02 added the type-bug An unexpected behavior, bug, or error label Oct 22, 2022
@gpshead gpshead added 3.11 only security fixes 3.12 bugs and security fixes labels Oct 23, 2022
@gpshead
Copy link
Member

gpshead commented Oct 23, 2022

A PR fixing this and adding unittest coverage would be great.

BTW as you are using forkserver, be aware of the recently disclosed #97514 if you ever run on 3.9 or later. (that issue can be worked around)

@aggieNick02
Copy link
Author

Thanks for the pointer on the CVE.

The simple fix for this is pretty straightforward and I'll try to get it together quickly (first cpython PR so may take a little longer). It is just to deal with the renamed dict entry.

However, I've discovered some other "quirks" with forkserver, and was wondering if they are known or being worked on, if they could use improvement, or are expected and it's just my newness in looking at it. (I'm on linux but trying to use multiprocessing in programs that also use threading, and have gotten burned a few times by deadlocks because of it, leading me to move to using forkserver. But maybe I should be considering another way forward?)

Anyways, with the break fixed by using the right dict key to get main_path into main, preloading __main__ works, but imports in the main module can fail, because sys.path is not set in forkserver's main before importing main. (The correct sys_path is actually fed into forkserver.main, but not used?! It's been that way since forkserver was added 9 years ago...)

Not having sys.path set leads to other weirdness even when not importing main. The modules specified to preload will succeed/fail depending on the cwd of the process, so python a/foo.py behaves different then cd a; python foo.py . And I'm pretty sure I've seen the same module imported twice, once at the name given as preload and once as the name used by another module (e.g., a.my_module and my_module).

The try/catch around each preloaded module hides all the failures too. In my case, I'd much prefer to get the exception, because if the preload fails, then every process created by the forkserver is going to have to do expensive module loading itself, and I don't realize it until I notice the performance problem.

When spawn is used, we set/clear _inheriting and call prepare with the result of get_preparation_data in spawn._main. My completely uninformed inclination is to do the same in forkserver.main, instead of what's currently happening. Is that a reasonable thing to look into?

@aggieNick02
Copy link
Author

I did some experiments. Using spawn.prepare in forkserver.py to consume the dict from get_prepare_data seems to work well. It fixes sys.path to be correct so imports work regardless of cwd, fixes preloading __main__ in a less brittle way, and does a bunch of other stuff that I'm guessing should be happening too. But I don't know enough about forkserver and spawn to be certain.

I'll put together a PR for this approach too, to solicit input. forkserver.main gets its parameters from the string created in forkserver.ensure_running, so the dict from get_prepare_data has to be turned into a string. A quick way to do this and not have to worry about escaping quotes/etc is to remove the authkey from the dict, then json.dumps to utf-8 to b64encode it. Then main does the reverse to get it back. (Thought about pickle but don't know enough to know the security implications here.)

The downside (for me) of this more general solution is that it alters code on the new process side, versus the not as complete fix that is on the original process side. This means the latter can be done with monkey-patching, but the former cannot, as far as I understand.

@aggieNick02
Copy link
Author

Updating the tests for these changes led to some comical confusion. Turns out there was another bug waiting... forkserver does not flush its stdout/stderr before each fork, and that makes things really confusing.

I was using print statements in files to make sure preload was working and they were only imported in the forkserver and not in the new forked processes. But the output said otherwise, even though everything was actually working...

aggieNick02 added a commit to PCPartPicker/cpython that referenced this issue Nov 15, 2022
…n for a

long time (9 years).

We do this by using spawn.get_preparation_data() and then using the
values in that dictionary appropriately. This lets us also fix the
setting of sys.path (which never worked) and other minor properties of
the forkserver process.

While updating the test to verify these fixes, we also discovered that
forkserver was not flushing stdout/stderr before it forked to create a
new process. This caused the updated test to fail because unflushed
output in the forkserver would then be printed by the forked process as
well. This is now fixed too.
@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants