-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add python bindings to the global thread pool functionality #24238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We don't have a good way to shutdown the thread pool. It should be done before the python process started to exit and unload DLLs. So this doc: https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices#deadlocks-caused-by-lock-order-inversion |
Like this: apache/mxnet#11163 |
@microsoft-github-policy-service agree company="Instacart" |
6a2b524
to
44d5cf6
Compare
Would adding EDIT: I see, the native implementation would just let Environment destruct at the end of the program, and that handles shutting down. But the Python bindings have a static |
44d5cf6
to
510733f
Compare
…tting the global threading options
On Windows when a thread shuts down, the OS also needs to notify all loaded DLLs in case if they need to do any cleanup work. It cannot be too late. So, in general all user created threads should be shutdown before unloading DLLs. Theoretically speaking, there is another way of doing this: the cleanup functions should test if the currently the process is shutting down, if yes, giving up doing any clean up. We do not have such logics in ORT yet. We may consider doing it later. |
Avoids compilation errors involving the affinity string.
So, one proposal here; similar to the C++ API where you create an Env at the top of main() and use that throughout the program, and it eventually destructs right before main returns, have a (Could even hide it from the Python users by moving the global methods to Python implementations that use a Python-module global.) @yuslepukhin is there any reason this wouldn't work over the current approach? |
For the environment-level thread pool, it would be helpful to review how this is handled in the C# API, as it highlights several important issues. The key concerns are:
What is also needed here is unit tests. The issue of shutting down the global threads on Python model unload mentioned by @snnn is probably still valid, but it is likely to be uncovered (or not) by the unit tests. The idea of shutting down the TP does not live well in GC like environments like Python, as we can get objects that are not GCed and alive from Python point of view, but the internals are shuttered by a shutdown api. It requires discipline, and the Python code is not typically written in this manner. Said that, there is a chance we can get it work, but not as presented in this PR. |
Looking at the C# a bit, it seems like they go with roughly what I'm suggesting of having a static singleton instance on the C#-defined _ENV: EnvHandle | None = None
_ENV_LOCK = threading.Lock()
def get_env() -> EnvHandle:
if (env := _ENV) is not None:
return env
with _ENV_LOCK:
if _ENV is not None:
return _ENV
_ENV = native_module.GetEnv()
return _ENV
@atexit.register
def _destroy_env() -> None:
with _ENV_LOCK:
_ENV = None
# Force the handle to be collected, if it was the last reference.
gc.collect() and then global python-exported functions like |
We will need a couple of things:
|
Sorry. I think I was overconcerned. I couldn't reproduce the deadlock problem anymore. Here is what I just learnt:
|
Continuing to look at the C# implementation, two more things:
With that in mind, maybe we don't have to change anything? The environment will get destroyed when all the inference sessions and the I do need a bit of help figuring out how to test this, though; I've found the test/python directory, |
The two tensor RT pipelines are failing |
The environment will be destroyed automatically when all InferenceSessions and the `_pybind_state` module are destroyed. The current state of the world almost mimics the C# behaviour, except the environment lives until all sessions are destroyed (rather than C# allowing the environment to be destroyed before all sessions are).
But the TensorRT tests could not pass |
Do it early because some of the EP specifics need an environment.
@yuslepukhin the issue should be resolved now. Looking at that windows cuda check that failed, seems like the error was while downloading https://github.com/onnx/onnx/archive/refs/heads/rel-1.15.0.zip:
|
I'd like to think this is ready; WASM is almost certainly unrelated. |
@snnn I've tried to incorporate the changes from #24891 here, but it mostly amounts to removing the at-import env creation and adding a bunch of An alternative would be to allow late-binding global thread pools if the env has been initialized without them. Then the Python bindings can continue creating the env at import time, and the global thread pools can be attached after-the-fact. I don't believe there's any code that would break if we did this. |
fa82356
to
e93897c
Compare
I don't see there you tried to uninit an OrtEnv. Did I miss something? |
Yeah, if you look at the merge, it also includes all these changes. The two PRs after revert them and go with the post-init global thread pools approach. |
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
Thanks! |
Description
Allows users to configure and enable the global thread pool via Python, and have inference sessions use it instead of session-local thread pools.
Motivation and Context
Forked off of #23495 to take over implementation, see issue #23523.
Our particular use case involves a single service instance serving thousands of individual models, each relatively small (e.g. small decision trees). Creating individual services for each model is too much overhead, and attempting to start several thousand thread-pools is a non-starter. We could possibly have each session be single-threaded, but we would like to be able to separate the request handler thread count from the compute thread count (e.g. 2 handler threads but 4 intra-op ones).