fix/workflow llm provider isolation#300
Merged
Merged
Conversation
…der singleton The workflow `llm.ask` path shares the process-wide `Provider._providers` registry with the session/agent runner. Under concurrent load each workflow LLM call could destabilize an in-flight session call: * `_prepare_provider` unconditionally rewrote `provider._config` and forced `provider._client = None`, dropping the session's live httpx connection pool and silently flipping `custom_settings` (notably `trust_env`) that the session had set. * `Provider._ensure_initialized` flipped `_initialized = True` before the registry was populated, so a concurrent caller could take the fast path and observe `Provider.get(...) is None` for built-in providers. * `Provider.apply_config` is called on every session step and on every workflow `llm.ask`; it unconditionally re-`configure`d the provider and rebuilt `provider._config_models`, racing readers on the mutable list across event loops. * The same `httpx.AsyncClient` could be inherited across event loops (session: uvicorn main loop, workflow: `flocks-workflow-llm-loop`), triggering "got Future attached to a different loop" or silent hangs. Changes ------- `flocks/workflow/llm.py`: * Serialize `_prepare_provider` per-provider via a `threading.Lock` keyed by `provider_id`. * Make reconfigure idempotent: skip `provider.configure(...)` and the client reset when the desired `ProviderConfig` (api_key / base_url / custom_settings) already matches what the provider holds. * Only override `custom_settings['trust_env']` when the user explicitly set `workflow.llm.trust_env` in flocks config. * Track the owning event loop of `provider._client` in a `WeakKeyDictionary` and force a client reset when the workflow loop id differs from the marker, even if the config is unchanged. `flocks/provider/provider.py`: * `_ensure_initialized` uses double-checked locking with a `threading.Lock` and flips `_initialized = True` only after `_load_dynamic_providers()` returns, so concurrent callers never observe a partially populated registry. * `apply_config` compares the desired `ProviderConfig` against the existing `provider._config` and skips `provider.configure(...)` when unchanged. The `_config_models` rebuild is gated by a signature-based equality check and assigned atomically. Tests ----- 13 new tests, all passing: * `tests/workflow/test_llm_provider_isolation.py`: idempotency, trust_env inheritance vs override, api_key / base_url change triggering reconfigure, cross-loop client reset, same-loop client reuse, per-provider lock identity. * `tests/provider/test_provider_lazy_init_thread_safe.py`: races 20 threads through `_ensure_initialized` behind a `Barrier` and asserts every observer sees a fully populated registry. * `tests/provider/test_provider_apply_config_idempotent.py`: no-op path on unchanged config; mutation path on api_key change. Co-authored-by: Cursor <cursoragent@cursor.com>
ba0f158 to
ca40ece
Compare
xiami762
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
isolate workflow LLMClient from session
Provider singleton"). It closes three more concurrency holes that surfaced
during the second-pass review:
A. Cross-loop httpx.AsyncClient reuse (regression risk introduced by the
idempotent reset in 3c7b300d).
httpx.AsyncClientbinds to whicheverevent loop awaits it first. Session runs on uvicorn's main loop; the
workflow runs on the dedicated flocks-workflow-llm-loop. If session
created the client first, workflow inheriting it on a different loop
raises "got Future attached to a different loop" or silently hangs.
_prepare_providernow tracks the owning loop ofprovider._clientvia a module-level
WeakKeyDictionaryand forces a reset whenever theworkflow loop id differs from the marker — even if the config itself
is unchanged. Same-loop back-to-back calls still reuse the client.
B.
Provider._ensure_initialized()lazy init was not thread-safe: the_initializedflag was flipped beforeproviders_to_registerand_load_dynamic_providers()finished, so a concurrent caller couldtake the fast path and observe an empty
_providersregistry. Theworkflow
_warmup_llm()helper papered over this for itself but thesession had no such guard — concurrent session/agent boot would hit
"Provider X not found" intermittently.
Now uses double-checked locking with
threading.Lockand only flips_initialized = Trueafter the registry is fully populated.C.
Provider.apply_config()is called on every session step andinside workflow
llm.ask(). Previously each call unconditionallyrewrote
provider._configand rebuiltprovider._config_models,causing race-prone reads on the mutable model list across loops.
apply_configis now idempotent: it skips bothprovider.configure(...)and the
_config_modelsrebuild when the desired values alreadymatch what the provider holds. Model-list rebuild remains atomic
(build full list then assign) so readers never observe a half-built
list.