Skip to content

fix/workflow llm provider isolation#300

Merged
xiami762 merged 1 commit into
devfrom
fix/workflow-llm-provider-isolation
May 21, 2026
Merged

fix/workflow llm provider isolation#300
xiami762 merged 1 commit into
devfrom
fix/workflow-llm-provider-isolation

Conversation

@duguwanglong
Copy link
Copy Markdown
Contributor

isolate workflow LLMClient from session
Provider singleton"). It closes three more concurrency holes that surfaced
during the second-pass review:

A. Cross-loop httpx.AsyncClient reuse (regression risk introduced by the
idempotent reset in 3c7b300d). httpx.AsyncClient binds to whichever
event loop awaits it first. Session runs on uvicorn's main loop; the
workflow runs on the dedicated flocks-workflow-llm-loop. If session
created the client first, workflow inheriting it on a different loop
raises "got Future attached to a different loop" or silently hangs.

_prepare_provider now tracks the owning loop of provider._client
via a module-level WeakKeyDictionary and forces a reset whenever the
workflow loop id differs from the marker — even if the config itself
is unchanged. Same-loop back-to-back calls still reuse the client.

B. Provider._ensure_initialized() lazy init was not thread-safe: the
_initialized flag was flipped before providers_to_register and
_load_dynamic_providers() finished, so a concurrent caller could
take the fast path and observe an empty _providers registry. The
workflow _warmup_llm() helper papered over this for itself but the
session had no such guard — concurrent session/agent boot would hit
"Provider X not found" intermittently.

Now uses double-checked locking with threading.Lock and only flips
_initialized = True after the registry is fully populated.

C. Provider.apply_config() is called on every session step and
inside workflow llm.ask(). Previously each call unconditionally
rewrote provider._config and rebuilt provider._config_models,
causing race-prone reads on the mutable model list across loops.

apply_config is now idempotent: it skips both provider.configure(...)
and the _config_models rebuild when the desired values already
match what the provider holds. Model-list rebuild remains atomic
(build full list then assign) so readers never observe a half-built
list.

@duguwanglong duguwanglong requested a review from xiami762 May 21, 2026 08:00
…der singleton

The workflow `llm.ask` path shares the process-wide `Provider._providers`
registry with the session/agent runner. Under concurrent load each
workflow LLM call could destabilize an in-flight session call:

* `_prepare_provider` unconditionally rewrote `provider._config` and
  forced `provider._client = None`, dropping the session's live
  httpx connection pool and silently flipping `custom_settings`
  (notably `trust_env`) that the session had set.
* `Provider._ensure_initialized` flipped `_initialized = True`
  before the registry was populated, so a concurrent caller could
  take the fast path and observe `Provider.get(...) is None` for
  built-in providers.
* `Provider.apply_config` is called on every session step and on
  every workflow `llm.ask`; it unconditionally re-`configure`d the
  provider and rebuilt `provider._config_models`, racing readers
  on the mutable list across event loops.
* The same `httpx.AsyncClient` could be inherited across event loops
  (session: uvicorn main loop, workflow: `flocks-workflow-llm-loop`),
  triggering "got Future attached to a different loop" or silent
  hangs.

Changes
-------

`flocks/workflow/llm.py`:
* Serialize `_prepare_provider` per-provider via a `threading.Lock`
  keyed by `provider_id`.
* Make reconfigure idempotent: skip `provider.configure(...)` and the
  client reset when the desired `ProviderConfig` (api_key / base_url
  / custom_settings) already matches what the provider holds.
* Only override `custom_settings['trust_env']` when the user
  explicitly set `workflow.llm.trust_env` in flocks config.
* Track the owning event loop of `provider._client` in a
  `WeakKeyDictionary` and force a client reset when the workflow
  loop id differs from the marker, even if the config is unchanged.

`flocks/provider/provider.py`:
* `_ensure_initialized` uses double-checked locking with a
  `threading.Lock` and flips `_initialized = True` only after
  `_load_dynamic_providers()` returns, so concurrent callers
  never observe a partially populated registry.
* `apply_config` compares the desired `ProviderConfig` against the
  existing `provider._config` and skips `provider.configure(...)`
  when unchanged. The `_config_models` rebuild is gated by a
  signature-based equality check and assigned atomically.

Tests
-----

13 new tests, all passing:

* `tests/workflow/test_llm_provider_isolation.py`: idempotency,
  trust_env inheritance vs override, api_key / base_url change
  triggering reconfigure, cross-loop client reset, same-loop
  client reuse, per-provider lock identity.
* `tests/provider/test_provider_lazy_init_thread_safe.py`: races
  20 threads through `_ensure_initialized` behind a `Barrier` and
  asserts every observer sees a fully populated registry.
* `tests/provider/test_provider_apply_config_idempotent.py`:
  no-op path on unchanged config; mutation path on api_key change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@duguwanglong duguwanglong force-pushed the fix/workflow-llm-provider-isolation branch from ba0f158 to ca40ece Compare May 21, 2026 08:10
@xiami762 xiami762 merged commit 602fa04 into dev May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants