fix/workflow llm provider isolation by duguwanglong · Pull Request #300 · AgentFlocks/flocks

duguwanglong · 2026-05-21T08:00:12Z

isolate workflow LLMClient from session
Provider singleton"). It closes three more concurrency holes that surfaced
during the second-pass review:

A. Cross-loop httpx.AsyncClient reuse (regression risk introduced by the
idempotent reset in 3c7b300d). httpx.AsyncClient binds to whichever
event loop awaits it first. Session runs on uvicorn's main loop; the
workflow runs on the dedicated flocks-workflow-llm-loop. If session
created the client first, workflow inheriting it on a different loop
raises "got Future attached to a different loop" or silently hangs.

_prepare_provider now tracks the owning loop of provider._client
via a module-level WeakKeyDictionary and forces a reset whenever the
workflow loop id differs from the marker — even if the config itself
is unchanged. Same-loop back-to-back calls still reuse the client.

B. Provider._ensure_initialized() lazy init was not thread-safe: the
_initialized flag was flipped before providers_to_register and
_load_dynamic_providers() finished, so a concurrent caller could
take the fast path and observe an empty _providers registry. The
workflow _warmup_llm() helper papered over this for itself but the
session had no such guard — concurrent session/agent boot would hit
"Provider X not found" intermittently.

Now uses double-checked locking with threading.Lock and only flips
_initialized = True after the registry is fully populated.

C. Provider.apply_config() is called on every session step and
inside workflow llm.ask(). Previously each call unconditionally
rewrote provider._config and rebuilt provider._config_models,
causing race-prone reads on the mutable model list across loops.

apply_config is now idempotent: it skips both provider.configure(...)
and the _config_models rebuild when the desired values already
match what the provider holds. Model-list rebuild remains atomic
(build full list then assign) so readers never observe a half-built
list.

…der singleton The workflow `llm.ask` path shares the process-wide `Provider._providers` registry with the session/agent runner. Under concurrent load each workflow LLM call could destabilize an in-flight session call: * `_prepare_provider` unconditionally rewrote `provider._config` and forced `provider._client = None`, dropping the session's live httpx connection pool and silently flipping `custom_settings` (notably `trust_env`) that the session had set. * `Provider._ensure_initialized` flipped `_initialized = True` before the registry was populated, so a concurrent caller could take the fast path and observe `Provider.get(...) is None` for built-in providers. * `Provider.apply_config` is called on every session step and on every workflow `llm.ask`; it unconditionally re-`configure`d the provider and rebuilt `provider._config_models`, racing readers on the mutable list across event loops. * The same `httpx.AsyncClient` could be inherited across event loops (session: uvicorn main loop, workflow: `flocks-workflow-llm-loop`), triggering "got Future attached to a different loop" or silent hangs. Changes ------- `flocks/workflow/llm.py`: * Serialize `_prepare_provider` per-provider via a `threading.Lock` keyed by `provider_id`. * Make reconfigure idempotent: skip `provider.configure(...)` and the client reset when the desired `ProviderConfig` (api_key / base_url / custom_settings) already matches what the provider holds. * Only override `custom_settings['trust_env']` when the user explicitly set `workflow.llm.trust_env` in flocks config. * Track the owning event loop of `provider._client` in a `WeakKeyDictionary` and force a client reset when the workflow loop id differs from the marker, even if the config is unchanged. `flocks/provider/provider.py`: * `_ensure_initialized` uses double-checked locking with a `threading.Lock` and flips `_initialized = True` only after `_load_dynamic_providers()` returns, so concurrent callers never observe a partially populated registry. * `apply_config` compares the desired `ProviderConfig` against the existing `provider._config` and skips `provider.configure(...)` when unchanged. The `_config_models` rebuild is gated by a signature-based equality check and assigned atomically. Tests ----- 13 new tests, all passing: * `tests/workflow/test_llm_provider_isolation.py`: idempotency, trust_env inheritance vs override, api_key / base_url change triggering reconfigure, cross-loop client reset, same-loop client reuse, per-provider lock identity. * `tests/provider/test_provider_lazy_init_thread_safe.py`: races 20 threads through `_ensure_initialized` behind a `Barrier` and asserts every observer sees a fully populated registry. * `tests/provider/test_provider_apply_config_idempotent.py`: no-op path on unchanged config; mutation path on api_key change. Co-authored-by: Cursor <cursoragent@cursor.com>

duguwanglong requested a review from xiami762 May 21, 2026 08:00

duguwanglong force-pushed the fix/workflow-llm-provider-isolation branch from ba0f158 to ca40ece Compare May 21, 2026 08:10

xiami762 approved these changes May 21, 2026

View reviewed changes

xiami762 merged commit 602fa04 into dev May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/workflow llm provider isolation#300

fix/workflow llm provider isolation#300
xiami762 merged 1 commit into
devfrom
fix/workflow-llm-provider-isolation

duguwanglong commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duguwanglong commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants