-
Notifications
You must be signed in to change notification settings - Fork 1
Description
概述
agentrun-sdk v0.0.22 的 PR #57(修复 Issue #55)引入了 Playwright 连接缓存机制 (_get_playwright + self._playwright_sync)。该缓存隐含假设所有工具调用在同一个 OS 线程上执行,但 LangGraph ToolNode 每次调用时创建新的 ContextThreadPoolExecutor(线程池),导致连续的浏览器工具调用实际运行在不同的短生命周期工作线程上。
结果:首次浏览器工具调用成功后,缓存的 Playwright 实例的 greenlet 绑定到已终止的工作线程上,所有后续调用必然失败,返回 "cannot switch to a different thread (which happens to have exited)"。
版本影响:v0.0.21 不受此问题影响(每次创建新 Playwright 实例),v0.0.22 和 v0.0.23 均受影响。
| 环境 | 值 |
|---|---|
| agentrun-sdk | 0.0.22 |
| Python | 3.12 |
| LangGraph | 1.0.6(已确认升级到 v1.0.10 无法解决) |
| LangChain | 1.2.4 |
| Playwright | 1.57.0 |
| greenlet | 3.3.0 |
| OS | macOS (darwin 23.6.0) |
Bug 1(P0 根因):_get_playwright 缓存不感知调用线程变化
根因
PR #57 将每个工具方法从"每次创建新 Playwright"改为"复用缓存实例":
# v0.0.21 — 每次创建 & 销毁
def browser_navigate(self, url, ...):
def inner(sb):
with sb.sync_playwright() as p: # ← 每次新建
response = p.goto(url, ...)
return {...}
return self._run_in_sandbox(inner)
# v0.0.22 — 缓存复用
def browser_navigate(self, url, ...):
def inner(sb):
p = self._get_playwright(sb) # ← 首次创建,后续复用
response = p.goto(url, ...)
return {...}
return self._run_in_sandbox(inner)_get_playwright 在首次调用时创建 BrowserPlaywrightSync 实例并缓存到 self._playwright_sync。Playwright 内部的 sync_playwright().start() 会创建一个 MainGreenlet,该 greenlet 绑定到创建它的 OS 线程(Python greenlet 库的约束:greenlet 只能在创建它的线程上切换)。
当外部调用者(如 LangGraph ToolNode)在不同线程上发起后续工具调用时,_get_playwright 返回缓存实例,但其内部 greenlet 仍绑定在已不存在的旧线程上,导致 greenlet.error。
LangGraph ToolNode 的线程模型
LangGraph ToolNode._func() 每次调用都创建新的 ContextThreadPoolExecutor,工具在线程池工作线程中执行:
# langgraph/prebuilt/tool_node.py
def _func(self, input, config, runtime):
# ...
with get_executor_for_config(config) as executor: # 每次新建 executor
outputs = list(
executor.map(self._run_one, tool_calls, ...)
)
# executor.__exit__() → shutdown(wait=True) → 工作线程终止get_executor_for_config(来自 langchain_core.runnables.config)每次返回新 ContextThreadPoolExecutor:
@contextmanager
def get_executor_for_config(config):
with ContextThreadPoolExecutor(max_workers=config.get("max_concurrency")) as executor:
yield executor已确认 LangGraph 最新版 v1.0.10 的 ToolNode 仍使用完全相同的模式,从 v1.0.6 → v1.0.10 的变更集中在 CLI、文档、依赖更新等方面,线程模型未变。升级 LangGraph 无法解决此问题。
冲突时序
LangGraph Agent 执行流程:
ToolNode._func() [调用 #1: browser_navigate]
└─ ContextThreadPoolExecutor-1 创建
└─ Thread-B 执行 browser_navigate
└─ _get_playwright(sb)
└─ 首次创建 → sync_playwright().start()
└─ MainGreenlet 创建,绑定 Thread-B ✓
└─ 缓存到 self._playwright_sync
└─ p.goto("baidu.com") → 成功 ✅
└─ executor.__exit__() → Thread-B 终止 💀
LLM 调用 (~2s)
ToolNode._func() [调用 #2: browser_snapshot]
└─ ContextThreadPoolExecutor-2 创建
└─ Thread-C 执行 browser_snapshot
└─ _get_playwright(sb)
└─ self._playwright_sync 不为 None → 返回缓存实例
└─ p.evaluate(...)
└─ dispatcher_fiber.switch()
└─ 目标 greenlet 绑定在 Thread-B,但 Thread-B 已退出
└─ 💥 "cannot switch to a different thread (which happens to have exited)"
v0.0.21 为什么不触发
在 v0.0.21 中,每次工具调用都使用 with sb.sync_playwright() as p: 创建和销毁独立的 Playwright 实例。即使 ToolNode 每次在不同线程上执行,每次调用都会创建新的 greenlet 绑定到当前线程,不依赖前一个线程。
v0.0.21 用户报告的"不稳定失败"是另一个独立问题(Issue #55 描述的 CDP 瞬态错误 + 不区分错误类型导致不必要的沙箱重建),并非 greenlet 线程绑定问题。
复现步骤
- 创建
BrowserToolSet并通过to_langchain()转为 LangChain 工具 - 通过
create_agent构建 LangGraph Agent - 给 Agent 发送需要使用浏览器的指令(如"打开 baidu.com 并获取快照")
- Agent 首次调用
browser_navigate→ 成功 - Agent 第二次调用任何浏览器工具 → 必然失败
最小复现代码
from agentrun.integration.builtin.sandbox import BrowserToolSet
from langchain_core.runnables.config import ContextThreadPoolExecutor
toolset = BrowserToolSet(template_name="<your-template>", config=None)
tools = toolset.to_langchain()
tools_map = {t.name: t for t in tools}
# 模拟 ToolNode:每次工具调用使用新的 ContextThreadPoolExecutor
# Step 1: 第一个 executor(工具在 Thread-B 上执行)
with ContextThreadPoolExecutor(max_workers=1) as executor:
r = list(executor.map(
lambda _: tools_map["browser_navigate"].invoke({"url": "https://www.baidu.com"}),
[None]
))[0]
print(r) # {'url': 'https://www.baidu.com', 'success': True, 'status': 200}
# executor 关闭 → Thread-B 终止 → 缓存的 greenlet 绑定的线程已死
# Step 2: 第二个 executor(工具在 Thread-C 上执行)
with ContextThreadPoolExecutor(max_workers=1) as executor:
r = list(executor.map(
lambda _: tools_map["browser_snapshot"].invoke({}),
[None]
))[0]
print(r) # {'error': 'cannot switch to a different thread (which happens to have exited)'}对照实验(同一 executor → 正常)
# 工作线程被复用,greenlet 线程绑定不变 → 正常
with ContextThreadPoolExecutor(max_workers=1) as executor:
r1 = list(executor.map(
lambda _: tools_map["browser_navigate"].invoke({"url": "https://www.baidu.com"}),
[None]
))[0]
print(r1) # {'success': True} ✅
r2 = list(executor.map(
lambda _: tools_map["browser_snapshot"].invoke({}),
[None]
))[0]
print(r2) # {'html': '...'} ✅生产环境证据
以下数据从项目数据库 tool_call_record 和 llm_call_record 中查询得到(dev 环境,目标网站 baidu.com):
| # | 工具 | 耗时(ms) | 时间戳 |
|---|---|---|---|
| 1 | health |
4,119 | 19:11:25 |
| 2 | browser_navigate |
2,703 | 19:11:29 |
| 3 | browser_snapshot |
4 | 19:11:31 |
| 4 | browser_snapshot |
8 | 19:11:32 |
| 5 | browser_navigate |
4 | 19:11:34 |
browser_navigate(#2) 耗时 2,703ms(含沙箱创建)→ 成功(baidu.com, status 200)browser_snapshot(Anycodes patch 1 #3) 耗时 4ms → 立即失败(greenlet 死亡)- 两次调用间隔 ~1.7s = LLM 调用时间,期间 Executor-1 已关闭、Thread-B 已终止
排除的假设
通过系统化测试排除了以下假设:
| 假设 | 测试方法 | 结果 |
|---|---|---|
| asyncio 事件循环冲突 | anyio.to_thread.run_sync | ✅ 正常,非根因 |
| FastAPI BackgroundTask | anyio + HTTP 间隙 | ✅ 正常,非根因 |
| 页面超时触发 | baidu.com(快速加载) | ❌ 也触发,排除 |
| 错误网址/DNS 失败 | 不存在域名/端口拒绝 | ✅ 不触发,可恢复 |
| PostgresSaver checkpointer | psycopg + checkpoint 写入 | ✅ 正常,非根因 |
| 直接 LLM 调用 | AzureChatOpenAI.invoke() | ✅ 正常,非根因 |
| checkpointer 类型 | MemorySaver / None | ❌ 也触发,排除 |
| 跨 executor 线程切换 | 模拟 ToolNode 行为 | ❌ 确认触发 |
| 同一 executor | 对照实验 | ✅ 正常 |
Bug 2(P1 独立):wait_until="load" 导致对特定网站必然超时
问题描述
browser_navigate 工具默认 wait_until="load"。部分网站因外部子资源在沙箱网络中无法加载,导致 load 事件永远不触发。
沙箱内诊断数据
| 网站 | commit |
domcontentloaded |
load |
|---|---|---|---|
| 百度 | 0.16s | 0.70s | 0.73s |
| 药明康德 | 0.45s | 0.65s | 90s 超时 ❌ |
| Bing | 0.43s | 0.40s | 0.44s |
药明康德 Navigation Timing API 显示 dom_complete = -1ms、load_event = -1ms(永远不触发),但 ttfb = 128ms、dom_interactive = 327ms(完全正常)。
与 Bug 1 的关系
Bug 2 是 Bug 1 的加重因素,不是根因:
- 即使没有超时(如 baidu.com),Bug 1 也必然触发
- 但 Bug 2 导致
browser_navigate的首次调用就超时失败(如 wuxiapptec.cn),使得 Agent 连首次导航都无法成功
建议
将 browser_navigate 的默认 wait_until 从 "load" 改为 "domcontentloaded"。
Bug 3(P2 关联):_run_in_sandbox 错误处理未覆盖 greenlet 死亡
问题
当 Bug 1 触发时,greenlet.error 继承自 Exception 而非 PlaywrightError,落入 _run_in_sandbox 的 catch-all 分支,不会调用 _reset_playwright() 清理死亡的缓存实例:
def _run_in_sandbox(self, callback):
try:
return callback(sb)
except PlaywrightError as e:
if self._is_infrastructure_error(str(e)):
self._reset_playwright() # ← 会重置
else:
return {"error": f"{e!s}"} # ← 不重置
except Exception as e:
return {"error": f"{e!s}"} # ← greenlet.error 走这里,不重置!这意味着即使外部调用者捕获错误后重试,死亡的 Playwright 实例仍被缓存,后续所有调用持续失败。
建议
在 except Exception 分支中检测 greenlet 死亡并重置:
except Exception as e:
if "cannot switch to" in str(e) or isinstance(e, greenlet.error):
self._reset_playwright()
self.sandbox = None
return {"error": f"{e!s}"}此修复为 Bug 1 的兜底防御,即使 _get_playwright 增加了线程检测(Bug 1 的修复),此处仍应覆盖以提高健壮性。
建议的修复方案
方案 1(SDK 侧 — 强烈推荐,改动最小):在 _get_playwright 中检测线程变化
保留 PR #57 的缓存优化,增加线程感知。当检测到调用线程已变化时,自动重建 Playwright 实例:
def _get_playwright(self, sb):
current_thread_id = threading.current_thread().ident
if (self._playwright_sync is not None
and self._playwright_thread_id != current_thread_id):
logger.debug(
"Thread changed from %s to %s, recreating Playwright connection",
self._playwright_thread_id, current_thread_id
)
self._reset_playwright()
if self._playwright_sync is None:
with self.lock:
if self._playwright_sync is None:
playwright_sync = sb.sync_playwright()
playwright_sync.open()
self._playwright_sync = playwright_sync
self._playwright_thread_id = current_thread_id
return self._playwright_sync
return self._playwright_sync优势:
- 改动量极小(仅
_get_playwright增加几行线程检测) - 完全向后兼容 PR feat(browser_toolset): enhance error handling and connection management #57 的所有修复(错误分类、缓存)
- 同一线程上的连续调用仍然复用连接
- 只在真正发生线程切换时才重建
方案 2(LangGraph 侧):让 ToolNode 复用 executor
修改 ToolNode._func() 使其复用 executor,而非每次创建新实例:
class ToolNode:
def __init__(self, tools):
self._executor = None
def _func(self, input, config, runtime):
if self._executor is None:
self._executor = ContextThreadPoolExecutor(
max_workers=config.get("max_concurrency")
)
outputs = list(
self._executor.map(self._run_one, tool_calls, input_types, tool_runtimes)
)方案 3(项目侧 workaround):自定义 ToolNode 绕过线程切换
在等待 SDK 或 LangGraph 修复前,项目可自定义 ToolNode 在当前线程直接执行工具:
from langgraph.prebuilt.tool_node import ToolNode
class DirectToolNode(ToolNode):
"""在当前线程直接执行工具,避免 ThreadPoolExecutor 线程切换"""
def _func(self, input, config, runtime):
tool_calls, input_type = self._parse_input(input)
outputs = [
self._run_one(call, input_type, ToolRuntime(...))
for call in tool_calls
]
return self._combine_tool_outputs(outputs, input_type)影响范围
- Bug 1(P0):影响所有通过 LangGraph Agent 使用
BrowserToolSet的场景。首次浏览器操作后所有后续操作必然失败,与目标网站无关。在 LangGraph Agent 中BrowserToolSet完全不可用。 - Bug 2(P1):影响沙箱中访问包含不可达外部资源的网站(大量企业官网、电商网站)
- Bug 3(P2):greenlet 死亡后缓存实例不被清理,即使外部重试也无法恢复
- 版本影响:v0.0.21 不受 Bug 1 影响(每次创建新 Playwright),v0.0.22 和 v0.0.23 均受影响
附录
A. 相关源码位置
AgentRun SDK (BrowserToolSet):
agentrun/integration/builtin/sandbox.pyBrowserToolSet._get_playwright— Playwright 实例缓存(无线程检查,Bug 1 根因)BrowserToolSet._run_in_sandbox— 错误处理逻辑(Bug 3)BrowserToolSet._reset_playwright— Playwright 重置
agentrun/sandbox/api/playwright_sync.pyBrowserPlaywrightSync.open()— 调用sync_playwright().start()创建 greenlet
LangGraph (ToolNode):
langgraph/prebuilt/tool_node.py—ToolNode._func()使用get_executor_for_config()创建线程池langchain_core/runnables/config.py—ContextThreadPoolExecutor(每次创建新实例)
Playwright:
playwright/sync_api/_context_manager.py—MainGreenlet创建,绑定到当前线程playwright/_impl/_sync_base.py—_sync()方法使用dispatcher_fiber.switch()