fix: 钉钉连接超时后自动重连失败#7924
Merged
Soulter merged 4 commits intoAstrBotDevs:masterfrom May 3, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
start_client, thefinallyblock always runs and will cancel the task whenshutdown_eventis set even in the ‘normal close’ path where you just returned because the task is still running; this contradicts the comment and likely prevents a clean shutdown—consider only cancelling when you explicitly want to abort (e.g., on retry/failure paths). - The
taskvariable is referenced in thefinallyblock but may not be defined if an exception is raised beforeloop.create_taskis called; initializetasktoNonebefore thetryand guard thetask.cancel()/task.done()checks accordingly. - Calling
task.cancel()from a worker thread can be unsafe; instead of directly cancelling the task fromstart_client, useloop.call_soon_threadsafe(task.cancel)to interact with the asyncio loop from another thread.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `start_client`, the `finally` block always runs and will cancel the task when `shutdown_event` is set even in the ‘normal close’ path where you just returned because the task is still running; this contradicts the comment and likely prevents a clean shutdown—consider only cancelling when you explicitly want to abort (e.g., on retry/failure paths).
- The `task` variable is referenced in the `finally` block but may not be defined if an exception is raised before `loop.create_task` is called; initialize `task` to `None` before the `try` and guard the `task.cancel()`/`task.done()` checks accordingly.
- Calling `task.cancel()` from a worker thread can be unsafe; instead of directly cancelling the task from `start_client`, use `loop.call_soon_threadsafe(task.cancel)` to interact with the asyncio loop from another thread.
## Individual Comments
### Comment 1
<location path="astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py" line_range="756-753" />
<code_context>
+ MAX_RETRIES = 5
+ retry_count = 0
+
def start_client(loop: asyncio.AbstractEventLoop) -> None:
- try:
- self._shutdown_event = threading.Event()
- task = loop.create_task(self.client_.start())
- self._shutdown_event.wait()
- if task.done():
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against `task` being referenced in `finally` before assignment if `create_task` raises.
If `loop.create_task(self.client_.start())` raises (e.g., due to a closing loop), control flows to `except` and then `finally`, where `task` is referenced without being defined, leading to an `UnboundLocalError` that masks the original failure. Consider initializing `task: asyncio.Task | None = None` before the `try` and guarding with `if task is not None and not task.done():` in `finally`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements a retry loop for the DingTalk adapter to handle unexpected client task exits. The review feedback identifies a potential deadlock where the thread could block indefinitely if the task crashes, a risk of UnboundLocalError for the task variable, and the need to handle CancelledError when retrieving task exceptions.
…ndle UnboundLocalError
Contributor
Author
|
@sourcery-ai review |
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The retry logic in
start_clientduplicates the same logging, counter increment, andtime.sleep(10)handling in both the normal exception path and thetask.exception()path; consider extracting a small helper to encapsulate the retry step and reduce complexity. - Using
time.sleep(10)inside thestart_clientloop blocks the worker thread entirely; if you expect this logic to evolve or run alongside other threaded work, consider using a non-blocking wait (e.g., anotherthreading.Event().wait(timeout)or a configurable backoff) to keep it more flexible and testable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The retry logic in `start_client` duplicates the same logging, counter increment, and `time.sleep(10)` handling in both the normal exception path and the `task.exception()` path; consider extracting a small helper to encapsulate the retry step and reduce complexity.
- Using `time.sleep(10)` inside the `start_client` loop blocks the worker thread entirely; if you expect this logic to evolve or run alongside other threaded work, consider using a non-blocking wait (e.g., another `threading.Event().wait(timeout)` or a configurable backoff) to keep it more flexible and testable.
## Individual Comments
### Comment 1
<location path="astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py" line_range="756" />
<code_context>
+ MAX_RETRIES = 5
+ retry_count = 0
+
def start_client(loop: asyncio.AbstractEventLoop) -> None:
- try:
- self._shutdown_event = threading.Event()
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the retry and shutdown handling into a small helper and a flat retry loop to simplify control flow and remove duplicated logic.
You can keep the same behavior while simplifying control flow, removing duplication, and avoiding `nonlocal` by:
- Classifying a single `client_.start()` run into a small result enum.
- Putting retry policy in a tight loop that uses that result.
- Centralizing the graceful-shutdown check and retry increment/sleep.
For example:
```python
def _is_graceful_shutdown(exc: BaseException) -> bool:
return "Graceful shutdown" in str(exc)
```
Split “run once” from “retry loop” and keep retry state local:
```python
def _run_client_once(
loop: asyncio.AbstractEventLoop,
shutdown_event: threading.Event,
) -> Literal["ok", "graceful_shutdown", "error"]:
task = loop.create_task(self.client_.start())
task.add_done_callback(lambda _: shutdown_event.set())
shutdown_event.wait()
if not task.done():
# 正常关闭路径:线程被唤醒但 task 仍在运行
return "ok"
try:
exc = task.exception()
except asyncio.CancelledError:
logger.info("钉钉适配器 task 已取消")
return "graceful_shutdown"
if exc is None:
return "ok"
if _is_graceful_shutdown(exc):
logger.info("钉钉适配器已被关闭")
return "graceful_shutdown"
logger.error(f"钉钉 SDK task 异常退出: {exc}")
return "error"
```
Then the retry loop becomes a single, flat structure with no `nonlocal` and shared retry handling:
```python
def start_client(loop: asyncio.AbstractEventLoop) -> None:
MAX_RETRIES = 5
RETRY_DELAY = 10
for attempt in range(1, MAX_RETRIES + 1):
shutdown_event = threading.Event()
task: asyncio.Task | None = None
try:
self._shutdown_event = shutdown_event
result = _run_client_once(loop, shutdown_event)
if result in ("ok", "graceful_shutdown"):
return # 正常退出或优雅关闭
# result == "error": 统一的重试逻辑
if attempt < MAX_RETRIES:
logger.info(f"钉钉适配器尝试重连 ({attempt}/{MAX_RETRIES})...")
time.sleep(RETRY_DELAY)
continue
logger.error("钉钉适配器重连失败,已达最大重试次数")
return
except Exception as e:
if _is_graceful_shutdown(e):
logger.info("钉钉适配器已被关闭")
return
logger.error(f"钉钉机器人启动失败: {e}")
if attempt < MAX_RETRIES:
logger.info(f"钉钉适配器尝试重连 ({attempt}/{MAX_RETRIES})...")
time.sleep(RETRY_DELAY)
else:
logger.error("钉钉适配器重连失败,已达最大重试次数")
return
finally:
if task is not None and not task.done():
task.cancel()
```
Key benefits while preserving functionality:
- No `nonlocal` or shared `retry_count`.
- A single, explicit retry loop with one place handling retries.
- Graceful shutdown detection centralized in `_is_graceful_shutdown`.
- Task lifecycle and classification confined to `_run_client_once`, making it easier to reason about all exit paths.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Soulter
approved these changes
May 2, 2026
Soulter
reviewed
May 2, 2026
Contributor
Author
|
您好,关于重试逻辑的说明:
当前方案是合理的,不需要改为无限重试。 |
Soulter
pushed a commit
that referenced
this pull request
May 3, 2026
* fix: improve DingTalk adapter error handling in run() method * fix: add retry logic for DingTalk SDK task unexpected exit * fix: use task.add_done_callback to wake thread on task completion, handle UnboundLocalError * refactor: extract retry logic into handle_retry helper function --------- Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>
Soulter
added a commit
that referenced
this pull request
May 3, 2026
* feat: supports plugin to add skills * fix tests * fix: fs tools * Add tests for plugin skills handling and improve skill management - Implement test for restricted local member reading plugin skill inventory even if the plugin is inactive. - Ensure that the skill synchronization process retains built-in skills when local skills are empty, including proper handling of plugin paths. - Update dashboard tests to verify that plugin details include components when requested. - Enhance skill metadata enrichment tests to include inactive plugin-provided skills for inventory. - Add filtering tests for plugin skills based on current configuration, ensuring only allowed plugins are considered and inactive plugins are skipped. Co-authored-by: Copilot <copilot@github.com> * fix: handle PPIO platform context-length error messages (#7888) * fix: 压缩算法删除 user 消息 Bug 修复 * perf: improve truncate algo * fix: improve context length error detection for PPIO platform compatibility - Extend error detection to handle PPIO's error message format: 'The input is longer than the model's context length' - Add case-insensitive matching using .lower() for robustness - Maintain backward compatibility with existing 'maximum context length' check This fixes the issue where PPIO platform models (e.g., ppio/zai-org/glm-5-turbo) would fail with AgentState.ERROR due to unrecognized context length errors. --------- Co-authored-by: Soulter <905617992@qq.com> * fix: 支持微信客服文件消息 (#7923) * fix: 支持微信客服文件消息 * fix: remove WeCom file message placeholder * fix(provider): fix Anthropic custom headers and system prompt compatibility (#7587) * fix(provider): fix Anthropic custom headers and system prompt compatibility - Pass custom_headers via AsyncAnthropic's `default_headers` parameter instead of creating a separate httpx.AsyncClient. This avoids `isinstance` check failures when multiple httpx installations exist on sys.path (e.g. bundled Python + system Python). - Use list format for the `system` parameter (`[{"type": "text", ...}]`) instead of a plain string. The list format is supported by the official Anthropic API and is also compatible with third-party API proxies that reject the string format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(provider): fix Anthropic custom headers and system prompt compatibility - Pass custom_headers via AsyncAnthropic's `default_headers` parameter instead of creating a separate httpx.AsyncClient. This avoids `isinstance` check failures when multiple httpx installations exist on sys.path (e.g. bundled Python + system Python). - Use list format for the `system` parameter (`[{"type": "text", ...}]`) instead of a plain string. The list format is supported by the official Anthropic API and is also compatible with third-party API proxies that reject the string format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add test unit --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * perf: improve logic of adding models Co-authored-by: piexian <piexian@users.noreply.github.com> * chore: remove redundant logger messages and improve log clarity Co-authored-by: Copilot <copilot@github.com> * chore: ruff format * docs: update knowledge base docs closes: #7962 * fix(#7907): send_message_to_user cron 场景下 session 容错 (#7911) * fix: send_message_to_user cron 场景下 session 容错 (#7907) - LLM 在主动场景可能只传 session_id 而非完整三段式, from_str 失败时用 current_session 补全前两段。 Co-authored-by: Copilot <copilot@github.com> * fix: 限制 session 补全仅对裸 session_id 生效,避免误修带冒号的错误输入 (#7907) * feat: add session information to cron job payload Co-authored-by: Copilot <copilot@github.com> * fix: improve clarity and consistency of safety mode prompts Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Weilong Liao <37870767+Soulter@users.noreply.github.com> Co-authored-by: Soulter <905617992@qq.com> * perf: tool rendering in conversation page (#7937) * fix(dashboard): route conversation history tool messages through ToolCallCard When viewing conversation history, large tool outputs (e.g. a single git log --stat producing tens of KB) caused the browser renderer to freeze. Root cause: formattedMessages mapped every role (including tool / system / _checkpoint) into user/bot bubbles, and bot plain strings went through markstream-vue's MarkdownRender. Single 88KB tool messages plus 88-of-them adding up to ~349KB of synchronous markdown parsing was enough to block the main thread for 5+ seconds. This patch: - Indexes tool-role messages by tool_call_id - Filters formattedMessages to user/assistant only — tool, system and _checkpoint roles no longer render as standalone bubbles - Converts assistant.tool_calls (OpenAI shape, with tc.name/tc.arguments fallbacks) into the existing tool_call MessagePart, attaching the paired result so MessageList's ToolCallCard renders it (default collapsed, no longer feeds large strings into the markdown renderer) - Drops empty placeholder plain parts when an assistant message only carries tool_calls - Sets ts/finished_ts to 0 as a sentinel: ToolCallCard.toolCallDuration returns "" when startTime <= 0, suppressing a misleading "0ms" duration that would otherwise appear because conversation history has no real timing data Behavior change: tool results are now embedded in their assistant's ToolCallCard.result instead of appearing as separate bot bubbles. This matches the main chat UI's behavior. Fixes #7929 Refs #7372 #7456 * style(dashboard): use single scrollbar in conversation history preview ToolCallCard's result/args panes have their own max-height + overflow, which produced a nested scrollbar when nested inside the history preview's already-scrollable .conversation-messages-container. Override those constraints inside the preview only — the outer 500px-bounded container already provides scroll bounds, so a single scrollbar feels cleaner. The main chat UI is unaffected. --------- Co-authored-by: wanger <wanger@example.com> * fix: ruff format * feat: add python tool timeout param (#7953) * feat: add python tool timeout param * Update python.py --------- Co-authored-by: Weilong Liao <37870767+Soulter@users.noreply.github.com> * fix: 钉钉连接超时后自动重连失败 (#7924) * fix: improve DingTalk adapter error handling in run() method * fix: add retry logic for DingTalk SDK task unexpected exit * fix: use task.add_done_callback to wake thread on task completion, handle UnboundLocalError * refactor: extract retry logic into handle_retry helper function --------- Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: leonforcode <leonbeyourside01@gmail.com> Co-authored-by: AstralSolipsism <134063164+AstralSolipsism@users.noreply.github.com> Co-authored-by: Pink YuDeer <wer00001@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: piexian <piexian@users.noreply.github.com> Co-authored-by: NayukiMeko <ChibaNayuki@163.com> Co-authored-by: wanger <122891289+10knamesmore@users.noreply.github.com> Co-authored-by: wanger <wanger@example.com> Co-authored-by: Haoran Xu <3230105281@zju.edu.cn> Co-authored-by: 千岚之夏 <108566281+Blueteemo@users.noreply.github.com> Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
问题描述
钉钉连接每12小时超时,SDK 重连后无法回复消息。
修复方案
改进
un() 方法,添加重试逻辑。当 SDK task 意外退出时,自动重启并重试(最多5次,每次间隔10秒)。
关联 Issue
Fixes #7861
Summary by Sourcery
Improve the DingTalk adapter run loop to make the SDK task resilient to unexpected termination and connection timeouts.
Bug Fixes:
Enhancements: