fix: 钉钉连接超时后自动重连失败 by Blueteemo · Pull Request #7924 · AstrBotDevs/AstrBot

Blueteemo · 2026-04-30T17:25:00Z

问题描述

钉钉连接每12小时超时，SDK 重连后无法回复消息。

修复方案

改进
un() 方法，添加重试逻辑。当 SDK task 意外退出时，自动重启并重试（最多5次，每次间隔10秒）。

关联 Issue

Fixes #7861

Summary by Sourcery

Improve the DingTalk adapter run loop to make the SDK task resilient to unexpected termination and connection timeouts.

Bug Fixes:

Ensure the DingTalk SDK task is automatically restarted with limited retries when it exits unexpectedly after connection timeouts.

Enhancements:

Add retry counting, logging, and controlled backoff delays to DingTalk client startup to avoid silent failures and repeated crash loops.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In start_client, the finally block always runs and will cancel the task when shutdown_event is set even in the ‘normal close’ path where you just returned because the task is still running; this contradicts the comment and likely prevents a clean shutdown—consider only cancelling when you explicitly want to abort (e.g., on retry/failure paths).
The task variable is referenced in the finally block but may not be defined if an exception is raised before loop.create_task is called; initialize task to None before the try and guard the task.cancel()/task.done() checks accordingly.
Calling task.cancel() from a worker thread can be unsafe; instead of directly cancelling the task from start_client, use loop.call_soon_threadsafe(task.cancel) to interact with the asyncio loop from another thread.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `start_client`, the `finally` block always runs and will cancel the task when `shutdown_event` is set even in the ‘normal close’ path where you just returned because the task is still running; this contradicts the comment and likely prevents a clean shutdown—consider only cancelling when you explicitly want to abort (e.g., on retry/failure paths).
- The `task` variable is referenced in the `finally` block but may not be defined if an exception is raised before `loop.create_task` is called; initialize `task` to `None` before the `try` and guard the `task.cancel()`/`task.done()` checks accordingly.
- Calling `task.cancel()` from a worker thread can be unsafe; instead of directly cancelling the task from `start_client`, use `loop.call_soon_threadsafe(task.cancel)` to interact with the asyncio loop from another thread.

## Individual Comments

### Comment 1
<location path="astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py" line_range="756-753" />
<code_context>
+        MAX_RETRIES = 5
+        retry_count = 0
+
         def start_client(loop: asyncio.AbstractEventLoop) -> None:
-            try:
-                self._shutdown_event = threading.Event()
-                task = loop.create_task(self.client_.start())
-                self._shutdown_event.wait()
-                if task.done():
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against `task` being referenced in `finally` before assignment if `create_task` raises.

If `loop.create_task(self.client_.start())` raises (e.g., due to a closing loop), control flows to `except` and then `finally`, where `task` is referenced without being defined, leading to an `UnboundLocalError` that masks the original failure. Consider initializing `task: asyncio.Task | None = None` before the `try` and guarding with `if task is not None and not task.done():` in `finally`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

gemini-code-assist

Code Review

This pull request implements a retry loop for the DingTalk adapter to handle unexpected client task exits. The review feedback identifies a potential deadlock where the thread could block indefinitely if the task crashes, a risk of UnboundLocalError for the task variable, and the need to handle CancelledError when retrieving task exceptions.

…ndle UnboundLocalError

Blueteemo · 2026-04-30T17:37:13Z

@sourcery-ai review

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The retry logic in start_client duplicates the same logging, counter increment, and time.sleep(10) handling in both the normal exception path and the task.exception() path; consider extracting a small helper to encapsulate the retry step and reduce complexity.
Using time.sleep(10) inside the start_client loop blocks the worker thread entirely; if you expect this logic to evolve or run alongside other threaded work, consider using a non-blocking wait (e.g., another threading.Event().wait(timeout) or a configurable backoff) to keep it more flexible and testable.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The retry logic in `start_client` duplicates the same logging, counter increment, and `time.sleep(10)` handling in both the normal exception path and the `task.exception()` path; consider extracting a small helper to encapsulate the retry step and reduce complexity.
- Using `time.sleep(10)` inside the `start_client` loop blocks the worker thread entirely; if you expect this logic to evolve or run alongside other threaded work, consider using a non-blocking wait (e.g., another `threading.Event().wait(timeout)` or a configurable backoff) to keep it more flexible and testable.

## Individual Comments

### Comment 1
<location path="astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py" line_range="756" />
<code_context>
+        MAX_RETRIES = 5
+        retry_count = 0
+
         def start_client(loop: asyncio.AbstractEventLoop) -> None:
-            try:
-                self._shutdown_event = threading.Event()
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the retry and shutdown handling into a small helper and a flat retry loop to simplify control flow and remove duplicated logic.

You can keep the same behavior while simplifying control flow, removing duplication, and avoiding `nonlocal` by:

- Classifying a single `client_.start()` run into a small result enum.
- Putting retry policy in a tight loop that uses that result.
- Centralizing the graceful-shutdown check and retry increment/sleep.

For example:

```python
def _is_graceful_shutdown(exc: BaseException) -> bool:
    return "Graceful shutdown" in str(exc)
```

Split “run once” from “retry loop” and keep retry state local:

```python
def _run_client_once(
    loop: asyncio.AbstractEventLoop,
    shutdown_event: threading.Event,
) -> Literal["ok", "graceful_shutdown", "error"]:
    task = loop.create_task(self.client_.start())
    task.add_done_callback(lambda _: shutdown_event.set())
    shutdown_event.wait()

    if not task.done():
        # 正常关闭路径：线程被唤醒但 task 仍在运行
        return "ok"

    try:
        exc = task.exception()
    except asyncio.CancelledError:
        logger.info("钉钉适配器 task 已取消")
        return "graceful_shutdown"

    if exc is None:
        return "ok"

    if _is_graceful_shutdown(exc):
        logger.info("钉钉适配器已被关闭")
        return "graceful_shutdown"

    logger.error(f"钉钉 SDK task 异常退出: {exc}")
    return "error"
```

Then the retry loop becomes a single, flat structure with no `nonlocal` and shared retry handling:

```python
def start_client(loop: asyncio.AbstractEventLoop) -> None:
    MAX_RETRIES = 5
    RETRY_DELAY = 10

    for attempt in range(1, MAX_RETRIES + 1):
        shutdown_event = threading.Event()
        task: asyncio.Task | None = None
        try:
            self._shutdown_event = shutdown_event
            result = _run_client_once(loop, shutdown_event)

            if result in ("ok", "graceful_shutdown"):
                return  # 正常退出或优雅关闭

            # result == "error": 统一的重试逻辑
            if attempt < MAX_RETRIES:
                logger.info(f"钉钉适配器尝试重连 ({attempt}/{MAX_RETRIES})...")
                time.sleep(RETRY_DELAY)
                continue

            logger.error("钉钉适配器重连失败，已达最大重试次数")
            return

        except Exception as e:
            if _is_graceful_shutdown(e):
                logger.info("钉钉适配器已被关闭")
                return

            logger.error(f"钉钉机器人启动失败: {e}")
            if attempt < MAX_RETRIES:
                logger.info(f"钉钉适配器尝试重连 ({attempt}/{MAX_RETRIES})...")
                time.sleep(RETRY_DELAY)
            else:
                logger.error("钉钉适配器重连失败，已达最大重试次数")
                return
        finally:
            if task is not None and not task.done():
                task.cancel()
```

Key benefits while preserving functionality:

- No `nonlocal` or shared `retry_count`.
- A single, explicit retry loop with one place handling retries.
- Graceful shutdown detection centralized in `_is_graceful_shutdown`.
- Task lifecycle and classification confined to `_run_client_once`, making it easier to reason about all exit paths.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Blueteemo · 2026-05-02T05:07:56Z

您好，关于重试逻辑的说明：

SDK 内部已有 while True 重连循环，正常断连会自动重连
我们的重试是针对 SDK task 意外崩溃的兜底机制（如未捕获的异常）
MAX_RETRIES = 5 足够处理瞬时问题（如网络波动）
如果 SDK 运行 12 小时后崩溃，重试计数器会重置（因为之前运行成功）

当前方案是合理的，不需要改为无限重试。

* fix: improve DingTalk adapter error handling in run() method * fix: add retry logic for DingTalk SDK task unexpected exit * fix: use task.add_done_callback to wake thread on task completion, handle UnboundLocalError * refactor: extract retry logic into handle_retry helper function --------- Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>

* feat: supports plugin to add skills * fix tests * fix: fs tools * Add tests for plugin skills handling and improve skill management - Implement test for restricted local member reading plugin skill inventory even if the plugin is inactive. - Ensure that the skill synchronization process retains built-in skills when local skills are empty, including proper handling of plugin paths. - Update dashboard tests to verify that plugin details include components when requested. - Enhance skill metadata enrichment tests to include inactive plugin-provided skills for inventory. - Add filtering tests for plugin skills based on current configuration, ensuring only allowed plugins are considered and inactive plugins are skipped. Co-authored-by: Copilot <copilot@github.com> * fix: handle PPIO platform context-length error messages (#7888) * fix: 压缩算法删除 user 消息 Bug 修复 * perf: improve truncate algo * fix: improve context length error detection for PPIO platform compatibility - Extend error detection to handle PPIO's error message format: 'The input is longer than the model's context length' - Add case-insensitive matching using .lower() for robustness - Maintain backward compatibility with existing 'maximum context length' check This fixes the issue where PPIO platform models (e.g., ppio/zai-org/glm-5-turbo) would fail with AgentState.ERROR due to unrecognized context length errors. --------- Co-authored-by: Soulter <905617992@qq.com> * fix: 支持微信客服文件消息 (#7923) * fix: 支持微信客服文件消息 * fix: remove WeCom file message placeholder * fix(provider): fix Anthropic custom headers and system prompt compatibility (#7587) * fix(provider): fix Anthropic custom headers and system prompt compatibility - Pass custom_headers via AsyncAnthropic's `default_headers` parameter instead of creating a separate httpx.AsyncClient. This avoids `isinstance` check failures when multiple httpx installations exist on sys.path (e.g. bundled Python + system Python). - Use list format for the `system` parameter (`[{"type": "text", ...}]`) instead of a plain string. The list format is supported by the official Anthropic API and is also compatible with third-party API proxies that reject the string format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(provider): fix Anthropic custom headers and system prompt compatibility - Pass custom_headers via AsyncAnthropic's `default_headers` parameter instead of creating a separate httpx.AsyncClient. This avoids `isinstance` check failures when multiple httpx installations exist on sys.path (e.g. bundled Python + system Python). - Use list format for the `system` parameter (`[{"type": "text", ...}]`) instead of a plain string. The list format is supported by the official Anthropic API and is also compatible with third-party API proxies that reject the string format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add test unit --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * perf: improve logic of adding models Co-authored-by: piexian <piexian@users.noreply.github.com> * chore: remove redundant logger messages and improve log clarity Co-authored-by: Copilot <copilot@github.com> * chore: ruff format * docs: update knowledge base docs closes: #7962 * fix(#7907): send_message_to_user cron 场景下 session 容错 (#7911) * fix: send_message_to_user cron 场景下 session 容错 (#7907) - LLM 在主动场景可能只传 session_id 而非完整三段式， from_str 失败时用 current_session 补全前两段。 Co-authored-by: Copilot <copilot@github.com> * fix: 限制 session 补全仅对裸 session_id 生效，避免误修带冒号的错误输入 (#7907) * feat: add session information to cron job payload Co-authored-by: Copilot <copilot@github.com> * fix: improve clarity and consistency of safety mode prompts Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Weilong Liao <37870767+Soulter@users.noreply.github.com> Co-authored-by: Soulter <905617992@qq.com> * perf: tool rendering in conversation page (#7937) * fix(dashboard): route conversation history tool messages through ToolCallCard When viewing conversation history, large tool outputs (e.g. a single git log --stat producing tens of KB) caused the browser renderer to freeze. Root cause: formattedMessages mapped every role (including tool / system / _checkpoint) into user/bot bubbles, and bot plain strings went through markstream-vue's MarkdownRender. Single 88KB tool messages plus 88-of-them adding up to ~349KB of synchronous markdown parsing was enough to block the main thread for 5+ seconds. This patch: - Indexes tool-role messages by tool_call_id - Filters formattedMessages to user/assistant only — tool, system and _checkpoint roles no longer render as standalone bubbles - Converts assistant.tool_calls (OpenAI shape, with tc.name/tc.arguments fallbacks) into the existing tool_call MessagePart, attaching the paired result so MessageList's ToolCallCard renders it (default collapsed, no longer feeds large strings into the markdown renderer) - Drops empty placeholder plain parts when an assistant message only carries tool_calls - Sets ts/finished_ts to 0 as a sentinel: ToolCallCard.toolCallDuration returns "" when startTime <= 0, suppressing a misleading "0ms" duration that would otherwise appear because conversation history has no real timing data Behavior change: tool results are now embedded in their assistant's ToolCallCard.result instead of appearing as separate bot bubbles. This matches the main chat UI's behavior. Fixes #7929 Refs #7372 #7456 * style(dashboard): use single scrollbar in conversation history preview ToolCallCard's result/args panes have their own max-height + overflow, which produced a nested scrollbar when nested inside the history preview's already-scrollable .conversation-messages-container. Override those constraints inside the preview only — the outer 500px-bounded container already provides scroll bounds, so a single scrollbar feels cleaner. The main chat UI is unaffected. --------- Co-authored-by: wanger <wanger@example.com> * fix: ruff format * feat: add python tool timeout param (#7953) * feat: add python tool timeout param * Update python.py --------- Co-authored-by: Weilong Liao <37870767+Soulter@users.noreply.github.com> * fix: 钉钉连接超时后自动重连失败 (#7924) * fix: improve DingTalk adapter error handling in run() method * fix: add retry logic for DingTalk SDK task unexpected exit * fix: use task.add_done_callback to wake thread on task completion, handle UnboundLocalError * refactor: extract retry logic into handle_retry helper function --------- Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: leonforcode <leonbeyourside01@gmail.com> Co-authored-by: AstralSolipsism <134063164+AstralSolipsism@users.noreply.github.com> Co-authored-by: Pink YuDeer <wer00001@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: piexian <piexian@users.noreply.github.com> Co-authored-by: NayukiMeko <ChibaNayuki@163.com> Co-authored-by: wanger <122891289+10knamesmore@users.noreply.github.com> Co-authored-by: wanger <wanger@example.com> Co-authored-by: Haoran Xu <3230105281@zju.edu.cn> Co-authored-by: 千岚之夏 <108566281+Blueteemo@users.noreply.github.com> Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>

Blueteemo added 2 commits May 1, 2026 00:54

fix: improve DingTalk adapter error handling in run() method

ea88a3f

fix: add retry logic for DingTalk SDK task unexpected exit

7a5d2e1

auto-assign Bot requested review from Fridemn and advent259141 April 30, 2026 17:25

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. labels Apr 30, 2026

sourcery-ai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py

fix: use task.add_done_callback to wake thread on task completion, ha…

b3587ae

…ndle UnboundLocalError

sourcery-ai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py

refactor: extract retry logic into handle_retry helper function

af75448

Soulter approved these changes May 2, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label May 2, 2026

Soulter reviewed May 2, 2026

View reviewed changes

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py

Soulter merged commit 859ab28 into AstrBotDevs:master May 3, 2026
21 checks passed

Blueteemo deleted the fix/issue-7861-dingtalk-reconnect branch May 3, 2026 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: 钉钉连接超时后自动重连失败#7924

fix: 钉钉连接超时后自动重连失败#7924
Soulter merged 4 commits intoAstrBotDevs:masterfrom
Blueteemo:fix/issue-7861-dingtalk-reconnect

Blueteemo commented Apr 30, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Blueteemo commented Apr 30, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Blueteemo commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Blueteemo commented Apr 30, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

问题描述

修复方案

关联 Issue

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Blueteemo commented Apr 30, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Blueteemo commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Blueteemo commented Apr 30, 2026 •

edited by sourcery-ai Bot

Loading