Skip to content

fix: 钉钉连接超时后自动重连失败#7924

Merged
Soulter merged 4 commits intoAstrBotDevs:masterfrom
Blueteemo:fix/issue-7861-dingtalk-reconnect
May 3, 2026
Merged

fix: 钉钉连接超时后自动重连失败#7924
Soulter merged 4 commits intoAstrBotDevs:masterfrom
Blueteemo:fix/issue-7861-dingtalk-reconnect

Conversation

@Blueteemo
Copy link
Copy Markdown
Contributor

@Blueteemo Blueteemo commented Apr 30, 2026

问题描述

钉钉连接每12小时超时,SDK 重连后无法回复消息。

修复方案

改进
un() 方法,添加重试逻辑。当 SDK task 意外退出时,自动重启并重试(最多5次,每次间隔10秒)。

关联 Issue

Fixes #7861

Summary by Sourcery

Improve the DingTalk adapter run loop to make the SDK task resilient to unexpected termination and connection timeouts.

Bug Fixes:

  • Ensure the DingTalk SDK task is automatically restarted with limited retries when it exits unexpectedly after connection timeouts.

Enhancements:

  • Add retry counting, logging, and controlled backoff delays to DingTalk client startup to avoid silent failures and repeated crash loops.

@auto-assign auto-assign Bot requested review from Fridemn and advent259141 April 30, 2026 17:25
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. labels Apr 30, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In start_client, the finally block always runs and will cancel the task when shutdown_event is set even in the ‘normal close’ path where you just returned because the task is still running; this contradicts the comment and likely prevents a clean shutdown—consider only cancelling when you explicitly want to abort (e.g., on retry/failure paths).
  • The task variable is referenced in the finally block but may not be defined if an exception is raised before loop.create_task is called; initialize task to None before the try and guard the task.cancel()/task.done() checks accordingly.
  • Calling task.cancel() from a worker thread can be unsafe; instead of directly cancelling the task from start_client, use loop.call_soon_threadsafe(task.cancel) to interact with the asyncio loop from another thread.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `start_client`, the `finally` block always runs and will cancel the task when `shutdown_event` is set even in the ‘normal close’ path where you just returned because the task is still running; this contradicts the comment and likely prevents a clean shutdown—consider only cancelling when you explicitly want to abort (e.g., on retry/failure paths).
- The `task` variable is referenced in the `finally` block but may not be defined if an exception is raised before `loop.create_task` is called; initialize `task` to `None` before the `try` and guard the `task.cancel()`/`task.done()` checks accordingly.
- Calling `task.cancel()` from a worker thread can be unsafe; instead of directly cancelling the task from `start_client`, use `loop.call_soon_threadsafe(task.cancel)` to interact with the asyncio loop from another thread.

## Individual Comments

### Comment 1
<location path="astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py" line_range="756-753" />
<code_context>
+        MAX_RETRIES = 5
+        retry_count = 0
+
         def start_client(loop: asyncio.AbstractEventLoop) -> None:
-            try:
-                self._shutdown_event = threading.Event()
-                task = loop.create_task(self.client_.start())
-                self._shutdown_event.wait()
-                if task.done():
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against `task` being referenced in `finally` before assignment if `create_task` raises.

If `loop.create_task(self.client_.start())` raises (e.g., due to a closing loop), control flows to `except` and then `finally`, where `task` is referenced without being defined, leading to an `UnboundLocalError` that masks the original failure. Consider initializing `task: asyncio.Task | None = None` before the `try` and guarding with `if task is not None and not task.done():` in `finally`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a retry loop for the DingTalk adapter to handle unexpected client task exits. The review feedback identifies a potential deadlock where the thread could block indefinitely if the task crashes, a risk of UnboundLocalError for the task variable, and the need to handle CancelledError when retrieving task exceptions.

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py
@Blueteemo
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The retry logic in start_client duplicates the same logging, counter increment, and time.sleep(10) handling in both the normal exception path and the task.exception() path; consider extracting a small helper to encapsulate the retry step and reduce complexity.
  • Using time.sleep(10) inside the start_client loop blocks the worker thread entirely; if you expect this logic to evolve or run alongside other threaded work, consider using a non-blocking wait (e.g., another threading.Event().wait(timeout) or a configurable backoff) to keep it more flexible and testable.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The retry logic in `start_client` duplicates the same logging, counter increment, and `time.sleep(10)` handling in both the normal exception path and the `task.exception()` path; consider extracting a small helper to encapsulate the retry step and reduce complexity.
- Using `time.sleep(10)` inside the `start_client` loop blocks the worker thread entirely; if you expect this logic to evolve or run alongside other threaded work, consider using a non-blocking wait (e.g., another `threading.Event().wait(timeout)` or a configurable backoff) to keep it more flexible and testable.

## Individual Comments

### Comment 1
<location path="astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py" line_range="756" />
<code_context>
+        MAX_RETRIES = 5
+        retry_count = 0
+
         def start_client(loop: asyncio.AbstractEventLoop) -> None:
-            try:
-                self._shutdown_event = threading.Event()
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the retry and shutdown handling into a small helper and a flat retry loop to simplify control flow and remove duplicated logic.

You can keep the same behavior while simplifying control flow, removing duplication, and avoiding `nonlocal` by:

- Classifying a single `client_.start()` run into a small result enum.
- Putting retry policy in a tight loop that uses that result.
- Centralizing the graceful-shutdown check and retry increment/sleep.

For example:

```python
def _is_graceful_shutdown(exc: BaseException) -> bool:
    return "Graceful shutdown" in str(exc)
```

Split “run once” from “retry loop” and keep retry state local:

```python
def _run_client_once(
    loop: asyncio.AbstractEventLoop,
    shutdown_event: threading.Event,
) -> Literal["ok", "graceful_shutdown", "error"]:
    task = loop.create_task(self.client_.start())
    task.add_done_callback(lambda _: shutdown_event.set())
    shutdown_event.wait()

    if not task.done():
        # 正常关闭路径:线程被唤醒但 task 仍在运行
        return "ok"

    try:
        exc = task.exception()
    except asyncio.CancelledError:
        logger.info("钉钉适配器 task 已取消")
        return "graceful_shutdown"

    if exc is None:
        return "ok"

    if _is_graceful_shutdown(exc):
        logger.info("钉钉适配器已被关闭")
        return "graceful_shutdown"

    logger.error(f"钉钉 SDK task 异常退出: {exc}")
    return "error"
```

Then the retry loop becomes a single, flat structure with no `nonlocal` and shared retry handling:

```python
def start_client(loop: asyncio.AbstractEventLoop) -> None:
    MAX_RETRIES = 5
    RETRY_DELAY = 10

    for attempt in range(1, MAX_RETRIES + 1):
        shutdown_event = threading.Event()
        task: asyncio.Task | None = None
        try:
            self._shutdown_event = shutdown_event
            result = _run_client_once(loop, shutdown_event)

            if result in ("ok", "graceful_shutdown"):
                return  # 正常退出或优雅关闭

            # result == "error": 统一的重试逻辑
            if attempt < MAX_RETRIES:
                logger.info(f"钉钉适配器尝试重连 ({attempt}/{MAX_RETRIES})...")
                time.sleep(RETRY_DELAY)
                continue

            logger.error("钉钉适配器重连失败,已达最大重试次数")
            return

        except Exception as e:
            if _is_graceful_shutdown(e):
                logger.info("钉钉适配器已被关闭")
                return

            logger.error(f"钉钉机器人启动失败: {e}")
            if attempt < MAX_RETRIES:
                logger.info(f"钉钉适配器尝试重连 ({attempt}/{MAX_RETRIES})...")
                time.sleep(RETRY_DELAY)
            else:
                logger.error("钉钉适配器重连失败,已达最大重试次数")
                return
        finally:
            if task is not None and not task.done():
                task.cancel()
```

Key benefits while preserving functionality:

- No `nonlocal` or shared `retry_count`.
- A single, explicit retry loop with one place handling retries.
- Graceful shutdown detection centralized in `_is_graceful_shutdown`.
- Task lifecycle and classification confined to `_run_client_once`, making it easier to reason about all exit paths.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 2, 2026
Comment thread astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py
@Blueteemo
Copy link
Copy Markdown
Contributor Author

您好,关于重试逻辑的说明:

  1. SDK 内部已有 while True 重连循环,正常断连会自动重连
  2. 我们的重试是针对 SDK task 意外崩溃的兜底机制(如未捕获的异常)
  3. MAX_RETRIES = 5 足够处理瞬时问题(如网络波动)
  4. 如果 SDK 运行 12 小时后崩溃,重试计数器会重置(因为之前运行成功)

当前方案是合理的,不需要改为无限重试。

@Soulter Soulter merged commit 859ab28 into AstrBotDevs:master May 3, 2026
21 checks passed
@Blueteemo Blueteemo deleted the fix/issue-7861-dingtalk-reconnect branch May 3, 2026 07:10
Soulter pushed a commit that referenced this pull request May 3, 2026
* fix: improve DingTalk adapter error handling in run() method

* fix: add retry logic for DingTalk SDK task unexpected exit

* fix: use task.add_done_callback to wake thread on task completion, handle UnboundLocalError

* refactor: extract retry logic into handle_retry helper function

---------

Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>
Soulter added a commit that referenced this pull request May 3, 2026
* feat: supports plugin to add skills

* fix tests

* fix: fs tools

* Add tests for plugin skills handling and improve skill management

- Implement test for restricted local member reading plugin skill inventory even if the plugin is inactive.
- Ensure that the skill synchronization process retains built-in skills when local skills are empty, including proper handling of plugin paths.
- Update dashboard tests to verify that plugin details include components when requested.
- Enhance skill metadata enrichment tests to include inactive plugin-provided skills for inventory.
- Add filtering tests for plugin skills based on current configuration, ensuring only allowed plugins are considered and inactive plugins are skipped.

Co-authored-by: Copilot <copilot@github.com>

* fix: handle PPIO platform context-length error messages (#7888)

* fix: 压缩算法删除 user 消息 Bug 修复

* perf: improve truncate algo

* fix: improve context length error detection for PPIO platform compatibility

- Extend error detection to handle PPIO's error message format:
  'The input is longer than the model's context length'
- Add case-insensitive matching using .lower() for robustness
- Maintain backward compatibility with existing 'maximum context length' check

This fixes the issue where PPIO platform models (e.g., ppio/zai-org/glm-5-turbo)
would fail with AgentState.ERROR due to unrecognized context length errors.

---------

Co-authored-by: Soulter <905617992@qq.com>

* fix: 支持微信客服文件消息 (#7923)

* fix: 支持微信客服文件消息

* fix: remove WeCom file message placeholder

* fix(provider): fix Anthropic custom headers and system prompt compatibility (#7587)

* fix(provider): fix Anthropic custom headers and system prompt compatibility

- Pass custom_headers via AsyncAnthropic's `default_headers` parameter
  instead of creating a separate httpx.AsyncClient. This avoids
  `isinstance` check failures when multiple httpx installations exist
  on sys.path (e.g. bundled Python + system Python).

- Use list format for the `system` parameter (`[{"type": "text", ...}]`)
  instead of a plain string. The list format is supported by the official
  Anthropic API and is also compatible with third-party API proxies that
  reject the string format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(provider): fix Anthropic custom headers and system prompt compatibility

- Pass custom_headers via AsyncAnthropic's `default_headers` parameter
  instead of creating a separate httpx.AsyncClient. This avoids
  `isinstance` check failures when multiple httpx installations exist
  on sys.path (e.g. bundled Python + system Python).

- Use list format for the `system` parameter (`[{"type": "text", ...}]`)
  instead of a plain string. The list format is supported by the official
  Anthropic API and is also compatible with third-party API proxies that
  reject the string format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add test unit

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* perf: improve logic of adding models

Co-authored-by: piexian <piexian@users.noreply.github.com>

* chore: remove redundant logger messages and improve log clarity

Co-authored-by: Copilot <copilot@github.com>

* chore: ruff format

* docs: update knowledge base docs

closes: #7962

* fix(#7907): send_message_to_user cron 场景下 session 容错 (#7911)

* fix: send_message_to_user cron 场景下 session 容错 (#7907)

- LLM 在主动场景可能只传 session_id 而非完整三段式,
from_str 失败时用 current_session 补全前两段。

Co-authored-by: Copilot <copilot@github.com>

* fix: 限制 session 补全仅对裸 session_id 生效,避免误修带冒号的错误输入 (#7907)

* feat: add session information to cron job payload

Co-authored-by: Copilot <copilot@github.com>

* fix: improve clarity and consistency of safety mode prompts

Co-authored-by: Copilot <copilot@github.com>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Weilong Liao <37870767+Soulter@users.noreply.github.com>
Co-authored-by: Soulter <905617992@qq.com>

* perf: tool rendering in conversation page (#7937)

* fix(dashboard): route conversation history tool messages through ToolCallCard

When viewing conversation history, large tool outputs (e.g. a single
git log --stat producing tens of KB) caused the browser renderer to
freeze. Root cause: formattedMessages mapped every role (including
tool / system / _checkpoint) into user/bot bubbles, and bot plain
strings went through markstream-vue's MarkdownRender. Single 88KB
tool messages plus 88-of-them adding up to ~349KB of synchronous
markdown parsing was enough to block the main thread for 5+ seconds.

This patch:

- Indexes tool-role messages by tool_call_id
- Filters formattedMessages to user/assistant only — tool, system and
  _checkpoint roles no longer render as standalone bubbles
- Converts assistant.tool_calls (OpenAI shape, with tc.name/tc.arguments
  fallbacks) into the existing tool_call MessagePart, attaching the
  paired result so MessageList's ToolCallCard renders it (default
  collapsed, no longer feeds large strings into the markdown renderer)
- Drops empty placeholder plain parts when an assistant message only
  carries tool_calls
- Sets ts/finished_ts to 0 as a sentinel: ToolCallCard.toolCallDuration
  returns "" when startTime <= 0, suppressing a misleading "0ms"
  duration that would otherwise appear because conversation history
  has no real timing data

Behavior change: tool results are now embedded in their assistant's
ToolCallCard.result instead of appearing as separate bot bubbles.
This matches the main chat UI's behavior.

Fixes #7929
Refs #7372 #7456

* style(dashboard): use single scrollbar in conversation history preview

ToolCallCard's result/args panes have their own max-height + overflow,
which produced a nested scrollbar when nested inside the history
preview's already-scrollable .conversation-messages-container. Override
those constraints inside the preview only — the outer 500px-bounded
container already provides scroll bounds, so a single scrollbar feels
cleaner. The main chat UI is unaffected.

---------

Co-authored-by: wanger <wanger@example.com>

* fix: ruff format

* feat: add python tool timeout param (#7953)

* feat: add python tool timeout param

* Update python.py

---------

Co-authored-by: Weilong Liao <37870767+Soulter@users.noreply.github.com>

* fix: 钉钉连接超时后自动重连失败 (#7924)

* fix: improve DingTalk adapter error handling in run() method

* fix: add retry logic for DingTalk SDK task unexpected exit

* fix: use task.add_done_callback to wake thread on task completion, handle UnboundLocalError

* refactor: extract retry logic into handle_retry helper function

---------

Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: leonforcode <leonbeyourside01@gmail.com>
Co-authored-by: AstralSolipsism <134063164+AstralSolipsism@users.noreply.github.com>
Co-authored-by: Pink YuDeer <wer00001@outlook.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: piexian <piexian@users.noreply.github.com>
Co-authored-by: NayukiMeko <ChibaNayuki@163.com>
Co-authored-by: wanger <122891289+10knamesmore@users.noreply.github.com>
Co-authored-by: wanger <wanger@example.com>
Co-authored-by: Haoran Xu <3230105281@zju.edu.cn>
Co-authored-by: 千岚之夏 <108566281+Blueteemo@users.noreply.github.com>
Co-authored-by: Blueteemo <Blueteemo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

persistent connection is timeout

2 participants