Skip to content

fix: keep DingTalk stream reconnecting#8610

Merged
Soulter merged 2 commits into
masterfrom
fix/dingtalk-reconnect-backoff
Jun 5, 2026
Merged

fix: keep DingTalk stream reconnecting#8610
Soulter merged 2 commits into
masterfrom
fix/dingtalk-reconnect-backoff

Conversation

@Soulter
Copy link
Copy Markdown
Member

@Soulter Soulter commented Jun 5, 2026

Summary

  • keep the DingTalk stream client reconnecting indefinitely after unexpected SDK task exits
  • add exponential backoff with a 10s minimum delay and 300s cap to avoid tight retry loops
  • reset the backoff after a stable run and preserve graceful shutdown behavior

Closes #8504

Tests

  • uv run pytest tests/test_dingtalk_adapter.py
  • uv run ruff check astrbot/core/platform/sources/dingtalk/dingtalk_adapter.py tests/test_dingtalk_adapter.py

Summary by Sourcery

Keep the DingTalk stream client reconnecting reliably with exponential backoff and stable-run reset logic after unexpected SDK task exits.

Bug Fixes:

  • Prevent DingTalk stream client from permanently stopping after unexpected SDK task termination by retrying indefinitely with delays.
  • Preserve graceful shutdown behaviour while differentiating it from error-triggered restarts.

Enhancements:

  • Introduce configurable exponential backoff for DingTalk reconnects with minimum and maximum delay bounds to avoid tight retry loops.
  • Reset reconnect backoff after a sufficiently long stable run to reduce recovery latency after transient issues.

Tests:

  • Add unit tests to validate DingTalk reconnect backoff behaviour, including exponential growth, minimum delay, and maximum cap.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. labels Jun 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exponential backoff reconnection strategy for the DingTalk adapter, replacing the previous fixed retry limit, and adds corresponding unit tests. The review feedback highlights a potential issue where using time.sleep(delay) can block the graceful shutdown of the application for up to 5 minutes, and re-creating self._shutdown_event in each loop iteration may cause race conditions. A solution using a thread-safe threading.Event to wait during the delay and monkey-patching self.terminate is suggested to allow immediate exit upon shutdown.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +769 to 782
def handle_retry(error_msg: str, run_seconds: float) -> None:
nonlocal retry_count
logger.error(error_msg)
if run_seconds >= DINGTALK_RECONNECT_STABLE_SECONDS:
retry_count = 0
retry_count += 1
if retry_count < MAX_RETRIES:
logger.info(f"钉钉适配器尝试重连 ({retry_count}/{MAX_RETRIES})...")
time.sleep(RETRY_INTERVAL)
return True
logger.error("钉钉适配器重连失败,已达最大重试次数")
return False

while retry_count < MAX_RETRIES:
delay = _dingtalk_reconnect_delay(retry_count)
logger.info(
f"钉钉适配器将在 {delay} 秒后重连 (第 {retry_count} 次)...",
)
time.sleep(delay)

while True:
task = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using time.sleep(delay) inside the background thread can block the graceful shutdown of the application for up to 300 seconds (5 minutes) if a reconnect delay is active when terminate() is called. Additionally, re-creating self._shutdown_event = threading.Event() in every iteration of the while True loop can lead to race conditions where terminate() sets an old event, causing the new event to wait indefinitely.

We can resolve both issues elegantly by monkey-patching self.terminate to set a thread-safe terminated event, and using terminated.wait(delay) instead of time.sleep(delay). This allows the background thread to wake up and exit immediately upon shutdown.

            def handle_retry(error_msg: str, run_seconds: float) -> None:
                nonlocal retry_count
                logger.error(error_msg)
                if run_seconds >= DINGTALK_RECONNECT_STABLE_SECONDS:
                    retry_count = 0
                retry_count += 1
                delay = _dingtalk_reconnect_delay(retry_count)
                logger.info(
                    f"钉钉适配器将在 {delay} 秒后重连 (第 {retry_count} 次)...",
                )
                terminated.wait(delay)

            terminated = threading.Event()
            original_terminate = self.terminate
            async def new_terminate():
                terminated.set()
                await original_terminate()
            self.terminate = new_terminate

            while not terminated.is_set():
                task = None

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The handle_retry docstring still describes a boolean return contract, but the function now returns None; updating or removing the docstring will avoid confusion for future readers.
  • Because the backoff can now reach 300 seconds, consider replacing time.sleep(delay) with a wait on the shutdown event (e.g., self._shutdown_event.wait(delay)) so that reconnect sleeps can be interrupted promptly during graceful shutdown.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `handle_retry` docstring still describes a boolean return contract, but the function now returns `None`; updating or removing the docstring will avoid confusion for future readers.
- Because the backoff can now reach 300 seconds, consider replacing `time.sleep(delay)` with a wait on the shutdown event (e.g., `self._shutdown_event.wait(delay)`) so that reconnect sleeps can be interrupted promptly during graceful shutdown.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@Soulter Soulter merged commit ef53a93 into master Jun 5, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:platform The bug / feature is about IM platform adapter, such as QQ, Lark, Telegram, WebChat and so on. size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]钉钉机器人在网络不佳的情况下掉线频繁,但重试只有5次,导致机器人频繁失效,建议可以配置重试次数

1 participant