feat: add deduplication for WeChat kefu text messages within 15 seconds#7788
feat: add deduplication for WeChat kefu text messages within 15 seconds#7788
Conversation
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The
_is_duplicate_wechat_kf_text_messagemethod does a full scan of_wechat_kf_seen_text_messageson every text message to evict expired keys; if this dict can grow large in production, consider a lighter-weight eviction strategy (e.g., periodic cleanup, capped size, or using an ordered structure). - Right now deduplication is based on
session_idandtext.strip()only; if messages differing only in trivial ways (e.g., whitespace changes or common punctuation variants) should also be treated as duplicates, consider centralizing a stronger normalization function for the text content before computing the dedup key.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `_is_duplicate_wechat_kf_text_message` method does a full scan of `_wechat_kf_seen_text_messages` on every text message to evict expired keys; if this dict can grow large in production, consider a lighter-weight eviction strategy (e.g., periodic cleanup, capped size, or using an ordered structure).
- Right now deduplication is based on `session_id` and `text.strip()` only; if messages differing only in trivial ways (e.g., whitespace changes or common punctuation variants) should also be treated as duplicates, consider centralizing a stronger normalization function for the text content before computing the dedup key.
## Individual Comments
### Comment 1
<location path="astrbot/core/platform/sources/wecom/wecom_adapter.py" line_range="420-421" />
<code_context>
if msgtype == "text":
text = msg.get("text", {}).get("content", "").strip()
+ if self._is_duplicate_wechat_kf_text_message(abm.session_id, text):
+ logger.debug(
+ "忽略 15 秒内重复微信客服文本消息 session_id=%s text=%s",
+ abm.session_id,
+ text,
</code_context>
<issue_to_address>
**suggestion:** Avoid hard-coding the TTL value in the log message.
The TTL is defined as `WECHAT_KF_TEXT_CONTENT_DEDUP_TTL_SECONDS = 15`, but the log message hard-codes `15 秒`. If the TTL changes, the log will be inaccurate.
Please format the message using the constant (e.g., `%d` or an f-string) so the log always reflects the configured TTL.
Suggested implementation:
```python
if self._is_duplicate_wechat_kf_text_message(abm.session_id, text):
logger.debug(
"忽略 %d 秒内重复微信客服文本消息 session_id=%s text=%s",
WECHAT_KF_TEXT_CONTENT_DEDUP_TTL_SECONDS,
abm.session_id,
text,
)
```
If `WECHAT_KF_TEXT_CONTENT_DEDUP_TTL_SECONDS` is not defined in this module, make sure it is either:
1. Imported at the top of `wecom_adapter.py` from the module where it is defined, or
2. Defined in this module if that is where it logically belongs.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| logger.debug( | ||
| "忽略 15 秒内重复微信客服文本消息 session_id=%s text=%s", |
There was a problem hiding this comment.
suggestion: Avoid hard-coding the TTL value in the log message.
The TTL is defined as WECHAT_KF_TEXT_CONTENT_DEDUP_TTL_SECONDS = 15, but the log message hard-codes 15 秒. If the TTL changes, the log will be inaccurate.
Please format the message using the constant (e.g., %d or an f-string) so the log always reflects the configured TTL.
Suggested implementation:
if self._is_duplicate_wechat_kf_text_message(abm.session_id, text):
logger.debug(
"忽略 %d 秒内重复微信客服文本消息 session_id=%s text=%s",
WECHAT_KF_TEXT_CONTENT_DEDUP_TTL_SECONDS,
abm.session_id,
text,
)If WECHAT_KF_TEXT_CONTENT_DEDUP_TTL_SECONDS is not defined in this module, make sure it is either:
- Imported at the top of
wecom_adapter.pyfrom the module where it is defined, or - Defined in this module if that is where it logically belongs.
There was a problem hiding this comment.
Code Review
This pull request implements a deduplication mechanism for WeChat KF text messages using a 15-second TTL to prevent redundant processing. Feedback includes suggestions to use a tuple for the deduplication key to avoid collisions, updating the type hint accordingly, and addressing potential performance issues in the cleanup logic which currently iterates over the entire cache for every message. Additionally, the reviewer requested unit tests to verify the new functionality.
|
|
||
| self.server = WecomServer(self._event_queue, self.config) | ||
| self.agent_id: str | None = None | ||
| self._wechat_kf_seen_text_messages: dict[str, float] = {} |
|
|
||
| self.server.callback = callback | ||
|
|
||
| def _is_duplicate_wechat_kf_text_message(self, session_id: str, text: str) -> bool: |
| expired_keys = [ | ||
| key | ||
| for key, expires_at in self._wechat_kf_seen_text_messages.items() | ||
| if expires_at <= now | ||
| ] | ||
| for key in expired_keys: | ||
| self._wechat_kf_seen_text_messages.pop(key, None) |
There was a problem hiding this comment.
The current cleanup logic iterates through the entire _wechat_kf_seen_text_messages dictionary on every incoming text message. This is an O(N) operation that runs on the main event loop. While the 15-second TTL likely keeps the dictionary small, a high volume of unique messages could lead to performance degradation and block the event loop. Consider cleaning up only periodically or limiting the maximum size of the deduplication cache.
| for key in expired_keys: | ||
| self._wechat_kf_seen_text_messages.pop(key, None) | ||
|
|
||
| dedup_key = f"{session_id}:{normalized_text}" |
There was a problem hiding this comment.
Using string concatenation with a colon as a key can lead to collisions if the session_id (which is the external_userid) contains a colon. Using a tuple (session_id, normalized_text) is a safer and more idiomatic way to create a composite key in Python.
| dedup_key = f"{session_id}:{normalized_text}" | |
| dedup_key = (session_id, normalized_text) |
Modifications / 改动点
Screenshots or Test Results / 运行截图或测试结果
Checklist / 检查清单
😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in
requirements.txtandpyproject.toml./ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到
requirements.txt和pyproject.toml文件相应位置。😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。
Summary by Sourcery
New Features: