Skip to content

fix(listener): 拦贴纸/GIF聚合站 + 裸媒体文件,避免 Discord 表情包误入分享库#2

Merged
longsizhuo merged 3 commits intomainfrom
fix/listener-sticker-gif-blocklist
Apr 25, 2026
Merged

fix(listener): 拦贴纸/GIF聚合站 + 裸媒体文件,避免 Discord 表情包误入分享库#2
longsizhuo merged 3 commits intomainfrom
fix/listener-sticker-gif-blocklist

Conversation

@longsizhuo
Copy link
Copy Markdown
Member

事故

用户在分享频道发了一个 Discord 贴纸(klipy GIF),`message.content` 里就是裸 `https://klipy.com/gifs/...\` URL。listener 当成正常分享走完 OG fetch + 分类,被打成 APPROVED 上架成 #18

类似的还有 #5(`mmbiz.qpic.cn/...640.jpg`,WeChat 图片直链)。

根因

原 `_SKIP_HOSTS` 只拦了 `discord.com` / `cdn.discordapp.com` 等 Discord 自家域,没考虑:

  • Discord 贴纸面板默认走 tenor / klipy / giphy(裸 URL 进 message.content)
  • 普通图片直链(`.jpg` / `.gif` / `.png`)也不应该入库

改法

  1. `_SKIP_HOSTS` 加入 tenor / klipy / giphy 全套(包括 media 子域)
  2. 兜底在 path 上做媒体扩展名匹配(host 永远穷举不完):`.gif/.png/.jpg/.jpeg/.webp/.bmp/.svg/.ico/.mp4/.webm/.mov/.m4v/.mp3/.wav/.ogg/.flac`
  3. 匹配只看 path,query 里出现 .jpg 不算(避免误伤带 `?file=foo.jpg` 的正常 API 链接)
  4. 测试 +19 case:klipy/tenor/giphy 各域、各种裸图片直链、case-insensitivity、query-only 媒体扩展名应放行

DB 清理

`#5` 和 `#18` 已直接 `UPDATE shared_links SET status = 'REJECTED' WHERE id IN (5, 18)` 在 prod DB 里执行,前端已不展示。

部署

本仓库 systemd 已 `restart chat-bot` 加载新代码(systemd 读磁盘)。

Test

  • `uv run pytest tests/` — 79/79 pass(新增 19 case 在 `test_listener_skip.py`)
  • `uv run ruff check src/ tests/` — clean

🤖 Generated with Claude Code

事故:用户 yhn 在分享频道发了一个 Discord 贴纸(klipy GIF),message.content 里就是裸 https://klipy.com/gifs/... URL,listener 当成正常分享走完 OG fetch + 分类,被打成 APPROVED 上架成 #18。

原 _SKIP_HOSTS 只拦了 discord.com / cdn.discordapp.com 等 Discord 自家域,没考虑贴纸面板默认走 tenor / klipy / giphy。同类问题:mmbiz.qpic.cn 这类纯图片直链(#5)也不该入库。

改法两层:(1) _SKIP_HOSTS 加入 tenor / klipy / giphy 全套;(2) 兜底在 path 上做媒体扩展名(.gif/.png/.jpg/.mp4/...)匹配,host 永远穷举不完。匹配只看 path,query 里出现 .jpg 不算(避免误伤带 ?file=foo.jpg 的正常 API 链接)。+19 个测试 case 覆盖。
Copilot AI review requested due to automatic review settings April 25, 2026 07:47
/share 是单页提交入口(带 ?url=... 预填,给 bookmarklet 用),/feed 才是已审核通过的展示墙。Bot 在 listener.py(首条 reply + APPROVED 终态 reply)和 commands.py(/share 斜杠命令成功回执)三处都把 '点此查看 / 已收录到内卷地狱分享库' 链接指向 /share——结果用户点过去看到的是空提交表单,不是自己刚分享的内容。
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Discord share listener’s URL filtering so sticker/GIF aggregator links and direct (bare) media-file URLs don’t get treated as “share submissions” and ingested into the backend.

Changes:

  • Expand _SKIP_HOSTS to include tenor/klipy/giphy domains commonly emitted by Discord sticker/GIF features.
  • Add a path-based media extension fallback (.gif/.png/.jpg/.../.mp3/...) to skip bare media links regardless of host, while ignoring query-only matches.
  • Add unit tests covering aggregator domains, bare media links, and non-skipped “normal article” URLs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/chat_bot/cogs/listener.py Extends skip logic to include sticker/GIF aggregators and path-based media extension filtering.
tests/test_listener_skip.py Adds test cases validating the new skip behavior and non-regressions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +95 to +96
# path 走小写匹配,跟 query 解耦:?foo=bar.jpg 不会误命中
return parsed.path.lower().endswith(_MEDIA_EXTENSIONS)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Media-extension skipping is based on parsed.path.endswith(...), but the listener’s URL regex can still capture trailing punctuation like , / . / / after a URL in chat. In that case the path becomes /file.jpg, and won’t match, defeating the new safeguard. Consider normalizing before the check (e.g., stripping common trailing punctuation from the URL/path) and adding a regression test for it.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +66
@pytest.mark.parametrize(
"url",
[
# 裸图片(WeChat 图床、随便哪个 host 的图片直链)
"https://mmbiz.qpic.cn/mmbiz_jpg/abc/640.jpg",
"https://example.com/path/photo.PNG",
"https://i.example.com/cat.gif",
"https://example.com/foo.webp",
# 视频/音频直链
"https://example.com/clip.mp4",
"https://example.com/audio.mp3",
# SVG(即便 host 不在黑名单也拦,配合服务端 SVG 上传黑名单)
"https://example.com/icon.svg",
],
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding at least one regression case where a bare media URL is followed by trailing punctuation (e.g. https://example.com/a.jpg, or Chinese punctuation) to reflect how URLs appear in real Discord messages; otherwise the new path-based extension filter can be bypassed if the extracted URL includes that punctuation.

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +93
host = parsed.netloc.lower().split(":")[0]
if host in _SKIP_HOSTS:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host = parsed.netloc.lower().split(":")[0] is a bit fragile (doesn’t handle IPv6 literals like [::1]:443 and can be confused by userinfo in the URL). Prefer parsed.hostname (already lowercased by urlparse) and then compare against _SKIP_HOSTS.

Suggested change
host = parsed.netloc.lower().split(":")[0]
if host in _SKIP_HOSTS:
host = parsed.hostname
if host is not None and host in _SKIP_HOSTS:

Copilot uses AI. Check for mistakes.
用户在分享频道贴自己 PR (#2) 通告,bot 把它当 '社区分享' 收成 #19。同类还会有 issue/commit/compare/actions/releases/discussions/blob/tree 等 dev 子路径。

策略:path 至少 3 段(/<org>/<repo>/<sub>)且 org=involutionhell 时 skip,仓库主页和第三方仓库全放行。这是 dev 自循环噪声专杀,不影响合法分享。+11 测试 case。
@longsizhuo longsizhuo merged commit 9d77952 into main Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants