feat: context token counting support for multimodal content (images, audio, and chain-of-thought)#6596
Conversation
EstimateTokenCounter 之前只计算 TextPart,完全忽略 ImageURLPart、 AudioURLPart 和 ThinkPart。多模态对话中图片占 500-2000 token, 不被计入会导致 context 压缩触发过晚,API 先报 context_length_exceeded。 改动: - ImageURLPart 按 765 token 估算(OpenAI vision 低/高分辨率中位数) - AudioURLPart 按 500 token 估算 - ThinkPart 的文本内容正常计算 - 10 个新测试覆盖各类型单独和混合场景
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求旨在增强系统的上下文管理能力,通过为多模态内容(如图片、音频和思考链)引入准确的 token 估算,解决了现有 Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
| for part in content: | ||
| if isinstance(part, TextPart): | ||
| total += self._estimate_tokens(part.text) | ||
| elif isinstance(part, ThinkPart): | ||
| total += self._estimate_tokens(part.think) | ||
| elif isinstance(part, ImageURLPart): | ||
| total += IMAGE_TOKEN_ESTIMATE | ||
| elif isinstance(part, AudioURLPart): | ||
| total += AUDIO_TOKEN_ESTIMATE |
There was a problem hiding this comment.
这部分使用 if/elif 链来处理不同类型的 ContentPart 是可行的,但当未来需要支持更多类型的内容时,需要不断修改这个函数。为了提高代码的可扩展性和遵循开闭原则,可以考虑使用更动态的分发机制,例如使用字典将内容类型映射到处理逻辑。
这样可以使 count_tokens 方法更稳定,添加新类型时只需更新映射字典,而无需修改核心循环逻辑。
例如,可以在类级别定义一个映射:
class EstimateTokenCounter:
_PART_HANDLERS = {
TextPart: lambda self, p: self._estimate_tokens(p.text),
ThinkPart: lambda self, p: self._estimate_tokens(p.think),
ImageURLPart: lambda self, p: IMAGE_TOKEN_ESTIMATE,
AudioURLPart: lambda self, p: AUDIO_TOKEN_ESTIMATE,
}
def count_tokens(self, messages: list[Message], trusted_token_usage: int = 0) -> int:
# ...
elif isinstance(content, list):
for part in content:
# Note: using type() might not work with inheritance, isinstance is safer.
# This is a conceptual example.
handler = self._PART_HANDLERS.get(type(part))
if handler:
total += handler(self, part)
# ...由于这只是一个改进建议,当前实现也是完全可以接受的。
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The hard-coded IMAGE_TOKEN_ESTIMATE and AUDIO_TOKEN_ESTIMATE values might be better surfaced as configuration (or at least module-level constants with a clear link to the current OpenAI pricing/docs) so they can be adjusted without code changes if model pricing or behavior shifts.
- In the multimodal loop you only handle TextPart, ThinkPart, ImageURLPart, and AudioURLPart; consider either raising/logging on unknown part types or adding a default branch so new Part subclasses don't get silently ignored in token accounting.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The hard-coded IMAGE_TOKEN_ESTIMATE and AUDIO_TOKEN_ESTIMATE values might be better surfaced as configuration (or at least module-level constants with a clear link to the current OpenAI pricing/docs) so they can be adjusted without code changes if model pricing or behavior shifts.
- In the multimodal loop you only handle TextPart, ThinkPart, ImageURLPart, and AudioURLPart; consider either raising/logging on unknown part types or adding a default branch so new Part subclasses don't get silently ignored in token accounting.
## Individual Comments
### Comment 1
<location path="tests/agent/test_token_counter.py" line_range="82-90" />
<code_context>
+ assert tokens == IMAGE_TOKEN_ESTIMATE * 3
+
+
+class TestTrustedUsage:
+ def test_trusted_overrides(self):
+ """如果 API 返回了 token 数,直接用它不做估算。"""
+ msg = _msg("user", [
+ TextPart(text="hello"),
+ ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
+ ])
+ tokens = counter.count_tokens([msg], trusted_token_usage=42)
+ assert tokens == 42
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a trusted_token_usage test that includes tool_calls to verify the override in a more complex message
Since `trusted_token_usage` also affects `tool_calls`, please add a companion test where the message includes both multimodal parts and non-empty `tool_calls`, and assert `count_tokens(..., trusted_token_usage=42) == 42`. This will verify the override still applies when tool calls are present and guard against regressions that reintroduce tool_call token summation on top of the trusted value.
Suggested implementation:
```python
class TestTrustedUsage:
def test_trusted_overrides(self):
"""如果 API 返回了 token 数,直接用它不做估算。"""
msg = _msg("user", [
TextPart(text="hello"),
ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
])
tokens = counter.count_tokens([msg], trusted_token_usage=42)
assert tokens == 42
def test_trusted_overrides_with_tool_calls(self):
"""如果 API 返回了 token 数,即使包含 tool_calls 也应直接使用该值。"""
msg = _msg(
"assistant",
[
TextPart(text="调用工具中……"),
ImageURLPart(
image_url=ImageURLPart.ImageURL(
url="data:image/png;base64,y"
)
),
],
tool_calls=[
{
"id": "call_1",
"type": "function",
"function": {
"name": "dummy_tool",
"arguments": '{"foo": "bar"}',
},
},
],
)
tokens = counter.count_tokens([msg], trusted_token_usage=42)
assert tokens == 42
```
1. This change assumes the `_msg` helper accepts extra keyword arguments (like `tool_calls=...`) and forwards them into the `Message` constructor. If that is not the case, update `_msg` accordingly, or construct `Message` directly in this test with the correct `tool_calls` field.
2. The exact structure of `tool_calls` (dict shape or specific classes) should match what `astrbot.core.agent.message.Message` expects elsewhere in your codebase. If you already have a canonical example of `tool_calls` in other tests, mirror that structure here for consistency.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| class TestTrustedUsage: | ||
| def test_trusted_overrides(self): | ||
| """如果 API 返回了 token 数,直接用它不做估算。""" | ||
| msg = _msg("user", [ | ||
| TextPart(text="hello"), | ||
| ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")), | ||
| ]) | ||
| tokens = counter.count_tokens([msg], trusted_token_usage=42) | ||
| assert tokens == 42 |
There was a problem hiding this comment.
suggestion (testing): Consider adding a trusted_token_usage test that includes tool_calls to verify the override in a more complex message
Since trusted_token_usage also affects tool_calls, please add a companion test where the message includes both multimodal parts and non-empty tool_calls, and assert count_tokens(..., trusted_token_usage=42) == 42. This will verify the override still applies when tool calls are present and guard against regressions that reintroduce tool_call token summation on top of the trusted value.
Suggested implementation:
class TestTrustedUsage:
def test_trusted_overrides(self):
"""如果 API 返回了 token 数,直接用它不做估算。"""
msg = _msg("user", [
TextPart(text="hello"),
ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
])
tokens = counter.count_tokens([msg], trusted_token_usage=42)
assert tokens == 42
def test_trusted_overrides_with_tool_calls(self):
"""如果 API 返回了 token 数,即使包含 tool_calls 也应直接使用该值。"""
msg = _msg(
"assistant",
[
TextPart(text="调用工具中……"),
ImageURLPart(
image_url=ImageURLPart.ImageURL(
url="data:image/png;base64,y"
)
),
],
tool_calls=[
{
"id": "call_1",
"type": "function",
"function": {
"name": "dummy_tool",
"arguments": '{"foo": "bar"}',
},
},
],
)
tokens = counter.count_tokens([msg], trusted_token_usage=42)
assert tokens == 42- This change assumes the
_msghelper accepts extra keyword arguments (liketool_calls=...) and forwards them into theMessageconstructor. If that is not the case, update_msgaccordingly, or constructMessagedirectly in this test with the correcttool_callsfield. - The exact structure of
tool_calls(dict shape or specific classes) should match whatastrbot.core.agent.message.Messageexpects elsewhere in your codebase. If you already have a canonical example oftool_callsin other tests, mirror that structure here for consistency.
问题
EstimateTokenCounter只计算TextPart的 token,完全忽略图片、音频和思考链内容。多模态对话中一张图片在 OpenAI vision API 里占 500-2000 token,但计数器看不到它们。结果是
ContextManager认为 context 还没满、不触发压缩,直到 API 返回context_length_exceeded。改动
astrbot/core/agent/context/token_counter.py:ImageURLPart→ 765 token(OpenAI vision 低/高分辨率的中位数,宁高不低)AudioURLPart→ 500 tokenThinkPart→ 正常按文本估算(之前被跳过)测试
新增
tests/agent/test_token_counter.py,10 个测试覆盖:Summary by Sourcery
Add multimodal token counting to the context token estimator so that images, audio, and thinking content are included in context length estimation.
New Features:
Tests: