Skip to content

feat: context token counting support for multimodal content (images, audio, and chain-of-thought)#6596

Merged
Soulter merged 1 commit intoAstrBotDevs:masterfrom
he-yufeng:feat/multimodal-token-counting
Mar 19, 2026
Merged

feat: context token counting support for multimodal content (images, audio, and chain-of-thought)#6596
Soulter merged 1 commit intoAstrBotDevs:masterfrom
he-yufeng:feat/multimodal-token-counting

Conversation

@he-yufeng
Copy link
Contributor

@he-yufeng he-yufeng commented Mar 19, 2026

问题

EstimateTokenCounter 只计算 TextPart 的 token,完全忽略图片、音频和思考链内容。

多模态对话中一张图片在 OpenAI vision API 里占 500-2000 token,但计数器看不到它们。结果是 ContextManager 认为 context 还没满、不触发压缩,直到 API 返回 context_length_exceeded

改动

astrbot/core/agent/context/token_counter.py:

  • ImageURLPart → 765 token(OpenAI vision 低/高分辨率的中位数,宁高不低)
  • AudioURLPart → 500 token
  • ThinkPart → 正常按文本估算(之前被跳过)

测试

新增 tests/agent/test_token_counter.py,10 个测试覆盖:

  • 纯文本 / 中英文权重差异
  • 图片、音频、思考链单独计算
  • 文本+图片混合内容
  • 多张图片累加
  • trusted_token_usage 优先级
  • tool_calls 计算

Summary by Sourcery

Add multimodal token counting to the context token estimator so that images, audio, and thinking content are included in context length estimation.

New Features:

  • Estimate token usage for ImageURLPart and AudioURLPart using fixed per-item token estimates to reflect multimodal context cost.
  • Include ThinkPart content in token estimation alongside regular text parts.

Tests:

  • Add comprehensive tests for the EstimateTokenCounter covering text-only, multimodal (image/audio/think), mixed content, multiple images, trusted_token_usage overrides, and tool call token counting.

EstimateTokenCounter 之前只计算 TextPart,完全忽略 ImageURLPart、
AudioURLPart 和 ThinkPart。多模态对话中图片占 500-2000 token,
不被计入会导致 context 压缩触发过晚,API 先报 context_length_exceeded。

改动:
- ImageURLPart 按 765 token 估算(OpenAI vision 低/高分辨率中位数)
- AudioURLPart 按 500 token 估算
- ThinkPart 的文本内容正常计算
- 10 个新测试覆盖各类型单独和混合场景
@auto-assign auto-assign bot requested review from Soulter and anka-afk March 19, 2026 02:55
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 19, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在增强系统的上下文管理能力,通过为多模态内容(如图片、音频和思考链)引入准确的 token 估算,解决了现有 EstimateTokenCounter 无法正确计算这些内容 token 的问题。这一改进确保了上下文压缩机制能够及时触发,从而有效避免了因 token 计数不足而导致的 API 错误,提升了多模态对话的稳定性和可靠性。

Highlights

  • 问题修复: 解决了 EstimateTokenCounter 仅计算文本 token,忽略图片、音频和思考链内容的问题,避免了因上下文长度估计不准确导致的 context_length_exceeded 错误。
  • 多模态支持: 为 ImageURLPart (估算 765 token)、AudioURLPart (估算 500 token) 和 ThinkPart (按文本估算) 增加了 token 估算逻辑。
  • 测试覆盖: 新增了 tests/agent/test_token_counter.py,包含 10 个测试用例,全面覆盖了纯文本、中英文权重、图片、音频、思考链、混合内容、多图累加、trusted_token_usage 优先级和 tool_calls 的 token 计数。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added area:core The bug / feature is about astrbot's core, backend area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. labels Mar 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 很好地解决了多模态内容(图片、音频、思考链)在 token 估算中被忽略的问题,通过为不同类型的内容部分分配合理的 token 估算值,可以更准确地触发上下文压缩,避免 API 错误。新增的测试用例覆盖了各种场景,包括纯文本、多模态内容、混合内容以及 tool_calls 等,确保了代码的健壮性。整体改动清晰,目标明确。

我有一个关于代码可扩展性的建议,请见具体评论。

Comment on lines 58 to +66
for part in content:
if isinstance(part, TextPart):
total += self._estimate_tokens(part.text)
elif isinstance(part, ThinkPart):
total += self._estimate_tokens(part.think)
elif isinstance(part, ImageURLPart):
total += IMAGE_TOKEN_ESTIMATE
elif isinstance(part, AudioURLPart):
total += AUDIO_TOKEN_ESTIMATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这部分使用 if/elif 链来处理不同类型的 ContentPart 是可行的,但当未来需要支持更多类型的内容时,需要不断修改这个函数。为了提高代码的可扩展性和遵循开闭原则,可以考虑使用更动态的分发机制,例如使用字典将内容类型映射到处理逻辑。

这样可以使 count_tokens 方法更稳定,添加新类型时只需更新映射字典,而无需修改核心循环逻辑。

例如,可以在类级别定义一个映射:

class EstimateTokenCounter:
    _PART_HANDLERS = {
        TextPart: lambda self, p: self._estimate_tokens(p.text),
        ThinkPart: lambda self, p: self._estimate_tokens(p.think),
        ImageURLPart: lambda self, p: IMAGE_TOKEN_ESTIMATE,
        AudioURLPart: lambda self, p: AUDIO_TOKEN_ESTIMATE,
    }

    def count_tokens(self, messages: list[Message], trusted_token_usage: int = 0) -> int:
        # ...
        elif isinstance(content, list):
            for part in content:
                # Note: using type() might not work with inheritance, isinstance is safer.
                # This is a conceptual example.
                handler = self._PART_HANDLERS.get(type(part))
                if handler:
                    total += handler(self, part)
        # ...

由于这只是一个改进建议,当前实现也是完全可以接受的。

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The hard-coded IMAGE_TOKEN_ESTIMATE and AUDIO_TOKEN_ESTIMATE values might be better surfaced as configuration (or at least module-level constants with a clear link to the current OpenAI pricing/docs) so they can be adjusted without code changes if model pricing or behavior shifts.
  • In the multimodal loop you only handle TextPart, ThinkPart, ImageURLPart, and AudioURLPart; consider either raising/logging on unknown part types or adding a default branch so new Part subclasses don't get silently ignored in token accounting.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The hard-coded IMAGE_TOKEN_ESTIMATE and AUDIO_TOKEN_ESTIMATE values might be better surfaced as configuration (or at least module-level constants with a clear link to the current OpenAI pricing/docs) so they can be adjusted without code changes if model pricing or behavior shifts.
- In the multimodal loop you only handle TextPart, ThinkPart, ImageURLPart, and AudioURLPart; consider either raising/logging on unknown part types or adding a default branch so new Part subclasses don't get silently ignored in token accounting.

## Individual Comments

### Comment 1
<location path="tests/agent/test_token_counter.py" line_range="82-90" />
<code_context>
+        assert tokens == IMAGE_TOKEN_ESTIMATE * 3
+
+
+class TestTrustedUsage:
+    def test_trusted_overrides(self):
+        """如果 API 返回了 token 数,直接用它不做估算。"""
+        msg = _msg("user", [
+            TextPart(text="hello"),
+            ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
+        ])
+        tokens = counter.count_tokens([msg], trusted_token_usage=42)
+        assert tokens == 42
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a trusted_token_usage test that includes tool_calls to verify the override in a more complex message

Since `trusted_token_usage` also affects `tool_calls`, please add a companion test where the message includes both multimodal parts and non-empty `tool_calls`, and assert `count_tokens(..., trusted_token_usage=42) == 42`. This will verify the override still applies when tool calls are present and guard against regressions that reintroduce tool_call token summation on top of the trusted value.

Suggested implementation:

```python
class TestTrustedUsage:
    def test_trusted_overrides(self):
        """如果 API 返回了 token 数,直接用它不做估算。"""
        msg = _msg("user", [
            TextPart(text="hello"),
            ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
        ])
        tokens = counter.count_tokens([msg], trusted_token_usage=42)
        assert tokens == 42

    def test_trusted_overrides_with_tool_calls(self):
        """如果 API 返回了 token 数,即使包含 tool_calls 也应直接使用该值。"""
        msg = _msg(
            "assistant",
            [
                TextPart(text="调用工具中……"),
                ImageURLPart(
                    image_url=ImageURLPart.ImageURL(
                        url="data:image/png;base64,y"
                    )
                ),
            ],
            tool_calls=[
                {
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "dummy_tool",
                        "arguments": '{"foo": "bar"}',
                    },
                },
            ],
        )
        tokens = counter.count_tokens([msg], trusted_token_usage=42)
        assert tokens == 42

```

1. This change assumes the `_msg` helper accepts extra keyword arguments (like `tool_calls=...`) and forwards them into the `Message` constructor. If that is not the case, update `_msg` accordingly, or construct `Message` directly in this test with the correct `tool_calls` field.
2. The exact structure of `tool_calls` (dict shape or specific classes) should match what `astrbot.core.agent.message.Message` expects elsewhere in your codebase. If you already have a canonical example of `tool_calls` in other tests, mirror that structure here for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +82 to +90
class TestTrustedUsage:
def test_trusted_overrides(self):
"""如果 API 返回了 token 数,直接用它不做估算。"""
msg = _msg("user", [
TextPart(text="hello"),
ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
])
tokens = counter.count_tokens([msg], trusted_token_usage=42)
assert tokens == 42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding a trusted_token_usage test that includes tool_calls to verify the override in a more complex message

Since trusted_token_usage also affects tool_calls, please add a companion test where the message includes both multimodal parts and non-empty tool_calls, and assert count_tokens(..., trusted_token_usage=42) == 42. This will verify the override still applies when tool calls are present and guard against regressions that reintroduce tool_call token summation on top of the trusted value.

Suggested implementation:

class TestTrustedUsage:
    def test_trusted_overrides(self):
        """如果 API 返回了 token 数,直接用它不做估算。"""
        msg = _msg("user", [
            TextPart(text="hello"),
            ImageURLPart(image_url=ImageURLPart.ImageURL(url="data:image/png;base64,x")),
        ])
        tokens = counter.count_tokens([msg], trusted_token_usage=42)
        assert tokens == 42

    def test_trusted_overrides_with_tool_calls(self):
        """如果 API 返回了 token 数,即使包含 tool_calls 也应直接使用该值。"""
        msg = _msg(
            "assistant",
            [
                TextPart(text="调用工具中……"),
                ImageURLPart(
                    image_url=ImageURLPart.ImageURL(
                        url="data:image/png;base64,y"
                    )
                ),
            ],
            tool_calls=[
                {
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "dummy_tool",
                        "arguments": '{"foo": "bar"}',
                    },
                },
            ],
        )
        tokens = counter.count_tokens([msg], trusted_token_usage=42)
        assert tokens == 42
  1. This change assumes the _msg helper accepts extra keyword arguments (like tool_calls=...) and forwards them into the Message constructor. If that is not the case, update _msg accordingly, or construct Message directly in this test with the correct tool_calls field.
  2. The exact structure of tool_calls (dict shape or specific classes) should match what astrbot.core.agent.message.Message expects elsewhere in your codebase. If you already have a canonical example of tool_calls in other tests, mirror that structure here for consistency.

@Soulter Soulter changed the title feat: 上下文 token 计数支持多模态内容(图片/音频/思考链) feat: context token counting support for multimodal content (images, audio, and chain-of-thought) Mar 19, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 19, 2026
@Soulter Soulter merged commit 2cb6c84 into AstrBotDevs:master Mar 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants