Skip to content

fix: align mimo tts style payload with official docs#6814

Merged
RC-CHN merged 1 commit intoAstrBotDevs:masterfrom
RichardLiuda:codex/fix-mimo-tts-style-docs
Mar 23, 2026
Merged

fix: align mimo tts style payload with official docs#6814
RC-CHN merged 1 commit intoAstrBotDevs:masterfrom
RichardLiuda:codex/fix-mimo-tts-style-docs

Conversation

@RichardLiuda
Copy link
Contributor

@RichardLiuda RichardLiuda commented Mar 22, 2026

对齐 MiMo 官方语音合成文档,修正 MiMo TTS 风格控制的请求构造方式。

Resolves #6815

概述

本 PR 修复了 MiMo TTS 风格控制的发送逻辑。

此前,mimo-tts-style-prompt 等风格相关内容是作为单独的 user 消息发送的,这与 MiMo 官方语音合成文档中的推荐用法不一致。根据官方文档,风格控制应通过 <style>...</style> 标签直接前置到 assistant 角色中的待合成文本开头。

改动内容

  • 调整 astrbot/core/provider/sources/mimo_tts_api_source.py
    • 将风格控制前置到 assistant.content 中的待合成文本开头
    • mimo-tts-style-promptmimo-tts-dialect 合并为开头的 <style>...</style> 标签
    • mimo-tts-seed-text 保留为可选的 user 消息,不再拼接到待合成文本中
    • 对“唱歌”风格做特殊处理,仅在文本最前面生成 <style>唱歌</style>
  • 更新以下文件中的 MiMo TTS 配置项说明文案:
    • astrbot/core/config/default.py
    • dashboard/src/i18n/locales/zh-CN/features/config-metadata.json
    • dashboard/src/i18n/locales/en-US/features/config-metadata.json
    • dashboard/src/i18n/locales/ru-RU/features/config-metadata.json
  • 扩展 tests/test_mimo_api_sources.py
    • 校验风格标签会前置到 assistant.content
    • 校验 seed_text 仅作为可选 user 消息发送
    • 校验未设置风格时仍能正常发送纯文本
    • 校验“唱歌”风格的特殊处理逻辑

与官方文档对齐

对齐文档:

本次对齐的关键点包括:

  • 待合成文本必须放在 assistant 角色消息中
  • 风格控制需要通过 <style>...</style> 标签添加到目标文本开头
  • user 角色消息为可选参数,仅作为辅助引导信息使用
  • 当使用“唱歌”风格时,目标文本开头仅保留 <style>唱歌</style>

兼容性说明

  • This is NOT a breaking change. / 这不是一个破坏性变更。

验证

本地执行以下检查:

uv run pytest tests/test_mimo_api_sources.py
uv run ruff check astrbot/core/provider/sources/mimo_tts_api_source.py tests/test_mimo_api_sources.py astrbot/core/config/default.py

结果:

  • tests/test_mimo_api_sources.py:10 项测试全部通过
  • ruff check:通过

Summary by Sourcery

Align MiMo TTS request construction with the official speech synthesis specification by moving style control into assistant messages and treating seed text as an optional user hint.

Bug Fixes:

  • Correct MiMo TTS style and dialect handling so they are applied via a <style>...</style> tag at the beginning of the assistant text instead of as a separate user message.

Enhancements:

  • Refactor MiMo TTS payload building to separate seed text handling from style prefix composition and ensure assistant content is always the actual synthesis text with any style tags prepended.

Documentation:

  • Update MiMo TTS configuration hints in default settings and dashboard i18n metadata to describe the new style, dialect, and seed text behavior.

Tests:

  • Extend MiMo TTS tests to cover style tag prefixing, omission of user messages without seed text, singing-style special handling, plain-text behavior without style, and ensuring seed text is not prepended to assistant content.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 22, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在修复 MiMo TTS 风格控制的发送逻辑,使其与 MiMo 官方语音合成文档保持一致。此前,风格相关内容作为单独的 user 消息发送,现在已改为通过 <style>...</style> 标签直接前置到 assistant 角色中的待合成文本开头,从而确保了正确的风格控制行为并提升了与官方 API 的兼容性。

Highlights

  • MiMo TTS 风格控制对齐官方文档: 调整了 MiMo TTS 风格控制的请求构造方式,使其与官方文档推荐的用法一致,将风格控制前置到 assistant.content 中的待合成文本开头。
  • 风格与方言合并为 <style> 标签: 将 mimo-tts-style-promptmimo-tts-dialect 合并为一个 <style>...</style> 标签,直接插入到待合成文本的开头。
  • 种子文本处理方式变更: 将 mimo-tts-seed-text 作为可选的 user 消息发送,不再将其拼接到待合成文本中,以辅助调节语气和风格。
  • “唱歌”风格特殊处理: 对“唱歌”风格进行了特殊处理,当检测到此风格时,仅在文本最前面生成 <style>唱歌</style>
  • 更新多语言配置项说明: 更新了中文、英文和俄文的配置文件中 MiMo TTS 相关配置项的说明文案,以反映新的实现方式。
  • 扩展测试覆盖: 增加了新的测试用例,以验证风格标签前置、种子文本作为可选 user 消息发送、未设置风格时的纯文本发送以及“唱歌”风格的特殊处理逻辑。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. label Mar 22, 2026
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The singing-style detection in _build_style_prefix uses a loose substring check (if "唱歌" in style_content), which may unintentionally trigger for phrases that merely contain those characters; consider a stricter condition (e.g., tokenized match or startswith) to avoid accidental singing mode.
  • The special handling for the singing style is currently hard-coded to the literal string "唱歌"; if MiMo later adds more singing-related tags or localized variants, this might be brittle—consider centralizing these style keywords or making them configurable to keep the mapping easier to extend.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The singing-style detection in `_build_style_prefix` uses a loose substring check (`if "唱歌" in style_content`), which may unintentionally trigger for phrases that merely contain those characters; consider a stricter condition (e.g., tokenized match or startswith) to avoid accidental singing mode.
- The special handling for the singing style is currently hard-coded to the literal string `"唱歌"`; if MiMo later adds more singing-related tags or localized variants, this might be brittle—consider centralizing these style keywords or making them configurable to keep the mapping easier to extend.

## Individual Comments

### Comment 1
<location path="tests/test_mimo_api_sources.py" line_range="46-55" />
<code_context>
+def test_mimo_tts_assistant_content_prefixes_style_and_dialect():
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen this test to fully validate the assistant message structure and messages length.

Currently this only asserts the full first `user` message and the `content` of the second `assistant` message. To make the regression test more robust for the new protocol, also assert that the second message has `role == "assistant"` and that `len(payload["messages"]) == 2` so future refactors can’t change the ordering or add extra messages unnoticed.

Suggested implementation:

```python
def test_mimo_tts_assistant_content_prefixes_style_and_dialect():
    provider = _make_tts_provider(
        {
            "mimo-tts-style-prompt": "开心",
            "mimo-tts-dialect": "四川话",
            "mimo-tts-seed-text": "You are chatting with a close friend.",
        }
    )
    try:
        # Build the payload used to call the model
        payload = provider._build_payload()

        # Ensure we only ever send the expected two messages
        assert len(payload["messages"]) == 2

        # First message should still be the full user seed text as before
        user_msg = payload["messages"][0]
        assert user_msg["role"] == "user"
        assert user_msg["content"] == "You are chatting with a close friend."

        # Second message should be an assistant message whose content prefixes style and dialect
        assistant_msg = payload["messages"][1]
        assert assistant_msg["role"] == "assistant"
        assert "开心" in assistant_msg["content"]
        assert "四川话" in assistant_msg["content"]

        asyncio.run(provider.terminate())

```

Because I can only see part of the file, you may need to align the helper names and patterns:

1. Replace `provider._build_payload()` with the actual helper you use to construct the chat payload that currently produces `payload["messages"]` (for example, it might be named `_build_chat_request`, `_build_mimo_payload`, etc.).
2. If the test already builds `payload` and asserts on the first user message and the assistant content, you should:
   - Keep the existing `payload` construction.
   - Insert the following assertions around where you currently assert on `payload["messages"][1]["content"]`:
   ```python
   assert len(payload["messages"]) == 2
   assert payload["messages"][1]["role"] == "assistant"
   ```
   and, if not already present, factor out `user_msg`/`assistant_msg` as shown so the test is clearer and more robust.
3. Ensure the `asyncio.run(provider.terminate())` call is not duplicated if there is already a termination call in a `finally` block or elsewhere in this test.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +46 to 55
def test_mimo_tts_assistant_content_prefixes_style_and_dialect():
provider = _make_tts_provider(
{
"mimo-tts-style-prompt": "Please sound cheerful and lively.",
"mimo-tts-dialect": "Sichuan dialect",
"mimo-tts-style-prompt": "开心",
"mimo-tts-dialect": "四川话",
"mimo-tts-seed-text": "You are chatting with a close friend.",
}
)
try:
payload = provider._build_payload("hello")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Strengthen this test to fully validate the assistant message structure and messages length.

Currently this only asserts the full first user message and the content of the second assistant message. To make the regression test more robust for the new protocol, also assert that the second message has role == "assistant" and that len(payload["messages"]) == 2 so future refactors can’t change the ordering or add extra messages unnoticed.

Suggested implementation:

def test_mimo_tts_assistant_content_prefixes_style_and_dialect():
    provider = _make_tts_provider(
        {
            "mimo-tts-style-prompt": "开心",
            "mimo-tts-dialect": "四川话",
            "mimo-tts-seed-text": "You are chatting with a close friend.",
        }
    )
    try:
        # Build the payload used to call the model
        payload = provider._build_payload()

        # Ensure we only ever send the expected two messages
        assert len(payload["messages"]) == 2

        # First message should still be the full user seed text as before
        user_msg = payload["messages"][0]
        assert user_msg["role"] == "user"
        assert user_msg["content"] == "You are chatting with a close friend."

        # Second message should be an assistant message whose content prefixes style and dialect
        assistant_msg = payload["messages"][1]
        assert assistant_msg["role"] == "assistant"
        assert "开心" in assistant_msg["content"]
        assert "四川话" in assistant_msg["content"]

        asyncio.run(provider.terminate())

Because I can only see part of the file, you may need to align the helper names and patterns:

  1. Replace provider._build_payload() with the actual helper you use to construct the chat payload that currently produces payload["messages"] (for example, it might be named _build_chat_request, _build_mimo_payload, etc.).
  2. If the test already builds payload and asserts on the first user message and the assistant content, you should:
    • Keep the existing payload construction.
    • Insert the following assertions around where you currently assert on payload["messages"][1]["content"]:
    assert len(payload["messages"]) == 2
    assert payload["messages"][1]["role"] == "assistant"
    and, if not already present, factor out user_msg/assistant_msg as shown so the test is clearer and more robust.
  3. Ensure the asyncio.run(provider.terminate()) call is not duplicated if there is already a termination call in a finally block or elsewhere in this test.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the MiMo TTS style payload with the official documentation by adjusting the request construction to include style control within assistant messages and treating seed text as an optional user hint. The changes involve modifications to the MiMo TTS API source code, updates to configuration hints in default settings and dashboard i18n metadata, and extensions to MiMo TTS tests. The review focuses on correctness and maintainability, ensuring the changes adhere to the intended functionality and coding best practices. All comments provided are valid and do not contradict the repository's rules regarding asyncio atomicity.


def _build_user_prompt(self) -> str:
prompt_parts: list[str] = []
def _build_user_prompt(self) -> str | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function _build_user_prompt is now returning None in some cases. It would be helpful to add a docstring to explain the function's purpose, return value, and the conditions under which it returns None.

prompt_parts: list[str] = []
def _build_user_prompt(self) -> str | None:
seed_text = self.seed_text.strip()
return seed_text or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line returns the seed_text if it's not empty, otherwise it returns None. Consider adding a comment to clarify this behavior, as it might not be immediately obvious to someone reading the code.

Suggested change
return seed_text or None
return seed_text or None # Returns seed_text if not empty, otherwise None

seed_text = self.seed_text.strip()
return seed_text or None

def _build_style_prefix(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It would be helpful to add a docstring to explain the purpose of the _build_style_prefix function and what the returned string represents.

Comment on lines +64 to +65
if "唱歌" in style_content:
return "<style>唱歌</style>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic handles the special case where the style includes "唱歌". It would be beneficial to add a comment explaining why this special handling is necessary, referencing the MiMo documentation if possible.

Suggested change
if "唱歌" in style_content:
return "<style>唱歌</style>"
# MiMo recommends using only the singing style tag at the very beginning.
if "唱歌" in style_content:
return "<style>唱歌</style>" # Special case for singing style


if self.seed_text.strip():
prompt_parts.append(self.seed_text.strip())
return f"<style>{style_content}</style>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment here to explain that the <style> tag is being constructed and that this is the standard way to apply styles according to the MiMo TTS API.

Suggested change
return f"<style>{style_content}</style>"
# Construct the style tag with the combined style content
return f"<style>{style_content}</style>"

Comment on lines +56 to +60
assert payload["messages"][0] == {
"role": "user",
"content": "You are chatting with a close friend.",
}
assert payload["messages"][1]["content"] == "<style>开心 四川话</style>hello"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion payload["messages"][1]["content"] checks the content of the assistant message. It would be more robust to also assert the role to ensure the message is correctly assigned to the assistant.

Suggested change
assert payload["messages"][0] == {
"role": "user",
"content": "You are chatting with a close friend.",
}
assert payload["messages"][1]["content"] == "<style>开心 四川话</style>hello"
assert payload["messages"][0] == {
"role": "user",
"content": "You are chatting with a close friend.",
}
assert payload["messages"][1]["role"] == "assistant"
assert payload["messages"][1]["content"] == "<style>开心 四川话</style>hello"

Comment on lines +74 to +78
assert payload["messages"] == [
{
"role": "assistant",
"content": "<style>开心</style>hello",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion payload["messages"] checks the entire messages array. It would be more robust to assert the role and content of the assistant message separately to ensure the message is correctly constructed.

        assert len(payload["messages"]) == 1
        assert payload["messages"][0]["role"] == "assistant"
        assert payload["messages"][0]["content"] == "<style>开心</style>hello"

)
try:
payload = provider._build_payload("歌词")
assert payload["messages"][1]["content"] == "<style>唱歌</style>歌词"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion payload["messages"][1]["content"] checks the content of the assistant message. It would be more robust to also assert the role to ensure the message is correctly assigned to the assistant.

        assert payload["messages"][1]["role"] == "assistant"
        assert payload["messages"][1]["content"] == "<style>唱歌</style>歌词"

Comment on lines +106 to +110
assert payload["messages"] == [
{
"role": "assistant",
"content": "hello",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion payload["messages"] checks the entire messages array. It would be more robust to assert the role and content of the assistant message separately to ensure the message is correctly constructed.

        assert len(payload["messages"]) == 1
        assert payload["messages"][0]["role"] == "assistant"
        assert payload["messages"][0]["content"] == "hello"

Comment on lines +125 to +127
assert payload["messages"][0]["content"] == "reference text"
assert payload["messages"][1]["content"] == "<style>开心</style>明天就是周五了"
assert "reference text" not in payload["messages"][1]["content"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These assertions check the content of the user and assistant messages. It would be more robust to also assert the role to ensure the messages are correctly assigned.

        assert payload["messages"][0]["role"] == "user"
        assert payload["messages"][0]["content"] == "reference text"
        assert payload["messages"][1]["role"] == "assistant"
        assert payload["messages"][1]["content"] == "<style>开心</style>明天就是周五了"
        assert "reference text" not in payload["messages"][1]["content"]

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 23, 2026
@RC-CHN RC-CHN merged commit 04b7618 into AstrBotDevs:master Mar 23, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] MiMo TTS 风格控制未按官方文档拼接到 assistant 合成文本前缀

2 participants