Skip to content
This repository was archived by the owner on May 13, 2026. It is now read-only.

fix(toolcall): eliminate strings.ToLower panics from Unicode case folding#460

Merged
CJackHwang merged 3 commits into
CJackHwang:devfrom
waiwaic:main
May 9, 2026
Merged

fix(toolcall): eliminate strings.ToLower panics from Unicode case folding#460
CJackHwang merged 3 commits into
CJackHwang:devfrom
waiwaic:main

Conversation

@waiwaic
Copy link
Copy Markdown
Contributor

@waiwaic waiwaic commented May 9, 2026

Summary

Replace all strings.ToLower(text) usage in the toolcall scanner with ASCII case-insensitive matching helpers (hasASCIIPrefixFoldAt, indexASCIIFold, hasDSMLPrefix) to prevent slice bounds out of range panics when Unicode characters change byte length after case folding.

Root cause: Code created a strings.ToLower(text) copy, found byte positions in that copy, then used those positions to slice the original text. When case folding changed byte lengths (e.g., Turkish İ U+0130 → i + combining dot: 2 bytes → 3 bytes), offsets valid in the lowercased copy became out-of-bounds in the original.

Impact: Panic at consumeToolMarkupNamePrefixOnce (toolcalls_scan.go:240) when processing tool markup containing certain Unicode characters.

Changes

File Change
internal/toolcall/toolcalls_scan.go Remove 5 lower usages, add hasDSMLPrefix helper
internal/toolcall/toolcalls_parse_markup.go Remove 3 lower usages, add indexASCIIFold helper
internal/toolcall/toolcalls_markup.go SanitizeLooseCDATA lower removal
internal/toolcall/toolcalls_parse.go updateCDATAStateForStrip lower removal
internal/toolcall/tool_prompt.go Align DSML pipe characters with tool call spec
internal/toolcall/tool_prompt_test.go Fix pre-existing test character mismatch
internal/promptcompat/prompt_build_test.go Align test expectations with DSML fullwidth pipe
internal/httpapi/claude/handler_util_test.go Align test expectations with DSML fullwidth pipe

Test Plan

  • go test ./internal/toolcall/... — all 55+ tests pass
  • go test ./... — full suite passes
  • bash scripts/lint.sh — 0 issues
  • npm run build --prefix webui — builds successfully

CJackHwang and others added 2 commits May 8, 2026 17:13
Avoid lowercasing ignored XML tails in toolcall
…ding

Replace all strings.ToLower usage with ASCII case-insensitive matching
(hasASCIIPrefixFoldAt, indexASCIIFold, hasDSMLPrefix) to prevent slice
bounds errors when Unicode characters change byte length after case
folding (e.g., Turkish İ U+0130 → i + combining dot: 2 bytes → 3 bytes).

Root cause: code created a strings.ToLower(text) copy, found byte
positions in that copy, then used those positions to slice the
original text — byte offsets that were valid in the lowercased copy
became out-of-bounds in the original when case folding changed byte
lengths.

Files changed:
- toolcalls_scan.go: remove 5 lower usages, add hasDSMLPrefix
- toolcalls_parse_markup.go: remove 3 lower usages, add indexASCIIFold
- toolcalls_markup.go: SanitizeLooseCDATA lower removal
- toolcalls_parse.go: updateCDATAStateForStrip lower removal
- tool_prompt.go: align DSML pipe characters with tool call spec
- tool_prompt_test.go: fix pre-existing test character mismatch
@vercel
Copy link
Copy Markdown

vercel Bot commented May 9, 2026

@waiwaic is attempting to deploy a commit to the cjack's projects Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e00e482a6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/toolcall/tool_prompt.go Outdated
@waiwaic
Copy link
Copy Markdown
Contributor Author

waiwaic commented May 9, 2026

感谢您的项目!这个 PR 修复了 strings.ToLower 在 Unicode 字符(如土耳其语 İ)上可能导致字节长度变化引发的 panic(slice bounds out of range),并通过 ASCII 大小写不敏感匹配替代了原来的小写处理,预期能大幅改善现有 toolcall 解析中的乱码问题和部分信息泄露。

如果方便的话,希望能给我 dev 分支的推送权限,这样后续贡献修复和功能时可以更高效,不必每次都走 fork PR 流程。非常感谢!

@vercel
Copy link
Copy Markdown

vercel Bot commented May 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
ds2api Ready Ready Preview, Comment May 9, 2026 8:53am

The closing tag format was <|/DSML|tag> but must be </|DSML|tag>.
The scanner's closing-tag detection checks text[1] == '/', so the
slash must come immediately after '<', before the first full-width
pipe (U+FF5C). Tags like <|/DSML|tool_calls> would not set
closing=true and would not match any tool markup name.

Files fixed:
- internal/toolcall/tool_prompt.go: all closing tags
- internal/promptcompat/prompt_build_test.go: 1 test expectation
@CJackHwang
Copy link
Copy Markdown
Owner

感谢您的项目!这个 PR 修复了 strings.ToLower 在 Unicode 字符(如土耳其语 İ)上可能导致字节长度变化引发的 panic(slice bounds out of range),并通过 ASCII 大小写不敏感匹配替代了原来的小写处理,预期能大幅改善现有 toolcall 解析中的乱码问题和部分信息泄露。

如果方便的话,希望能给我 dev 分支的推送权限,这样后续贡献修复和功能时可以更高效,不必每次都走 fork PR 流程。非常感谢!

感谢您的PR
本项目dev分支走流程是有必要的 尤其是当前在完全AI开发的情况下是保证代码质量稳定可控的重要方式。我需要对项目全部PR重定向到dev分支,且此时codex自动初审+我的人工质量评估。在合并多个PR或者进行review的修复,统一推送到main分支时候会触发codex的二次审核,可以最大程度保证代码的高质量,以及版本回溯

正如上面的review,对于标签格式的误改动可以提前发现

@CJackHwang CJackHwang merged commit 239c4fa into CJackHwang:dev May 9, 2026
7 checks passed
@waiwaic
Copy link
Copy Markdown
Contributor Author

waiwaic commented May 9, 2026

hasASCIIPrefixFoldAt和hasDSMLPrefix函数有重复,可以提出来合并一起

CJackHwang added a commit that referenced this pull request May 10, 2026
PR #460 introduced fullwidth pipe characters (|) in DSML tool call formatting
to improve parsing robustness, but models exposed to these fullwidth pipes in
system prompts exhibit significantly higher rates of tool output hallucinations.
Reverting to halfwidth pipes (|) drastically reduces tokenizer/perplexity-driven
hallucinations while retaining the existing confusable-hardening in the parser.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants