Skip to content

[Optimization] Accelerate the speed of tokenizer.#7544

Merged
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
K11OntheBoat:develop_tokenzier
Apr 24, 2026
Merged

[Optimization] Accelerate the speed of tokenizer.#7544
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
K11OntheBoat:develop_tokenzier

Conversation

@K11OntheBoat
Copy link
Copy Markdown
Collaborator

Motivation

长文+单并发场景, 用.encode 替代 .convert_tokens_to_ids 方法,预处理加速1.24倍.
GLM4.5-Air 在长文单并发场景下,TPS提升1.04倍.

Modifications

用tokenizer.encode 替代tokenizer.convert_tokens_to_ids

Accuracy Tests

前后输出的token ids 一致.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 21, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label Apr 21, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 21, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ K11OntheBoat
✅ ZhangX-21
❌ “liuruian”


“liuruian” seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 63.63636% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8883757). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/input/base_processor.py 63.63% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7544   +/-   ##
==========================================
  Coverage           ?   71.71%           
==========================================
  Files              ?      419           
  Lines              ?    57788           
  Branches           ?     9063           
==========================================
  Hits               ?    41445           
  Misses             ?    13522           
  Partials           ?     2821           
Flag Coverage Δ
GPU 71.71% <63.63%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-23 21:14:15

📋 Review 摘要

PR 概述:将 tokenizer 预处理从 .tokenize() + .convert_tokens_to_ids() 两步改为单步 .encode(),并为 ernie4_5 保留旧路径作为 workaround,长文单并发场景预处理加速 1.24x。
变更范围fastdeploy/input/base_processor.py(核心逻辑)、tests/input/test_text_processor.py(测试补充)
影响面 TagDataProcessor


问题

级别 文件 概述
🔴 Bug base_processor.py:171-172 hasattr(token_ids, "input_ids") 触发时使用 token_ids["input_ids"],若对象有属性但无 __getitem__ 会抛出 TypeError
🟡 建议 base_processor.py:173 ndim 检查仅覆盖 tensor,Python 原生 list[list[int]] 嵌套情况未处理,可能返回 [[1,2,3]] 而非 [1,2,3]
🟡 建议 base_processor.py:167 ernie4_5 hang 的 workaround 无关联 issue,建议补充链接便于未来跟踪
🟡 建议 tests/input/test_text_processor.py 新增测试仅覆盖 else 分支的三种返回类型,缺少 tokenizer_type == "ernie4_5" 路径的单测

总体评价

优化思路清晰,性能收益有数据支撑,兼容性处理也考虑了多种 encode 返回格式。但 hasattr + [] 访问的不一致逻辑存在潜在异常风险,ernie4_5 分支缺少对应测试,请修复后合入。

else:
token_ids = self.tokenizer.encode(spliced_message, add_special_tokens=False)
if hasattr(token_ids, "input_ids") or (isinstance(token_ids, dict) and "input_ids" in token_ids):
token_ids = token_ids["input_ids"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug hasattr 检查与 [] 下标访问不一致

hasattr(token_ids, "input_ids") 为 True 时,若该对象有 input_ids 属性但未实现 __getitem__(即非 dict-like),执行 token_ids["input_ids"] 会抛出 TypeError

建议将属性访问与字典访问分开处理:

if isinstance(token_ids, dict) and "input_ids" in token_ids:
    token_ids = token_ids["input_ids"]
elif hasattr(token_ids, "input_ids"):
    token_ids = token_ids.input_ids  # 用属性访问替代下标访问

token_ids = self.tokenizer.encode(spliced_message, add_special_tokens=False)
if hasattr(token_ids, "input_ids") or (isinstance(token_ids, dict) and "input_ids" in token_ids):
token_ids = token_ids["input_ids"]
if hasattr(token_ids, "ndim") and token_ids.ndim > 1:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 Python 原生嵌套 list 未处理

hasattr(token_ids, "ndim") and token_ids.ndim > 1 只能处理 numpy/paddle/torch tensor 的 2D 情况。若 input_ids 是普通 Python list[list[int]](如 [[1, 2, 3]]),则没有 ndim 属性,检查跳过,后续 isinstance(token_ids, list) 直接返回 [[1, 2, 3]],下游处理会出错。

建议在 ndim 检查后补充:

if isinstance(token_ids, list) and len(token_ids) == 1 and isinstance(token_ids[0], list):
    token_ids = token_ids[0]

tokens = self.tokenizer.tokenize(spliced_message)
token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
if self.tokenizer_type == "ernie4_5":
# NOTE: ernie4_5 tokenizer will hang when meet long input when use .encode()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 ernie4_5 hang 问题建议关联 issue

注释仅说明「会 hang」但未说明原因、影响版本及预计修复时间,未来维护者难以判断何时可以安全移除该 workaround。

建议补充 issue 链接,例如:

# NOTE: ernie4_5 tokenizer hangs on long input with .encode(); see issue #XXXX
# TODO: remove this branch once the upstream issue is fixed

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 978b813 into PaddlePaddle:develop Apr 24, 2026
40 of 46 checks passed
@EmmonsCurse
Copy link
Copy Markdown
Collaborator

❌ Cherry-pick failed: Conflicts detected when cherry-picking to release/2.6. Please resolve manually.

Jiang-Jia-Jun added a commit that referenced this pull request Apr 27, 2026
xiaoguoguo626807 pushed a commit to xiaoguoguo626807/FastDeploy that referenced this pull request May 7, 2026
* Change default workers and max-concurrency when launch api-server

* Change convert_tokens_to_ids to encode to get token ids

---------

Co-authored-by: zhangxiao35 <zhangxiao35@baidu.com>
Co-authored-by: “liuruian” <liuruian@baidu.com>
xiaoguoguo626807 pushed a commit to xiaoguoguo626807/FastDeploy that referenced this pull request May 7, 2026
xiaoguoguo626807 pushed a commit to xiaoguoguo626807/FastDeploy that referenced this pull request May 7, 2026
sunlei1024 pushed a commit to sunlei1024/FastDeploy that referenced this pull request May 7, 2026
* Change default workers and max-concurrency when launch api-server

* Change convert_tokens_to_ids to encode to get token ids

---------

Co-authored-by: zhangxiao35 <zhangxiao35@baidu.com>
Co-authored-by: “liuruian” <liuruian@baidu.com>
sunlei1024 pushed a commit to sunlei1024/FastDeploy that referenced this pull request May 7, 2026
sunlei1024 pushed a commit to sunlei1024/FastDeploy that referenced this pull request May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants