[Optimization] Accelerate the speed of tokenizer.#7544
[Optimization] Accelerate the speed of tokenizer.#7544Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
|
“liuruian” seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
64f96e0 to
25a6cd9
Compare
25a6cd9 to
e2effec
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7544 +/- ##
==========================================
Coverage ? 71.71%
==========================================
Files ? 419
Lines ? 57788
Branches ? 9063
==========================================
Hits ? 41445
Misses ? 13522
Partials ? 2821
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e2effec to
5a5ca42
Compare
5a5ca42 to
6c5113b
Compare
6c5113b to
adbff61
Compare
adbff61 to
1cd7fbc
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-23 21:14:15
📋 Review 摘要
PR 概述:将 tokenizer 预处理从 .tokenize() + .convert_tokens_to_ids() 两步改为单步 .encode(),并为 ernie4_5 保留旧路径作为 workaround,长文单并发场景预处理加速 1.24x。
变更范围:fastdeploy/input/base_processor.py(核心逻辑)、tests/input/test_text_processor.py(测试补充)
影响面 Tag:DataProcessor
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | base_processor.py:171-172 |
hasattr(token_ids, "input_ids") 触发时使用 token_ids["input_ids"],若对象有属性但无 __getitem__ 会抛出 TypeError |
| 🟡 建议 | base_processor.py:173 |
ndim 检查仅覆盖 tensor,Python 原生 list[list[int]] 嵌套情况未处理,可能返回 [[1,2,3]] 而非 [1,2,3] |
| 🟡 建议 | base_processor.py:167 |
ernie4_5 hang 的 workaround 无关联 issue,建议补充链接便于未来跟踪 |
| 🟡 建议 | tests/input/test_text_processor.py |
新增测试仅覆盖 else 分支的三种返回类型,缺少 tokenizer_type == "ernie4_5" 路径的单测 |
总体评价
优化思路清晰,性能收益有数据支撑,兼容性处理也考虑了多种 encode 返回格式。但 hasattr + [] 访问的不一致逻辑存在潜在异常风险,ernie4_5 分支缺少对应测试,请修复后合入。
| else: | ||
| token_ids = self.tokenizer.encode(spliced_message, add_special_tokens=False) | ||
| if hasattr(token_ids, "input_ids") or (isinstance(token_ids, dict) and "input_ids" in token_ids): | ||
| token_ids = token_ids["input_ids"] |
There was a problem hiding this comment.
🔴 Bug hasattr 检查与 [] 下标访问不一致
当 hasattr(token_ids, "input_ids") 为 True 时,若该对象有 input_ids 属性但未实现 __getitem__(即非 dict-like),执行 token_ids["input_ids"] 会抛出 TypeError。
建议将属性访问与字典访问分开处理:
if isinstance(token_ids, dict) and "input_ids" in token_ids:
token_ids = token_ids["input_ids"]
elif hasattr(token_ids, "input_ids"):
token_ids = token_ids.input_ids # 用属性访问替代下标访问| token_ids = self.tokenizer.encode(spliced_message, add_special_tokens=False) | ||
| if hasattr(token_ids, "input_ids") or (isinstance(token_ids, dict) and "input_ids" in token_ids): | ||
| token_ids = token_ids["input_ids"] | ||
| if hasattr(token_ids, "ndim") and token_ids.ndim > 1: |
There was a problem hiding this comment.
🟡 建议 Python 原生嵌套 list 未处理
hasattr(token_ids, "ndim") and token_ids.ndim > 1 只能处理 numpy/paddle/torch tensor 的 2D 情况。若 input_ids 是普通 Python list[list[int]](如 [[1, 2, 3]]),则没有 ndim 属性,检查跳过,后续 isinstance(token_ids, list) 直接返回 [[1, 2, 3]],下游处理会出错。
建议在 ndim 检查后补充:
if isinstance(token_ids, list) and len(token_ids) == 1 and isinstance(token_ids[0], list):
token_ids = token_ids[0]| tokens = self.tokenizer.tokenize(spliced_message) | ||
| token_ids = self.tokenizer.convert_tokens_to_ids(tokens) | ||
| if self.tokenizer_type == "ernie4_5": | ||
| # NOTE: ernie4_5 tokenizer will hang when meet long input when use .encode() |
There was a problem hiding this comment.
🟡 建议 ernie4_5 hang 问题建议关联 issue
注释仅说明「会 hang」但未说明原因、影响版本及预计修复时间,未来维护者难以判断何时可以安全移除该 workaround。
建议补充 issue 链接,例如:
# NOTE: ernie4_5 tokenizer hangs on long input with .encode(); see issue #XXXX
# TODO: remove this branch once the upstream issue is fixed|
❌ Cherry-pick failed: Conflicts detected when cherry-picking to |
* Change default workers and max-concurrency when launch api-server * Change convert_tokens_to_ids to encode to get token ids --------- Co-authored-by: zhangxiao35 <zhangxiao35@baidu.com> Co-authored-by: “liuruian” <liuruian@baidu.com>
…le#7544)" (PaddlePaddle#7630) This reverts commit df4ed5a.
…ddlePaddle#7544)" (#…" (PaddlePaddle#7634) This reverts commit 701d268.
* Change default workers and max-concurrency when launch api-server * Change convert_tokens_to_ids to encode to get token ids --------- Co-authored-by: zhangxiao35 <zhangxiao35@baidu.com> Co-authored-by: “liuruian” <liuruian@baidu.com>
…le#7544)" (PaddlePaddle#7630) This reverts commit 978b813.
…ddlePaddle#7544)" (#…" (PaddlePaddle#7634) This reverts commit 4ca1064.
Motivation
长文+单并发场景, 用.encode 替代 .convert_tokens_to_ids 方法,预处理加速1.24倍.
GLM4.5-Air 在长文单并发场景下,TPS提升1.04倍.
Modifications
用tokenizer.encode 替代tokenizer.convert_tokens_to_ids
Accuracy Tests
前后输出的token ids 一致.