Skip to content

[Feature] Add server-level token limits and prompt truncation control#7842

Merged
Jiang-Jia-Jun merged 27 commits into
PaddlePaddle:developfrom
luukunn:length
Jun 2, 2026
Merged

[Feature] Add server-level token limits and prompt truncation control#7842
Jiang-Jia-Jun merged 27 commits into
PaddlePaddle:developfrom
luukunn:length

Conversation

@luukunn
Copy link
Copy Markdown
Collaborator

@luukunn luukunn commented May 18, 2026

Motivation

本 PR 为服务端新增了统一的长度参数默认值配置能力,使用户在未显式传入请求级参数时,也可以通过服务级配置控制生成长度相关行为;同时新增了输入 token 长度限制,用于提前拦截超长请求。

Modifications

  • 新增服务级长度控制配置 ServingLimitsConfig,并挂载到 FDConfig 中统一管理。
  • 在 CLI / 配置项中新增以下服务级参数:
    • max_completion_tokens
    • reasoning_max_tokens
    • response_max_tokens
    • min_completion_tokens
    • input_max_tokens
  • async_llmcommon_engineengine_client 初始化阶段,将服务级默认长度配置注入 data_processor
  • 更新文本与多模态请求处理逻辑:
    • 当请求未显式指定 max_tokens 时,默认使用服务级 max_completion_tokens,并受剩余上下文长度约束;
    • 当请求显式指定 max_tokens 时,会同时受服务级上限和上下文剩余长度限制;
    • reasoning_max_tokens / response_max_tokens 会被约束为不超过最终生效的 max_tokens
    • min_tokens 采用 max(server_value, request_value) 规则,并在超过 max_tokens 时直接报错。
  • 新增 input_max_tokens 校验:
    • 在 prompt 被截断前先检查输入长度;
    • 当输入 token 数超过 input_max_tokens 时,直接拒绝请求。
  • 调整 engine / engine_client 中默认 max_tokens 的处理逻辑:
    • 若配置了 max_completion_tokens,优先使用该值作为默认生成长度;
    • 否则保持原有基于 max_model_len 的默认行为。
  • 补充中英文参数文档:
    • docs/parameters.md
    • docs/zh/parameters.md

Usage or Command

示例启动参数:

--max-completion-tokens 1024 \
--reasoning-max-tokens 512 \
--response-max-tokens 512 \
--min-completion-tokens 1 \
--input-max-tokens 4096

行为说明:

  • 当请求未指定 max_tokens 时,默认使用服务级配置 max_completion_tokens,并受上下文剩余长度约束;
  • 当请求指定了 max_tokens 时,最终值会被限制为 min(请求值, 服务级上限, 上下文剩余长度)
  • 当请求未指定 reasoning_max_tokens / response_max_tokens 时,可使用服务级默认值;
  • reasoning_max_tokens / response_max_tokens 的最终值不会超过 max_tokens
  • min_tokens 的最终值取服务端配置与请求值中的较大者,若超过 max_tokens 会直接报错;
  • 当输入 prompt token 数超过 input_max_tokens 时,请求会被直接拒绝;
  • 当输入超过 max_model_len 时,直接报错。

Accuracy Tests

该 PR 不涉及模型前向计算逻辑或 kernel 行为修改,因此无精度测试影响。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 18, 2026 06:33
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 18, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

本 PR 在服务端引入了若干"默认 token 长度限制"配置 (max_completion_tokens / reasoning_max_tokens / response_max_tokens / min_completion_tokens / input_max_tokens),允许通过 CLI 设置 server-level 默认值;当请求未携带相应字段时使用这些默认值,超过 input_max_tokens 的请求将被拒绝。

Changes:

  • EngineArgs/ModelConfig 上新增 5 个长度相关参数,并在 CLI 和文档中暴露
  • BaseDataProcessor 上新增 set_server_defaults,并在 engine_client / async_llm / engine / common_engine 各入口处调用以同步 server defaults
  • base_processor.pymultimodal_processor.py 中加入"超长拒绝"以及"用户值/服务端默认值/上下文上限取最小"的合并逻辑

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
fastdeploy/engine/args_utils.py 新增 5 个 server-level token 长度相关参数及对应 CLI 选项
fastdeploy/config.py ModelConfig 初始化新字段(默认 None / 1)以接受新参数
fastdeploy/input/base_processor.py 新增 set_server_defaultsprocess_request_dict 中的长度合并/拒绝逻辑
fastdeploy/input/multimodal_processor.py 多模态处理流程中加入同样的长度合并/拒绝逻辑
fastdeploy/entrypoints/engine_client.py 调用 set_server_defaults,并在缺失 max_tokens 时用 max_completion_tokens 兜底
fastdeploy/engine/engine.py 同上:注入 server defaults 并优先使用 max_completion_tokens 作为缺省
fastdeploy/engine/common_engine.py 创建 data_processor 后注入 server defaults
fastdeploy/engine/async_llm.py 创建 data_processor 后注入 server defaults
docs/parameters.md / docs/zh/parameters.md 文档同步新增 5 个参数说明

Comment thread fastdeploy/input/base_processor.py Outdated
Comment thread fastdeploy/input/multimodal_processor.py Outdated
@luukunn luukunn changed the title Length [Feature] Add server-level token length defaults and input token limit May 18, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 18, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-01 12:56:31

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务全部通过 ✅,PR 可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 39 3 0 0 0

2 任务状态汇总

2.1 Required任务 : 10/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 10 个必选任务全部通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m45s Job -
Trigger Jenkins for PR 16s Job -
CI_HPU 1h4m Job -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

💡 当前 3 个可选任务失败(ILUVATAR-CI、CI_METAX、CI_HPU),均为可选任务,不阻塞合并。失败原因疑似环境问题(自定义容器异常、Jenkins触发失败、HPU环境退出码非0),建议 rerun 确认。

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 18, 2026 07:17

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 18, 2026 07:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comment thread fastdeploy/input/base_processor.py Outdated
Comment thread fastdeploy/input/multimodal_processor.py Outdated
Comment thread fastdeploy/engine/args_utils.py Outdated
Comment on lines +830 to +837
model_group.add_argument(
"--truncate-prompt-tokens",
type=lambda x: x.lower() in ("true", "1", "yes"),
default=EngineArgs.truncate_prompt_tokens,
help="Whether to truncate prompts that exceed max_model_len. "
"If True (default), prompts are silently truncated. "
"If False, a ValueError is raised.",
)
Comment thread fastdeploy/entrypoints/engine_client.py
Comment thread fastdeploy/input/base_processor.py Outdated
PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 28, 2026 13:07

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings May 28, 2026 13:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

)
# Create data processor
self.data_processor = self.input_processor.create_processor()
self.data_processor.set_server_defaults(cfg.serving_limits_config)
enable_mm_runtime=self.cfg.enable_mm_runtime,
)
self.data_processor = self.input_processor.create_processor()
self.data_processor.set_server_defaults(self.cfg.serving_limits_config)
)
self.enable_logprob = self.fd_config.model_config.enable_logprob
self.data_processor = input_processor.create_processor()
self.data_processor.set_server_defaults(self.fd_config.serving_limits_config)
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 29, 2026 06:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

)
self.enable_logprob = self.fd_config.model_config.enable_logprob
self.data_processor = input_processor.create_processor()
self.data_processor.set_server_defaults(self.fd_config.serving_limits_config)
enable_mm_runtime=self.cfg.enable_mm_runtime,
)
self.data_processor = self.input_processor.create_processor()
self.data_processor.set_server_defaults(self.cfg.serving_limits_config)
)
# Create data processor
self.data_processor = self.input_processor.create_processor()
self.data_processor.set_server_defaults(cfg.serving_limits_config)
Comment on lines +515 to +517
if effective_min > max_tokens:
raise ValueError(f"min_tokens ({effective_min}) must not exceed max_tokens ({max_tokens})")
request["min_tokens"] = effective_min
Comment on lines +470 to +472
if effective_min > max_tokens:
raise ValueError(f"min_tokens ({effective_min}) must not exceed max_tokens ({max_tokens})")
request["min_tokens"] = effective_min
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-01 10:21:23

📋 Review 摘要

PR 概述:新增服务级 token 长度限制与 prompt 截断控制配置能力
变更范围:config、engine、entrypoints、input processor、docs、tests
影响面 Tag[FDConfig] [Engine] [APIServer] [DataProcessor] [Docs]

问题

未发现新的阻塞性问题。历史 Findings 状态见下方。

历史 Findings 修复情况

Finding 问题 状态
F1 超长 prompt 从静默截断改为硬拒绝(破坏性行为变更) ⚠️ 仍存在
F2 base_processor 与 multimodal_processor 长度限制逻辑完全重复 ⚠️ 仍存在
F3 engine_client 中 max_tokens 默认值逻辑与 processor 重复 ⚠️ 仍存在
F4 test_engine_client 中 reasoning_max_tokens/response_max_tokens 幽灵属性 ⚠️ 仍存在

📝 PR 规范检查

PR 标题 [Feature] Add server-level token limits and prompt truncation control 格式合规,Tag 匹配变更内容。描述结构完整,包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 所有必填段落。✓ 符合规范。

总体评价

功能实现完整,逻辑正确,测试覆盖充分。历史 review 指出的代码重复和行为变更问题仍未解决,建议后续迭代中抽取公共方法消除 base_processor 与 multimodal_processor 的重复逻辑。

Copy link
Copy Markdown
Collaborator

@LiqinruiG LiqinruiG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 42c66a7 into PaddlePaddle:develop Jun 2, 2026
40 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants