[Optim] Parallel BOS feature download#8018
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/online/20260415 #8018 +/- ##
==========================================================
Coverage ? 72.90%
==========================================================
Files ? 388
Lines ? 54158
Branches ? 8497
==========================================================
Hits ? 39486
Misses ? 11944
Partials ? 2728
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-08 19:23:20
📋 Review 摘要
PR 概述:优化多模态 prefill 的 MM chunk 切分与 BOS feature 下载并行化。
变更范围:fastdeploy/engine/common_engine.py、fastdeploy/envs.py、fastdeploy/utils.py
影响面 Tag:[Engine] [DataProcessor]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/engine/common_engine.py:64 |
Python 均衡切分把连续视频 patch run 当作单个原子块,丢失 grid_thw 行边界切点,长视频可能生成超大 chunk |
📝 PR 规范检查
标题不符合规范:[Optim] 不是 checklist 中的官方 Tag;且目标分支是 release/online/20260415,按规范应使用 Cherry-Pick 标题格式。描述也未使用必填模板,并且“其他”中列出的 fastdeploy/model_executor/layers/moe/ep.py、fastdeploy/worker/gpu_model_runner.py 未出现在本次 diff 中,和实现不一致。
标题建议(可直接复制):
[Cherry-Pick][Optimization] Balanced MM chunking and parallel BOS feature download(#8018)
PR 描述建议(点击展开,可直接复制)
## Motivation
针对多模态 prefill 场景中 MM 分块不均衡、BOS 多模态特征串行下载导致 TTFT 偏高的问题进行优化。
## Modifications
- `fastdeploy/engine/common_engine.py`: 新增 `FD_MM_CHUNK_STEP` 控制 MM chunk step,并新增 `FD_MM_BALANCED_CHUNKING=1` 控制的 Python 均衡切分路径;默认仍走 `get_mm_split_fuse` kernel。
- `fastdeploy/envs.py`: 新增 `FD_BOS_DOWNLOAD_PARALLEL` 环境变量,控制单请求内 BOS feature 下载并发数。
- `fastdeploy/utils.py`: `download_from_bos` 支持 `max_workers`,多链接场景下通过 `ThreadPoolExecutor` 并发下载,并按提交顺序 yield。
## Usage or Command
- 设置 `FD_MM_BALANCED_CHUNKING=1` 开启 Python 均衡 MM chunking;可通过 `FD_MM_CHUNK_STEP` 调整 MM chunk step。
- 设置 `FD_BOS_DOWNLOAD_PARALLEL=1` 回退串行 BOS 下载;默认值为 8。
## Accuracy Tests
N/A。本 PR 主要修改性能优化和调度/下载路径,未提供精度相关数据。
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本次优化方向明确,但新增的 Python 均衡切分当前没有复刻旧 kernel 基于 grid_thw 行的合法切点语义,会在视频/连续视觉 token 场景下破坏 chunked prefill 的均衡目标。建议先修复该切点生成逻辑,并补充覆盖连续视频 patch run 的单测。
| # an image patch span: not (is_patch[p-1] and is_patch[p]). | ||
| splittable_mask = np.ones(n + 1, dtype=bool) | ||
| if n >= 2: | ||
| inside = is_patch[:-1] & is_patch[1:] |
There was a problem hiding this comment.
🔴 Bug 这里把任意连续 image_patch_id 都当成不可切分的一个原子 span,会丢掉视频内部按 grid_thw 行的合法切点。
处理器生成视频时会连续写入视觉 token,同时用 grid_thw / image_type_ids 表示多帧;本函数外层又把 grid_thw 拆成 [2, h, w] 多行。旧 kernel 是按每行 (h * w) // 4 建切点,所以同一个连续视觉 token run 内可以在帧组边界切。现在这段 inside = is_patch[:-1] & is_patch[1:] 会让这些边界全部不可切,长视频会退化成一个超大 chunk,超过 target/mm_chunk_step 并破坏 chunked prefill 的 token 预算与均衡目标。
建议修复方式:
用 grid_thw_np 生成视觉 span 内的可切点:扫描 input_ids_np 找到每段 patch run 后,按 per_img = (h * w) // 4 的累加位置把 span_start + cumulative 标为 splittable;只禁止落在单个 grid_thw 行内部的切点,而不是禁止整个连续 patch run 内的切点。
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 6/7 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: |
| # Sequential path: keep behavior identical for single link or when parallel disabled. | ||
| if max_workers <= 1 or len(bos_links) <= 1: | ||
| for link in bos_links: | ||
| ok, data = _fetch_one(link) |
| self.partial_chunked_tokens[1], | ||
| 2048, | ||
| ) | ||
| # mm 切分专用 step:默认与 partial_chunked_tokens[1] 一致; |
| # 方案2:全局均衡切分(Python 层)。开启后绕过 kernel get_mm_split_fuse, | ||
| # 用二分 + 贪心在所有可切点中选 K 个切点,使最大 chunk 长度最小。 | ||
| # 通过 FD_MM_BALANCED_CHUNKING=1 启用,默认关。 | ||
| use_balanced = os.getenv("FD_MM_BALANCED_CHUNKING", "0") == "1" |
There was a problem hiding this comment.
这个是不是对45-vl、qwen-vl等开源模型也通用?
2160527
into
PaddlePaddle:release/online/20260415
背景
针对多模态 prefill 场景的性能瓶颈进行优化:
网络等待时间无法掩盖,端到端 TTFT 偏高。
主要修改
1. BOS 特征并行下载(
fastdeploy/utils.py+fastdeploy/envs.py)download_from_bos重构为_fetch_one+ThreadPoolExecutor,按提交顺序 yield,保证调用方有序拼装(如视频分片)逻辑不变。
max_workers<=1走原串行路径,行为完全一致。FD_BOS_DOWNLOAD_PARALLEL(默认 8,设 1 回退串行)。兼容性
FD_BOS_DOWNLOAD_PARALLEL默认 8,对单链接无影响。