[Optimization] 移除 num_blocks 上限限制#7241
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 修改了 PaddleDisWorkerProc.initialize_kv_cache() 在 profiling 阶段计算 KV cache block 数量的逻辑,当前变更主要是将 num_blocks_local 的 40000 上限保护逻辑注释掉。与此同时,PR 标题未按 [CLASS]Title 格式填写,且描述中的 Motivation/Modifications 等内容为空,难以判断变更动机与预期影响范围。
Changes:
- 注释掉
num_blocks_local > 40000时的上限截断逻辑与日志输出(worker 进程侧)。 - 将原本的安全保护逻辑保留为“注释代码”,而非通过配置/开关控制。
| num_blocks_local = int(available_kv_cache_memory // model_block_memory_used) | ||
| # NOTE(liuzichang): Too many block will lead to illegal memory access | ||
| # We will develop dynamic limits in future. | ||
| if num_blocks_local > 40000: | ||
| logger.info(f"------- Reset num_blocks_local {num_blocks_local} to 40000") | ||
| num_blocks_local = min(40000, num_blocks_local) | ||
| # # NOTE(liuzichang): Too many block will lead to illegal memory access | ||
| # # We will develop dynamic limits in future. | ||
| # if num_blocks_local > 40000: | ||
| # logger.info(f"------- Reset num_blocks_local {num_blocks_local} to 40000") | ||
| # num_blocks_local = min(40000, num_blocks_local) |
There was a problem hiding this comment.
这里把 num_blocks_local 上限(40000)的保护逻辑整体注释掉了,但同文件注释已明确“Too many block will lead to illegal memory access”。在 do_profile 场景下 num_blocks_local 仅由 available_kv_cache_memory/model_block_memory_used 推导,极端情况下可能非常大,进而触发 GPU 非法访存/崩溃。建议保留该上限保护,或将上限改为可配置(例如通过 FDConfig/env)并给出合理的动态计算方式,而不是直接移除。
| # # NOTE(liuzichang): Too many block will lead to illegal memory access | ||
| # # We will develop dynamic limits in future. | ||
| # if num_blocks_local > 40000: | ||
| # logger.info(f"------- Reset num_blocks_local {num_blocks_local} to 40000") | ||
| # num_blocks_local = min(40000, num_blocks_local) |
There was a problem hiding this comment.
当前改动是把一段关键逻辑用注释“屏蔽”而不是通过配置开关/条件分支控制,同时注释行出现了“# # NOTE”这种重复符号,容易让后续维护者误解实际行为。建议要么删除这段死代码并在提交说明中解释原因,要么用明确的配置项/flag 来控制是否启用上限保护,并保持注释格式为标准英文单行注释。
| # # NOTE(liuzichang): Too many block will lead to illegal memory access | |
| # # We will develop dynamic limits in future. | |
| # if num_blocks_local > 40000: | |
| # logger.info(f"------- Reset num_blocks_local {num_blocks_local} to 40000") | |
| # num_blocks_local = min(40000, num_blocks_local) | |
| # Static hard caps are intentionally not applied here. If an upper | |
| # bound is needed in the future, it should be controlled by an | |
| # explicit configuration flag instead of commented-out code. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7241 +/- ##
==========================================
Coverage ? 74.12%
==========================================
Files ? 383
Lines ? 53609
Branches ? 8410
==========================================
Hits ? 39737
Misses ? 11169
Partials ? 2703
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| log_dir = os.getenv("FD_LOG_DIR", "log") | ||
| log_file = os.path.join(log_dir, "config.log") | ||
| baseline = 40000 | ||
| baseline = 65400 | ||
|
|
There was a problem hiding this comment.
这里把 baseline 从 40000 调整到 65400,但 baseline 是硬编码魔数且断言范围只有 ±5%,非常依赖具体机器显存/模型配置,容易导致 e2e 在不同 CI 资源或配置下波动失败。建议将 baseline 通过环境变量注入、或从日志/配置中推导一个期望区间(例如只做下限校验或放宽容忍度),并在注释中说明该数值来源与适用的硬件/参数前提。
| """测试profile reset_block_num功能,与baseline diff不能超过5%""" | ||
| log_file = "./log/config.log" | ||
| baseline = 40000 | ||
| baseline = 74000 | ||
|
|
There was a problem hiding this comment.
这里把 baseline 改为 74000 且仍使用 ±5% 的严格区间校验;该值看起来强绑定特定 GPU 显存/kv-cache 参数,e2e 用例容易在不同运行环境下变得不稳定。建议把 baseline 改为可配置(环境变量/pytest 参数),或改为校验相对关系/下限(例如确保不低于某阈值),并补充注释说明该 baseline 的采集方式与适用前提。
| if num_blocks_local > 40000: | ||
| logger.info(f"------- Reset num_blocks_local {num_blocks_local} to 40000") | ||
| num_blocks_local = min(40000, num_blocks_local) | ||
| # # NOTE(liuzichang): Too many block will lead to illegal memory access |
There was a problem hiding this comment.
PR 标题目前为“test”,不符合仓库要求的“[CLASS]Title”格式(模板里也列出了 tag 列表);同时 PR 描述仍是模板占位,Motivation/Modifications/Usage/Accuracy Tests 等关键内容为空。建议在合入前补齐:说明本 PR 为什么要移除/放开 num_blocks 上限、对应的风险评估/验证方式(至少给出复现与验证命令或日志/精度结果),并按约定补上标题 tag。
d4e753d to
4bab0a7
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-13
📋 Review 摘要
PR 概述:移除 num_blocks_local 硬编码上限(40000),提升显存利用率
变更范围:fastdeploy/worker/、tests/e2e/
影响面 Tag:[KVCache] [Optimization]
📝 PR 规范检查
✅ 标题包含有效 Tag [Optimization],描述包含 Motivation/Modifications/影响范围/测试验证,符合规范。
问题
未发现阻塞性问题。
🟡 建议:文档 docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md 中使用了 --num-gpu-blocks-override 40000。该参数是手动指定的 override 值,与本次移除代码中自动计算硬上限的变更不直接冲突。建议确认该配置是否仍然合理:
- 如果是实际测试得到的合理值,建议添加注释说明该值来源
- 如果该值是受之前硬上限影响设定的,建议重新评估并更新
总体评价
变更逻辑清晰,移除硬编码限制后按实际显存计算可用 KV Cache 块数,显存利用率将得到提升。测试 baseline 已同步更新,未发现代码层面的问题。
背景
之前为规避"块数过多导致非法内存访问"问题,在
worker_process.py中对num_blocks_local硬编码了 40000 的上限。随着问题根因得到修复,该限制已无必要,且会错误地压低实际可用 KV Cache 块数,影响显存利用率。变更内容
1. 移除 num_blocks_local 硬上限限制
文件:
fastdeploy/worker/worker_process.pynum_blocks_local > 40000上限截断逻辑,让系统按实际显存自动计算可用 KV Cache 块数,不再人为截断到 40000。2. 更新 E2E 测试中的 num_blocks baseline 值
文件:
tests/e2e/4cards_cases/test_Qwen3_30b_tp4.py,tests/e2e/test_EB_VL_Lite_serving.pytest_profile_reset_block_numbaseline:test_Qwen3_30b_tp4.py:40000→74000test_EB_VL_Lite_serving.py:40000→65400影响范围
测试验证
test_Qwen3_30b_tp4.py::test_profile_reset_block_num通过(baseline=74000)test_EB_VL_Lite_serving.py::test_profile_reset_block_num通过(baseline=65400)