[Feature] implement log channel separation and request log level system by xyxinyang · Pull Request #7190 · PaddlePaddle/FastDeploy

xyxinyang · 2026-04-03T09:45:02Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

针对 FastDeploy 的日志系统进行优化，预计分 4 个 pr 完成。

pr	内容	状态
1	新增日志相关参数、错误同时输出到终端	已合入
2	日志通道划分、request.log 级别划分和聚合	当前 pr
3	worker_process.log、cache_manager.log、comm.log、paddle日志收敛和简化	待提交
4	trace.log 日志的规范化和整合	待提交

Modifications

日志文件精简：十几个散落文件 → 几个核心文件（fastdeploy.log、request.log、error.log、comm.log）
请求日志聚合：请求全链路日志统一写入 request.log，不再散落在 api_server.log、data_processor.log、scheduler.log 等多个文件。FD_LOG_REQUESTS 控制是否产生该日志文件。
请求日志分级：新增 FD_LOG_REQUESTS_LEVEL 环境变量，支持 4 级请求日志（0-LIFECYCLE / 1-STAGES / 2-CONTENT / 3-FULL），可按需调整详细程度。
错误日志增强：请求相关错误带 request_id，便于问题追踪。
Logger 代码统一：日志实现统一到 fastdeploy/logger/ 模块，保持向后兼容。

Usage or Command

日志通道划分

FastDeploy 将日志分为三个通道：

通道	Logger 名称	输出文件	说明
main	`fastdeploy.main.*`	`fastdeploy.log`	主日志，记录系统配置、启动信息等
request	`fastdeploy.request.*`	`request.log`	请求日志，记录请求生命周期和处理细节
console	`fastdeploy.console.*`	`console.log`	控制台日志，输出到终端和 console.log

请求日志级别

请求日志 (request.log) 支持 4 个级别，通过环境变量 FD_LOG_REQUESTS_LEVEL 控制：

级别	说明	示例内容
L0	关键生命周期事件	请求创建/初始化、完成统计（InputToken/OutputToken/耗时）、流式响应首次和最后发送、请求中止
L1	处理阶段细节	信号量获取/释放、首 token 时间记录、信号处理（preemption/abortion/recovery）、缓存任务、预处理耗时、参数调整警告
L2	请求/响应内容和调度	调度信息（入队/拉取/完成）、请求和响应内容（超长内容会被截断）
L3	完整数据（原先debug级别）	完整的请求和响应数据

默认级别为 L2 (CONTENT)，记录请求参数、调度信息和响应内容。如需更精简的日志，可设置 FD_LOG_REQUESTS_LEVEL=0 只记录关键生命周期事件。

Accuracy Tests

1

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-03T09:45:11Z

Thanks for your contribution!

codecov-commenter · 2026-04-03T11:35:21Z

Codecov Report

❌ Patch coverage is 87.50000% with 47 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@dec0b06). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/entrypoints/engine_client.py	70.96%	9 Missing ⚠️
fastdeploy/entrypoints/openai/serving_engine.py	50.00%	6 Missing ⚠️
fastdeploy/entrypoints/api_server.py	0.00%	4 Missing ⚠️
...deploy/entrypoints/openai/v1/serving_completion.py	55.55%	4 Missing ⚠️
fastdeploy/utils.py	80.00%	4 Missing ⚠️
...i/tool_parsers/ernie_45_vl_thinking_tool_parser.py	40.00%	3 Missing ⚠️
...astdeploy/entrypoints/openai/serving_completion.py	91.30%	2 Missing ⚠️
fastdeploy/input/tokenizer_client.py	50.00%	2 Missing ⚠️
fastdeploy/logger/__init__.py	90.90%	1 Missing and 1 partial ⚠️
fastdeploy/output/token_processor.py	90.00%	2 Missing ⚠️
... and 8 more

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7190   +/-   ##
==========================================
  Coverage           ?   73.75%           
==========================================
  Files              ?      397           
  Lines              ?    54828           
  Branches           ?     8587           
==========================================
  Hits               ?    40439           
  Misses             ?    11682           
  Partials           ?     2707

Flag	Coverage Δ
GPU	`73.75% <87.50%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

该 PR 在 FastDeploy 现有日志体系上引入“日志通道化（main/request/console）”与“请求日志 L0-L3 分级”的统一机制，并将多处请求生命周期/调度/输出相关日志迁移到 request 通道，同时补充相应单测与文档说明。

Changes:

新增 fastdeploy.logger.request_logger（L0-L3 请求日志）与 fastdeploy.logger.config（日志级别/请求日志默认配置解析）。
setup_logging 支持 main/request/console 三通道 logger 配置，并在大量业务模块中用 log_request/log_request_error 替换原散落 logger 调用。
更新与新增日志相关单测，并同步更新中英文日志文档。

Reviewed changes

Copilot reviewed 48 out of 48 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/scheduler/test_local_scheduler.py	将断言从 scheduler_logger 迁移为验证 `log_request` 调用
tests/scheduler/test_dp_scheduler.py	通过 mock `log_request` 适配 DP scheduler 的请求日志断言
tests/output/test_process_batch_output.py	用 `log_request` 替换对 llm_logger 的断言，覆盖 preemption 场景
tests/output/test_process_batch_output_use_zmq.py	用 `log_request` 替换对 llm_logger 的断言，覆盖 abort recycle 场景
tests/logger/test_setup_logging.py	更新默认 handler 断言并新增通道 logger/handler 配置测试
tests/logger/test_request_logger.py	新增 request_logger 的 _should_log/_truncate/log_request 测试
tests/logger/test_logging_config.py	新增 resolve_log_level 与请求日志默认值解析测试
tests/logger/test_logger.py	更新 unified logger 命名规则（默认走 main 通道）相关测试
tests/input/test_video_utils.py	调整 assertLogs 指向 main 通道 logger 名
tests/entrypoints/test_llm.py	将异常日志断言改为验证 `log_request_error` 调用
tests/entrypoints/test_engine_client.py	将参数告警断言改为验证 `log_request` 调用；修正 getenv mock 签名
tests/entrypoints/test_abort.py	将 abort 日志断言改为验证 `log_request` 调用
tests/entrypoints/openai/v1/test_serving_completion_v1.py	将异常日志断言改为验证 `log_request_error` 调用
fastdeploy/utils.py	get_logger 增加 channel；全局 logger 默认归入 main/console/comm 等通道
fastdeploy/scheduler/splitwise_scheduler.py	将多处错误日志改为 `log_request_error`
fastdeploy/scheduler/local_scheduler.py	将请求入队/出队/完成等日志迁移到 `log_request`
fastdeploy/scheduler/global_scheduler.py	增加 `log_request/log_request_error` 并调整部分请求相关日志
fastdeploy/scheduler/dp_scheduler.py	将请求/结果相关日志迁移到 `log_request`
fastdeploy/output/token_processor.py	将大量请求生命周期/异常日志迁移到 `log_request/log_request_error`
fastdeploy/logger/setup_logging.py	默认 dictConfig 改为三通道配置并使用 LazyFileHandler
fastdeploy/logger/request_logger.py	新增请求日志模块（L0-L3、截断、error 日志接口）
fastdeploy/logger/logger.py	支持按 channel 获取 logger，并调整 unified logger 默认落 main 通道
fastdeploy/logger/config.py	新增日志级别解析与请求日志默认配置解析
fastdeploy/input/tokenizer_client.py	将轮询/解码异常日志迁移到 request 通道日志接口
fastdeploy/input/qwen3_vl_processor/qwen3_vl_processor.py	将“Processed request”日志迁移到 request 通道
fastdeploy/input/qwen_vl_processor/qwen_vl_processor.py	将“Processed request”日志迁移到 request 通道
fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py	将“Processed request”日志迁移到 request 通道
fastdeploy/input/base_processor.py	将 token_ids/请求处理过程等日志迁移到 request 通道并分级
fastdeploy/entrypoints/openai/v1/serving_completion.py	将异常与流式关键点日志迁移到 request 通道
fastdeploy/entrypoints/openai/v1/serving_chat.py	将异常与流式关键点日志迁移到 request 通道
fastdeploy/entrypoints/openai/tool_parsers/ernie_x1_tool_parser.py	将错误日志迁移到 `log_request_error`
fastdeploy/entrypoints/openai/tool_parsers/ernie_45_vl_thinking_tool_parser.py	将错误日志迁移到 `log_request_error`
fastdeploy/entrypoints/openai/serving_reward.py	将关键输出日志迁移到 `log_request`
fastdeploy/entrypoints/openai/serving_models.py	将错误日志迁移到 `log_request_error`
fastdeploy/entrypoints/openai/serving_engine.py	将 semaphore/初始化/错误日志迁移到 request 通道接口
fastdeploy/entrypoints/openai/serving_embedding.py	将关键输出日志迁移到 `log_request`
fastdeploy/entrypoints/openai/serving_completion.py	将异常与流式关键点日志迁移到 request 通道接口
fastdeploy/entrypoints/openai/serving_chat.py	将异常与流式关键点日志迁移到 request 通道接口
fastdeploy/entrypoints/openai/protocol.py	将 metadata obsolete 警告日志迁移到 `log_request`
fastdeploy/entrypoints/openai/api_server.py	将接收请求/异常日志迁移到 request 通道接口
fastdeploy/entrypoints/llm.py	将 _receive_output 异常日志迁移到 `log_request_error`
fastdeploy/entrypoints/engine_client.py	将校验/接收/abort/异常日志迁移到 request 通道接口
fastdeploy/entrypoints/api_server.py	将接收请求与异常日志迁移到 request 通道接口
fastdeploy/engine/request.py	将 from_dict/mm_positions 异常日志迁移到 request 通道接口
fastdeploy/engine/engine.py	将 add_requests/generate 相关日志迁移到 request 通道接口
fastdeploy/engine/async_llm.py	将 abort/generate 异常日志迁移到 `log_request_error`
docs/zh/usage/log.md	补充日志通道与请求日志级别、环境变量说明（中文）
docs/usage/log.md	补充日志通道与请求日志级别、环境变量说明（英文）

Jiang-Jia-Jun · 2026-04-10T06:18:01Z

 from fastdeploy.engine.expert_service import start_data_parallel_service
 from fastdeploy.engine.request import Request
 from fastdeploy.inter_communicator import EngineWorkerQueue, IPCSignal
+from fastdeploy.logger.request_logger import log_request, log_request_error


error分开打印，这个是一种通用的做法吗？是会形成一个类似request_error.log?

这个分开是因为， log_request 会分 4 个级别，然后根据当前设置的级别过滤日志。 log_request_error 是不分级别。两个都会在 request.log 里打印，然后 request_error 会同时映射到终端和 error.log

Jiang-Jia-Jun · 2026-04-10T06:19:35Z

-        llm_logger.info(f"Cache task with request_id ({request.get('request_id')})")
-        llm_logger.debug(f"cache task: {request}")
+        log_request(
+            level=1,


这里的level=1~n，文档提到对应不同程度，是否可以改为枚举变量，代码中更清楚知道含义

好的，预计改成 4 个枚举变量，把名称和对应的程度对齐，方便清楚含义

Jiang-Jia-Jun · 2026-04-10T06:20:45Z


            if not function_call_arr:
-                data_processor_logger.error("No valid tool calls found")
+                log_request_error(message="No valid tool calls found")


这种打印，是针对请求的话，建议加入request_id

Jiang-Jia-Jun · 2026-04-10T06:20:51Z


        except Exception as e:
-            data_processor_logger.error(f"Error in extracting tool call from response: {str(e)}")
+            log_request_error(message="Error in extracting tool call from response: {error}", error=str(e))


同上，其它地方同理

Jiang-Jia-Jun · 2026-04-10T06:23:29Z

-                            api_server_logger.info(f"Chat Streaming response last send: {chunk.model_dump_json()}")
+                            log_request(
+                                level=0,
+                                message="Chat Streaming response last send: request_id={request_id}, finish_reason={finish_reason}, completion_tokens={completion_tokens}",


如果有logprob，logprob也增加打印，训练中依赖根据这个值判断是否出NaN

Jiang-Jia-Jun · 2026-04-10T06:26:15Z

+    """
+    return {
+        "enabled": int(os.getenv("FD_LOG_REQUESTS", "1")),
+        "level": int(os.getenv("FD_LOG_REQUESTS_LEVEL", "0")),


level=默认值是0，是当前的输出日志没变吗？

level=2是对齐当前 FastDeploy 的 info 级别的请求相关的输出日志

Jiang-Jia-Jun · 2026-04-10T06:29:21Z

-scheduler_logger = get_logger("scheduler", "scheduler.log")
-api_server_logger = get_logger("api_server", "api_server.log")
-console_logger = get_logger("console", "console.log", print_to_console=True)
+llm_logger = get_logger("fastdeploy", channel="main")


日志的打印目前有多处实现，在这一版升级中，可全部统一，不用在logger下存在，又在utils下有一份类似的了

Jiang-Jia-Jun · 2026-04-10T06:30:11Z

对于error类的打印，都Review下是否针对请求级别的，如果是，增加request_id便于追溯。

xyxinyang · 2026-04-13T11:50:45Z

针对 review 意见的修改和说明：

1、log_request 分 4 个级别，根据当前设置的级别过滤日志。 log_request_error 不分级别，打印错误日志。两者都会在 request.log 里打印， log_request_error 还会同时映射到终端和 error.log。

2、log_request 的 4 个级别，分别对应有枚举变量：

级别	枚举名	说明	示例内容
0	LIFECYCLE	生命周期起止	请求创建/初始化、完成统计（InputToken/OutputToken/耗时）、流式响应首次和最后发送、请求中止
1	STAGES	处理阶段	信号量获取/释放、首 token 时间记录、信号处理（preemption/abortion/recovery）、缓存任务、预处理耗时、参数调整警告
2	CONTENT	内容和调度	请求参数、处理后的请求、调度信息（入队/拉取/完成）、响应内容（超长内容会被截断）
3	FULL	完整数据	完整的请求和响应数据、原始接收请求

3、对于error类的打印，针对请求级别的，增加request_id便于追溯。

4、logprob也增加打印

5、统一将日志相关的实现，整合到 fastdeploy/logger，在 utils 做了兼容处理

6、FD_LOG_REQUESTS_LEVEL 默认值设为了 2，该级别对应改造前的 FastDeploy 各相关模块的 info 级别日志

…ger implementation from utils to logger module

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-16

📋 Review 摘要

PR 概述：实现日志通道划分和请求日志级别系统，将日志分为 main/request/console 三个通道，支持 L0-L3 四级请求日志详细度。

变更范围：fastdeploy/logger/（新增模块）、entrypoints/、engine/、scheduler/、docs/（文档更新）

影响面 Tag：[Feature] [APIServer] [Engine] [Scheduler] [Docs]

问题

级别	文件	概述
🟡 建议	`docs/usage/environment_variables.md:28`	FD_LOG_REQUESTS_LEVEL 默认值变更属于 Breaking Change，需要明确说明

总体评价

日志系统重构整体设计合理，代码结构清晰，测试覆盖较为完善。实现了日志通道划分和分级控制功能，核心实现质量良好。但默认值变更属于 Breaking Change，建议在文档中增加醒目的迁移说明。

PaddlePaddle-bot · 2026-04-16T05:58:45Z


    # Request logging detail level (0-3). Higher level means more verbose output.
-    "FD_LOG_REQUESTS_LEVEL": lambda: int(os.getenv("FD_LOG_REQUESTS_LEVEL", "0")),
+    "FD_LOG_REQUESTS_LEVEL": lambda: int(os.getenv("FD_LOG_REQUESTS_LEVEL", "2")),


🟡 建议 FD_LOG_REQUESTS_LEVEL 默认值从 0 改为 2，这是一个 Breaking Change。

影响说明：

原默认值 0 (LIFECYCLE): 只记录请求创建/完成/中止等关键生命周期事件

新默认值 2 (CONTENT): 记录请求参数、调度信息、响应内容

潜在影响：

现有部署如果不设置 FD_LOG_REQUESTS_LEVEL 环境变量，会产生更多的日志输出

日志文件大小可能显著增加，可能影响磁盘 I/O 和存储成本

建议：

在 log.md 文档中增加「重要提示」或「Breaking Change」标记，提醒用户注意日志行为的变化

考虑在 PR 描述中增加 Breaking Change 说明

或者设置一个迁移期，在下一个版本再调整默认值

xyxinyang had a problem deploying to Metax_ci April 3, 2026 09:45 — with GitHub Actions Failure

xyxinyang force-pushed the dev-log-v2 branch from 8d44029 to ec6f2f8 Compare April 8, 2026 03:56

xyxinyang had a problem deploying to Metax_ci April 8, 2026 03:56 — with GitHub Actions Error

xyxinyang force-pushed the dev-log-v2 branch from ec6f2f8 to 9c79d9a Compare April 8, 2026 03:58

xyxinyang had a problem deploying to Metax_ci April 8, 2026 03:58 — with GitHub Actions Failure