[Metric] Support custom metric labels#7865
Conversation
…e interface Introduce MetricsManagerInterface with unified set_value/inc_value/dec_value/obs_value methods. When FD_DEFAULT_METRIC_LABEL_VALUES is set to a valid non-empty JSON dict, metric labels (e.g. model_id) are automatically applied. Otherwise, operations fall back to the raw prometheus_client calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-20 18:25:06
📋 Review 摘要
PR 概述:为 Prometheus 指标添加自定义 label 支持(如 model_id),通过 FD_DEFAULT_METRIC_LABEL_VALUES 环境变量注入,并将全量 metric.set/inc/dec/observe() 调用迁移至统一接口方法 set_value/inc_value/dec_value/obs_value。
变更范围:fastdeploy/metrics/、fastdeploy/engine/、fastdeploy/cache_manager/、fastdeploy/entrypoints/、fastdeploy/output/、fastdeploy/splitwise/
影响面 Tag:[Engine] [KVCache] [Scheduler] [APIServer] [DataProcessor] [PD Disaggregation]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/metrics/metrics.py:814 |
spec_decode_draft_single_head_acceptance_rate label 迁移逻辑矛盾,patched labelnames 被静默丢弃 |
| 📝 PR 规范 | — | 标题使用非官方 Tag [Metric],应改为 [Feature] |
强制规则:表格中每一条有
文件:行号的问题均已在comments[]中创建对应行间评论。
📝 PR 规范检查
标题 [Metric] 不在官方 Tag 列表中(官方无 [Metric] tag)。根据 diff 内容,本 PR 新增了自定义 label 注入特性,应使用 [Feature]。此外,Checklist 中 Add unit tests 未勾选,但 PR 正文未说明不写单测的原因,请补充说明(如"仅影响 metrics 格式化,无模型逻辑,暂不补充单测"等)。
标题建议(可直接复制):
[Feature] Support custom metric labels
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
Re-implement PR #4480 on current develop branch. The original PR introduced `MetricsManagerInterface` to support custom labels (e.g., `model_id`) on Prometheus metrics, but the codebase has changed significantly since then (`WorkMetricsManager` removed, new `v1/serving_chat.py` added, `internal_adapter_utils.py` no longer imports metrics, etc.).
## Modifications
1. **New file `fastdeploy/metrics/interface.py`**: Define `MetricsManagerInterface` with 4 abstract methods: `set_value`, `inc_value`, `dec_value`, `obs_value`.
2. **`fastdeploy/metrics/metrics.py`**:
- `MetricsManager` inherits from `MetricsManagerInterface`
- Parse `FD_DEFAULT_METRIC_LABEL_VALUES` env var; when set to a valid non-empty JSON dict, enable metric labels
- `_patch_labelnames()`: add label keys from `_default_labelvalues` to all metrics' `labelnames`
- Implement the 4 interface methods: when labels enabled, call `metric.labels(**merged).set()/inc()/dec()/observe()`; otherwise, call `metric.set()/inc()/dec()/observe()` directly
- Handle `set_cache_config_info()`, `record_zmq_stats()`, `init_zmq_metrics()`, `_init_speculative_metrics()` with label support
3. **`fastdeploy/envs.py`**: Add `FD_DEFAULT_METRIC_LABEL_VALUES` environment variable
4. **14 call-site files**: Migrate all `main_process_metrics.<metric>.set()/inc()/dec()/observe()` calls to `set_value()/inc_value()/dec_value()/obs_value()`
5. **`fastdeploy/metrics/metrics_middleware.py`**: Migrate HTTP metric `.labels().inc()/.observe()` to `inc_value()/obs_value()` with `labelvalues` parameter
## Usage or Command
```bash
# Enable custom labels on all metrics
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b"}'
# Or with multiple labels
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b","version":"v2"}'
```
When not set (default `{}`), behavior is identical to current code — no labels are added.
## Accuracy Tests
No model output changes. This only affects Prometheus metric formatting.
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests. <!-- 请补充说明不写单测的原因,例如:仅影响 Prometheus metrics 格式化,不涉及模型推理逻辑,暂不补充单测 -->
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现思路清晰,接口抽象合理,默认行为无变更(FD_DEFAULT_METRIC_LABEL_VALUES 未设置时与旧代码完全兼容)。建议修复 spec_decode_draft_single_head_acceptance_rate 的 label 迁移一致性问题,并完善 PR 规范(标题 Tag + 单测说明)。
|
|
||
| patched_spec_metrics = self._patch_labelnames(self.SPECULATIVE_METRICS) | ||
|
|
||
| for metric_name, config in patched_spec_metrics.items(): |
There was a problem hiding this comment.
🟡 建议 spec_decode_draft_single_head_acceptance_rate 的 label 迁移存在逻辑矛盾
_patch_labelnames 会对所有 SPECULATIVE_METRICS 条目(包括此条目)注入 labelnames;但随后 if 分支创建 per-head Gauge 列表时,是硬编码构造,完全不使用 config["kwargs"],导致 patched 的 labelnames 被静默丢弃。
实际效果:当 _enable_labels=True 时,其他投机解码指标(如 spec_decode_draft_acceptance_rate)会携带 model_id 等 default label,但 spec_decode_draft_single_head_acceptance_rate_0/1/... 不携带,造成指标体系不一致。
如果有意不对该指标迁移 label,建议在 _patch_labelnames 中显式跳过它(而非 patch 后再丢弃),例如:
def _patch_labelnames(self, metrics_dict: dict, skip_names: set = None) -> dict:
skip_names = skip_names or set()
for name, config in metrics_dict.items():
if name in skip_names:
patched[name] = copy.deepcopy(config)
continue
...然后调用:self._patch_labelnames(self.SPECULATIVE_METRICS, skip_names={"spec_decode_draft_single_head_acceptance_rate"})
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务存在 3 个失败项(主单测、Approval、Pre Commit),当前不建议合入;需先修复单测/代码规范并完成人工审批。Optional 任务另有 3 个失败项,仅供参考。
2 任务状态汇总日志列说明:失败任务直接使用日志链接;运行中任务显示 Job 链接。 2.1 Required任务 : 7/10 通过
2.2 可选任务 — 28/31 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 同步测试 mock/stub 到 set_value 等接口 关联变更: 链接: 查看日志 Pre Commit — 代码规范(置信度: 高)Pre Commit
根因详情: 关键日志: 修复建议:
修复建议摘要: 运行 pre-commit 并提交 isort 修复 关联变更: 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7865 +/- ##
==========================================
Coverage ? 62.71%
==========================================
Files ? 463
Lines ? 64506
Branches ? 9898
==========================================
Hits ? 40456
Misses ? 21282
Partials ? 2768
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Re-implement PR #4480 on current develop branch. The original PR introduced
MetricsManagerInterfaceto support custom labels (e.g.,model_id) on Prometheus metrics, but the codebase has changed significantly since then (WorkMetricsManagerremoved, newv1/serving_chat.pyadded,internal_adapter_utils.pyno longer imports metrics, etc.).Modifications
New file
fastdeploy/metrics/interface.py: DefineMetricsManagerInterfacewith 4 abstract methods:set_value,inc_value,dec_value,obs_value.fastdeploy/metrics/metrics.py:MetricsManagerinherits fromMetricsManagerInterfaceFD_DEFAULT_METRIC_LABEL_VALUESenv var; when set to a valid non-empty JSON dict, enable metric labels_patch_labelnames(): add label keys from_default_labelvaluesto all metrics'labelnamesmetric.labels(**merged).set()/inc()/dec()/observe(); otherwise, callmetric.set()/inc()/dec()/observe()directlyset_cache_config_info(),record_zmq_stats(),init_zmq_metrics(),_init_speculative_metrics()with label supportfastdeploy/envs.py: AddFD_DEFAULT_METRIC_LABEL_VALUESenvironment variable14 call-site files: Migrate all
main_process_metrics.<metric>.set()/inc()/dec()/observe()calls toset_value()/inc_value()/dec_value()/obs_value()fastdeploy/metrics/metrics_middleware.py: Migrate HTTP metric.labels().inc()/.observe()toinc_value()/obs_value()withlabelvaluesparameterUsage or Command
When not set (default
{}), behavior is identical to current code — no labels are added.Accuracy Tests
No model output changes. This only affects Prometheus metric formatting.
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.