[Cherry-Pick][BugFix] Fix clear_parameters hang issue in MTP during weight cleanup in RL (#7522) by Deleter-D · Pull Request #7523 · PaddlePaddle/FastDeploy

Deleter-D · 2026-04-20T13:06:09Z

Motivation

Cleaning CUDA Graphs after releasing weights may lead to a hang issue.

Modifications

This PR resolves the problem by moving the CUDA Graph cleanup logic in MTP to occur before weight cleanup, preventing hangs observed in RL scenarios.

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-20T13:06:15Z

Thanks for your contribution!

codecov-commenter · 2026-04-20T14:35:33Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@be2fd17). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/gpu_model_runner.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7523   +/-   ##
==============================================
  Coverage               ?   73.55%           
==============================================
  Files                  ?      376           
  Lines                  ?    53033           
  Branches               ?     8288           
==============================================
  Hits                   ?    39011           
  Misses                 ?    11274           
  Partials               ?     2748

Flag	Coverage Δ
GPU	`73.55% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gongshaotian

LGTM

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-21 21:05:12

📋 Review 摘要

PR 概述：将 MTP Speculative Decoding 场景中 CUDA Graph 的清理操作移至权重释放之前，修复 RL 场景下的 hang 问题。
变更范围：fastdeploy/worker/gpu_model_runner.py — clear_parameters 方法
影响面 Tag：[Speculative Decoding] [RL]

问题

级别	文件	概述
❓ 疑问	`gpu_model_runner.py:2700`	`proposer.model.clear_graph_opt_backend()` 嵌套在 `if self.use_cudagraph:` 内，当主模型未使用 cudagraph 但 draft model 启用了 cudagraph 时，可能存在资源未释放风险
🟡 建议	—	该 hang 问题缺乏回归测试，建议补充测试用例防止未来回退

总体评价

修复思路正确，将 proposer.model.clear_graph_opt_backend() 前移至 clear_parameters() 之前以避免 hang，同时新增 speculative_decoding 和 draft_model_use_cudagraph 条件守卫使逻辑更严谨。建议确认 use_cudagraph 与 draft_model_use_cudagraph 的配置约束关系，以及考虑补充回归测试。

PaddlePaddle-bot · 2026-04-21T13:08:29Z

        # Clear CUDAGraph
        if self.use_cudagraph:
            self.model.clear_graph_opt_backend()
+            if (


❓ 疑问 proposer.model.clear_graph_opt_backend() 现在嵌套在 if self.use_cudagraph: 块内

原始代码中，MTP 的 proposer.model.clear_graph_opt_backend() 是在 if self.spec_method == SpecMethod.MTP: 条件下单独调用的，与主模型的 use_cudagraph 开关无关。

新代码将其嵌套在 if self.use_cudagraph: 内，意味着当主模型禁用 cudagraph（use_cudagraph=False），但 draft model 独立启用了 cudagraph（draft_model_use_cudagraph=True）时，proposer.model.clear_graph_opt_backend() 将永远不会被调用，可能导致 draft model 的 CUDA graph 资源泄漏。

请确认：draft_model_use_cudagraph=True 是否在设计上一定需要 use_cudagraph=True 才能生效？如果是，建议在配置层面添加约束或断言，使假设显式化。

EmmonsCurse

LGTM～ Skip coverage check as it mainly relies on tests with RL.

Deleter-D had a problem deploying to Metax_ci April 20, 2026 13:06 — with GitHub Actions Failure

EmmonsCurse had a problem deploying to Metax_ci April 20, 2026 13:11 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Deleter-D added 2 commits April 21, 2026 20:59

fix mtp clear graph bugs in rl

2a0f99f

fix

d58843f

Deleter-D force-pushed the 2.6_fix_mtp_clear_graph branch from 2ffb8ec to d58843f Compare April 21, 2026 12:59

Deleter-D had a problem deploying to Metax_ci April 21, 2026 12:59 — with GitHub Actions Failure

gongshaotian approved these changes Apr 21, 2026

View reviewed changes

PaddlePaddle-bot reviewed Apr 21, 2026

View reviewed changes

EmmonsCurse approved these changes Apr 22, 2026

View reviewed changes

EmmonsCurse added the skip-ci: coverage label Apr 22, 2026

Deleter-D merged commit 2961400 into PaddlePaddle:release/2.6 Apr 22, 2026
51 of 56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][BugFix] Fix clear_parameters hang issue in MTP during weight cleanup in RL (#7522)#7523

[Cherry-Pick][BugFix] Fix clear_parameters hang issue in MTP during weight cleanup in RL (#7522)#7523
Deleter-D merged 2 commits into
PaddlePaddle:release/2.6from
Deleter-D:2.6_fix_mtp_clear_graph

Deleter-D commented Apr 20, 2026

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 20, 2026 •

edited

Loading

Uh oh!

gongshaotian left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Apr 21, 2026

Uh oh!

EmmonsCurse left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Deleter-D commented Apr 20, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

EmmonsCurse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Apr 20, 2026 •

edited

Loading