Skip to content

[Cherry-Pick][RL] change glm rope_emb calculation #7316#7318

Merged
zoooo0820 merged 3 commits into
PaddlePaddle:release/2.6from
zoooo0820:cp26_rope
Apr 11, 2026
Merged

[Cherry-Pick][RL] change glm rope_emb calculation #7316#7318
zoooo0820 merged 3 commits into
PaddlePaddle:release/2.6from
zoooo0820:cp26_rope

Conversation

@zoooo0820
Copy link
Copy Markdown
Collaborator

@zoooo0820 zoooo0820 commented Apr 10, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 10, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@c756038). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...stdeploy/model_executor/layers/rotary_embedding.py 14.28% 6 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7318   +/-   ##
==============================================
  Coverage               ?   73.84%           
==============================================
  Files                  ?      376           
  Lines                  ?    52960           
  Branches               ?     8268           
==============================================
  Hits                   ?    39110           
  Misses                 ?    11115           
  Partials               ?     2735           
Flag Coverage Δ
GPU 73.84% <14.28%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-11 14:34 CST

📋 Review 摘要

PR 概述:Cherry Pick PR,调整 GLM 模型 RoPE embedding 计算,新增环境变量 FD_ENABLE_RL 用于对齐训练精度

变更范围:custom_ops/gpu_ops/append_attn/、fastdeploy/envs.py、fastdeploy/model_executor/layers/rotary_embedding.py

影响面 Tag[RL] [OP]

📝 PR 规范检查

PR 描述中的 Motivation 和 Modifications 章节未填写,建议补充:

描述模板(可直接复制):

## Motivation

为了使 GLM 模型的推理结果与训练结果对齐,需要调整 RoPE embedding 的计算精度。

## Modifications

1. 新增环境变量 `FD_ENABLE_RL`,用于控制是否启用 RL 对齐模式
2. 在 Python 层(GlmRotaryEmbedding)中,根据 `FD_ENABLE_RL` 选择不同的计算方式
3. 在 CUDA kernel 中,将 `EnforceFmulRN` 硬编码为 `false` 以避免使用 IEEE-754 compliant rounding

问题

级别 文件 概述
🔴 Bug custom_ops/gpu_ops/append_attn/decoder_write_cache_with_rope_kernel.cu:152 硬编码 EnforceFmulRN=false 影响范围过大
🔴 Bug fastdeploy/model_executor/layers/rotary_embedding.py:91 Python 和 CUDA 层逻辑不一致
🔴 Bug fastdeploy/envs.py:272 环境变量在 CUDA 层未被使用

总体评价

PR 的目的是对齐 GLM 模型的 RoPE embedding 计算精度,但实现存在问题:CUDA kernel 中硬编码 EnforceFmulRN=false 会影响所有使用这些 kernel 的模型(不仅是 GLM),且 Python 层使用环境变量控制而 CUDA 层直接硬编码,逻辑不一致。建议明确修改范围或使用更细粒度的控制机制。

auto* kernelFn = append_decode_cache_T_neox_partial_rope_kernel<
T,
PackSize,
false>; // GLM use EnforceFmulRN=false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 硬编码 EnforceFmulRN=false 会影响所有使用此 kernel 的模型,不仅仅是 GLM。

append_decode_cache_rope() 函数被多种模型调用(如 Qwen、Ernie 等),但注释仅说明 "GLM use EnforceFmulRN=false"。如果此修改仅针对 GLM,则会影响其他模型的精度。

建议:

  1. 如果只针对 GLM,需要在调用方判断模型类型或使用单独的 kernel
  2. 或者将 EnforceFmulRN 参数化,通过 getEnvEnableRL() 读取环境变量来控制

T,
PackSize,
QKV_TYPE,
false> // GLM use EnforceFmulRN=false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 硬编码 EnforceFmulRN=false 会影响所有使用此 kernel 的模型。

同样的问题,此函数不仅用于 GLM 模型,硬编码会影响其他模型。

bsz, max_seq_len = position_ids.shape[:2]
inv_freq = self.base ** (-paddle.arange(0, self.rotary_dim, 2, dtype="float32") / self.rotary_dim)
freqs = paddle.einsum("ij,k->ijk", position_ids.cast("float32"), inv_freq)
if envs.FD_ENABLE_RL == 1:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug Python 层使用环境变量 FD_ENABLE_RL 控制,但 CUDA 层直接硬编码为 false,逻辑不一致。

Python 代码中根据 FD_ENABLE_RL 选择不同的计算方式,但在 CUDA kernel 中,EnforceFmulRN 被硬编码为 false,没有读取环境变量 FD_ENABLE_RL

custom_ops/gpu_ops/helper.cu 中定义了 getEnvEnableRL() 函数来读取 FD_ENABLE_RL,但这个函数没有被使用。

建议:如果要使用环境变量控制,CUDA kernel 也应该根据 getEnvEnableRL() 的返回值来设置 EnforceFmulRN

Comment thread fastdeploy/envs.py
int(os.getenv("FD_SAVE_OUTPUT_CACHE_FOR_PREEMPTED_REQUEST", "1"))
),
# Whether to align RoPE and moe gate precision with training
"FD_ENABLE_RL": lambda: int(os.getenv("FD_ENABLE_RL", "0")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 环境变量 FD_ENABLE_RL 在 CUDA 层未被实际使用。

custom_ops/gpu_ops/helper.cu:160-166 中定义了 getEnvEnableRL() 函数读取 FD_ENABLE_RL 环境变量,但在 CUDA kernel 调用时没有使用这个返回值,而是直接硬编码 EnforceFmulRN=false

建议:要么在 CUDA kernel 中使用 getEnvEnableRL() 的返回值,要么移除这个未使用的函数。

Copy link
Copy Markdown
Collaborator

@EmmonsCurse EmmonsCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~ Skip coverage check as it mainly relies on tests with RL.

@zoooo0820 zoooo0820 merged commit 42b0f59 into PaddlePaddle:release/2.6 Apr 11, 2026
52 of 57 checks passed
@zoooo0820 zoooo0820 deleted the cp26_rope branch April 11, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants