feat: add profiler_device speedup metric for CUDA benchmarks#693
Merged
Xreki merged 6 commits intoPaddlePaddle:developfrom Apr 17, 2026
Merged
Conversation
|
Thanks for your contribution! |
Add PyTorch Profiler device time measurement and speedup calculation for CUDA benchmarks. This provides a more accurate kernel-level performance comparison between eager and compiled models. - Add --profiler-device-time CLI argument - Measure profiler_device via torch.profiler with CUDA activity - Print [Speedup][profiler_device] in benchmark results - Propagate flag through multi-model test runners Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f123326 to
00c8ef1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Feature Enhancement
Description
增加GPU时间的测量
Details
graph-net-test-compiler-log [Processing] /work/ai4c/samples/hf_subgraphs_v2/fusible_subgraphs/c4/f4/c4f4baceed0eed457d338798e78817bd018b4cfe4a762fc95022e94a742dd3f1/graphs/hf_subgraphs_v2/fusible_subgraphs/float16/7/samples/transformers-auto-model/zuppif_maskformer-swin-small-ade/_decomposed/zuppif_maskformer-swin-small-ade_start261_end269_22 graph-net-test-compiler-log [Config] model: zuppif/maskformer-swin-small-ade graph-net-test-compiler-log [Config] device: cuda graph-net-test-compiler-log [Config] hardware: NVIDIA A100-SXM4-80GB graph-net-test-compiler-log [Config] compiler: inductor graph-net-test-compiler-log [Config] warmup: 5 graph-net-test-compiler-log [Config] trials: 10 graph-net-test-compiler-log [Config] compile_framework_version: 2.9.1+cu126 [Profiling] Using device: cuda NVIDIA A100-SXM4-80GB, warm up 5, trials 10 Trial 1: e2e=0.42009 ms, gpu=0.30118 ms Trial 2: e2e=0.34142 ms, gpu=0.26333 ms Trial 3: e2e=0.33832 ms, gpu=0.27142 ms Trial 4: e2e=0.32544 ms, gpu=0.25709 ms Trial 5: e2e=0.30947 ms, gpu=0.24982 ms Trial 6: e2e=0.33855 ms, gpu=0.27082 ms Trial 7: e2e=0.32330 ms, gpu=0.25642 ms Trial 8: e2e=0.31805 ms, gpu=0.25197 ms Trial 9: e2e=0.31590 ms, gpu=0.25078 ms Trial 10: e2e=0.31233 ms, gpu=0.25248 ms Trial 1: profiler_device=0.04288 ms Trial 2: profiler_device=0.04198 ms Trial 3: profiler_device=0.04186 ms Trial 4: profiler_device=0.04202 ms Trial 5: profiler_device=0.04208 ms Trial 6: profiler_device=0.04198 ms Trial 7: profiler_device=0.04189 ms Trial 8: profiler_device=0.04186 ms Trial 9: profiler_device=0.04218 ms Trial 10: profiler_device=0.04205 ms [Profiling] Using device: cuda NVIDIA A100-SXM4-80GB, warm up 5, trials 10 Trial 1: e2e=0.35548 ms, gpu=0.27174 ms Trial 2: e2e=0.33617 ms, gpu=0.27578 ms Trial 3: e2e=0.29993 ms, gpu=0.24602 ms Trial 4: e2e=0.29445 ms, gpu=0.24115 ms Trial 5: e2e=0.30160 ms, gpu=0.24838 ms Trial 6: e2e=0.29421 ms, gpu=0.24150 ms Trial 7: e2e=0.28825 ms, gpu=0.23693 ms Trial 8: e2e=0.36645 ms, gpu=0.30074 ms Trial 9: e2e=0.32663 ms, gpu=0.26058 ms Trial 10: e2e=0.33236 ms, gpu=0.25965 ms Trial 1: profiler_device=0.00733 ms Trial 2: profiler_device=0.00723 ms Trial 3: profiler_device=0.00698 ms Trial 4: profiler_device=0.00698 ms Trial 5: profiler_device=0.00694 ms Trial 6: profiler_device=0.00698 ms Trial 7: profiler_device=0.00691 ms Trial 8: profiler_device=0.00694 ms Trial 9: profiler_device=0.00698 ms Trial 10: profiler_device=0.00698 ms graph-net-test-compiler-log [Datatype][eager]: float16 float16 graph-net-test-compiler-log [Datatype][compiled]: float16 float16 graph-net-test-compiler-log [DataType] eager:['float16', 'float16'] compiled:['float16', 'float16'] match:True graph-net-test-compiler-log [Correctness][equal]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-10_rtol_1.00E-06]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-10_rtol_2.56E-04]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-10_rtol_1.69E-12]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-14_rtol_1.00E-14]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-09_rtol_3.98E-06]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-09_rtol_5.85E-04]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-09_rtol_2.54E-11]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_2.51E-13_rtol_2.51E-13]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-08_rtol_1.58E-05]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-08_rtol_1.34E-03]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-08_rtol_3.82E-10]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_6.31E-12_rtol_6.31E-12]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-07_rtol_6.31E-05]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-07_rtol_3.06E-03]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-07_rtol_5.75E-09]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.58E-10_rtol_1.58E-10]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-06_rtol_2.51E-04]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-06_rtol_7.00E-03]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-06_rtol_8.65E-08]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_3.98E-09_rtol_3.98E-09]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-05_rtol_1.00E-03]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-05_rtol_1.60E-02]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-05_rtol_1.30E-06]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-07_rtol_1.00E-07]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-04_rtol_3.98E-03]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-04_rtol_3.66E-02]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-04_rtol_1.96E-05]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_2.51E-06_rtol_2.51E-06]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-03_rtol_1.58E-02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-03_rtol_8.36E-02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-03_rtol_2.94E-04]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_6.31E-05_rtol_6.31E-05]: 1 0 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-02_rtol_6.31E-02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-02_rtol_1.91E-01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-02_rtol_4.42E-03]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.58E-03_rtol_1.58E-03]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-01_rtol_2.51E-01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-01_rtol_4.37E-01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E-01_rtol_6.65E-02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_3.98E-02_rtol_3.98E-02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+00_rtol_1.00E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+00_rtol_1.00E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+00_rtol_1.00E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+00_rtol_1.00E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+01_rtol_3.98E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+01_rtol_2.29E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+01_rtol_1.50E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_2.51E+01_rtol_2.51E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+02_rtol_1.58E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+02_rtol_5.23E+00]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+02_rtol_2.26E+02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_6.31E+02_rtol_6.31E+02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+03_rtol_6.31E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+03_rtol_1.20E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+03_rtol_3.40E+03]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.58E+04_rtol_1.58E+04]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+04_rtol_2.51E+02]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+04_rtol_2.73E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+04_rtol_5.11E+04]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_3.98E+05_rtol_3.98E+05]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+05_rtol_1.00E+03]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+05_rtol_6.25E+01]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+05_rtol_7.69E+05]: 1 1 graph-net-test-compiler-log [Correctness][all_close_atol_1.00E+07_rtol_1.00E+07]: 1 1 graph-net-test-compiler-log [Correctness][max_diff]: 0.0 0.00390625 graph-net-test-compiler-log [Correctness][mean_diff]: 0.0 0.00014134598313830793 graph-net-test-compiler-log [Result] status: success graph-net-test-compiler-log [Performance][eager]: {"e2e": {"mean": 0.334287, "std": 0.0305726, "min": 0.309467, "max": 0.420094}, "gpu": {"mean": 0.262531, "std": 0.0149047, "min": 0.249824, "max": 0.301184}, "profiler_device": {"mean": 0.0420766, "std": 0.000284628, "min": 0.041855, "max": 0.04288}} graph-net-test-compiler-log [Performance][compiled]: {"e2e": {"mean": 0.319552, "std": 0.0263446, "min": 0.288248, "max": 0.366449}, "gpu": {"mean": 0.258246, "std": 0.0189105, "min": 0.236928, "max": 0.300736}, "profiler_device": {"mean": 0.007024, "std": 0.000131356, "min": 0.006912, "max": 0.007328}} graph-net-test-compiler-log [Speedup][e2e]: 1.04611 graph-net-test-compiler-log [Speedup][gpu]: 1.01659 graph-net-test-compiler-log [Speedup][profiler_device]: 5.99040