Skip to content

Large performance fluctuation with NvFuserScheduler_SBR_Norm_fp32/8/640/128 #456

@naoyam

Description

@naoyam

The performance of NvFuserScheduler_SBR_Norm_fp32/8/640/128 fluctuates more than 10% on an A100 PCIe 80GB board. For consistency, I locked the core clock at 1380 MHz as the default 1410 resulted in automatic slowdown of the frequency.

Measurement done with 48a2aab. Here's a result of one run. Notice that the measured time is consistently around 2350 us for all of the 10 runs.

./bin/nvfuser_bench --benchmark_filter=NvFuserScheduler_SBR_Norm_fp32/8/640/128 --benchmark_min_time=0.01 --benchmark_repetitions=10  
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
2023-06-05T17:48:53-07:00
Running ./bin/nvfuser_bench
Run on (64 X 3593.24 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 512 KiB (x32)
  L3 Unified 16384 KiB (x8)
Load Average: 1.39, 23.79, 21.70
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------------
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2356 us         2430 us            6 bytes_per_second=1.29547T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2354 us         2429 us            6 bytes_per_second=1.29641T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2355 us         2430 us            6 bytes_per_second=1.29566T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2352 us         2426 us            6 bytes_per_second=1.29745T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2353 us         2427 us            6 bytes_per_second=1.29698T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2354 us         2497 us            6 bytes_per_second=1.29622T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2353 us         2488 us            6 bytes_per_second=1.29698T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2350 us         2488 us            6 bytes_per_second=1.29858T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2355 us         2497 us            6 bytes_per_second=1.29566T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2353 us         2490 us            6 bytes_per_second=1.29669T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_mean         2354 us         2460 us           10 bytes_per_second=1.29661T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_median       2354 us         2459 us           10 bytes_per_second=1.29655T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_stddev       1.72 us         33.7 us           10 bytes_per_second=996.643M/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_cv           0.07 %          1.37 %            10 bytes_per_second=0.07% 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]

Here's another result. This time, all the 10 runs resulted in around 2620 us.

$ ./bin/nvfuser_bench --benchmark_filter=NvFuserScheduler_SBR_Norm_fp32/8/640/128 --benchmark_min_time=0.01 --benchmark_repetitions=10
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
2023-06-05T17:54:09-07:00
Running ./bin/nvfuser_bench
Run on (64 X 3592.74 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 512 KiB (x32)
  L3 Unified 16384 KiB (x8)
Load Average: 0.09, 8.30, 15.45
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------------
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2621 us         2736 us            5 bytes_per_second=1.16434T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2622 us         2695 us            5 bytes_per_second=1.16397T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2621 us         2764 us            5 bytes_per_second=1.16424T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2622 us         2696 us            5 bytes_per_second=1.16406T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2619 us         2730 us            5 bytes_per_second=1.16525T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2617 us         2692 us            5 bytes_per_second=1.16598T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2620 us         2693 us            5 bytes_per_second=1.16479T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2624 us         2698 us            5 bytes_per_second=1.16297T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2622 us         2696 us            5 bytes_per_second=1.16406T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time              2623 us         2698 us            5 bytes_per_second=1.16334T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_mean         2621 us         2710 us           10 bytes_per_second=1.1643T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_median       2621 us         2697 us           10 bytes_per_second=1.16415T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_stddev       1.96 us         24.8 us           10 bytes_per_second=915.461M/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_cv           0.07 %          0.91 %            10 bytes_per_second=0.07% 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]

This performance difference also happened even when the core clock was locked at 1050 MHz.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions