-
Notifications
You must be signed in to change notification settings - Fork 78
Closed
Description
The performance of NvFuserScheduler_SBR_Norm_fp32/8/640/128 fluctuates more than 10% on an A100 PCIe 80GB board. For consistency, I locked the core clock at 1380 MHz as the default 1410 resulted in automatic slowdown of the frequency.
Measurement done with 48a2aab. Here's a result of one run. Notice that the measured time is consistently around 2350 us for all of the 10 runs.
./bin/nvfuser_bench --benchmark_filter=NvFuserScheduler_SBR_Norm_fp32/8/640/128 --benchmark_min_time=0.01 --benchmark_repetitions=10
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
2023-06-05T17:48:53-07:00
Running ./bin/nvfuser_bench
Run on (64 X 3593.24 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x32)
L1 Instruction 32 KiB (x32)
L2 Unified 512 KiB (x32)
L3 Unified 16384 KiB (x8)
Load Average: 1.39, 23.79, 21.70
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------------
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2356 us 2430 us 6 bytes_per_second=1.29547T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2354 us 2429 us 6 bytes_per_second=1.29641T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2355 us 2430 us 6 bytes_per_second=1.29566T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2352 us 2426 us 6 bytes_per_second=1.29745T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2353 us 2427 us 6 bytes_per_second=1.29698T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2354 us 2497 us 6 bytes_per_second=1.29622T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2353 us 2488 us 6 bytes_per_second=1.29698T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2350 us 2488 us 6 bytes_per_second=1.29858T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2355 us 2497 us 6 bytes_per_second=1.29566T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2353 us 2490 us 6 bytes_per_second=1.29669T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_mean 2354 us 2460 us 10 bytes_per_second=1.29661T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_median 2354 us 2459 us 10 bytes_per_second=1.29655T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_stddev 1.72 us 33.7 us 10 bytes_per_second=996.643M/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_cv 0.07 % 1.37 % 10 bytes_per_second=0.07% 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
Here's another result. This time, all the 10 runs resulted in around 2620 us.
$ ./bin/nvfuser_bench --benchmark_filter=NvFuserScheduler_SBR_Norm_fp32/8/640/128 --benchmark_min_time=0.01 --benchmark_repetitions=10
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
2023-06-05T17:54:09-07:00
Running ./bin/nvfuser_bench
Run on (64 X 3592.74 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x32)
L1 Instruction 32 KiB (x32)
L2 Unified 512 KiB (x32)
L3 Unified 16384 KiB (x8)
Load Average: 0.09, 8.30, 15.45
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------------
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2621 us 2736 us 5 bytes_per_second=1.16434T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2622 us 2695 us 5 bytes_per_second=1.16397T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2621 us 2764 us 5 bytes_per_second=1.16424T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2622 us 2696 us 5 bytes_per_second=1.16406T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2619 us 2730 us 5 bytes_per_second=1.16525T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2617 us 2692 us 5 bytes_per_second=1.16598T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2620 us 2693 us 5 bytes_per_second=1.16479T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2624 us 2698 us 5 bytes_per_second=1.16297T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2622 us 2696 us 5 bytes_per_second=1.16406T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time 2623 us 2698 us 5 bytes_per_second=1.16334T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_mean 2621 us 2710 us 10 bytes_per_second=1.1643T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_median 2621 us 2697 us 10 bytes_per_second=1.16415T/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_stddev 1.96 us 24.8 us 10 bytes_per_second=915.461M/s 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
NvFuserScheduler_SBR_Norm_fp32___GRAPH/NvFuserScheduler_SBR_Norm_fp32/8/640/128/manual_time_cv 0.07 % 0.91 % 10 bytes_per_second=0.07% 2D Schedule at 2/Vectorize, Factor: 4/Launch_Parameters[block(1/1/128)/grid(1/160/5120)/0]
This performance difference also happened even when the core clock was locked at 1050 MHz.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels