-
Notifications
You must be signed in to change notification settings - Fork 78
Description
This is a bizarre issue.
The observed behavior is that, our pytest benchmark using torch.profiler.profile seems to be non-deterministically dropping events in consecutive benchmark runs.
In PR branch #3743, running backward benchmark as a whole generates numbers like this (on H100)
running NVFUSER_DISABLE=kernel_reuse pytest --benchmark-thunder test_rope.py -k bwd
Name (time in us) Mean Median
-----------------------------------------------------------------------------------------------------------------------
test_rope_bwd_benchmark[executor='thunder'-variation='llama_2_7b_hf_rope'] 871.5996 (5.23) 871.6050 (5.24)
test_rope_bwd_benchmark[executor='thunder'-variation='llama_3_8B_rope'] 1,443.0095 (8.66) 1,442.9955 (8.67)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_mistral_nemo_rope'] 166.5515 (1.0) 166.4480 (1.0)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_qwen2_rope'] 386.4463 (2.32) 386.5565 (2.32)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope'] 452.3351 (2.72) 452.0685 (2.72)
-----------------------------------------------------------------------------------------------------------------------
In that example, if we comment out the other variants inside benchmarks/python/test_rope.py to run only hf_phi3 for example, we are getting numbers like
test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope'] 514.7512 514.4900
Further debugging went down here:
Fuser/benchmarks/python/core.py
Lines 156 to 157 in 8ea30c7
| prof_averages = self.prof.key_averages() | |
| elapsed_cuda_time = self._get_kernel_time(prof_averages) |
noticing that the benchmark discrepancy is coming from dropping cuda event in consecutive runs.
i.e. when we have 6 kernels running in the backward path, only 5 of them are recorded in the profiler.