fixing rope pytest benchmark grad accumulation#3743
Conversation
PR Reviewer Guide 🔍(Review updated until commit 378979b)Here are some key observations to aid the review process:
|
|
phi3 now has a 450 us vs 494 us(ref). Which looks strange. I'll pull a profile to investigate. Even stranger is that, running bwd with only phi3 gives But with qwen2 before phi3, we get faster kernels... looks like we are picking up some cache here and our heuristics isn't making the best decision 😢 |
|
following up on #3394 (comment) |
|
Two separate issues I'm seeing with benchmark number:
Investigating... cc'ing @naoyam in case you are looking at backward time. |
|
Yes, I'm looking at Phi3 but still only focuses on the forward fusion. I wonder if it's specific to Phi3? Or, could it be it was just because it was executed last? For example, if Qwen2 was executed after Phi3, would Qwen2 see a similar discrepancy? |
|
strangely that seems to only affect phi3. I swapped it to run phi3 before qwen2 and haven't noticed any event dropped in qwen2. But given that it's not deterministic in phi3 in the first place, I'm not sure if this means there's anything specific to phi3. |
|
Re: missing event.
So I'm now wondering if we can actually run Patterns I see on pytorch are all using context manager, not sure if we are hitting a bug on profiler? |
|
I don't think this is specific to phi3. At least I'm seeing the same thing happening to qwen2, when it's not running as the first benchmark. So there's indeed something wrong with the benchmark profiler and pytest. |
Can you open an issue for this mentioning the order of benchmarks to reproduce this? |
| # a reference point for torchcompile and eager executor for comparison. | ||
| run_benchmark( | ||
| benchmark, unary_bwd_torch, [output, grad(), fwd_inputs()], iobytes=iobytes() | ||
| benchmark, unary_bwd_torch, [output, grad(), *fwd_inputs], iobytes=iobytes() |
There was a problem hiding this comment.
The reason I need to expand fwd_inputs is to have it work with the infra code where we clear grad from torch.Tensor inputs. Since fwd_inputs is a sequence of torch.Tensor.
An alternative is to flatten inputs here instead.
|
!test --diff-bench |
|
failure doesn't look related. merged as-is. |
#3349 removed grad accumulation, but rope benchmark implementation needs an update to get that working.
Reference implementation.
after l2_cache clear
Before this PR:
In this PR:
With the existing issue on pytest/torch.profiler, if I instead run each benchmark separately,
So these number does match the manual benchmark with l2_cache cleared. I think that justifies this PR.