Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple of thoughts on benchmark results in PRs #23085

Open
moorepants opened this issue Feb 16, 2022 · 9 comments
Open

Couple of thoughts on benchmark results in PRs #23085

moorepants opened this issue Feb 16, 2022 · 9 comments
Labels

Comments

@moorepants
Copy link
Member

This is what we now see:

image

This is great new feature but there are a couple of things that confuse me or don't seem so ideal.

  1. Green is usually associated with positive or good results, but the good results are shown in red. I recommend swapping the colors.
  2. The output shows slowdowns wrt to the tip of master and wrt the prior released version. The output regarding the prior released version is mostly just noise for the PR developer because that vast majority of the time the slowdowns/speedups are due to something else that has been merged in master since the prior release. I think we should only show the slowdowns compared to master in the PR report, as that is most relevant, i.e. "is my PR slowing things down". The benchmarks comparing to the prior release only need to be computed when we are making a new release to see if any slowdowns snuck in.
@moorepants moorepants added the CI label Feb 16, 2022
@oscarbenjamin
Copy link
Contributor

  1. Green is usually associated with positive or good results, but the good results are shown in red. I recommend swapping the colors.

This can be done but it requires something a bit more than the current approach which just formats the asv output as diff like

    ```diff
    <asv output>
    ```

body: |
Benchmark results from GitHub Actions
Lower numbers are good, higher numbers are bad. A ratio less than 1
means a speed up and greater than 1 means a slowdown. Green lines
beginning with `+` are slowdowns (the PR is slower then master or
master is slower than the previous release). Red lines beginning
with `-` are speedups.
Significantly changed benchmark results (PR vs master)
```diff
${{ steps.pr_vs_master_changed.outputs.content }}
```
Significantly changed benchmark results (master vs previous release)
```diff
${{ steps.master_vs_release_changed.outputs.content }}
```
Full benchmark results can be found as artifacts in GitHub Actions
(click on checks at the top of the PR).

2. The benchmarks comparing to the prior release only need to be computed when we are making a new release to see if any slowdowns snuck in.

Ideally everyone would be paying attention to these slowdowns and doing something about them before the release comes. For previous releases I took the approach you describe but then had to spend a lot of time investigating and trying to address possible slowdowns before the release which added a huge amount of work to the process of actually making the release. I do not want to commit to doing that as part of the release process: the task of keeping on top of performance should be a continuous process with the responsibility distributed among all contributors.

One simple thing that we could do to reduce the output is adjust the thresholds so that less significant changes don't show up e.g. the output you show does represent real changes in some sense but most of them are not necessarily that significant.

@moorepants
Copy link
Member Author

Ok, I'm not sure how to fix #1 off hand with the diff, but maybe there is some way to do so. I'll think about it.

For the later, I think having people pay attention if their PR slows things down wrt to the tip of master is sufficient. Collectively, we should end up with no need to fix much at the release. Showing the slowdowns wrt to the prior release can't be fixed in each PR because the cause is from prior merged code since the release, right? We aren't expecting new PR's to fix these slowdowns wrt to the prior release, are we?

@oscarbenjamin
Copy link
Contributor

Showing the slowdowns wrt to the prior release can't be fixed in each PR because the cause is from prior merged code since the release, right?

I'm not sure how often people check the benchmark results before merging. The PR comment auto-updates but doesn't give any notification that it has done so. For example there was clearly a PR at some point that lead to some slowdowns for certain operations with matrices. That's what the result is showing you. I think that whoever wrote that PR and merged it probably did not check the benchmark results at the time. We can notice the effect of it though because it shows up on other PRs as well. If we only showed the change for each individual PR wrt master then that change would go unnoticed until the release.

Also there is a threshold for reporting any slowdown. The change since the previous release will show the cumulative effect of all PRs.

@oscarbenjamin
Copy link
Contributor

For det it looks like the change was around about #7708 but it's not visible in the PR itself. I think that maybe it's the combined effect of a few changes so the benchmark was already running 25% slower before that PR got merged and then the PR made it run a bit slower and maybe there were other changes elsewhere and these things have added together so that after a while the cumulative effect exceeds the 50% slower threshold.

A bigger problem is that the benchmark output is not really representative of an actual slowdown because it's reporting the time taken with the cache enabled. This is what the time_det(4, 2) benchmark is timing:

from sympy import *

x = symbols('x')

A = Matrix([
    [ x, 8, 10, 5],
    [10, 9,  3, 7],
    [ 5, 9,  x, 0],
    [ 1, 8,  0, 7]])

import time
t = time.time()
A.det()
print(time.time() - t)

For that I don't see a significant difference between 1.9 and master but it looks like a difference when you use timeit e.g.

python -m timeit -s 'from t import A' 'A.det()'

@asmeurer
Copy link
Member

asmeurer commented Oct 6, 2023

I was about to open a new issue for this but I found this one.

I think the benchmarks comments could be improved a lot. Right now it dumps a lot of information which means that it is likely to be ignored. The green/red thing has actually been "fixed" because asv now puts a vertical line to the left of the results, so it no longer looks like a diff. However, I think we can go much further:

  • I don't see the benefit of including the "master vs. previous release" in the comment. That's completely irrelevant to the PR and only confuses things. It can be included in the CI output, but it shouldn't be in the comment.

  • The present output of asv is Markdown table. We should show it as thus, instead of in a code block. This would make them significantly easier to read and much more likely to be paid attention to when there are significant changes from master. For example

| Change   | Before [8059df73] <sympy-1.12^0>   | After [91b92eb1]    |   Ratio | Benchmark (Parameter)                                                |
|----------|------------------------------------|---------------------|---------|----------------------------------------------------------------------|
| -        | 86.1±1ms                           | 56.5±0.3ms          |    0.66 | integrate.TimeIntegrationRisch02.time_doit(10)                       |
| -        | 85.7±0.3ms                         | 55.5±0.3ms          |    0.65 | integrate.TimeIntegrationRisch02.time_doit_risch(10)                 |
| +        | 22.4±0.08μs                        | 39.3±0.2μs          |    1.75 | integrate.TimeIntegrationRisch03.time_doit(1)                        |

renders as

Change Before [8059df7] <sympy-1.12^0> After [91b92eb] Ratio Benchmark (Parameter)
- 86.1±1ms 56.5±0.3ms 0.66 integrate.TimeIntegrationRisch02.time_doit(10)
- 85.7±0.3ms 55.5±0.3ms 0.65 integrate.TimeIntegrationRisch02.time_doit_risch(10)
+ 22.4±0.08μs 39.3±0.2μs 1.75 integrate.TimeIntegrationRisch03.time_doit(1)

We could additionally do some post-processing of this to make it easier to read, like

Change Before [8059df7] <sympy-1.12^0> After [91b92eb] Ratio Benchmark (Parameter)
🟢 86.1±1ms 56.5±0.3ms 0.66 integrate.TimeIntegrationRisch02.time_doit(10)
🟢 85.7±0.3ms 55.5±0.3ms 0.65 integrate.TimeIntegrationRisch02.time_doit_risch(10)
22.4±0.08μs 39.3±0.2μs 1.75 integrate.TimeIntegrationRisch03.time_doit(1)

With this kind of change, we should also be able to cut down on the boilerplate text as well. I would cut it down to no more than a sentence. If there's more to say about benchmarks, it should be put in the docs with and just linked to.

@oscarbenjamin
Copy link
Contributor

I think that the main thing would be to make it so that no comment gets posted at all if there are no significant changes or at least the comment could just say "no significant changes". Then most of the time it would not show up on a PR at all or at least it would not be distracting.

Perhaps then if the PR does have significant changes then a new comment should be posted after each benchmark run (rather than editing the existing comment). Then if there are new changes that are relevant to the PR they can be reported very loudly.

Otherwise though I think that showing master vs previous release is useful because we just won't see this very often if it is only shown in CI.

@asmeurer
Copy link
Member

asmeurer commented Oct 6, 2023

Yeah, no comment would be nice, but at the same time, it would then be hard to tell if it's actually working or not, especially given how delayed it tends to be. It doesn't help that the benchmarks status is nonrequired.

Otherwise though I think that showing master vs previous release is useful because we just won't see this very often if it is only shown in CI.

Is it that important? I guess we could include it under a <details> so it doesn't take up much space if you really feel it should be there.

@oscarbenjamin
Copy link
Contributor

The problem at the moment is that a comment is added with likely irrelevant information but then gets quietly updated after each push. That makes it quite likely that a subsequent push might cause a significant change without a noisy notification to the effect of that change. The master vs previous release comparison is useful because if there was a significant slowdown in any benchmarks then it would definitely get noticed in a subsequent PR even it was missed when merging the first PR that introduced the slowdown. Without the master vs previous release comparison as soon as a measurable slowdown is merged it will be forgotten regardless of how big the slowdown is.

If we could make it so that any possible slowdowns are always noisy (not just a silent PR comment edit) then maybe it would less necessary to have the master vs previous release comparison.

@asmeurer
Copy link
Member

asmeurer commented Oct 6, 2023

I agree, we should try to avoid regressions in the first place. Also it's much less necessary to see the speedups than the slowdowns when comparing to the last release. Right now master has a lot of speedups, which is very nice, but is not necessary to see that all the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants