Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would making diff_ thresholds a percentage instead of absolute values make more sense? #206

Open
zakkak opened this issue Sep 21, 2023 · 4 comments

Comments

@zakkak
Copy link
Collaborator

zakkak commented Sep 21, 2023

Right now, diff_* thresholds for performance regression testing are defined as absolute numbers, e.g.:

linux.diff_native.time.to.first.ok.request.threshold.ms=80

This sometimes results in test failures when running on different machines than the one used to tune the thresholds.

However, I am thinking that checking if the increase is within an acceptable range, e.g. 5%, would probably make more sense. After all a 50ms increase on a 10ms run is huge, while on a 5s run it's negligible.

I wonder if switching to percentages instead would also allow us to perform the regression testing (only for diffs between runs) on various machines (including github runners) while not losing accuracy.

cc @Karm @jerboaa

@roberttoyonaga
Copy link
Collaborator

Hi @zakkak just chiming in here - The Jfr perf test thresholds are specified as a relative change ( |new - old| / old ). Maybe something similar could make sense elsewhere too.

@Karm
Copy link
Owner

Karm commented Oct 3, 2023

Definitely makes sense, requires recording JVM run as a baseline, but that already happens, see the notion of diff_jvm and diff_native suffixes in threshold.properties.

@zakkak
Copy link
Collaborator Author

zakkak commented Oct 4, 2023

@Karm what percentage would you consider acceptable?

@Karm
Copy link
Owner

Karm commented Oct 4, 2023

@Karm what percentage would you consider acceptable?

There are 2 things:

  1. % difference between JVM (time-to-first-ok-request, time-to-complete, RSS) and Native, i.e. is it acceptable, that Native's , time-to-complete is 10% worse etc.

  2. And then there is a deviation from some hardcoded value.

I'd focus on 2) and I'd hardcode values from Q 2.13.8.Final, M 22.3.3.1-Final run on a reference system.
I'd run again with Q 2.16.9.Final, M 22.3.3.1-Final on a reference system and record the percentage difference.
That is what I'd use as acceptable percentage to judge the success of failure of Quarkus 3.x and M 23.x.
By reference system I mean one of the stock 8 cores 16 g ram RHEL 8 contemporary Xeon backed VMs I use as they have pretty stable profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants