-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve BenchmarkIndex to land device targeting #9085
Comments
After discussion and further investigation at a local Best Buy, the inconsistencies he found come down to the fact that Chrome on Windows criminally underperforms on this specific benchmark compared to Edge and similarly spec'd Chromebook or Mac devices. My proposal for moving forward to address all concerns:
I have added a few notable device stats to the benchmarks datasheet that illustrates the difference between Chrome on Windows, Edge on Windows, and Chrome on *nix. |
OK, so after far too much time with Chrome on Windows and questions from blue shirts, I've concluded it just has different characteristics that even vary significantly by processor arch that are too difficult to flawlessly identify with a single, small benchmark. My new proposal would be to loosely abandon the automatic throttling multiplier selection and just add a Ideas for solutions:
Any feelings here @exterkamp? |
@exterkamp how do you feel about the above proposal? Specifically.
|
I like the idea, but I wonder if the calibration mechanism might need to come first to help alleviate our Lightrider benchmark problems? It's not unusual for an LR machine to cross below 500. Would we need to prioritize fleet optimization before we can start to surface this kind of warning if we think <500 is a huge problem. I'm not a huge fan of surfacing reports from PSI that have a |
You're saying we improve the variability of existing BenchmarkIndex in LR first? this wfm. |
I'm not sure any of this applies to LR. From the beginning I was already on board with basically ignoring any of this in LR since it's already using hard-coded thresholds, consistent hardware, different config, etc. It's still worth exploring how seriously slow the LR hardware is for better calibration (have we tried running https://browserbench.org/Speedometer2.0/ in WRS?), but I wasn't trying to propose we randomly add runWarnings to PSI results :) |
Disclaimer: I think we might be trying to handle two diff problems. I have no problems with the above idea for benchmark index in general, I'm just thinking about the problem space in LR now.
IMO yes (or in parallel), I think that we should focus on getting that to be +/- X points of some value, then it is worth it to try to start making a better benchmark index for LR at least.
At scale nothing is consistent. We have LR runs on <100 benchmark index. It happens. So I think that this might need to be dealt with before we can do anything from an LR side based on benchmark index. Or is the idea that we can eventually calibrate to any power of machine?
It's also unfair to give ourselves a pass. I think we should have some retry logic in PSI maybe if the index is out of spec, so that we don't surface bad results, but we also don't give ourselves a pass? #variance Maybe something like this?
Would be easier if we were asynchronous 😉 ⌚️ |
Pretty much all agreed. We've conflated two separate issues multiple times in this journey, haha, maybe we could split this to track those separate efforts? I think the situation in LR will end up needing to be handled completely differently. We should actually have significantly greater control over the flow there, retry availability, advanced knowledge about what hardware we should be seeing, different level of user actionability, etc. It's just a completely different ballgame than being randomly invoked a single time in a completely unknown environment. To be clear, I wasn't trying to suggest we give up on LR variance, just that my suggestions thusfar have been separate from whatever we do there. |
my next steps here:
future steps (lower priority):
|
I have been really struggling with...
No benchmark I have tested so far (Richards, Deltablue, crypto, Raytrace, EarleyBoyer, Regexp, Splay, NavierStokes, pdf.js, Mandreel, CodeLoad, zlib, typescript, Octane 2.0, Speedometer 2.0, Geekbench 4.0, Geekbench 5.0, and ULTRADUMB) can accurately capture how much script execution a visit to theverge.com will increase by. Some very interesting data thusfar though. It turns out a lot has changed in 4 years and the correct multiplier for a modern 2020 macbook down to a Moto G4 is more like 10x throttling, not 4x throttling. This might be a larger conversation regarding our targets and whether we want to truly match a moto G4 or just ballpark of "a mobile phone" I've updated the benchmark stats spreadsheet with the data |
What we discussed in the meeting today:
Specific action items to consider this closed:
|
To make this even more fun... Chrome Canary m86 has a significant regression in BenchmarkIndex performance (~2x on my Macbook) which I bisected to r787210, a v8 roll of 8.6.106. It contains several memory related changes, so seems reasonable our memory allocation-based benchmark would be affected. Given that this had been stable for over 2 years, it's definitely unfortunate to have such a massive change now. I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.
|
Wow, benchmark index might actually be the most useful useless performance index! |
FWIW if doing this, it would be best to send The generated optimized code is identical (modulo memory addresses), the optimization timing appears to be the same, and GC time seems to only change by up to 10% in a quick profile, so it'll be interesting to hear what changed and if there was an intentional tradeoff for more realistic code/allocation/whatever. |
Fascinating I actually observe the opposite on my machine. Using v8 8.6.106 alone yields the higher bucket value which is ~15% faster than v8 8.6.105. Repro Script cat > benchmark.js <<EOF
function ultradumbBenchmark() {
const start = Date.now();
let iterations = 0;
while (Date.now() - start < 500) {
let s = ''; // eslint-disable-line no-unused-vars
for (let j = 0; j < 100000; j++) s += 'a';
iterations++;
}
const durationInSeconds = (Date.now() - start) / 1000;
return Math.round(iterations / durationInSeconds);
}
console.log(ultradumbBenchmark());
EOF
npm install -g jsvu
jsvu v8@8.6.105
~/.jsvu/engines/v8-8.6.105/v8-8.6.105 benchmark.js
jsvu v8@8.6.106
~/.jsvu/engines/v8-8.6.106/v8-8.6.106 benchmark.js
jsvu v8@8.6.105
~/.jsvu/engines/v8-8.6.342/v8-8.6.342 benchmark.js |
Maybe this combined with the bimodality suggests it's something specific with the way Chrome is running v8? If there are alternate modes or flags that can be flipped? |
whoops, missed that the result is |
Summary
@exterkamp brought up some inconsistencies in our ULTRADUMB™ benchmark. We need to fix these before landing #6162.
The text was updated successfully, but these errors were encountered: