Improve BenchmarkIndex to land device targeting #9085

patrickhulce · 2019-05-30T20:11:21Z

Summary
@exterkamp brought up some inconsistencies in our ULTRADUMB™ benchmark. We need to fix these before landing #6162.

patrickhulce · 2019-05-30T20:31:54Z

After discussion and further investigation at a local Best Buy, the inconsistencies he found come down to the fact that Chrome on Windows criminally underperforms on this specific benchmark compared to Edge and similarly spec'd Chromebook or Mac devices.

My proposal for moving forward to address all concerns:

Do not use any sort of auto-detection like core(throttling): add option to slowdown to a device class #6162 in consistent environments like PSI. This was never the intention of the benchmark and I agree with all the concerns @exterkamp has raised here. Separate tuning of our exact throttling multiplier for PSI to better account for the longer task lengths we see there is still advised.
Maintain the policy of core(throttling): add option to slowdown to a device class #6162 that 750+ is all fast and will not be adjusted.
ULTRADUMB™ was never intended to distinguish between a 10-30% perf differential and these fluctuations on devices all within the "fast" device class are not a concern.
Investigate if the significant underperformance of Chrome in all JS benchmarks on Windows is at all reflective of reduced JS performance. (AI @patrickhulce)
Develop a slimmed down version of the Regexp benchmark from Octane 2.0, and use this in combination with or instead of ULTRADUMB™. Regexp is the only observable JS benchmark tested that actually maintains its differentiability on Windows. The only reason that it was not used to begin with is that it is quite large at 125KB. (AI @patrickhulce)

I have added a few notable device stats to the benchmarks datasheet that illustrates the difference between Chrome on Windows, Edge on Windows, and Chrome on *nix.

patrickhulce · 2019-09-25T19:38:26Z

OK, so after far too much time with Chrome on Windows and questions from blue shirts, I've concluded it just has different characteristics that even vary significantly by processor arch that are too difficult to flawlessly identify with a single, small benchmark.

My new proposal would be to loosely abandon the automatic throttling multiplier selection and just add a runWarning if the default throttling settings are being used and the BenchmarkIndex is ~500 or lower that the machine might be underpowered so you might want to adjust the multiplier. This would solve the borderline variance issue, lessen the impact of a mistaken identification, and pull nice double duty of warning in situations like #9691. Unfortunately this kind of leaves DevTools users on weaker machines in the cold as we're removing the throttling customization ability (though they're likely already kinda broken being in this position but not knowing about it).

Ideas for solutions:

Offer a separate sticky DevTools calibration process that saves a calibrated multiplier that's used for all runs moving forward that can be more expensive/comprehensive than a brief ULTRADUMB on every run. (Could also be a nice bonus to do in CLI environments where accuracy matters :))
Live with occasional false positive warnings in DevTools and push weaker machine users toward CLI/PSI

Any feelings here @exterkamp?

patrickhulce · 2020-01-14T22:20:25Z

@exterkamp how do you feel about the above proposal?

Specifically.

Add a runWarning if the default throttling settings are being used and BenchmarkIndex is <500 that the machine might be underpowered.
Write a doc that summarizes how one might calibrate the correct throttling multiplier to use... and come up with how we suggest doing this :)
Eventually implement this calibration mechanism in DevTools to make the warning go away

exterkamp · 2020-02-27T22:11:12Z

I like the idea, but I wonder if the calibration mechanism might need to come first to help alleviate our Lightrider benchmark problems? It's not unusual for an LR machine to cross below 500. Would we need to prioritize fleet optimization before we can start to surface this kind of warning if we think <500 is a huge problem.

I'm not a huge fan of surfacing reports from PSI that have a runWarning about the machine being slow. So I like the idea, but it seems like a non-starter to me?

paulirish · 2020-02-27T22:23:11Z

You're saying we improve the variability of existing BenchmarkIndex in LR first?
Once we feel better about that, then we can explore the LH side of changes?

this wfm.

patrickhulce · 2020-02-27T23:06:23Z

I'm not sure any of this applies to LR. From the beginning I was already on board with basically ignoring any of this in LR since it's already using hard-coded thresholds, consistent hardware, different config, etc. It's still worth exploring how seriously slow the LR hardware is for better calibration (have we tried running https://browserbench.org/Speedometer2.0/ in WRS?), but I wasn't trying to propose we randomly add runWarnings to PSI results :)

exterkamp · 2020-02-28T06:48:29Z

Disclaimer: I think we might be trying to handle two diff problems. I have no problems with the above idea for benchmark index in general, I'm just thinking about the problem space in LR now.

You're saying we improve the variability of existing BenchmarkIndex in LR first?

IMO yes (or in parallel), I think that we should focus on getting that to be +/- X points of some value, then it is worth it to try to start making a better benchmark index for LR at least.

consistent hardware

At scale nothing is consistent. We have LR runs on <100 benchmark index. It happens. So I think that this might need to be dealt with before we can do anything from an LR side based on benchmark index. Or is the idea that we can eventually calibrate to any power of machine?

I wasn't trying to propose we randomly add runWarnings to PSI results :)

It's also unfair to give ourselves a pass. I think we should have some retry logic in PSI maybe if the index is out of spec, so that we don't surface bad results, but we also don't give ourselves a pass? #variance

Maybe something like this?

start a run
run benchmark index
<500
bail out
connect to a new WRS and try again

Would be easier if we were asynchronous 😉 ⌚️

patrickhulce · 2020-02-28T15:42:04Z

Pretty much all agreed. We've conflated two separate issues multiple times in this journey, haha, maybe we could split this to track those separate efforts?

I think the situation in LR will end up needing to be handled completely differently. We should actually have significantly greater control over the flow there, retry availability, advanced knowledge about what hardware we should be seeing, different level of user actionability, etc. It's just a completely different ballgame than being randomly invoked a single time in a completely unknown environment. To be clear, I wasn't trying to suggest we give up on LR variance, just that my suggestions thusfar have been separate from whatever we do there.

patrickhulce · 2020-06-23T21:57:07Z

my next steps here:

develop the hybrid, longer benchmark that works on windows+mac into a script that someone could run
add docs in throttling guide on how to run that script and save the benchmark for usage with --throtlling.cpuSlowdownMultiplier

future steps (lower priority):

offer to compute and automatically save this benchmark first time CLI is run (like the sentry prompt)
enable devtools to run and save this multiplier

patrickhulce · 2020-07-14T17:51:59Z

I have been really struggling with...

develop the hybrid, longer benchmark that works on windows+mac into a script that someone could run

No benchmark I have tested so far (Richards, Deltablue, crypto, Raytrace, EarleyBoyer, Regexp, Splay, NavierStokes, pdf.js, Mandreel, CodeLoad, zlib, typescript, Octane 2.0, Speedometer 2.0, Geekbench 4.0, Geekbench 5.0, and ULTRADUMB) can accurately capture how much script execution a visit to theverge.com will increase by.

Some very interesting data thusfar though. It turns out a lot has changed in 4 years and the correct multiplier for a modern 2020 macbook down to a Moto G4 is more like 10x throttling, not 4x throttling. This might be a larger conversation regarding our targets and whether we want to truly match a moto G4 or just ballpark of "a mobile phone"

I've updated the benchmark stats spreadsheet with the data

patrickhulce · 2020-07-27T22:33:24Z

What we discussed in the meeting today:

No available benchmark today accurately captures the real-world multiplier of JS execution cross-device. GeekBench and Speedometer are the closest but they underpredict the correct multiple by ~64% on the Windows desktop example above and overpredict on the Moto G4 example by ~42%.
We can't really be in the business of trying to create better benchmarks than GeekBench and Speedometer 2.0.
Therefore, we can't really create a script that magically finds the multiplier out cross-environment variance.
What we can do is...
- Document the "target device" based on the baseline device specs. As of today that would be a Moto G4 and a GCP n1-standard-2 instance (which is what we base our Lantern accuracy numbers on).
- Document the spreadsheet data above into a rough matrix of "Device A with default multipliers translates roughly to a Device B"
- Collect more CPU and memory information about the device to add to BenchmarkIndex diagnostic data
- Provide basic warnings in the report if the diagnostic information indicates the device is significantly faster/slower than the baseline device that links to our documentation.

Specific action items to consider this closed:

Collect the diagnostic data for n1-standard-2 machines to see where we fall today
Convert the spreadsheet data into a matrix to add to our variability.md
Add a toplevel warning to Lighthouse when BenchmarkIndex is significantly underpowered

patrickhulce · 2020-08-10T20:10:02Z

To make this even more fun...

Chrome Canary m86 has a significant regression in BenchmarkIndex performance (~2x on my Macbook) which I bisected to r787210, a v8 roll of 8.6.106. It contains several memory related changes, so seems reasonable our memory allocation-based benchmark would be affected.

Given that this had been stable for over 2 years, it's definitely unfortunate to have such a massive change now. I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.

If yes, then great, we can assume it will continue to be stable and we helped find a bug.
If no, then crap, we need to re-evaluate our life choices if there's no hope of BenchmarkIndex staying even order-of-magnitude consistent.

connorjclark · 2020-08-10T22:16:29Z

If yes, then great, we can assume it will continue to be stable and we helped find a bug.

Wow, benchmark index might actually be the most useful useless performance index!

patrickhulce · 2020-08-10T23:43:06Z

The plot thickens.

tl;dr

BenchmarkIndex is now bimodal, sometimes ~66% lower than the old value, sometimes higher.
Affects most devices and OSes
On the first page load after tab creation (Lighthouse CLI's situation and my bisect method), BenchmarkIndex was never observed to be the higher value across 1100 trials.
The longer a tab lives the more likely it seems to be that the BenchmarkIndex flips to the higher value, but it can still flip back, a page required 3-10 refreshes before observing the higher value.

It appears the benchmark became bimodal in this same change. It's unclear to me what determines which bucket of the distribution happens, other than it's some attribute of a toplevel task that affects it. On the first page load after the tab is created it's always the lower number (which is exactly the situation Lighthouse CLI finds itself in). For example, I can run BenchmarkIndex 50 times for 25s straight in a loop and get all 50 values at the low number and immediately refresh for another 50 times for 25s straight and get all 50 values at the high number. They are never mixed within the same task which is a very different effect from normal CPU contention from other processes on the machine (which manifests itself as a lower score within the same task while the CPU is occupied with other tasks and then returns to normal when CPU becomes available). If however, I break up the benchmark index into different tasks, I can observe the bimodal behavior.

The devices I've retested so far...

Macbook Pro: ~1500 to ~700, ~5% of the time it's ~1800
Mac Mini: ~1100 to ~450, ~10% of the time it's ~1400
Samsung S10: ~800 to ~550, ~30% of the time it's ~1400
Middle-level Windows Desktop: ~1200 to ~400, 20% of the time ~1300
Moto G4: ~35 to ~35, very very rarely it's 160 but on both Canary and Stable

Distribution of Benchmark Index

Stable New Tab

Canary New Tab

Stable Refreshed Tab (long-lived, same tab used for all trials)

Canary Refreshed Tab (short-lived, ~just a few refreshes for new tab creation)

Canary Refreshed Tab (long-lived, same tab used for all trials)

Just to be sure my bisect was correct and the lower value never happens I did 1000 more trials with the "Canary New Tab" to be safe

brendankenny · 2020-08-11T02:33:35Z

I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.

FWIW if doing this, it would be best to send ultradumbBenchmark on its own so it's runnable in d8. I'm able to reproduce a regression on my machine of about 20% between d8 from 8.6.105 and 8.6.106 and it has remained slower since, including in the latest build (8.6.342).

The generated optimized code is identical (modulo memory addresses), the optimization timing appears to be the same, and GC time seems to only change by up to 10% in a quick profile, so it'll be interesting to hear what changed and if there was an intentional tradeoff for more realistic code/allocation/whatever.

patrickhulce · 2020-08-11T15:58:53Z

I'm able to reproduce a regression on my machine of about 20% between d8 from 8.6.105 and 8.6.106 and it has remained slower since, including in the latest build (8.6.342).

Fascinating I actually observe the opposite on my machine. Using v8 8.6.106 alone yields the higher bucket value which is ~15% faster than v8 8.6.105.

Repro Script

cat > benchmark.js <<EOF
function ultradumbBenchmark() {
  const start = Date.now();
  let iterations = 0;

  while (Date.now() - start < 500) {
    let s = ''; // eslint-disable-line no-unused-vars
    for (let j = 0; j < 100000; j++) s += 'a';

    iterations++;
  }

  const durationInSeconds = (Date.now() - start) / 1000;
  return Math.round(iterations / durationInSeconds);
}

console.log(ultradumbBenchmark());
EOF

npm install -g jsvu
jsvu v8@8.6.105
~/.jsvu/engines/v8-8.6.105/v8-8.6.105 benchmark.js
jsvu v8@8.6.106
~/.jsvu/engines/v8-8.6.106/v8-8.6.106 benchmark.js
jsvu v8@8.6.105
~/.jsvu/engines/v8-8.6.342/v8-8.6.342 benchmark.js

patrickhulce · 2020-08-11T16:01:29Z

Maybe this combined with the bimodality suggests it's something specific with the way Chrome is running v8? If there are alternate modes or flags that can be flipped?

brendankenny · 2020-08-11T16:10:48Z

Using v8 8.6.106 alone yields the higher bucket value which is ~15% faster than v8 8.6.105.

whoops, missed that the result is iterations / durationInSeconds. I see an improvement as well, then (~20%).

patrickhulce · 2020-08-12T23:45:33Z

So some good stuff came out of this and we might have finally accomplished the title of the issue :)

tl;dr - V8 team gave us advice on how to tweak our microbenchmark to be more resilient, a simple average of the two tweaked benchmarks now correlates with JS execution time better than any other JS benchmark tested, and they're even going to add it to their waterfall to be alerted about major changes to it 🎉

Root Cause

The bimodality appears to be caused by GC heuristics used by Chrome. The identified V8 CL changes the page size which normally increases GC performance but in Chrome's slow path cause far more GC interruptions.

Before	After (slow)	After (Fast)
34% time spent in GC	71% time spent in GC	24% time spent in GC

The Fix

Our benchmark creates a string length of 100k which just pushes past a threshold that causes this crazy slow GC path. By reporting the iterations on a shorter string of length 10k and dividing the resulting index by 10 we end up with nearly identical benchmark results to the fast path on our length 100k string but now we always fall into the fast GC path :)

The Improvement

The allocation/GC-dependence of this benchmark was sort of a feature since cheap devices tend to struggle with memory ops, but V8 team suggested trying a benchmark that preallocates an array of 100k and just copies elements into it. By combining the results of this tweaked benchmark with our previous one, we actually get a new benchmark that correlates with JS execution time on sites better than every other web benchmark we've tested and is only beaten out by GeekBench 🎉

I'll open a PR for the tweaked combo benchmark and we can continue with our previous plans :)

patrickhulce self-assigned this May 30, 2019

exterkamp mentioned this issue Jun 6, 2019

CPU/Memory Power is different on each run #6907

Closed

patrickhulce added the needs-priority label Jun 25, 2019

paulirish added the variability label Jul 23, 2019

exterkamp added P1 and removed needs-priority labels Jul 23, 2019

brendankenny mentioned this issue Sep 30, 2019

tests(smokehouse): snapshot stderr #9733

Closed

dzschille mentioned this issue Oct 2, 2019

☔ Pagespeed Insights results differ from lighthouse in chrome #6708

Open

patrickhulce mentioned this issue Oct 9, 2019

core(script-elements): parallelize getting request contents #9713

Merged

patrickhulce assigned exterkamp Feb 25, 2020

patrickhulce mentioned this issue Apr 11, 2020

docs: add a guide to running Lighthouse at scale #10511

Merged

dvelasquez mentioned this issue Apr 20, 2020

Spinning my own PageSpeed Insights? #10569

Closed

patrickhulce mentioned this issue Jun 18, 2020

☔ Performance Score Is Different #10657

Open

patrickhulce added P1 and removed P1 labels Jun 23, 2020

patrickhulce added the needs-discussion label Jul 14, 2020

This was referenced Jul 29, 2020

Huge difference in blocking time on Windows vs. Mac #11183

Closed

☂️ 👣 Breaking changes for v7 #11207

Closed

patrickhulce mentioned this issue Aug 10, 2020

misc: add benchmark script with octane and speedometer #11247

Merged

This was referenced Aug 24, 2020

misc(benchmark): update BenchmarkIndex for m86 changes #11304

Merged

docs: add CPU throttling guide #11325

Merged

patrickhulce mentioned this issue Aug 31, 2020

core(gather-runner): warn when BenchmarkIndex is sufficiently low #11350

Merged

patrickhulce mentioned this issue Sep 10, 2020

Reconfigure emulation flags #10910

Closed

patrickhulce removed the needs-discussion label Sep 15, 2020

patrickhulce closed this as completed in #11350 Dec 8, 2020

brendankenny mentioned this issue Dec 21, 2020

☂️ 👣 LH v8 burndown #11866

Closed

17 tasks

adamraine mentioned this issue Jun 3, 2021

☂️ Breaking changes for v9 #12614

Closed

15 tasks

patrickhulce mentioned this issue Jun 15, 2021

Calibration for cpu throttling #820

Closed

connorjclark mentioned this issue Oct 25, 2021

possibly recalibrate throttling multipliers based on benchmark index #13259

Closed

jazyan mentioned this issue Jun 28, 2023

Variability in CPU throughout the duration of the test. #15202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BenchmarkIndex to land device targeting #9085

Improve BenchmarkIndex to land device targeting #9085

patrickhulce commented May 30, 2019

patrickhulce commented May 30, 2019 •

edited

Loading

patrickhulce commented Sep 25, 2019

patrickhulce commented Jan 14, 2020

exterkamp commented Feb 27, 2020

paulirish commented Feb 27, 2020

patrickhulce commented Feb 27, 2020

exterkamp commented Feb 28, 2020

patrickhulce commented Feb 28, 2020

patrickhulce commented Jun 23, 2020 •

edited

Loading

patrickhulce commented Jul 14, 2020 •

edited

Loading

patrickhulce commented Jul 27, 2020 •

edited

Loading

patrickhulce commented Aug 10, 2020

connorjclark commented Aug 10, 2020

patrickhulce commented Aug 10, 2020

brendankenny commented Aug 11, 2020 •

edited

Loading

patrickhulce commented Aug 11, 2020

patrickhulce commented Aug 11, 2020

brendankenny commented Aug 11, 2020

patrickhulce commented Aug 12, 2020

Improve BenchmarkIndex to land device targeting #9085

Improve BenchmarkIndex to land device targeting #9085

Comments

patrickhulce commented May 30, 2019

patrickhulce commented May 30, 2019 • edited Loading

patrickhulce commented Sep 25, 2019

patrickhulce commented Jan 14, 2020

exterkamp commented Feb 27, 2020

paulirish commented Feb 27, 2020

patrickhulce commented Feb 27, 2020

exterkamp commented Feb 28, 2020

patrickhulce commented Feb 28, 2020

patrickhulce commented Jun 23, 2020 • edited Loading

patrickhulce commented Jul 14, 2020 • edited Loading

patrickhulce commented Jul 27, 2020 • edited Loading

patrickhulce commented Aug 10, 2020

connorjclark commented Aug 10, 2020

patrickhulce commented Aug 10, 2020

Distribution of Benchmark Index

Just to be sure my bisect was correct and the lower value never happens I did 1000 more trials with the "Canary New Tab" to be safe

brendankenny commented Aug 11, 2020 • edited Loading

patrickhulce commented Aug 11, 2020

patrickhulce commented Aug 11, 2020

brendankenny commented Aug 11, 2020

patrickhulce commented Aug 12, 2020

Root Cause

The Fix

The Improvement

patrickhulce commented May 30, 2019 •

edited

Loading

patrickhulce commented Jun 23, 2020 •

edited

Loading

patrickhulce commented Jul 14, 2020 •

edited

Loading

patrickhulce commented Jul 27, 2020 •

edited

Loading

brendankenny commented Aug 11, 2020 •

edited

Loading