Project Lantern is an ongoing effort to reduce the run time of Lighthouse and improve audit quality by modeling page activity and simulating browser execution. This document details the accuracy of these models and captures the expected natural variability.
All of the following accuracy stats are reported on a set of 300 URLs sampled from the Alexa top 1000, HTTPArchive dataset, and miscellaneous ad landing pages. Median was collected for 9 runs in one environment and compared to the median of 9 runs in a second environment.
Lantern Accuracy Stats
|Lantern predicting Default LH||.811 : 23.1%||.811 : 23.6%||.869 : 42.5%|
|Lantern predicting LH on WPT||.785 : 28.3%||.761 : 33.7%||.854 : 45.4%|
|Unthrottled LH predicting Default LH||.738 : 27.1%||.694 : 33.8%||.743 : 62.0%|
|Unthrottled LH predicting WPT||.691 : 33.8%||.635 : 33.7%||.712 : 66.4%|
|Default LH predicting WPT||.855 : 22.3%||.813 : 27.0%||.889 : 32.3%|
Lantern Accuracy Conclusions
We conclude that Lantern is ~6-13% more inaccurate than DevTools throttling. When evaluating rank performance, Lantern achieves correlations within ~.04-.07 of DevTools throttling.
- For the single view use case, our original conclusion that Lantern's inaccuracy is roughly equal to the inaccuracy introduced by expected variance seems to hold. The standard deviation of single observations from DevTools throttling is ~9-13%, and given Lantern's much lower variance, single observations from Lantern are not significantly more inaccurate on average than single observations from DevTools throttling.
- For the repeat view use case, we can conclude that Lantern is systematically off by ~6-13% more than DevTools throttling.
Metric Variability Conclusions
The reference stats demonstrate that there is high degree of variability with the user-centric metrics and strengthens the position that every load is just an observation of a point drawn from a distribution and to understand the entire experience, multiple draws must be taken, i.e. multiple runs are needed to have sufficiently small error bounds on the median load experience.
The current size of confidence intervals for DevTools throttled performance scores are as follows.
- 95% confidence interval for 1-run of site at median: 50 +/- 15 = 65-35
- 95% confidence interval for 3-runs of site at median: 50 +/- 11 = 61-39
- 95% confidence interval for 5-runs of site at median: 50 +/- 8 = 58-42