Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Integrate Lighthouse results #87

Closed
rviscomi opened this issue Apr 6, 2017 · 37 comments
Closed

Integrate Lighthouse results #87

rviscomi opened this issue Apr 6, 2017 · 37 comments

Comments

@rviscomi
Copy link
Member

rviscomi commented Apr 6, 2017

No description provided.

@igrigorik
Copy link
Contributor

Collating notes from previous threads and conversations:

  1. Prerequisite: we need to move agents to the new Linux agent.
  2. LH results will be restricted to mobile crawl - what's the limitation here?
  3. We'll append LH's JSON report directly into the HAR
    • Report should be O(100kb) and shouldn't need to be truncated or split
    • Existing BigQuery pipeline should be sufficient, results will end up in <date>_pages table
      • Do we have an example HAR with LH output that we can validate?

In terms of deployment: gathering LH data will require adding another run at the end of the current set, which will increase the overall load on the infrastructure. On the other hand, it sounds like new agents should consume fewer cycles, so hopefully we'll come out ahead. That said, we're running close to the limit on capacity, so we should monitor things closely.

@rviscomi @pmeenan anything else we're missing here or need to think about?

In terms of timelines:

  • when should we target the agent upgrade?
  • what do we need to implement to add the LH step?

@pmeenan
Copy link
Member

pmeenan commented Apr 7, 2017

Here is a sample HAR from this test. It should mirror what we'd get since it was run on desktop Linux with mobile emulation (and the new agent).

Adding the LH step is a singe additional test option (lighthouse=1) and I can have it applied automatically at the same point where I turn on tracing (2 minutes of work).

My proposal:

  • May 1 crawl: Switch all agents to the new Linux agent (without lighthouse enabled). Use the duration of the crawl to find any bottlenecks and determine how many agents we can run per server (I'm expecting to be able to run a lot more than we run for Windows from early CPU data on the public instance)

  • May 15 crawl: Don't change the crawl itself, take the time to analyze the rsults of the May 1 crawl and see if there were any shifts in the metrics or issues with the data. I'm particularly worried that some of the bytes metrics may shift since we're moving from direct measurement to translating netlogs and devtools events.

  • June 1 crawl: Enable LightHouse for the mobile crawl

As far as desktop/mobile restrictions, technically I can run lighthouse on the desktop crawl as well but a lot of the audits are mobile PWA-specific (add to homescreen, address bar color, design being mobile friendly). That said, there may be enough value in collecting the best practices parts for desktop to make it worth running in all cases.

@igrigorik
Copy link
Contributor

image

Looking at the example HAR..

  • _lighthouse key is living at the top level. Conceptually, I guess that makes sense, since it's a separate run, but it does mean we'll have to update our DataFlow pipeline to pull it out and either (a) merge it into pages table, or (b) write it out into a separate table ~date_lighthouse or some such. I can see pros and cons for each.. WDYT?
  • What's our plan for network / CPU throttling that LH enables? Are we going to noop those?

Timeline sounds good. Is there anything we can do upfront to sanity check and validate the bytes numbers? This part sounds pretty meaty on its own and probably deserves a separate bug.

@pmeenan
Copy link
Member

pmeenan commented Apr 8, 2017

I was a bit split because at the top-level there aren't really pages either. If I strip out the help text (which I'm considering doing anyway), duplicating the lighthouse JSON across all 3 page load data sets could work. I can make that change when I get back (4/17) and send over a new HAR.

As far as throttling:

  • network is already disabled (actually, it is done externally by dummynet). I can monkey-patch the result data with the throttle settings from the test so it is accurate.

  • CPU throttling is enabled when emulating mobile devices on a desktop machine (made the change after I generated the sample HAR). Disabled when device emulation is disabled (either because we are testing desktop or on a physical device).

@igrigorik
Copy link
Contributor

duplicating the lighthouse JSON across all 3 page load data sets could work.

Wait, you mean duplicating same report in each pages entry? That's probably unnecessary. It's a trivial change in our DataFlow pipeline to pull out the top level key and stuff it into pages table (if that's what we decide to go with). I'm OK with current setup, just mentioned it to surface that we'll have to do some work.

network is already disabled (actually, it is done externally by dummynet). I can monkey-patch the result data with the throttle settings from the test so it is accurate.

So, to confirm: you do want to enable network throttling, but via dummynet? If that's the case, do we trust the results? Previously we said that we shouldn't rely on any CPU/network timing data coming back from our agents..

@pmeenan
Copy link
Member

pmeenan commented Apr 9, 2017

We already enable network throttling (Cable for desktop crawls, 3G for mobile). The Lighthouse testing just inherits whatever the test uses.

As far as timing goes, things are better than they used to be with the meachines not as oversubscribed as they used to be but they may still be running hotter than they should for performance measurement. When we added the new hardware and increased the mobile testing I set things up so that the host servers ran ~95% CPU utilized but individual runs may still be starved.

Since then, changes to make the agents cycle faster may have pushed things hotter but we are in the neighborhood of being able to measure perf. Early testing shows the new agent on Linux uses significantly less CPU (30-40% less) so the headroom may allow us to both add the lighthouse pass and keep the servers from running too hot to measure perf.

@igrigorik
Copy link
Contributor

We already enable network throttling (Cable for desktop crawls, 3G for mobile). The Lighthouse testing just inherits whatever the test uses.

Funny, I didn't realize that's the case. Given that we'll want to update these settings at some point in the future, should we also log what BW/RTT settings we're using -- similar to what LH reports?

Early testing shows the new agent on Linux uses significantly less CPU (30-40% less) so the headroom may allow us to both add the lighthouse pass and keep the servers from running too hot to measure perf.

Fingers crossed. :-)

@rviscomi
Copy link
Member Author

there may be enough value in collecting the best practices parts for desktop to make it worth running in all cases.

Yeah, I think so. One less difference between test configs.

Is there any separate process for updating LH or is that included in the general WPT update? And is that automatic?

it does mean we'll have to update our DataFlow pipeline to pull it out and either (a) merge it into pages table, or (b) write it out into a separate table

I'm partial to having its own table but don't feel strongly. Whatever is easiest for analysis.

@pmeenan
Copy link
Member

pmeenan commented Apr 10, 2017

It will be "automatic" in the sense that it is manual and needs to be turned on but I'll take care of it ;-)

@igrigorik
Copy link
Contributor

I'm partial to having its own table but don't feel strongly. Whatever is easiest for analysis.

Curious, why? FWIW, the reason I'm leaning towards same table is precisely because doing cross-table joins is both more complicated and expensive in terms of overall query cost + runtime. I think a strong argument in favor of a separate table could be if we could define a full schema for the results.. but I'm not sure how stable that is. My guess is that we shouldn't.

@pmeenan
Copy link
Member

pmeenan commented Apr 10, 2017

The lighthouse schema is decidedly not stable and is expected to change in the next couple of releases.

@igrigorik
Copy link
Contributor

In that case, let's stick into _pages under a top-level "lighthouse" key. The payload will be a serialized JSON string. We can always go back and reprocess, this is not a once-and-forever decision.

@rviscomi
Copy link
Member Author

Curious, why? FWIW, the reason I'm leaning towards same table is precisely because doing cross-table joins is both more complicated and expensive in terms of overall query cost + runtime. I think a strong argument in favor of a separate table could be if we could define a full schema for the results.. but I'm not sure how stable that is. My guess is that we shouldn't.

Some joins could be avoided by duplicating page/test metadata in the lighthouse table. But I agree that overall it's more convoluted. My preference was mostly just for separation so it's clearer where the data came from and not commingling/flattening results. Having it in _pages under a lighthouse column SGTM too.

@igrigorik
Copy link
Contributor

Cool. Let's start with stuffing it into _pages, as that's the simplest route. We can always update the pipeline later and rebuild the dataset if we decide that a separate table is preferable.

@rviscomi rviscomi added this to TODO in HTTP Archive Reboot Apr 11, 2017
@rviscomi rviscomi added the P0 label Apr 11, 2017
@rviscomi
Copy link
Member Author

Blocked by #98 (in progress)

@igrigorik
Copy link
Contributor

We have Lighthouse data in the latest mobile run -- woot! Kudos to @pmeenan for wiring that up.

@rviscomi can you take a run at updating our DF pipeline to import LH results?

@rviscomi
Copy link
Member Author

See HTTPArchive/bigquery#10

If we merge and update the GCE worker soon the 6/1 pipeline can pick up the changes.

@rviscomi
Copy link
Member Author

Updated GCE. So we should see LH in the 6/1 HAR tables whenever they're available.

@rviscomi
Copy link
Member Author

Dataflow is failing with this error message:

(b65680b98c1ce59b): Workflow failed. Causes: (b65680b98c1ce76e): S01:read-har+split-har+write-entries+write-pages+write-bodies failed., (b06eec4c78f63bb6): BigQuery import job "dataflow_job_9101323334612128427-B" failed., (b06eec4c78f637a3): BigQuery job "dataflow_job_9101323334612128427-B" in project "httparchive" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON parsing error in row starting at position 0: . Row size is larger than: 10485760.

The addition of the Lighthouse data must be spilling us over the row size limit.

@rviscomi
Copy link
Member Author

On a related note, if the limit is 10 MB, the request_bodies tables are over-truncating at 2 MB. Filed HTTPArchive/bigquery#13

@igrigorik
Copy link
Contributor

Wowza.. how large are the LH reports? Can we pull out the report that's causing the exception?

@rviscomi
Copy link
Member Author

This LH report from mobile.twitter.com is 4.1 MB so 10+ MB wouldn't be so unusual.

It looks like $.log.entries makes up the majority of that data. For example, deleting that object from the file reduces the file size to 179 KB. The contents are redundant anyway, as we already collect HAR data in the requests and requests_bodies tables.

I'd suggest pruning entries from the Lighthouse results in the dataflow pipeline so that the full results are still available in GCS but the data in BQ is deduped and fits well below quota limitations.

@rviscomi
Copy link
Member Author

Hah, disregard that. That file is the entire HAR returned from WPT, not just the LH report.

@igrigorik
Copy link
Contributor

So now I'm confused.. The LH report in the Twitter case is ~69KB, right? So, where are the exceptions coming from?

@rviscomi
Copy link
Member Author

Here are the actual mobile.twitter.com LH results from the 6/1 crawl. The JSON report is 376 KB.

But the report size is very site-dependent. For example, Reddit is 2.8 MB.

Those are just a couple of examples from the Alexa Top 10. I sampled some of the largest gzipped HAR files using this command: gsutil ls -lh gs://httparchive/android-Jun_1_2017 | sort -h | head -10. But none of them managed to cause a dataflow failure. The HAR size is not really a good indicator of LH size, because Reddit was still the largest LH report, even compared to the largest HARs.

Given that this is difficult to reproduce, I'll run another test using a similar method to request_bodies truncation. I'll measure the row size before appending and set a bool field if it exceeds the limit.

Anecdotally, LH significantly adds to the dataflow processing time. A normal job would take ~45 minutes, but the normal+LH jobs are now running ~150 minutes just to fail.

If you want to see the temporary results, check out https://bigquery.cloud.google.com/table/httparchive:har.2017_06_20_android_pages?tab=preview while it lasts.

@igrigorik
Copy link
Contributor

So if max report is ~2.8MB.. that should fit in the 10MB row -- what's the issue? Is it 1MB?

Looking at reddit report, I'm wondering if a wholesale import is the right approach. For example, it looks like there's some base64'ed screenshots in there, and lots of duplicate helpText? The a11y report, in particular, is massive.. All this is making me think that we might better of splitting this data into a standalone table with a more crafted schema and filter logic on what we import?

@rviscomi
Copy link
Member Author

rviscomi commented Jun 21, 2017

The 2.8 MB Reddit report actually fits in the row without an error (see the temp 6/20 table). The row limit applies to the sum of all columns: page URL (negligibly small), HAR payload, and LH report. So some combination of large HAR payload and large LH report is overflowing the row. The max gzipped HAR file size in GCS is 1 MB, and none of the 10 largest files I manually ran through dataflow exceeded the limit either.

I've added some logging to dataflow and am now rerunning the 6/1 mobile results. It's been going for 60 minutes and has skipped 430 LH reports so far. The LOG.error calls don't seem to be logging to any of the consoles I've been watching (local NetBeans IDE and Dataflow job logs), so I'm not sure which pages are actually overflowing.

I still think there is at least one HAR file that triggers the error. I ran tests without the custom LH counter and they also failed around 2.5 hours in. When I get more time (I'll be unavailable until Tuesday starting in a couple of hours) I'll try bisecting the dataset to try to pin down exactly which HAR is responsible.

Edit: Yeah it failed again after 2.5 hours.There were 1,112 LH reports skipped.

@igrigorik
Copy link
Contributor

Hmm, interesting.. As an experiment, we can remove some unknown variables and split the LH data into a separate table?

@rviscomi
Copy link
Member Author

The last thing I did before leaving for my break was kick off the pipeline with the 6/1 mobile results and the original row size quota of 2 MB (rather than 10 MB). Checked on it today and it turns out that run was successful!

I kept the skippedLighthouse counter in dataflow and it shows that for this run there were 12,121 reports skipped. It'd be nice not to lose out on all that data, but it seems necessary for now.

I'll be committing my changes to get it in for the 6/15 crawl to be processed.

@igrigorik
Copy link
Contributor

Hmm, 12K pages is ~bearable, but that also depends on the rank of those pages -- e.g. if we're dropping amazon then that's a problem. Two routes:

  • What happens if we split LH into separate table?
  • Can we prune some of the data inside of LH reports to reduce size?

@pmeenan
Copy link
Member

pmeenan commented Jun 27, 2017

They used to embed full-resolution screenshots of the filmstrip and I strip those out but I think they may have snuck in thumbnails of the filmstrip or moved the location. There's nothing in the actual data that should get near 1MB. I'll track down the binary data bloating the files and strip them out of the HAR json before the 7/1 crawl starts.

@rviscomi
Copy link
Member Author

Yes it's definitely embedding base64 encoded images in the report. They're hiding under the "screenshot-thumbnails" audit which is in both the Performance report category and the audits array:

$.reportCategories[1].audits[16].result.details.items
$.audits['screenshot-thumbnails'].details.items

@patrickhulce
Copy link

We brought back embedding screenshots but just 10 thumbnails (max 60px wide) for the visualization in the timeline :)

For httparchive I'm not sure it makes sense to save the reportCategories.audits property at all, since it's used for display in the HTML report and all of the audits data is duplicated there.

@rviscomi
Copy link
Member Author

Another contributing factor is UGC like selectors failing the color-contrast or link-name audits. For example, here's a snippet from Reddit's 2.8 MB report:

"#container > .App > .AppMainPage > .NavFrame > .DualPartInterstitial > .DualPartInterstitial__content > .DualPartInterstitial__common > .DualPartInterstitialHeader.plain > .TransparentOverlay > .TransparentOverlay__childContainer > .PostsFromSubredditPage > .PostsList.PostAndCommentList > article:nth-of-type(1) > .Post__header-wrapper > .PostHeader.size-compact.m-thumbnail-margin > .PostHeader__metadata-container > .PostHeader__post-descriptor-line > .PostHeader__post-descriptor-line-overflow > span > .PostHeader__subreddit-link"

And that's just one of 300 selectors in the $.audits['color-contrast'].details.items array, which itself is ~200 KB!

Pruning reportCategories.audits sounds like a good idea and should help a lot. If that's not enough we may also want to go a step farther to strip out UGC.

@rviscomi
Copy link
Member Author

HTTPArchive/bigquery#16 does the pruning. Running the change over the 6/1 data and will report back with the difference in skipped rows. I'll also be looking into Ilya's suggestion to break LH results out into a separate table.

@rviscomi
Copy link
Member Author

Dropped rows
Before pruning: 12,121
After pruning: 4,882

@rviscomi
Copy link
Member Author

Lighthouse is fully integrated so marking this as closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

4 participants