-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WPT started getting a lot of PROTOCOL_TIMEOUT errors #12835
Comments
Howdy! Appreciate you filing this bug. 👏 We think this is the same root issue as #6512. So, we'll automatically mark this as a duplicate. Thanks! |
Pretty sure this isn't a duplicate of that 3-year old issue unless you want to track all PROTOCOL_ERROR issues there. Here is a (non-verbose) log where the page loads but gets a different PROTOCOL_TIMEOUT waiting on an emulation command (verbose loggging is still rolling out):
|
Looks like the log lines get a bit truncated in the screen session that WPT uses but here is a verbose log of the CNN page. Doesn't look like there is a race with the traces, it's in the shutdown after:
|
I'm rolling out a test option to use devtools emulation optionally so we can continue to debug and test it while the main user flow uses simulate. |
(also updating all of the agents to Node 14 in case it helps but it feels like a Chrome 92 change may have pushed something over the edge) |
Thanks @pmeenan! We started seeing this exact behavior on our test infrastructure, but only on ToT Chromium on GitHub actions and haven't been able to reproduce in any local or Docker environment. (I'm not sure if our stable set has reached 92 yet) AFAICT in my reproduction attempts with extra Chromium logging, the renderer is crashing but we're failing to get Sounds like if WPT is reliably reproing I might be able to spin up a WPT VM on ec2 and investigate directly? |
I'll see if I can reproduce it on a non-prod VM on EC2 and make it available. I can't reproduce it in my VMWare dev environment :( |
Does Chrome normally output stderr/out logging in the renderer crash case or is it something you instrumented for a local build? I can have WPT collect the chrome output as part of the test. |
I just turned on some verbose logging |
FWIW, I can reproduce the renderer crash in the VM I just set up. Full log here.
It looks like the Inspector.targetCrashed {} message came in but it still tried to gracefully continue. That said, I can't get it to Crash on the WPT part of the test (separate browser instance) even if I use devtools CPU and network emulation and it doesn't crash in Lighthouse if I don't use the devtools throttling and in both cases WPT launches the browser with the same command-line flags and passes it over to Lighthouse. Not sure what would be causing the renderer crash but maybe something with the trace events that are collected and how is causing it to OOM. |
If it helps, I'm happy to provide access to the VM to debug. If you send me a public SSH key I can install it on the VM and provide information on how to debug in the WPT agent config |
Yeah that's the case until #11840 lands. We suspect most of our PROTOCOL_TIMEOUT issues today are crashes. @paulirish was investigating the situation on Windows and found some GPU crashes that ultimately caused a timeout without a renderer crash, could be related to the situations without the
That'd be great! Finally having a repro environment is amazing. I think I just need to get chrome's stderr with the verbose logging to get some extra details? |
As @patrickhulce discovered, using Would still be great to figure out what in Chrome 92 started crashing when running lighthouse with a headless XOrg display but this works well enough for now (main concern is any differences in operation when running --headless). |
I've been unsuccessful at securing the crash ID of this case, and this time around in that VM the chrome log is very unhelpful without any signs of a crash despite the Lighthouse logs indicating there was one :/ Here are the logs of what I discovered so far. Unclear what the next steps would be to try and get a crash ID to file upstream at this point (perhaps whip out the linux debugging tools and get raw crash details?) It's crippling our test suite though, so I'll definitely look into it more next week 👍 Chrome Logs During Crash Run
Lighthouse Logs During Crash Run
|
The plot thickens... First of all, I noticed today that the reproducible cases we are getting have Second, there have been many different root causes throughout my more casual investigation attempts that I have been conflating that I straightened out this morning. Taking both together... Docker Environment
Verdict: whatever this is does not affect docker on mac, all crashes were real renderer crashes, dev-shm related WPT Instance
Verdict: whatever this is landed in 92.0.4477.0 and does not appear to affect the URL used by our tests in this environment Sidenote: this environment does not appear to be flaky at all, 14/14 fail and 12/12 pass on the correct side of the bisect. Redid the bisect from scratch just to make sure I got the exact same revision range and I did. GitHub Actions
Verdict: whatever this is does affect both URLs, and so far, appears to be VM environment related because they either both fail or they both pass (which would explain why retries have not helped at all) Next step: examine commits in https://chromium.googlesource.com/chromium/src/+log/107c2a09ba3efd7638ebe612db97ca978dd26730..8a6d807a6c7a682431c8ff619654542ecd5e524 more closely for anything that might relate to hangs. |
Alright slowly narrowing the case here. I'm trying to build a minimal repro, and so far a lighthouse run of webpagetest.org will start to work if one of the following changes is made...
Notably, @pmeenan you probably know the make up of the site better than I do, any immediate thoughts as to what could be going on there? My hunch that started this was something around the bframe / worker from recaptcha but blocking recaptcha alone was not enough :/ |
The more confusing part to me is that it works fine without headless when using WebPageTest and devtools throttling. Not sure if there's some other command-line flag differences or if some trace event is triggering it (and when testing LH, catchpoint and cnn's sites both fail as well). In the Lighthouse case when run under WPT, it uses the exact same command-line flags since WPT launches the browser in both cases. Here is the full list of options WPT uses (some may be invalid and still around just because they haven't been cleaned up):
|
Honestly, my best guess is that there is either something racy that only gets triggered with CPU emulation (but gets triggered reliably) or a busted trace event in a category that WPT doesn't enable by default that tries to write a bad trace event (accesses invalid memory or something), causing the renderer crash. |
Huzzah! It's the CPU profiler and we don't even need to use it right now! 🎉 |
Here was my bisect setup on the WebPageTest instance @pmeenan so kindly set up for us.
In another terminal
If it tries to load but stops at It never flaked in the WPT instance environment, but we observe flaky behavior in GitHub Actions, so better safe than sorry. At this point I've bisected 3 times to the exact same range, and I'm 1000% sure it's in between r871421 and r871818. |
FAQ
URL
https://www.catchpoint.com/
What happened?
Somewhere last week we started seeing a lot of failures on WebPageTest when running lighthouse audits. Not sure if it is a Chrome change or Lighthouse change since the Agents auto-update both.
The logs are all failing with PROTOCOL_TIMEOUT, usually when issuing an Emulation command of some kind. In this example it was trying to disable JS on a hung page but there have been other cases where it happens after collecting the trace and when it is resetting the network emulation (throttling method is dev tools with no network throttling but a CPU throttle).
What did you expect?
I expected it to run the test and collect the audits.
What have you tried?
It seems to only happen when dev tools CPU throttling is enabled. If I switch to "provided" or don't provide a CPU throttle then the tests run fine.
Best guess is that the emulation commands are run in parallel with other commands (like collecting the trace) that take longer when CPU throttling is enabled and something pushed it over the edge where now it is timing out.
I'm rolling out verbose logging now so I should have better logs shortly that should show what else is going on when the errors occur.
How were you running Lighthouse?
CLI, WebPageTest
Lighthouse Version
8.1.0
Chrome Version
92.0.4515.107
Node Version
12.22.2
Relevant log output
The text was updated successfully, but these errors were encountered: