-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky E2E: Jest failed to execute suite #70674
Comments
Enabled the This is the resulting log:
|
Further logs:
|
Searching the web for the error messages don't lead to much, but here are some possibly related issues:
I wonder if this line is related. |
I have a potential fix in #70970. After merging them, I will have to monitor for a bit. |
Reopening to monitor. |
Looks like #70970 did not really have the intended effect. For example, in the Even today (December 13, 2022) there has been multiple occurrences of this across a variety of specs. |
@dpasque I think it's time we try increasing the shm allocation for docker. I'm going to spin up a new PR for that. |
Let's see if #71356 helps - all failure we've had today (2022/12/21) have been this failure. |
Still happening occasionally after increasing the shared memory to 4GB, albeit at a rate lower than it used to at 3GB. |
Just adding what the newest trace stacks look like:
|
I've been playing around with this on my 🎩 rotation, and I think I have some next steps to try...
|
Adding some more logs! Got it to fail, here is what the console logging showed:
And here is what the Playwright API logs showed:
|
For comparison, here is a console log from a healthy executing test:
|
So comparing the two, this seems to be the real error:
So, this confirms my suspicion that the screenshot taking process is not actually the root issue! It's just exposing a |
So, I'm not entirely sure what these errors mean haha... But here's where the error is thrown in the Chromium source code: Gonna keep digging |
So, I'm not entirely sure how the IPC / Mojo communication (which appears to be how Chromium processes talk to each other) plays a roll... But this might be suggesting that some iframe on our login page is violating our frame security (same-origin) policy? But this only happens very rarely? I'm not sure what that might be 🤔 |
This is the only iframe I'm seeing on the login page:
|
Going back through a bunch of recent versions of this crash, I actually was able to find in the Playwright traces that they are all similarly crashes when trying to do a navigation, often the same So, pending discovering what it causing some initial navigations to fail in these specific conditions, I wonder if we can get around this with just some more defensive coding. For one, we should be using try/catches in our teardown code to catch any failed page operations. This way, we'll make sure we are actually seeing the real errors that crash the page, not the errors thrown by trying to act on a failed page. Secondly, we may want to wrap our initial authentication or landing in a couple of retry loops, so that if there's an unexpected crash right on the first navigation, we just try again. |
@noahtallen or @WunderBart, I'm wondering if you have any quick insight on the error as outlined in this comment. No worries if not, just curious if that rings any initial bells as you might be more familiar with the Calypso / DotCom security policies than I am 😁 |
No quick insights unfortunately, but a big fan of capturing as much log info as we can in normal runs! You're right, the only iframe in that scenario I can think of is the rest proxy. 🤔 Though a side point is that calypso.live really doesn't work well when 3rd party cookies are disabled, just because *.calypso.live is a weird different origin from normal production. (When testing calypso.live, I have to turn off enhanced tracking protection in firefox, for example) |
Sounds good, thanks for weighing in @noahtallen! 🙇 You bring up a good point, and I'm also realizing we tend to only see these errors on our TC builds that use Calypso.live. That makes me think that maybe a good next step is adding some retry logic to our initial navigation/login, and see if that smooths things out! |
Just catching up on this conversation and adding a few of my observations.
On retries, I am not sure where we can add retrie. The browser startup is handled in |
👋 Sorry for such a late response, somehow I missed this one. I'm wondering (and please ignore if you've already checked this scenario) if it's not the same case as microsoft/playwright#18137. From the explaining comment:
...which seems to be describing our problem as well. The proposed solution looks pretty straightforward:
So it should be sufficient to reload the page before calling the await authenticate(page);
await page.waitForURL('/home');
await page.goto('/bar');
await baz(); Not sure which of the above would be more viable in our case, but maybe we could try with simple |
@WunderBart ahhh, I didn't see that issue, that seems exactly like it!!! 😄 Nice find! I think either approach would be viable, and hopefully that does the trick! CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix! Otherwise, I can pick it up later 👍 |
Excellent find @WunderBart! We can give this a try later today or next week, and see if it helps reduce or eliminate this intermittent failure. |
I can make time for this next week! |
Spec file
various
TeamCity ID
9073509
Logs
The text was updated successfully, but these errors were encountered: