Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky E2E: Jest failed to execute suite #70674

Open
worldomonation opened this issue Dec 1, 2022 · 28 comments · Fixed by #70970 or #71154
Open

Flaky E2E: Jest failed to execute suite #70674

worldomonation opened this issue Dec 1, 2022 · 28 comments · Fixed by #70970 or #71154
Assignees
Labels
[Type] Flaky E2E Denotes an issue created for fixing flaky e2e tests [Type] Monitoring Issues or tasks that are being actively monitored.

Comments

@worldomonation
Copy link
Contributor

Spec file

various

TeamCity ID

9073509

Logs

page.screenshot: Target crashed
=========================== logs ===========================
taking page screenshot
============================================================
page.screenshot: Target crashed
=========================== logs ===========================
taking page screenshot
============================================================
    at JestEnvironmentPlaywright.handleTestEvent (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/jest-playwright-config/environment.ts:344:28)
    at dispatch (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-circus/build/state.js:75:5)
    at runAndTransformResultsToJestFormat (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:234:3)
    at jestAdapter (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:82:19)
    at runTestInternal (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:389:16)
    at runTest (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:475:34)
    at Object.worker (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/testWorker.js:133:12)
@worldomonation worldomonation added the [Type] Flaky E2E Denotes an issue created for fixing flaky e2e tests label Dec 1, 2022
@worldomonation worldomonation self-assigned this Dec 1, 2022
@worldomonation worldomonation moved this from Backlog to Needs Review in KitKat Project Board Dec 1, 2022
@worldomonation worldomonation moved this from Needs Review to Backlog in KitKat Project Board Dec 1, 2022
@worldomonation worldomonation moved this from Backlog to In Progress in KitKat Project Board Dec 1, 2022
@worldomonation
Copy link
Contributor Author

This has been an ongoing issue since the early days of the framework, but it's gotten a lot more frequent in the last few months.

There have been 8 occurrences for the editor/editor__revision.ts spec alone for the last week:
image

and 13 cases in the editor__navbar.ts spec in the same time period:
image

@worldomonation
Copy link
Contributor Author

Enabled the pw:browser debug level in Playwright and after the same build several times, I came across the crash.

This is the resulting log:

01:29:45 
  2022-12-03T01:29:45.152Z pw:browser [pid=4174][err] [1203/012945.152220:ERROR:render_process_host_impl.cc(5157)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch

01:29:45 
  2022-12-03T01:29:45.152Z pw:browser [pid=4174][err] [1203/012945.152304:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123

01:29:45 
  2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.146939

01:29:45 
  2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.148676

01:29:45 
  2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.148827

01:29:45 
  2022-12-03T01:29:45.155Z pw:browser writing frame 1670030985.151879

01:29:45 
  2022-12-03T01:29:45.158Z pw:browser writing frame 1670030985.151862

01:29:45 
  2022-12-03T01:29:45.162Z pw:browser writing frame 1670030985.156923

01:29:45 
  Event for import__site: hook_failure

01:29:45 
  ERROR: Site Import > beforeAll

01:29:45 
  

01:29:45 
   page.goto: net::ERR_ABORTED; maybe frame was detached?

01:29:45 
  =========================== logs ===========================

01:29:45 
  navigating to "https://container-vigorous-khayyam.calypso.live/", waiting until "load"

01:29:45 
  ============================================================

01:29:45 
      at TestAccount.authenticate (/home/teamcity-0/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:41:15)

01:29:45 
      at /home/teamcity-0/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/tools/import__site.ts:18:3 {

01:29:45 
    name: 'Error'

01:29:45 
  }

@worldomonation
Copy link
Contributor Author

Further logs:

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: run_describe_finish

01:29:45 
  Event for import__site: run_describe_finish

@worldomonation
Copy link
Contributor Author

@worldomonation
Copy link
Contributor Author

I have a potential fix in #70970. After merging them, I will have to monitor for a bit.

Repository owner moved this from In Progress to Done🎊 in KitKat Project Board Dec 9, 2022
@worldomonation worldomonation reopened this Dec 9, 2022
Repository owner moved this from Done🎊 to In Progress in KitKat Project Board Dec 9, 2022
@worldomonation
Copy link
Contributor Author

Reopening to monitor.

@worldomonation
Copy link
Contributor Author

Looks like #70970 did not really have the intended effect.

For example, in the support__popover-invalid spec alone, we have seen 4 new instances of browser crashing:
image

Even today (December 13, 2022) there has been multiple occurrences of this across a variety of specs.
image

@worldomonation
Copy link
Contributor Author

@dpasque I think it's time we try increasing the shm allocation for docker. I'm going to spin up a new PR for that.

@worldomonation
Copy link
Contributor Author

Let's see if #71356 helps - all failure we've had today (2022/12/21) have been this failure.

@worldomonation
Copy link
Contributor Author

Incidents of this issue have dropped off quite significantly, but still occurs occasionally.

Looking at the trend, this issue appears to occur with media related tests.

image

image

@worldomonation worldomonation added the [Type] Monitoring Issues or tasks that are being actively monitored. label Jan 24, 2023
@worldomonation
Copy link
Contributor Author

Still happening occasionally after increasing the shared memory to 4GB, albeit at a rate lower than it used to at 3GB.

@dpasque
Copy link
Contributor

dpasque commented Apr 6, 2023

Just adding what the newest trace stacks look like:

page.screenshot: Target crashed
      =========================== logs ===========================
      taking page screenshot
      ============================================================
 page.screenshot: Target crashed
      =========================== logs ===========================
      taking page screenshot
      ============================================================
          at JestEnvironmentPlaywright.teardown (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/jest-playwright-config/environment.ts:165:22)
          at runTestInternal (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:461:5)
          at runTest (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:475:34)
          at Object.worker (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/testWorker.js:133:12)

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

I've been playing around with this on my 🎩 rotation, and I think I have some next steps to try...

  • Get a fresh set of logs. I'm pushing up a PR that captures browser logging and am going to run tests on CI till I get the crash. I also added some other logs here. I'm actually a tiny suspicious the browser may already be crashed by the time we are taking the screenshot?
  • Do a deep dive on machine resources during these test failures. I've been noticing some sluggishness during tests -- maybe looks like congested network? I wonder if any resource constraints are lining up with these failures?
  • With all that info, if we still don't know... I think it's time to ask again on the Playwright repo 😅
  • We could add a try-catch here so that we stop crashing the whole suite -- that might actually make the failures more meaningful to people (we would see what test actually failed). But, it would make detecting these crashes harder on CI.

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

Adding some more logs!

Got it to fail, here is what the console logging showed:

2023-04-07T16:06:28.554Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/chromium-1055/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/home/teamcity-2/buildAgent/temp/buildTmp/playwright_chromiumdev_profile-ae3r7K --remote-debugging-pipe --no-startup-window
2023-04-07T16:06:28.573Z pw:browser <launched> pid=3191
2023-04-07T16:06:28.757Z pw:browser [pid=3191][err] [0407/160628.755709:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.837Z pw:browser [pid=3191][err] [0407/160628.836220:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.837Z pw:browser [pid=3191][err] [0407/160628.800476:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.
2023-04-07T16:06:28.838Z pw:browser [pid=3191][err] [0407/160628.836346:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.844Z pw:browser [pid=3191][err] [0407/160628.840861:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
2023-04-07T16:06:33.693Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/ffmpeg-1008/ffmpeg-linux -loglevel error -f image2pipe -avioflags direct -fpsprobesize 0 -probesize 32 -analyzeduration 0 -c:v mjpeg -i - -y -an -r 25 -c:v vp8 -qmin 0 -qmax 50 -crf 8 -deadline realtime -speed 8 -b:v 1M -threads 1 -vf pad=800:450:0:0:gray,crop=800:450:0:0 /home/teamcity-2/buildAgent/temp/buildTmp/431fea4297650193e2792c6b1385ee6c.webm
2023-04-07T16:06:33.770Z pw:browser <launched> pid=4686
  console.log
    TestAccount:gutenbergSimpleSiteUser => Logging in via Login Page

      at TestAccount.log (../../packages/calypso-e2e/src/lib/test-account.ts:205:12)

2023-04-07T16:06:34.383Z pw:browser [pid=3191][err] [0407/160634.383470:ERROR:render_process_host_impl.cc(5252)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
2023-04-07T16:06:34.384Z pw:browser [pid=3191][err] [0407/160634.383549:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
ERROR: Theme: Preview And Activate > beforeAll

 page.goto: net::ERR_ABORTED; maybe frame was detached?
=========================== logs ===========================
navigating to "https://container-flamboyant-hoover.calypso.live/log-in", waiting until "load"
============================================================
    at LoginPage.visit (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/pages/login-page.ts:34:26)
    at TestAccount.logInViaLoginPage (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:64:3)
    at TestAccount.authenticate (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:47:4)
    at /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/appearance/theme__details-preview-activate.ts:38:3 {
  name: 'Error'
} 

Beginning page.screenshot()...
2023-04-07T16:06:34.582Z pw:browser [pid=3191] <gracefully close start>
2023-04-07T16:06:34.600Z pw:browser [pid=4686] <gracefully close start>
2023-04-07T16:06:34.617Z pw:browser Closing stdin...
2023-04-07T16:06:34.617Z pw:browser ffmpeg finished input.
2023-04-07T16:06:34.658Z pw:browser [pid=3191] <process did exit: exitCode=0, signal=null>
2023-04-07T16:06:34.658Z pw:browser [pid=3191] starting temporary directories cleanup
2023-04-07T16:06:34.668Z pw:browser [pid=3191] finished temporary directories cleanup
2023-04-07T16:06:34.668Z pw:browser [pid=3191] <gracefully close end>
2023-04-07T16:06:34.796Z pw:browser [pid=4686] <process did exit: exitCode=0, signal=null>
2023-04-07T16:06:34.796Z pw:browser ffmpeg onkill exitCode=0 signal=null
2023-04-07T16:06:34.796Z pw:browser [pid=4686] starting temporary directories cleanup
2023-04-07T16:06:34.796Z pw:browser [pid=4686] finished temporary directories cleanup
2023-04-07T16:06:34.796Z pw:browser [pid=4686] <gracefully close end>

And here is what the Playwright API logs showed:

2023-04-07T16:06:33.631Z 946 api info: => browser.newPage started
2023-04-07T16:06:33.659Z 946 api info: <= browser.newPage succeeded
2023-04-07T16:06:33.661Z 946 api info: => browser.newPage started
2023-04-07T16:06:33.805Z 946 api info: <= browser.newPage succeeded
2023-04-07T16:06:33.807Z 946 api info: => page.on started
2023-04-07T16:06:33.808Z 946 api info: => tracing.start started
2023-04-07T16:06:33.812Z 946 api info: <= page.on succeeded
2023-04-07T16:06:33.901Z 946 api info: <= tracing.start succeeded
2023-04-07T16:06:33.923Z 946 api info: => browserContext.clearCookies started
2023-04-07T16:06:33.929Z 946 api info: <= browserContext.clearCookies succeeded
2023-04-07T16:06:33.991Z 946 api info: => page.route started
2023-04-07T16:06:33.995Z 946 api info: <= page.route succeeded
2023-04-07T16:06:33.996Z 946 api info: => page.goto started
2023-04-07T16:06:34.406Z 946 api info: <= page.goto failed
2023-04-07T16:06:34.470Z 946 api info: => tracing.stop started
2023-04-07T16:06:34.573Z 946 api info: <= tracing.stop succeeded
2023-04-07T16:06:34.576Z 946 api info: => page.screenshot started
2023-04-07T16:06:34.580Z 946 api info: <= page.screenshot failed
2023-04-07T16:06:34.581Z 946 api info: => browser.close started
2023-04-07T16:06:34.668Z 946 api info: <= browser.close succeeded

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

For comparison, here is a console log from a healthy executing test:

2023-04-07T16:06:28.476Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/chromium-1055/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/home/teamcity-2/buildAgent/temp/buildTmp/playwright_chromiumdev_profile-6bb9AZ --remote-debugging-pipe --no-startup-window
2023-04-07T16:06:28.511Z pw:browser <launched> pid=3045
2023-04-07T16:06:28.741Z pw:browser [pid=3045][err] [0407/160628.740159:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.741Z pw:browser [pid=3045][err] [0407/160628.740703:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.742Z pw:browser [pid=3045][err] [0407/160628.740781:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.854Z pw:browser [pid=3045][err] [0407/160628.854251:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
2023-04-07T16:06:33.657Z pw:browser [pid=3045][err] [0407/160628.886265:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.
2023-04-07T16:06:33.686Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/ffmpeg-1008/ffmpeg-linux -loglevel error -f image2pipe -avioflags direct -fpsprobesize 0 -probesize 32 -analyzeduration 0 -c:v mjpeg -i - -y -an -r 25 -c:v vp8 -qmin 0 -qmax 50 -crf 8 -deadline realtime -speed 8 -b:v 1M -threads 1 -vf pad=800:450:0:0:gray,crop=800:450:0:0 /home/teamcity-2/buildAgent/temp/buildTmp/cae3e409edeb015dcf966d7ca4024c9c.webm
2023-04-07T16:06:33.717Z pw:browser <launched> pid=4575
  console.log
    TestAccount:siteEditorSimpleSiteUser => Logging in via Login Page

      at TestAccount.log (../../packages/calypso-e2e/src/lib/test-account.ts:205:12)

2023-04-07T16:07:02.556Z pw:browser [pid=3045] <gracefully close start>
2023-04-07T16:07:02.560Z pw:browser [pid=4575] <gracefully close start>
2023-04-07T16:07:02.560Z pw:browser Closing stdin...
2023-04-07T16:07:02.560Z pw:browser ffmpeg finished input.
2023-04-07T16:07:02.578Z pw:browser [pid=4575] <process did exit: exitCode=0, signal=null>
2023-04-07T16:07:02.578Z pw:browser ffmpeg onkill exitCode=0 signal=null
2023-04-07T16:07:02.578Z pw:browser [pid=4575] starting temporary directories cleanup
2023-04-07T16:07:02.579Z pw:browser [pid=4575] finished temporary directories cleanup
2023-04-07T16:07:02.579Z pw:browser [pid=4575] <gracefully close end>
2023-04-07T16:07:02.588Z pw:browser [pid=3045] <process did exit: exitCode=0, signal=null>
2023-04-07T16:07:02.588Z pw:browser [pid=3045] starting temporary directories cleanup
2023-04-07T16:07:02.627Z pw:browser [pid=3045] finished temporary directories cleanup
2023-04-07T16:07:02.627Z pw:browser [pid=3045] <gracefully close end>

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

So comparing the two, this seems to be the real error:

2023-04-07T16:06:34.383Z pw:browser [pid=3191][err] [0407/160634.383470:ERROR:render_process_host_impl.cc(5252)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
2023-04-07T16:06:34.384Z pw:browser [pid=3191][err] [0407/160634.383549:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
ERROR: Theme: Preview And Activate > beforeAll

 page.goto: net::ERR_ABORTED; maybe frame was detached?
=========================== logs ===========================
navigating to "https://container-flamboyant-hoover.calypso.live/log-in", waiting until "load"
============================================================
    at LoginPage.visit (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/pages/login-page.ts:34:26)
    at TestAccount.logInViaLoginPage (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:64:3)
    at TestAccount.authenticate (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:47:4)
    at /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/appearance/theme__details-preview-activate.ts:38:3 {
  name: 'Error'
} 

So, this confirms my suspicion that the screenshot taking process is not actually the root issue! It's just exposing a page that is in a bad state from an earlier error.

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

So, I'm not entirely sure what these errors mean haha... But here's where the error is thrown in the Chromium source code:

https://source.chromium.org/chromium/chromium/src/+/main:content/browser/loader/prefetch_url_loader_service.cc;l=258

Gonna keep digging

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

So, I'm not entirely sure how the IPC / Mojo communication (which appears to be how Chromium processes talk to each other) plays a roll...

But this might be suggesting that some iframe on our login page is violating our frame security (same-origin) policy? But this only happens very rarely? I'm not sure what that might be 🤔

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

This is the only iframe I'm seeing on the login page:

https://public-api.wordpress.com/wp-admin/rest-proxy/?v=2.0#https://container-flamboyant-hoover.calypso.live

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

Going back through a bunch of recent versions of this crash, I actually was able to find in the Playwright traces that they are all similarly crashes when trying to do a navigation, often the same ERR_ABORTED error, sometimes it manifests in other ways. It always seems to happen in the beforeAll hooks of scripts too.

So, pending discovering what it causing some initial navigations to fail in these specific conditions, I wonder if we can get around this with just some more defensive coding.

For one, we should be using try/catches in our teardown code to catch any failed page operations. This way, we'll make sure we are actually seeing the real errors that crash the page, not the errors thrown by trying to act on a failed page.

Secondly, we may want to wrap our initial authentication or landing in a couple of retry loops, so that if there's an unexpected crash right on the first navigation, we just try again.

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

@noahtallen or @WunderBart, I'm wondering if you have any quick insight on the error as outlined in this comment. No worries if not, just curious if that rings any initial bells as you might be more familiar with the Calypso / DotCom security policies than I am 😁

@noahtallen
Copy link
Contributor

No quick insights unfortunately, but a big fan of capturing as much log info as we can in normal runs!

You're right, the only iframe in that scenario I can think of is the rest proxy. 🤔 Though a side point is that calypso.live really doesn't work well when 3rd party cookies are disabled, just because *.calypso.live is a weird different origin from normal production. (When testing calypso.live, I have to turn off enhanced tracking protection in firefox, for example)

@dpasque
Copy link
Contributor

dpasque commented Apr 7, 2023

Sounds good, thanks for weighing in @noahtallen! 🙇 You bring up a good point, and I'm also realizing we tend to only see these errors on our TC builds that use Calypso.live. That makes me think that maybe a good next step is adding some retry logic to our initial navigation/login, and see if that smooths things out!

@worldomonation
Copy link
Contributor Author

Just catching up on this conversation and adding a few of my observations.

  1. The observation that the crash only appears to happen on builds that use the calypso.live target is true. I don't think I've ever encountered this crash locally (as it hits staging when proxied/production if not proxied) and in the Pre-Release Tests (as it hits wpcalypso.wordpress.com).
  2. Playwright runs with third-party cookies completely disabled and cannot be enabled. I'm not certain whether this is a legitimate bug in Chromium and I am more than happy to report this upstream, though the challenge in that is they typically ask us for a minimally reproducible snippet...which in theory is our login process. But there may be additional parts involved (eg. our particular CI/docker image setup) that they may not be able to replicate.
  3. Defensive coding seems to be the right way to solve this problem, especially with the time risk of going upstream and digging really deep.

On retries, I am not sure where we can add retrie. The browser startup is handled in environment.ts while the initial navigation to the login page is handled in test-accounts.ts. Since the actual page object itself crashes upon navigation to the /log-in endpoint, I'm not sure how to "cross bridges" from POMs to the framework code itself.

@WunderBart
Copy link
Member

WunderBart commented May 11, 2023

👋 Sorry for such a late response, somehow I missed this one.

I'm wondering (and please ignore if you've already checked this scenario) if it's not the same case as microsoft/playwright#18137.

From the explaining comment:

The intent was to authenticate user and redirect to them to /bar.

However, what happens in practice, the polling mechanism of the app detects that the user was > authenticated, and attempts to redirect them to post-login page.

The end result was that sometimes when baz() executes we would be on /home and sometimes on > /baz.

...which seems to be describing our problem as well. The proposed solution looks pretty straightforward:

The reason for adding page.reload() after authenticate() is to ensure that before we are calling goto there are no more outstanding redirects.

So it should be sufficient to reload the page before calling the goto. There's also another solution proposed right in the next comment:

await authenticate(page);
await page.waitForURL('/home');
await page.goto('/bar');
await baz();

Not sure which of the above would be more viable in our case, but maybe we could try with simple reload before goto to clear the page's state and see what happens?

@dpasque
Copy link
Contributor

dpasque commented May 11, 2023

@WunderBart ahhh, I didn't see that issue, that seems exactly like it!!! 😄 Nice find!

I think either approach would be viable, and hopefully that does the trick! CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix! Otherwise, I can pick it up later 👍

@worldomonation
Copy link
Contributor Author

Excellent find @WunderBart! We can give this a try later today or next week, and see if it helps reduce or eliminate this intermittent failure.

@karenroldan
Copy link
Contributor

CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix

I can make time for this next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Flaky E2E Denotes an issue created for fixing flaky e2e tests [Type] Monitoring Issues or tasks that are being actively monitored.
Projects
Status: Backlog
6 participants