Notification: Broken Jetpack connection banner doesn't dismiss #90758

ebinnion · 2024-05-15T14:43:17Z

Quick summary

See p1715718239728809-slack-C029GN3KD

The summary of that thread is that some set of users see a notification that describes the Jetpack connection as being broken because the plugin is deactivated. But, even after fixing the connection, the notice doesn't dismiss.

In working with an HE on this issue, we fixed the issue by clearing IndexedDB and localStorage from the application tab of Chrome, after we noticed that the issue only showed for the HE in Chrome and not in Safari.

In talking to @supernovia, she suggests that somewhere around 2-3% of her interactions are this and that they ask users to log out and back in.

Steps to reproduce

Based on the reports, I would imagine that this is due to an intermittent connection issue that then persists. Based on that, I'm not sure what the repro steps are. This is how I would start though.

Break the Jetpack connection by going to /_cli for an atomic site and remove the blog or user connection secret
Load WordPress.com
Verify the notice shows
Fix the connection in Jetpack debug
Verify the notice still shows

What you expected to happen

The notice to disappear after the connection is broken.

What actually happened

The notice persists and requires HE intervention and the user logging out.

Impact

Some (< 50%)

Available workarounds?

Yes, easy to implement

Platform (Simple and/or Atomic)

No response

Logs or notes

No response

The text was updated successfully, but these errors were encountered:

zaguiini · 2024-05-17T00:14:39Z

Maybe it's not a false positive. See: https://github.com/Automattic/dotcom-forge/issues/7234#issuecomment-2116409760

supernovia · 2024-05-17T03:26:00Z

@zaguiini

Maybe it's not a false positive. See: Automattic/dotcom-forge#7234 (comment)

We definitely run into actual broken connections to fix, too, and it would be good to address the issues behind those.

But there are also quite a few cases where the user will see connection errors even when a debug reveals the connection is fine. If the user tries with another browser or an incognito window with these cases, it works, so in these cases it's something stuck in the browser itself. Logging out and back in seems to fix it, but I'm not sure folks would think to try that in their troubleshooting steps, so the error can be frustrating for them.

mrfoxtalbot · 2024-05-17T11:05:09Z

Related? #79324

paulopmt1 · 2024-05-17T18:38:20Z

We're seeing a similar case on A4A sites, but the root cause shouldn't be the same. Maybe the reason why it doesn't clear the error message is the same bug.

supernovia · 2024-05-17T19:08:36Z

Thanks @paulopmt1 - we just ran into that today in our WooCommerce tinkering at a meetup; same situation! And yes the error is stuck. That must be what's causing us to get new users with this error when everything seems fine.

paulopmt1 · 2024-05-17T19:19:37Z

Thanks for sharing one more example of that @supernovia.

when everything seems fine

I also noticed that everything was working as expected.
The responsible for keeping the error visible seems to be a cache issue after a broken connection. We still don't know why the connection breaks in the first place, though.

paulopmt1 · 2024-05-22T17:40:56Z

Until now, Luis and I have been working on fixing the A4A site creation issue, which is similar to this issue but it's not the same.

Investigating this issue further, I found two interesting things:

We have a 2-minute cache on Calypso when we find a Jetpack error.
We also have a 5-minute cache in the Backend: fbhepr%2Skers%2Sjcpbz%2Sjc%2Qpbagrag%2Serfg%2Qncv%2Qcyhtvaf%2Sraqcbvagf%2Swrgcnpx%2Qpbaarpgvba%2Qurnygu.cuc%3Se%3Q6s5866nq%26zb%3Q4091%26sv%3Q168%23202-og

Note: Our current cache clear CTA on the /hosting page won't clear that backend cache, so it will always be 5 minutes.

So even when we (or the Jetpack by itself) fix the connection, it can take up to 7 minutes for the user to notice that which is not ideal and can lead to the issues we see here.

We could change it to 1 minute if we update our backend to only cache a failed Jetpack connection for 1 minute and a success Jetpack connection for 5 minutes (as it is currently). In that case, our frontend would have a ~15s cache only for failed Jetpack connections (so multiple requests would benefit from that) and 5 minutes for successful connections (as it's currently).

paulopmt1 · 2024-05-24T00:24:58Z

Nice, after this HE interaction (p1716493834606609/1716465093.319339-slack-CB0B2G43X), I found a way to simulate the bug:

New monthly explorer site using credits
Buy a creator plan without migrating to Atomic
Activate theme StarAce
Attach a domain to the site using the “Attach to an existing site” (we can release an existing domain first. There is no need to buy a new one)
Go to the /hosting page and enable the hosting service (initiate Atomic migration)
Run window.postMessage( [ { message: 'site is inaccessible' }, 500 ] ) in the client browser, which will trigger a /jetpack-connection-health check, and that will fail since we don’t have a jetpack_connection_active_plugins for that site

Next step:

Understand why we don't always store a jetpack_connection_active_plugins blog_option during an Atomic transfer.
Understand what triggers the /jetpack-connection-health in the first place (once it's triggered, its banner will continue to trigger it). This will explain why the bug doesn't always show up.

paulopmt1 · 2024-05-24T14:14:45Z

Found the minimum steps to reproduce the bug:

Create a new site and buy a creator plan without migrating to Atomic using credits
Open the pages section (this is important)
Without being proxied, go to the /hosting-config/{domain} page and enable the hosting service (initiate Atomic migration)
You'll see a jetpack connection error when accessing /home:

Screen.Recording.2024-05-24.at.11.10.32.mov

paulopmt1 · 2024-05-25T17:11:52Z

Why does this flow trigger the Jetpack connection validation?

The "Pages" menu isn't the only one that triggers a failure. In fact, any page that tries to load fetchModuleList will trigger it since that call fails and calls the setJetpackConnectionMaybeUnhealthy. This bug is not so usual because fewer places in the Calypso call it.

Why the fetchModuleList call fails? Because it doesn't support simple sites (which is the state of our site on that flow) and will always return rest_no_route for those cases, triggering the setJetpackConnectionMaybeUnhealthy check.

Solution for this trigger: We could validate if the current site is_atomic before calling that jetpack-blogs API. We wouldn't fix the root cause of the issue but will avoid one important trigger of it.

The root cause question

Why don't we always have a jetpack_connection_active_plugins blog_option for new Atomic sites on the Dotcom database?

We introduced the jetpack_connection_active_plugins check inside the has_missing_plugin here: D119629-code

We update this option in a couple of places and have a dedicated endpoint that does that: fbhepr%2Skers%2Sjcpbz%2Sjc%2Qpbagrag%2Serfg%2Qncv%2Qcyhtvaf%2Sraqcbvagf%2Swrgcnpx%2Qnpgvir%2Qpbaarpgrq%2Qcyhtvaf.cuc%3Se%3Qqo5rp541%2310-og

Ask more about it here: p1716657490806279-slack-CBG1CP4EN

jeherve · 2024-05-27T07:29:25Z

it doesn't support simple sites (which is the state of our site on that flow)

Isn't the site an Atomic site by then? Since you triggered the transfer from the hosting page, the primary URL was changed to a *.wpcomstaging.com one; at that point I would consider the site to be a WoA site. Am I missing something?

Why don't we always have a jetpack_connection_active_plugins blog_option for new Atomic sites on the Dotcom database?

cc'ing @Automattic/jetpack-vulcan on this, so they can look at the flow when this is triggered and the option populated.

fgiannar · 2024-05-27T07:54:37Z

Why don't we always have a jetpack_connection_active_plugins blog_option for new Atomic sites on the Dotcom database?

We recently changed the trigger for updating the jetpack_connection_active_plugins to rely on plugin updates instead of checking it on every request.

Full Context: p9o2xV-46w-p2#comment-9261

Since there's no update_plugin hook fired when we activate Jetpack on WoA sites, we should make sure to populate this option during the AT site creation process.

Please give us a ping if you need further assistance/clarifications/reviews related to the above!

github-actions · 2024-05-27T07:55:59Z

Support References

This comment is automatically generated. Please do not edit it.

p9o2xV-46w-p2#comment-9261

paulopmt1 · 2024-05-27T17:36:27Z

Isn't the site an Atomic site by then?

Not yet, since the user navigates on it before going Atomic, at that moment (second 20 of the video), the navigation is in the simple site.

we should make sure to populate this option during the AT site creation process.

I see, so this is the change we did on its behavior.
Here's the code we worked on to fix the issue (we can only test it on prod, so lots of diffs for a simple fix):

First POC: D150074-code
Tested another approach adding the code in another async job, but it didn't work as expected: D150079-code
Found the ideal place for the code: D150091-code

Learn that we need to set jetpack_connection_active_plugins right after we finish the transfer because once the transfer_status is complete, Calypso will fire a /jetpack-connection-healt verification and that may occur after other async jobs like woa_jetpack_sync finishes: D150079-code.

In this diff we're releasing the feature to all new Atomic sites: D150120-code

Here's the code in action:

Screen.Recording.2024-05-27.at.11.38.43.mov

paulopmt1 · 2024-05-28T19:17:38Z

We deployed the fix for this problem. So, new sites won't be affected anymore.
We still need to fix the sites created between May 1 and May 28 and are defining how to do it here: pet6gk-19m-p2

ebinnion · 2024-06-05T17:50:08Z

Reading through the p2 post:

We reduced the Jetpack error message cache from 7 minutes to ~1 minute. That means that if a connection is restored, users will only see the error for a maximum of 1 minute and 15 seconds now.

Did we also consider just clearing the error message cache once an HE fixes the connection via the Jetpack Debugger? Preferably, when an HE fixes the connection issue, they should then immediately be able to see the connection error message resolve.

ebinnion added [Type] Bug Needs triage Ticket needs to be triaged labels May 15, 2024

github-actions bot added the [Pri] Low label May 15, 2024

autumnfjeld assigned paulopmt1 May 17, 2024

mrfoxtalbot removed the Needs triage Ticket needs to be triaged label May 17, 2024

paulopmt1 mentioned this issue May 23, 2024

Reduced Jetpack connection short check from 2 minutes to 15 seconds #91049

Merged

7 tasks

github-actions bot added the Customer Report Issues or PRs that were reported via Happiness. Previously known as "Happiness Request". label May 27, 2024

paulopmt1 mentioned this issue May 27, 2024

Restricting fetchModuleList calls to Atomic sites only #91170

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notification: Broken Jetpack connection banner doesn't dismiss #90758

Notification: Broken Jetpack connection banner doesn't dismiss #90758

ebinnion commented May 15, 2024 •

edited

zaguiini commented May 17, 2024

supernovia commented May 17, 2024 •

edited

mrfoxtalbot commented May 17, 2024

paulopmt1 commented May 17, 2024

supernovia commented May 17, 2024

paulopmt1 commented May 17, 2024

paulopmt1 commented May 22, 2024

paulopmt1 commented May 24, 2024 •

edited

paulopmt1 commented May 24, 2024

paulopmt1 commented May 25, 2024 •

edited

jeherve commented May 27, 2024

fgiannar commented May 27, 2024

github-actions bot commented May 27, 2024

paulopmt1 commented May 27, 2024 •

edited

paulopmt1 commented May 28, 2024

ebinnion commented Jun 5, 2024

Notification: Broken Jetpack connection banner doesn't dismiss #90758

Notification: Broken Jetpack connection banner doesn't dismiss #90758

Comments

ebinnion commented May 15, 2024 • edited

Quick summary

Steps to reproduce

What you expected to happen

What actually happened

Impact

Available workarounds?

Platform (Simple and/or Atomic)

Logs or notes

zaguiini commented May 17, 2024

supernovia commented May 17, 2024 • edited

mrfoxtalbot commented May 17, 2024

paulopmt1 commented May 17, 2024

supernovia commented May 17, 2024

paulopmt1 commented May 17, 2024

paulopmt1 commented May 22, 2024

paulopmt1 commented May 24, 2024 • edited

paulopmt1 commented May 24, 2024

paulopmt1 commented May 25, 2024 • edited

Why does this flow trigger the Jetpack connection validation?

The root cause question

jeherve commented May 27, 2024

fgiannar commented May 27, 2024

github-actions bot commented May 27, 2024

paulopmt1 commented May 27, 2024 • edited

paulopmt1 commented May 28, 2024

ebinnion commented Jun 5, 2024

ebinnion commented May 15, 2024 •

edited

supernovia commented May 17, 2024 •

edited

paulopmt1 commented May 24, 2024 •

edited

paulopmt1 commented May 25, 2024 •

edited

paulopmt1 commented May 27, 2024 •

edited