Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification: Broken Jetpack connection banner doesn't dismiss #90758

Open
ebinnion opened this issue May 15, 2024 · 16 comments
Open

Notification: Broken Jetpack connection banner doesn't dismiss #90758

ebinnion opened this issue May 15, 2024 · 16 comments
Assignees
Labels
Customer Report Issues or PRs that were reported via Happiness. Previously known as "Happiness Request". [Pri] Low [Type] Bug

Comments

@ebinnion
Copy link
Contributor

ebinnion commented May 15, 2024

Quick summary

See p1715718239728809-slack-C029GN3KD

The summary of that thread is that some set of users see a notification that describes the Jetpack connection as being broken because the plugin is deactivated. But, even after fixing the connection, the notice doesn't dismiss.

CleanShot 2024-05-15 at 09 39 17@2x

In working with an HE on this issue, we fixed the issue by clearing IndexedDB and localStorage from the application tab of Chrome, after we noticed that the issue only showed for the HE in Chrome and not in Safari.

In talking to @supernovia, she suggests that somewhere around 2-3% of her interactions are this and that they ask users to log out and back in.

Steps to reproduce

Based on the reports, I would imagine that this is due to an intermittent connection issue that then persists. Based on that, I'm not sure what the repro steps are. This is how I would start though.

  1. Break the Jetpack connection by going to /_cli for an atomic site and remove the blog or user connection secret
  2. Load WordPress.com
  3. Verify the notice shows
  4. Fix the connection in Jetpack debug
  5. Verify the notice still shows

What you expected to happen

The notice to disappear after the connection is broken.

What actually happened

The notice persists and requires HE intervention and the user logging out.

Impact

Some (< 50%)

Available workarounds?

Yes, easy to implement

Platform (Simple and/or Atomic)

No response

Logs or notes

No response

@ebinnion ebinnion added [Type] Bug Needs triage Ticket needs to be triaged labels May 15, 2024
@zaguiini
Copy link
Contributor

Maybe it's not a false positive. See: https://github.com/Automattic/dotcom-forge/issues/7234#issuecomment-2116409760

@supernovia
Copy link
Contributor

supernovia commented May 17, 2024

@zaguiini

Maybe it's not a false positive. See: Automattic/dotcom-forge#7234 (comment)

We definitely run into actual broken connections to fix, too, and it would be good to address the issues behind those.

But there are also quite a few cases where the user will see connection errors even when a debug reveals the connection is fine. If the user tries with another browser or an incognito window with these cases, it works, so in these cases it's something stuck in the browser itself. Logging out and back in seems to fix it, but I'm not sure folks would think to try that in their troubleshooting steps, so the error can be frustrating for them.

@mrfoxtalbot
Copy link

Related? #79324

@mrfoxtalbot mrfoxtalbot removed the Needs triage Ticket needs to be triaged label May 17, 2024
@paulopmt1
Copy link
Contributor

We're seeing a similar case on A4A sites, but the root cause shouldn't be the same. Maybe the reason why it doesn't clear the error message is the same bug.

@supernovia
Copy link
Contributor

Thanks @paulopmt1 - we just ran into that today in our WooCommerce tinkering at a meetup; same situation! And yes the error is stuck. That must be what's causing us to get new users with this error when everything seems fine.

@paulopmt1
Copy link
Contributor

Thanks for sharing one more example of that @supernovia.

when everything seems fine

I also noticed that everything was working as expected.
The responsible for keeping the error visible seems to be a cache issue after a broken connection. We still don't know why the connection breaks in the first place, though.

@paulopmt1
Copy link
Contributor

Until now, Luis and I have been working on fixing the A4A site creation issue, which is similar to this issue but it's not the same.

Investigating this issue further, I found two interesting things:

  • We have a 2-minute cache on Calypso when we find a Jetpack error.
  • We also have a 5-minute cache in the Backend: fbhepr%2Skers%2Sjcpbz%2Sjc%2Qpbagrag%2Serfg%2Qncv%2Qcyhtvaf%2Sraqcbvagf%2Swrgcnpx%2Qpbaarpgvba%2Qurnygu.cuc%3Se%3Q6s5866nq%26zb%3Q4091%26sv%3Q168%23202-og

Note: Our current cache clear CTA on the /hosting page won't clear that backend cache, so it will always be 5 minutes.

So even when we (or the Jetpack by itself) fix the connection, it can take up to 7 minutes for the user to notice that which is not ideal and can lead to the issues we see here.

We could change it to 1 minute if we update our backend to only cache a failed Jetpack connection for 1 minute and a success Jetpack connection for 5 minutes (as it is currently). In that case, our frontend would have a ~15s cache only for failed Jetpack connections (so multiple requests would benefit from that) and 5 minutes for successful connections (as it's currently).

@paulopmt1
Copy link
Contributor

paulopmt1 commented May 24, 2024

Nice, after this HE interaction (p1716493834606609/1716465093.319339-slack-CB0B2G43X), I found a way to simulate the bug:

  • New monthly explorer site using credits
  • Buy a creator plan without migrating to Atomic
  • Activate theme StarAce
  • Attach a domain to the site using the “Attach to an existing site” (we can release an existing domain first. There is no need to buy a new one)
  • Go to the /hosting page and enable the hosting service (initiate Atomic migration)
  • Run window.postMessage( [ { message: 'site is inaccessible' }, 500 ] ) in the client browser, which will trigger a /jetpack-connection-health check, and that will fail since we don’t have a jetpack_connection_active_plugins for that site

Next step:

  • Understand why we don't always store a jetpack_connection_active_plugins blog_option during an Atomic transfer.
  • Understand what triggers the /jetpack-connection-health in the first place (once it's triggered, its banner will continue to trigger it). This will explain why the bug doesn't always show up.

@paulopmt1
Copy link
Contributor

Found the minimum steps to reproduce the bug:

  • Create a new site and buy a creator plan without migrating to Atomic using credits
  • Open the pages section (this is important)
  • Without being proxied, go to the /hosting-config/{domain} page and enable the hosting service (initiate Atomic migration)
  • You'll see a jetpack connection error when accessing /home:
Screen.Recording.2024-05-24.at.11.10.32.mov

@paulopmt1
Copy link
Contributor

paulopmt1 commented May 25, 2024

Why does this flow trigger the Jetpack connection validation?

The "Pages" menu isn't the only one that triggers a failure. In fact, any page that tries to load fetchModuleList will trigger it since that call fails and calls the setJetpackConnectionMaybeUnhealthy. This bug is not so usual because fewer places in the Calypso call it.

Why the fetchModuleList call fails? Because it doesn't support simple sites (which is the state of our site on that flow) and will always return rest_no_route for those cases, triggering the setJetpackConnectionMaybeUnhealthy check.

image

Solution for this trigger: We could validate if the current site is_atomic before calling that jetpack-blogs API. We wouldn't fix the root cause of the issue but will avoid one important trigger of it.

The root cause question

Why don't we always have a jetpack_connection_active_plugins blog_option for new Atomic sites on the Dotcom database?

We introduced the jetpack_connection_active_plugins check inside the has_missing_plugin here: D119629-code

We update this option in a couple of places and have a dedicated endpoint that does that: fbhepr%2Skers%2Sjcpbz%2Sjc%2Qpbagrag%2Serfg%2Qncv%2Qcyhtvaf%2Sraqcbvagf%2Swrgcnpx%2Qnpgvir%2Qpbaarpgrq%2Qcyhtvaf.cuc%3Se%3Qqo5rp541%2310-og

Ask more about it here: p1716657490806279-slack-CBG1CP4EN

@jeherve
Copy link
Member

jeherve commented May 27, 2024

it doesn't support simple sites (which is the state of our site on that flow)

Isn't the site an Atomic site by then? Since you triggered the transfer from the hosting page, the primary URL was changed to a *.wpcomstaging.com one; at that point I would consider the site to be a WoA site. Am I missing something?

Why don't we always have a jetpack_connection_active_plugins blog_option for new Atomic sites on the Dotcom database?

cc'ing @Automattic/jetpack-vulcan on this, so they can look at the flow when this is triggered and the option populated.

@fgiannar
Copy link

Why don't we always have a jetpack_connection_active_plugins blog_option for new Atomic sites on the Dotcom database?

We recently changed the trigger for updating the jetpack_connection_active_plugins to rely on plugin updates instead of checking it on every request.

Full Context: p9o2xV-46w-p2#comment-9261

Since there's no update_plugin hook fired when we activate Jetpack on WoA sites, we should make sure to populate this option during the AT site creation process.

Please give us a ping if you need further assistance/clarifications/reviews related to the above!

Copy link

Support References

This comment is automatically generated. Please do not edit it.

  • p9o2xV-46w-p2#comment-9261

@github-actions github-actions bot added the Customer Report Issues or PRs that were reported via Happiness. Previously known as "Happiness Request". label May 27, 2024
@paulopmt1
Copy link
Contributor

paulopmt1 commented May 27, 2024

Isn't the site an Atomic site by then?

Not yet, since the user navigates on it before going Atomic, at that moment (second 20 of the video), the navigation is in the simple site.

we should make sure to populate this option during the AT site creation process.

I see, so this is the change we did on its behavior.
Here's the code we worked on to fix the issue (we can only test it on prod, so lots of diffs for a simple fix):

  • First POC: D150074-code
  • Tested another approach adding the code in another async job, but it didn't work as expected: D150079-code
  • Found the ideal place for the code: D150091-code

Learn that we need to set jetpack_connection_active_plugins right after we finish the transfer because once the transfer_status is complete, Calypso will fire a /jetpack-connection-healt verification and that may occur after other async jobs like woa_jetpack_sync finishes: D150079-code.

In this diff we're releasing the feature to all new Atomic sites: D150120-code

Here's the code in action:

Screen.Recording.2024-05-27.at.11.38.43.mov

@paulopmt1
Copy link
Contributor

We deployed the fix for this problem. So, new sites won't be affected anymore.
We still need to fix the sites created between May 1 and May 28 and are defining how to do it here: pet6gk-19m-p2

@ebinnion
Copy link
Contributor Author

ebinnion commented Jun 5, 2024

Reading through the p2 post:

We reduced the Jetpack error message cache from 7 minutes to ~1 minute. That means that if a connection is restored, users will only see the error for a maximum of 1 minute and 15 seconds now.

Did we also consider just clearing the error message cache once an HE fixes the connection via the Jetpack Debugger? Preferably, when an HE fixes the connection issue, they should then immediately be able to see the connection error message resolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Customer Report Issues or PRs that were reported via Happiness. Previously known as "Happiness Request". [Pri] Low [Type] Bug
Projects
Development

No branches or pull requests

7 participants