Skip to content

feat: unpeering force #3054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

giuliafalaschi
Copy link

Description

Two flags have been added to perform a forced unpeering:
--force, a Boolean flag that force the unpeering even if the provider cluster is unreachable, and --remote-cluster-id, a flag that must be entered in conjunction with --force and must contain the cluster id of the provider cluster.

Fixes #3051

How Has This Been Tested?

I created two local clusters using kind, installed Liqo on them, and established a peering relationship between the two. To simulate a catastrophic scenario, I modified the IP address of the provider cluster to make it unreachable. Finally, I executed the unpeer command with the options I implemented, specifically:
liqoctl unpeer --force --remote-cluster-id <cluster-id>
In this way, the unpeering was successfully performed on the consumer cluster, even though the provider was no longer reachable.

@adamjensenbot
Copy link
Collaborator

Hi @giuliafalaschi. Thanks for your PR!

I am @adamjensenbot.
You can interact with me issuing a slash command in the first line of a comment.
Currently, I understand the following commands:

  • /rebase: Rebase this PR onto the master branch (You can add the option test=true to launch the tests
    when the rebase operation is completed)
  • /merge: Merge this PR into the master branch
  • /build Build Liqo components
  • /test Launch the E2E and Unit tests
  • /hold, /unhold Add/remove the hold label to prevent merging with /merge

Make sure this PR appears in the liqo changelog, adding one of the following labels:

  • feat: 🚀 New Feature
  • fix: 🐛 Bug Fix
  • refactor: 🧹 Code Refactoring
  • docs: 📝 Documentation
  • style: 💄 Code Style
  • perf: 🐎 Performance Improvement
  • test: ✅ Tests
  • chore: 🚚 Dependencies Management
  • build: 📦 Builds Management
  • ci: 👷 CI/CD
  • revert: ⏪ Reverts Previous Changes

@github-actions github-actions bot added the feat Adds a new feature to the codebase label May 27, 2025
@claudiolor
Copy link
Contributor

Hi @giuliafalaschi, thanks for this PR. Is this ready for review?

@giuliafalaschi
Copy link
Author

Hi @giuliafalaschi, thanks for this PR. Is this ready for review?

Yes, I tested everything on a local cluster.

@claudiolor
Copy link
Contributor

Hi Giulia 😊

Thank you so much for your contribution and the effort you put into this PR, it's really appreciated! 🙏
Apologies for the delay in getting back to you, I got caught up with other tasks but I’ve now had the chance to review and test your changes.

That said, I noticed some issues related to how the forced unpeering is handled. Specifically, there are some challenges that need to be handled when the remote cluster becomes unreachable, which are not addressed in your implementation. The two main components that cause problems in such scenarios are:

  • CRDReplicator: whose role is reflecting Liqo CR (e.g., ResourceSlice) on the remote cluster, and It attempts to clean them up when they are deleted. If the remote cluster is no longer reachable, the finalizers won’t be removed, preventing the resources from being deleted.

  • VirtualNode: When a ResourceSlice is deleted, the associated virtual node also gets removed, triggering the VirtualNode controller, which starts a process that tries to evict pods scheduled on the virtual node. The Virtual Kubelet's reflector will take in charge the eviction of the pods, but if the remote cluster is unreachable, this eviction gets stuck.

Because of this, if you try to force unpeer two kind clusters and then delete one of them, with your version of the code you’ll notice that the forced unpeer hangs on the ResourceSlice deletion.

To properly handle these situations, we need to introduce a way to inform that the remote cluster is no longer expected to respond. That way, components like the CRDReplicator or Virtual Kubelet can stop trying to reach it endlessly. Here are a couple of ideas I could come up with:

  1. Use the controller manager’s API server checker to flag a ForeignCluster as "dead" after a number of failed health checks. However, this might introduce issues in case of temporary downtimes, if the cluster is considered "dead", the Virtual Kubelet will mark the pod scheduled on the virtual node as terminated. However, if the connection comes back up, then the pod running remotely will never be stopped.

  2. Explicitly mark a cluster as unreachable in the ForeignCluster spec (for example, when forcing unpeer and the cluster doesn’t respond). Then the component listed above can skip trying to delete remote resources, so cleanly removing the peering connection on the consumer cluster.

Once we handle this properly in the Liqo core, we can refine the behavior of the liqoctl unpeer command. I saw you modified the flow a bit, but I would suggest keeping the original flow and falling back to forced unpeer only if the standard one fails. For example, if we can’t get the cluster ID of the provider during unpeering, we can attempt with a short timeout. If it fails and force is enabled, we fall back to the user-supplied cluster ID and performing the operation required to force the unpeer (for example, following the second option above, marking the ForeignCluster as "dead").
Pay attention to keep the utility functions as "agnostic" as possible, not introducing on them the concept of "force". For example, you can add a flag to skip remote operations or giving the possibility to manually provide the data it tries to get from the remote cluster via API server.

Lastly, regarding the CLI flags: instead of having both --force and --remote-cluster-id (where the latter only makes sense with --force), it might be cleaner to have a single flag like --force-with-cluster-id that provides the fallback cluster ID directly.

Thanks again for your work! Let me know your thoughts on this and feel free to ask if you any questions 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Adds a new feature to the codebase size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force unpeering
3 participants