Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: fix rare panic due to race condition on stop #32513

Merged
merged 1 commit into from
May 16, 2024

Conversation

giorio94
Copy link
Member

The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel.

Fixes: #32179

Fix rare race condition afflicting clustermesh when disconnecting from a remote cluster, possibly causing the agent to panic

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: cilium#32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels May 13, 2024
@giorio94 giorio94 requested a review from a team as a code owner May 13, 2024 14:06
@giorio94
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 16, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue May 16, 2024
Merged via the queue into cilium:main with commit 104a302 May 16, 2024
66 checks passed
@YutaroHayakawa YutaroHayakawa mentioned this pull request May 23, 2024
15 tasks
@YutaroHayakawa YutaroHayakawa added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels May 23, 2024
@YutaroHayakawa YutaroHayakawa mentioned this pull request May 24, 2024
12 tasks
@YutaroHayakawa YutaroHayakawa added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels May 24, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic
3 participants