Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic #32179

Closed
squeed opened this issue Apr 25, 2024 · 0 comments · Fixed by #32513
Closed
Labels
area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@squeed
Copy link
Contributor

squeed commented Apr 25, 2024

CI failure

In an unrelated PR, I saw a panic in a kvstoremesh privileged unit test:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                           
┃   PANIC  package: github.com/cilium/cilium/pkg/clustermesh • Test/ClusterMeshServicesTestSuite/TestRemoteServiceObserver   ┃                                           
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                                           
panic: send on closed channel                                                                                                                                            
                                                                                                                                                                         
goroutine 1132 [running]:                                                                                                                                                
github.com/cilium/cilium/pkg/clustermesh/common.(*remoteCluster).getClusterConfig.func1({0x2931a90, 0x400101a5f0})                                                       
	/home/runner/work/cilium/cilium/pkg/clustermesh/common/remote_cluster.go:293 +0xe0                                                                                       
github.com/cilium/cilium/pkg/controller.(*controller).runController(0x400074f2c0, {{{0x25638fb, 0x1a}}, {0x0, 0x0}, 0x4001453230, 0x0, 0x2690a18, 0x0, 0x6fc23ac00, ...})
	/home/runner/work/cilium/cilium/pkg/controller/controller.go:254 +0x108                                                                                                  
created by github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked in goroutine 1065                                                                   
	/home/runner/work/cilium/cilium/pkg/controller/manager.go:111 +0x3d0                                                                                                     
FAIL	github.com/cilium/cilium/pkg/clustermesh	0.935s                                                                                                                       
                                                                                                                                                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃   FAIL  package: github.com/cilium/cilium/pkg/clustermesh/kvstoremesh   ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

--- FAIL: TestRemoteClusterRun (120.00s)

time="2024-04-24T17:34:09Z" level=info msg="Creating etcd client" ConfigPath= KeepAliveHeartbeat=15s KeepAliveTimeout=25s ListLimit=256 MaxInflight=100 RateLimit=100 subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="Connecting to etcd server..." config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="New lease successfully acquired" LeaseID=5326789763116035469 TTL=25s config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="Got lock lease ID 49ec8f1127b7ed8d" config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="Initial etcd session established" config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="New lease successfully acquired" LeaseID=5326789763116035478 TTL=15m0s config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
    dummy.go:84: Timed out waiting to acquire the kvstore lock
@squeed squeed added area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Apr 25, 2024
giorio94 added a commit to giorio94/cilium that referenced this issue May 13, 2024
The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: cilium#32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
github-merge-queue bot pushed a commit that referenced this issue May 16, 2024
The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 23, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 24, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 24, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 24, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 24, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 24, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 24, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa pushed a commit that referenced this issue May 25, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
aanm pushed a commit that referenced this issue May 27, 2024
[ upstream commit 104a302 ]

The clustermesh logic is currently affected by a possible, although
rare, race condition occurring if the cluster configuration is being
retrieved while the connection to the remote cluster is stopped.
Indeed, this operation stops two controllers -- the one handling the
connection to the remote cluster and the one responsible for the
retrieval of the cluster config. However, this causes the
getRemoteCluster function to possibly terminate before the termination
of the second controller, in turn leading to a panic due to send on
closed channel. Let's fix this issue by explicitly removing only the
first controller, and letting the other terminate normally due to the
parent context having been terminated. Hence, ensuring that the
controller has always terminated before closing the cfgch channel.

Fixes: #32179
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant