CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic #32179

squeed · 2024-04-25T08:23:53Z

CI failure

In an unrelated PR, I saw a panic in a kvstoremesh privileged unit test:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                           
┃   PANIC  package: github.com/cilium/cilium/pkg/clustermesh • Test/ClusterMeshServicesTestSuite/TestRemoteServiceObserver   ┃                                           
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                                           
panic: send on closed channel                                                                                                                                            
                                                                                                                                                                         
goroutine 1132 [running]:                                                                                                                                                
github.com/cilium/cilium/pkg/clustermesh/common.(*remoteCluster).getClusterConfig.func1({0x2931a90, 0x400101a5f0})                                                       
	/home/runner/work/cilium/cilium/pkg/clustermesh/common/remote_cluster.go:293 +0xe0                                                                                       
github.com/cilium/cilium/pkg/controller.(*controller).runController(0x400074f2c0, {{{0x25638fb, 0x1a}}, {0x0, 0x0}, 0x4001453230, 0x0, 0x2690a18, 0x0, 0x6fc23ac00, ...})
	/home/runner/work/cilium/cilium/pkg/controller/controller.go:254 +0x108                                                                                                  
created by github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked in goroutine 1065                                                                   
	/home/runner/work/cilium/cilium/pkg/controller/manager.go:111 +0x3d0                                                                                                     
FAIL	github.com/cilium/cilium/pkg/clustermesh	0.935s                                                                                                                       
                                                                                                                                                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃   FAIL  package: github.com/cilium/cilium/pkg/clustermesh/kvstoremesh   ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

--- FAIL: TestRemoteClusterRun (120.00s)

time="2024-04-24T17:34:09Z" level=info msg="Creating etcd client" ConfigPath= KeepAliveHeartbeat=15s KeepAliveTimeout=25s ListLimit=256 MaxInflight=100 RateLimit=100 subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="Connecting to etcd server..." config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="New lease successfully acquired" LeaseID=5326789763116035469 TTL=25s config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="Got lock lease ID 49ec8f1127b7ed8d" config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="Initial etcd session established" config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
time="2024-04-24T17:34:09Z" level=info msg="New lease successfully acquired" LeaseID=5326789763116035478 TTL=15m0s config= endpoints="[http://127.0.0.1:4002/]" subsys=kvstore
    dummy.go:84: Timed out waiting to acquire the kvstore lock

The text was updated successfully, but these errors were encountered:

The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel. Fixes: cilium#32179 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel. Fixes: #32179 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

[ upstream commit 104a302 ] The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel. Fixes: #32179 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

squeed added area/CI Continuous Integration testing issue or flake area/clustermesh Relates to multi-cluster routing functionality in Cilium. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Apr 25, 2024

jrajahalme mentioned this issue May 3, 2024

fqdn: Change error log to warning #32333

Merged

giorio94 mentioned this issue May 13, 2024

clustermesh: fix rare panic due to race condition on stop #32513

Merged

julianwiedmann closed this as completed in #32513 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic #32179

CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic #32179

squeed commented Apr 25, 2024

CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic #32179

CI: kvstoremesh unit test ClusterMeshServicesTestSuite/TestRemoteServiceObserver panic #32179

Comments

squeed commented Apr 25, 2024

CI failure