Skip to content

Plugin fails with "Store is locked" and "network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input" #186

@w3s7y

Description

@w3s7y

Is this a request for help?:
Issue/Bug, request for confirmation of fix in later versions.

Is this an ISSUE or FEATURE REQUEST? (choose one):

ISSUE

Which release version?:

v1.0.2

Which component (CNI/IPAM/CNM/CNS):

CNI & IPAM

Which Operating System (Linux/Windows):

Linux (Ubuntu 16.04) 64bit
Linux k8s-agentpool1-62076056-0 4.15.0-1014-azure #14~16.04.1-Ubuntu SMP Thu Jun 14 15:42:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

We observed that our pods could not start or terminate on one of our development clusters since moving from acs to acs-engine:

store-weu-acc1     downloadservice-345209866-0q1rx                 0/1       ContainerCreating   0          2d
store-weu-acc1     downloadservice-345209866-3mf0h                 0/1       ContainerCreating   0          1d
store-weu-acc1     downloadservice-345209866-6g1s9                 1/1       Terminating         0          2d
store-weu-acc1     downloadservice-345209866-9lsft                 1/1       Terminating         0          2d

After looking through the logs we found the journal for kubelet had the following errors:

Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 "Start Flag false Name CNI Version v1.0.2 ErrorMessage Store is locked vnet []
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]:                                 Context  SubContext "
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 OrchestratorDetails &{Kubernetes v1.7.16
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]:  }
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 OSDetails &{linux 16.04.4 4.15.0-1013-azure Ubuntu }
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 SystemDetails &{6946 3192 0 29715 6375 2 }
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 InterfaceDetails &{Primary 10.240.0.0/12 10.240.2.2 00:0d:3a:29:83:17 azure0 127 0 }
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 BridgeDetails <nil>
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 Send telemetry success 200
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: 2018/07/02 11:27:25 SetReportState succeeded
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: E0702 11:27:25.437772   95278 cni.go:312] Error deleting network: netplugin failed but error parsing its diagnostic message "": unexpected end of JS
ON input
Jul 02 11:27:25 k8s-agentpool1-62076056-0 docker[79932]: time="2018-07-02T11:27:25.438398234Z" level=error msg="Handler for POST /v1.26/containers/f3fd5317fafcb25f515b0e2052120c6eb843540d2d3d262bc5ebaa67de2
62040/stop returned error: Container f3fd5317fafcb25f515b0e2052120c6eb843540d2d3d262bc5ebaa67de262040 is already stopped"
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: E0702 11:27:25.441303   95278 remote_runtime.go:114] StopPodSandbox "f3fd5317fafcb25f515b0e2052120c6eb843540d2d3d262bc5ebaa67de262040" from runtime 
service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "heapster-186967039-cqmjk_kube-system" network: netplugin failed but error parsing its diagnostic message "": unexpected e
nd of JSON input
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: E0702 11:27:25.446164   95278 kuberuntime_gc.go:156] Failed to stop sandbox "f3fd5317fafcb25f515b0e2052120c6eb843540d2d3d262bc5ebaa67de262040" befor
e removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "heapster-186967039-cqmjk_kube-system" network: netplugin failed but error parsing its diagnostic message "": unexpected end o
f JSON input
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: W0702 11:27:25.448217   95278 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "
heapster-186967039-cqmjk_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "b9baf578c7996418ebad86e062b0d515f7ce2efa4a3e0e12a624c9c410a6
6d73"
Jul 02 11:27:25 k8s-agentpool1-62076056-0 kubelet[95278]: W0702 11:27:25.448875   95278 cni.go:258] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container 
"b9baf578c7996418ebad86e062b0d515f7ce2efa4a3e0e12a624c9c410a66d73"

And in the CNI logs (forgive the discrepancy in timestamps, the issue was also occuring at same time as journal errors however this was the exact time the lock issue became apparent):

2018/07/02 08:16:42 [cni-net] Plugin azure-vnet version v1.0.2.
2018/07/02 08:16:42 [cni-net] Running on Linux version 4.15.0-1013-azure (buildd@lcy01-amd64-006) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)) #13~16.04.2-Ubuntu SMP Wed May 30 01:39:27 UTC 2018
2018/07/02 08:16:42 [net] Network interface: {Index:1 MTU:65536 Name:lo HardwareAddr: Flags:up|loopback} with IP addresses: [127.0.0.1/8 ::1/128]
2018/07/02 08:16:42 [net] Network interface: {Index:2 MTU:1500 Name:eth0 HardwareAddr:00:0d:3a:29:83:17 Flags:up|broadcast} with IP addresses: [fe80::20d:3aff:fe29:8317/64]
2018/07/02 08:16:42 [net] Network interface: {Index:3 MTU:1500 Name:docker0 HardwareAddr:02:42:ea:72:50:50 Flags:up|broadcast|multicast} with IP addresses: [172.17.0.1/16]
2018/07/02 08:16:42 [net] Network interface: {Index:4 MTU:1500 Name:azure0 HardwareAddr:00:0d:3a:29:83:17 Flags:up|broadcast|multicast} with IP addresses: [10.240.2.2/12 fe80::20d:3aff:fe29:8317/64]
2018/07/02 08:16:42 [net] Network interface: {Index:5 MTU:1500 Name:azvethd3b5ed9 HardwareAddr:f2:29:d3:58:12:9f Flags:up|broadcast} with IP addresses: [fe80::f029:d3ff:fe58:129f/64]
2018/07/02 08:16:42 [net] Store timestamp is 2018-06-29 22:55:32.841685859 +0000 UTC.
2018/07/02 08:16:42 [net] Restored state, &{Version:v1.0.2 TimeStamp:2018-06-29 22:55:32.848144792 +0000 UTC ExternalInterfaces:map[eth0:0xc420136000] store:0xc420016ed0 Mutex:{state:0 sema:0}}
2018/07/02 08:16:42 [cni-net] Plugin started.
2018/07/02 08:16:42 [cni-net] Processing DEL command with args {ContainerID:ee4e979f2a289c49e6bc5b5d83744763461036664405a51196394bd93812ba29 Netns: IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=sstore-weu-dev;K8S_POD_NAME=ui-3042035097-1lwcf;K8S_POD_INFRA_CONTAINER_ID=ee4e979f2a289c49e6bc5b5d83744763461036664405a51196394bd93812ba29 Path:/opt/azure-vnet/bin:/opt/cni/bin}.
2018/07/02 08:16:42 [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:}}.
2018/07/02 08:16:42 [azure-vnet] Failed to query endpoint: Endpoint not found.
2018/07/02 08:16:42 [cni-net] DEL command completed with err:<nil>.
2018/07/02 08:16:42 [cni-net] Plugin stopped.
2018/07/02 08:16:42 [cni-net] Plugin azure-vnet version v1.0.2.
2018/07/02 08:16:42 [cni-net] Running on Linux version 4.15.0-1013-azure (buildd@lcy01-amd64-006) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)) #13~16.04.2-Ubuntu SMP Wed May 30 01:39:27 UTC 2018
2018/07/02 08:16:42 [net] Network interface: {Index:1 MTU:65536 Name:lo HardwareAddr: Flags:up|loopback} with IP addresses: [127.0.0.1/8 ::1/128]
2018/07/02 08:16:42 [net] Network interface: {Index:2 MTU:1500 Name:eth0 HardwareAddr:00:0d:3a:29:83:17 Flags:up|broadcast} with IP addresses: [fe80::20d:3aff:fe29:8317/64]
2018/07/02 08:16:42 [net] Network interface: {Index:3 MTU:1500 Name:docker0 HardwareAddr:02:42:ea:72:50:50 Flags:up|broadcast|multicast} with IP addresses: [172.17.0.1/16]
2018/07/02 08:16:42 [net] Network interface: {Index:4 MTU:1500 Name:azure0 HardwareAddr:00:0d:3a:29:83:17 Flags:up|broadcast|multicast} with IP addresses: [10.240.2.2/12 fe80::20d:3aff:fe29:8317/64]
2018/07/02 08:16:42 [net] Network interface: {Index:5 MTU:1500 Name:azvethd3b5ed9 HardwareAddr:f2:29:d3:58:12:9f Flags:up|broadcast} with IP addresses: [fe80::f029:d3ff:fe58:129f/64]
2018/07/02 08:16:42 [net] Store timestamp is 2018-06-29 22:55:32.841685859 +0000 UTC.
2018/07/02 08:16:42 [net] Restored state, &{Version:v1.0.2 TimeStamp:2018-06-29 22:55:32.848144792 +0000 UTC ExternalInterfaces:map[eth0:0xc420136000] store:0xc420016ed0 Mutex:{state:0 sema:0}}
2018/07/02 08:16:42 [cni-net] Plugin started.
2018/07/02 08:25:26 [cni] Timed out on locking store, err:Store is locked.
2018/07/02 08:25:26 [cni-net] Failed to initialize base plugin, err:Store is locked.
2018/07/02 08:25:26 Failed to start network plugin, err:Store is locked.
2018/07/02 08:25:26 Report plugin error
2018/07/02 08:25:29 [cni] Timed out on locking store, err:Store is locked.
2018/07/02 08:25:29 [cni-net] Failed to initialize base plugin, err:Store is locked.
2018/07/02 08:25:29 Failed to start network plugin, err:Store is locked.
2018/07/02 08:25:29 Report plugin error
2018/07/02 08:25:29 [cni] Timed out on locking store, err:Store is locked.
2018/07/02 08:25:29 [cni-net] Failed to initialize base plugin, err:Store is locked.
2018/07/02 08:25:29 Failed to start network plugin, err:Store is locked.
2018/07/02 08:25:29 Report plugin error

This error continues until we restart the node.

Is this something an upgrade may address? I did look in the release notes for versions >1.0.2 but did not see anything Re: bugfixes for "Store locks" or similar.

Which Orchestrator and version (e.g. Kubernetes, Docker)
Kubernetes 1.7.16 (built with acs-engine 0.16.2)
Docker 1.13.1

What happened:

CNI Failed with "err:Store is locked."

What you expected to happen:

CNI plugin to ADD & DEL interfaces as needed.

How to reproduce it (as minimally and precisely as possible):

Build kubernetes cluster using acs-engine 0.16.2 with K8s version '1.7'
Custom build parameters (max-pods=110 & "ipAddressCount": 128 for masters & nodes)

Anything else we need to know:

The cluster has been built and running for 41 days now, however this issue only appeared Monday 2nd July. The cluster has also had intermittent faults with memory exhaustion and has required a cluster restart to recover.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions