-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Is this an ISSUE or FEATURE REQUEST? (choose one): Issue
Which release version?: 1.0.29
Which component (CNI/IPAM/CNM/CNS): CNI
Which Operating System (Linux/Windows): Windows
For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"
Which Orchestrator and version (e.g. Kubernetes, Docker): Kubernetes 1.17
What happened:
Under a heavily loaded system - azure-vnet.exe terminated without logging any result or deleting the lock file. All azure-vnet calls since then have failed.
azure-vnet.json.lock contents are 7740
These are all the logs from the process 7740 that terminated without removing the lock.
2019/12/19 20:02:10 [7740] Connected to telemetry service
2019/12/19 20:02:10 [7740] [cni-net] Plugin azure-vnet version v1.0.29.
2019/12/19 20:02:10 [7740] [cni-net] Running on windows
2019/12/19 20:02:16 [7740] [net] Network interface: {Index:6 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:c2:a0:b5 Flags:up|broadcast|multicast} with IP: [fe80::8c4c:911b:131f:132a/64 10.240.0.128/12]
2019/12/19 20:02:17 [7740] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2019/12/19 20:02:19 [7740] [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:76:8d:b5 Flags:up|broadcast|multicast} with IP: [fe80::dc51:edd1:aefe:f886/64 172.25.16.1/20]
Other processes following that one also logged the same failure:
2019/12/19 15:31:26 [21564] Formatted Boot time: 2019-12-12 22:58:19
2019/12/19 15:31:26 [21564] [Azure-Utils] ver
2019/12/19 15:31:26 [21564] Connected to telemetry service
2019/12/19 15:31:26 [21564] [cni-net] Plugin azure-vnet version v1.0.29.
2019/12/19 15:31:26 [21564] [cni-net] Running on windows
2019/12/19 15:31:26 [21564] [net] Network interface: {Index:6 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:c2:a0:b5 Flags:up|broadcast|multicast} with IP: [fe80::8c4c:911b:131f:132a/64 10.240.0.128/12]
2019/12/19 15:31:26 [21564] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2019/12/19 15:31:26 [21564] [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:76:8d:b5 Flags:up|broadcast|multicast} with IP: [fe80::dc51:edd1:aefe:f886/64 172.25.16.1/20]
2019/12/19 15:31:26 [21564] [net] network store key not found
2019/12/19 15:31:26 [21564] [cni-net] Plugin started.
2019/12/19 15:31:26 [21564] [cni-net] Plugin stopped.
2019/12/19 15:31:26 [21564] Sending report succeeded
Once the system load was relieved, from this point on, all azure-vnet calls are failing:
2019/12/19 20:13:43 [16676] Formatted Boot time: 2019-12-12 22:58:19
2019/12/19 20:13:43 [16676] [Azure-Utils] ver
2019/12/19 20:14:11 [16676] Formatted Boot time: 2019-12-12 22:58:19
2019/12/19 20:14:11 [16676] [cni] reboot time 2019-12-12 22:58:19 +0000 UTC storeLockFile mod time 2019-12-19 20:02:05.2168854 +0000 UTC
2019/12/19 20:17:04 [16676] [cni] Failed to lock store: timed out locking store.
2019/12/19 20:17:04 [16676] Failed to initialize key-value store of network plugin, err:timed out locking store.
2019/12/19 20:17:04 [16676] Report plugin error
What you expected to happen: No lock leaks
How to reproduce it (as minimally and precisely as possible):
Deploy this and scale to > 4x the number of CPUs the node has
apiVersion: apps/v1
kind: Deployment
metadata:
name: burn500m
labels:
app: burn500m
spec:
replicas: 3
template:
metadata:
name: burn500m
labels:
app: burn500m
spec:
containers:
- name: windowswebserver
resources:
limits:
cpu: .5
memory: 1Gi
requests:
cpu: .25
memory: 1Gi
image: mcr.microsoft.com/windows/servercore:ltsc2019
command:
- powershell.exe
- -command
- "foreach ($loopnumber in 1..8) { Start-Job -ScriptBlock { $result = 1; foreach($mm in 1..2147483647){$res1=1;foreach($num in 1..2147483647){$res1=$mm*$num*1340371};$res1} } } ; get-job | wait-job"
nodeSelector:
"beta.kubernetes.io/os": windows
selector:
matchLabels:
app: burn500m
Anything else we need to know:
This also causes some difficult to understand errors in kubelet logs
W1219 23:13:16.676189 14772 cni.go:237] Unable to update cni config: no valid networks found in c:\k\azurecni\netconf
E1219 23:13:29.295807 14772 cni.go:364] Error adding default_burn2-7b86d4db4f-m2zbc/2a108f19f55a5fd4e2572182c3cdb90280309c4626a35fe31c4a1c1173897c49 to network azure-vnet/azure: unexpected end of JSON input
W1219 23:13:43.276619 14772 cni.go:202] Error validating CNI config list {
"cniVersion": "0.3.0",
"name": "azure",
"plugins": [
{
"type": "azure-vnet",
"mode": "bridge",
"bridge": "azure0",
"capabilities": {
"portMappings": true,
"dns": true
},
"ipam": {
"type": "azure-vnet-ipam"
},
"dns": {
"Nameservers": [
"10.0.0.10",
"168.63.129.16"
],
"Search": [
"svc.cluster.local"
]
},
"AdditionalArgs": [
{
"Name": "EndpointPolicy",
"Value": {
"Type": "OutBoundNAT",
"ExceptionList": [
"10.0.0.0/8",
"10.240.0.0/12"
]
}
},
{
"Name": "EndpointPolicy",
"Value": {
"Type": "ROUTE",
"DestinationPrefix": "10.0.0.0/16",
"NeedEncap": true
}
}
]
}
]
}
: [decoding version info: unexpected end of JSON input]
W1219 23:13:43.276619 14772 cni.go:237] Unable to update cni config: no valid networks found in c:\k\azurecni\netconf
E1219 23:13:51.423887 14772 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "2a108f19f55a5fd4e2572182c3cdb90280309c4626a35fe31c4a1c1173897c49" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
And these events are passed back to Kubernetes, visible in kubectl describe pod ...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 6 Insufficient cpu.
Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 6 Insufficient cpu.
Normal Scheduled <unknown> default-scheduler Successfully assigned default/burn2-7b86d4db4f-m2zbc to 2758k8s010
Warning FailedCreatePodSandBox 9m1s kubelet, 2758k8s010 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8931a83501dcb167bf10e9365d2a7f8fe4bd125ddc04f8370b08bd7c3a799e50" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
Warning FailedCreatePodSandBox 6m2s kubelet, 2758k8s010 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "2044513a9f09b702c426085091d420d129292e62979d10b77b053f3002e27ebc" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
Warning FailedSync 5m18s (x4 over 5m41s) kubelet, 2758k8s010 error determining status: rpc error: code = Unknown desc = Error: No such container: 8931a83501dcb167bf10e9365d2a7f8fe4bd125ddc04f8370b08bd7c3a799e50
Warning FailedCreatePodSandBox 4m12s kubelet, 2758k8s010 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b32c54cb368f0c37f769a795ceb1c8a10dfb58db54dc3152c2f71cf9caeaff54" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
Warning FailedCreatePodSandBox 2m43s kubelet, 2758k8s010 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "af3d169eb3390c9692d78fef343dbb9959fcb523d981000af0127f79a29c5467" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
Warning FailedCreatePodSandBox 74s kubelet, 2758k8s010 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4cd552d3e6df57530227ea9a4a633eea05c62d87cf53ba765164545124e79391" network for pod "burn2-7b86d4db4f-m2zbc": networkPlugin cni failed to set up pod "burn2-7b86d4db4f-m2zbc_default" network: unexpected end of JSON input
Normal SandboxChanged 52s (x5 over 7m28s) kubelet, 2758k8s010 Pod sandbox changed, it will be killed and re-created.