Windows nodes become Not Ready occasionally #1765

zhiweiv · 2020-08-05T08:08:40Z

What happened:
We have Windows pools to run job pods, a lot of job pods run then complete and being deleted on them. Windows nodes become Not Ready occasionally in these Windows pools, pods in these nodes hang at terminating state.

I can see a lot of errors in kubelet logs. For example
E0803 17:19:44.164137 4368 remote_runtime.go:261] RemoveContainer "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64": Error response from daemon: unable to remove filesystem for 2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64: CreateFile C:\ProgramData\docker\containers\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64-json.log: Access is denied.

fatal error: concurrent map writes

goroutine 13660836 [running]:
runtime.throw(0x40932d0, 0x15)
	/usr/local/go/src/runtime/panic.go:774 +0x79 fp=0xc00274b7c0 sp=0xc00274b790 pc=0x4334a9
runtime.mapassign_faststr(0x3772920, 0xc00084b0b0, 0xc002d2edc0, 0x40, 0xc000da0480)
	/usr/local/go/src/runtime/map_faststr.go:211 +0x41e fp=0xc00274b828 sp=0xc00274b7c0 pc=0x41450e
k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).CreateContainer(0xc00091c0b0, 0x4790180, 0xc00263eb70, 0xc00263eba0, 0xc00091c0b0, 0xc00263eb70, 0xc000c2fa80)
	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/dockershim/docker_container.go:186 +0xae9 fp=0xc00274ba20 sp=0xc00274b828 pc=0x311e199

Full logs:
kubelet.err.log.zip

What you expected to happen:
Windows nodes working well

How to reproduce it (as minimally and precisely as possible):
Run high volume of pods and delete them when they completed in Windows nodes.

Anything else we need to know?:
I guess this maybe caused by concurrent pod run or pod delete.

Environment:

Kubernetes version (use kubectl version):
1.17.7
Size of cluster (how many worker nodes are in the cluster?):
10 nodes with auto scale
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
.net framewoker app to to process background tasks

The text was updated successfully, but these errors were encountered:

ghost · 2020-08-05T08:08:43Z

Triage required from @Azure/aks-pm

TomGeske · 2020-08-05T08:40:49Z

@keikhara something you can help with?

ghost · 2020-08-05T08:41:26Z

@immuzz, @keikhara would you be able to assist?

AbelHu · 2020-08-07T07:49:58Z

I found a similar issue fatal error: concurrent map writes in golang/go#20060.
From the logs, it seems like that the error was caused by the race condition in RemoveContainer and AddContainer.

https://github.com/kubernetes/kubernetes/blob/b4455102ef392bf7d594ef96b97a4caa79d729d9/pkg/kubelet/dockershim/docker_container.go#L186

func (ds *dockerService) CreateContainer(_ context.Context, r *runtimeapi.CreateContainerRequest) (*runtimeapi.CreateContainerResponse, error) {
        ...
	if createResp != nil {
		containerID := createResp.ID

		if cleanupInfo != nil {
			// we don't perform the clean up just yet at that could destroy information
			// needed for the container to start (e.g. Windows credentials stored in
			// registry keys); instead, we'll clean up when the container gets removed
			ds.containerCleanupInfos[containerID] = cleanupInfo
		}
		return &runtimeapi.CreateContainerResponse{ContainerId: containerID}, nil
	}

https://github.com/kubernetes/kubernetes/blob/b4455102ef392bf7d594ef96b97a4caa79d729d9/pkg/kubelet/dockershim/docker_container.go#L463

func (ds *dockerService) performPlatformSpecificContainerForContainer(containerID string) (errors []error) {
	if cleanupInfo, present := ds.containerCleanupInfos[containerID]; present {
		errors = ds.performPlatformSpecificContainerCleanupAndLogErrors(containerID, cleanupInfo)

		if len(errors) == 0 {
			delete(ds.containerCleanupInfos, containerID)
		}
	}

	return
}

AbelHu · 2020-08-07T07:53:50Z

@zhiweiv I suggest you to file an issue in https://github.com/kubernetes/kubernetes

zhiweiv · 2020-08-07T08:01:02Z

For the kubelet panic issue, I will file issue in k/k.

But for the error:
E0803 17:19:44.164137 4368 remote_runtime.go:261] RemoveContainer "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64": Error response from daemon: unable to remove filesystem for 2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64: CreateFile C:\ProgramData\docker\containers\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64-json.log: Access is denied.

I guesss it is related to Windows, or hcsshim?

AbelHu · 2020-08-07T08:53:14Z

From the logs, it seems like that RemoveContainer is called multiple times for a same containerid.

W0803 11:41:27.191465    4368 docker_sandbox.go:394] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "alita-job-15378e2651014ac080747d95023dff9e-spop08031140-000_default": A virtual machine or container with the specified identifier does not exist.
I0803 11:41:27.515618    4368 fake_cpu_manager.go:47] [fake cpumanager] RemoveContainer (container id: 486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45)

I0803 11:42:11.334173    4368 fake_cpu_manager.go:47] [fake cpumanager] RemoveContainer (container id: 486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45)
E0803 11:42:11.354174    4368 remote_runtime.go:261] RemoveContainer "486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45": Error response from daemon: unable to remove filesystem for 486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45: CreateFile C:\ProgramData\docker\containers\486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45\486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45-json.log: Access is denied.

From the remarks here, the operating system will return ERROR_ACCESS_DENIED except the first request. I guess CreateFile is called in RemoveContainer inside the docker.

If you call CreateFile on a file that is pending deletion as a result of a previous call to DeleteFile, the function fails. The operating system delays file deletion until all handles to the file are closed. GetLastError returns ERROR_ACCESS_DENIED.

palma21 · 2020-08-11T20:25:19Z

@craiglpeters FYI

zhiweiv · 2020-09-04T03:05:57Z

The fix has been merged to k8s 1.17 branch(kubernetes/kubernetes#94042), the recent release should be 1.17.12 in this month, do you know when this will be included to AKS?

zhiweiv · 2020-09-17T01:01:40Z

@palma21 @jluk
k8s 1.17.12 released today, it contains the fix for this issue. do you know when it will arrive AKS?

GerryWilko · 2020-10-12T11:12:53Z

Do msft have any update on when this fix will be included in AKS? This is regularly causing wasted node resources that require manual clean up on my AKS.

This bug is still present on 1.18.6

zhiweiv · 2020-10-12T11:18:27Z

1.18.9 fixed this, I guess it should arrive soon, it has been 20 days since last release.

AbelHu · 2020-10-12T12:47:28Z

Correct: We plan to release it with Windows 10C patch.

zhiweiv · 2020-10-22T06:26:47Z

@AbelHu
Will this be available in release 20201019 too? Our Ops team is getting crazy to restart error nodes everyday.

AbelHu · 2020-10-22T06:31:13Z

@zhiweiv new Kubernetes versions were not enabled in 20201019 for some issues. We are trying to add 1.17.13, 1.18.10 and 1.19.3 in 20201026.

zhiweiv · 2020-11-03T01:42:44Z

New k8s versions are available now.

ghost added the triage label Aug 5, 2020

github-actions bot added the Top 10 Issue label Aug 5, 2020

ghost removed the triage label Aug 5, 2020

TomGeske added question windows labels Aug 5, 2020

TomGeske removed the Top 10 Issue label Aug 5, 2020

github-actions bot added Top 10 Issue and removed Top 10 Issue labels Aug 5, 2020

keikhara self-assigned this Aug 5, 2020

zhiweiv mentioned this issue Aug 7, 2020

kubelet panic with fatal error: concurrent map writes kubernetes/kubernetes#93771

Closed

github-actions bot added the Top 10 Issue label Aug 7, 2020

palma21 added upstream and removed Top 10 Issue labels Aug 11, 2020

palma21 assigned craiglpeters Aug 11, 2020

ghost added the action-required label Oct 12, 2020

ghost removed the action-required label Oct 22, 2020

zhiweiv closed this as completed Nov 3, 2020

ghost locked as resolved and limited conversation to collaborators Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows nodes become Not Ready occasionally #1765

Windows nodes become Not Ready occasionally #1765

zhiweiv commented Aug 5, 2020 •

edited

Loading

ghost commented Aug 5, 2020

TomGeske commented Aug 5, 2020

ghost commented Aug 5, 2020

AbelHu commented Aug 7, 2020

AbelHu commented Aug 7, 2020 •

edited

Loading

zhiweiv commented Aug 7, 2020 •

edited

Loading

AbelHu commented Aug 7, 2020

palma21 commented Aug 11, 2020

zhiweiv commented Sep 4, 2020 •

edited

Loading

zhiweiv commented Sep 17, 2020

GerryWilko commented Oct 12, 2020 •

edited

Loading

zhiweiv commented Oct 12, 2020 •

edited

Loading

AbelHu commented Oct 12, 2020 •

edited

Loading

zhiweiv commented Oct 22, 2020

AbelHu commented Oct 22, 2020 •

edited

Loading

zhiweiv commented Nov 3, 2020 •

edited

Loading

Windows nodes become Not Ready occasionally #1765

Windows nodes become Not Ready occasionally #1765

Comments

zhiweiv commented Aug 5, 2020 • edited Loading

ghost commented Aug 5, 2020

TomGeske commented Aug 5, 2020

ghost commented Aug 5, 2020

AbelHu commented Aug 7, 2020

AbelHu commented Aug 7, 2020 • edited Loading

zhiweiv commented Aug 7, 2020 • edited Loading

AbelHu commented Aug 7, 2020

palma21 commented Aug 11, 2020

zhiweiv commented Sep 4, 2020 • edited Loading

zhiweiv commented Sep 17, 2020

GerryWilko commented Oct 12, 2020 • edited Loading

zhiweiv commented Oct 12, 2020 • edited Loading

AbelHu commented Oct 12, 2020 • edited Loading

zhiweiv commented Oct 22, 2020

AbelHu commented Oct 22, 2020 • edited Loading

zhiweiv commented Nov 3, 2020 • edited Loading

zhiweiv commented Aug 5, 2020 •

edited

Loading

AbelHu commented Aug 7, 2020 •

edited

Loading

zhiweiv commented Aug 7, 2020 •

edited

Loading

zhiweiv commented Sep 4, 2020 •

edited

Loading

GerryWilko commented Oct 12, 2020 •

edited

Loading

zhiweiv commented Oct 12, 2020 •

edited

Loading

AbelHu commented Oct 12, 2020 •

edited

Loading

AbelHu commented Oct 22, 2020 •

edited

Loading

zhiweiv commented Nov 3, 2020 •

edited

Loading