Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows nodes become Not Ready occasionally #1765

Closed
zhiweiv opened this issue Aug 5, 2020 · 16 comments
Closed

Windows nodes become Not Ready occasionally #1765

zhiweiv opened this issue Aug 5, 2020 · 16 comments

Comments

@zhiweiv
Copy link

zhiweiv commented Aug 5, 2020

What happened:
We have Windows pools to run job pods, a lot of job pods run then complete and being deleted on them. Windows nodes become Not Ready occasionally in these Windows pools, pods in these nodes hang at terminating state.

I can see a lot of errors in kubelet logs. For example
E0803 17:19:44.164137 4368 remote_runtime.go:261] RemoveContainer "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64": Error response from daemon: unable to remove filesystem for 2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64: CreateFile C:\ProgramData\docker\containers\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64-json.log: Access is denied.

fatal error: concurrent map writes

goroutine 13660836 [running]:
runtime.throw(0x40932d0, 0x15)
	/usr/local/go/src/runtime/panic.go:774 +0x79 fp=0xc00274b7c0 sp=0xc00274b790 pc=0x4334a9
runtime.mapassign_faststr(0x3772920, 0xc00084b0b0, 0xc002d2edc0, 0x40, 0xc000da0480)
	/usr/local/go/src/runtime/map_faststr.go:211 +0x41e fp=0xc00274b828 sp=0xc00274b7c0 pc=0x41450e
k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).CreateContainer(0xc00091c0b0, 0x4790180, 0xc00263eb70, 0xc00263eba0, 0xc00091c0b0, 0xc00263eb70, 0xc000c2fa80)
	/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/dockershim/docker_container.go:186 +0xae9 fp=0xc00274ba20 sp=0xc00274b828 pc=0x311e199

Full logs:
kubelet.err.log.zip

What you expected to happen:
Windows nodes working well

How to reproduce it (as minimally and precisely as possible):
Run high volume of pods and delete them when they completed in Windows nodes.

Anything else we need to know?:
I guess this maybe caused by concurrent pod run or pod delete.

Environment:

  • Kubernetes version (use kubectl version):
    1.17.7

  • Size of cluster (how many worker nodes are in the cluster?):
    10 nodes with auto scale

  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
    .net framewoker app to to process background tasks

@ghost ghost added the triage label Aug 5, 2020
@ghost
Copy link

ghost commented Aug 5, 2020

Triage required from @Azure/aks-pm

@ghost ghost removed the triage label Aug 5, 2020
@TomGeske
Copy link

TomGeske commented Aug 5, 2020

@keikhara something you can help with?

@ghost
Copy link

ghost commented Aug 5, 2020

@immuzz, @keikhara would you be able to assist?

@keikhara keikhara self-assigned this Aug 5, 2020
@AbelHu
Copy link
Member

AbelHu commented Aug 7, 2020

I found a similar issue fatal error: concurrent map writes in golang/go#20060.
From the logs, it seems like that the error was caused by the race condition in RemoveContainer and AddContainer.

https://github.com/kubernetes/kubernetes/blob/b4455102ef392bf7d594ef96b97a4caa79d729d9/pkg/kubelet/dockershim/docker_container.go#L186

func (ds *dockerService) CreateContainer(_ context.Context, r *runtimeapi.CreateContainerRequest) (*runtimeapi.CreateContainerResponse, error) {
        ...
	if createResp != nil {
		containerID := createResp.ID

		if cleanupInfo != nil {
			// we don't perform the clean up just yet at that could destroy information
			// needed for the container to start (e.g. Windows credentials stored in
			// registry keys); instead, we'll clean up when the container gets removed
			ds.containerCleanupInfos[containerID] = cleanupInfo
		}
		return &runtimeapi.CreateContainerResponse{ContainerId: containerID}, nil
	}

https://github.com/kubernetes/kubernetes/blob/b4455102ef392bf7d594ef96b97a4caa79d729d9/pkg/kubelet/dockershim/docker_container.go#L463

func (ds *dockerService) performPlatformSpecificContainerForContainer(containerID string) (errors []error) {
	if cleanupInfo, present := ds.containerCleanupInfos[containerID]; present {
		errors = ds.performPlatformSpecificContainerCleanupAndLogErrors(containerID, cleanupInfo)

		if len(errors) == 0 {
			delete(ds.containerCleanupInfos, containerID)
		}
	}

	return
}

@AbelHu
Copy link
Member

AbelHu commented Aug 7, 2020

@zhiweiv I suggest you to file an issue in https://github.com/kubernetes/kubernetes

@zhiweiv
Copy link
Author

zhiweiv commented Aug 7, 2020

For the kubelet panic issue, I will file issue in k/k.

But for the error:
E0803 17:19:44.164137 4368 remote_runtime.go:261] RemoveContainer "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64": Error response from daemon: unable to remove filesystem for 2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64: CreateFile C:\ProgramData\docker\containers\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64-json.log: Access is denied.

I guesss it is related to Windows, or hcsshim?

@AbelHu
Copy link
Member

AbelHu commented Aug 7, 2020

From the logs, it seems like that RemoveContainer is called multiple times for a same containerid.

W0803 11:41:27.191465    4368 docker_sandbox.go:394] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "alita-job-15378e2651014ac080747d95023dff9e-spop08031140-000_default": A virtual machine or container with the specified identifier does not exist.
I0803 11:41:27.515618    4368 fake_cpu_manager.go:47] [fake cpumanager] RemoveContainer (container id: 486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45)

I0803 11:42:11.334173    4368 fake_cpu_manager.go:47] [fake cpumanager] RemoveContainer (container id: 486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45)
E0803 11:42:11.354174    4368 remote_runtime.go:261] RemoveContainer "486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45": Error response from daemon: unable to remove filesystem for 486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45: CreateFile C:\ProgramData\docker\containers\486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45\486f88a538086c73fcf8e43e847a3cfc302427576744fa228a4e85a29dabbe45-json.log: Access is denied.

From the remarks here, the operating system will return ERROR_ACCESS_DENIED except the first request. I guess CreateFile is called in RemoveContainer inside the docker.

If you call CreateFile on a file that is pending deletion as a result of a previous call to DeleteFile, the function fails. The operating system delays file deletion until all handles to the file are closed. GetLastError returns ERROR_ACCESS_DENIED.

@palma21
Copy link
Member

palma21 commented Aug 11, 2020

@craiglpeters FYI

@zhiweiv
Copy link
Author

zhiweiv commented Sep 4, 2020

The fix has been merged to k8s 1.17 branch(kubernetes/kubernetes#94042), the recent release should be 1.17.12 in this month, do you know when this will be included to AKS?

@zhiweiv
Copy link
Author

zhiweiv commented Sep 17, 2020

@palma21 @jluk
k8s 1.17.12 released today, it contains the fix for this issue. do you know when it will arrive AKS?

@ghost ghost added the action-required label Oct 12, 2020
@GerryWilko
Copy link

GerryWilko commented Oct 12, 2020

Do msft have any update on when this fix will be included in AKS? This is regularly causing wasted node resources that require manual clean up on my AKS.

This bug is still present on 1.18.6

@zhiweiv
Copy link
Author

zhiweiv commented Oct 12, 2020

1.18.9 fixed this, I guess it should arrive soon, it has been 20 days since last release.

@AbelHu
Copy link
Member

AbelHu commented Oct 12, 2020

Correct: We plan to release it with Windows 10C patch.

@zhiweiv
Copy link
Author

zhiweiv commented Oct 22, 2020

@AbelHu
Will this be available in release 20201019 too? Our Ops team is getting crazy to restart error nodes everyday.

@AbelHu
Copy link
Member

AbelHu commented Oct 22, 2020

@zhiweiv new Kubernetes versions were not enabled in 20201019 for some issues. We are trying to add 1.17.13, 1.18.10 and 1.19.3 in 20201026.

@ghost ghost removed the action-required label Oct 22, 2020
@zhiweiv
Copy link
Author

zhiweiv commented Nov 3, 2020

New k8s versions are available now.

@zhiweiv zhiweiv closed this as completed Nov 3, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants