Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with stop/start container on WS2k19 #1822

Open
dardelean opened this issue Jun 21, 2023 · 2 comments
Open

Issue with stop/start container on WS2k19 #1822

dardelean opened this issue Jun 21, 2023 · 2 comments

Comments

@dardelean
Copy link

The issue is that the containers (process or hyperv isolation) fail to start (after stop) or restart. This happens on WS2k19. The issue is easy to reproduce, a standard WS2k19 deployment with nerdctl and containerd (v1.7.0-339-g87dbdd2ca). This is the latest version of containerd as of today (07.06.2023), but the issue reproduces on older versions as well.

The specific error is
errors: failed to create shim task: hcs::CreateComputeSystem 7741aa979c8a1ef17659b625d73418b28421be780e848e12d82edd5c6b76312e: The requested operation for attach namespace failed.: unknown"

This is how the Cirrus CI uses WS2k19:
https://github.com/containerd/nerdctl/blob/main/.cirrus.yml#L26

It uses an image built on top of "windows-2019-core-for-containers":
https://github.com/cirruslabs/vm-images/blob/master/googlecompute/windows_images.json#L8

An this is how the image is configured:
https://github.com/containerd/nerdctl/blob/main/hack/configure-windows-ci.ps1

We saw that during the period the container is stopped, if we remove the endpoint, the container successfully starts, but then it won't have a network endpoint. We suspect that the issue is there. containerd and the shim sends correct information to HCS, during debug we compared the go stuctures with a WS2k22 deployent, which works. One thing we did not understand were the endpoint states, state 4 for example (after the container failed to start).

@acobaugh
Copy link

I'm seeing the exact same thing with:

  • datadog agent: gcr.io/datadoghq/agent:7.43.1
  • containerd 1.6.6
  • EKS 1.24 v1.24.13-eks-0a21954
  • Host AMI: Windows_Server-2019-English-Core-EKS_Optimized-1.24-2023.06.14, Windows Server 2019 Datacenter 10.0.17763.4499

I did not see this on dockerd and EKS 1.23.

Every once-in-a-while I will have a container start up just fine.

Other containers start fine on these hosts, it just seems to be this datadog agent image that consistently fails to start with this error.

@jterry75
Copy link
Contributor

AttachNamespace is a networking failure. @kevpar - Could you add the right people for that. I dont remember if networking should be here or on WinContainers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants