Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS agent disconnects instances but autoscalling not working properly after that #4082

Open
xploshioOn opened this issue Jan 30, 2024 · 3 comments

Comments

@xploshioOn
Copy link

xploshioOn commented Jan 30, 2024

Summary

We have a cluster with some GPU instances working, they work as expected normally, but every now and then, we start having instances disconnecting from the cluster but they are still up in EC2, just not reporting anything to the cluster. for example when the only instance up get disconnected in this way we have a gap in the report of the resources usage

Screenshot 2024-01-30 at 12 26 01

If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.

The thing is, this kind of instances it seems to affect the autoscalling, because while those are not working as expected, the autoscaling don't start more instances and we can have 1 hour without any instance running.

We are using the last ECS agent version and the last AMI image for it.

Looking at the ECS logs, we don't have enough information to debug what can be happening:

level=info time=2024-01-30T11:03:13Z msg="End of eligible images for deletion" managedImagesRemaining=1
level=info time=2024-01-30T11:11:11Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-01-30T11:11:11Z msg="Using cached DiscoverPollEndpoint" containerInstanceARN="arn:aws:ecs:us-east-1:4875549089326412330:container-instance/production-ecs-cluster/0079405150f5345f4bf1324234gb5h6446bf3bb8273" endpoint="https://ecs-a.us-east-1.amazonaws.com/acs/31/" telemetryEndpoint="https://ecs-t.us-east-1.amazonaws.com/tcs/31/" serviceConnectEndpoint="https://ecs-a.us-east-1.amazonaws.com"

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 3
  Running: 1
  Paused: 0
  Stopped: 2
 Images: 5
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.334-252.552.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.23GiB
 Name: ip-10-30-1-250.ec2.internal
 ID: MU72:QWOM:NEUO:JDMS:Y47J:VEY6:IU3F:4CMD:KBBR:ESHN:PFMI:75E3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Not sure if the issue is with the ECS agent that get's disconnected or it can be another issue. the logs doesn't give us enough information to debug

@FelipeGXavier
Copy link

I have a past issue where when the EC2 host machine was on heavy load (cpu, memory) the ecs-agent got disconnected, can be the case?

@gpkc
Copy link

gpkc commented Mar 20, 2024

Same thing is happening to me. I don't think it's related to high load in my case, as my workloads are fairly lightweight and this is happening seemingly at random times.

We're also not using GPU instances either, so this doesn't seem to be related.

It started happening to me when we upgraded to the latest ECS-optimized AMI, before we were running one that was a few months old.

@amogh09
Copy link
Contributor

amogh09 commented May 1, 2024

Thanks for opening this issue @xploshioOn and sorry that you are having trouble using ECS. Sorry for the delay in our response.

If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.

What is the task's state in ECS console or in response to ecs describe-tasks API when this happens? It would be very helpful to see the ecs describe-tasks response for such tasks (please redact any sensitive information before sharing the response here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants