ECS agent disconnects instances but autoscalling not working properly after that #4082

xploshioOn · 2024-01-30T12:29:14Z

Summary

We have a cluster with some GPU instances working, they work as expected normally, but every now and then, we start having instances disconnecting from the cluster but they are still up in EC2, just not reporting anything to the cluster. for example when the only instance up get disconnected in this way we have a gap in the report of the resources usage

If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.

The thing is, this kind of instances it seems to affect the autoscalling, because while those are not working as expected, the autoscaling don't start more instances and we can have 1 hour without any instance running.

We are using the last ECS agent version and the last AMI image for it.

Looking at the ECS logs, we don't have enough information to debug what can be happening:

level=info time=2024-01-30T11:03:13Z msg="End of eligible images for deletion" managedImagesRemaining=1
level=info time=2024-01-30T11:11:11Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-01-30T11:11:11Z msg="Using cached DiscoverPollEndpoint" containerInstanceARN="arn:aws:ecs:us-east-1:4875549089326412330:container-instance/production-ecs-cluster/0079405150f5345f4bf1324234gb5h6446bf3bb8273" endpoint="https://ecs-a.us-east-1.amazonaws.com/acs/31/" telemetryEndpoint="https://ecs-t.us-east-1.amazonaws.com/tcs/31/" serviceConnectEndpoint="https://ecs-a.us-east-1.amazonaws.com"

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 3
  Running: 1
  Paused: 0
  Stopped: 2
 Images: 5
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.334-252.552.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.23GiB
 Name: ip-10-30-1-250.ec2.internal
 ID: MU72:QWOM:NEUO:JDMS:Y47J:VEY6:IU3F:4CMD:KBBR:ESHN:PFMI:75E3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Not sure if the issue is with the ECS agent that get's disconnected or it can be another issue. the logs doesn't give us enough information to debug

The text was updated successfully, but these errors were encountered:

FelipeGXavier · 2024-02-06T11:52:53Z

I have a past issue where when the EC2 host machine was on heavy load (cpu, memory) the ecs-agent got disconnected, can be the case?

gpkc · 2024-03-20T10:06:22Z

Same thing is happening to me. I don't think it's related to high load in my case, as my workloads are fairly lightweight and this is happening seemingly at random times.

We're also not using GPU instances either, so this doesn't seem to be related.

It started happening to me when we upgraded to the latest ECS-optimized AMI, before we were running one that was a few months old.

amogh09 · 2024-05-01T23:17:55Z

Thanks for opening this issue @xploshioOn and sorry that you are having trouble using ECS. Sorry for the delay in our response.

If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.

What is the task's state in ECS console or in response to ecs describe-tasks API when this happens? It would be very helpful to see the ecs describe-tasks response for such tasks (please redact any sensitive information before sharing the response here).

amogh09 added the more info needed label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS agent disconnects instances but autoscalling not working properly after that #4082

ECS agent disconnects instances but autoscalling not working properly after that #4082

xploshioOn commented Jan 30, 2024 •

edited

FelipeGXavier commented Feb 6, 2024

gpkc commented Mar 20, 2024 •

edited

amogh09 commented May 1, 2024 •

edited

ECS agent disconnects instances but autoscalling not working properly after that #4082

ECS agent disconnects instances but autoscalling not working properly after that #4082

Comments

xploshioOn commented Jan 30, 2024 • edited

Summary

FelipeGXavier commented Feb 6, 2024

gpkc commented Mar 20, 2024 • edited

amogh09 commented May 1, 2024 • edited

xploshioOn commented Jan 30, 2024 •

edited

gpkc commented Mar 20, 2024 •

edited

amogh09 commented May 1, 2024 •

edited