You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a cluster with some GPU instances working, they work as expected normally, but every now and then, we start having instances disconnecting from the cluster but they are still up in EC2, just not reporting anything to the cluster. for example when the only instance up get disconnected in this way we have a gap in the report of the resources usage
If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.
The thing is, this kind of instances it seems to affect the autoscalling, because while those are not working as expected, the autoscaling don't start more instances and we can have 1 hour without any instance running.
We are using the last ECS agent version and the last AMI image for it.
Looking at the ECS logs, we don't have enough information to debug what can be happening:
level=info time=2024-01-30T11:03:13Z msg="End of eligible images for deletion" managedImagesRemaining=1
level=info time=2024-01-30T11:11:11Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-01-30T11:11:11Z msg="Using cached DiscoverPollEndpoint" containerInstanceARN="arn:aws:ecs:us-east-1:4875549089326412330:container-instance/production-ecs-cluster/0079405150f5345f4bf1324234gb5h6446bf3bb8273" endpoint="https://ecs-a.us-east-1.amazonaws.com/acs/31/" telemetryEndpoint="https://ecs-t.us-east-1.amazonaws.com/tcs/31/" serviceConnectEndpoint="https://ecs-a.us-east-1.amazonaws.com"
Same thing is happening to me. I don't think it's related to high load in my case, as my workloads are fairly lightweight and this is happening seemingly at random times.
We're also not using GPU instances either, so this doesn't seem to be related.
It started happening to me when we upgraded to the latest ECS-optimized AMI, before we were running one that was a few months old.
Thanks for opening this issue @xploshioOn and sorry that you are having trouble using ECS. Sorry for the delay in our response.
If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.
What is the task's state in ECS console or in response to ecs describe-tasks API when this happens? It would be very helpful to see the ecs describe-tasks response for such tasks (please redact any sensitive information before sharing the response here).
Summary
We have a cluster with some GPU instances working, they work as expected normally, but every now and then, we start having instances disconnecting from the cluster but they are still up in EC2, just not reporting anything to the cluster. for example when the only instance up get disconnected in this way we have a gap in the report of the resources usage
If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.
The thing is, this kind of instances it seems to affect the autoscalling, because while those are not working as expected, the autoscaling don't start more instances and we can have 1 hour without any instance running.
We are using the last ECS agent version and the last AMI image for it.
Looking at the ECS logs, we don't have enough information to debug what can be happening:
docker info
Not sure if the issue is with the ECS agent that get's disconnected or it can be another issue. the logs doesn't give us enough information to debug
The text was updated successfully, but these errors were encountered: