Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected SIGTERM to dy-sidecar and dy-proxy pair #3832

Closed
GitHK opened this issue Feb 3, 2023 · 7 comments
Closed

Unexpected SIGTERM to dy-sidecar and dy-proxy pair #3832

GitHK opened this issue Feb 3, 2023 · 7 comments
Assignees
Labels
a:autoscaling autoscaling service in simcore's stack a:dynamic-sidecar dynamic-sidecar service bug buggy, it does not work as expected High Priority a totally crucial bug/feature to be fixed asap t:maintenance Some planned maintenance work
Milestone

Comments

@GitHK
Copy link
Contributor

GitHK commented Feb 3, 2023

It was observed that on autoscaled nodes, dy-sidecar and dy-proxy pair, sometimes get a SIGTERM out of the blue.

I am sure this is not initiated by the director-v2 since it would first try to save the state, push the outputs, before trying to remove both services.

For reference have a look at logs from 67fa84ec-5a78-49b8-8aca-0ce179d84021 AWS prod. It can clearly be seen that the sidecar and proxy receive a SIGERM and shut down.

@GitHK GitHK added bug buggy, it does not work as expected High Priority a totally crucial bug/feature to be fixed asap labels Feb 3, 2023
@GitHK
Copy link
Contributor Author

GitHK commented Feb 17, 2023

This was also observed again on a node with 3 dy-services running on them. All 3 pairs of dy-sidecar and dy-proxy received the SIGTERM at the same time.

The obvious culprit would be autoscaling draining the node too early.

@sanderegg sanderegg added a:dynamic-sidecar dynamic-sidecar service a:autoscaling autoscaling service in simcore's stack labels Feb 28, 2023
@sanderegg sanderegg modified the milestone: Mithril Feb 28, 2023
@pcrespov pcrespov added the t:maintenance Some planned maintenance work label Mar 5, 2023
@sanderegg
Copy link
Member

@GitHK could you please add the command I should use in Graylog to find these cases? thanks.

@GitHK
Copy link
Contributor Author

GitHK commented Apr 18, 2023

Related to this case. Currently on AWS staging unexpected behaviour. Containers are in an orphaned state. If we look at the details in the screenshots it's stated that the task belongs to a node which was deleted.

Unfortunately the longs are partial. If you filter by container_name: /.05827b71-86b1-48b4-9f5a-8ee5c4d545e8./ the only thing I see is that the sidecar was trying to shut down, but I don't see a lot of logs related to starting so I don't think these are that helpful.

@matusdrobuliak66 was also involved in this investigation.

Image

Image

Image

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Apr 18, 2023

Just not to forget, on 17.4. somewhere around that time we also turned off the autoscaling e2e tests (therefore this might be an outlier):
image

@sanderegg
Copy link
Member

These services were manually deleted. @GitHK as I already asked. do you have a link to such an error?
If not I will unassign/close the case.

@GitHK
Copy link
Contributor Author

GitHK commented Apr 24, 2023

We have been monitoring the staging deployment and there was not even a single incident regarding sidecars the last week. Let alone with autoscaled node. I'd say we can close the case for now.

@sanderegg
Copy link
Member

Ok then I close it.

@sanderegg sanderegg closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack a:dynamic-sidecar dynamic-sidecar service bug buggy, it does not work as expected High Priority a totally crucial bug/feature to be fixed asap t:maintenance Some planned maintenance work
Projects
None yet
Development

No branches or pull requests

4 participants