Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For failed deploys, add the information about the k8s node #762

Open
ayatsynych opened this issue Oct 27, 2020 · 3 comments
Open

For failed deploys, add the information about the k8s node #762

ayatsynych opened this issue Oct 27, 2020 · 3 comments

Comments

@ayatsynych
Copy link

Feature request

Proposal: If it's possible, in case of deploy failure, add the information about the k8s node. There are cases when deployment failures are caused by the underlying node issues. It will be much easier to identify these causes by outputting the node information (name) for each of the failed resources.
Having this information in the logs also helps with audits and debugging.

@dturn
Copy link
Contributor

dturn commented Oct 27, 2020

In general this sounds like it would be helpful. Though can you be a bit more specific about the types of failures you're seeing. e.g. DS pods not scheduling, container problems, ...

The other big question are you interested PRing this or is this just a request?

@ayatsynych
Copy link
Author

The other big question are you interested PRing this or is this just a request?

This is just a request, but if we find the time, we will consider putting in the work to implement this request

Though can you be a bit more specific about the types of failures you're seeing.

Few specific examples we have seen in the past:

  1. The underlying node is having docker daemon issues and all the pods that get scheduled on that node are in a bad state (stuck in "Terminating" or "Initializing" state)
  2. The underlying node is having performance issues, therefore causing a timeout (if this is helpful, the specific case we have seen was that the image pulls on one of the nodes were extra slow. Having surfaced node name in all the failed resources would have right away pointed to the node-specific problem)

@ajshepley
Copy link
Contributor

cc @Shopify/pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants