Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Mar 16, 2023

Marked as draft pending time available to refine this further based on feedback!

Related

Copy link

@pnasrat pnasrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments. In general an incident report should contain more than just links back to the issue.

@@ -0,0 +1,20 @@
# 2023-02-01 Heavy use of dask-gateway induced critical pod evictions

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add summary statement and impact.


All times in UTC+1

- 2023-02-01 - [Summary of issue updated between ~8-9 PM](https://github.com/2i2c-org/infrastructure/issues/2126#issuecomment-1412554908)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'd expect to see a full timeline here not a link - eg https://sre.google/sre-book/example-postmortem/

However looking at the other incident reports in this repo it doesn't seem like we have a clear shared expectation of what should be here.


## What went wrong

- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do better than belief? Is there not evidence to support your hypothesis here? That's what I'd expect the incident report/postmortem to reveal through analysis of the contributing factors/root causes/triggers.


## Follow-up improvements

- #2127
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make these full markdown links with a summary line

## What went wrong

- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters
- I think its likely, but I can't say for sure, that the dask scheduler pod also would run into resource limitations with this amount of workers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what went wrong? I'm not sure what this speculation is for in terms of analyzing the incident. Is this a lesson learned?

@consideRatio
Copy link
Member Author

@pnasrat thank you for the review! These are all relevant points, I'll leave them unanswered for now since I'm feeling a time pressure to focus on other things.

@consideRatio consideRatio marked this pull request as draft March 16, 2023 18:58
@damianavila damianavila marked this pull request as ready for review October 6, 2023 15:31
Copy link

@damianavila damianavila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to be receive any updates after 8 months, so let's get it merged as is.

@consideRatio consideRatio reopened this Oct 9, 2023
@consideRatio consideRatio merged commit dcf8eca into 2i2c-org:main Oct 9, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

None yet

3 participants