Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4

consideRatio · 2023-03-16T18:10:42Z

Marked as draft pending time available to refine this further based on feedback!

+## What went wrong
+
+- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters
+- I think its likely, but I can't say for sure, that the dask scheduler pod also would run into resource limitations with this amount of workers


Is this what went wrong? I'm not sure what this speculation is for in terms of analyzing the incident. Is this a lesson learned?

consideRatio · 2023-03-16T18:39:26Z

@pnasrat thank you for the review! These are all relevant points, I'll leave them unanswered for now since I'm feeling a time pressure to focus on other things.

damianavila

This is not going to be receive any updates after 8 months, so let's get it merged as is.

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage

b77f88b

consideRatio mentioned this pull request Mar 16, 2023

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason 2i2c-org/infrastructure#2126

Closed

5 tasks

pnasrat suggested changes Mar 16, 2023

View reviewed changes

consideRatio marked this pull request as draft March 16, 2023 18:58

damianavila assigned consideRatio Mar 17, 2023

damianavila marked this pull request as ready for review October 6, 2023 15:31

damianavila approved these changes Oct 6, 2023

View reviewed changes

consideRatio closed this Oct 9, 2023

consideRatio reopened this Oct 9, 2023

consideRatio merged commit dcf8eca into 2i2c-org:main Oct 9, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4

consideRatio commented Mar 16, 2023 •

edited

pnasrat left a comment

pnasrat Mar 16, 2023

pnasrat Mar 16, 2023

pnasrat Mar 16, 2023

pnasrat Mar 16, 2023

pnasrat Mar 16, 2023

consideRatio commented Mar 16, 2023

damianavila left a comment

		@@ -0,0 +1,20 @@
		# 2023-02-01 Heavy use of dask-gateway induced critical pod evictions


		All times in UTC+1

		- 2023-02-01 - [Summary of issue updated between ~8-9 PM](https://github.com/2i2c-org/infrastructure/issues/2126#issuecomment-1412554908)


		## What went wrong

		- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters


		## Follow-up improvements

		- #2127

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4

Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4

Conversation

consideRatio commented Mar 16, 2023 • edited

Related

pnasrat left a comment

Choose a reason for hiding this comment

pnasrat Mar 16, 2023

Choose a reason for hiding this comment

pnasrat Mar 16, 2023

Choose a reason for hiding this comment

pnasrat Mar 16, 2023

Choose a reason for hiding this comment

pnasrat Mar 16, 2023

Choose a reason for hiding this comment

pnasrat Mar 16, 2023

Choose a reason for hiding this comment

consideRatio commented Mar 16, 2023

damianavila left a comment

Choose a reason for hiding this comment

consideRatio commented Mar 16, 2023 •

edited