You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-api-machinery/5366-graceful-leader-transition/README.md
+14-6Lines changed: 14 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -293,19 +293,26 @@ Risk 1: Resource exhaustion: Memory leaks may exist in the processes that were
293
293
previously masked by doing a full shutdown and restart loop.
294
294
295
295
- Severity: Medium high
296
-
- Controllers will continue to function (potentially in degraded state due to lack of resources), and may be restarted frequently. However, cluster should continue to function.
296
+
- Controllers will continue to function (potentially in degraded state due to
297
+
lack of resources), and may be restarted frequently. However, cluster should
298
+
continue to function.
297
299
298
-
Risk 2: Wedged KCM: There is a risk that controllers and the
299
-
scheduler are not properly respecting context shutdowns. This can either result in multiple instances of controllers running or no instances running despite the lock being held.
300
+
Risk 2: Wedged KCM: There is a risk that controllers and the scheduler are not
301
+
properly respecting context shutdowns. This can either result in multiple
302
+
instances of controllers running or no instances running despite the lock being
303
+
held.
300
304
301
-
- Severity: Extreme
302
-
- Breaking mutual exclusion guarantees can put the cluster into a non-desirable state. A manual user intervention is possible but if the problem is triggered due to a problematic component, the issue will resurface and the best path for mitigation is to turn off the feature.
305
+
- Severity: High
306
+
- Breaking mutual exclusion guarantees can put the cluster into a non-desirable
307
+
state. A manual user intervention is possible but if the problem is triggered
308
+
due to a problematic component, the issue will resurface and the best path for
309
+
mitigation is to turn off the feature.
303
310
304
311
Risk 3: Futureproofing: An additional risk is that even if all the current code
305
312
is safe and respects shutting down gracefully, new controllers/modifications to
306
313
kcm or scheduler could create subtle problems in shutdown and transition.
307
314
308
-
- Severity: Medium
315
+
- Severity: High
309
316
- Leads to either risk 1 or 2.
310
317
311
318
@@ -447,6 +454,7 @@ Will test that feature enablement will still result in a functional cluster.
447
454
#### Beta
448
455
449
456
- e2e tests
457
+
- Address how to minimize risks of putting KCM or scheduler in a "wedged" state
0 commit comments