Skip to content

fix: use notifyContext to manage the operator exit #2463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jiuker
Copy link
Contributor

@jiuker jiuker commented Jun 5, 2025

fix: use notifyContext to manage the operator exit

Description

when operator lost leader role. Looks like we can't not exit operator
fix #2458

Related Issue

Type of Change

  • Bug fix 🐛
  • New feature 🚀
  • Breaking change 🚨
  • Documentation update 📖
  • Refactor 🔨
  • Other (please describe) ⬇️

Screenshots (if applicable e.g before/after)

Checklist

  • I have tested these changes
  • I have updated relevant documentation (if applicable)
  • I have added necessary unit tests (if applicable)

Test Steps

Additional Notes / Context

fix: use notifyContext to manage the operator exit
@jiuker jiuker requested review from pjuarezd and ramondeklein and removed request for pjuarezd June 5, 2025 07:52
lint
@jiuker jiuker requested review from cniackz and allanrogerr June 5, 2025 07:58
@ramondeklein
Copy link
Contributor

Can you please add the test steps, so this can be tested?

@jiuker
Copy link
Contributor Author

jiuker commented Jun 5, 2025

Unless we support infrastructure disconnect testing, I am adding this code compilation test
@ramondeklein

// StartOperator starts the MinIO Operator controller
func StartOperator(kubeconfig string) {
	_ = v2.AddToScheme(scheme.Scheme)
	_ = stsv1beta1.AddToScheme(scheme.Scheme)
	_ = stsv1alpha1.AddToScheme(scheme.Scheme)
	klog.Info("Starting MinIO Operator")

	// set up signals, so we handle the first shutdown signal gracefully
	ctx, cancel := setupSignalHandler(context.Background())
	defer cancel()
	done := ctx.Done()

+	go func() {
+		time.Sleep(time.Second * 5)
+		cancel()
+	}()

I can see pod restart count +1.
image
image

cniackz
cniackz previously approved these changes Jun 6, 2025
Copy link
Contributor

@cniackz cniackz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m approving this PR because it provides a pragmatic improvement to how the operator handles leadership loss, which currently results in a stuck and resource-hungry state. While it doesn’t solve the root cause (lack of probes or retry logic), it introduces a clean and safe exit mechanism using notifyContext, allowing Kubernetes to restart the pod. This is a net gain in resilience, especially given that the current behavior requires manual intervention. The code is minimal, targeted, and does not introduce complexity or regression risk.

@cniackz cniackz added the enhancement New feature or request label Jun 6, 2025
Copy link
Contributor

@allanrogerr allanrogerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change for clarity

Co-authored-by: Allan Roger Reid <allanrogerreid@gmail.com>
@jiuker jiuker requested review from allanrogerr and cniackz June 11, 2025 03:31
cniackz
cniackz previously approved these changes Jun 18, 2025
harshavardhana
harshavardhana previously approved these changes Jun 18, 2025
Copy link
Contributor

@ramondeklein ramondeklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test this PR and the pod now properly terminates when it looses the lease. I do think we need to refactor this code, because it uses two mechanisms:

  • context.Context that is initialized when starting the operator.
  • chan struct{} that will be triggered when the context is cancelled.

I think it's better to have just a single method, because Controller.Start now receives both the context and the channel which serve exactly the same purpose. I think we need to fix that.

refactor
@jiuker jiuker dismissed stale reviews from harshavardhana and cniackz via cbb5570 June 19, 2025 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator is not resiliant to a short API-Server downtime
5 participants