-
Notifications
You must be signed in to change notification settings - Fork 362
Job reconcile only for jobs in opt-in namespaces #5638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: PannagaRao The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @PannagaRao. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test Please sign the CLA. |
@@ -22,6 +22,11 @@ controller: | |||
clientConnection: | |||
qps: 50 | |||
burst: 100 | |||
managedJobsNamespaceSelector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should make this the default behavior.
I would prefer to see this as optional and documented as a possible configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is something that can be customized by an admin. Also, it would be a breaking change.
@@ -311,6 +311,18 @@ func (r *JobReconciler) ReconcileGenericJob(ctx context.Context, req ctrl.Reques | |||
return ctrl.Result{}, client.IgnoreNotFound(err) | |||
} | |||
|
|||
if features.Enabled(features.ManagedJobsNamespaceSelector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to see tests around this functionality.
Does this behavior work the same for JobSet, RayJob, RayCluster, Kubeflow jobs and AppWrappers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to see tests around this functionality.
Yes, I expect some unit and e2e tests (for Job would be enough in customconfigs) for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior works the same for JobSet, RayJob, RayCluster, Kubeflow jobs and AppWrappers without any other modifications.
Verified it manually. Will work on having e2e tests for jobs.
if features.Enabled(features.ManagedJobsNamespaceSelector) { | ||
ns := corev1.Namespace{} | ||
if err := r.client.Get(ctx, client.ObjectKey{Name: req.Namespace}, &ns); err != nil { | ||
log.Error(err, "failed to get namespace for selector check") | ||
return ctrl.Result{}, err | ||
} | ||
if !r.managedJobsNamespaceSelector.Matches(labels.Set(ns.GetLabels())) { | ||
log.V(2).Info("Namespace not opted in for Kueue management", "namespace", ns.Name) | ||
return ctrl.Result{}, nil | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this logic
kueue/pkg/controller/jobframework/reconciler.go
Lines 381 to 394 in f6dbdee
// when manageJobsWithoutQueueName is enabled, standalone jobs without queue names | |
// are still not managed if they don't match the namespace selector. | |
if features.Enabled(features.ManagedJobsNamespaceSelector) && r.manageJobsWithoutQueueName && QueueName(job) == "" { | |
ns := corev1.Namespace{} | |
err := r.client.Get(ctx, client.ObjectKey{Name: job.Object().GetNamespace()}, &ns) | |
if err != nil { | |
log.Error(err, "failed to get job namespace") | |
return ctrl.Result{}, err | |
} | |
if !r.managedJobsNamespaceSelector.Matches(labels.Set(ns.GetLabels())) { | |
log.V(3).Info("namespace selector does not match, ignoring the job", "namespace", ns.Name) | |
return ctrl.Result{}, nil | |
} | |
} |
Why we are duplicate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we pull the namespace check up higher and do it independent of the value of r.manageJobsWithoutQueueName
, then we can delete this second namespace check.
I would like to first understand what is the motivation for the change and what are the affected configuration. IIUC, currently in Kueue we have I would even consider adding validation inside Kueue to prevent mixing I think if a user wants to use |
So IIUC the intention of the PR is to skip reconciliation of Jobs with queue-name if they don't belong to the opt-in namepsaces. I think this is a big conceptual change in Kueue model, and requires KEP update first. However, I would suggest to rather implement VAP against creating Jobs with queue-name in the namespaces which are not maintained by Kueue. |
This is already the pattern for pod, deployments and statefulsets. I argue that we are matching job-based integrations to match this. |
Interesting, I didn't know that. So IIUC when:
then for Pods, STS and LWS we already skip reconciliation? It is possible omission (I haven't checked experimentally), but the KEP is pretty clear "If manageJobsWithoutQueueName is false, managedJobsNamespaceSelector has no effect: Kueue will manage exactly those instances of supported Kinds that have a queue-name label." so I would like to first understand the use-case and update KEP. |
Agree the KEP should probably be updated to reflect what happened afterwards. This was change made in Kueue 0.11 to prepare for removing |
@dgrove-oss correctly calls out the root cause of this. #4256 And to answer your question, yes. Pod based integrations can work in an opt-in namespace approach with managedNamespaceSelector with either This is why we are interested in trying to get a similar behavior for Jobs as its odd. @PannagaRao and I followed up and she will update the KEP with these changes. |
Signed-off-by: Pannaga Rao Bhoja Ramamanohara
f6dbdee
to
3bf7c63
Compare
@PannagaRao: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@PannagaRao please rebase the PR and make the tests pass |
@PannagaRao please also add release note section so that the robot removes the |
Aside from changes to the logic please also add an integration or e2e test which demonstrates the fix. It is really important to have a higher level test, because the interactions between webhooks and reconcilers can be tricky. I foresee the tests for Pods, LWS (retroactively, because currently it works just by omission I think), and Job. |
Signed-off-by: Pannaga Rao Bhoja Ramamanohara
What type of PR is this?
/kind bug
What this PR does / why we need it:
Following the discussion in #5260, this PR introduces a fix to prevent the Job reconciler from managing and unsuspending workloads solely based on the presence of the queue-name label, without considering the namespace.
This change ensures that a Job is only managed by Kueue if it resides in a namespace explicitly labeled with managed-by-kueue: true. As a result, Jobs in unmanaged namespaces will be ignored by the reconciler, even if they specify a queue-name.
Note that this logic is evaluated at creation time. If the namespace is labeled after the Job is created, the existing Job will remain unmanaged. However, new Jobs created after labeling the namespace will be properly managed by Kueue.