Skip to content

Job reconcile only for jobs in opt-in namespaces #5638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

PannagaRao
Copy link

Signed-off-by: Pannaga Rao Bhoja Ramamanohara

What type of PR is this?

/kind bug

What this PR does / why we need it:

Following the discussion in #5260, this PR introduces a fix to prevent the Job reconciler from managing and unsuspending workloads solely based on the presence of the queue-name label, without considering the namespace.

This change ensures that a Job is only managed by Kueue if it resides in a namespace explicitly labeled with managed-by-kueue: true. As a result, Jobs in unmanaged namespaces will be ignored by the reconciler, even if they specify a queue-name.

Note that this logic is evaluated at creation time. If the namespace is labeled after the Job is created, the existing Job will remain unmanaged. However, new Jobs created after labeling the namespace will be properly managed by Kueue.

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 12, 2025
Copy link

linux-foundation-easycla bot commented Jun 12, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: PannagaRao / name: Pannaga Rao Bhoja Ramamanohara (3bf7c63)

Copy link

netlify bot commented Jun 12, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 3bf7c63
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6855c3d9b6bd62000877b34d
😎 Deploy Preview https://deploy-preview-5638--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PannagaRao
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 12, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @PannagaRao. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 12, 2025
@kannon92
Copy link
Contributor

/ok-to-test

Please sign the CLA.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2025
@@ -22,6 +22,11 @@ controller:
clientConnection:
qps: 50
burst: 100
managedJobsNamespaceSelector:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we should make this the default behavior.

I would prefer to see this as optional and documented as a possible configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is something that can be customized by an admin. Also, it would be a breaking change.

@@ -311,6 +311,18 @@ func (r *JobReconciler) ReconcileGenericJob(ctx context.Context, req ctrl.Reques
return ctrl.Result{}, client.IgnoreNotFound(err)
}

if features.Enabled(features.ManagedJobsNamespaceSelector) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see tests around this functionality.

Does this behavior work the same for JobSet, RayJob, RayCluster, Kubeflow jobs and AppWrappers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see tests around this functionality.

Yes, I expect some unit and e2e tests (for Job would be enough in customconfigs) for this.

Copy link
Author

@PannagaRao PannagaRao Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior works the same for JobSet, RayJob, RayCluster, Kubeflow jobs and AppWrappers without any other modifications.

Verified it manually. Will work on having e2e tests for jobs.

@kannon92
Copy link
Contributor

cc @dgrove-oss @mimowo

Comment on lines +314 to +324
if features.Enabled(features.ManagedJobsNamespaceSelector) {
ns := corev1.Namespace{}
if err := r.client.Get(ctx, client.ObjectKey{Name: req.Namespace}, &ns); err != nil {
log.Error(err, "failed to get namespace for selector check")
return ctrl.Result{}, err
}
if !r.managedJobsNamespaceSelector.Matches(labels.Set(ns.GetLabels())) {
log.V(2).Info("Namespace not opted in for Kueue management", "namespace", ns.Name)
return ctrl.Result{}, nil
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this logic

// when manageJobsWithoutQueueName is enabled, standalone jobs without queue names
// are still not managed if they don't match the namespace selector.
if features.Enabled(features.ManagedJobsNamespaceSelector) && r.manageJobsWithoutQueueName && QueueName(job) == "" {
ns := corev1.Namespace{}
err := r.client.Get(ctx, client.ObjectKey{Name: job.Object().GetNamespace()}, &ns)
if err != nil {
log.Error(err, "failed to get job namespace")
return ctrl.Result{}, err
}
if !r.managedJobsNamespaceSelector.Matches(labels.Set(ns.GetLabels())) {
log.V(3).Info("namespace selector does not match, ignoring the job", "namespace", ns.Name)
return ctrl.Result{}, nil
}
}

Why we are duplicate it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we pull the namespace check up higher and do it independent of the value of r.manageJobsWithoutQueueName, then we can delete this second namespace check.

@mimowo
Copy link
Contributor

mimowo commented Jun 16, 2025

I would like to first understand what is the motivation for the change and what are the affected configuration.

IIUC, currently in Kueue we have manageJobsWithoutQueueName and ManagedJobsNamespaceSelector, and by design we should only be able to use ManagedJobsNamespaceSelector when manageJobsWithoutQueueName: true, see kep: "If manageJobsWithoutQueueName is false, managedJobsNamespaceSelector has no effect: Kueue will manage exactly those instances of supported Kinds that have a queue-name label.".

I would even consider adding validation inside Kueue to prevent mixing managedJobsNamespaceSelector with manageJobsWithoutQueueName: false. wdyt @dgrove-oss ?

I think if a user wants to use manageJobsWithoutQueueName: false, then it is enough not to add queue-name to skip the reconciliation.

@mimowo
Copy link
Contributor

mimowo commented Jun 16, 2025

So IIUC the intention of the PR is to skip reconciliation of Jobs with queue-name if they don't belong to the opt-in namepsaces. I think this is a big conceptual change in Kueue model, and requires KEP update first. However, I would suggest to rather implement VAP against creating Jobs with queue-name in the namespaces which are not maintained by Kueue.

@kannon92
Copy link
Contributor

So IIUC the intention of the PR is to skip reconciliation of Jobs with queue-name if they don't belong to the opt-in namepsaces. I think this is a big conceptual change in Kueue model, and requires KEP update first. However, I would suggest to rather implement VAP against creating Jobs with queue-name in the namespaces which are not maintained by Kueue.

This is already the pattern for pod, deployments and statefulsets. I argue that we are matching job-based integrations to match this.

@mimowo
Copy link
Contributor

mimowo commented Jun 17, 2025

This is already the pattern for pod, deployments and statefulsets. I argue that we are matching job-based integrations to match this.

Interesting, I didn't know that. So IIUC when:

  • manageJobsWithoutQueueName: false
  • Job has queue-name in the namespace excluded by managedJobsNamespaceSelector

then for Pods, STS and LWS we already skip reconciliation?

It is possible omission (I haven't checked experimentally), but the KEP is pretty clear "If manageJobsWithoutQueueName is false, managedJobsNamespaceSelector has no effect: Kueue will manage exactly those instances of supported Kinds that have a queue-name label." so I would like to first understand the use-case and update KEP.

@dgrove-oss
Copy link
Contributor

Agree the KEP should probably be updated to reflect what happened afterwards. This was change made in Kueue 0.11 to prepare for removing integrations.podOptions from the config (#4256) in some future version.

@kannon92
Copy link
Contributor

This is already the pattern for pod, deployments and statefulsets. I argue that we are matching job-based integrations to match this.

Interesting, I didn't know that. So IIUC when:

  • manageJobsWithoutQueueName: false
  • Job has queue-name in the namespace excluded by managedJobsNamespaceSelector

then for Pods, STS and LWS we already skip reconciliation?

It is possible omission (I haven't checked experimentally), but the KEP is pretty clear "If manageJobsWithoutQueueName is false, managedJobsNamespaceSelector has no effect: Kueue will manage exactly those instances of supported Kinds that have a queue-name label." so I would like to first understand the use-case and update KEP.

@dgrove-oss correctly calls out the root cause of this. #4256

And to answer your question, yes. Pod based integrations can work in an opt-in namespace approach with managedNamespaceSelector with either managedJobWithoutQueueName = false or true.

This is why we are interested in trying to get a similar behavior for Jobs as its odd.

@PannagaRao and I followed up and she will update the KEP with these changes.

Signed-off-by: Pannaga Rao Bhoja Ramamanohara
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 20, 2025
@k8s-ci-robot
Copy link
Contributor

@PannagaRao: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-integration-extended-main 3bf7c63 link true /test pull-kueue-test-integration-extended-main
pull-kueue-test-integration-baseline-main 3bf7c63 link true /test pull-kueue-test-integration-baseline-main
pull-kueue-test-integration-multikueue-main 3bf7c63 link true /test pull-kueue-test-integration-multikueue-main
pull-kueue-test-unit-main 3bf7c63 link true /test pull-kueue-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mimowo
Copy link
Contributor

mimowo commented Jun 25, 2025

@PannagaRao please rebase the PR and make the tests pass

@mimowo
Copy link
Contributor

mimowo commented Jun 25, 2025

@PannagaRao please also add release note section so that the robot removes the do-not-merge/release-note-label-needed label

@mimowo
Copy link
Contributor

mimowo commented Jun 25, 2025

Aside from changes to the logic please also add an integration or e2e test which demonstrates the fix. It is really important to have a higher level test, because the interactions between webhooks and reconcilers can be tricky. I foresee the tests for Pods, LWS (retroactively, because currently it works just by omission I think), and Job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants