Skip to content

scheduler: add 'none' type that bypasses PodGroup creation (#936)#975

Draft
jiaenren wants to merge 1 commit into
mainfrom
jiaenr/default-scheduler-control
Draft

scheduler: add 'none' type that bypasses PodGroup creation (#936)#975
jiaenren wants to merge 1 commit into
mainfrom
jiaenr/default-scheduler-control

Conversation

@jiaenren
Copy link
Copy Markdown
Collaborator

@jiaenren jiaenren commented May 8, 2026

Summary

Resolves #936 by adopting Option C — Add 'none': a new BackendSchedulerType.NONE that bypasses PodGroup CR creation entirely and lets the cluster's default scheduler place pods. Removes the hard dependency on kai-scheduler for backends that don't need gang scheduling.

  • New NoneK8sObjectFactory in src/utils/job/kb_objects.py: returns pods only (no PodGroup), leaves schedulerName unset, emits no kai/runai labels, returns empty scheduler-resource specs.
  • New enum value BackendSchedulerType.NONE = 'none' in src/utils/connectors/postgres.py.
  • get_k8s_object_factory() dispatches NONENoneK8sObjectFactory.

priority_supported() and topology_supported() correctly return False for NONE, so the existing validation in Pool.validate and Workflow.submit already produces clear errors when users try to use those features against a none backend (no extra plumbing needed).

Trade-offs

  • Sacrifices: gang scheduling, priority classes, topology constraints.
  • Gains: workflows submit and run on vanilla Kubernetes with no kai-scheduler installed.

Test plan

  • Added 14 unit tests in src/utils/job/tests/test_kb_objects.py covering enum parsing, factory dispatch, absence of schedulerName, absence of kai.scheduler/* and runai/* labels, no PodGroup in output, no pod-group-name annotation, empty scheduler/cleanup specs, priority_supported()/topology_supported() returning False, and topology-keys-ignored.
  • Removed pre-existing tags = [\"manual\"] on test_kb_objects so the suite runs in CI; fixed 2 pre-existing tests using deprecated assertEquals.
  • bazel test //src/utils/job/... — 9/9 passing.
  • bazel test //src/utils/connectors/... — passing.
  • bazel test //src/service/core/config:config-pylint //src/service/core/config/tests:test_config_history — passing.
  • End-to-end on a real kind cluster with no kai-scheduler installed (run/tests/test_scheduler_none_kind.py):
    • RED phase: KaiK8sObjectFactory's PodGroup is rejected by the API server with ensure CRDs are installed first — proves the cluster genuinely lacks kai and the test can fail.
    • GREEN phase: NoneK8sObjectFactory's pod applies cleanly, reaches Running, and is scheduled by default-scheduler on the kind node.

The e2e harness is wired as osmo_py_binary //run/tests:test_scheduler_none_kind and runs against a user-supplied KIND_CONTEXT env var, so it can be invoked locally or hooked into a kind-only CI lane later.

Summary by CodeRabbit

  • New Features

    • Added support for a new "NONE" scheduler type option. When enabled, pods use the cluster's default scheduler instead of custom gang-scheduling, eliminating the need for scheduler-specific resource constructs.
  • Tests

    • Added comprehensive test coverage for the new scheduler type, validating pod scheduling behavior and resource generation.

Adds BackendSchedulerType.NONE so backends can run without kai-scheduler.
The new NoneK8sObjectFactory skips PodGroup CR creation, leaves
schedulerName unset (use the cluster's default scheduler), and emits no
kai/runai labels or scheduler-side resources (Queue/Topology). Sacrifices
gang scheduling and priority/topology constraints to remove the
kai-scheduler dependency.

Tests:
- 14 new unit tests in src/utils/job/tests/test_kb_objects.py covering
  enum parsing, factory dispatch, pod spec absence-of-fields, no PodGroup
  in output, no kai/runai labels, empty scheduler resources, and
  priority/topology unsupported.
- New run/tests/test_scheduler_none_kind.py e2e harness that, on a kind
  cluster with no kai-scheduler installed, asserts (a) the kai factory's
  PodGroup is rejected by the API server (proving the negative) and
  (b) the none factory's pod applies cleanly and reaches Running on
  default-scheduler.
@jiaenren jiaenren requested a review from a team as a code owner May 8, 2026 07:31
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements the NONE scheduler type (Option C from issue #936) to allow workflows to run on clusters without kai-scheduler by falling back to Kubernetes' default scheduler. Changes span enum definition, factory routing, comprehensive unit tests, and an end-to-end kind cluster validation script.

Changes

NONE Scheduler Type Implementation

Layer / File(s) Summary
Scheduler Type Contract
src/utils/connectors/postgres.py
BackendSchedulerType.NONE = 'none' enum value added; docstring clarifies that NONE scheduler leaves scheduler_name unused and creates no PodGroup CRs.
Factory Implementation & Routing
src/utils/job/kb_objects.py
NoneK8sObjectFactory introduced; overrides update_pod_k8s_resource to no-op (leaving schedulerName unset). get_k8s_object_factory routes BackendSchedulerType.NONE to factory instance.
Unit Tests: Setup & Helpers
src/utils/job/tests/test_kb_objects.py
Module imports connectors and topology; adds _make_backend() and _make_pod() helpers to construct test fixtures with NONE scheduler type.
Unit Tests: NONE Scheduler Validation
src/utils/job/tests/test_kb_objects.py
NoneSchedulerSettingsTest validates enum parsing; NoneK8sObjectFactoryTest and NoneFactoryWithTopologyKeysTest assert: group resources are Pods only (no PodGroup), schedulerName and kai/runai labels are absent, pod updates are no-ops, cleanup excludes PodGroup, scheduler resources are empty, and topology keys are ignored.
Unit Tests: Host Mount Improvements
src/utils/job/tests/test_kb_objects.py
Replaces deprecated assertEquals with assertEqual and enhances host-mount test assertions for src_path, dest_path, volume(), and volume_mount() outputs.
Unit Test Build Config
src/utils/job/tests/BUILD
Removes tags = ["manual"] from test_kb_objects target to enable automated test execution.
E2E Test: Kind Cluster Validation
run/tests/test_scheduler_none_kind.py
End-to-end script validates NONE scheduler on kind without kai-scheduler: verifies no kai CRDs, resets namespace, confirms KAI path fails (PodGroup rejection), confirms NONE path succeeds (Pod only, default-scheduler, node assignment), and cleans up in finally block.
E2E Test Build Config
run/tests/BUILD
Loads osmo_py_binary macro; defines test_scheduler_none_kind binary target with main, srcs, dependencies (priority, connectors, job), and public visibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A bunny hops through schedulers three,
With Kai now optional—hooray, we're free!
The default scheduler hums its tune,
No PodGroup dance, just Pods that swoon,
Kind clusters leap without a care,
The NONE scheduler floats on air! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a 'none' scheduler type that bypasses PodGroup creation, directly addressing issue #936.
Linked Issues check ✅ Passed The PR fully implements Option C from #936: adds BackendSchedulerType.NONE, creates NoneK8sObjectFactory to bypass PodGroup creation, includes comprehensive unit and E2E tests, and enables workflows to run on vanilla Kubernetes without kai-scheduler.
Out of Scope Changes check ✅ Passed All changes are directly aligned with implementing the 'none' scheduler type: enum addition, factory implementation, test coverage, and Bazel build configuration. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jiaenr/default-scheduler-control

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 44.59%. Comparing base (01a90b5) to head (08c893b).

Files with missing lines Patch % Lines
src/utils/job/kb_objects.py 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #975      +/-   ##
==========================================
+ Coverage   44.53%   44.59%   +0.06%     
==========================================
  Files         218      218              
  Lines       28537    28545       +8     
  Branches     4260     4261       +1     
==========================================
+ Hits        12708    12730      +22     
+ Misses      15208    15194      -14     
  Partials      621      621              
Flag Coverage Δ
backend 45.49% <87.50%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/utils/connectors/postgres.py 71.50% <100.00%> (+0.01%) ⬆️
src/utils/job/kb_objects.py 86.91% <85.71%> (+8.10%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@run/tests/test_scheduler_none_kind.py`:
- Around line 1-2: The SPDX header in this new test file exceeds 100 characters
and needs the same pylint waiver used elsewhere; edit the file-level header (the
top triple-quoted string containing the SPDX line) to append the inline pylint
disable for line-too-long (i.e., add the comment "# pylint:
disable=line-too-long" alongside the SPDX copyright statement) so the long
copyright line is exempted from the linter.

In `@src/utils/job/kb_objects.py`:
- Around line 620-623: The override update_pod_k8s_resource currently
intentionally ignores all parameters but doesn't suppress pylint
unused-argument; modify the method to silence the linter by either renaming
unused parameters with a leading underscore (e.g., _pod, _group_uuid,
_pool_name, _priority) or add a pylint disable comment (e.g., # pylint:
disable=unused-argument) on the method signature, keeping the no-op pass body
and the method name unchanged so it still overrides the base implementation.

In `@src/utils/job/tests/test_kb_objects.py`:
- Around line 1-2: Add the inline pylint waiver to the existing SPDX copyright
header line (the long SPDX-FileCopyrightText header string) so it won't trigger
line-too-long checks; specifically append the comment "# pylint:
disable=line-too-long" to that SPDX header line in test_kb_objects.py (the
top-of-file SPDX header) without splitting the copyright text across multiple
lines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4674f97e-9cd6-4e53-b5b8-39e8214e33b1

📥 Commits

Reviewing files that changed from the base of the PR and between 01a90b5 and 08c893b.

📒 Files selected for processing (6)
  • run/tests/BUILD
  • run/tests/test_scheduler_none_kind.py
  • src/utils/connectors/postgres.py
  • src/utils/job/kb_objects.py
  • src/utils/job/tests/BUILD
  • src/utils/job/tests/test_kb_objects.py
💤 Files with no reviewable changes (1)
  • src/utils/job/tests/BUILD

Comment on lines +1 to +2
"""
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the long-line pylint waiver to this SPDX header too.

The new file has the same over-100-character copyright line, so it needs the inline line-too-long disable to stay consistent with repo policy.

As per coding guidelines, "If copyright lines exceed 100 characters, add # pylint: disable=line-too-long comment instead of breaking into multiple lines".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@run/tests/test_scheduler_none_kind.py` around lines 1 - 2, The SPDX header in
this new test file exceeds 100 characters and needs the same pylint waiver used
elsewhere; edit the file-level header (the top triple-quoted string containing
the SPDX line) to append the inline pylint disable for line-too-long (i.e., add
the comment "# pylint: disable=line-too-long" alongside the SPDX copyright
statement) so the long copyright line is exempted from the linter.

Comment on lines +620 to +623
def update_pod_k8s_resource(self, pod: Dict, group_uuid: str, pool_name: str,
priority: wf_priority.WorkflowPriority):
# No-op: leave schedulerName unset so the default scheduler picks it up.
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Suppress the intentional unused parameters in the no-op override.

This override ignores every argument, but unlike the similar base-class methods it does not disable unused-argument, so pylint will flag it.

Proposed fix
 def update_pod_k8s_resource(self, pod: Dict, group_uuid: str, pool_name: str,
                             priority: wf_priority.WorkflowPriority):
+    # pylint: disable=unused-argument
     # No-op: leave schedulerName unset so the default scheduler picks it up.
     pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/job/kb_objects.py` around lines 620 - 623, The override
update_pod_k8s_resource currently intentionally ignores all parameters but
doesn't suppress pylint unused-argument; modify the method to silence the linter
by either renaming unused parameters with a leading underscore (e.g., _pod,
_group_uuid, _pool_name, _priority) or add a pylint disable comment (e.g., #
pylint: disable=unused-argument) on the method signature, keeping the no-op pass
body and the method name unchanged so it still overrides the base
implementation.

Comment on lines 1 to +2
"""
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the long-line pylint waiver to the SPDX header.

This copyright line is over the repo’s 100-character limit, so it should carry the same inline waiver used in the other Python files touched by this PR.

As per coding guidelines, "If copyright lines exceed 100 characters, add # pylint: disable=line-too-long comment instead of breaking into multiple lines".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/job/tests/test_kb_objects.py` around lines 1 - 2, Add the inline
pylint waiver to the existing SPDX copyright header line (the long
SPDX-FileCopyrightText header string) so it won't trigger line-too-long checks;
specifically append the comment "# pylint: disable=line-too-long" to that SPDX
header line in test_kb_objects.py (the top-of-file SPDX header) without
splitting the copyright text across multiple lines.

@jiaenren jiaenren marked this pull request as draft May 11, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/feat: BACKEND scheduler_type enum hardcoded to 'kai' — no fallback to K8s default-scheduler in 6.2.10

1 participant