scheduler: add 'none' type that bypasses PodGroup creation (#936) by jiaenren · Pull Request #975 · NVIDIA/OSMO

jiaenren · 2026-05-08T07:31:03Z

Summary

Resolves #936 by adopting Option C — Add 'none': a new BackendSchedulerType.NONE that bypasses PodGroup CR creation entirely and lets the cluster's default scheduler place pods. Removes the hard dependency on kai-scheduler for backends that don't need gang scheduling.

New NoneK8sObjectFactory in src/utils/job/kb_objects.py: returns pods only (no PodGroup), leaves schedulerName unset, emits no kai/runai labels, returns empty scheduler-resource specs.
New enum value BackendSchedulerType.NONE = 'none' in src/utils/connectors/postgres.py.
get_k8s_object_factory() dispatches NONE → NoneK8sObjectFactory.

priority_supported() and topology_supported() correctly return False for NONE, so the existing validation in Pool.validate and Workflow.submit already produces clear errors when users try to use those features against a none backend (no extra plumbing needed).

Trade-offs

Sacrifices: gang scheduling, priority classes, topology constraints.
Gains: workflows submit and run on vanilla Kubernetes with no kai-scheduler installed.

Test plan

Added 14 unit tests in src/utils/job/tests/test_kb_objects.py covering enum parsing, factory dispatch, absence of schedulerName, absence of kai.scheduler/* and runai/* labels, no PodGroup in output, no pod-group-name annotation, empty scheduler/cleanup specs, priority_supported()/topology_supported() returning False, and topology-keys-ignored.
Removed pre-existing tags = [\"manual\"] on test_kb_objects so the suite runs in CI; fixed 2 pre-existing tests using deprecated assertEquals.
bazel test //src/utils/job/... — 9/9 passing.
bazel test //src/utils/connectors/... — passing.
bazel test //src/service/core/config:config-pylint //src/service/core/config/tests:test_config_history — passing.
End-to-end on a real kind cluster with no kai-scheduler installed (run/tests/test_scheduler_none_kind.py):
- RED phase: KaiK8sObjectFactory's PodGroup is rejected by the API server with ensure CRDs are installed first — proves the cluster genuinely lacks kai and the test can fail.
- GREEN phase: NoneK8sObjectFactory's pod applies cleanly, reaches Running, and is scheduled by default-scheduler on the kind node.

The e2e harness is wired as osmo_py_binary //run/tests:test_scheduler_none_kind and runs against a user-supplied KIND_CONTEXT env var, so it can be invoked locally or hooked into a kind-only CI lane later.

Summary by CodeRabbit

New Features
- Added support for a new "NONE" scheduler type option. When enabled, pods use the cluster's default scheduler instead of custom gang-scheduling, eliminating the need for scheduler-specific resource constructs.
Tests
- Added comprehensive test coverage for the new scheduler type, validating pod scheduling behavior and resource generation.

Adds BackendSchedulerType.NONE so backends can run without kai-scheduler. The new NoneK8sObjectFactory skips PodGroup CR creation, leaves schedulerName unset (use the cluster's default scheduler), and emits no kai/runai labels or scheduler-side resources (Queue/Topology). Sacrifices gang scheduling and priority/topology constraints to remove the kai-scheduler dependency. Tests: - 14 new unit tests in src/utils/job/tests/test_kb_objects.py covering enum parsing, factory dispatch, pod spec absence-of-fields, no PodGroup in output, no kai/runai labels, empty scheduler resources, and priority/topology unsupported. - New run/tests/test_scheduler_none_kind.py e2e harness that, on a kind cluster with no kai-scheduler installed, asserts (a) the kai factory's PodGroup is rejected by the API server (proving the negative) and (b) the none factory's pod applies cleanly and reaches Running on default-scheduler.

coderabbitai · 2026-05-08T07:31:16Z

📝 Walkthrough

Walkthrough

This PR implements the NONE scheduler type (Option C from issue #936) to allow workflows to run on clusters without kai-scheduler by falling back to Kubernetes' default scheduler. Changes span enum definition, factory routing, comprehensive unit tests, and an end-to-end kind cluster validation script.

Changes

NONE Scheduler Type Implementation

Layer / File(s)	Summary
Scheduler Type Contract `src/utils/connectors/postgres.py`	`BackendSchedulerType.NONE = 'none'` enum value added; docstring clarifies that NONE scheduler leaves `scheduler_name` unused and creates no PodGroup CRs.
Factory Implementation & Routing `src/utils/job/kb_objects.py`	`NoneK8sObjectFactory` introduced; overrides `update_pod_k8s_resource` to no-op (leaving `schedulerName` unset). `get_k8s_object_factory` routes `BackendSchedulerType.NONE` to factory instance.
Unit Tests: Setup & Helpers `src/utils/job/tests/test_kb_objects.py`	Module imports `connectors` and `topology`; adds `_make_backend()` and `_make_pod()` helpers to construct test fixtures with NONE scheduler type.
Unit Tests: NONE Scheduler Validation `src/utils/job/tests/test_kb_objects.py`	`NoneSchedulerSettingsTest` validates enum parsing; `NoneK8sObjectFactoryTest` and `NoneFactoryWithTopologyKeysTest` assert: group resources are Pods only (no PodGroup), `schedulerName` and kai/runai labels are absent, pod updates are no-ops, cleanup excludes PodGroup, scheduler resources are empty, and topology keys are ignored.
Unit Tests: Host Mount Improvements `src/utils/job/tests/test_kb_objects.py`	Replaces deprecated `assertEquals` with `assertEqual` and enhances host-mount test assertions for `src_path`, `dest_path`, `volume()`, and `volume_mount()` outputs.
Unit Test Build Config `src/utils/job/tests/BUILD`	Removes `tags = ["manual"]` from `test_kb_objects` target to enable automated test execution.
E2E Test: Kind Cluster Validation `run/tests/test_scheduler_none_kind.py`	End-to-end script validates NONE scheduler on kind without kai-scheduler: verifies no kai CRDs, resets namespace, confirms KAI path fails (PodGroup rejection), confirms NONE path succeeds (Pod only, default-scheduler, node assignment), and cleans up in `finally` block.
E2E Test Build Config `run/tests/BUILD`	Loads `osmo_py_binary` macro; defines `test_scheduler_none_kind` binary target with `main`, `srcs`, dependencies (`priority`, `connectors`, `job`), and public visibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A bunny hops through schedulers three,
With Kai now optional—hooray, we're free!
The default scheduler hums its tune,
No PodGroup dance, just Pods that swoon,
Kind clusters leap without a care,
The NONE scheduler floats on air! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.68% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a 'none' scheduler type that bypasses PodGroup creation, directly addressing issue `#936`.
Linked Issues check	✅ Passed	The PR fully implements Option C from `#936`: adds BackendSchedulerType.NONE, creates NoneK8sObjectFactory to bypass PodGroup creation, includes comprehensive unit and E2E tests, and enables workflows to run on vanilla Kubernetes without kai-scheduler.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with implementing the 'none' scheduler type: enum addition, factory implementation, test coverage, and Bazel build configuration. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jiaenr/default-scheduler-control

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-08T07:36:38Z

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 44.59%. Comparing base (01a90b5) to head (08c893b).

Files with missing lines	Patch %	Lines
src/utils/job/kb_objects.py	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #975      +/-   ##
==========================================
+ Coverage   44.53%   44.59%   +0.06%     
==========================================
  Files         218      218              
  Lines       28537    28545       +8     
  Branches     4260     4261       +1     
==========================================
+ Hits        12708    12730      +22     
+ Misses      15208    15194      -14     
  Partials      621      621

Flag	Coverage Δ
backend	`45.49% <87.50%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/utils/connectors/postgres.py	`71.50% <100.00%> (+0.01%)`	⬆️
src/utils/job/kb_objects.py	`86.91% <85.71%> (+8.10%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@run/tests/test_scheduler_none_kind.py`:
- Around line 1-2: The SPDX header in this new test file exceeds 100 characters
and needs the same pylint waiver used elsewhere; edit the file-level header (the
top triple-quoted string containing the SPDX line) to append the inline pylint
disable for line-too-long (i.e., add the comment "# pylint:
disable=line-too-long" alongside the SPDX copyright statement) so the long
copyright line is exempted from the linter.

In `@src/utils/job/kb_objects.py`:
- Around line 620-623: The override update_pod_k8s_resource currently
intentionally ignores all parameters but doesn't suppress pylint
unused-argument; modify the method to silence the linter by either renaming
unused parameters with a leading underscore (e.g., _pod, _group_uuid,
_pool_name, _priority) or add a pylint disable comment (e.g., # pylint:
disable=unused-argument) on the method signature, keeping the no-op pass body
and the method name unchanged so it still overrides the base implementation.

In `@src/utils/job/tests/test_kb_objects.py`:
- Around line 1-2: Add the inline pylint waiver to the existing SPDX copyright
header line (the long SPDX-FileCopyrightText header string) so it won't trigger
line-too-long checks; specifically append the comment "# pylint:
disable=line-too-long" to that SPDX header line in test_kb_objects.py (the
top-of-file SPDX header) without splitting the copyright text across multiple
lines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4674f97e-9cd6-4e53-b5b8-39e8214e33b1

📥 Commits

Reviewing files that changed from the base of the PR and between 01a90b5 and 08c893b.

📒 Files selected for processing (6)

run/tests/BUILD
run/tests/test_scheduler_none_kind.py
src/utils/connectors/postgres.py
src/utils/job/kb_objects.py
src/utils/job/tests/BUILD
src/utils/job/tests/test_kb_objects.py

💤 Files with no reviewable changes (1)

src/utils/job/tests/BUILD

coderabbitai · 2026-05-08T07:37:08Z

+"""
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the long-line pylint waiver to this SPDX header too.

The new file has the same over-100-character copyright line, so it needs the inline line-too-long disable to stay consistent with repo policy.

As per coding guidelines, "If copyright lines exceed 100 characters, add # pylint: disable=line-too-long comment instead of breaking into multiple lines".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@run/tests/test_scheduler_none_kind.py` around lines 1 - 2, The SPDX header in this new test file exceeds 100 characters and needs the same pylint waiver used elsewhere; edit the file-level header (the top triple-quoted string containing the SPDX line) to append the inline pylint disable for line-too-long (i.e., add the comment "# pylint: disable=line-too-long" alongside the SPDX copyright statement) so the long copyright line is exempted from the linter.

coderabbitai · 2026-05-08T07:37:09Z

+    def update_pod_k8s_resource(self, pod: Dict, group_uuid: str, pool_name: str,
+                                priority: wf_priority.WorkflowPriority):
+        # No-op: leave schedulerName unset so the default scheduler picks it up.
+        pass


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Suppress the intentional unused parameters in the no-op override.

This override ignores every argument, but unlike the similar base-class methods it does not disable unused-argument, so pylint will flag it.

Proposed fix

def update_pod_k8s_resource(self, pod: Dict, group_uuid: str, pool_name: str, priority: wf_priority.WorkflowPriority): + # pylint: disable=unused-argument # No-op: leave schedulerName unset so the default scheduler picks it up. pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/job/kb_objects.py` around lines 620 - 623, The override update_pod_k8s_resource currently intentionally ignores all parameters but doesn't suppress pylint unused-argument; modify the method to silence the linter by either renaming unused parameters with a leading underscore (e.g., _pod, _group_uuid, _pool_name, _priority) or add a pylint disable comment (e.g., # pylint: disable=unused-argument) on the method signature, keeping the no-op pass body and the method name unchanged so it still overrides the base implementation.

coderabbitai · 2026-05-08T07:37:09Z

 """
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the long-line pylint waiver to the SPDX header.

This copyright line is over the repo’s 100-character limit, so it should carry the same inline waiver used in the other Python files touched by this PR.

As per coding guidelines, "If copyright lines exceed 100 characters, add # pylint: disable=line-too-long comment instead of breaking into multiple lines".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/job/tests/test_kb_objects.py` around lines 1 - 2, Add the inline pylint waiver to the existing SPDX copyright header line (the long SPDX-FileCopyrightText header string) so it won't trigger line-too-long checks; specifically append the comment "# pylint: disable=line-too-long" to that SPDX header line in test_kb_objects.py (the top-of-file SPDX header) without splitting the copyright text across multiple lines.

jiaenren requested a review from a team as a code owner May 8, 2026 07:31

jiaenren had a problem deploying to internal-ci May 8, 2026 07:31 — with GitHub Actions Failure

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

jiaenren marked this pull request as draft May 11, 2026 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: add 'none' type that bypasses PodGroup creation (#936)#975

scheduler: add 'none' type that bypasses PodGroup creation (#936)#975
jiaenren wants to merge 1 commit into
mainfrom
jiaenr/default-scheduler-control

jiaenren commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 8, 2026

Uh oh!

coderabbitai Bot May 8, 2026

Uh oh!

coderabbitai Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"""
		SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

jiaenren commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Trade-offs

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiaenren commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

codecov Bot commented May 8, 2026 •

edited

Loading