Skip to content

Conversation

@amaslenn
Copy link
Contributor

Summary

Two issues were discovered:

  1. Job name wasn't sanitized (addresses internal bug).
  2. Job deletion crashed if job was already gone.

Test Plan

  1. CI (extended)
  2. Manual runs on k8s system.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds Kubernetes job-name sanitization and validation in JSON generation; tightens sanitization (max length, empty-result check); improves Kubernetes system error handling/logging for job deletion and missing pods; and adds unit and integration tests covering sanitization behavior and JSON output.

Changes

Cohort / File(s) Summary
Kubernetes system error handling
src/cloudai/systems/kubernetes/kubernetes_system.py
_delete_batch_job now catches ApiException, treats 404 as debug and returns, logs/re-raises other errors with status/reason; casts success to V1Status and logs deletion status; store_logs_for_job logs missing pods at debug level.
Job metadata generation
src/cloudai/workloads/sleep/kubernetes_json_gen_strategy.py
Job metadata name now uses self.sanitize_k8s_job_name(self.test_run.name) (namespace unchanged); copyright year updated.
Sanitization logic
src/cloudai/_core/json_gen_strategy.py
sanitize_k8s_job_name applies early truncation to 253 chars, performs trailing cleanup after truncation, and raises ValueError if the sanitized result is empty.
Tests — sanitization unit tests
tests/json_gen_strategy/test_common_kubernetes.py
New parameterized tests for sanitize_k8s_job_name covering underscores, dots, special chars, uppercase, length truncation, leading/trailing hyphen trimming, and invalid-only-special-chars case (raises).
Tests — integration for sleep strategy
tests/json_gen_strategy/test_sleep_kubernetes_json_gen_strategy.py
New test ensuring SleepKubernetesJsonGenStrategy produces JSON metadata name equal to sanitize_k8s_job_name(TestRun.name) for a mangled test-run name.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nibble names until they're lean,
Dots and caps turned calm and clean,
If jobs vanish, logs speak low,
Tests hop in to check the flow,
Small hops, tidy runs — I preen.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Address issues with Sleep test over K8s' clearly summarizes the main changes: fixing Sleep test issues on Kubernetes by sanitizing job names and handling job deletion edge cases.
Description check ✅ Passed The description clearly relates to the changeset by identifying the two specific issues being addressed: job name sanitization and job deletion crash handling, which match the changes shown in the raw summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/cloudai/systems/kubernetes/kubernetes_system.py (1)

2-2: Update copyright year to fix CI failure.

The pipeline indicates the copyright year is invalid. Update from 2024-2025 to 2024-2026 to match policy.

Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
src/cloudai/workloads/sleep/kubernetes_json_gen_strategy.py (1)

2-2: Update copyright year to fix CI failure.

The pipeline indicates the copyright year is invalid. Update from 2024-2025 to 2024-2026 to match policy.

Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Fix all issues with AI agents
In `@tests/json_gen_strategy/test_common_kubernetes.py`:
- Around line 40-43: The test test_job_name_sanitization injects base_tr which
pulls in a SlurmSystem, causing base_tr.output_path and setup to reference SLURM
instead of Kubernetes; make base_tr independent or use a Kubernetes-compatible
fixture so the Kubernetes-only test doesn't instantiate SlurmSystem. Update the
test to either create a minimal TestRun instance locally (set name and
output_path appropriate for Kubernetes) or add/use a fixture like k8s_base_tr
that constructs TestRun without SlurmSystem dependency, then pass that into
MyJsonGenStrategy and assert sanitize_k8s_job_name(base_tr.name) == expected;
ensure references to KubernetesSystem, MyJsonGenStrategy, sanitize_k8s_job_name
and base_tr.output_path are updated accordingly.

In `@tests/json_gen_strategy/test_sleep_kubernetes_json_gen_strategy` copy.py:
- Around line 1-28: The test file was accidentally saved with a " copy" suffix
causing whitespace in the filename which breaks pytest discovery and imports;
rename the file to remove the space/suffix (use
test_sleep_kubernetes_json_gen_strategy.py) and ensure any test references or CI
configs point to the new name; verify the test function
test_job_name_sanitization and imports for SleepKubernetesJsonGenStrategy,
SleepTestDefinition, TestRun, and KubernetesSystem still resolve after renaming.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 20, 2026

Greptile Overview

Greptile Summary

Fixes two K8s-related issues in the Sleep test workload: job name sanitization order and graceful handling of already-deleted jobs.

  • Job name sanitization: Fixed the order of operations in sanitize_k8s_job_name to truncate names to 253 characters before removing trailing hyphens. The previous implementation could produce invalid K8s job names when truncation resulted in trailing hyphens.
  • Job deletion robustness: Added try-catch handling in _delete_batch_job to gracefully handle 404 errors when attempting to delete jobs that have already been removed, preventing crashes.
  • Applied sanitization: Updated SleepKubernetesJsonGenStrategy to use sanitize_k8s_job_name when creating jobs, ensuring all Sleep test jobs have valid K8s names.
  • Test coverage: Added comprehensive tests covering edge cases including truncation with trailing hyphens and invalid character handling.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • The changes are targeted bug fixes with comprehensive test coverage. The sanitization logic fix follows K8s naming requirements correctly, and the exception handling prevents crashes without changing behavior for successful deletions. All changes are well-tested and address real production issues.
  • No files require special attention.

Important Files Changed

Filename Overview
src/cloudai/_core/json_gen_strategy.py Fixed job name sanitization order - truncate before removing trailing hyphens to prevent invalid K8s names.
src/cloudai/systems/kubernetes/kubernetes_system.py Added proper exception handling for 404 errors when deleting already-deleted jobs, and fixed cast type from V1Job to V1Status.

Sequence Diagram

sequenceDiagram
    participant User
    participant SleepJsonGen as SleepKubernetesJsonGenStrategy
    participant JsonGenStrategy
    participant K8sSystem as KubernetesSystem
    participant K8sAPI as Kubernetes API

    User->>SleepJsonGen: gen_json()
    SleepJsonGen->>JsonGenStrategy: sanitize_k8s_job_name(test_run.name)
    Note over JsonGenStrategy: 1. Convert to lowercase<br/>2. Replace invalid chars with '-'<br/>3. Remove leading non-alphanumeric<br/>4. Truncate to 253 chars (NEW ORDER)<br/>5. Remove trailing non-alphanumeric<br/>6. Validate not empty
    JsonGenStrategy-->>SleepJsonGen: sanitized_name
    SleepJsonGen->>K8sSystem: create_job(job_spec)
    K8sSystem->>K8sAPI: create_namespaced_job()
    K8sAPI-->>K8sSystem: job created
    
    Note over User,K8sAPI: Later: Job Deletion Flow
    
    User->>K8sSystem: delete_job(job_name, "Job")
    K8sSystem->>K8sSystem: _delete_batch_job(job_name)
    K8sSystem->>K8sAPI: delete_namespaced_job()
    alt Job exists
        K8sAPI-->>K8sSystem: V1Status
        K8sSystem-->>User: success
    else Job already deleted (404)
        K8sAPI-->>K8sSystem: ApiException(status=404)
        Note over K8sSystem: Log debug message<br/>Return gracefully (NEW)
        K8sSystem-->>User: success
    else Other error
        K8sAPI-->>K8sSystem: ApiException
        K8sSystem-->>User: raise exception
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 20, 2026

Additional Comments (1)

src/cloudai/systems/kubernetes/kubernetes_system.py
Inconsistent error handling between _delete_mpi_job and _delete_batch_job. MPI job deletion doesn't return early on 404, so it may try to log the success message for a non-existent job.

        except lazy.k8s.client.ApiException as e:
            if e.status == 404:
                logging.debug(f"MPIJob '{job_name}' not found. It may have already been deleted.")
                return

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 434-453: The delete_namespaced_job call's response is being cast
incorrectly to k8s.client.V1Job; update the cast for api_response returned by
batch_v1.delete_namespaced_job to k8s.client.V1Status (or the appropriate
V1Status type used in your kube client) and ensure any necessary
imports/reference for V1Status are present so the variable type matches the
actual API response from delete_namespaced_job.

In `@src/cloudai/workloads/sleep/kubernetes_json_gen_strategy.py`:
- Around line 33-39: The code builds a Job manifest using
sanitize_k8s_job_name(self.test_run.name) but doesn’t guard against the
sanitizer returning an empty string; update kubernetes_json_gen_strategy.py to
call sanitize_k8s_job_name first, check its return value, and if it is empty
raise a clear exception (e.g., ValueError) that includes the original
self.test_run.name and explains the name contained no valid characters so
Kubernetes job creation would fail; reference the sanitize_k8s_job_name call and
the Job creation block so the check happens before inserting the name into
job_spec.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/_core/json_gen_strategy.py (1)

2-2: Update header year to satisfy CI.

The pipeline failure indicates the header should be “2024-2026,” but this file still shows “2024-2025.” Please update to unblock CI.

🛠️ Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cloudai/_core/json_gen_strategy.py`:
- Around line 59-64: The current truncation of final_name to 253 chars can
reintroduce trailing hyphens and produce an invalid K8s name; after you truncate
(the assignment to final_name = sanitized_name[:253]) re-apply the
trailing-character strip that was done earlier (remove trailing hyphens/dots as
used in the earlier sanitization step), then re-check that final_name is not
empty and raise the same ValueError referencing job_name if it becomes empty;
ensure the final validation happens after the re-strip so the returned
final_name always complies with K8s naming rules.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@amaslenn amaslenn added the bug Something isn't working label Jan 20, 2026
Copy link
Contributor

@jeffnvidia jeffnvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

just a general comment, this file json_gen_strategy.py seems to be uniquely for kubernetes. It seems odd (to me) that it's in the _core folder and that the name doesn't reflect the fact that it's just for kubernetes. It's only when you look at a specific class inside that you understand it's specific. What do you think ?

@amaslenn
Copy link
Contributor Author

LGTM

just a general comment, this file json_gen_strategy.py seems to be uniquely for kubernetes. It seems odd (to me) that it's in the _core folder and that the name doesn't reflect the fact that it's just for kubernetes. It's only when you look at a specific class inside that you understand it's specific. What do you think ?

There is one under _core/: https://github.com/NVIDIA/cloudai/blob/main/src/cloudai/_core/json_gen_strategy.py and multiple per-workload if there is k8s support (same as cmd gen for slurm). Yeah, I guess name could be more specific, but this is not the goal for this PR.

@jeffnvidia
Copy link
Contributor

LGTM
just a general comment, this file json_gen_strategy.py seems to be uniquely for kubernetes. It seems odd (to me) that it's in the _core folder and that the name doesn't reflect the fact that it's just for kubernetes. It's only when you look at a specific class inside that you understand it's specific. What do you think ?

There is one under _core/: https://github.com/NVIDIA/cloudai/blob/main/src/cloudai/_core/json_gen_strategy.py and multiple per-workload if there is k8s support (same as cmd gen for slurm). Yeah, I guess name could be more specific, but this is not the goal for this PR.

sure, just to keep in mind for the future maybe.

@amaslenn amaslenn merged commit 1409ba7 into main Jan 20, 2026
5 checks passed
@amaslenn amaslenn deleted the am/bug-4844156 branch January 20, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants