-
Notifications
You must be signed in to change notification settings - Fork 42
Address issues with Sleep test over K8s #779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds Kubernetes job-name sanitization and validation in JSON generation; tightens sanitization (max length, empty-result check); improves Kubernetes system error handling/logging for job deletion and missing pods; and adds unit and integration tests covering sanitization behavior and JSON output. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/systems/kubernetes/kubernetes_system.py (1)
2-2: Update copyright year to fix CI failure.The pipeline indicates the copyright year is invalid. Update from
2024-2025to2024-2026to match policy.Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.src/cloudai/workloads/sleep/kubernetes_json_gen_strategy.py (1)
2-2: Update copyright year to fix CI failure.The pipeline indicates the copyright year is invalid. Update from
2024-2025to2024-2026to match policy.Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Fix all issues with AI agents
In `@tests/json_gen_strategy/test_common_kubernetes.py`:
- Around line 40-43: The test test_job_name_sanitization injects base_tr which
pulls in a SlurmSystem, causing base_tr.output_path and setup to reference SLURM
instead of Kubernetes; make base_tr independent or use a Kubernetes-compatible
fixture so the Kubernetes-only test doesn't instantiate SlurmSystem. Update the
test to either create a minimal TestRun instance locally (set name and
output_path appropriate for Kubernetes) or add/use a fixture like k8s_base_tr
that constructs TestRun without SlurmSystem dependency, then pass that into
MyJsonGenStrategy and assert sanitize_k8s_job_name(base_tr.name) == expected;
ensure references to KubernetesSystem, MyJsonGenStrategy, sanitize_k8s_job_name
and base_tr.output_path are updated accordingly.
In `@tests/json_gen_strategy/test_sleep_kubernetes_json_gen_strategy` copy.py:
- Around line 1-28: The test file was accidentally saved with a " copy" suffix
causing whitespace in the filename which breaks pytest discovery and imports;
rename the file to remove the space/suffix (use
test_sleep_kubernetes_json_gen_strategy.py) and ensure any test references or CI
configs point to the new name; verify the test function
test_job_name_sanitization and imports for SleepKubernetesJsonGenStrategy,
SleepTestDefinition, TestRun, and KubernetesSystem still resolve after renaming.
Greptile OverviewGreptile SummaryFixes two K8s-related issues in the Sleep test workload: job name sanitization order and graceful handling of already-deleted jobs.
Confidence Score: 5/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
Additional Comments (1)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 434-453: The delete_namespaced_job call's response is being cast
incorrectly to k8s.client.V1Job; update the cast for api_response returned by
batch_v1.delete_namespaced_job to k8s.client.V1Status (or the appropriate
V1Status type used in your kube client) and ensure any necessary
imports/reference for V1Status are present so the variable type matches the
actual API response from delete_namespaced_job.
In `@src/cloudai/workloads/sleep/kubernetes_json_gen_strategy.py`:
- Around line 33-39: The code builds a Job manifest using
sanitize_k8s_job_name(self.test_run.name) but doesn’t guard against the
sanitizer returning an empty string; update kubernetes_json_gen_strategy.py to
call sanitize_k8s_job_name first, check its return value, and if it is empty
raise a clear exception (e.g., ValueError) that includes the original
self.test_run.name and explains the name contained no valid characters so
Kubernetes job creation would fail; reference the sanitize_k8s_job_name call and
the Job creation block so the check happens before inserting the name into
job_spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/_core/json_gen_strategy.py (1)
2-2: Update header year to satisfy CI.The pipeline failure indicates the header should be “2024-2026,” but this file still shows “2024-2025.” Please update to unblock CI.
🛠️ Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/cloudai/_core/json_gen_strategy.py`:
- Around line 59-64: The current truncation of final_name to 253 chars can
reintroduce trailing hyphens and produce an invalid K8s name; after you truncate
(the assignment to final_name = sanitized_name[:253]) re-apply the
trailing-character strip that was done earlier (remove trailing hyphens/dots as
used in the earlier sanitization step), then re-check that final_name is not
empty and raise the same ValueError referencing job_name if it becomes empty;
ensure the final validation happens after the re-strip so the returned
final_name always complies with K8s naming rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
jeffnvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
just a general comment, this file json_gen_strategy.py seems to be uniquely for kubernetes. It seems odd (to me) that it's in the _core folder and that the name doesn't reflect the fact that it's just for kubernetes. It's only when you look at a specific class inside that you understand it's specific. What do you think ?
There is one under |
sure, just to keep in mind for the future maybe. |
Summary
Two issues were discovered:
Test Plan
Additional Notes
—