Skip to content

Add documentation for Systems#826

Merged
amaslenn merged 4 commits intomainfrom
am/systems-doc
Mar 10, 2026
Merged

Add documentation for Systems#826
amaslenn merged 4 commits intomainfrom
am/systems-doc

Conversation

@amaslenn
Copy link
Contributor

Summary

Add documentation for Systems: list only properties relevant for TOML configs.

Test Plan

  1. CI
  2. Manual doc build and review.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Adds a new "systems" documentation page and TOC entry; updates Sphinx autodoc skip logic to skip non-attribute members for pydantic models when rendering the systems page; shortens RunAISystem docstring; updates LSFSystem docstring and adds a description to its queues Field and copyright year range.

Changes

Cohort / File(s) Summary
Documentation config & TOC
doc/conf.py, doc/index.rst
Imported inspect and added an autodoc_skip_member guard that, for pydantic models on the systems page, skips non-attribute members (functions, methods, classmethod, staticmethod, property). Added systems to the toctree.
Systems documentation
doc/systems.rst
Added a new systems documentation page with tables and per-system autopydantic_model sections and selective member exclusions.
LSF system file
src/cloudai/systems/lsf/lsf_system.py
Updated copyright header year range, replaced a verbose class docstring with a concise one-line docstring, and added a description to the queues Field (Field(default_factory=list, description="A list of queues in the LSF system, filled in automatically")).
RunAI system file
src/cloudai/systems/runai/runai_system.py
Shortened RunAISystem class docstring to a one-line summary; no functional changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through docs to tidy a page,
Skipped noisy members, set doc structure's stage.
LSF's queues now labeled with care,
RunAI's blurb trimmed light as air.
A tiny hop forward — code and carrot shared!

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding documentation for the Systems feature, which matches the primary objective and the bulk of changes.
Description check ✅ Passed The description is directly related to the changeset, explaining the purpose (document Systems with TOML-relevant properties) and outlining the test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch am/systems-doc

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR adds a new doc/systems.rst documentation page that lists all CloudAI-supported systems (Slurm, Kubernetes, RunAI, LSF, Standalone) with their TOML-configurable attributes, rendered via sphinxcontrib.autodoc_pydantic. A targeted autodoc-skip-member hook is added to conf.py to suppress methods, properties, and classmethods from Pydantic models exclusively on the systems page, ensuring only data fields are shown. Class-level docstrings in lsf_system.py and runai_system.py are trimmed to one-liners as the attribute-level documentation now lives in field description strings or on the docs page itself.

Key changes:

  • doc/systems.rst: New page with autopydantic_model directives for all five systems; runtime-only fields (cmd_shell, lsf_nodes, slurm_nodes, nodes, group_allocated, supports_gpu_directives_cache) are excluded via :exclude-members:.
  • doc/conf.py: Per-page autodoc-skip-member hook using app.env.docname == "systems" to filter out non-field members only when building the systems page.
  • src/cloudai/systems/lsf/lsf_system.py: LSFSystem.queues gains a description="A list of queues in the LSF system, filled in automatically" — the wording "filled in automatically" may be confusing in a TOML-documentation context since the page states these are attributes users can set via TOML.
  • src/cloudai/systems/runai/runai_system.py: Docstring simplified; no functional changes.

Confidence Score: 4/5

  • This PR is safe to merge; changes are documentation-only except for a non-functional field description addition in lsf_system.py.
  • The autodoc hook logic is sound and well-scoped. The RST directives correctly exclude runtime-only fields. The one minor concern is the "filled in automatically" description on LSFSystem.queues, which creates slightly contradictory messaging on a page that claims to show TOML-configurable attributes. No functional code is changed.
  • src/cloudai/systems/lsf/lsf_system.py — the queues field description wording warrants review for consistency with the page's stated documentation goal.

Important Files Changed

Filename Overview
doc/conf.py Adds import inspect and a per-page autodoc hook that skips methods/properties from Pydantic models only when building the systems page. Logic is correct and well-scoped.
doc/systems.rst New documentation page listing all systems and their TOML-configurable fields via autopydantic_model. Runtime fields are properly excluded. The LSFSystem.queues field is shown but described as "filled in automatically", which slightly contradicts the page's stated purpose.
doc/index.rst Adds systems entry to the toctree, placed logically between DEV and workloads/index.
src/cloudai/systems/lsf/lsf_system.py Simplifies class docstring to a one-liner and adds a description to the queues field ("filled in automatically") for autodoc display. The description is accurate at runtime but may be misleading in a TOML docs context.
src/cloudai/systems/runai/runai_system.py Simplifies class docstring to a one-liner. No functional changes; nodes field exclusion is handled in systems.rst via :exclude-members: directive.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Sphinx doc build] --> B{docname == systems?}
    B -- Yes --> C[autodoc_skip_member hook]
    B -- No --> D[Normal autodoc processing]
    C --> E{what == pydantic_model?}
    E -- No --> F[Skip by default rules only]
    E -- Yes --> G{obj is function / method / property?}
    G -- Yes --> H[Skip member]
    G -- No --> I[Show member as TOML field]
    I --> J[systems.rst rendered\nwith fields only]
Loading

Comments Outside Diff (1)

  1. src/cloudai/systems/lsf/lsf_system.py, line 55-57 (link)

    queues description contradicts TOML documentation intent

    The doc/systems.rst page explicitly states "The attributes shown for each system can be set in TOML configuration files." However, the description "filled in automatically" on the queues field communicates the opposite — that users should not (or need not) set this field via TOML.

    At runtime, parse_bhosts_output resets self.queues = [] and rebuilds it entirely from bhosts output, so any user-provided TOML value would be overwritten. This is consistent with how lsf_nodes is excluded from LSFQueue in the docs.

    For consistency with the page's stated goal, consider either:

    1. Excluding queues from LSFSystem in systems.rst (via :exclude-members: queues) similarly to how slurm_nodes/lsf_nodes are excluded from their queue/partition models, while still keeping LSFQueue/LSFGroup separately documented for structure reference, or
    2. Rewording the description to something neutral like "List of queues in the LSF system." and adding a note about automatic population in the docstring instead.

    Compare with SlurmSystem.partitions which carries no such "filled in automatically" annotation in its field definition.

Last reviewed commit: 692b7f1

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@doc/systems.rst`:
- Around line 43-44: The RunAISystem schema currently exposes a nodes:
List[RunAINode] but the docs only render RunAISystem; either document the nested
RunAINode model by adding an autopydantic_model entry for
cloudai.systems.runai.runai_node.RunAINode (or the correct import path for
RunAINode) below the RunAISystem entry so the TOML shape of each node is shown,
or hide the nodes field from the parent model by changing the RunAISystem
autopydantic_model invocation to exclude the nodes field (e.g., add the
appropriate :exclude: nodes option or equivalent in the autopydantic directive)
so that nodes is not rendered if it is not user-settable.
- Around line 50-56: The docs page is TOML-only and must not expose
runtime-derived types LSFQueue and LSFGroup; remove the two autopydantic_model
directives for LSFQueue and LSFGroup from doc/systems.rst (leave only the
LSFSystem entry), since in src/cloudai/systems/lsf/lsf_system.py the queues are
rebuilt from live bhosts output and should not be presented as user-configurable
types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 03b5c9b4-5bc9-48e8-bd00-08c2f730ae90

📥 Commits

Reviewing files that changed from the base of the PR and between 9b84632 and 866d001.

📒 Files selected for processing (5)
  • doc/conf.py
  • doc/index.rst
  • doc/systems.rst
  • src/cloudai/systems/lsf/lsf_system.py
  • src/cloudai/systems/runai/runai_system.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/cloudai/systems/lsf/lsf_system.py (3)

69-79: ⚠️ Potential issue | 🟠 Major

Do not rebuild self.queues after a failed shell command.

Line 127 only logs command errors, but update() still parses the outputs and Line 161 clears self.queues before repopulating it. A transient bhosts/bjobs failure can therefore replace the last good topology with an empty or partial snapshot. Fail the refresh before mutating state once command execution reports an error.

Suggested direction
     def update(self) -> None:
-        bhosts_output, _ = self.fetch_command_output("bhosts")
-        bjobs_output, _ = self.fetch_command_output("bjobs -u all")
+        bhosts_output, bhosts_error = self.fetch_command_output("bhosts")
+        bjobs_output, bjobs_error = self.fetch_command_output("bjobs -u all")
+        if bhosts_error or bjobs_error:
+            raise RuntimeError("Failed to refresh LSF system state")
+
         node_user_map = self.parse_bjobs_output(bjobs_output)
         self.parse_bhosts_output(bhosts_output, node_user_map)

Also applies to: 117-131, 161-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/systems/lsf/lsf_system.py` around lines 69 - 79, The update()
method currently ignores command failures from fetch_command_output and proceeds
to parse and then clear/rebuild self.queues; change it to check the second
return value (error/exit status) from fetch_command_output for both "bhosts" and
"bjobs -u all", and if either reports an error, log and return early without
calling parse_bjobs_output/parse_bhosts_output or mutating self.queues; also
move the clearing/repopulation of self.queues (the code that empties and
repopulates queues) so it only runs after successful command execution and
parsing (i.e., after parse_bhosts_output returns successfully).

75-78: ⚠️ Potential issue | 🔴 Critical

Fix parse_bhosts_output(): the sixth column in bhosts output is RUN (job slots), not a queue name.

The plain bhosts command reports columns HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV. Line 168 incorrectly treats the sixth token (RUN, a numeric count of running job slots) as a queue name, causing hosts to be bucketed under values like 0, 1, or 2 instead of actual queue names. LSF queues are a separate configuration surface—queue-to-host mappings are not available from the plain bhosts command. A different source (such as lshosts with -q flag, queue configuration files, or LSF API) must be used to correctly associate hosts with queues before constructing LSFQueue objects.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/systems/lsf/lsf_system.py` around lines 75 - 78,
parse_bhosts_output currently treats the 6th token of bhosts lines as a queue
name (causing hosts to be grouped under numeric RUN values); update
parse_bhosts_output to treat the 6th column as RUN (a numeric count) and not a
queue identifier, stop using that token to construct LSFQueue objects, and
instead either (a) leave queue association empty/None or (b) accept an explicit
queue-to-host mapping injected from a proper source (e.g., lshosts -q or LSF
API) before creating LSFQueue instances; adjust any logic that uses
parse_bhosts_output (including calls in the same flow after parse_bjobs_output
and references to LSFQueue) to handle hosts without queue assignments or to use
the injected mapping.

75-77: ⚠️ Potential issue | 🟠 Major

Use explicit output format for bjobs instead of relying on fixed column positions.

The parse_bjobs_output() method assumes token 6 is always EXEC_HOST, but the default bjobs output leaves EXEC_HOST empty for pending jobs. This causes unstable field alignment when parsing—non-running jobs have fewer columns or different column order, causing the method to either skip them or misalign field assignments, resulting in an incomplete or inaccurate node_user_map.

Request only the fields you need via -o and filter by job status before populating node_user_map. This is consistent with how is_job_running() and is_job_completed() already use -noheader -o stat in the same file.

Suggested direction
-        bjobs_output, _ = self.fetch_command_output("bjobs -u all")
+        bjobs_output, _ = self.fetch_command_output('bjobs -noheader -o "user stat exec_host" -u all')
...
-        for line in bjobs_output.splitlines():
-            parts = line.split()
-            if len(parts) < 6:
+        for line in bjobs_output.splitlines():
+            parts = line.split()
+            if len(parts) < 3:
                 continue
-            _, user, _, _, _, exec_host = parts[:6]
-            if exec_host not in node_user_map:
+            user, stat, exec_host = parts[:3]
+            if stat != "RUN" or not exec_host or exec_host == "-":
+                continue
+            if exec_host not in node_user_map:
                 node_user_map[exec_host] = user
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/systems/lsf/lsf_system.py` around lines 75 - 77,
parse_bjobs_output is relying on fixed column positions from a plain "bjobs"
call which breaks when EXEC_HOST is empty; instead change the call in the code
that invokes fetch_command_output to request explicit fields and filter statuses
(e.g. use "bjobs -noheader -o <fields> -u all" where <fields> includes JOBID,
USER, STAT, EXEC_HOST) and then update parse_bjobs_output to parse those
tab/space-separated ordered fields (or split by the chosen delimiter) and only
populate node_user_map for jobs with running STAT values; reference
fetch_command_output, parse_bjobs_output, and the places where
is_job_running()/is_job_completed() already use "-noheader -o stat" to mirror
that approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/lsf/lsf_system.py`:
- Around line 52-53: Update the LSFSystem class docstring to clearly describe
the configuration surface and behavior: state whether the class represents TOML
configuration, runtime state, or both; list the expected config keys/sections or
their purpose (e.g., host, port, credentials, scheduling options) and the format
(TOML) if applicable; summarize runtime behavior and side effects (e.g.,
establishes connections, starts jobs, mutates system state) and any validation
performed by LSFSystem methods; place this expanded one-paragraph docstring
directly on the LSFSystem class to replace the current generic line.

---

Outside diff comments:
In `@src/cloudai/systems/lsf/lsf_system.py`:
- Around line 69-79: The update() method currently ignores command failures from
fetch_command_output and proceeds to parse and then clear/rebuild self.queues;
change it to check the second return value (error/exit status) from
fetch_command_output for both "bhosts" and "bjobs -u all", and if either reports
an error, log and return early without calling
parse_bjobs_output/parse_bhosts_output or mutating self.queues; also move the
clearing/repopulation of self.queues (the code that empties and repopulates
queues) so it only runs after successful command execution and parsing (i.e.,
after parse_bhosts_output returns successfully).
- Around line 75-78: parse_bhosts_output currently treats the 6th token of
bhosts lines as a queue name (causing hosts to be grouped under numeric RUN
values); update parse_bhosts_output to treat the 6th column as RUN (a numeric
count) and not a queue identifier, stop using that token to construct LSFQueue
objects, and instead either (a) leave queue association empty/None or (b) accept
an explicit queue-to-host mapping injected from a proper source (e.g., lshosts
-q or LSF API) before creating LSFQueue instances; adjust any logic that uses
parse_bhosts_output (including calls in the same flow after parse_bjobs_output
and references to LSFQueue) to handle hosts without queue assignments or to use
the injected mapping.
- Around line 75-77: parse_bjobs_output is relying on fixed column positions
from a plain "bjobs" call which breaks when EXEC_HOST is empty; instead change
the call in the code that invokes fetch_command_output to request explicit
fields and filter statuses (e.g. use "bjobs -noheader -o <fields> -u all" where
<fields> includes JOBID, USER, STAT, EXEC_HOST) and then update
parse_bjobs_output to parse those tab/space-separated ordered fields (or split
by the chosen delimiter) and only populate node_user_map for jobs with running
STAT values; reference fetch_command_output, parse_bjobs_output, and the places
where is_job_running()/is_job_completed() already use "-noheader -o stat" to
mirror that approach.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 160c1219-cbcd-40cd-802e-22a0ffd747c1

📥 Commits

Reviewing files that changed from the base of the PR and between 866d001 and 716c8fa.

📒 Files selected for processing (2)
  • doc/conf.py
  • src/cloudai/systems/lsf/lsf_system.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
doc/systems.rst (1)

51-57: ⚠️ Potential issue | 🟡 Minor

LSFQueue and LSFGroup should not be documented on this TOML-only page.

These models represent runtime-derived state (queues are rebuilt from live bhosts output) and should not be presented as user-configurable TOML attributes. This contradicts the page introduction stating that the documented attributes "can be set in TOML configuration files."

📚 Suggested fix
 .. autopydantic_model:: cloudai.systems.lsf.lsf_system.LSFSystem
-   :exclude-members: cmd_shell
+   :exclude-members: cmd_shell, queues

-.. autopydantic_model:: cloudai.systems.lsf.lsf_system.LSFQueue
-   :exclude-members: lsf_nodes
-
-.. autopydantic_model:: cloudai.systems.lsf.lsf_system.LSFGroup
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@doc/systems.rst` around lines 51 - 57, Remove the runtime-only models
LSFQueue and LSFGroup from this TOML-focused documentation page: delete the
autopydantic_model entries for cloudai.systems.lsf.lsf_system.LSFQueue and
cloudai.systems.lsf.lsf_system.LSFGroup so only LSFSystem remains documented,
and add a brief one-line note (if desired) stating that queues and groups are
runtime-derived (from bhosts) and are not configurable via TOML to avoid user
confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@doc/systems.rst`:
- Around line 51-57: Remove the runtime-only models LSFQueue and LSFGroup from
this TOML-focused documentation page: delete the autopydantic_model entries for
cloudai.systems.lsf.lsf_system.LSFQueue and
cloudai.systems.lsf.lsf_system.LSFGroup so only LSFSystem remains documented,
and add a brief one-line note (if desired) stating that queues and groups are
runtime-derived (from bhosts) and are not configurable via TOML to avoid user
confusion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 44600d20-9b69-403b-956f-6409127e14b2

📥 Commits

Reviewing files that changed from the base of the PR and between 716c8fa and 874af2f.

📒 Files selected for processing (1)
  • doc/systems.rst

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
doc/systems.rst (1)

53-59: ⚠️ Potential issue | 🟡 Minor

Hide runtime-derived LSF queue models from this TOML-only page.

Line 4 says this page lists attributes that can be set in TOML, but LSFSystem.queues is rebuilt from live bhosts output in parse_bhosts_output(), and the field itself is described as auto-filled. Rendering LSFQueue and LSFGroup here makes runtime state look user-configurable. Exclude queues from LSFSystem and drop the nested LSF model directives from this page.

Suggested doc change
 .. autopydantic_model:: cloudai.systems.lsf.lsf_system.LSFSystem
-   :exclude-members: cmd_shell
+   :exclude-members: cmd_shell, queues
 
-.. autopydantic_model:: cloudai.systems.lsf.lsf_system.LSFQueue
-   :exclude-members: lsf_nodes
-
-.. autopydantic_model:: cloudai.systems.lsf.lsf_system.LSFGroup
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@doc/systems.rst` around lines 53 - 59, Update the doc to stop exposing
runtime-derived LSF models: in the autopydantic_model directive for LSFSystem
remove or add :exclude-members: queues (so the auto-generated docs do not show
the runtime-filled queues field) and delete the separate autopydantic_model
blocks for LSFQueue and LSFGroup from this TOML-only page; this keeps
parse_bhosts_output()/LSFSystem.queues (the auto-filled runtime data) out of the
configuration documentation and leaves only TOML-settable attributes documented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@doc/systems.rst`:
- Around line 53-59: Update the doc to stop exposing runtime-derived LSF models:
in the autopydantic_model directive for LSFSystem remove or add
:exclude-members: queues (so the auto-generated docs do not show the
runtime-filled queues field) and delete the separate autopydantic_model blocks
for LSFQueue and LSFGroup from this TOML-only page; this keeps
parse_bhosts_output()/LSFSystem.queues (the auto-filled runtime data) out of the
configuration documentation and leaves only TOML-settable attributes documented.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c981c38a-028e-418e-a37b-6d93d9c251b3

📥 Commits

Reviewing files that changed from the base of the PR and between 874af2f and 692b7f1.

📒 Files selected for processing (2)
  • doc/systems.rst
  • src/cloudai/systems/lsf/lsf_system.py

@amaslenn amaslenn requested a review from podkidyshev March 10, 2026 12:41
@amaslenn amaslenn merged commit 18f4254 into main Mar 10, 2026
5 checks passed
@amaslenn amaslenn deleted the am/systems-doc branch March 10, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants