Conversation
📝 WalkthroughWalkthroughAdds a new "NixlEP" Slurm workload: documentation, command-argument model, test definition, Slurm command generator, report generation, log-parsing utilities, package registration, reference SBATCH, and comprehensive unit/acceptance tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
879dcf7 to
6fa76db
Compare
6fa76db to
deb15b5
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@doc/workloads/nixl_ep.rst`:
- Around line 20-27: The example and accompanying comment are ambiguous about
whether omitting a rank in a phase implies it is removed; update the doc text
around the ``plan`` example to state the exact semantics (e.g., "ranks not
listed in a phase are considered inactive/removed; negative indices explicitly
mark removals") and either modify the sample phase 2 to explicitly show rank 5's
status (e.g., use -5 if it was intended removed) or explicitly call out that
omission of 5 was intentional and means it is inactive; reference the ``plan``
field and the phase lists in the example to ensure readers understand omission
vs explicit negative indices.
In `@src/cloudai/workloads/nixl_ep/nixl_ep.py`:
- Around line 186-200: The check in _check_benchmark_output currently uses
any(parse_nixl_ep_bandwidth_samples(path) for path in expected_node_logs) which
treats a partial run as successful if one node emitted summary lines; change
this to require every expected node log to contain samples by using
all(parse_nixl_ep_bandwidth_samples(path) for path in expected_node_logs) (or
otherwise verify each Path via parse_nixl_ep_bandwidth_samples) so the function
only returns success when all expected_node_logs have summary output; keep the
existing tail/error_message behavior and ensure you still handle an empty
expected_node_logs list in the same way was_run_successful expects.
- Around line 97-101: The validation loop over parsed plan phases currently uses
isinstance(rank, int), which accepts JSON booleans because bool subclasses int;
update the check in the loop that iterates over parsed/phase/rank to use strict
type comparison (e.g., type(rank) is int) so that True/False are rejected as
non-integer ranks, and keep raising ValueError("Each plan rank must be an
integer.") when the strict check fails.
In `@src/cloudai/workloads/nixl_ep/report_generation_strategy.py`:
- Around line 80-91: The table-building currently collapses samples per node and
ignores sample.rank; update _build_table to preserve rank granularity by adding
a "Rank" column (use Table.add_column like the existing columns) and change the
grouping/rendering logic to iterate and merge/render samples per (node, rank)
instead of per node. Use parse_nixl_ep_bandwidth_samples' sample.rank field when
constructing rows so each (node, rank) produces its own row (including the
Dispatch/Combine BW and Avg/Min/Max columns when has_combined/has_kineto are
set) and ensure any mean calculations are applied per (node, rank) not across
all ranks on a node.
In `@src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py`:
- Around line 245-251: The _render_launch method currently joins arguments
unsafely and fails to quote the log file and certain shell-sensitive values;
update _render_launch to build the benchmark command by quoting each token
returned from _build_benchmark_command(launch) with shlex.quote(), but when an
argument equals the literal "$master_ip" emit it as a double-quoted "$master_ip"
(to preserve runtime expansion), then construct the script without the brittle
.replace('"','\\"') step; also wrap the log_file shown in the --output option
with shlex.quote(str(log_file)) and keep the srun prefix via
_launch_srun_prefix(launch.node_idx) and open_mode_arg as before so the final
returned string safely embeds the quoted bash -c "<script>" invocation.
- Around line 189-198: _build_benchmark_command currently passes
NixlEPCmdArgs.elastic_script through unchanged, which breaks when elastic_script
is a path relative to the container runtime root; resolve the elastic_script to
the container runtime absolute path before building the command (use the same
resolution mechanism you use for plan paths — e.g. mirror resolve_plan_path or
add a resolve_script_path helper) and replace cmd_args.elastic_script in the
command array with the resolved path so relative values like
"examples/device/ep/tests/elastic/elastic.py" work inside the container.
In `@tests/test_test_scenario.py`:
- Line 626: The parametrized test test_custom_reporters is missing the new
NixlEPTestDefinition entry: add (NixlEPTestDefinition,
{NixlEPReportGenerationStrategy}) to its parameter list and update the
assertions accordingly so Registry().reports_map still validates mapping for
NixlEPTestDefinition; also add imports for NixlEPTestDefinition and
NixlEPReportGenerationStrategy at the top of the test file so the symbols
resolve.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fc20196a-68a9-4029-975f-839c2691690d
📒 Files selected for processing (16)
doc/workloads/index.rstdoc/workloads/nixl_ep.rstsrc/cloudai/registration.pysrc/cloudai/workloads/nixl_ep/__init__.pysrc/cloudai/workloads/nixl_ep/log_parsing.pysrc/cloudai/workloads/nixl_ep/nixl_ep.pysrc/cloudai/workloads/nixl_ep/report_generation_strategy.pysrc/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.pytests/ref_data/nixl-ep.sbatchtests/test_acceptance.pytests/test_init.pytests/test_test_scenario.pytests/workloads/nixl_ep/__init__.pytests/workloads/nixl_ep/test_command_gen_strategy_slurm.pytests/workloads/nixl_ep/test_job_status_retrieval_strategy.pytests/workloads/nixl_ep/test_log_parsing.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/nixl_ep/nixl_ep.py`:
- Around line 105-108: The type-check in parse_plan currently raises ValueError
for a non-string self.plan; change this to raise TypeError instead. Update the
exception in the parse_plan method (the check on self.plan) to raise a TypeError
with a clear message (mentioning parse_plan and expected string) and keep the
rest of the logic that calls self._parse_plan(self.plan) unchanged.
In `@tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py`:
- Around line 196-220: The parametrize call uses a comma-separated string for
argument names which Ruff PT006 flags; update the pytest.mark.parametrize
invocation to pass a tuple of parameter names instead (e.g., replace the string
"log_content, expected_fragment" with a tuple containing the two identifiers) so
the decorator uses a tuple of names for log_content and expected_fragment in the
pytest.mark.parametrize call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8afaf7aa-8afc-4700-be1d-33eb1ac6abfd
📒 Files selected for processing (4)
doc/workloads/nixl_ep.rstsrc/cloudai/workloads/nixl_ep/nixl_ep.pytests/test_test_scenario.pytests/workloads/nixl_ep/test_job_status_retrieval_strategy.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_acceptance.py`:
- Around line 415-417: The third phase in the test data for "plan" contains an
invalid negative rank `-6`; update the JSON payload used in the test (the "plan"
value) to replace `-6` with the correct non-negative rank (likely `5` to
continue the ascending sequence) so the phase reads `[0, 1, 2, 3, 4, 5, 7]` (or
`6` if you intended to skip 5), and run the test to ensure the phase ordering is
correct; locate the "plan" JSON assignment in the test_acceptance.py snippet to
make this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b06809fe-c13c-4b25-b32c-369976dfcd49
📒 Files selected for processing (1)
tests/test_acceptance.py
Summary
Add NIXL EP workload.
Test Plan
Additional Notes
–