Add NIXL EP workload by amaslenn · Pull Request #845 · NVIDIA/cloudai

amaslenn · 2026-03-24T13:02:25Z

Summary

Add NIXL EP workload.

Test Plan

CI (extended)
Manual runs for 1 and 2 nodes

Additional Notes

–

coderabbitai · 2026-03-24T13:02:32Z

📝 Walkthrough

Walkthrough

Adds a new "NixlEP" Slurm workload: documentation, command-argument model, test definition, Slurm command generator, report generation, log-parsing utilities, package registration, reference SBATCH, and comprehensive unit/acceptance tests.

Changes

Cohort / File(s)	Summary
Documentation `doc/workloads/index.rst`, `doc/workloads/nixl_ep.rst`	Added `NixlEP` entry to the workloads index and a new Sphinx page describing the JSON `plan` format, TOML examples, runtime semantics, reporting output, and API doc hooks.
Package & Registration `src/cloudai/workloads/nixl_ep/__init__.py`, `src/cloudai/registration.py`	New package initializer exporting NixlEP symbols; registration now registers `NixlEP` test definition, Slurm command-gen strategy, and report generation strategy.
Core Workload & Parsing `src/cloudai/workloads/nixl_ep/nixl_ep.py`, `src/cloudai/workloads/nixl_ep/log_parsing.py`	Added `NixlEPCmdArgs` (plan/process validation and parsing), `NixlEPTestDefinition` (docker/installables and log-based `was_run_successful` checks), and log-parsing utilities/dataclass for completed phases and bandwidth samples.
Slurm Command Generation `src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py`	Added `NixlEPSlurmCommandGenStrategy` that generates SBATCH scripts, writes generated-plan and env artifacts, validates packing/phase transitions, and emits per-node `srun` launches with master/follower coordination and wait helpers.
Report Generation `src/cloudai/workloads/nixl_ep/report_generation_strategy.py`	Added `NixlEPReportGenerationStrategy` that detects NixlEP output, loads generated plan, parses completed phases and bandwidth samples, renders rich tables, and exposes a `default` metric aggregation.
Reference SBATCH `tests/ref_data/nixl-ep.sbatch`	Added reference Slurm batch script illustrating multi-node multi-phase launches, master IP resolution, readiness/phase-wait helpers, and per-node log routing.
Tests: integration & harness updates `tests/test_acceptance.py`, `tests/test_init.py`, `tests/test_test_scenario.py`	Wired `nixl-ep` into acceptance mapping (3-node test run), added registration/assertions for NixlEP components, and updated expected registry/report counts.
Tests: workload unit tests `tests/workloads/nixl_ep/test_command_gen_strategy_slurm.py`, `tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py`, `tests/workloads/nixl_ep/test_log_parsing.py`, `tests/workloads/nixl_ep/__init__.py`	Added extensive unit tests for command generation, SBATCH normalization, job-status detection via logs, and log-parsing edge cases; new test fixtures and reference comparisons included.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Poem

🐇 I hopped through plans of phases and nodes,
I stitched SBATCH lines and counted the modes,
Master hums softly, followers align,
Logs pour out bandwidth in neat little lines,
A carrot for code — elastic runs shine! 🥕✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add NIXL EP workload' directly and clearly summarizes the main change: introducing a new workload type to the repository.
Description check	✅ Passed	The description is related to the changeset, providing a summary of the addition, test plan, and noting that it restores accidentally removed code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am/cloudai-6/nixl-ep

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@doc/workloads/nixl_ep.rst`:
- Around line 20-27: The example and accompanying comment are ambiguous about
whether omitting a rank in a phase implies it is removed; update the doc text
around the ``plan`` example to state the exact semantics (e.g., "ranks not
listed in a phase are considered inactive/removed; negative indices explicitly
mark removals") and either modify the sample phase 2 to explicitly show rank 5's
status (e.g., use -5 if it was intended removed) or explicitly call out that
omission of 5 was intentional and means it is inactive; reference the ``plan``
field and the phase lists in the example to ensure readers understand omission
vs explicit negative indices.

In `@src/cloudai/workloads/nixl_ep/nixl_ep.py`:
- Around line 186-200: The check in _check_benchmark_output currently uses
any(parse_nixl_ep_bandwidth_samples(path) for path in expected_node_logs) which
treats a partial run as successful if one node emitted summary lines; change
this to require every expected node log to contain samples by using
all(parse_nixl_ep_bandwidth_samples(path) for path in expected_node_logs) (or
otherwise verify each Path via parse_nixl_ep_bandwidth_samples) so the function
only returns success when all expected_node_logs have summary output; keep the
existing tail/error_message behavior and ensure you still handle an empty
expected_node_logs list in the same way was_run_successful expects.
- Around line 97-101: The validation loop over parsed plan phases currently uses
isinstance(rank, int), which accepts JSON booleans because bool subclasses int;
update the check in the loop that iterates over parsed/phase/rank to use strict
type comparison (e.g., type(rank) is int) so that True/False are rejected as
non-integer ranks, and keep raising ValueError("Each plan rank must be an
integer.") when the strict check fails.

In `@src/cloudai/workloads/nixl_ep/report_generation_strategy.py`:
- Around line 80-91: The table-building currently collapses samples per node and
ignores sample.rank; update _build_table to preserve rank granularity by adding
a "Rank" column (use Table.add_column like the existing columns) and change the
grouping/rendering logic to iterate and merge/render samples per (node, rank)
instead of per node. Use parse_nixl_ep_bandwidth_samples' sample.rank field when
constructing rows so each (node, rank) produces its own row (including the
Dispatch/Combine BW and Avg/Min/Max columns when has_combined/has_kineto are
set) and ensure any mean calculations are applied per (node, rank) not across
all ranks on a node.

In `@src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py`:
- Around line 245-251: The _render_launch method currently joins arguments
unsafely and fails to quote the log file and certain shell-sensitive values;
update _render_launch to build the benchmark command by quoting each token
returned from _build_benchmark_command(launch) with shlex.quote(), but when an
argument equals the literal "$master_ip" emit it as a double-quoted "$master_ip"
(to preserve runtime expansion), then construct the script without the brittle
.replace('"','\\"') step; also wrap the log_file shown in the --output option
with shlex.quote(str(log_file)) and keep the srun prefix via
_launch_srun_prefix(launch.node_idx) and open_mode_arg as before so the final
returned string safely embeds the quoted bash -c "<script>" invocation.
- Around line 189-198: _build_benchmark_command currently passes
NixlEPCmdArgs.elastic_script through unchanged, which breaks when elastic_script
is a path relative to the container runtime root; resolve the elastic_script to
the container runtime absolute path before building the command (use the same
resolution mechanism you use for plan paths — e.g. mirror resolve_plan_path or
add a resolve_script_path helper) and replace cmd_args.elastic_script in the
command array with the resolved path so relative values like
"examples/device/ep/tests/elastic/elastic.py" work inside the container.

In `@tests/test_test_scenario.py`:
- Line 626: The parametrized test test_custom_reporters is missing the new
NixlEPTestDefinition entry: add (NixlEPTestDefinition,
{NixlEPReportGenerationStrategy}) to its parameter list and update the
assertions accordingly so Registry().reports_map still validates mapping for
NixlEPTestDefinition; also add imports for NixlEPTestDefinition and
NixlEPReportGenerationStrategy at the top of the test file so the symbols
resolve.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc20196a-68a9-4029-975f-839c2691690d

📥 Commits

Reviewing files that changed from the base of the PR and between 43f24aa and 7dfb84d.

📒 Files selected for processing (16)

doc/workloads/index.rst
doc/workloads/nixl_ep.rst
src/cloudai/registration.py
src/cloudai/workloads/nixl_ep/__init__.py
src/cloudai/workloads/nixl_ep/log_parsing.py
src/cloudai/workloads/nixl_ep/nixl_ep.py
src/cloudai/workloads/nixl_ep/report_generation_strategy.py
src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py
tests/ref_data/nixl-ep.sbatch
tests/test_acceptance.py
tests/test_init.py
tests/test_test_scenario.py
tests/workloads/nixl_ep/__init__.py
tests/workloads/nixl_ep/test_command_gen_strategy_slurm.py
tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py
tests/workloads/nixl_ep/test_log_parsing.py

doc/workloads/nixl_ep.rst

src/cloudai/workloads/nixl_ep/nixl_ep.py

src/cloudai/workloads/nixl_ep/report_generation_strategy.py

src/cloudai/workloads/nixl_ep/slurm_command_gen_strategy.py

tests/test_test_scenario.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/nixl_ep/nixl_ep.py`:
- Around line 105-108: The type-check in parse_plan currently raises ValueError
for a non-string self.plan; change this to raise TypeError instead. Update the
exception in the parse_plan method (the check on self.plan) to raise a TypeError
with a clear message (mentioning parse_plan and expected string) and keep the
rest of the logic that calls self._parse_plan(self.plan) unchanged.

In `@tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py`:
- Around line 196-220: The parametrize call uses a comma-separated string for
argument names which Ruff PT006 flags; update the pytest.mark.parametrize
invocation to pass a tuple of parameter names instead (e.g., replace the string
"log_content, expected_fragment" with a tuple containing the two identifiers) so
the decorator uses a tuple of names for log_content and expected_fragment in the
pytest.mark.parametrize call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8afaf7aa-8afc-4700-be1d-33eb1ac6abfd

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfb84d and 18aafa6.

📒 Files selected for processing (4)

doc/workloads/nixl_ep.rst
src/cloudai/workloads/nixl_ep/nixl_ep.py
tests/test_test_scenario.py
tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py

src/cloudai/workloads/nixl_ep/nixl_ep.py

tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py

src/cloudai/workloads/nixl_ep/nixl_ep.py

tests/test_acceptance.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_acceptance.py`:
- Around line 415-417: The third phase in the test data for "plan" contains an
invalid negative rank `-6`; update the JSON payload used in the test (the "plan"
value) to replace `-6` with the correct non-negative rank (likely `5` to
continue the ascending sequence) so the phase reads `[0, 1, 2, 3, 4, 5, 7]` (or
`6` if you intended to skip 5), and run the test to ensure the phase ordering is
correct; locate the "plan" JSON assignment in the test_acceptance.py snippet to
make this change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b06809fe-c13c-4b25-b32c-369976dfcd49

📥 Commits

Reviewing files that changed from the base of the PR and between 18aafa6 and c005fa2.

📒 Files selected for processing (1)

tests/test_acceptance.py

tests/test_acceptance.py

amaslenn force-pushed the am/cloudai-6/nixl-ep branch 2 times, most recently from 879dcf7 to 6fa76db Compare March 24, 2026 13:14

Add NIXL EP workload

deb15b5

amaslenn force-pushed the am/cloudai-6/nixl-ep branch from 6fa76db to deb15b5 Compare March 24, 2026 13:15

amaslenn added 5 commits March 24, 2026 14:18

Update report

e2a2649

Draw phases info in report

c484e60

Align left

d6fa2a8

Address issues

5092cd7

Make ruff happy

7dfb84d

amaslenn marked this pull request as ready for review March 24, 2026 13:44

amaslenn requested review from jeffnvidia and srivatsankrishnan as code owners March 24, 2026 13:44

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

Address review comments

18aafa6

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

src/cloudai/workloads/nixl_ep/nixl_ep.py Show resolved Hide resolved

tests/workloads/nixl_ep/test_job_status_retrieval_strategy.py Show resolved Hide resolved

amaslenn requested a review from podkidyshev March 24, 2026 14:24

podkidyshev reviewed Mar 24, 2026

View reviewed changes

src/cloudai/workloads/nixl_ep/nixl_ep.py Show resolved Hide resolved

tests/test_acceptance.py Show resolved Hide resolved

Restore accidentially removed code

c005fa2

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

tests/test_acceptance.py Show resolved Hide resolved

podkidyshev approved these changes Mar 24, 2026

View reviewed changes

amaslenn merged commit e330d84 into main Mar 25, 2026
5 checks passed

amaslenn deleted the am/cloudai-6/nixl-ep branch March 25, 2026 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NIXL EP workload#845

Add NIXL EP workload#845
amaslenn merged 8 commits intomainfrom
am/cloudai-6/nixl-ep

amaslenn commented Mar 24, 2026

Uh oh!

coderabbitai bot commented Mar 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amaslenn commented Mar 24, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 24, 2026 •

edited

Loading