Skip to content

Conversation

@nsarka
Copy link
Member

@nsarka nsarka commented Oct 16, 2025

DDLB workload integration

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

Please also:

  1. Extend test_acceptance.py to cover sbatch generation logic.
  2. Add documentation page for this workload, see doc/workloads for examples. And link this page to the main one.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nsarka
Copy link
Member Author

nsarka commented Oct 22, 2025

Thanks for the review. I will update with these changes.

In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn

Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

@amaslenn
Copy link
Contributor

Thanks for the review. I will update with these changes.

In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn

Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

Depends on the system it can take some time, but 4h for 9GB is too much.

Have you tried enabling local caching in system with cache_docker_images_locally = true (https://nvidia.github.io/cloudai/USER_GUIDE.html#step-4-system-configuration) and running cloudai install? This will run srun ... enroot ... to cache the image explicitly. Once done, next runs will use .sqsh file instead of pulling image every time.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent changes address previously raised issues about copyright dates and commented code. The developer has updated copyright headers in newly added files to use only "2025" (instead of "2024-2025") and removed commented pre_test and post_test lines from the test scenario configuration, streamlining the DDLB integration. These are minor cleanup changes that improve code consistency with project conventions.

Important Files Changed

Filename Score Overview
conf/common/test_scenario/ddlb_test.toml 5/5 Removed commented pre_test and post_test hook lines, leaving clean minimal configuration
src/cloudai/workloads/ddlb/init.py 5/5 Updated copyright year from "2024-2025" to "2025" only
src/cloudai/registration.py 5/5 Updated copyright year from "2024-2025" to "2025" only
src/cloudai/workloads/ddlb/ddlb.py 5/5 Updated copyright year from "2024-2025" to "2025" only
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 5/5 Updated copyright year from "2024-2025" to "2025" only
conf/common/test/ddlb_test.toml 5/5 Updated copyright year from "2024-2025" to "2025" only

Confidence score: 5/5

  • These changes are safe to merge as they only address formatting and consistency issues raised in previous reviews
  • The score reflects that these are purely cosmetic/metadata changes with no functional impact on code behavior
  • No files require special attention; all changes are straightforward corrections to copyright headers and removal of commented placeholder code

6 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers changes made to the DDLB workload since the last review, not the entire PR. The developer addressed most of the critical feedback from prior reviews by fixing the unreachable code bug, removing dead configuration comments, simplifying validation logic, and standardizing copyright headers to "2025" for newly added files. The key fix removes the duplicate "Error" check in ddlb.py lines 58/68 that made success validation unreachable, and eliminates the unused missing_indicators list. PEP 8 formatting was also corrected. The test scenario timeout was extended from 10 to 30 minutes to allow DDLB benchmarks to complete. These changes clean up the DDLB integration while addressing previously flagged code quality issues.

Important Files Changed

Filename Score Overview
src/cloudai/workloads/ddlb/init.py 5/5 Updated copyright year from "2024-2025" to "2025" (administrative only)
conf/common/test_scenario/ddlb_test.toml 4.5/5 Extended test time limit from 10 to 30 minutes and removed dead commented-out fields
src/cloudai/workloads/ddlb/ddlb.py 5/5 Fixed critical duplicate error check bug making validation unreachable; removed unused missing_indicators list
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 4/5 Simplified command generation by removing intermediate variable; copyright updated

Confidence score: 4/5

  • This PR addresses critical bugs but one code smell remains that should be resolved before merging
  • Score reflects that the duplicate error check bug was fixed and copyright headers were standardized, but the unused tdef variable in slurm_command_gen_strategy.py still exists from prior reviews, and concerns about the relative path safety raised in previous review ("Is it safe to use relative path? We can introduce a field in the test definition for this workload to hold path_to_script.") remain unaddressed
  • Review slurm_command_gen_strategy.py carefully—the unused tdef variable suggests the test definition may need to be used for configuration in the future, and the hardcoded relative path "scripts/run_benchmark.py" may cause failures if executed from unexpected working directories

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nsarka
Copy link
Member Author

nsarka commented Oct 27, 2025

Thanks for the review. I will update with these changes.
In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn
Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

Depends on the system it can take some time, but 4h for 9GB is too much.

Have you tried enabling local caching in system with cache_docker_images_locally = true (https://nvidia.github.io/cloudai/USER_GUIDE.html#step-4-system-configuration) and running cloudai install? This will run srun ... enroot ... to cache the image explicitly. Once done, next runs will use .sqsh file instead of pulling image every time.

Thanks. I opted to try it on another cluster, and it failed there too with

slurmstepd: error: pyxis:     [INFO] Creating squashfs filesystem...
slurmstepd: error: pyxis:     Write failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR: Failed to write to output filesystem
slurmstepd: error: pyxis:     Parallel mksquashfs: Using 32 processors
slurmstepd: error: pyxis:     Creating 4.0 filesystem on /run/pyxis/47469/846367.2.squashfs, block size 131072.

It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's tmpfs. Is there a way to pass the .sqsh file directly to CloudAI without caching? I want to make the .sqsh file and copy it over to the machine I'm testing on.

@amaslenn
Copy link
Contributor

Thanks. I opted to try it on another cluster, and it failed there too with

slurmstepd: error: pyxis:     [INFO] Creating squashfs filesystem...
slurmstepd: error: pyxis:     Write failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR: Failed to write to output filesystem
slurmstepd: error: pyxis:     Parallel mksquashfs: Using 32 processors
slurmstepd: error: pyxis:     Creating 4.0 filesystem on /run/pyxis/47469/846367.2.squashfs, block size 131072.

It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's tmpfs. Is there a way to pass the .sqsh file directly to CloudAI without caching? I want to make the .sqsh file and copy it over to the machine I'm testing on.

If this image is too big, how will you create .sqsh file?
You can specify a local file for docker_image_url field, it will bypass enroot.

@amaslenn
Copy link
Contributor

@nsarka please merge your PR with the latest main branch to align check list.

@nsarka nsarka force-pushed the nsarka/ddlb-integration branch from 1171554 to 2bf56e4 Compare October 31, 2025 15:12
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI. The implementation follows established patterns from other workloads like NCCL and ChakraReplay.

Key Changes:

  • New test definition (DDLBTestDefinition) with Docker image support
  • Slurm command generation strategy that executes python scripts/run_benchmark.py
  • Configuration files for test setup (single node, 30-minute timeout)
  • Success validation checking for "Benchmark Results" in stdout
  • Registration in the main registry alongside other test definitions

Observations:

  • The error detection uses a generic "Error" string check which may produce false positives
  • The implementation is minimal but functional, delegating most logic to the container's benchmark script
  • No unit tests included for the new workload (though other workloads have test coverage)

Confidence Score: 4/5

  • This PR is safe to merge with minor refinements recommended
  • The implementation follows existing patterns closely (NCCL, ChakraReplay) and integrates cleanly into the registry. The main concern is the generic error detection string which could cause false positives. The code is well-structured and mirrors established workload patterns, making it maintainable. No breaking changes or security issues identified.
  • Primary attention needed on src/cloudai/workloads/ddlb/ddlb.py for error detection refinement

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/ddlb.py 4/5 Core DDLB test definition with generic error detection ('Error' string may match false positives), success validation checks for 'Benchmark Results'
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 5/5 Slurm command generation for DDLB, returns static test command, success check validates 'Benchmark Results' in output

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage
    participant OutputFile

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
    
    User->>DDLBTestDefinition: Create test with docker_image_url
    DDLBTestDefinition->>DockerImage: Initialize DockerImage(url)
    
    User->>DDLBTestSlurmCommandGenStrategy: Generate test command
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
    DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate srun command with container
    DDLBTestSlurmCommandGenStrategy-->>User: Return ["python scripts/run_benchmark.py"]
    
    User->>SlurmSystem: Execute test via Slurm
    SlurmSystem->>OutputFile: Write stdout.txt
    
    User->>DDLBTestDefinition: Check was_run_successful()
    DDLBTestDefinition->>OutputFile: Read stdout.txt
    alt Contains "Error"
        DDLBTestDefinition-->>User: JobStatusResult(False, error details)
    else Missing "Benchmark Results"
        DDLBTestDefinition-->>User: JobStatusResult(False, missing indicators)
    else Success
        DDLBTestDefinition-->>User: JobStatusResult(True)
    end
Loading

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

adds DDLB (Distributed Deep Learning Benchmark) workload support following the existing CloudAI workload pattern with test definition, Slurm command generation strategy, and configuration files.

Key changes:

  • registered DDLB workload in src/cloudai/registration.py
  • created DDLBTestDefinition with Docker image management and success validation based on "Benchmark Results" pattern
  • implemented DDLBTestSlurmCommandGenStrategy to generate mpirun commands
  • added test configuration (conf/common/test/ddlb_test.toml) and test scenario

Issues found:

  • critical command generation bug in slurm_command_gen_strategy.py:36 that produces malformed commands
  • unused imports in ddlb.py

Confidence Score: 2/5

  • critical bug in command generation will cause runtime failures when executing DDLB tests
  • the generate_test_command method in slurm_command_gen_strategy.py:36 constructs a malformed command list with "mpirun -np " (trailing space) as a single element, which when joined with spaces produces "mpirun -np 8 python..." (double space). This breaks command parsing and will cause test execution failures
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate fix to command generation logic

Important Files Changed

File Analysis

Filename Score Overview
conf/common/test/ddlb_test.toml 4/5 configuration file with hardcoded path, standard structure matches other test configs
src/cloudai/workloads/ddlb/ddlb.py 3/5 test definition with unused imports, follows established patterns for workload definitions
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 command generation with critical bug in generate_test_command list structure (line 36) that will produce malformed command

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant SlurmCommandGenStrategy
    participant DockerImage
    participant SlurmSystem

    User->>Registry: register DDLB workload
    Registry->>DDLBTestDefinition: register test definition
    Registry->>SlurmCommandGenStrategy: register command gen strategy
    
    User->>DDLBTestDefinition: load test config (ddlb_test.toml)
    DDLBTestDefinition->>DockerImage: initialize docker_image from docker_image_url
    
    User->>SlurmCommandGenStrategy: generate execution command
    SlurmCommandGenStrategy->>DDLBTestDefinition: get test definition
    DDLBTestDefinition->>DockerImage: get installed_path
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command() -> ["mpirun -np ", "8", "python scripts/run_benchmark.py"]
    SlurmCommandGenStrategy->>SlurmSystem: create sbatch script with srun command
    
    User->>SlurmSystem: execute job
    SlurmSystem->>SlurmSystem: run mpirun with DDLB benchmark
    SlurmSystem-->>User: output to stdout.txt
    
    User->>DDLBTestDefinition: was_run_successful(test_run)
    DDLBTestDefinition->>DDLBTestDefinition: check stdout.txt for "Error" or "Benchmark Results"
    DDLBTestDefinition-->>User: JobStatusResult
Loading

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

This reverts commit eda5d0e.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, following the existing pattern for workload registration with Slurm systems.

Key Changes

  • Added DDLB test definition with Docker image support and job success validation
  • Implemented Slurm command generation strategy for DDLB workloads
  • Created configuration files for test and test scenario definitions
  • Registered DDLB workload in the global registry alongside other workloads like NCCL and UCC

Issues Identified

  • Critical: Duplicate error checking logic in ddlb.py:59-68 makes second condition unreachable
  • Unused imports (Literal, Union) in ddlb.py:17
  • Unused tdef variable in slurm_command_gen_strategy.py:35
  • Generic "Error" pattern may cause false positives
  • Potential None handling issue in image_path() when installed_path is None

Confidence Score: 2/5

  • Not safe to merge - contains critical logic bug that prevents proper error detection
  • The duplicate error check at lines 59-68 in ddlb.py creates unreachable code that will prevent the success indicator check from ever executing. This is a critical bug that breaks the test validation logic. Additionally, the generic "Error" pattern is prone to false positives, and the image_path() method may return string "None" instead of handling None properly.
  • src/cloudai/workloads/ddlb/ddlb.py requires immediate attention due to unreachable code, and slurm_command_gen_strategy.py needs review for None handling

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/ddlb.py 2/5 DDLB test definition with critical logic error in duplicate error checking (lines 59-68 unreachable), unused imports, and overly generic error pattern matching
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 3/5 Command generation strategy with unused tdef variable and potential None handling issue in image_path() method

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant TestRunner
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy

    User->>TestRunner: Execute DDLB test
    TestRunner->>DDLBTestDefinition: Load test configuration
    DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
    
    TestRunner->>DDLBTestSlurmCommandGenStrategy: Generate command
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Generate test command
    DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return ["python scripts/run_benchmark.py"]
    
    TestRunner->>SlurmSystem: Submit job with srun command
    SlurmSystem-->>TestRunner: Job execution
    
    TestRunner->>DDLBTestDefinition: was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
    DDLBTestDefinition-->>TestRunner: Return JobStatusResult
Loading

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Integrates DDLB (Distributed Deep Learning Benchmark) workload with configuration files, test definitions, command generation strategy, and registration in the CloudAI framework.

Key Changes

  • Added new workload module at src/cloudai/workloads/ddlb/ with test definition and Slurm command generation
  • Configured DDLB test parameters including primitives, matrix dimensions (m, n, k), implementations, and Docker image
  • Registered DDLB test definition and command generation strategy in registration.py

Issues Found

  • Critical: Command generation uses is for string comparison instead of ==, causing Python SyntaxWarning
  • Critical: List values in configuration (e.g., m = [1024,8192], impl = [...]) will be formatted as Python literals instead of proper CLI arguments
  • Logic error: Duplicate error check makes lines 68-77 in ddlb.py unreachable
  • Multiple style issues from previous review (unused imports, alphabetical ordering)

Confidence Score: 2/5

  • Not safe to merge—critical command generation bugs will cause runtime failures
  • Two critical logic errors in command generation strategy: using is instead of == for string comparison (Python SyntaxWarning) and no handling of list values which are defined in the config. List parameters like m = [1024,8192] will generate malformed commands like -m [1024, 8192] instead of proper CLI format. These will cause the DDLB benchmark to fail at runtime.
  • Pay close attention to src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py which has critical command generation bugs

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 Command generation has critical logic error using is for string comparison and doesn't handle list values which will cause malformed commands
src/cloudai/workloads/ddlb/ddlb.py 3/5 Validation logic has duplicate error checks and unused imports; error detection pattern may produce false positives
src/cloudai/registration.py 4/5 Registration follows standard pattern but import placement breaks alphabetical ordering

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant Registry
    participant DDLBTestDef
    participant SlurmCmdGen
    participant SlurmSystem
    participant Docker

    User->>CloudAI: Load test config (ddlb_test.toml)
    CloudAI->>Registry: Lookup DDLBTest
    Registry->>DDLBTestDef: Create test definition
    DDLBTestDef->>Docker: Initialize docker image
    Docker-->>DDLBTestDef: Return image handle
    
    User->>CloudAI: Run test scenario
    CloudAI->>Registry: Get command gen strategy
    Registry->>SlurmCmdGen: Create strategy for Slurm+DDLB
    SlurmCmdGen->>DDLBTestDef: Get cmd_args (m, n, k, impl, etc)
    SlurmCmdGen->>SlurmCmdGen: Generate test command
    Note over SlurmCmdGen: Build CLI args from config<br/>Handle list params (BUG HERE)
    SlurmCmdGen->>SlurmSystem: Submit slurm job
    
    SlurmSystem->>Docker: Run container with benchmark
    Docker->>Docker: Execute python ddlb/cli/benchmark.py
    Docker-->>SlurmSystem: Write stdout.txt
    
    SlurmSystem-->>CloudAI: Job complete
    CloudAI->>DDLBTestDef: was_run_successful()
    DDLBTestDef->>DDLBTestDef: Check stdout.txt for "Benchmark Results"
    DDLBTestDef-->>User: Return success/failure
Loading

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

nsarka and others added 2 commits November 5, 2025 23:52
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, enabling benchmarking of tensor parallel operations within the framework.

Key additions:

  • New DDLBTestDefinition and DDLBTestSlurmCommandGenStrategy classes for test definition and Slurm command generation
  • Configuration files for test scenarios and experimental tests
  • Integration with the registry system and test acceptance suite

Critical issues identified:

  • List parameter handling in command generation will produce malformed CLI commands (e.g., -m [1024, 8192] instead of proper format)
  • String comparison using is operator instead of ==
  • Duplicate error checking logic making code unreachable
  • Several style issues including unused imports and minor formatting inconsistencies

Confidence Score: 2/5

  • This PR has critical bugs that will cause runtime failures with list parameters
  • The list handling bug in slurm_command_gen_strategy.py:46-52 will cause command generation failures when using the experimental config with list values for m, n, k, or impl parameters. The is operator issue and duplicate error check are also problematic. These are not style issues but actual logic errors that need resolution before merge.
  • Primary attention needed: src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py (list handling and string comparison), src/cloudai/workloads/ddlb/ddlb.py (duplicate error check)

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 Command generation strategy with critical list handling bug and string comparison issue
src/cloudai/workloads/ddlb/ddlb.py 3/5 Test definition with duplicate error check and unused imports
conf/experimental/test/ddlb_test.toml 4/5 Test config with list parameters that will trigger command generation bug

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage
    participant TestRun

    User->>Registry: Register DDLBTest workload
    Registry->>DDLBTestDefinition: Create test definition from config
    DDLBTestDefinition->>DockerImage: Initialize docker_image with URL
    User->>TestRun: Execute test
    TestRun->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build command with args
    DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate sbatch script
    SlurmSystem->>SlurmSystem: Execute srun with container
    SlurmSystem->>TestRun: Write stdout.txt
    TestRun->>DDLBTestDefinition: was_run_successful(tr)
    DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt for errors
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
    DDLBTestDefinition->>TestRun: Return JobStatusResult
Loading

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Integrates DDLB (Distributed Deep Learning Benchmarks) as a new workload in CloudAI, enabling benchmarking of distributed primitives like tensor parallelism.

Key changes:

  • Added DDLBTestDefinition and DDLBCmdArgs classes with support for configurable parameters (primitive, m/n/k dimensions, dtype, implementations)
  • Implemented DDLBTestSlurmCommandGenStrategy for Slurm command generation
  • Registered DDLB workload in the CloudAI registry
  • Added comprehensive documentation with usage examples

Critical issues found:

  • Missing Union import in ddlb.py will cause immediate runtime NameError when class is loaded
  • List parameter handling in command generation produces malformed CLI syntax (e.g., -m [1024, 8192] instead of -m 1024,8192)

These issues will prevent the workload from functioning correctly and must be fixed before merge.

Confidence Score: 1/5

  • This PR has critical bugs that will cause runtime failures and cannot be merged safely
  • Two syntax/logic errors will break functionality: (1) missing Union import causes NameError on module load, (2) list parameters generate malformed CLI commands. Both prevent workload from executing. Documentation and integration structure are solid, but code issues are blocking.
  • Critical attention needed for src/cloudai/workloads/ddlb/ddlb.py (missing import) and src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py (list handling bug)

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/ddlb.py 2/5 Core DDLB test definition with missing Union import causing runtime error; success checking logic is functional but has generic error pattern
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 1/5 Command generation strategy with critical list handling bug that will produce malformed CLI commands for list-type parameters

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant Registry
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage
    participant DDLB

    User->>CloudAI: Define DDLB test (TOML config)
    CloudAI->>Registry: Register DDLBTestDefinition
    Registry->>DDLBTestDefinition: Create test definition
    DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
    
    User->>CloudAI: Run test scenario
    CloudAI->>Registry: Get command generation strategy
    Registry->>DDLBTestSlurmCommandGenStrategy: Instantiate strategy
    
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
    Note over DDLBTestSlurmCommandGenStrategy: Build Python command with DDLB CLI args
    
    DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate sbatch script
    Note over SlurmSystem: Include srun with container mounts
    
    SlurmSystem->>DDLB: Execute via srun in container
    DDLB->>DDLB: Run benchmark (primitive, m, n, k, dtype, impl)
    DDLB-->>SlurmSystem: Write stdout.txt with results
    
    CloudAI->>DDLBTestDefinition: was_run_successful(test_run)
    DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt for "Benchmark Results"
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" patterns
    DDLBTestDefinition-->>CloudAI: Return JobStatusResult
Loading

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmarks) workload integration to CloudAI framework for Slurm systems. The implementation includes test definitions, command generation strategy, registration, documentation, and test coverage.

Key Changes:

  • New DDLBTestDefinition and DDLBCmdArgs classes with support for benchmark parameters (m, n, k, dtype, num_iterations, num_warmups, impl, primitive)
  • DDLBTestSlurmCommandGenStrategy generates Slurm commands for running DDLB benchmarks in Docker containers
  • Success checking based on "Benchmark Results" presence in stdout.txt
  • Registration in CloudAI's workload registry
  • Test fixtures and reference data for validation

Critical Issue:
The command generation strategy does not handle list values correctly. When m, n, k, or impl are lists (as shown in conf/experimental/test/ddlb_test.toml), they will be formatted as Python list literals (e.g., -m [1024, 8192]) instead of proper CLI format. This will cause runtime errors when the DDLB CLI attempts to parse these arguments.

Confidence Score: 3/5

  • This PR has a critical bug in list handling that will cause runtime failures
  • Score of 3 reflects proper framework integration and test coverage, but the unresolved list formatting bug in slurm_command_gen_strategy.py will cause failures when using list values for m, n, k, or impl parameters as demonstrated in the example configuration
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention to fix list argument handling

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 Command generation strategy with list handling bug that will break CLI arguments when lists are used
src/cloudai/workloads/ddlb/ddlb.py 4/5 Core test definition with proper type annotations and success checking logic

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant SlurmCommandGenStrategy
    participant DockerImage
    participant SlurmSystem

    User->>Registry: Register DDLBTest workload
    Registry->>Registry: add_test_definition("DDLBTest", DDLBTestDefinition)
    Registry->>Registry: add_command_gen_strategy(SlurmSystem, DDLBTestDefinition, DDLBTestSlurmCommandGenStrategy)
    
    User->>DDLBTestDefinition: Load test configuration from TOML
    DDLBTestDefinition->>DDLBTestDefinition: Initialize DDLBCmdArgs (m, n, k, impl, etc.)
    DDLBTestDefinition->>DockerImage: Create DockerImage from docker_image_url
    
    User->>SlurmSystem: Execute test run
    SlurmSystem->>SlurmCommandGenStrategy: Get command generation strategy
    SlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command()
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: Format CLI arguments (m, n, k, impl)
    Note over SlurmCommandGenStrategy: BUG: List values formatted as Python literals
    SlurmCommandGenStrategy->>SlurmSystem: Return srun command
    SlurmSystem->>DockerImage: Execute in container
    
    DockerImage->>DockerImage: Run DDLB benchmark
    DockerImage-->>SlurmSystem: Write stdout.txt
    
    SlurmSystem->>DDLBTestDefinition: was_run_successful(TestRun)
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" in stdout.txt
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results" in stdout.txt
    DDLBTestDefinition-->>User: Return JobStatusResult
Loading

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nsarka nsarka changed the title Draft: Add DDLB workload Add DDLB workload Nov 5, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmarks) workload integration to CloudAI framework, including test definitions, Slurm command generation strategy, configuration files, and documentation.

Key changes:

  • New DDLBTestDefinition and DDLBCmdArgs classes with support for matrix multiplication benchmarking parameters
  • Slurm command generation strategy for executing DDLB benchmarks in containerized environments
  • Test configuration files with examples of single and list-valued parameters
  • Registration of DDLB workload in the CloudAI registry

Critical issues found:

  • List parameter handling in command generation will produce malformed CLI commands (e.g., -m [1024, 8192] instead of proper format)
  • Duplicate error checking logic makes code unreachable
  • String comparison using is operator instead of ==

Confidence Score: 2/5

  • This PR contains critical bugs that will cause runtime failures when list parameters are used
  • The list parameter handling bug in slurm_command_gen_strategy.py will generate malformed commands that will fail at runtime when users attempt to use list values for parameters like m, n, k, or impl. The duplicate error check makes significant code unreachable, though this may not cause immediate failures. The string comparison issue using is will trigger Python warnings and may cause incorrect behavior.
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention to fix list parameter handling; src/cloudai/workloads/ddlb/ddlb.py needs duplicate error check removed

Important Files Changed

File Analysis

Filename Score Overview
conf/experimental/test/ddlb_test.toml 4/5 Configuration file defining DDLB test parameters with list values for m and impl arguments
src/cloudai/workloads/ddlb/ddlb.py 3/5 Test definition with duplicate error check at line 67, making lines 68-77 unreachable, and generic error pattern matching that may cause false positives
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 Command generation strategy with critical list handling bug causing malformed CLI commands, and string comparison using 'is' operator instead of '=='

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant SlurmCommandGenStrategy
    participant SlurmSystem
    participant Container

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
    
    User->>DDLBTestDefinition: Create test with cmd_args
    DDLBTestDefinition->>DDLBTestDefinition: Validate parameters (m, n, k, impl, etc)
    DDLBTestDefinition->>DockerImage: Initialize docker_image
    
    User->>SlurmCommandGenStrategy: Generate test command
    SlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: Build command parts
    Note over SlurmCommandGenStrategy: Formats parameters as CLI args<br/>(-m, -n, -k, --impl, etc)
    SlurmCommandGenStrategy-->>User: Return command list
    
    User->>SlurmSystem: Submit sbatch script
    SlurmSystem->>Container: Execute srun with container-image
    Container->>Container: Run python ddlb/cli/benchmark.py
    Container->>Container: Execute benchmark
    Container-->>SlurmSystem: Write stdout.txt
    
    SlurmSystem->>DDLBTestDefinition: Check was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
    DDLBTestDefinition-->>User: Return JobStatusResult
Loading

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@nsarka
Copy link
Member Author

nsarka commented Nov 5, 2025

@amaslenn I unmarked the PR as draft. I believe it should be ready for merging once there's a clear answer on the greptile comment about the list argument. Please let me know what you think.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Added experimental test configuration for DDLB workload with parameters for tensor parallelism benchmarking.

Key changes:

  • New ddlb_test.toml configuration file with test parameters (primitive type, matrix dimensions m/n/k, dtype, iterations)
  • Configured with two implementation variants: PyTorch with NCCL backend and fuser with CUDA backend
  • Uses list values for m parameter [1024, 8192] and multiple impl configurations to test different scenarios
  • References Docker image from gitlab-master.nvidia.com/nsarkauskas/ddlb:latest

Configuration structure:
The file follows standard CloudAI test configuration format with cmd_args section matching the DDLBCmdArgs schema defined in src/cloudai/workloads/ddlb/ddlb.py. Parameters include matrix dimensions, data type, warmup/iteration counts, and implementation variants.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk—it only adds a new configuration file.
  • The configuration file is well-structured and follows established patterns. However, there are existing issues in the command generation strategy (already commented on) that affect list parameter handling. The config itself is correct, but runtime behavior depends on fixes to slurm_command_gen_strategy.py.
  • No files in this PR require special attention. However, note that src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py (not in this PR) needs fixes for proper list parameter handling.

Important Files Changed

File Analysis

Filename Score Overview
conf/experimental/test/ddlb_test.toml 4/5 New DDLB test configuration with proper structure, valid parameters, and clear documentation. Uses list values for m and impl parameters which align with the workload's type annotations.

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant Config as ddlb_test.toml
    participant TestDef as DDLBTestDefinition
    participant CmdGen as SlurmCommandGenStrategy
    participant Slurm
    participant Docker as DDLB Container

    User->>CloudAI: Load test configuration
    CloudAI->>Config: Read ddlb_test.toml
    Config-->>CloudAI: Test parameters (primitive, m, n, k, dtype, impl, etc.)
    CloudAI->>TestDef: Create DDLBTestDefinition with cmd_args
    TestDef->>TestDef: Instantiate DockerImage from docker_image_url
    CloudAI->>CmdGen: Generate Slurm commands
    CmdGen->>TestDef: Get cmd_args and docker_image
    CmdGen->>CmdGen: Build command: python ddlb/cli/benchmark.py with args
    CmdGen-->>CloudAI: Return srun command with container params
    CloudAI->>Slurm: Submit job with generated sbatch script
    Slurm->>Docker: Launch container with DDLB benchmark
    Docker->>Docker: Execute benchmark with specified parameters
    Docker-->>Slurm: Write results to stdout.txt
    Slurm-->>CloudAI: Job completion signal
    CloudAI->>TestDef: Check job success (was_run_successful)
    TestDef->>TestDef: Parse stdout.txt for "Benchmark Results"
    TestDef-->>CloudAI: JobStatusResult
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR relocates the DDLB test scenario configuration file from conf/common/test_scenario/ to conf/experimental/test_scenario/ with no content modifications. This is a pure file move (git rename) to organize the DDLB workload under the experimental configurations directory.

Key Changes:

  • Moved ddlb_test.toml from common to experimental directory
  • No modifications to file content
  • Aligns with the experimental status of the DDLB workload integration

Analysis:
The test scenario configuration remains unchanged and properly defines a single DDLB test with 1 node and 30-minute timeout. The file references ddlb_test which exists in conf/experimental/test/ddlb_test.toml with proper command arguments defined.

Confidence Score: 5/5

  • This PR is completely safe to merge - it's a pure file relocation with zero content changes or functional impact
  • Perfect score reflects that this is a simple git rename operation with no code modifications, no logical changes, and no risk of introducing bugs. The file move from common to experimental is organizationally appropriate for the DDLB workload integration.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
conf/experimental/test_scenario/ddlb_test.toml 5/5 File moved from conf/common/ to conf/experimental/ with no content changes - simple relocation to appropriate directory for experimental DDLB workload

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant Git as Git Repository
    participant Common as conf/common/test_scenario/
    participant Exp as conf/experimental/test_scenario/

    Dev->>Git: Move ddlb_test.toml
    Git->>Common: Remove ddlb_test.toml
    Git->>Exp: Add ddlb_test.toml
    Note over Exp: Same content, new location
    Dev->>Git: Commit rename operation
    Note over Git: Pure file relocation<br/>No content changes
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@amaslenn
Copy link
Contributor

@nsarka please resolve conflicts and I'll review the PR.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmarks) workload integration including test definitions, Slurm command generation, configuration files, and documentation.

Key changes:

  • New DDLB workload implementation in src/cloudai/workloads/ddlb/
  • Registration of DDLB test definition and command generation strategy
  • Configuration files for testing with support for list-valued parameters (m, n, k, impl)
  • Reference sbatch script and acceptance tests

Critical issue:

  • Command generator doesn't handle list parameters correctly - will produce Python literal syntax like -m [1024, 8192] instead of proper CLI format, causing runtime failures when using list values as configured in conf/experimental/test/ddlb_test.toml

Confidence Score: 2/5

  • This PR has a critical bug that will cause failures when using list parameters as configured
  • The command generation logic in slurm_command_gen_strategy.py:46-51 will produce malformed CLI commands when parameters like m, n, k, or impl are lists (Python literal strings like [1024, 8192] instead of proper CLI syntax). The configuration file explicitly uses list values, so this will fail at runtime.
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention to fix list parameter handling before merge

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 New DDLB Slurm command generator with critical list handling bug that will produce malformed CLI commands
src/cloudai/workloads/ddlb/ddlb.py 4/5 New DDLB test definition with success checking logic, minor style issues but functionally sound
conf/experimental/test/ddlb_test.toml 3/5 DDLB test configuration with list values that may not be handled correctly by command generator

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant Registry
    participant DDLBTestDef
    participant SlurmCmdGen
    participant Slurm
    participant Container

    User->>CloudAI: Load DDLB test config
    CloudAI->>Registry: Register DDLBTestDefinition
    CloudAI->>Registry: Register DDLBTestSlurmCommandGenStrategy
    
    User->>CloudAI: Execute DDLB test
    CloudAI->>DDLBTestDef: Parse cmd_args (m, n, k, impl, etc.)
    DDLBTestDef->>DDLBTestDef: Create DockerImage from docker_image_url
    
    CloudAI->>SlurmCmdGen: Generate test command
    SlurmCmdGen->>SlurmCmdGen: Build srun command with container args
    SlurmCmdGen->>SlurmCmdGen: Format CLI args (m, n, k, dtype, impl)
    Note over SlurmCmdGen: BUG: Lists formatted as Python literals
    
    SlurmCmdGen->>Slurm: Submit sbatch script
    Slurm->>Container: Launch DDLB container
    Container->>Container: Execute python ddlb/cli/benchmark.py
    Container->>Container: Run benchmarks, output results
    
    Container-->>CloudAI: stdout.txt with results
    CloudAI->>DDLBTestDef: Check was_run_successful()
    DDLBTestDef->>DDLBTestDef: Verify "Benchmark Results" in stdout
    DDLBTestDef-->>User: Return success/failure status
Loading

11 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmarks) workload integration with Slurm command generation and test acceptance coverage.

Major changes:

  • New DDLBTestSlurmCommandGenStrategy class to generate DDLB benchmark CLI commands from test configurations
  • Test acceptance coverage for DDLB workload with sample configuration

Critical issues found:

  • List parameters (m, n, k, impl) are not handled correctly and will produce malformed CLI commands (e.g., -m [1024, 8192] instead of expected format)
  • image_path() method doesn't handle None case for installed_path, which would return string "None"

Confidence Score: 1/5

  • This PR contains critical logic bugs that will cause runtime failures and cannot be merged as-is
  • Previous review comments identified critical list handling bugs in lines 46-52 that remain unfixed. When list parameters like m = [1024, 8192] or impl = ["pytorch;...", "fuser;..."] are used (as shown in conf/experimental/test/ddlb_test.toml), the code will generate invalid CLI syntax like -m [1024, 8192] instead of proper format. This will cause command execution to fail.
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate fixes for list handling (lines 46-52) and None check in image_path (line 32)

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 1/5 Critical list handling bugs that will cause CLI command failures; image_path may return string "None"
tests/test_acceptance.py 5/5 Clean addition of DDLB test to test matrix with proper imports and test definition

Sequence Diagram

sequenceDiagram
    participant User
    participant TestRunner
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DDLBContainer

    User->>TestRunner: Request DDLB workload execution
    TestRunner->>DDLBTestDefinition: Load test config (cmd_args)
    DDLBTestDefinition->>DDLBTestDefinition: Parse parameters (m, n, k, impl, etc.)
    TestRunner->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build CLI command parts
    Note over DDLBTestSlurmCommandGenStrategy: Format: python ddlb/cli/benchmark.py -m X -n Y --impl Z
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: image_path()
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Docker image path
    DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return command parts list
    TestRunner->>SlurmSystem: Submit job with srun command
    SlurmSystem->>DDLBContainer: Execute benchmark.py with args
    DDLBContainer->>DDLBContainer: Run DDLB benchmark
    DDLBContainer-->>SlurmSystem: Write stdout.txt with "Benchmark Results"
    SlurmSystem-->>TestRunner: Job completed
    TestRunner->>DDLBTestDefinition: was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results" in stdout.txt
    DDLBTestDefinition-->>TestRunner: JobStatusResult
    TestRunner-->>User: Test results
Loading

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds Slurm command generation strategy for DDLB (Distributed Deep Learning Benchmark) workload integration. The implementation provides container-based execution with Docker image support and custom CLI argument handling for DDLB's benchmark parameters.

Key changes:

  • Implements DDLBTestSlurmCommandGenStrategy extending SlurmCommandGenStrategy
  • Provides Docker container path resolution via image_path() method
  • Generates DDLB benchmark CLI commands with parameters like matrix dimensions (m, n, k), implementation types (impl), and iteration counts
  • Includes success check by grepping for "Benchmark Results" in stdout

Critical issues requiring attention:

  • List-typed arguments (m, n, k, impl) will produce malformed CLI syntax (e.g., -m [1024, 8192] instead of proper format)
  • str(None) conversion on line 32 may produce string "None" instead of None value if docker image path is not set
  • Missing proper list handling for multi-value parameters defined in DDLBCmdArgs

Confidence Score: 2/5

  • Not safe to merge - contains critical logic errors that will cause runtime failures
  • The list handling bug (lines 46-52) will produce malformed CLI commands for DDLB benchmark when list values are used (as shown in conf/experimental/test/ddlb_test.toml). This will cause the benchmark to fail at runtime with invalid argument syntax. Previous comments have identified this issue but it remains unresolved.
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention for list argument handling fix

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 New command generation strategy for DDLB workload; contains critical logic errors in list handling for CLI arguments and potential None string conversion issue

Sequence Diagram

sequenceDiagram
    participant User
    participant SlurmSystem
    participant DDLBTestSlurmCommandGenStrategy
    participant DDLBTestDefinition
    participant DockerImage
    participant SlurmJob

    User->>SlurmSystem: Submit DDLB test
    SlurmSystem->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get test definition
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return test config
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args (m, n, k, impl, etc)
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return command arguments
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build CLI command
    Note over DDLBTestSlurmCommandGenStrategy: Format args for ddlb/cli/benchmark.py
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: image_path()
    DDLBTestDefinition->>DockerImage: Get installed_path
    DockerImage-->>DDLBTestDefinition: Return path
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
    DDLBTestSlurmCommandGenStrategy-->>SlurmSystem: Return complete command
    SlurmSystem->>SlurmJob: Execute with srun + container
    SlurmJob->>SlurmJob: Run python ddlb/cli/benchmark.py
    SlurmJob-->>SlurmSystem: Write stdout.txt
    SlurmSystem->>DDLBTestSlurmCommandGenStrategy: gen_srun_success_check()
    DDLBTestSlurmCommandGenStrategy-->>SlurmSystem: grep for "Benchmark Results"
    SlurmSystem->>DDLBTestDefinition: was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt for errors/results
    DDLBTestDefinition-->>User: Return success/failure status
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

This PR integrates the DDLB (Distributed Deep Learning Benchmarks) workload into CloudAI. The implementation follows the established pattern for workloads with Docker container support, including test definitions, Slurm command generation, configuration files, documentation, and tests.

Key additions:

  • New DDLBTestDefinition and DDLBCmdArgs classes supporting matrix dimensions (m, n, k), primitive types, implementations, and benchmark parameters
  • DDLBTestSlurmCommandGenStrategy for generating Slurm commands that execute the DDLB benchmark CLI
  • Configuration examples demonstrating list parameter usage for grid searches across multiple values
  • Comprehensive documentation with usage examples
  • Test coverage in acceptance and initialization tests

Critical issue identified:
The command generation strategy at slurm_command_gen_strategy.py:46-52 does not handle list values for parameters like m, n, k, and impl. When these are lists (as shown in conf/experimental/test/ddlb_test.toml:24,31-34), the code produces invalid CLI syntax like -m [1024, 8192] instead of the proper format expected by DDLB's CLI. This will cause runtime failures when executing tests with list parameters.

Confidence Score: 2/5

  • This PR has a critical bug in list parameter handling that will cause runtime failures
  • The list parameter handling bug in slurm_command_gen_strategy.py is a blocking issue that will prevent the workload from functioning correctly with the configuration examples provided in this PR. The config file demonstrates list usage (m = [1024, 8192], impl = [...]), but the command generator will format these incorrectly. Additionally, the generic "Error" detection pattern may produce false positives in production.
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention to fix list parameter handling before merge

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 Command generation strategy with critical list handling bug that produces invalid CLI syntax for m, n, k, and impl parameters
src/cloudai/workloads/ddlb/ddlb.py 3/5 Core test definition with overly generic error detection pattern that may produce false positives
src/cloudai/registration.py 4/5 Registration updates with minor import ordering style issue
conf/experimental/test/ddlb_test.toml 5/5 Well-documented test configuration with proper list parameter usage examples

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant Registry
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant DockerImage
    participant SlurmSystem
    participant Container

    User->>CloudAI: Submit DDLB test configuration
    CloudAI->>Registry: Load DDLBTest definition
    Registry->>DDLBTestDefinition: Instantiate with CmdArgs
    DDLBTestDefinition->>DockerImage: Create docker_image from URL
    DDLBTestDefinition-->>CloudAI: Test definition ready
    
    CloudAI->>Registry: Get command generation strategy
    Registry->>DDLBTestSlurmCommandGenStrategy: Instantiate for Slurm
    
    CloudAI->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build CLI arguments
    Note over DDLBTestSlurmCommandGenStrategy: Formats m, n, k, primitive, dtype,<br/>num_iterations, num_warmups, impl
    DDLBTestSlurmCommandGenStrategy-->>CloudAI: Return command parts list
    
    CloudAI->>DDLBTestSlurmCommandGenStrategy: image_path()
    DDLBTestSlurmCommandGenStrategy->>DockerImage: Get installed_path
    DockerImage-->>DDLBTestSlurmCommandGenStrategy: Return path
    DDLBTestSlurmCommandGenStrategy-->>CloudAI: Return image path
    
    CloudAI->>SlurmSystem: Generate sbatch script
    SlurmSystem->>Container: Execute srun with container-image
    Container->>Container: Run python ddlb/cli/benchmark.py
    Container-->>SlurmSystem: Write stdout.txt
    
    SlurmSystem-->>CloudAI: Job complete
    CloudAI->>DDLBTestDefinition: was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt exists
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
    DDLBTestDefinition-->>CloudAI: Return JobStatusResult
    CloudAI-->>User: Report test results
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@amaslenn
Copy link
Contributor

@nsarka let me know if we can merge it.

@amaslenn amaslenn mentioned this pull request Nov 17, 2025
@amaslenn amaslenn merged commit 0f69871 into NVIDIA:main Nov 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants