-
Notifications
You must be signed in to change notification settings - Fork 42
Add DDLB workload #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DDLB workload #711
Conversation
amaslenn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
Please also:
- Extend
test_acceptance.pyto cover sbatch generation logic. - Add documentation page for this workload, see
doc/workloadsfor examples. And link this page to the main one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, no comments
|
Thanks for the review. I will update with these changes. In the mean time, when I tried running this change, I found that Here the output of a manual run: I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang. |
Depends on the system it can take some time, but 4h for 9GB is too much. Have you tried enabling local caching in system with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The most recent changes address previously raised issues about copyright dates and commented code. The developer has updated copyright headers in newly added files to use only "2025" (instead of "2024-2025") and removed commented pre_test and post_test lines from the test scenario configuration, streamlining the DDLB integration. These are minor cleanup changes that improve code consistency with project conventions.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| conf/common/test_scenario/ddlb_test.toml | 5/5 | Removed commented pre_test and post_test hook lines, leaving clean minimal configuration |
| src/cloudai/workloads/ddlb/init.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| src/cloudai/registration.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| src/cloudai/workloads/ddlb/ddlb.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| conf/common/test/ddlb_test.toml | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
Confidence score: 5/5
- These changes are safe to merge as they only address formatting and consistency issues raised in previous reviews
- The score reflects that these are purely cosmetic/metadata changes with no functional impact on code behavior
- No files require special attention; all changes are straightforward corrections to copyright headers and removal of commented placeholder code
6 files reviewed, 7 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers changes made to the DDLB workload since the last review, not the entire PR. The developer addressed most of the critical feedback from prior reviews by fixing the unreachable code bug, removing dead configuration comments, simplifying validation logic, and standardizing copyright headers to "2025" for newly added files. The key fix removes the duplicate "Error" check in ddlb.py lines 58/68 that made success validation unreachable, and eliminates the unused missing_indicators list. PEP 8 formatting was also corrected. The test scenario timeout was extended from 10 to 30 minutes to allow DDLB benchmarks to complete. These changes clean up the DDLB integration while addressing previously flagged code quality issues.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/init.py | 5/5 | Updated copyright year from "2024-2025" to "2025" (administrative only) |
| conf/common/test_scenario/ddlb_test.toml | 4.5/5 | Extended test time limit from 10 to 30 minutes and removed dead commented-out fields |
| src/cloudai/workloads/ddlb/ddlb.py | 5/5 | Fixed critical duplicate error check bug making validation unreachable; removed unused missing_indicators list |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 4/5 | Simplified command generation by removing intermediate variable; copyright updated |
Confidence score: 4/5
- This PR addresses critical bugs but one code smell remains that should be resolved before merging
- Score reflects that the duplicate error check bug was fixed and copyright headers were standardized, but the unused
tdefvariable inslurm_command_gen_strategy.pystill exists from prior reviews, and concerns about the relative path safety raised in previous review ("Is it safe to use relative path? We can introduce a field in the test definition for this workload to holdpath_to_script.") remain unaddressed - Review
slurm_command_gen_strategy.pycarefully—the unusedtdefvariable suggests the test definition may need to be used for configuration in the future, and the hardcoded relative path "scripts/run_benchmark.py" may cause failures if executed from unexpected working directories
4 files reviewed, no comments
Thanks. I opted to try it on another cluster, and it failed there too with It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's |
If this image is too big, how will you create |
|
@nsarka please merge your PR with the latest main branch to align check list. |
1171554 to
2bf56e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI. The implementation follows established patterns from other workloads like NCCL and ChakraReplay.
Key Changes:
- New test definition (
DDLBTestDefinition) with Docker image support - Slurm command generation strategy that executes
python scripts/run_benchmark.py - Configuration files for test setup (single node, 30-minute timeout)
- Success validation checking for "Benchmark Results" in stdout
- Registration in the main registry alongside other test definitions
Observations:
- The error detection uses a generic
"Error"string check which may produce false positives - The implementation is minimal but functional, delegating most logic to the container's benchmark script
- No unit tests included for the new workload (though other workloads have test coverage)
Confidence Score: 4/5
- This PR is safe to merge with minor refinements recommended
- The implementation follows existing patterns closely (NCCL, ChakraReplay) and integrates cleanly into the registry. The main concern is the generic error detection string which could cause false positives. The code is well-structured and mirrors established workload patterns, making it maintainable. No breaking changes or security issues identified.
- Primary attention needed on
src/cloudai/workloads/ddlb/ddlb.pyfor error detection refinement
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/ddlb.py | 4/5 | Core DDLB test definition with generic error detection ('Error' string may match false positives), success validation checks for 'Benchmark Results' |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 5/5 | Slurm command generation for DDLB, returns static test command, success check validates 'Benchmark Results' in output |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DockerImage
participant OutputFile
User->>Registry: Register DDLB workload
Registry->>Registry: Add DDLBTestDefinition
Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
User->>DDLBTestDefinition: Create test with docker_image_url
DDLBTestDefinition->>DockerImage: Initialize DockerImage(url)
User->>DDLBTestSlurmCommandGenStrategy: Generate test command
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate srun command with container
DDLBTestSlurmCommandGenStrategy-->>User: Return ["python scripts/run_benchmark.py"]
User->>SlurmSystem: Execute test via Slurm
SlurmSystem->>OutputFile: Write stdout.txt
User->>DDLBTestDefinition: Check was_run_successful()
DDLBTestDefinition->>OutputFile: Read stdout.txt
alt Contains "Error"
DDLBTestDefinition-->>User: JobStatusResult(False, error details)
else Missing "Benchmark Results"
DDLBTestDefinition-->>User: JobStatusResult(False, missing indicators)
else Success
DDLBTestDefinition-->>User: JobStatusResult(True)
end
6 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
adds DDLB (Distributed Deep Learning Benchmark) workload support following the existing CloudAI workload pattern with test definition, Slurm command generation strategy, and configuration files.
Key changes:
- registered DDLB workload in
src/cloudai/registration.py - created
DDLBTestDefinitionwith Docker image management and success validation based on "Benchmark Results" pattern - implemented
DDLBTestSlurmCommandGenStrategyto generate mpirun commands - added test configuration (
conf/common/test/ddlb_test.toml) and test scenario
Issues found:
- critical command generation bug in
slurm_command_gen_strategy.py:36that produces malformed commands - unused imports in
ddlb.py
Confidence Score: 2/5
- critical bug in command generation will cause runtime failures when executing DDLB tests
- the
generate_test_commandmethod inslurm_command_gen_strategy.py:36constructs a malformed command list with"mpirun -np "(trailing space) as a single element, which when joined with spaces produces"mpirun -np 8 python..."(double space). This breaks command parsing and will cause test execution failures src/cloudai/workloads/ddlb/slurm_command_gen_strategy.pyrequires immediate fix to command generation logic
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| conf/common/test/ddlb_test.toml | 4/5 | configuration file with hardcoded path, standard structure matches other test configs |
| src/cloudai/workloads/ddlb/ddlb.py | 3/5 | test definition with unused imports, follows established patterns for workload definitions |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | command generation with critical bug in generate_test_command list structure (line 36) that will produce malformed command |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant SlurmCommandGenStrategy
participant DockerImage
participant SlurmSystem
User->>Registry: register DDLB workload
Registry->>DDLBTestDefinition: register test definition
Registry->>SlurmCommandGenStrategy: register command gen strategy
User->>DDLBTestDefinition: load test config (ddlb_test.toml)
DDLBTestDefinition->>DockerImage: initialize docker_image from docker_image_url
User->>SlurmCommandGenStrategy: generate execution command
SlurmCommandGenStrategy->>DDLBTestDefinition: get test definition
DDLBTestDefinition->>DockerImage: get installed_path
SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command() -> ["mpirun -np ", "8", "python scripts/run_benchmark.py"]
SlurmCommandGenStrategy->>SlurmSystem: create sbatch script with srun command
User->>SlurmSystem: execute job
SlurmSystem->>SlurmSystem: run mpirun with DDLB benchmark
SlurmSystem-->>User: output to stdout.txt
User->>DDLBTestDefinition: was_run_successful(test_run)
DDLBTestDefinition->>DDLBTestDefinition: check stdout.txt for "Error" or "Benchmark Results"
DDLBTestDefinition-->>User: JobStatusResult
3 files reviewed, 2 comments
This reverts commit eda5d0e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, following the existing pattern for workload registration with Slurm systems.
Key Changes
- Added DDLB test definition with Docker image support and job success validation
- Implemented Slurm command generation strategy for DDLB workloads
- Created configuration files for test and test scenario definitions
- Registered DDLB workload in the global registry alongside other workloads like NCCL and UCC
Issues Identified
- Critical: Duplicate error checking logic in
ddlb.py:59-68makes second condition unreachable - Unused imports (
Literal,Union) inddlb.py:17 - Unused
tdefvariable inslurm_command_gen_strategy.py:35 - Generic "Error" pattern may cause false positives
- Potential
Nonehandling issue inimage_path()wheninstalled_pathisNone
Confidence Score: 2/5
- Not safe to merge - contains critical logic bug that prevents proper error detection
- The duplicate error check at lines 59-68 in
ddlb.pycreates unreachable code that will prevent the success indicator check from ever executing. This is a critical bug that breaks the test validation logic. Additionally, the generic "Error" pattern is prone to false positives, and theimage_path()method may return string"None"instead of handlingNoneproperly. src/cloudai/workloads/ddlb/ddlb.pyrequires immediate attention due to unreachable code, andslurm_command_gen_strategy.pyneeds review for None handling
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/ddlb.py | 2/5 | DDLB test definition with critical logic error in duplicate error checking (lines 59-68 unreachable), unused imports, and overly generic error pattern matching |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 3/5 | Command generation strategy with unused tdef variable and potential None handling issue in image_path() method |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant TestRunner
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DockerImage
User->>Registry: Register DDLB workload
Registry->>Registry: Add DDLBTestDefinition
Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
User->>TestRunner: Execute DDLB test
TestRunner->>DDLBTestDefinition: Load test configuration
DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
TestRunner->>DDLBTestSlurmCommandGenStrategy: Generate command
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Generate test command
DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return ["python scripts/run_benchmark.py"]
TestRunner->>SlurmSystem: Submit job with srun command
SlurmSystem-->>TestRunner: Job execution
TestRunner->>DDLBTestDefinition: was_run_successful()
DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
DDLBTestDefinition-->>TestRunner: Return JobStatusResult
3 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Integrates DDLB (Distributed Deep Learning Benchmark) workload with configuration files, test definitions, command generation strategy, and registration in the CloudAI framework.
Key Changes
- Added new workload module at
src/cloudai/workloads/ddlb/with test definition and Slurm command generation - Configured DDLB test parameters including primitives, matrix dimensions (m, n, k), implementations, and Docker image
- Registered DDLB test definition and command generation strategy in
registration.py
Issues Found
- Critical: Command generation uses
isfor string comparison instead of==, causing Python SyntaxWarning - Critical: List values in configuration (e.g.,
m = [1024,8192],impl = [...]) will be formatted as Python literals instead of proper CLI arguments - Logic error: Duplicate error check makes lines 68-77 in
ddlb.pyunreachable - Multiple style issues from previous review (unused imports, alphabetical ordering)
Confidence Score: 2/5
- Not safe to merge—critical command generation bugs will cause runtime failures
- Two critical logic errors in command generation strategy: using
isinstead of==for string comparison (Python SyntaxWarning) and no handling of list values which are defined in the config. List parameters likem = [1024,8192]will generate malformed commands like-m [1024, 8192]instead of proper CLI format. These will cause the DDLB benchmark to fail at runtime. - Pay close attention to
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.pywhich has critical command generation bugs
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | Command generation has critical logic error using is for string comparison and doesn't handle list values which will cause malformed commands |
| src/cloudai/workloads/ddlb/ddlb.py | 3/5 | Validation logic has duplicate error checks and unused imports; error detection pattern may produce false positives |
| src/cloudai/registration.py | 4/5 | Registration follows standard pattern but import placement breaks alphabetical ordering |
Sequence Diagram
sequenceDiagram
participant User
participant CloudAI
participant Registry
participant DDLBTestDef
participant SlurmCmdGen
participant SlurmSystem
participant Docker
User->>CloudAI: Load test config (ddlb_test.toml)
CloudAI->>Registry: Lookup DDLBTest
Registry->>DDLBTestDef: Create test definition
DDLBTestDef->>Docker: Initialize docker image
Docker-->>DDLBTestDef: Return image handle
User->>CloudAI: Run test scenario
CloudAI->>Registry: Get command gen strategy
Registry->>SlurmCmdGen: Create strategy for Slurm+DDLB
SlurmCmdGen->>DDLBTestDef: Get cmd_args (m, n, k, impl, etc)
SlurmCmdGen->>SlurmCmdGen: Generate test command
Note over SlurmCmdGen: Build CLI args from config<br/>Handle list params (BUG HERE)
SlurmCmdGen->>SlurmSystem: Submit slurm job
SlurmSystem->>Docker: Run container with benchmark
Docker->>Docker: Execute python ddlb/cli/benchmark.py
Docker-->>SlurmSystem: Write stdout.txt
SlurmSystem-->>CloudAI: Job complete
CloudAI->>DDLBTestDef: was_run_successful()
DDLBTestDef->>DDLBTestDef: Check stdout.txt for "Benchmark Results"
DDLBTestDef-->>User: Return success/failure
6 files reviewed, 2 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, enabling benchmarking of tensor parallel operations within the framework.
Key additions:
- New
DDLBTestDefinitionandDDLBTestSlurmCommandGenStrategyclasses for test definition and Slurm command generation - Configuration files for test scenarios and experimental tests
- Integration with the registry system and test acceptance suite
Critical issues identified:
- List parameter handling in command generation will produce malformed CLI commands (e.g.,
-m [1024, 8192]instead of proper format) - String comparison using
isoperator instead of== - Duplicate error checking logic making code unreachable
- Several style issues including unused imports and minor formatting inconsistencies
Confidence Score: 2/5
- This PR has critical bugs that will cause runtime failures with list parameters
- The list handling bug in
slurm_command_gen_strategy.py:46-52will cause command generation failures when using the experimental config with list values form,n,k, orimplparameters. Theisoperator issue and duplicate error check are also problematic. These are not style issues but actual logic errors that need resolution before merge. - Primary attention needed:
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py(list handling and string comparison),src/cloudai/workloads/ddlb/ddlb.py(duplicate error check)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | Command generation strategy with critical list handling bug and string comparison issue |
| src/cloudai/workloads/ddlb/ddlb.py | 3/5 | Test definition with duplicate error check and unused imports |
| conf/experimental/test/ddlb_test.toml | 4/5 | Test config with list parameters that will trigger command generation bug |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DockerImage
participant TestRun
User->>Registry: Register DDLBTest workload
Registry->>DDLBTestDefinition: Create test definition from config
DDLBTestDefinition->>DockerImage: Initialize docker_image with URL
User->>TestRun: Execute test
TestRun->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build command with args
DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate sbatch script
SlurmSystem->>SlurmSystem: Execute srun with container
SlurmSystem->>TestRun: Write stdout.txt
TestRun->>DDLBTestDefinition: was_run_successful(tr)
DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt for errors
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
DDLBTestDefinition->>TestRun: Return JobStatusResult
9 files reviewed, no comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Integrates DDLB (Distributed Deep Learning Benchmarks) as a new workload in CloudAI, enabling benchmarking of distributed primitives like tensor parallelism.
Key changes:
- Added
DDLBTestDefinitionandDDLBCmdArgsclasses with support for configurable parameters (primitive, m/n/k dimensions, dtype, implementations) - Implemented
DDLBTestSlurmCommandGenStrategyfor Slurm command generation - Registered DDLB workload in the CloudAI registry
- Added comprehensive documentation with usage examples
Critical issues found:
- Missing
Unionimport inddlb.pywill cause immediate runtimeNameErrorwhen class is loaded - List parameter handling in command generation produces malformed CLI syntax (e.g.,
-m [1024, 8192]instead of-m 1024,8192)
These issues will prevent the workload from functioning correctly and must be fixed before merge.
Confidence Score: 1/5
- This PR has critical bugs that will cause runtime failures and cannot be merged safely
- Two syntax/logic errors will break functionality: (1) missing
Unionimport causesNameErroron module load, (2) list parameters generate malformed CLI commands. Both prevent workload from executing. Documentation and integration structure are solid, but code issues are blocking. - Critical attention needed for
src/cloudai/workloads/ddlb/ddlb.py(missing import) andsrc/cloudai/workloads/ddlb/slurm_command_gen_strategy.py(list handling bug)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/ddlb.py | 2/5 | Core DDLB test definition with missing Union import causing runtime error; success checking logic is functional but has generic error pattern |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 1/5 | Command generation strategy with critical list handling bug that will produce malformed CLI commands for list-type parameters |
Sequence Diagram
sequenceDiagram
participant User
participant CloudAI
participant Registry
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DockerImage
participant DDLB
User->>CloudAI: Define DDLB test (TOML config)
CloudAI->>Registry: Register DDLBTestDefinition
Registry->>DDLBTestDefinition: Create test definition
DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
User->>CloudAI: Run test scenario
CloudAI->>Registry: Get command generation strategy
Registry->>DDLBTestSlurmCommandGenStrategy: Instantiate strategy
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
Note over DDLBTestSlurmCommandGenStrategy: Build Python command with DDLB CLI args
DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate sbatch script
Note over SlurmSystem: Include srun with container mounts
SlurmSystem->>DDLB: Execute via srun in container
DDLB->>DDLB: Run benchmark (primitive, m, n, k, dtype, impl)
DDLB-->>SlurmSystem: Write stdout.txt with results
CloudAI->>DDLBTestDefinition: was_run_successful(test_run)
DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt for "Benchmark Results"
DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" patterns
DDLBTestDefinition-->>CloudAI: Return JobStatusResult
4 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmarks) workload integration to CloudAI framework for Slurm systems. The implementation includes test definitions, command generation strategy, registration, documentation, and test coverage.
Key Changes:
- New
DDLBTestDefinitionandDDLBCmdArgsclasses with support for benchmark parameters (m,n,k,dtype,num_iterations,num_warmups,impl,primitive) DDLBTestSlurmCommandGenStrategygenerates Slurm commands for running DDLB benchmarks in Docker containers- Success checking based on "Benchmark Results" presence in stdout.txt
- Registration in CloudAI's workload registry
- Test fixtures and reference data for validation
Critical Issue:
The command generation strategy does not handle list values correctly. When m, n, k, or impl are lists (as shown in conf/experimental/test/ddlb_test.toml), they will be formatted as Python list literals (e.g., -m [1024, 8192]) instead of proper CLI format. This will cause runtime errors when the DDLB CLI attempts to parse these arguments.
Confidence Score: 3/5
- This PR has a critical bug in list handling that will cause runtime failures
- Score of 3 reflects proper framework integration and test coverage, but the unresolved list formatting bug in slurm_command_gen_strategy.py will cause failures when using list values for m, n, k, or impl parameters as demonstrated in the example configuration
- src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention to fix list argument handling
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | Command generation strategy with list handling bug that will break CLI arguments when lists are used |
| src/cloudai/workloads/ddlb/ddlb.py | 4/5 | Core test definition with proper type annotations and success checking logic |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant SlurmCommandGenStrategy
participant DockerImage
participant SlurmSystem
User->>Registry: Register DDLBTest workload
Registry->>Registry: add_test_definition("DDLBTest", DDLBTestDefinition)
Registry->>Registry: add_command_gen_strategy(SlurmSystem, DDLBTestDefinition, DDLBTestSlurmCommandGenStrategy)
User->>DDLBTestDefinition: Load test configuration from TOML
DDLBTestDefinition->>DDLBTestDefinition: Initialize DDLBCmdArgs (m, n, k, impl, etc.)
DDLBTestDefinition->>DockerImage: Create DockerImage from docker_image_url
User->>SlurmSystem: Execute test run
SlurmSystem->>SlurmCommandGenStrategy: Get command generation strategy
SlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command()
SlurmCommandGenStrategy->>SlurmCommandGenStrategy: Format CLI arguments (m, n, k, impl)
Note over SlurmCommandGenStrategy: BUG: List values formatted as Python literals
SlurmCommandGenStrategy->>SlurmSystem: Return srun command
SlurmSystem->>DockerImage: Execute in container
DockerImage->>DockerImage: Run DDLB benchmark
DockerImage-->>SlurmSystem: Write stdout.txt
SlurmSystem->>DDLBTestDefinition: was_run_successful(TestRun)
DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" in stdout.txt
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results" in stdout.txt
DDLBTestDefinition-->>User: Return JobStatusResult
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmarks) workload integration to CloudAI framework, including test definitions, Slurm command generation strategy, configuration files, and documentation.
Key changes:
- New
DDLBTestDefinitionandDDLBCmdArgsclasses with support for matrix multiplication benchmarking parameters - Slurm command generation strategy for executing DDLB benchmarks in containerized environments
- Test configuration files with examples of single and list-valued parameters
- Registration of DDLB workload in the CloudAI registry
Critical issues found:
- List parameter handling in command generation will produce malformed CLI commands (e.g.,
-m [1024, 8192]instead of proper format) - Duplicate error checking logic makes code unreachable
- String comparison using
isoperator instead of==
Confidence Score: 2/5
- This PR contains critical bugs that will cause runtime failures when list parameters are used
- The list parameter handling bug in slurm_command_gen_strategy.py will generate malformed commands that will fail at runtime when users attempt to use list values for parameters like
m,n,k, orimpl. The duplicate error check makes significant code unreachable, though this may not cause immediate failures. The string comparison issue usingiswill trigger Python warnings and may cause incorrect behavior. - src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention to fix list parameter handling; src/cloudai/workloads/ddlb/ddlb.py needs duplicate error check removed
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| conf/experimental/test/ddlb_test.toml | 4/5 | Configuration file defining DDLB test parameters with list values for m and impl arguments |
| src/cloudai/workloads/ddlb/ddlb.py | 3/5 | Test definition with duplicate error check at line 67, making lines 68-77 unreachable, and generic error pattern matching that may cause false positives |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | Command generation strategy with critical list handling bug causing malformed CLI commands, and string comparison using 'is' operator instead of '==' |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant SlurmCommandGenStrategy
participant SlurmSystem
participant Container
User->>Registry: Register DDLB workload
Registry->>Registry: Add DDLBTestDefinition
Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
User->>DDLBTestDefinition: Create test with cmd_args
DDLBTestDefinition->>DDLBTestDefinition: Validate parameters (m, n, k, impl, etc)
DDLBTestDefinition->>DockerImage: Initialize docker_image
User->>SlurmCommandGenStrategy: Generate test command
SlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
SlurmCommandGenStrategy->>SlurmCommandGenStrategy: Build command parts
Note over SlurmCommandGenStrategy: Formats parameters as CLI args<br/>(-m, -n, -k, --impl, etc)
SlurmCommandGenStrategy-->>User: Return command list
User->>SlurmSystem: Submit sbatch script
SlurmSystem->>Container: Execute srun with container-image
Container->>Container: Run python ddlb/cli/benchmark.py
Container->>Container: Execute benchmark
Container-->>SlurmSystem: Write stdout.txt
SlurmSystem->>DDLBTestDefinition: Check was_run_successful()
DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
DDLBTestDefinition-->>User: Return JobStatusResult
1 file reviewed, 1 comment
|
@amaslenn I unmarked the PR as draft. I believe it should be ready for merging once there's a clear answer on the greptile comment about the list argument. Please let me know what you think. |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added experimental test configuration for DDLB workload with parameters for tensor parallelism benchmarking.
Key changes:
- New
ddlb_test.tomlconfiguration file with test parameters (primitive type, matrix dimensions m/n/k, dtype, iterations) - Configured with two implementation variants: PyTorch with NCCL backend and fuser with CUDA backend
- Uses list values for
mparameter[1024, 8192]and multipleimplconfigurations to test different scenarios - References Docker image from
gitlab-master.nvidia.com/nsarkauskas/ddlb:latest
Configuration structure:
The file follows standard CloudAI test configuration format with cmd_args section matching the DDLBCmdArgs schema defined in src/cloudai/workloads/ddlb/ddlb.py. Parameters include matrix dimensions, data type, warmup/iteration counts, and implementation variants.
Confidence Score: 4/5
- This PR is safe to merge with minimal risk—it only adds a new configuration file.
- The configuration file is well-structured and follows established patterns. However, there are existing issues in the command generation strategy (already commented on) that affect list parameter handling. The config itself is correct, but runtime behavior depends on fixes to
slurm_command_gen_strategy.py. - No files in this PR require special attention. However, note that
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py(not in this PR) needs fixes for proper list parameter handling.
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| conf/experimental/test/ddlb_test.toml | 4/5 | New DDLB test configuration with proper structure, valid parameters, and clear documentation. Uses list values for m and impl parameters which align with the workload's type annotations. |
Sequence Diagram
sequenceDiagram
participant User
participant CloudAI
participant Config as ddlb_test.toml
participant TestDef as DDLBTestDefinition
participant CmdGen as SlurmCommandGenStrategy
participant Slurm
participant Docker as DDLB Container
User->>CloudAI: Load test configuration
CloudAI->>Config: Read ddlb_test.toml
Config-->>CloudAI: Test parameters (primitive, m, n, k, dtype, impl, etc.)
CloudAI->>TestDef: Create DDLBTestDefinition with cmd_args
TestDef->>TestDef: Instantiate DockerImage from docker_image_url
CloudAI->>CmdGen: Generate Slurm commands
CmdGen->>TestDef: Get cmd_args and docker_image
CmdGen->>CmdGen: Build command: python ddlb/cli/benchmark.py with args
CmdGen-->>CloudAI: Return srun command with container params
CloudAI->>Slurm: Submit job with generated sbatch script
Slurm->>Docker: Launch container with DDLB benchmark
Docker->>Docker: Execute benchmark with specified parameters
Docker-->>Slurm: Write results to stdout.txt
Slurm-->>CloudAI: Job completion signal
CloudAI->>TestDef: Check job success (was_run_successful)
TestDef->>TestDef: Parse stdout.txt for "Benchmark Results"
TestDef-->>CloudAI: JobStatusResult
1 file reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR relocates the DDLB test scenario configuration file from conf/common/test_scenario/ to conf/experimental/test_scenario/ with no content modifications. This is a pure file move (git rename) to organize the DDLB workload under the experimental configurations directory.
Key Changes:
- Moved
ddlb_test.tomlfrom common to experimental directory - No modifications to file content
- Aligns with the experimental status of the DDLB workload integration
Analysis:
The test scenario configuration remains unchanged and properly defines a single DDLB test with 1 node and 30-minute timeout. The file references ddlb_test which exists in conf/experimental/test/ddlb_test.toml with proper command arguments defined.
Confidence Score: 5/5
- This PR is completely safe to merge - it's a pure file relocation with zero content changes or functional impact
- Perfect score reflects that this is a simple git rename operation with no code modifications, no logical changes, and no risk of introducing bugs. The file move from common to experimental is organizationally appropriate for the DDLB workload integration.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| conf/experimental/test_scenario/ddlb_test.toml | 5/5 | File moved from conf/common/ to conf/experimental/ with no content changes - simple relocation to appropriate directory for experimental DDLB workload |
Sequence Diagram
sequenceDiagram
participant Dev as Developer
participant Git as Git Repository
participant Common as conf/common/test_scenario/
participant Exp as conf/experimental/test_scenario/
Dev->>Git: Move ddlb_test.toml
Git->>Common: Remove ddlb_test.toml
Git->>Exp: Add ddlb_test.toml
Note over Exp: Same content, new location
Dev->>Git: Commit rename operation
Note over Git: Pure file relocation<br/>No content changes
1 file reviewed, no comments
|
@nsarka please resolve conflicts and I'll review the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmarks) workload integration including test definitions, Slurm command generation, configuration files, and documentation.
Key changes:
- New DDLB workload implementation in
src/cloudai/workloads/ddlb/ - Registration of DDLB test definition and command generation strategy
- Configuration files for testing with support for list-valued parameters (
m,n,k,impl) - Reference sbatch script and acceptance tests
Critical issue:
- Command generator doesn't handle list parameters correctly - will produce Python literal syntax like
-m [1024, 8192]instead of proper CLI format, causing runtime failures when using list values as configured inconf/experimental/test/ddlb_test.toml
Confidence Score: 2/5
- This PR has a critical bug that will cause failures when using list parameters as configured
- The command generation logic in
slurm_command_gen_strategy.py:46-51will produce malformed CLI commands when parameters likem,n,k, orimplare lists (Python literal strings like[1024, 8192]instead of proper CLI syntax). The configuration file explicitly uses list values, so this will fail at runtime. src/cloudai/workloads/ddlb/slurm_command_gen_strategy.pyrequires immediate attention to fix list parameter handling before merge
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | New DDLB Slurm command generator with critical list handling bug that will produce malformed CLI commands |
| src/cloudai/workloads/ddlb/ddlb.py | 4/5 | New DDLB test definition with success checking logic, minor style issues but functionally sound |
| conf/experimental/test/ddlb_test.toml | 3/5 | DDLB test configuration with list values that may not be handled correctly by command generator |
Sequence Diagram
sequenceDiagram
participant User
participant CloudAI
participant Registry
participant DDLBTestDef
participant SlurmCmdGen
participant Slurm
participant Container
User->>CloudAI: Load DDLB test config
CloudAI->>Registry: Register DDLBTestDefinition
CloudAI->>Registry: Register DDLBTestSlurmCommandGenStrategy
User->>CloudAI: Execute DDLB test
CloudAI->>DDLBTestDef: Parse cmd_args (m, n, k, impl, etc.)
DDLBTestDef->>DDLBTestDef: Create DockerImage from docker_image_url
CloudAI->>SlurmCmdGen: Generate test command
SlurmCmdGen->>SlurmCmdGen: Build srun command with container args
SlurmCmdGen->>SlurmCmdGen: Format CLI args (m, n, k, dtype, impl)
Note over SlurmCmdGen: BUG: Lists formatted as Python literals
SlurmCmdGen->>Slurm: Submit sbatch script
Slurm->>Container: Launch DDLB container
Container->>Container: Execute python ddlb/cli/benchmark.py
Container->>Container: Run benchmarks, output results
Container-->>CloudAI: stdout.txt with results
CloudAI->>DDLBTestDef: Check was_run_successful()
DDLBTestDef->>DDLBTestDef: Verify "Benchmark Results" in stdout
DDLBTestDef-->>User: Return success/failure status
11 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmarks) workload integration with Slurm command generation and test acceptance coverage.
Major changes:
- New
DDLBTestSlurmCommandGenStrategyclass to generate DDLB benchmark CLI commands from test configurations - Test acceptance coverage for DDLB workload with sample configuration
Critical issues found:
- List parameters (
m,n,k,impl) are not handled correctly and will produce malformed CLI commands (e.g.,-m [1024, 8192]instead of expected format) image_path()method doesn't handleNonecase forinstalled_path, which would return string"None"
Confidence Score: 1/5
- This PR contains critical logic bugs that will cause runtime failures and cannot be merged as-is
- Previous review comments identified critical list handling bugs in lines 46-52 that remain unfixed. When list parameters like
m = [1024, 8192]orimpl = ["pytorch;...", "fuser;..."]are used (as shown in conf/experimental/test/ddlb_test.toml), the code will generate invalid CLI syntax like-m [1024, 8192]instead of proper format. This will cause command execution to fail. - src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate fixes for list handling (lines 46-52) and None check in image_path (line 32)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 1/5 | Critical list handling bugs that will cause CLI command failures; image_path may return string "None" |
| tests/test_acceptance.py | 5/5 | Clean addition of DDLB test to test matrix with proper imports and test definition |
Sequence Diagram
sequenceDiagram
participant User
participant TestRunner
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DDLBContainer
User->>TestRunner: Request DDLB workload execution
TestRunner->>DDLBTestDefinition: Load test config (cmd_args)
DDLBTestDefinition->>DDLBTestDefinition: Parse parameters (m, n, k, impl, etc.)
TestRunner->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build CLI command parts
Note over DDLBTestSlurmCommandGenStrategy: Format: python ddlb/cli/benchmark.py -m X -n Y --impl Z
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: image_path()
DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Docker image path
DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return command parts list
TestRunner->>SlurmSystem: Submit job with srun command
SlurmSystem->>DDLBContainer: Execute benchmark.py with args
DDLBContainer->>DDLBContainer: Run DDLB benchmark
DDLBContainer-->>SlurmSystem: Write stdout.txt with "Benchmark Results"
SlurmSystem-->>TestRunner: Job completed
TestRunner->>DDLBTestDefinition: was_run_successful()
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results" in stdout.txt
DDLBTestDefinition-->>TestRunner: JobStatusResult
TestRunner-->>User: Test results
2 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds Slurm command generation strategy for DDLB (Distributed Deep Learning Benchmark) workload integration. The implementation provides container-based execution with Docker image support and custom CLI argument handling for DDLB's benchmark parameters.
Key changes:
- Implements
DDLBTestSlurmCommandGenStrategyextendingSlurmCommandGenStrategy - Provides Docker container path resolution via
image_path()method - Generates DDLB benchmark CLI commands with parameters like matrix dimensions (
m,n,k), implementation types (impl), and iteration counts - Includes success check by grepping for "Benchmark Results" in stdout
Critical issues requiring attention:
- List-typed arguments (
m,n,k,impl) will produce malformed CLI syntax (e.g.,-m [1024, 8192]instead of proper format) str(None)conversion on line 32 may produce string"None"instead of None value if docker image path is not set- Missing proper list handling for multi-value parameters defined in
DDLBCmdArgs
Confidence Score: 2/5
- Not safe to merge - contains critical logic errors that will cause runtime failures
- The list handling bug (lines 46-52) will produce malformed CLI commands for DDLB benchmark when list values are used (as shown in conf/experimental/test/ddlb_test.toml). This will cause the benchmark to fail at runtime with invalid argument syntax. Previous comments have identified this issue but it remains unresolved.
- src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate attention for list argument handling fix
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | New command generation strategy for DDLB workload; contains critical logic errors in list handling for CLI arguments and potential None string conversion issue |
Sequence Diagram
sequenceDiagram
participant User
participant SlurmSystem
participant DDLBTestSlurmCommandGenStrategy
participant DDLBTestDefinition
participant DockerImage
participant SlurmJob
User->>SlurmSystem: Submit DDLB test
SlurmSystem->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get test definition
DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return test config
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args (m, n, k, impl, etc)
DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return command arguments
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build CLI command
Note over DDLBTestSlurmCommandGenStrategy: Format args for ddlb/cli/benchmark.py
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: image_path()
DDLBTestDefinition->>DockerImage: Get installed_path
DockerImage-->>DDLBTestDefinition: Return path
DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
DDLBTestSlurmCommandGenStrategy-->>SlurmSystem: Return complete command
SlurmSystem->>SlurmJob: Execute with srun + container
SlurmJob->>SlurmJob: Run python ddlb/cli/benchmark.py
SlurmJob-->>SlurmSystem: Write stdout.txt
SlurmSystem->>DDLBTestSlurmCommandGenStrategy: gen_srun_success_check()
DDLBTestSlurmCommandGenStrategy-->>SlurmSystem: grep for "Benchmark Results"
SlurmSystem->>DDLBTestDefinition: was_run_successful()
DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt for errors/results
DDLBTestDefinition-->>User: Return success/failure status
1 file reviewed, no comments
Greptile OverviewGreptile SummaryThis PR integrates the DDLB (Distributed Deep Learning Benchmarks) workload into CloudAI. The implementation follows the established pattern for workloads with Docker container support, including test definitions, Slurm command generation, configuration files, documentation, and tests. Key additions:
Critical issue identified: Confidence Score: 2/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant CloudAI
participant Registry
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant DockerImage
participant SlurmSystem
participant Container
User->>CloudAI: Submit DDLB test configuration
CloudAI->>Registry: Load DDLBTest definition
Registry->>DDLBTestDefinition: Instantiate with CmdArgs
DDLBTestDefinition->>DockerImage: Create docker_image from URL
DDLBTestDefinition-->>CloudAI: Test definition ready
CloudAI->>Registry: Get command generation strategy
Registry->>DDLBTestSlurmCommandGenStrategy: Instantiate for Slurm
CloudAI->>DDLBTestSlurmCommandGenStrategy: generate_test_command()
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get cmd_args
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Build CLI arguments
Note over DDLBTestSlurmCommandGenStrategy: Formats m, n, k, primitive, dtype,<br/>num_iterations, num_warmups, impl
DDLBTestSlurmCommandGenStrategy-->>CloudAI: Return command parts list
CloudAI->>DDLBTestSlurmCommandGenStrategy: image_path()
DDLBTestSlurmCommandGenStrategy->>DockerImage: Get installed_path
DockerImage-->>DDLBTestSlurmCommandGenStrategy: Return path
DDLBTestSlurmCommandGenStrategy-->>CloudAI: Return image path
CloudAI->>SlurmSystem: Generate sbatch script
SlurmSystem->>Container: Execute srun with container-image
Container->>Container: Run python ddlb/cli/benchmark.py
Container-->>SlurmSystem: Write stdout.txt
SlurmSystem-->>CloudAI: Job complete
CloudAI->>DDLBTestDefinition: was_run_successful()
DDLBTestDefinition->>DDLBTestDefinition: Check stdout.txt exists
DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
DDLBTestDefinition-->>CloudAI: Return JobStatusResult
CloudAI-->>User: Report test results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
11 files reviewed, no comments
|
@nsarka let me know if we can merge it. |
DDLB workload integration