Extend DSE for Env Vars along with Cmd Args #408

srivatsankrishnan · 2025-03-11T19:39:45Z

Summary

Lot of workloads defined in the test toml also requires sweeps in environment variables (extra_env_vars). Previous support for only exposed the cmd_args as DSE'ble. This PR expands the scope to parameters defined in extra_env_vars field defined in Test toml.

Similar to how users can specify the any parameters defined under cmd_args as list which will be activate the DSE. Likewise, they can also expand any parameters in extra_env_vars as list and CloudAI will activate the DSE for these parameters and add them to the agents action space.

Example

name = "dse_nemo_run_llama3_8b"
description = "dse_nemo_run_llama3_8b"
test_template_name = "NeMoRun"

[cmd_args]
docker_image_url = "nvcr.io/nvidia/nemo:24.12.rc3"
task = "pretrain"
recipe_name = "llama3_8b"
...
...

[extra_env_vars]
NCCL_P2P_NET_CHUNKSIZE = ["2097152", "4194304"]
NVTE_FUSED_ATTN = ["1", "0"]

Note:

The global env vars is defined per System toml in CloudAI. So focussing on extra_env_vars which is controlled by users in Test toml in this PR. If there is a requirement for system toml, we should be able to figure something out too.
cmd_args was a nice object. extra_env_args was defined as a dict from the get go. Lets revisit this and see if it makes sense to move towards object modulo user experience requirements. This is largely internal change and can be planned or done in a phased manner.

Auxiliary inconsistency fixes

It looks like most of the local dev uses Python 3.10. The Github CI uses python 3.9. One corner cases is when the local CI/CD passes and the upstream ones fail.

Example is the failed CI test due to a python 3.10 feature.

In python 3.10 this is a valid way to define things instead of Union from typing. Though this is fixed to use Union to ensure backward compatibility, in general saves lot of time to move this CI flow to 3.10 since this is what we use in private repo and as well as most clusters.

Dict[str, str | List[str]]

Dict[str, Union[str, List[str]]]

In 3.9, this will result in the following error

src/cloudai/workloads/jax_toolbox/slurm_command_gen_strategy.py:82: in JaxToolboxSlurmCommandGenStrategy
    self, env_vars: Dict[str, str | List[str]], cmd_args: Dict[str, Any], num_nodes: int
E   TypeError: unsupported operand type(s) for |: 'type' and '_GenericAlias'
Error: Process completed with exit code 4.

@amaslenn FYI

Discussion on Match14th (with @amaslenn @TaekyungHeo ): Some older cluster might need 3.9. But cloudaix already uses 3.10. So not sure why are can't upgrade it given even cloudaix need to run on older cluster?

But need broader discussion on this separately as agreed.

Test Plan

CI/CD
Dry-run

Nemo2.0 Llama3-8b Model

cloudai dry-run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml 
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: dse_nemo_run_llama3_8b_1
  Test Name: dse_nemo_run_llama3_8b
  Description: dse_nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 1: Observation: [34.15068181818182], Reward: 0.029281992240272052
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/2', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 2: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/3', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 3: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '0', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/4', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 4: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/5', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 5: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '0', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/6', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 6: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/7', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 7: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '0', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/8', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 8: Observation: [-1.0], Reward: -1.0

NCCL Test All-Gather

$ cloudai dry-run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nccl_all_gather.toml 
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dse-nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dse-nccl-test

Section Name: Tests.1
  Test Name: dse_nccl_all_gather
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/pytorch:24.02-py3', 'subtest_name': 'all_gather_perf_mpi', 'nthreads': 1, 'ngpus': 1, 'minbytes': '128', 'maxbytes': '4G', 'stepbytes': '1M', 'op': 'sum', 'datatype': 'float', 'root': 0, 'iters': 100, 'warmup_iters': 50, 'agg_iters': 1, 'average': 1, 'parallel_init': 0, 'check': 1, 'blocking': 0, 'cudagraph': 0, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1
[WARNING] Skipping 'results/dse-nccl-test/Tests.1/0/1', can't handle with strategy=<class 'cloudai.workloads.nccl_test.report_generation_strategy.NcclTestReportGenerationStrategy'>.
[INFO] Step 1: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/pytorch:24.02-py3', 'subtest_name': 'all_gather_perf_mpi', 'nthreads': 1, 'ngpus': 1, 'minbytes': '128', 'maxbytes': '4G', 'stepbytes': '1M', 'op': 'sum', 'datatype': 'float', 'root': 0, 'iters': 100, 'warmup_iters': 50, 'agg_iters': 1, 'average': 1, 'parallel_init': 0, 'check': 1, 'blocking': 0, 'cudagraph': 0, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1
[WARNING] Skipping 'results/dse-nccl-test/Tests.1/0/2', can't handle with strategy=<class 'cloudai.workloads.nccl_test.report_generation_strategy.NcclTestReportGenerationStrategy'>.
[INFO] Step 2: Observation: [-1.0], Reward: -1.0

Real system testing

NCCL AG with Extra Env Args

cloudai run --system-config ../cloudaix/conf/common/system/c
xxx --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nccl_all_gather.toml 
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dse-nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dse-nccl-test

Section Name: Tests.1
  Test Name: dse_nccl_all_gather
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {...obfuscated...,'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1881080
[INFO] Job completed: Tests.1
[INFO] Step 1: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {...obfuscated.... 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1881207
[INFO] Job completed: Tests.1
[INFO] Step 2: Observation: [-1.0], Reward: -1.0

Output (representative). Please ping me offline for more details.

NCCL AG with Extra Env Args and CmdArgs

cloudai run --system-config ../cloudaix/conf/common/system/xxxx.to
ml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nccl_all_gather.toml 
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dse-nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dse-nccl-test

Section Name: Tests.1
  Test Name: dse_nccl_all_gather
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {output redacted.... 'warmup_iters': 5, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971248
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 1: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {output redacted...'warmup_iters', 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971250
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 2: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {output redacted....'warmup_iters': 50, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971252
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 3: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {output redacted...'warmup_iters': 50, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971317
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 4: Observation: [-1.0], Reward: -1.0

Results
Representative output. Ping me on slack for more details

Additional Notes

More unit test and workloads (e.g., NCCL, Nemo) will be tested. UCC will be tested after sync-up with Sergey. But can be outside of this PR. Note, the reward generation etc depends upon the objects. So the generic report will be useful. But can fix this once we understand the objective.

src/cloudai/_core/test.py

TaekyungHeo

Let's have two approvals. Please review my comments. Some are minor.

src/cloudai/_core/configurator/cloudai_gym.py

TaekyungHeo

LGTM. Let's have two approvals.

src/cloudai/_core/configurator/cloudai_gym.py

.github/workflows/ci.yml

src/cloudai/_core/configurator/grid_search.py

src/cloudai/workloads/jax_toolbox/slurm_command_gen_strategy.py

src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py

TaekyungHeo

LGTM. Let's remove pyright ignore gradually.

src/cloudai/_core/configurator/grid_search.py

src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py

TaekyungHeo

Minor comment, which is not a blocker: You replaced Dict[str, str] with Union[str, List[str]]. You might consider introducing a new type, such as DSEableParam = Union[str, List[str]], or creating a new class to improve readability and maintainability.

srivatsankrishnan · 2025-03-17T13:51:04Z

Minor comment, which is not a blocker: You replaced Dict[str, str] with Union[str, List[str]]. You might consider introducing a new type, such as DSEableParam = Union[str, List[str]], or creating a new class to improve readability and maintainability.

I don;t think this is required for such a small change. This is basic python and no need to have a class definition. I infact think this makes it worse.

support for env_vars for DSE

33f1539

srivatsankrishnan requested review from TaekyungHeo, amaslenn and srinivas212 as code owners March 11, 2025 19:39

TaekyungHeo added the enhancement New feature or request label Mar 11, 2025

srivatsankrishnan commented Mar 11, 2025

View reviewed changes

src/cloudai/_core/test.py Outdated Show resolved Hide resolved

TaekyungHeo requested changes Mar 11, 2025

View reviewed changes

src/cloudai/_core/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

src/cloudai/_core/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

srivatsankrishnan marked this pull request as draft March 11, 2025 19:51

srivatsankrishnan added 8 commits March 11, 2025 19:16

combine it into a single dictionary

79e10d6

update type, fix unit tests etc

6fe110b

Merge branch 'main' into main

efeeabb

more unit test

8785b4a

ruff fixes

a8f86e8

nccl dse test/test scenarios

cc8951c

fix copyright header year + taplo

00d5ee9

taplo

1e683cb

srivatsankrishnan requested a review from TaekyungHeo March 13, 2025 00:08

srivatsankrishnan marked this pull request as ready for review March 13, 2025 00:08

TaekyungHeo requested changes Mar 13, 2025

View reviewed changes

src/cloudai/_core/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

fix

23aa211

TaekyungHeo previously approved these changes Mar 13, 2025

View reviewed changes

amaslenn reviewed Mar 13, 2025

View reviewed changes

src/cloudai/_core/configurator/cloudai_gym.py Show resolved Hide resolved

amaslenn reviewed Mar 13, 2025

View reviewed changes

src/cloudai/_core/configurator/cloudai_gym.py Show resolved Hide resolved

Merge branch 'NVIDIA:main' into main

752f2f5

srivatsankrishnan mentioned this pull request Mar 13, 2025

Reporting for encoded logs #412

Merged

propate env_vars type to whole universe

2b408fd

srivatsankrishnan dismissed TaekyungHeo’s stale review via 2b408fd March 14, 2025 00:30

srivatsankrishnan added 3 commits March 13, 2025 17:39

fix

f293943

fix python3.9 vs python 3.10 feature differences

3092a1c

fix github CI to use python 3.10

13d8177

srivatsankrishnan added 2 commits March 13, 2025 18:01

ruff fixes python3.9 --> 3.10

bc7e10d

fix copyright year

a94de5a

srivatsankrishnan mentioned this pull request Mar 14, 2025

Explore Action space class #413

Closed

amaslenn reviewed Mar 14, 2025

View reviewed changes

srivatsankrishnan added 6 commits March 14, 2025 12:10

revert python version + stric checking for combinations

7e886e5

Merge branch 'main' into main

bc6d7e1

revert ruff version to use python 3.9

82691de

make ucc/nccl cmd_args dseble

659f1b9

cmd_args dse'ble

467b8bd

taplo

6595f41

srivatsankrishnan requested review from TaekyungHeo and amaslenn March 15, 2025 03:20

TaekyungHeo approved these changes Mar 15, 2025

View reviewed changes

amaslenn reviewed Mar 17, 2025

View reviewed changes

src/cloudai/_core/configurator/grid_search.py Outdated Show resolved Hide resolved

TaekyungHeo reviewed Mar 17, 2025

View reviewed changes

src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py Show resolved Hide resolved

TaekyungHeo requested changes Mar 17, 2025

View reviewed changes

TaekyungHeo previously approved these changes Mar 17, 2025

View reviewed changes

remove strict

bc5e8e4

srivatsankrishnan dismissed TaekyungHeo’s stale review via bc5e8e4 March 17, 2025 14:10

srivatsankrishnan requested review from TaekyungHeo and amaslenn March 17, 2025 14:11

Merge branch 'main' into main

46619a9

amaslenn approved these changes Mar 17, 2025

View reviewed changes

TaekyungHeo approved these changes Mar 17, 2025

View reviewed changes

srivatsankrishnan merged commit 10a866c into NVIDIA:main Mar 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend DSE for Env Vars along with Cmd Args #408

Extend DSE for Env Vars along with Cmd Args #408

Uh oh!

srivatsankrishnan commented Mar 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

TaekyungHeo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaekyungHeo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaekyungHeo left a comment

Uh oh!

Uh oh!

Uh oh!

TaekyungHeo left a comment •

edited

Loading

Uh oh!

srivatsankrishnan commented Mar 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Extend DSE for Env Vars along with Cmd Args #408

Extend DSE for Env Vars along with Cmd Args #408

Uh oh!

Conversation

srivatsankrishnan commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Auxiliary inconsistency fixes

Test Plan

Nemo2.0 Llama3-8b Model

NCCL Test All-Gather

NCCL AG with Extra Env Args

NCCL AG with Extra Env Args and CmdArgs

Additional Notes

Uh oh!

Uh oh!

TaekyungHeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaekyungHeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaekyungHeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TaekyungHeo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srivatsankrishnan commented Mar 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srivatsankrishnan commented Mar 11, 2025 •

edited

Loading

TaekyungHeo left a comment •

edited

Loading