Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Mar 11, 2025

Summary

Lot of workloads defined in the test toml also requires sweeps in environment variables (extra_env_vars). Previous support for only exposed the cmd_args as DSE'ble. This PR expands the scope to parameters defined in extra_env_vars field defined in Test toml.

Similar to how users can specify the any parameters defined under cmd_args as list which will be activate the DSE. Likewise, they can also expand any parameters in extra_env_vars as list and CloudAI will activate the DSE for these parameters and add them to the agents action space.

Example

name = "dse_nemo_run_llama3_8b"
description = "dse_nemo_run_llama3_8b"
test_template_name = "NeMoRun"

[cmd_args]
docker_image_url = "nvcr.io/nvidia/nemo:24.12.rc3"
task = "pretrain"
recipe_name = "llama3_8b"
...
...

[extra_env_vars]
NCCL_P2P_NET_CHUNKSIZE = ["2097152", "4194304"]
NVTE_FUSED_ATTN = ["1", "0"]

Note:

  • The global env vars is defined per System toml in CloudAI. So focussing on extra_env_vars which is controlled by users in Test toml in this PR. If there is a requirement for system toml, we should be able to figure something out too.
  • cmd_args was a nice object. extra_env_args was defined as a dict from the get go. Lets revisit this and see if it makes sense to move towards object modulo user experience requirements. This is largely internal change and can be planned or done in a phased manner.

Auxiliary inconsistency fixes

It looks like most of the local dev uses Python 3.10. The Github CI uses python 3.9. One corner cases is when the local CI/CD passes and the upstream ones fail.

Example is the failed CI test due to a python 3.10 feature.

In python 3.10 this is a valid way to define things instead of Union from typing. Though this is fixed to use Union to ensure backward compatibility, in general saves lot of time to move this CI flow to 3.10 since this is what we use in private repo and as well as most clusters.

Dict[str, str | List[str]]
Dict[str, Union[str, List[str]]]

In 3.9, this will result in the following error

src/cloudai/workloads/jax_toolbox/slurm_command_gen_strategy.py:82: in JaxToolboxSlurmCommandGenStrategy
    self, env_vars: Dict[str, str | List[str]], cmd_args: Dict[str, Any], num_nodes: int
E   TypeError: unsupported operand type(s) for |: 'type' and '_GenericAlias'
Error: Process completed with exit code 4.

@amaslenn FYI

Discussion on Match14th (with @amaslenn @TaekyungHeo ): Some older cluster might need 3.9. But cloudaix already uses 3.10. So not sure why are can't upgrade it given even cloudaix need to run on older cluster?

But need broader discussion on this separately as agreed.

Test Plan

  • CI/CD
  • Dry-run

Nemo2.0 Llama3-8b Model

cloudai dry-run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml 
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: dse_nemo_run_llama3_8b_1
  Test Name: dse_nemo_run_llama3_8b
  Description: dse_nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 1: Observation: [34.15068181818182], Reward: 0.029281992240272052
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/2', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 2: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/3', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 3: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '0', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/4', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 4: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/5', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 5: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '2097152', 'extra_env_vars.NVTE_FUSED_ATTN': '0', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/6', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 6: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '1', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/7', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 7: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'extra_env_vars.NCCL_P2P_NET_CHUNKSIZE': '4194304', 'extra_env_vars.NVTE_FUSED_ATTN': '0', ...obfuscated...}
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[WARNING] Skipping 'results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/8', can't handle with strategy=<class 'cloudai.workloads.nemo_run.report_generation_strategy.NeMoRunReportGenerationStrategy'>.
[INFO] Step 8: Observation: [-1.0], Reward: -1.0

NCCL Test All-Gather

$ cloudai dry-run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nccl_all_gather.toml 
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dse-nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dse-nccl-test

Section Name: Tests.1
  Test Name: dse_nccl_all_gather
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/pytorch:24.02-py3', 'subtest_name': 'all_gather_perf_mpi', 'nthreads': 1, 'ngpus': 1, 'minbytes': '128', 'maxbytes': '4G', 'stepbytes': '1M', 'op': 'sum', 'datatype': 'float', 'root': 0, 'iters': 100, 'warmup_iters': 50, 'agg_iters': 1, 'average': 1, 'parallel_init': 0, 'check': 1, 'blocking': 0, 'cudagraph': 0, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1
[WARNING] Skipping 'results/dse-nccl-test/Tests.1/0/1', can't handle with strategy=<class 'cloudai.workloads.nccl_test.report_generation_strategy.NcclTestReportGenerationStrategy'>.
[INFO] Step 1: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/pytorch:24.02-py3', 'subtest_name': 'all_gather_perf_mpi', 'nthreads': 1, 'ngpus': 1, 'minbytes': '128', 'maxbytes': '4G', 'stepbytes': '1M', 'op': 'sum', 'datatype': 'float', 'root': 0, 'iters': 100, 'warmup_iters': 50, 'agg_iters': 1, 'average': 1, 'parallel_init': 0, 'check': 1, 'blocking': 0, 'cudagraph': 0, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1
[WARNING] Skipping 'results/dse-nccl-test/Tests.1/0/2', can't handle with strategy=<class 'cloudai.workloads.nccl_test.report_generation_strategy.NcclTestReportGenerationStrategy'>.
[INFO] Step 2: Observation: [-1.0], Reward: -1.0
  • Real system testing

NCCL AG with Extra Env Args

cloudai run --system-config ../cloudaix/conf/common/system/c
xxx --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nccl_all_gather.toml 
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dse-nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dse-nccl-test

Section Name: Tests.1
  Test Name: dse_nccl_all_gather
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {...obfuscated...,'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1881080
[INFO] Job completed: Tests.1
[INFO] Step 1: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {...obfuscated.... 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1881207
[INFO] Job completed: Tests.1
[INFO] Step 2: Observation: [-1.0], Reward: -1.0

Output (representative). Please ping me offline for more details.

image

NCCL AG with Extra Env Args and CmdArgs

cloudai run --system-config ../cloudaix/conf/common/system/xxxx.to
ml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nccl_all_gather.toml 
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dse-nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dse-nccl-test

Section Name: Tests.1
  Test Name: dse_nccl_all_gather
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Running step 0 with action {output redacted.... 'warmup_iters': 5, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971248
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 1: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {output redacted...'warmup_iters', 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971250
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 2: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {output redacted....'warmup_iters': 50, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x7'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971252
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 3: Observation: [-1.0], Reward: -1.0
[INFO] Running step 0 with action {output redacted...'warmup_iters': 50, 'extra_env_vars.NCCL_TESTS_SPLIT_MASK': '0x0'}
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1971317
[INFO] Job completed: Tests.1
[INFO] Generated scenario report at results/dse-nccl-test/dse-nccl-test.html
[INFO] Step 4: Observation: [-1.0], Reward: -1.0

Results
Representative output. Ping me on slack for more details

image

Additional Notes

More unit test and workloads (e.g., NCCL, Nemo) will be tested. UCC will be tested after sync-up with Sergey. But can be outside of this PR. Note, the reward generation etc depends upon the objects. So the generic report will be useful. But can fix this once we understand the objective.

@TaekyungHeo TaekyungHeo added the enhancement New feature or request label Mar 11, 2025
Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have two approvals. Please review my comments. Some are minor.

@srivatsankrishnan srivatsankrishnan marked this pull request as draft March 11, 2025 19:51
@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review March 13, 2025 00:08
TaekyungHeo
TaekyungHeo previously approved these changes Mar 13, 2025
Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's have two approvals.

Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's remove pyright ignore gradually.

Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Minor comment, which is not a blocker: You replaced Dict[str, str] with Union[str, List[str]]. You might consider introducing a new type, such as DSEableParam = Union[str, List[str]], or creating a new class to improve readability and maintainability.

TaekyungHeo
TaekyungHeo previously approved these changes Mar 17, 2025
@srivatsankrishnan
Copy link
Contributor Author

  • Minor comment, which is not a blocker: You replaced Dict[str, str] with Union[str, List[str]]. You might consider introducing a new type, such as DSEableParam = Union[str, List[str]], or creating a new class to improve readability and maintainability.

I don;t think this is required for such a small change. This is basic python and no need to have a class definition. I infact think this makes it worse.

@srivatsankrishnan srivatsankrishnan merged commit 10a866c into NVIDIA:main Mar 17, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants