Use a shell script as the entry point for AI Dynamo #615

TaekyungHeo · 2025-07-31T03:36:21Z

Summary

The goal of this PR is to support a custom run.sh to launch AI Dynamo.

Kapil's requirements met by this PR.:

Add a configurable run.sh template
Pass arbitrary flags by removing: model_config = ConfigDict(extra="forbid", populate_by_name=True)

RM4548255

Test Plan

CI passes
Run on CW.

Take https://github.com/Mellanox/cloudaix/pull/319.

$ python cloudaix.py run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml   

[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: deepseek_r1_distill_llama_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: deepseek_r1_distill_llama_8b

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 4899741
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/deepseek_r1_distill_llama_8b_2025-08-08_04-21-04
[WARNING] Error generating report for 'results/deepseek_r1_distill_llama_8b_2025-08-08_04-21-04/Tests.1/0' with strategy=AIDynamoReportGenerationStrategy: could not convert string to float: '"1'
[INFO] Generated scenario report at results/deepseek_r1_distill_llama_8b_2025-08-08_04-21-04/deepseek_r1_distill_llama_8b.html
[INFO] All jobs are complete.

https://drive.google.com/drive/folders/1e5L80zLqZUywJ0SpCgT-s-tJyid13zBf?usp=sharing

TaekyungHeo · 2025-07-31T14:12:07Z

@karya0, please review and provide feedback when you have a chance.

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py

srivatsankrishnan · 2025-08-07T18:48:03Z

Given AI dynamo changes a lot and actively getting developed, Kapil's script encapsulates and absorbs these without having to do lot of code changes on CloudAI side. Provides more flexibility and value to the actual end user. We saw similar issues with Grok/Nemo and now with AI-dynamo. One thing that is common across is the need for flexibility and fast iteration. The user is fully aware and responsible for the flags that gets passed. Maybe once the AI dynamo gets stable and benchmark matures, the verification/validation aspects can be brought back.

srivatsankrishnan · 2025-08-07T18:51:35Z

Re: #4554587: AI-Dynamo: Dry-Run with DSE fails

@TaekyungHeo What is the fix for this? I don't see any handler or dry-run related changes in this PR.

TaekyungHeo · 2025-08-07T18:53:44Z

@srivatsankrishnan , explicit code changes were not needed to fix #4554587: AI-Dynamo: Dry-Run with DSE fails.

Please find the log below. It works after merging the changes

$ python cloudaix.py dry-run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml
[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: deepseek_r1_distill_llama_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: deepseek_r1_distill_llama_8b

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/deepseek_r1_distill_llama_8b_2025-08-07_11-53-04
[INFO] All jobs are complete.

srivatsankrishnan · 2025-08-07T18:56:36Z

@srivatsankrishnan , explicit code changes were not needed to fix #4554587: AI-Dynamo: Dry-Run with DSE fails.

Please find the log below. It works after merging the changes

python cloudaix.py dry-run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml
[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: deepseek_r1_distill_llama_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: deepseek_r1_distill_llama_8b

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/deepseek_r1_distill_llama_8b_2025-08-07_11-53-04
[INFO] All jobs are complete.

What caused this issue for Kapil then? Was it mis attribution?

TaekyungHeo · 2025-08-07T18:59:07Z

@srivatsankrishnan, maybe. The bug was actually reported from his fork, not the public main branch. It seems to have originated from his local changes.

srivatsankrishnan · 2025-08-07T19:00:56Z

@srivatsankrishnan, maybe. The bug was actually reported from his fork, not the public main branch. It seems to have originated from his local changes.

We are also not testing with DSE config? Correct? I think dry run works for non-dse cases. For DSE, it was causing this issue?
https://github.com/Mellanox/cloudaix/blob/main/conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml

TaekyungHeo · 2025-08-07T19:11:48Z

Had a call with @srivatsankrishnan .

This is not satisfied and removed.

#4554587: AI-Dynamo: Dry-Run with DSE fails

I had to run a command using the DSE-enabled test configuration to validate this feature.

TaekyungHeo · 2025-08-08T09:00:08Z

Waiting for @srivatsankrishnan's approval. I spoke with @karya0 yesterday. This is not the final version of run.sh. I will match the functionality of his run.sh by merging other PRs and creating additional ones.

srivatsankrishnan · 2025-08-08T15:05:09Z

Waiting for @srivatsankrishnan's approval. I spoke with @karya0 yesterday. This is not the final version of run.sh. I will match the functionality of his run.sh by merging other PRs and creating additional ones.

Clarification on this. ideally lot of his changes should be contained within his run.sh we should mostly be agnostic to the changes. A standalone and working run.sh for various changes in AI Dyanmo would be the requirement from CLoudAI side. @karya0 is this understanding correct?

TaekyungHeo added the feature label Jul 31, 2025

TaekyungHeo changed the title ~~Use custom run.sh for AI dynamo~~ Use a shell script as the entry point for AI Dynamo Jul 31, 2025

TaekyungHeo force-pushed the custom-run-sh branch 5 times, most recently from a6aa3d9 to 43d2fbb Compare July 31, 2025 14:08

TaekyungHeo mentioned this pull request Aug 1, 2025

Handle multi-section CSV format in AI Dynamo report generation #620

Merged

TaekyungHeo added 2 commits August 6, 2025 06:27

Use a shell script as the entry point for AI Dynamo

f96ba8f

Fix AI Dynamo command args parsing issue

9f1b97c

TaekyungHeo force-pushed the custom-run-sh branch from b5a2eaa to 9f1b97c Compare August 6, 2025 10:27

TaekyungHeo marked this pull request as ready for review August 6, 2025 10:30

TaekyungHeo requested review from amaslenn, srinivas212 and srivatsankrishnan as code owners August 6, 2025 10:30

TaekyungHeo added 2 commits August 7, 2025 05:23

Merge branch 'main' into custom-run-sh

a0eb5cd

Merge branch 'main' into custom-run-sh

ebe3533

amaslenn reviewed Aug 7, 2025

View reviewed changes

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh Show resolved Hide resolved

src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

TaekyungHeo added 3 commits August 7, 2025 09:22

Reflect Andrei's comment

757a8c1

Reflect Andrei's comment

ebea76d

Reflect Andrei's comment

990c735

Merge branch 'main' into custom-run-sh

ca1d622

TaekyungHeo added 2 commits August 7, 2025 15:22

Merge branch 'main' into custom-run-sh

de2c77d

Merge branch 'main' into custom-run-sh

39eba96

amaslenn approved these changes Aug 8, 2025

View reviewed changes

srivatsankrishnan approved these changes Aug 8, 2025

View reviewed changes

TaekyungHeo merged commit 6204400 into NVIDIA:main Aug 8, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use a shell script as the entry point for AI Dynamo #615

Use a shell script as the entry point for AI Dynamo #615

Uh oh!

TaekyungHeo commented Jul 31, 2025 •

edited

Loading

Uh oh!

TaekyungHeo commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 7, 2025 •

edited

Loading

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 7, 2025

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 8, 2025

Uh oh!

srivatsankrishnan commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use a shell script as the entry point for AI Dynamo #615

Use a shell script as the entry point for AI Dynamo #615

Uh oh!

Conversation

TaekyungHeo commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

TaekyungHeo commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 7, 2025

Uh oh!

srivatsankrishnan commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 7, 2025

Uh oh!

TaekyungHeo commented Aug 8, 2025

Uh oh!

srivatsankrishnan commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TaekyungHeo commented Jul 31, 2025 •

edited

Loading

TaekyungHeo commented Aug 7, 2025 •

edited

Loading