Skip to content

Conversation

@TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Jul 31, 2025

Summary

The goal of this PR is to support a custom run.sh to launch AI Dynamo.

Kapil's requirements met by this PR.:

  1. Add a configurable run.sh template
  2. Pass arbitrary flags by removing: model_config = ConfigDict(extra="forbid", populate_by_name=True)

RM4548255

Test Plan

  1. CI passes
  2. Run on CW.

Take https://github.com/Mellanox/cloudaix/pull/319.

$ python cloudaix.py run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml   

[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: deepseek_r1_distill_llama_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: deepseek_r1_distill_llama_8b

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 4899741
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/deepseek_r1_distill_llama_8b_2025-08-08_04-21-04
[WARNING] Error generating report for 'results/deepseek_r1_distill_llama_8b_2025-08-08_04-21-04/Tests.1/0' with strategy=AIDynamoReportGenerationStrategy: could not convert string to float: '"1'
[INFO] Generated scenario report at results/deepseek_r1_distill_llama_8b_2025-08-08_04-21-04/deepseek_r1_distill_llama_8b.html
[INFO] All jobs are complete.

https://drive.google.com/drive/folders/1e5L80zLqZUywJ0SpCgT-s-tJyid13zBf?usp=sharing

@TaekyungHeo TaekyungHeo changed the title Use custom run.sh for AI dynamo Use a shell script as the entry point for AI Dynamo Jul 31, 2025
@TaekyungHeo TaekyungHeo force-pushed the custom-run-sh branch 5 times, most recently from a6aa3d9 to 43d2fbb Compare July 31, 2025 14:08
@TaekyungHeo
Copy link
Member Author

@karya0, please review and provide feedback when you have a chance.

@TaekyungHeo TaekyungHeo marked this pull request as ready for review August 6, 2025 10:30
@srivatsankrishnan
Copy link
Contributor

Given AI dynamo changes a lot and actively getting developed, Kapil's script encapsulates and absorbs these without having to do lot of code changes on CloudAI side. Provides more flexibility and value to the actual end user. We saw similar issues with Grok/Nemo and now with AI-dynamo. One thing that is common across is the need for flexibility and fast iteration. The user is fully aware and responsible for the flags that gets passed. Maybe once the AI dynamo gets stable and benchmark matures, the verification/validation aspects can be brought back.

@srivatsankrishnan
Copy link
Contributor

Re: #4554587: AI-Dynamo: Dry-Run with DSE fails

@TaekyungHeo What is the fix for this? I don't see any handler or dry-run related changes in this PR.

@TaekyungHeo
Copy link
Member Author

TaekyungHeo commented Aug 7, 2025

@srivatsankrishnan , explicit code changes were not needed to fix #4554587: AI-Dynamo: Dry-Run with DSE fails.

Please find the log below. It works after merging the changes

$ python cloudaix.py dry-run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml
[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: deepseek_r1_distill_llama_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: deepseek_r1_distill_llama_8b

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/deepseek_r1_distill_llama_8b_2025-08-07_11-53-04
[INFO] All jobs are complete.

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan , explicit code changes were not needed to fix #4554587: AI-Dynamo: Dry-Run with DSE fails.

Please find the log below. It works after merging the changes

python cloudaix.py dry-run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml
[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: deepseek_r1_distill_llama_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: deepseek_r1_distill_llama_8b

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 0
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/deepseek_r1_distill_llama_8b_2025-08-07_11-53-04
[INFO] All jobs are complete.

What caused this issue for Kapil then? Was it mis attribution?

@TaekyungHeo
Copy link
Member Author

@srivatsankrishnan, maybe. The bug was actually reported from his fork, not the public main branch. It seems to have originated from his local changes.

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan, maybe. The bug was actually reported from his fork, not the public main branch. It seems to have originated from his local changes.

We are also not testing with DSE config? Correct? I think dry run works for non-dse cases. For DSE, it was causing this issue?
https://github.com/Mellanox/cloudaix/blob/main/conf/staging/ai_dynamo/test_scenario/deepseek_r1_distill_llama_8b.toml

@TaekyungHeo
Copy link
Member Author

Had a call with @srivatsankrishnan .

This is not satisfied and removed.

  1. #4554587: AI-Dynamo: Dry-Run with DSE fails

I had to run a command using the DSE-enabled test configuration to validate this feature.

@TaekyungHeo
Copy link
Member Author

Waiting for @srivatsankrishnan's approval. I spoke with @karya0 yesterday. This is not the final version of run.sh. I will match the functionality of his run.sh by merging other PRs and creating additional ones.

@srivatsankrishnan
Copy link
Contributor

Waiting for @srivatsankrishnan's approval. I spoke with @karya0 yesterday. This is not the final version of run.sh. I will match the functionality of his run.sh by merging other PRs and creating additional ones.

Clarification on this. ideally lot of his changes should be contained within his run.sh we should mostly be agnostic to the changes. A standalone and working run.sh for various changes in AI Dyanmo would be the requirement from CLoudAI side. @karya0 is this understanding correct?

@TaekyungHeo TaekyungHeo merged commit 6204400 into NVIDIA:main Aug 8, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants