Skip to content

Conversation

@TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Aug 8, 2025

Summary

The goal of this PR is to update ai_dynamo.sh to support running multiple AI Dynamo workers on a single physical node. Previously, only one worker would run per node. However, there is now a need to run multiple workers on a single node. To enable this, the number of GPUs allocated per worker can be reduced. For example, if a node has 8 GPUs and each worker uses 4 GPUs, then two workers will be instantiated per node. This PR builds on changes made by @karya0, our primary customer. You can find @karya0 's branches below:

  1. CloudAI: https://github.com/karya0/cloudaix/blob/disagg_inf_new_vllm_refactor/conf/staging/ai_dynamo/test/run.sh#L159-L174
  2. CloudAIX: https://github.com/karya0/cloudaix/blob/disagg_inf_new_vllm_refactor/conf/staging/ai_dynamo/test_scenario/dsr1_70b_3k_150.toml

RM4554439

Related PR: https://github.com/Mellanox/cloudaix/pull/325

Test Plan

  1. CI passes
  2. Run on CW

Take https://github.com/Mellanox/cloudaix/pull/325

$ python cloudaix.py run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/dsr1_70b_3k_150.toml
[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dsr1_70b_3k_150
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dsr1_70b_3k_150

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 4909218
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/dsr1_70b_3k_150_2025-08-08_13-59-56
[INFO] Generated scenario report at results/dsr1_70b_3k_150_2025-08-08_13-59-56/dsr1_70b_3k_150.html
[INFO] All jobs are complete.

Confirmed that multiple workers are instantiated by reviewing the logs below:
https://drive.google.com/drive/folders/12syZTI5LTGOI9x_CuDMDNTnM28sDfUjX?usp=drive_link

@TaekyungHeo TaekyungHeo marked this pull request as ready for review August 8, 2025 21:29
@TaekyungHeo
Copy link
Member Author

@karya0 & @jeffnvidia . The PR is ready. Please review.

@TaekyungHeo
Copy link
Member Author

TaekyungHeo commented Aug 8, 2025

@srivatsankrishnan & @amaslenn: Please review. Two approvals are needed since this is a feature.

Let's allow for some clumsiness in the shell script. This is my plan:

  1. Support the requested AI Dynamo functionalities immediately, without spending too much time cleaning up the shell script. It will be updated anyway to support the pending features.
  2. Clean up the shell script and rerun the related tests manually (Draft PR)
  3. Add support for DeepSeekR1 in the AI Dynamo shell script. For this, cleaning up the shell script is essential ^.

@karya0
Copy link
Contributor

karya0 commented Aug 8, 2025

I looked at the gdrive files and the logs looks appropriate. There are 4 decode workers being launched with two GPUs each and TP=2. Thanks @TaekyungHeo

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Kapil comments and Taekyung's logs seems like working.

@TaekyungHeo TaekyungHeo merged commit c51d190 into NVIDIA:main Aug 11, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants