Add multi-worker-per-node GPU slicing support with dynamic allocation #636

TaekyungHeo · 2025-08-08T20:00:16Z

Summary

The goal of this PR is to update ai_dynamo.sh to support running multiple AI Dynamo workers on a single physical node. Previously, only one worker would run per node. However, there is now a need to run multiple workers on a single node. To enable this, the number of GPUs allocated per worker can be reduced. For example, if a node has 8 GPUs and each worker uses 4 GPUs, then two workers will be instantiated per node. This PR builds on changes made by @karya0, our primary customer. You can find @karya0 's branches below:

RM4554439

Related PR: https://github.com/Mellanox/cloudaix/pull/325

Test Plan

CI passes
Run on CW

Take https://github.com/Mellanox/cloudaix/pull/325

$ python cloudaix.py run --system-config conf/common/system/cw.toml --tests-dir conf/staging/ai_dynamo/test --test-scenario conf/staging/ai_dynamo/test_scenario/dsr1_70b_3k_150.toml
[INFO] System Name: Coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: dsr1_70b_3k_150
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: dsr1_70b_3k_150

Section Name: Tests.1
  Test Name: vllm
  Description: vllm
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 4909218
[INFO] Job completed: Tests.1 (iteration 1 of 1)
[INFO] All test scenario results stored at: results/dsr1_70b_3k_150_2025-08-08_13-59-56
[INFO] Generated scenario report at results/dsr1_70b_3k_150_2025-08-08_13-59-56/dsr1_70b_3k_150.html
[INFO] All jobs are complete.

Confirmed that multiple workers are instantiated by reviewing the logs below:
https://drive.google.com/drive/folders/12syZTI5LTGOI9x_CuDMDNTnM28sDfUjX?usp=drive_link

TaekyungHeo · 2025-08-08T21:33:33Z

@karya0 & @jeffnvidia . The PR is ready. Please review.

TaekyungHeo · 2025-08-08T21:41:21Z

@srivatsankrishnan & @amaslenn: Please review. Two approvals are needed since this is a feature.

Let's allow for some clumsiness in the shell script. This is my plan:

Support the requested AI Dynamo functionalities immediately, without spending too much time cleaning up the shell script. It will be updated anyway to support the pending features.
Clean up the shell script and rerun the related tests manually (Draft PR)
Add support for DeepSeekR1 in the AI Dynamo shell script. For this, cleaning up the shell script is essential ^.

karya0 · 2025-08-08T23:18:01Z

I looked at the gdrive files and the logs looks appropriate. There are 4 decode workers being launched with two GPUs each and TP=2. Thanks @TaekyungHeo

srivatsankrishnan

Based on Kapil comments and Taekyung's logs seems like working.

TaekyungHeo added the feature label Aug 8, 2025

TaekyungHeo force-pushed the rm-4554439 branch from fa837b7 to 7678a0f Compare August 8, 2025 20:18

TaekyungHeo requested a review from jeffnvidia August 8, 2025 20:30

Add multi-worker-per-node GPU slicing support with dynamic allocation

7ee6af3

TaekyungHeo force-pushed the rm-4554439 branch from 4db4efa to 7ee6af3 Compare August 8, 2025 21:17

TaekyungHeo marked this pull request as ready for review August 8, 2025 21:29

TaekyungHeo requested review from amaslenn, srinivas212 and srivatsankrishnan as code owners August 8, 2025 21:29

TaekyungHeo mentioned this pull request Aug 8, 2025

Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh #637

Closed

srivatsankrishnan approved these changes Aug 9, 2025

View reviewed changes

jeffnvidia approved these changes Aug 11, 2025

View reviewed changes

amaslenn approved these changes Aug 11, 2025

View reviewed changes

TaekyungHeo merged commit c51d190 into NVIDIA:main Aug 11, 2025
2 checks passed

TaekyungHeo mentioned this pull request Aug 11, 2025

Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh #639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multi-worker-per-node GPU slicing support with dynamic allocation #636

Add multi-worker-per-node GPU slicing support with dynamic allocation #636

Uh oh!

TaekyungHeo commented Aug 8, 2025 •

edited

Loading

Uh oh!

TaekyungHeo commented Aug 8, 2025

Uh oh!

TaekyungHeo commented Aug 8, 2025 •

edited

Loading

Uh oh!

karya0 commented Aug 8, 2025

Uh oh!

srivatsankrishnan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add multi-worker-per-node GPU slicing support with dynamic allocation #636

Add multi-worker-per-node GPU slicing support with dynamic allocation #636

Uh oh!

Conversation

TaekyungHeo commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

TaekyungHeo commented Aug 8, 2025

Uh oh!

TaekyungHeo commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karya0 commented Aug 8, 2025

Uh oh!

srivatsankrishnan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TaekyungHeo commented Aug 8, 2025 •

edited

Loading

TaekyungHeo commented Aug 8, 2025 •

edited

Loading