Skip to content

test: add vLLM deployment tests for checkpoint robustness#1656

Merged
akoumpa merged 15 commits intomainfrom
adil-a/checkpoint-vllm-deploy
Apr 9, 2026
Merged

test: add vLLM deployment tests for checkpoint robustness#1656
akoumpa merged 15 commits intomainfrom
adil-a/checkpoint-vllm-deploy

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 2, 2026

Summary

  • vLLM deployment verification tests that load consolidated checkpoints and compare greedy output token-for-token against HuggingFace
  • Supports both full token comparison and smoke test mode (--vllm_smoke_test)
  • 14 model configs covering Llama, Qwen, Nemotron, Phi-4, Gemma, Baichuan, Mistral, GPT-OSS

Test plan

🤖 Generated with Claude Code

vLLM deployment verification tests that load consolidated checkpoints
and compare greedy output token-for-token against HuggingFace.
Supports both full comparison and smoke test mode.

Depends on checkpoint robustness PR #1606.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
@thomasdhc thomasdhc requested a review from a team as a code owner April 3, 2026 19:32
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
@thomasdhc
Copy link
Copy Markdown
Contributor

@adil-a I refactored similar to what we have for checkpoint robustness have a look when you get a chance: Ministral-3-3B-Instruct-2512, Llama-3.1-Nemotron-Nano-8B-v1, Qwen3-30B-A3B aren't recipes we have in examples. Are these needed?

TEST_FOLDER = "checkpoint_robustness"


class TestVLLMDeploy:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the checkpoints to be the ones from the finetune job, should it be from the robustness job?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we make that update I think we can delete this file

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 6, 2026

@adil-a can we merge this?

@thomasdhc
Copy link
Copy Markdown
Contributor

@akoumpa Not yet this pathway has not been tested yet

@thomasdhc thomasdhc added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 7, 2026
akoumpa and others added 11 commits April 6, 2026 23:38
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
@thomasdhc
Copy link
Copy Markdown
Contributor

/ok to test 26054f2

@akoumpa akoumpa merged commit 3a3f685 into main Apr 9, 2026
56 checks passed
@akoumpa akoumpa deleted the adil-a/checkpoint-vllm-deploy branch April 9, 2026 07:03
svcnvidia-nemo-ci pushed a commit that referenced this pull request Apr 9, 2026
* test: add vLLM deployment tests for checkpoint robustness

vLLM deployment verification tests that load consolidated checkpoints
and compare greedy output token-for-token against HuggingFace.
Supports both full comparison and smoke test mode.

Depends on checkpoint robustness PR #1606.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Create deploy-test dependency group

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Revert deploy test group

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Move configs to recipes and create vllm_launcher

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Setup deploy environment

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Remove duplicate keys

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Add scope to vllm deploy test

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Drop needs dependency

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Use finetune test name for ckpt dir

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Make ckpt checking more robust

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Pass arguments correctly

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Update arguments

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Remove unused file

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
akoumpa added a commit that referenced this pull request Apr 9, 2026
test: add vLLM deployment tests for checkpoint robustness (#1656)

* test: add vLLM deployment tests for checkpoint robustness

vLLM deployment verification tests that load consolidated checkpoints
and compare greedy output token-for-token against HuggingFace.
Supports both full comparison and smoke test mode.

Depends on checkpoint robustness PR #1606.




* Create deploy-test dependency group



* Revert deploy test group



* Move configs to recipes and create vllm_launcher



* Setup deploy environment



* Remove duplicate keys



* Add scope to vllm deploy test



* Drop needs dependency



* Use finetune test name for ckpt dir



* Make ckpt checking more robust



* Pass arguments correctly



* Update arguments



* Remove unused file



---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Apr 10, 2026
* test: add vLLM deployment tests for checkpoint robustness

vLLM deployment verification tests that load consolidated checkpoints
and compare greedy output token-for-token against HuggingFace.
Supports both full comparison and smoke test mode.

Depends on checkpoint robustness PR #1606.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Create deploy-test dependency group

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Revert deploy test group

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Move configs to recipes and create vllm_launcher

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Setup deploy environment

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Remove duplicate keys

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Add scope to vllm deploy test

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Drop needs dependency

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Use finetune test name for ckpt dir

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Make ckpt checking more robust

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Pass arguments correctly

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Update arguments

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* Remove unused file

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants