test: add checkpoint robustness functional tests#1606
Merged
Conversation
Add end-to-end checkpoint robustness tests that verify checkpoint save/load round-trips produce bitwise-identical logits. Tests cover both SFT and PEFT workflows: - Phase 1: Train for N steps and save checkpoint - Phase 2: Capture reference logits - Phase 3: Reload automodel from consolidated checkpoint (SFT) or auto-resume from checkpoint dir (PEFT), assert zero KL divergence - Phase 4: Load into vanilla HF, assert KL within relaxed threshold (accounts for kernel/attention implementation differences) Also adds a vLLM deployment smoke test that verifies greedy decoding matches between HF and vLLM for consolidated checkpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
85 tasks
Add Phase 5 that reloads consolidated checkpoint with a different TP size (e.g., train at TP=1, reload at TP=2). Exercises FSDP2 DTensor resharding and QKV interleaving under different sharding layouts. Opt-in via --cross_tp_size <int> with separate --cross_tp_kl_threshold (default 5e-3) since TP resharding introduces forward pass numerical differences similar to the HF comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
- Add GPT-OSS 20B SFT and PEFT checkpoint robustness shell scripts with hf_kl_threshold=5e-2 (higher for MoE due to expert routing numerical divergence from RoPE precision and attention kernel diffs) - Add vLLM PEFT support via native LoRA (enable_lora + LoRARequest) - Add --vllm_smoke_test mode for models where model_impl="transformers" is unavailable (e.g., MoE with transformers<5.0): loads model into vLLM native backend and verifies non-empty output without HF comparison - Add vLLM step to Llama PEFT shell script - Handle models returning raw tensors instead of CausalLMOutput in _get_logits Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Add LoRA support to vLLM smoke test path (enable_lora + LoRARequest). Fix GPT-OSS model name to openai/gpt-oss-20b in PEFT script and add vLLM deployment step. Update hf_kl_threshold to 5e-2 for MoE. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Update all checkpoint robustness shell scripts to use 8 GPUs (CUDA_VISIBLE_DEVICES=0-7, nproc_per_node=8). Add cross-TP test (--cross_tp_size 2) to Llama SFT script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Combine separate SFT/PEFT scripts into one per model. Add dedicated vLLM deployment scripts that reuse checkpoints from robustness runs. Shell scripts: - L2_Checkpoint_Robustness_Llama3_2_3B.sh (SFT + cross-TP + PEFT) - L2_Checkpoint_Robustness_GPT_OSS_20B.sh (SFT + PEFT, ep_size=8) - L2_vLLM_Deploy_Llama3_2_3B.sh (SFT greedy + PEFT LoRA) - L2_vLLM_Deploy_GPT_OSS_20B.sh (SFT smoke + PEFT LoRA smoke) All scripts use 8 GPUs, hardcoded /adasif/checkpoints/ paths, and LATEST symlink for step dir resolution. vLLM scripts must run in an environment with vllm installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Add SFT and PEFT checkpoint robustness tests for Nemotron Nano V3 (hybrid Mamba2+Attention+MoE, 30B/3B-active). Uses experts_implementation= grouped_mm for HF comparison to match automodel's batched GEMM backend, reducing KL divergence from bf16 numerical noise. Also fixes transformers >= 5.2 compatibility where check_model_inputs was split into merge_with_config_defaults + capture_outputs but the deprecated import still exists with a different signature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
- Dynamic tokenizer: --tokenizer_name flag for non-Llama models - Memory tracking: --max_vram_gb / --max_cpu_gb with peak VRAM and RSS assertions - Phantom key check: --check_phantom_keys scans consolidated safetensors for leaked _blocks/_scales keys (GPT-OSS mxFP4) - Fused QKV check: --check_fused_qkv_keys verifies PEFT adapter has split q/k/v projections - Resume loss continuity: --check_resume trains baseline + resumed run, compares per-step losses (disabled for MoE due to DeepEP non-determinism) - vLLM token comparison: assert length equality before content comparison - Audit fixes: no vacuous passes for phantom keys, resume, or vLLM checks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New models: Nemotron Flash 1B, Gemma 3 270m, Phi-4, Nemotron Nano V2 9B, Baichuan 2 7B, Qwen2.5 7B, Qwen3-MoE 30B, Nemotron Super 120B, Llama-3.3-Super-49B, Mistral3 3B, Nemotron-Nano-8B-v1, llama-nemotron-embed-1b-v2 - 12 robustness shell scripts (SFT + PEFT per model) - 11 vLLM deploy shell scripts (no vLLM for biencoder) - 5 new YAML configs (Mistral3 SFT/PEFT, Nano-8B-v1 SFT/PEFT, Qwen3-MoE SFT) - Biencoder test (test_checkpoint_robustness_biencoder.py) with cosine similarity - 12 test methods in TestCheckpointRobustness + 11 in TestVLLMDeploy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --dataset.limit_dataset_samples 500 / --dataset.num_samples_limit 500 to all robustness scripts (squad and hellaswag respectively) to cut dataset mapping time from ~60s to ~1s per run - Add --max_vram_gb / --max_cpu_gb thresholds to Gemma 3 and Phi-4 based on observed peak usage (~1.2x headroom) - Fix Gemma 3 to TP=1 (1 KV head not divisible by TP=2) - Fix Phi-4 to TP=1 (DTensor redistribution assertion with TP=2) - Tighten HF KL thresholds based on observed values: Gemma 3 SFT: 6e-3, PEFT: 8e-3 Phi-4 SFT: 1.2e-3, PEFT: 1e-3 - Register dataset.num_samples_limit in conftest.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Qwen2.5 7B: tighten SFT KL to 9e-3, PEFT to 8e-2, add cross-TP, memory limits - Qwen3-MoE 30B: tighten SFT KL to 1e-4, add memory limits - Nemotron-Nano-8B-v1: tighten SFT KL to 7e-4, add cross-TP, disable resume (Mamba hybrid non-determinism) - Baichuan/Mistral3: add cross-TP to SFT step - Add __main__ block to test_checkpoint_robustness_llm.py for direct execution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Super-49B confirmed multi-node only (OOM on 8 GPUs with TP=4 PP=2). Updated all model results including vLLM pass/fail status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Observed peak memory (1.2x headroom applied): - Llama 3.2 3B: SFT 3.91→5 GB VRAM, PEFT 3.89→5 GB VRAM - GPT-OSS 20B: SFT 19.24→24 GB VRAM, PEFT 9.49→12 GB VRAM - Nemotron Nano V3: SFT 29.02→35 GB VRAM, PEFT 12.47→15 GB VRAM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lerance Allows MoE and Mamba hybrid models to use a looser threshold for training resumption loss continuity checks (default: 5e-3 for dense SFT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phi-4 DTensor bug at TP=2 fixed on main. Both SFT and PEFT pass. Added configurable --resume_loss_threshold CLI arg (default 5e-3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Enables device_map="auto" in Phase 4 to spread vanilla HF model across all GPUs on rank 0's node. Required for 49B+ models that don't fit on 1 GPU (98GB at bf16 > 80GB H100). Validated under torchrun on 8 GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Results: - Super-120B SFT: PASS (4 nodes, EP=32, device_map=auto for Phase 4) - Super-49B SFT: Phase 1-3 PASS (2 nodes, TP=4), Phase 4 FAIL (combined QKV keys) - Super-49B/120B PEFT: Phase 1-3 PASS, Phase 4 FAIL (combined QKV in adapter) - Embed-1B-v2: PASS (cosine=1.0, resume with t=2e-2) Changes: - Add --hf_device_map_auto flag for Phase 4 large model HF loading - Fix biencoder import (recipes.biencoder -> recipes.retrieval) - Fix biencoder tokenizer compatibility (NeMoAutoTokenizer + return_tensors) - Add --resume_loss_threshold to biencoder test - Register new CLI flags in conftest.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
adil-a
added a commit
that referenced
this pull request
Apr 2, 2026
vLLM deployment verification tests that load consolidated checkpoints and compare greedy output token-for-token against HuggingFace. Supports both full comparison and smoke test mode. Depends on checkpoint robustness PR #1606. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Remove vLLM deploy test module, 14 shell scripts, and TestVLLMDeploy runner class. Remove vLLM-specific conftest entries and STATUS.md sections. vLLM tests will land in a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
2 tasks
Add ci.checkpoint_robustness section to 28 recipe YAMLs with model-specific test args (KL thresholds, TP overrides, tokenizer names). Common args (max_steps=5, dataset_limit=500, etc.) handled in launcher. Append robustness test block to finetune_launcher.sh that runs after finetune completes, gated by presence of ci.checkpoint_robustness. Add 20 missing model configs to nightly_recipes.yml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
These scripts are superseded by ci.checkpoint_robustness sections in recipe YAMLs. Kept locally for manual debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
TestCheckpointRobustness class called the removed .sh scripts via run_test_script(). No longer needed — CI runs robustness tests directly from finetune_launcher.sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Move YAML config parsing for model-specific robustness args from finetune_launcher.sh into test_checkpoint_robustness_llm.py. The launcher now only detects if ci.checkpoint_robustness exists and passes common args. The test script reads model-specific values (KL thresholds, TP overrides, tokenizer names, etc.) directly from the YAML's ci.checkpoint_robustness section, with CLI args taking precedence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Collaborator
Author
|
/ok to test d77ea17 |
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Contributor
|
/ok to test 321ac05 |
thomasdhc
previously approved these changes
Apr 6, 2026
Contributor
|
/ok to test 3bec651 |
akoumpa
approved these changes
Apr 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive checkpoint robustness testing for all supported models. Tests the full lifecycle: load → SFT/PEFT (few steps) → save → reload → verify correctness.
Tracks #1586.
Test Infrastructure
test_checkpoint_robustness_llm.py— 6-phase test harness:AutoModelForCausalLM, assert KL < thresholdtest_checkpoint_robustness_biencoder.py— Biencoder variant using cosine similarity for embedding models (Embed-1B-v2).CI Integration
Robustness tests run automatically after finetune in the same Slurm allocation. Configured via
ci.checkpoint_robustnesssection in recipe YAMLs:finetune_launcher.shci:sectionFeatures
--hf_device_map_auto: Spread Phase 4 HF model across all GPUs for large models (49B+)--resume_loss_threshold: Configurable resume loss comparison threshold--tokenizer_name: Dynamic tokenization for non-Llama models--max_vram_gb/--max_cpu_gb: Peak memory regression assertions--check_fused_qkv_keys: Verify PEFT adapter has split q/k/v projections--check_phantom_keys: Scan for leaked mxFP4 keys in consolidated checkpointsResults
Passing Models (8 single-node + 3 multi-node)
*Phase 4 failures due to combined QKV projection keys in consolidated checkpoints — vanilla HF can't load them. Phases 1-3 (training + automodel reload) all pass.
Known Issues
--check_resumedisabled for MoE.Test plan
ci.checkpoint_robustnessin recipe YAMLs--hf_device_map_autofor large model Phase 4--resume_loss_thresholdflag🤖 Generated with Claude Code