Skip to content

Qwen-VL LoRA Support#4277

Merged
holgerroth merged 44 commits intoNVIDIA:mainfrom
holgerroth:qwen_lora
Mar 13, 2026
Merged

Qwen-VL LoRA Support#4277
holgerroth merged 44 commits intoNVIDIA:mainfrom
holgerroth:qwen_lora

Conversation

@holgerroth
Copy link
Copy Markdown
Collaborator

@holgerroth holgerroth commented Mar 6, 2026

Fixes # .

Description

Qwen3-VL example: LoRA training fixes

Summary

This PR makes the Qwen3-VL NVFlare example self-contained and stabilizes both full-model and LoRA federated training paths.

Key outcomes:

  • Removes dependency on cloning external Qwen repos by vendoring required qwenvl train/data code.
  • Fixes LoRA adapter-only FL training failures (trainable params: 0, empty optimizer groups, grad_fn/backward issues).
  • Improves runtime behavior by using in-memory parameter exchange for default single-process runs.
  • Keeps checkpoint-based exchange for multi-rank runs, with optimized adapter artifact handling.

Changes

1) Self-contained Qwen3-VL example

  • Vendored qwenvl/train/* and qwenvl/data/* into the example.
  • Removed external patch/repo dependency path and updated docs/NOTICE.

2) LoRA training correctness

  • In adapter-dir load path (train_qwen.py), explicitly enables LoRA params for training.
  • Ensures PEFT path runs in train mode and avoids gradient-checkpointing/PEFT backward failures.
  • Uses PEFT-aware optimizer construction with trainable params only.

3) Single-process in-memory exchange

  • For default WORLD_SIZE=1, client no longer relies on checkpoint round-trip between receive/train/send.
  • Loads received FL params into model in-process and returns trained params directly from memory.

4) Multi-rank path performance and safety

  • Keeps checkpoint-based exchange for multi-rank synchronization.
  • Optimizes LoRA adapter prep by writing adapter_model.safetensors + adapter_config.json directly, removing unnecessary extra base-model load in adapter save helper.

5) Optimizer robustness and config alignment

  • Adds empty-group filtering/guard in non-PEFT optimizer path.
  • Aligns LoRA dropout defaults across model/train paths (0.0) to avoid mode-dependent mismatch.

Testing

  • Targeted formatting/syntax checks on touched files (isort --check-only, black --check, py_compile) passed.
  • LoRA path validated for: non-zero trainable adapter params, no empty optimizer group errors, and no PEFT grad_fn/backward failure.
  • Non-LoRA and LoRA flows both keep expected FL send/receive behavior.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 6, 2026

Greptile Summary

This PR makes the Qwen3-VL NVFlare example self-contained by vendoring qwenvl/train and qwenvl/data from the upstream Qwen3-VL repo (eliminating the external clone + patch requirement), and stabilises both full-model and LoRA federated training. It introduces an in-memory parameter exchange path for single-process runs, a dedicated Qwen3VLLoRAModel that exposes only adapter weights for FL aggregation, a PEFT-aware QwenTrainer.create_optimizer, and gradient-checkpointing suppression for PeftModel+FlashAttention2 to resolve the known backward grad_fn failure.

Key changes:

  • Vendored qwenvl/: data/__init__.py, data/data_processor.py, data/rope2d.py, train/argument.py, train/train_qwen.py, train/trainer.py are all new files adapted from the upstream repo with FL-specific changes (adapter-dir loading, initial_state_dict injection, return_state_dict path).
  • In-memory exchange (use_in_memory_exchange = not _is_multi_rank): for single-process clients, the received FL params are kept in-memory and passed directly to training; the trained state dict is returned from train() without touching disk. This eliminates a full checkpoint round-trip and reduces timeout risk.
  • LoRA adapter exchange: Qwen3VLLoRAModel reduces communicated payload from ~4651 MB to ~98 MB. _save_lora_adapter_for_training writes adapter_model.safetensors + adapter_config.json directly from FL params (no base-model reload). train_qwen.py detects the adapter dir, loads base + adapter, and explicitly enables LoRA params for training.
  • _save_lora_adapter_for_training hardcodes DEFAULT_LORA_TARGET_MODULES without accepting a target_modules parameter, while Qwen3VLLoRAModel.__init__ supports custom lora_target_modules. Using non-default target modules would cause a RuntimeError in the validation at line 143, breaking multi-rank LoRA exchange entirely.
  • run_inference.py LoRA error message lacks guidance when adapter keys are missing due to a target-module mismatch, leaving users without a path to recover.
  • The lora_dropout or 0.05 silent-override and _original_create_optimizer dead-code issues flagged in previous review threads are resolved in the vendored code.

Confidence Score: 3/5

  • Safe to merge for the default LoRA config; one latent runtime error exists if custom LoRA target modules are used.
  • The core logic for in-memory exchange, barrier ordering, error recovery, and PEFT optimizer construction is sound. The main concern is that _save_lora_adapter_for_training hardcodes DEFAULT_LORA_TARGET_MODULES while Qwen3VLLoRAModel supports custom target modules — using non-default modules would silently break multi-rank LoRA exchange with a RuntimeError at validation. Since the current CLI only exposes the default modules, this does not affect the documented workflow, but the inconsistency in the public API is a real latent bug. The inference LoRA error message is also missing actionable recovery guidance. Both issues are scoped to the example directory with no impact on the broader NVFlare framework.
  • Pay close attention to examples/advanced/qwen3-vl/client.py (_save_lora_adapter_for_training, lines 104–164) for the target_modules inconsistency, and examples/advanced/qwen3-vl/run_inference.py (lines 304–310) for the LoRA mismatch error message.

Important Files Changed

Filename Overview
examples/advanced/qwen3-vl/client.py Core FL client with new in-memory exchange path, LoRA checkpoint prep, and multi-rank coordination. The _save_lora_adapter_for_training function hardcodes DEFAULT_LORA_TARGET_MODULES for both validation and the written adapter_config.json, creating a latent inconsistency with Qwen3VLLoRAModel's optional custom lora_target_modules. Logic for barrier ordering and error-recovery paths is otherwise sound.
examples/advanced/qwen3-vl/model.py Adds Qwen3VLLoRAModel, adapter key normalization, and load_state_dict_from_checkpoint(lora_only=True). Qwen3VLLoRAModel.load_state_dict strips "model." from all keys regardless of whether they are adapter keys, then passes them through map_adapter_state_dict_for_peft_model; non-adapter keys end up as unmatched and cause a RuntimeError when strict=True, which is correct guard behavior. Overall well-structured.
examples/advanced/qwen3-vl/qwenvl/train/train_qwen.py Vendored training entry point with new in-memory exchange support (initial_state_dict, return_state_dict), adapter-dir detection, and LoRA path. Gradient checkpointing is correctly disabled for PeftModel+flash_attention_2. The _load_initial_state_dict non-PeftModel path uses strict=False without logging unmatched keys, which could silently hide FL key mismatches.
examples/advanced/qwen3-vl/qwenvl/train/trainer.py New vendored QwenTrainer with PEFT-aware create_optimizer that correctly filters to requires_grad parameters only and drops empty parameter groups before optimizer construction. Non-PEFT path also filters with empty-group guard. Flash-attention forward overrides for Qwen2-VL/2.5-VL/3-VL/MoE are structurally sound.
examples/advanced/qwen3-vl/run_inference.py Adds LoRA-aware inference path that auto-detects checkpoint type by checking full-model key overlap. The detection heuristic (full_match_count > 0) is safe in practice since LoRA adapter keys contain lora_A/lora_B markers that never appear in base model keys. Strict error on missing adapter keys gives clear failures.
examples/advanced/qwen3-vl/job.py Adds --lora, --lora_r, --lora_alpha, --lora_dropout CLI args, conditionally picks Qwen3VLLoRAModel vs Qwen3VLModel as initial model, and refactors timeout configuration into _configure_timeouts. Clean and correct.
examples/advanced/qwen3-vl/qwenvl/data/init.py Vendored data registry with fl_site branch reading FL_SITE_DATA_DIR / PUBMEDVISION_IMAGE_ROOT. Clean direct port from the patch that was previously applied to the external Qwen3-VL clone.
examples/advanced/qwen3-vl/qwenvl/train/argument.py Defines training argument dataclasses including lora_enable, lora_r, lora_alpha, and lora_dropout (default 0.0). Clean and consistent with what train_qwen.py uses directly.

Sequence Diagram

sequenceDiagram
    participant Server as FL Server
    participant R0 as Client rank-0
    participant Rn as Client rank-N

    loop FL Round
        Server->>R0: "FLModel(params) via flare.receive()"
        Note over R0: "Determine exchange mode"

        alt "In-memory single-rank (full or LoRA)"
            R0->>R0: "round_initial_state_dict = _strip_model_prefix(params)"
            R0->>R0: "train(base_hf_id, initial_state_dict, return_state_dict=True)"
            R0->>Server: "flare.send(trained params from memory)"
        else "Multi-rank full-model checkpoint"
            R0->>R0: "save_pretrained(input_model_dir)"
            R0-->>Rn: "_dist_barrier"
            R0->>Rn: "torchrun train(input_model_dir)"
            Rn-->>R0: "DDP training complete"
            R0->>R0: "load_state_dict_from_checkpoint(output_model_dir)"
            R0->>Server: "flare.send(full params ~4651 MB)"
        else "Multi-rank LoRA checkpoint"
            R0->>R0: "_save_lora_adapter_for_training(adapter_model.safetensors)"
            R0-->>Rn: "_dist_barrier"
            R0->>Rn: "torchrun train(adapter_dir, lora_enable=True)"
            Rn-->>R0: "DDP training complete"
            R0->>R0: "load_state_dict_from_checkpoint(lora_only=True)"
            R0->>Server: "flare.send(adapter params ~98 MB)"
        end
    end
Loading

Comments Outside Diff (2)

  1. examples/advanced/qwen3-vl/client.py, line 104-157 (link)

    target_modules hardcoded — inconsistent with Qwen3VLLoRAModel

    _save_lora_adapter_for_training accepts lora_r, lora_alpha, and lora_dropout as explicit parameters but hardcodes DEFAULT_LORA_TARGET_MODULES in both the validation loop (lines 133–148) and the written adapter_config.json (line 157). Meanwhile, Qwen3VLLoRAModel.__init__ (in model.py) already accepts an optional lora_target_modules parameter:

    # model.py
    def __init__(self, ..., lora_target_modules: Optional[list] = None, ...):
        lora_config = LoraConfig(
            ...
            target_modules=lora_target_modules or DEFAULT_LORA_TARGET_MODULES,
        )

    If Qwen3VLLoRAModel is initialised with custom target modules, the server's state_dict() will contain adapter keys only for those modules — but the validation at line 143 checks for all four default modules (q_proj, k_proj, v_proj, o_proj). Any custom module set that omits even one default module will raise a RuntimeError in the multi-rank checkpoint-based exchange path, making it entirely non-functional with non-default LoRA configs.

    The fix is to add a target_modules parameter, use it for validation, and pass it to the saved LoraConfig:

    def _save_lora_adapter_for_training(
        lora_state_dict: dict,
        save_dir: str,
        base_model_name_or_path: str,
        lora_r: int,
        lora_alpha: int,
        lora_dropout: float,
        target_modules: list = DEFAULT_LORA_TARGET_MODULES,
    ) -> None:

    The call site at line 451 should then forward the target_modules that were used when building Qwen3VLLoRAModel.

  2. examples/advanced/qwen3-vl/run_inference.py, line 304-310 (link)

    Missing recovery guidance in LoRA mismatch error

    The error message for missing adapter keys tells the user what went wrong but not why or how to fix it. The expected keys are determined by DEFAULT_LORA_TARGET_MODULES, which is hardcoded in the get_peft_model call above. If the FL job was run with a custom set of target modules (which Qwen3VLLoRAModel in model.py supports via the lora_target_modules parameter), the user would receive this opaque error with no actionable path forward.

    Consider augmenting the error message to hint that LoRA rank, alpha, and target modules in the inference call must match the configuration used during the FL job. This is especially important since neither run_inference.py nor job.py currently expose a target_modules flag, making the default a silent assumption that is easy to violate when Qwen3VLLoRAModel is constructed directly with custom target modules.

Last reviewed commit: f20e316

Comment thread examples/advanced/qwen3-vl/qwenvl/train/train_qwen.py Outdated
Comment thread examples/advanced/qwen3-vl/qwenvl/train/trainer.py Outdated
Comment thread examples/advanced/qwen3-vl/qwenvl/train/train_qwen.py
@holgerroth holgerroth marked this pull request as draft March 7, 2026 00:01
@holgerroth
Copy link
Copy Markdown
Collaborator Author

@greptileai review again

@holgerroth
Copy link
Copy Markdown
Collaborator Author

@greptileai review latest changes

@holgerroth
Copy link
Copy Markdown
Collaborator Author

@greptileai review the latest version of this PR

@holgerroth holgerroth marked this pull request as ready for review March 9, 2026 22:23
@holgerroth
Copy link
Copy Markdown
Collaborator Author

/build

@holgerroth
Copy link
Copy Markdown
Collaborator Author

@greptileai review

@holgerroth
Copy link
Copy Markdown
Collaborator Author

/build

@holgerroth holgerroth enabled auto-merge (squash) March 13, 2026 20:52
@holgerroth holgerroth merged commit 9c68d55 into NVIDIA:main Mar 13, 2026
28 checks passed
@holgerroth holgerroth deleted the qwen_lora branch March 14, 2026 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants