Qwen-VL LoRA Support by holgerroth · Pull Request #4277 · NVIDIA/NVFlare

holgerroth · 2026-03-06T23:30:25Z

Fixes # .

Description

Qwen3-VL example: LoRA training fixes

Summary

This PR makes the Qwen3-VL NVFlare example self-contained and stabilizes both full-model and LoRA federated training paths.

Key outcomes:

Removes dependency on cloning external Qwen repos by vendoring required qwenvl train/data code.
Fixes LoRA adapter-only FL training failures (trainable params: 0, empty optimizer groups, grad_fn/backward issues).
Improves runtime behavior by using in-memory parameter exchange for default single-process runs.
Keeps checkpoint-based exchange for multi-rank runs, with optimized adapter artifact handling.

Changes

1) Self-contained Qwen3-VL example

Vendored qwenvl/train/* and qwenvl/data/* into the example.
Removed external patch/repo dependency path and updated docs/NOTICE.

2) LoRA training correctness

In adapter-dir load path (train_qwen.py), explicitly enables LoRA params for training.
Ensures PEFT path runs in train mode and avoids gradient-checkpointing/PEFT backward failures.
Uses PEFT-aware optimizer construction with trainable params only.

3) Single-process in-memory exchange

For default WORLD_SIZE=1, client no longer relies on checkpoint round-trip between receive/train/send.
Loads received FL params into model in-process and returns trained params directly from memory.

4) Multi-rank path performance and safety

Keeps checkpoint-based exchange for multi-rank synchronization.
Optimizes LoRA adapter prep by writing adapter_model.safetensors + adapter_config.json directly, removing unnecessary extra base-model load in adapter save helper.

5) Optimizer robustness and config alignment

Adds empty-group filtering/guard in non-PEFT optimizer path.
Aligns LoRA dropout defaults across model/train paths (0.0) to avoid mode-dependent mismatch.

Testing

Targeted formatting/syntax checks on touched files (isort --check-only, black --check, py_compile) passed.
LoRA path validated for: non-zero trainable adapter params, no empty optimizer group errors, and no PEFT grad_fn/backward failure.
Non-LoRA and LoRA flows both keep expected FL send/receive behavior.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

greptile-apps · 2026-03-06T23:36:00Z

Greptile Summary

This PR makes the Qwen3-VL NVFlare example self-contained by vendoring qwenvl/train and qwenvl/data from the upstream Qwen3-VL repo (eliminating the external clone + patch requirement), and stabilises both full-model and LoRA federated training. It introduces an in-memory parameter exchange path for single-process runs, a dedicated Qwen3VLLoRAModel that exposes only adapter weights for FL aggregation, a PEFT-aware QwenTrainer.create_optimizer, and gradient-checkpointing suppression for PeftModel+FlashAttention2 to resolve the known backward grad_fn failure.

Key changes:

Vendored qwenvl/: data/__init__.py, data/data_processor.py, data/rope2d.py, train/argument.py, train/train_qwen.py, train/trainer.py are all new files adapted from the upstream repo with FL-specific changes (adapter-dir loading, initial_state_dict injection, return_state_dict path).
In-memory exchange (use_in_memory_exchange = not _is_multi_rank): for single-process clients, the received FL params are kept in-memory and passed directly to training; the trained state dict is returned from train() without touching disk. This eliminates a full checkpoint round-trip and reduces timeout risk.
LoRA adapter exchange: Qwen3VLLoRAModel reduces communicated payload from ~4651 MB to ~98 MB. _save_lora_adapter_for_training writes adapter_model.safetensors + adapter_config.json directly from FL params (no base-model reload). train_qwen.py detects the adapter dir, loads base + adapter, and explicitly enables LoRA params for training.
_save_lora_adapter_for_training hardcodes DEFAULT_LORA_TARGET_MODULES without accepting a target_modules parameter, while Qwen3VLLoRAModel.__init__ supports custom lora_target_modules. Using non-default target modules would cause a RuntimeError in the validation at line 143, breaking multi-rank LoRA exchange entirely.
run_inference.py LoRA error message lacks guidance when adapter keys are missing due to a target-module mismatch, leaving users without a path to recover.
The lora_dropout or 0.05 silent-override and _original_create_optimizer dead-code issues flagged in previous review threads are resolved in the vendored code.

Confidence Score: 3/5

Safe to merge for the default LoRA config; one latent runtime error exists if custom LoRA target modules are used.
The core logic for in-memory exchange, barrier ordering, error recovery, and PEFT optimizer construction is sound. The main concern is that _save_lora_adapter_for_training hardcodes DEFAULT_LORA_TARGET_MODULES while Qwen3VLLoRAModel supports custom target modules — using non-default modules would silently break multi-rank LoRA exchange with a RuntimeError at validation. Since the current CLI only exposes the default modules, this does not affect the documented workflow, but the inconsistency in the public API is a real latent bug. The inference LoRA error message is also missing actionable recovery guidance. Both issues are scoped to the example directory with no impact on the broader NVFlare framework.
Pay close attention to examples/advanced/qwen3-vl/client.py (_save_lora_adapter_for_training, lines 104–164) for the target_modules inconsistency, and examples/advanced/qwen3-vl/run_inference.py (lines 304–310) for the LoRA mismatch error message.

Important Files Changed

Filename	Overview
examples/advanced/qwen3-vl/client.py	Core FL client with new in-memory exchange path, LoRA checkpoint prep, and multi-rank coordination. The `_save_lora_adapter_for_training` function hardcodes `DEFAULT_LORA_TARGET_MODULES` for both validation and the written `adapter_config.json`, creating a latent inconsistency with `Qwen3VLLoRAModel`'s optional custom `lora_target_modules`. Logic for barrier ordering and error-recovery paths is otherwise sound.
examples/advanced/qwen3-vl/model.py	Adds `Qwen3VLLoRAModel`, adapter key normalization, and `load_state_dict_from_checkpoint(lora_only=True)`. `Qwen3VLLoRAModel.load_state_dict` strips "model." from all keys regardless of whether they are adapter keys, then passes them through `map_adapter_state_dict_for_peft_model`; non-adapter keys end up as `unmatched` and cause a RuntimeError when `strict=True`, which is correct guard behavior. Overall well-structured.
examples/advanced/qwen3-vl/qwenvl/train/train_qwen.py	Vendored training entry point with new in-memory exchange support (`initial_state_dict`, `return_state_dict`), adapter-dir detection, and LoRA path. Gradient checkpointing is correctly disabled for PeftModel+flash_attention_2. The `_load_initial_state_dict` non-PeftModel path uses `strict=False` without logging unmatched keys, which could silently hide FL key mismatches.
examples/advanced/qwen3-vl/qwenvl/train/trainer.py	New vendored QwenTrainer with PEFT-aware `create_optimizer` that correctly filters to `requires_grad` parameters only and drops empty parameter groups before optimizer construction. Non-PEFT path also filters with empty-group guard. Flash-attention forward overrides for Qwen2-VL/2.5-VL/3-VL/MoE are structurally sound.
examples/advanced/qwen3-vl/run_inference.py	Adds LoRA-aware inference path that auto-detects checkpoint type by checking full-model key overlap. The detection heuristic (`full_match_count > 0`) is safe in practice since LoRA adapter keys contain lora_A/lora_B markers that never appear in base model keys. Strict error on missing adapter keys gives clear failures.
examples/advanced/qwen3-vl/job.py	Adds `--lora`, `--lora_r`, `--lora_alpha`, `--lora_dropout` CLI args, conditionally picks `Qwen3VLLoRAModel` vs `Qwen3VLModel` as initial model, and refactors timeout configuration into `_configure_timeouts`. Clean and correct.
examples/advanced/qwen3-vl/qwenvl/data/init.py	Vendored data registry with `fl_site` branch reading `FL_SITE_DATA_DIR` / `PUBMEDVISION_IMAGE_ROOT`. Clean direct port from the patch that was previously applied to the external Qwen3-VL clone.
examples/advanced/qwen3-vl/qwenvl/train/argument.py	Defines training argument dataclasses including `lora_enable`, `lora_r`, `lora_alpha`, and `lora_dropout` (default 0.0). Clean and consistent with what `train_qwen.py` uses directly.

Sequence Diagram

sequenceDiagram
    participant Server as FL Server
    participant R0 as Client rank-0
    participant Rn as Client rank-N

    loop FL Round
        Server->>R0: "FLModel(params) via flare.receive()"
        Note over R0: "Determine exchange mode"

        alt "In-memory single-rank (full or LoRA)"
            R0->>R0: "round_initial_state_dict = _strip_model_prefix(params)"
            R0->>R0: "train(base_hf_id, initial_state_dict, return_state_dict=True)"
            R0->>Server: "flare.send(trained params from memory)"
        else "Multi-rank full-model checkpoint"
            R0->>R0: "save_pretrained(input_model_dir)"
            R0-->>Rn: "_dist_barrier"
            R0->>Rn: "torchrun train(input_model_dir)"
            Rn-->>R0: "DDP training complete"
            R0->>R0: "load_state_dict_from_checkpoint(output_model_dir)"
            R0->>Server: "flare.send(full params ~4651 MB)"
        else "Multi-rank LoRA checkpoint"
            R0->>R0: "_save_lora_adapter_for_training(adapter_model.safetensors)"
            R0-->>Rn: "_dist_barrier"
            R0->>Rn: "torchrun train(adapter_dir, lora_enable=True)"
            Rn-->>R0: "DDP training complete"
            R0->>R0: "load_state_dict_from_checkpoint(lora_only=True)"
            R0->>Server: "flare.send(adapter params ~98 MB)"
        end
    end

Comments Outside Diff (2)

examples/advanced/qwen3-vl/client.py, line 104-157 (link)

target_modules hardcoded — inconsistent with Qwen3VLLoRAModel

_save_lora_adapter_for_training accepts lora_r, lora_alpha, and lora_dropout as explicit parameters but hardcodes DEFAULT_LORA_TARGET_MODULES in both the validation loop (lines 133–148) and the written adapter_config.json (line 157). Meanwhile, Qwen3VLLoRAModel.__init__ (in model.py) already accepts an optional lora_target_modules parameter:
```
# model.py
def __init__(self, ..., lora_target_modules: Optional[list] = None, ...):
    lora_config = LoraConfig(
        ...
        target_modules=lora_target_modules or DEFAULT_LORA_TARGET_MODULES,
    )
```
If Qwen3VLLoRAModel is initialised with custom target modules, the server's state_dict() will contain adapter keys only for those modules — but the validation at line 143 checks for all four default modules (q_proj, k_proj, v_proj, o_proj). Any custom module set that omits even one default module will raise a RuntimeError in the multi-rank checkpoint-based exchange path, making it entirely non-functional with non-default LoRA configs.

The fix is to add a target_modules parameter, use it for validation, and pass it to the saved LoraConfig:
```
def _save_lora_adapter_for_training(
    lora_state_dict: dict,
    save_dir: str,
    base_model_name_or_path: str,
    lora_r: int,
    lora_alpha: int,
    lora_dropout: float,
    target_modules: list = DEFAULT_LORA_TARGET_MODULES,
) -> None:
```
The call site at line 451 should then forward the target_modules that were used when building Qwen3VLLoRAModel.
examples/advanced/qwen3-vl/run_inference.py, line 304-310 (link)

Missing recovery guidance in LoRA mismatch error

The error message for missing adapter keys tells the user what went wrong but not why or how to fix it. The expected keys are determined by DEFAULT_LORA_TARGET_MODULES, which is hardcoded in the get_peft_model call above. If the FL job was run with a custom set of target modules (which Qwen3VLLoRAModel in model.py supports via the lora_target_modules parameter), the user would receive this opaque error with no actionable path forward.

Consider augmenting the error message to hint that LoRA rank, alpha, and target modules in the inference call must match the configuration used during the FL job. This is especially important since neither run_inference.py nor job.py currently expose a target_modules flag, making the default a silent assumption that is easy to violate when Qwen3VLLoRAModel is constructed directly with custom target modules.

_{Last reviewed commit: f20e316}

holgerroth · 2026-03-07T00:09:12Z

@greptileai review again

holgerroth · 2026-03-07T14:05:29Z

@greptileai review latest changes

holgerroth · 2026-03-09T21:08:55Z

@greptileai review the latest version of this PR

holgerroth · 2026-03-09T22:23:22Z

/build

holgerroth · 2026-03-09T22:23:48Z

@greptileai review

holgerroth · 2026-03-13T20:51:38Z

/build

holgerroth added 6 commits March 6, 2026 16:39

configure timouts

8e2aaab

enable LoRA

a545a4e

train patch

dcab536

vendor qwenvl training code

861d454

formatting

ffa1971

fix lora bug

31b01ea

greptile-apps Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread examples/advanced/qwen3-vl/qwenvl/train/train_qwen.py Outdated

Comment thread examples/advanced/qwen3-vl/qwenvl/train/trainer.py Outdated

Comment thread examples/advanced/qwen3-vl/qwenvl/train/train_qwen.py

holgerroth added 2 commits March 6, 2026 18:38

fix license headers

2506ba9

Fix qwen3-vl vendored data imports and single-process rank checks

fe40325

holgerroth marked this pull request as draft March 7, 2026 00:01

Address remaining Greptile comments in qwen3-vl vendored code

0b36490

holgerroth added 3 commits March 7, 2026 08:53

Fix Qwen3-VL PR comments and black-check failures

3a7c40b

Fix isort ordering in qwen3-vl data files

f726a72

Remove unused typing imports in qwen3-vl data files

9f800e6

holgerroth added 2 commits March 9, 2026 16:23

Use in-memory param exchange for single-process qwen3-vl

ed0bdba

Fix qwen3-vl single-process memory and rank0 logging

c9777ee

holgerroth and others added 4 commits March 9, 2026 17:30

Merge branch 'main' into qwen_lora

3de3bbf

Optimize qwen3-vl LoRA adapter exchange and optimizer guards

3ffc103

inference with lora checkpoint

01f270a

formatting

d482a1b

holgerroth marked this pull request as ready for review March 9, 2026 22:23

holgerroth requested review from ZiyueXu77, chesterxgchen and pcnudde March 9, 2026 22:29

holgerroth and others added 25 commits March 10, 2026 15:00

Clean qwen3-vl resume and inference LoRA paths

4a365e2

Harden qwen3-vl LoRA load_state handling

da5bb68

Tighten qwen3-vl vendored data and model loading

1415154

Remove qwen3-vl token and key-mapping drift

8f745d2

Guard qwen3-vl vendored imports and lora config

aab5b32

Harden qwen3-vl multi-rank LoRA exchange

84dcffe

Merge branch 'main' into qwen_lora

15941b6

Polish qwen3-vl trainer exchange guards

f45a4f1

Clean up qwen3-vl trainer integration

a2832eb

Align qwen3-vl LoRA target modules

3bd0303

add 3rd party license

049d6f8

Harden qwen3-vl inference and LoRA docs

483b3ce

Tighten qwen3-vl LoRA loading paths

7b7b3cd

Fix qwen3-vl flash attention and client buffers

f026c29

Fix qwen3-vl full-model trainer attributes

c6e9ec2

Merge branch 'main' into qwen_lora

29ae297

Harden qwen3-vl in-memory state loading

cb4843a

Harden qwen3-vl adapter validation and imports

c55603a

Fix qwen3-vl trainable logging helpers

307d30e

Merge branch 'main' into qwen_lora

2b65096

Guard qwen3-vl peft import for non-lora runs

831ec48

Handle older peft config serialization

9f10c40

fix multigpu lora; ensure trainable lora params

1109b4a

keep original license header

0bec54e

Clean up qwen3-vl state dict and enum handling

f20e316

pcnudde approved these changes Mar 13, 2026

View reviewed changes

holgerroth enabled auto-merge (squash) March 13, 2026 20:52

holgerroth merged commit 9c68d55 into NVIDIA:main Mar 13, 2026
28 checks passed

holgerroth deleted the qwen_lora branch March 14, 2026 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-VL LoRA Support#4277

Qwen-VL LoRA Support#4277
holgerroth merged 44 commits intoNVIDIA:mainfrom
holgerroth:qwen_lora

holgerroth commented Mar 6, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 6, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented Mar 7, 2026

Uh oh!

holgerroth commented Mar 7, 2026

Uh oh!

holgerroth commented Mar 9, 2026

Uh oh!

holgerroth commented Mar 9, 2026

Uh oh!

holgerroth commented Mar 9, 2026

Uh oh!

holgerroth commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

holgerroth commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Qwen3-VL example: LoRA training fixes

Summary

Changes

1) Self-contained Qwen3-VL example

2) LoRA training correctness

3) Single-process in-memory exchange

4) Multi-rank path performance and safety

5) Optimizer robustness and config alignment

Testing

Types of changes

Uh oh!

greptile-apps Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented Mar 7, 2026

Uh oh!

holgerroth commented Mar 7, 2026

Uh oh!

holgerroth commented Mar 9, 2026

Uh oh!

holgerroth commented Mar 9, 2026

Uh oh!

holgerroth commented Mar 9, 2026

Uh oh!

holgerroth commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

holgerroth commented Mar 6, 2026 •

edited

Loading

greptile-apps Bot commented Mar 6, 2026 •

edited

Loading