Add MedGemma fine-tuning example with QLoRA by holgerroth · Pull Request #4359 · NVIDIA/NVFlare

holgerroth · 2026-03-25T12:17:14Z

Summary

add a new advanced MedGemma example modeled on the advanced Qwen3-VL example and adapted from the official MedGemma notebook
include data prep, QLoRA fine-tuning, inference, and before/after evaluation utilities plus updated documentation
add focused unit coverage for MedGemma data utilities

Testing

python3 -m compileall examples/advanced/medgemma tests/unit_test/examples/advanced/medgemma
pytest tests/unit_test/examples/advanced/medgemma/data_utils_test.py -q

greptile-apps · 2026-03-25T12:20:23Z

Greptile Summary

This PR adds a new federated MedGemma QLoRA fine-tuning example, closely modelled on the Qwen3-VL example and the official Google Health notebook. All six issues raised in the previous review rounds have been addressed: the empty-dataset guard, zip-slip sanitisation, the broken ALT_TISSUE_LABELS f-string, the hardcoded num_train_epochs, device_map derivation in inference, and the peft>=0.18.0 version pin.

Confidence Score: 5/5

Safe to merge — all previously raised P0/P1 issues are resolved and the one remaining note is a P2 best-practice suggestion for GPU memory management in the evaluation script.

Six significant issues from prior review rounds are fully addressed (empty-dataset guard, zip-slip sanitisation, broken alt-label f-string, hardcoded num_train_epochs, device_map derivation, peft version pin). The only remaining finding is a P2 suggestion to explicitly flush CUDA cache between the two sequential model loads in run_evaluation.py, which affects convenience on constrained hardware but not correctness. Per the confidence guidance, all P2 findings yield a score of 5.

run_evaluation.py — minor GPU memory hygiene between two sequential model loads.

Important Files Changed

Filename	Overview
examples/advanced/medgemma/client.py	Federated training client with QLoRA; empty-dataset guard, per-round adapter exchange, and memory cleanup all look correct.
examples/advanced/medgemma/data_utils.py	Data helpers including the corrected `_to_alt_tissue_label` (replacing the broken f-string), label parsing, and client-shard splitting look correct.
examples/advanced/medgemma/inference_utils.py	Device map now correctly derived from the `--device` argument via `get_inference_device_map()`; model loading branches look clean.
examples/advanced/medgemma/job.py	`--num_train_epochs` now wired through `_build_train_args`, fixing the previous hardcoded-1 issue; GPU and timeout configuration looks correct.
examples/advanced/medgemma/model.py	LoRA config, adapter-state helpers, and `MedGemmaLoRAModel` look correct; `ensure_weight_tying=True` now guarded by `peft>=0.18.0` in requirements.
examples/advanced/medgemma/download_data.py	Zip-slip guard via `resolve_extraction_path()` correctly validates each member before extraction; partial-download cleanup is solid.
examples/advanced/medgemma/run_evaluation.py	Sequential two-model evaluation using PIL context managers correctly; base model not explicitly freed before loading the fine-tuned model, risking OOM on constrained GPUs.
examples/advanced/medgemma/run_inference.py	Now uses `with Image.open(...) as image_file:` context manager, addressing the previously noted file-handle leak.
examples/advanced/medgemma/prepare_data.py	Data preparation script correctly delegates to `collect_image_records` and `split_records_for_clients`; JSON output looks clean.
examples/advanced/medgemma/requirements.txt	All version pins now present including `peft>=0.18.0` and `transformers>=4.57.1`; looks complete for the example's dependencies.

Sequence Diagram

sequenceDiagram
    participant J as job.py (SimEnv)
    participant S as NVFlare Server (FedAvgRecipe)
    participant C as client.py (SFTTrainer)
    participant M as model.py (MedGemmaLoRAModel)

    J->>S: recipe.execute(env)
    S->>M: MedGemmaLoRAModel.state_dict() → initial LoRA adapter weights
    loop Each FL Round
        S->>C: flare.send(FLModel with adapter params)
        C->>C: apply_adapter_state(model, params)
        C->>C: SFTTrainer.train() — QLoRA fine-tuning on local data
        C->>C: get_adapter_state_dict(model) → updated LoRA weights
        C->>S: flare.send(FLModel with updated params + metrics)
        S->>S: FedAvg aggregation of adapter weights
    end
    S->>J: run.get_result() → FL_global_model.pt
    J-->>J: inference / evaluation via run_inference.py / run_evaluation.py

_{Reviews (16): Last reviewed commit: "Merge branch 'main' into codex/medgemma-..." | Re-trigger Greptile}

holgerroth · 2026-03-25T14:26:11Z

/build

holgerroth · 2026-03-25T14:45:04Z

/build

vijaygovindaraja · 2026-03-27T06:25:04Z

The adapter-only exchange pattern here is the correct architecture for clinical FL base model weights stay local,
which is critical for HIPAA-compliant deployments where institutional data governance prevents any model artifact that touched patient data from leaving the site. This should be stated explicitly in the README rather than left implicit.

The Zip Slip mitigation in download_data.py is necessary given external dataset sources. Good.

One substantive concern: FedAvg over LoRA adapters is not neutral with respect to rank. When you average low-rank matrices across heterogeneous clients, the effective rank of the aggregated adapter can collapse below the configured lora_r, particularly under non-IID data distributions common in multi-site clinical imaging. This is a known failure mode (see Cho et al., "Heterogeneous LoRA Aggregation" literature). At minimum, the documentation should note this risk and suggest monitoring validation loss divergence as a diagnostic. Ideally, a future iteration would support
rank-adaptive aggregation strategies.

Otherwise, well-structured and properly tested.

YuanTingHsieh

added one comment

holgerroth · 2026-03-31T17:19:40Z

/build

holgerroth · 2026-04-06T18:33:13Z

/build

holgerroth · 2026-04-07T21:21:50Z

/build

holgerroth added 14 commits March 25, 2026 08:12

Add advanced MedGemma fine-tuning example

8f23acc

update instructions

f2ac1ca

update readme for gated model

26b774f

Refine MedGemma docs and inference

0accc16

Add MedGemma accuracy evaluation workflow

7061726

print trainable params

5c9660e

Fix MedGemma transfer timeout config

37979a8

Enable MedGemma LoRA weight tying

faf1602

Simplify MedGemma memory requirement note

586fb6f

Fix MedGemma inference helper call

c6dc70b

update readme

6d78c30

Remove MedGemma evaluate dependency

22570be

Add MedGemma evaluation progress logging

5c1b774

eval result

47a5d10

holgerroth changed the title ~~Add advanced MedGemma fine-tuning example~~ Add MedGemma fine-tuning example with LoRA Mar 25, 2026

greptile-apps Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/advanced/medgemma/client.py

Comment thread examples/advanced/medgemma/run_inference.py Outdated

Comment thread examples/advanced/medgemma/download_data.py

Comment thread examples/advanced/medgemma/job.py Outdated

holgerroth changed the title ~~Add MedGemma fine-tuning example with LoRA~~ Add MedGemma fine-tuning example with QLoRA Mar 25, 2026

Guard empty MedGemma training datasets

0fb36a3

greptile-apps Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/advanced/medgemma/inference_utils.py

Respect MedGemma inference device selection

e19a727

greptile-apps Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/advanced/medgemma/data_utils.py

holgerroth added 2 commits March 25, 2026 10:07

Add MedGemma source attribution headers

fa352c1

Harden MedGemma example helpers

64c35bd

holgerroth requested review from YuanTingHsieh, ZiyueXu77, chesterxgchen and nvkevlu March 25, 2026 14:17

greptile-apps Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/advanced/medgemma/model.py

Pin minimum PEFT for MedGemma example

0585a81

Avoid MedGemma test sys.path leakage

14ac856

holgerroth added 3 commits March 25, 2026 11:35

Pin MedGemma example dependency minimums

97511b6

Quote MedGemma client train args

46dfd1e

Remove MedGemma example unit tests

1d79329

chesterxgchen and others added 3 commits March 28, 2026 21:09

Merge branch 'main' into codex/medgemma-upstream-pr

9200085

Clarify MedGemma clinical FL caveats

2d979c0

Refine MedGemma LoRA aggregation note

34082cb

YuanTingHsieh reviewed Mar 30, 2026

View reviewed changes

Comment thread examples/advanced/medgemma/install_requirements.sh Outdated

Merge branch 'main' into codex/medgemma-upstream-pr

c1d485e

holgerroth requested a review from YuanTingHsieh March 31, 2026 15:24

holgerroth and others added 2 commits March 31, 2026 11:24

Merge branch 'main' into codex/medgemma-upstream-pr

e8d6762

Simplify MedGemma install instructions

22c4f59

Merge branch 'main' into codex/medgemma-upstream-pr

442db02

Merge branch 'main' into codex/medgemma-upstream-pr

7aca7a5

holgerroth enabled auto-merge (squash) April 7, 2026 21:35

YuanTingHsieh approved these changes Apr 7, 2026

View reviewed changes

holgerroth merged commit 80a2abc into NVIDIA:main Apr 7, 2026
29 checks passed

holgerroth deleted the codex/medgemma-upstream-pr branch April 7, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MedGemma fine-tuning example with QLoRA#4359

Add MedGemma fine-tuning example with QLoRA#4359
holgerroth merged 31 commits intoNVIDIA:mainfrom
holgerroth:codex/medgemma-upstream-pr

holgerroth commented Mar 25, 2026

Uh oh!

greptile-apps Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented Mar 25, 2026

Uh oh!

holgerroth commented Mar 25, 2026

Uh oh!

vijaygovindaraja commented Mar 27, 2026

Uh oh!

YuanTingHsieh left a comment

Uh oh!

Uh oh!

holgerroth commented Mar 31, 2026

Uh oh!

holgerroth commented Apr 6, 2026

Uh oh!

holgerroth commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

holgerroth commented Mar 25, 2026

Summary

Testing

Uh oh!

greptile-apps Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented Mar 25, 2026

Uh oh!

holgerroth commented Mar 25, 2026

Uh oh!

vijaygovindaraja commented Mar 27, 2026

Uh oh!

YuanTingHsieh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

holgerroth commented Mar 31, 2026

Uh oh!

holgerroth commented Apr 6, 2026

Uh oh!

holgerroth commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Mar 25, 2026 •

edited

Loading