Skip to content

feat: Add Prov-GigaPath linear probe test workflows and prediction-map utilities#15

Merged
vojtech-cifka merged 129 commits into
masterfrom
feature/provgigapath-metrics-test
May 19, 2026
Merged

feat: Add Prov-GigaPath linear probe test workflows and prediction-map utilities#15
vojtech-cifka merged 129 commits into
masterfrom
feature/provgigapath-metrics-test

Conversation

@vojtech-cifka
Copy link
Copy Markdown
Collaborator

@vojtech-cifka vojtech-cifka commented May 19, 2026

Description
This branch extends the linear-probe workflow to cover Prov-GigaPath alongside Virchow2 and cleans up experiment naming.

Added:

  • Prov-GigaPath k-fold training configs for AdamW and LBFGS.
  • Prov-GigaPath final training configs with selected weight decay values.
  • Prov-GigaPath held-out test configs loading final checkpoints.
  • Backbone-aware linear head input dimension via ${embedding_dim}.
  • Slide-name metadata in per-slide test accuracy exports.
  • Optional slide-budget sampling for embedding generation.
  • Safer prediction-map filenames.
  • Renamed ML and preprocessing experiment configs for clearer backbone/model naming.

Summary by CodeRabbit

  • New Features

    • Added support for ProvGigaPath embeddings across training and testing workflows
    • Slide-level metadata (slide names) are now included in per-slide test outputs and logs
  • Improvements

    • Prediction output filenames now include a short digest for stable, sanitized names
    • Slide sampling introduced with configurable tile-budget and seed for reproducible preprocessing
    • Checkpointing now retains the last checkpoint automatically

Review Change Stack

vojtech-cifka and others added 30 commits May 7, 2026 21:43
Use the `label` and `fold` columns produced by the upstream k-fold split
instead of deriving labels from coverage columns and randomly splitting
val. Memory-mapped via HuggingFace datasets so the full embedding parquet
no longer has to fit in numpy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ddings

Datamodule downloads embeddings + kfold artifacts from MLflow, joins on
(slide_id, x, y) via pyarrow, applies class mapping, tissue/class coverage
filters, and exposes per-fold splits via set_val_fold(). Training script
loops folds in a single run and logs per-fold + aggregate metrics. Probe
adds per-class F1, confusion matrix figures, optional input L2-norm and
class weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The experiment file was declaring /class_mapping as a fresh default
while configs/ml/linear_probe.yaml already had one, which Hydra rejects
as a duplicate. Mark it as an override so the experiment replaces the
base default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ml/train.py uses @with_cli_args(["+ml=linear_probe"]), so the decorator
already injects that arg. Passing it again on the command line caused
Hydra to load configs/ml/linear_probe.yaml twice and reject duplicate
defaults. Rely on the decorator and pass only +experiment=...

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng refs

Two interpolation problems prevented Hydra from resolving the linear-probe
config:

1. configs/ml.yaml uses ${random_seed:} and configs/ml/linear_probe.yaml
   uses ${len:...}, but neither resolver is registered anywhere. Register
   both at module import time in ml/train.py.

2. The class_mapping yamls use # @Package _global_, so class_mapping,
   class_indices, and class_names land at the config root. The references
   in linear_probe.yaml were doubly nested (e.g. class_mapping.class_mapping).
   Drop the prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The filtered tiles parquet collapses ROI columns at tiling time, so
kfold writes canonical names ("Epithelium", etc.) directly into `label`.
The raw→canonical lookup built from the BB-suffixed YAML lists matched
none of these and dropped the entire 1.1M-tile dataset under
drop_unmapped=True.

Extend _raw_to_canonical with identity entries for every canonical class
so modern parquets pass through while legacy un-collapsed labels still
collapse correctly. "background" stays unmapped → dropped, as intended.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add EmbeddingsDataModule.compute_class_weights("balanced"|"inverse")
  using sklearn-style weights from the current train fold.
- train.py resolves class_weights="balanced"/"inverse" via the
  datamodule and passes the resulting list to LinearProbe at instantiate
  time (per-fold, since splits change).
- Bump class_coverage_min from 0.0 to 0.5 to drop mosaic tiles.
- Drop the redundant /class_mapping default from configs/ml/linear_probe.yaml;
  experiment files now own the choice.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in
both split/kfold_split.py and the new embedding_dataset pipeline. The new
pipeline joins k-fold (train) / filter_tiles (test) tile metadata with
precomputed embeddings after applying tissue + per-dominant-class ROI thresholds,
and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on
to_pandas() or hitting int32 list-offset overflow inside take(). The fix:
- read embeddings into Arrow only and cast each chunk to large_list so
  take() concatenation uses int64 offsets;
- run the join on keys plus a synthetic row index because Acero refuses
  list columns in non-key fields, then pull embeddings via take();
- combine_chunks() before take() for an O(N) single-pass copy;
- write the parquet straight from Arrow, never materialising the
  embedding column in pandas.

Also bumps the kube job memory to 64Gi to give the combined-chunks +
take() peak some headroom, and trims the verbose [timing] prints down
to one progress line per split.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside
apply_thresholds with a confusing KeyError. Surface a clear error that
points at the expected upstream artifact instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vojtech-cifka and others added 15 commits May 18, 2026 20:12
setup(stage="fit") replaces criterion with class-weighted CrossEntropyLoss,
adding a criterion.weight buffer that gets saved to checkpoints. At test,
Lightning restores the checkpoint before setup() runs, so the model still
has the unweighted criterion from __init__ and strict load fails with
"Unexpected key(s) in state_dict: criterion.weight". Affected both adamw
and lbfgs test runs.

Initialize criterion with a placeholder ones-weight sized num_classes so
the criterion.weight key always exists; setup(fit) still overrides it with
the real class-balanced weights.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Clear the batch buffer only on rank!=0 or after a successful write so the
on_test_end fallback no longer hits an always-empty buffer. Add diagnostic
prints to the silent early-return guards and an idempotency flag so the
two write hooks cooperate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts:
#	configs/experiment/ml/linear_classifier_test_adamw.yaml
#	configs/experiment/ml/linear_classifier_test_lbfgs.yaml
#	configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_05mpp.yaml
#	configs/ml/task/final_linear_classifier.yaml
#	configs/preprocessing/embeddings.yaml
#	ml/callbacks/tiff_prediction_map_writer.py
#	preprocessing/embeddings.py
@vojtech-cifka vojtech-cifka requested review from Adames4 and vejtek May 19, 2026 09:33
@vojtech-cifka vojtech-cifka self-assigned this May 19, 2026
@vojtech-cifka vojtech-cifka requested a review from a team May 19, 2026 09:33
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 215fe430-98e6-4aa2-9400-28cc84003653

📥 Commits

Reviewing files that changed from the base of the PR and between 67d3ef4 and 191892c.

📒 Files selected for processing (5)
  • configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
  • configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml
  • ml/callbacks/tiff_prediction_map_writer.py
  • ml/data/datasets/embedding_tiles.py
  • preprocessing/embeddings.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • preprocessing/embeddings.py
  • ml/data/datasets/embedding_tiles.py
  • configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml

📝 Walkthrough

Walkthrough

Parameterizes embedding selection and dims, wires slide metadata into embedding datasets, enriches per‑slide MLflow logs with slide names, adds deterministic slide-budget sampling to preprocessing, sanitizes TIFF output names, and introduces ProvGigaPath train/final/test experiment configs plus related defaults updates and a legacy config removal.

Changes

Embedding Model Parameterization and Slide Metadata Infrastructure

Layer / File(s) Summary
Task and model config parameterization
configs/ml/task/final_linear_classifier.yaml, configs/ml/task/kfold_linear_classifier.yaml, configs/ml/model/linear_classifier.yaml, configs/ml/trainer/early_stopping.yaml, configs/preprocessing/embeddings.yaml
Task configs now accept embedding_model_name and embedding_dim; model decode_head.in_features references ${embedding_dim}; early_stopping now sets save_last: true; preprocessing adds slide_sample_max_tiles and slide_sample_seed.
Slide metadata loading and wiring
ml/data/datasets/embedding_tiles.py, configs/ml/data/final_embedding_tiles.yaml, configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml
_BaseEmbeddingTilesDataset and subclasses accept optional slide_metadata_uri; _load_slide_names() reads slide id↔basename mappings from parquet; final embedding dataset config wires test_slide_metadata_uri; experiment preprocessing config documents sampling controls.
Per-slide accuracy logging with slide names
ml/meta_arch.py
_log_per_slide_accuracy uses _test_slide_names_by_id() to obtain slide name mapping and adds slide_name to per-slide MLflow table rows when available.
Slide budget sampling
preprocessing/embeddings.py
New select_slide_budget() deterministically selects slides under a tile-budget using a seed and slide ordering; main() reads slide_sample_max_tiles/slide_sample_seed, builds a PyArrow tiles dataset, selects slides when set, filters slides and Ray dataset rows to the selected set.
TIFF filename sanitization
ml/callbacks/tiff_prediction_map_writer.py
_slide_prediction_filename now composes a {stem}-{blake2b_digest}.tiff basename and normalizes it via _safe_filename.

ProvGigaPath Experiment Configurations

Layer / File(s) Summary
Final ProvGigaPath linear classifiers
configs/experiment/ml/final_linear_provgigapath_adamw.yaml, configs/experiment/ml/final_linear_provgigapath_lbfgs.yaml
New final experiment configs for ProvGigaPath embeddings that inherit Virchow2 final configs and set embedding model name/dim, kfold wiring, MLflow paths, model weight decay, and run metadata.
ProvGigaPath training configurations
configs/experiment/ml/train_linear_provgigapath_adamw_group_kfold.yaml, configs/experiment/ml/train_linear_provgigapath_lbfgs_group_kfold.yaml
Adds AdamW group-kfold training config and an LBFGS variant that references the AdamW config; templated metadata includes dataset, kfold, fold, optimizer, and weight decay; LBFGS config sets data.num_workers: 0.
ProvGigaPath and Virchow2 test configurations
configs/experiment/ml/test_linear_provgigapath_adamw.yaml, configs/experiment/ml/test_linear_provgigapath_lbfgs.yaml, configs/experiment/ml/test_linear_virchow2_adamw.yaml, configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
Adds ProvGigaPath test configs with MLflow checkpoint URIs and test-mode settings; updates Virchow2 test defaults to reference final_linear_virchow2_*.
Training defaults updates and legacy removal
configs/experiment/ml/train_linear_virchow2_lbfgs_group_kfold.yaml, configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml
Updates Virchow2 train defaults reference; clears legacy linear_classifier_adamw_stratified_kfold.yaml contents.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • vejtek
  • Adames4

Poem

🐰 In fields of slides I hop and plot,

Names and budgets tied in a knot,

ProvGigaPath seeds take their place,

Per‑slide logs wear a friendlier face,

TIFFs named safe — a tidy spot.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main changes: adding Prov-GigaPath support with new test workflows and prediction-map utilities, which aligns with the primary additions throughout the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/provgigapath-metrics-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml`:
- Around line 14-15: Update the top-of-file header comment to reflect that
embeddings are generated using slide-budget sampling (i.e., sampled per-slide up
to slide_sample_max_tiles with deterministic seed slide_sample_seed) instead of
saying embeddings are generated for “every tile”; edit the header text wherever
it mentions “every tile” (also at the other occurrence noted around line 19) to
clearly state the sampled-slide behavior and reference the
slide_sample_max_tiles and slide_sample_seed parameters.

In `@ml/callbacks/tiff_prediction_map_writer.py`:
- Line 520: The current return call uses
_safe_filename(Path(str(path)).with_suffix(".tiff").name) which can collapse
distinct slide basenames into the same sanitized name; change this to produce a
stable, collision-resistant filename by incorporating a short deterministic hash
of the original path (or original basename) into the filename before
sanitization — for example, compute an sha256 or blake2b of str(path) (or
Path(path).name), take an 8-character prefix, append or insert it into the
basename (e.g., original_basename + "-" + hash + ".tiff"), then pass that
combined string through _safe_filename so the function (and its callers) produce
unique, stable TIFF filenames that prevent silent overwrites.

In `@preprocessing/embeddings.py`:
- Around line 98-108: The budget check currently skips enforcement for the first
sampled slide because the condition uses "if selected and selected_tiles +
tile_count > max_tiles:" so an empty selected list lets an oversized first slide
through; change that condition to always check the budget (e.g., "if
selected_tiles + tile_count > max_tiles: continue") so you never append a slide
that would push selected_tiles over max_tiles, keeping the fallback that uses
counts.sort_values("tile_count") intact if nothing is selected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab2803b2-bc8c-46f5-adb9-6738e7db1023

📥 Commits

Reviewing files that changed from the base of the PR and between 2344026 and 67d3ef4.

📒 Files selected for processing (31)
  • configs/experiment/ml/final_linear_provgigapath_adamw.yaml
  • configs/experiment/ml/final_linear_provgigapath_lbfgs.yaml
  • configs/experiment/ml/final_linear_virchow2_adamw.yaml
  • configs/experiment/ml/final_linear_virchow2_lbfgs.yaml
  • configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml
  • configs/experiment/ml/predict_linear_virchow2_lbfgs_tissue_tiles.yaml
  • configs/experiment/ml/test_linear_provgigapath_adamw.yaml
  • configs/experiment/ml/test_linear_provgigapath_lbfgs.yaml
  • configs/experiment/ml/test_linear_virchow2_adamw.yaml
  • configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
  • configs/experiment/ml/train_linear_provgigapath_adamw_group_kfold.yaml
  • configs/experiment/ml/train_linear_provgigapath_lbfgs_group_kfold.yaml
  • configs/experiment/ml/train_linear_virchow2_adamw_group_kfold.yaml
  • configs/experiment/ml/train_linear_virchow2_lbfgs_group_kfold.yaml
  • configs/experiment/preprocessing/embeddings_provgigapath_0_5mpp.yaml
  • configs/experiment/preprocessing/embeddings_virchow2_0_5mpp.yaml
  • configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml
  • configs/experiment/preprocessing/tile_masks_0_5mpp.yaml
  • configs/experiment/preprocessing/tiling_0_5mpp.yaml
  • configs/experiment/preprocessing/tissue_masks_2mpp.yaml
  • configs/experiment/preprocessing/tissue_stats_0_5mpp.yaml
  • configs/ml/data/final_embedding_tiles.yaml
  • configs/ml/model/linear_classifier.yaml
  • configs/ml/task/final_linear_classifier.yaml
  • configs/ml/task/kfold_linear_classifier.yaml
  • configs/ml/trainer/early_stopping.yaml
  • configs/preprocessing/embeddings.yaml
  • ml/callbacks/tiff_prediction_map_writer.py
  • ml/data/datasets/embedding_tiles.py
  • ml/meta_arch.py
  • preprocessing/embeddings.py
💤 Files with no reviewable changes (1)
  • configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml

Comment thread ml/callbacks/tiff_prediction_map_writer.py Outdated
Comment thread preprocessing/embeddings.py Outdated
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for ProvGigaPath embeddings, implements a slide sampling strategy in the embedding preprocessing pipeline to respect tile budgets, and improves test logging by including slide names. It also parameterizes model dimensions and updates several training and testing configurations. Feedback highlights a potential runtime error due to an unsupported parameter in a YAML config and suggests a minor optimization to remove redundant column selection in a data loading loop.

Comment thread configs/experiment/ml/test_linear_virchow2_lbfgs.yaml Outdated
Comment thread ml/data/datasets/embedding_tiles.py Outdated
@vojtech-cifka vojtech-cifka merged commit 8a8f76e into master May 19, 2026
2 of 3 checks passed
@vojtech-cifka vojtech-cifka deleted the feature/provgigapath-metrics-test branch May 19, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants