feat: Add Prov-GigaPath linear probe test workflows and prediction-map utilities by vojtech-cifka · Pull Request #15 · RationAI/tissue-classification

vojtech-cifka · 2026-05-19T09:33:16Z

Description
This branch extends the linear-probe workflow to cover Prov-GigaPath alongside Virchow2 and cleans up experiment naming.

Added:

Prov-GigaPath k-fold training configs for AdamW and LBFGS.
Prov-GigaPath final training configs with selected weight decay values.
Prov-GigaPath held-out test configs loading final checkpoints.
Backbone-aware linear head input dimension via ${embedding_dim}.
Slide-name metadata in per-slide test accuracy exports.
Optional slide-budget sampling for embedding generation.
Safer prediction-map filenames.
Renamed ML and preprocessing experiment configs for clearer backbone/model naming.

Summary by CodeRabbit

New Features
- Added support for ProvGigaPath embeddings across training and testing workflows
- Slide-level metadata (slide names) are now included in per-slide test outputs and logs
Improvements
- Prediction output filenames now include a short digest for stable, sanitized names
- Slide sampling introduced with configurable tile-budget and seed for reproducible preprocessing
- Checkpointing now retains the last checkpoint automatically

Use the `label` and `fold` columns produced by the upstream k-fold split instead of deriving labels from coverage columns and randomly splitting val. Memory-mapped via HuggingFace datasets so the full embedding parquet no longer has to fit in numpy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ddings Datamodule downloads embeddings + kfold artifacts from MLflow, joins on (slide_id, x, y) via pyarrow, applies class mapping, tissue/class coverage filters, and exposes per-fold splits via set_val_fold(). Training script loops folds in a single run and logs per-fold + aggregate metrics. Probe adds per-class F1, confusion matrix figures, optional input L2-norm and class weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The experiment file was declaring /class_mapping as a fresh default while configs/ml/linear_probe.yaml already had one, which Hydra rejects as a duplicate. Mark it as an override so the experiment replaces the base default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ml/train.py uses @with_cli_args(["+ml=linear_probe"]), so the decorator already injects that arg. Passing it again on the command line caused Hydra to load configs/ml/linear_probe.yaml twice and reject duplicate defaults. Rely on the decorator and pass only +experiment=... Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@Package

…ng refs Two interpolation problems prevented Hydra from resolving the linear-probe config: 1. configs/ml.yaml uses ${random_seed:} and configs/ml/linear_probe.yaml uses ${len:...}, but neither resolver is registered anywhere. Register both at module import time in ml/train.py. 2. The class_mapping yamls use # @Package _global_, so class_mapping, class_indices, and class_names land at the config root. The references in linear_probe.yaml were doubly nested (e.g. class_mapping.class_mapping). Drop the prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The filtered tiles parquet collapses ROI columns at tiling time, so kfold writes canonical names ("Epithelium", etc.) directly into `label`. The raw→canonical lookup built from the BB-suffixed YAML lists matched none of these and dropped the entire 1.1M-tile dataset under drop_unmapped=True. Extend _raw_to_canonical with identity entries for every canonical class so modern parquets pass through while legacy un-collapsed labels still collapse correctly. "background" stays unmapped → dropped, as intended. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Add EmbeddingsDataModule.compute_class_weights("balanced"|"inverse") using sklearn-style weights from the current train fold. - train.py resolves class_weights="balanced"/"inverse" via the datamodule and passes the resulting list to LinearProbe at instantiate time (per-fold, since splits change). - Bump class_coverage_min from 0.0 to 0.5 to drop mosaic tiles. - Drop the redundant /class_mapping default from configs/ml/linear_probe.yaml; experiment files now own the choice. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ataset

Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

setup(stage="fit") replaces criterion with class-weighted CrossEntropyLoss, adding a criterion.weight buffer that gets saved to checkpoints. At test, Lightning restores the checkpoint before setup() runs, so the model still has the unweighted criterion from __init__ and strict load fails with "Unexpected key(s) in state_dict: criterion.weight". Affected both adamw and lbfgs test runs. Initialize criterion with a placeholder ones-weight sized num_classes so the criterion.weight key always exists; setup(fit) still overrides it with the real class-balanced weights. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Clear the batch buffer only on rank!=0 or after a successful write so the on_test_end fallback no longer hits an always-empty buffer. Add diagnostic prints to the silent early-return guards and an idempotency flag so the two write hooks cooperate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-test

# Conflicts: # configs/experiment/ml/linear_classifier_test_adamw.yaml # configs/experiment/ml/linear_classifier_test_lbfgs.yaml # configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_05mpp.yaml # configs/ml/task/final_linear_classifier.yaml # configs/preprocessing/embeddings.yaml # ml/callbacks/tiff_prediction_map_writer.py # preprocessing/embeddings.py

…h-metrics-test

coderabbitai · 2026-05-19T09:33:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 215fe430-98e6-4aa2-9400-28cc84003653

📥 Commits

Reviewing files that changed from the base of the PR and between 67d3ef4 and 191892c.

📒 Files selected for processing (5)

configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml
ml/callbacks/tiff_prediction_map_writer.py
ml/data/datasets/embedding_tiles.py
preprocessing/embeddings.py

🚧 Files skipped from review as they are similar to previous changes (3)

preprocessing/embeddings.py
ml/data/datasets/embedding_tiles.py
configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml

📝 Walkthrough

Walkthrough

Parameterizes embedding selection and dims, wires slide metadata into embedding datasets, enriches per‑slide MLflow logs with slide names, adds deterministic slide-budget sampling to preprocessing, sanitizes TIFF output names, and introduces ProvGigaPath train/final/test experiment configs plus related defaults updates and a legacy config removal.

Changes

Embedding Model Parameterization and Slide Metadata Infrastructure

Layer / File(s)	Summary
Task and model config parameterization `configs/ml/task/final_linear_classifier.yaml`, `configs/ml/task/kfold_linear_classifier.yaml`, `configs/ml/model/linear_classifier.yaml`, `configs/ml/trainer/early_stopping.yaml`, `configs/preprocessing/embeddings.yaml`	Task configs now accept `embedding_model_name` and `embedding_dim`; model `decode_head.in_features` references `${embedding_dim}`; `early_stopping` now sets `save_last: true`; preprocessing adds `slide_sample_max_tiles` and `slide_sample_seed`.
Slide metadata loading and wiring `ml/data/datasets/embedding_tiles.py`, `configs/ml/data/final_embedding_tiles.yaml`, `configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml`	`_BaseEmbeddingTilesDataset` and subclasses accept optional `slide_metadata_uri`; `_load_slide_names()` reads slide id↔basename mappings from parquet; final embedding dataset config wires `test_slide_metadata_uri`; experiment preprocessing config documents sampling controls.
Per-slide accuracy logging with slide names `ml/meta_arch.py`	`_log_per_slide_accuracy` uses `_test_slide_names_by_id()` to obtain slide name mapping and adds `slide_name` to per-slide MLflow table rows when available.
Slide budget sampling `preprocessing/embeddings.py`	New `select_slide_budget()` deterministically selects slides under a tile-budget using a seed and slide ordering; `main()` reads `slide_sample_max_tiles`/`slide_sample_seed`, builds a PyArrow tiles dataset, selects slides when set, filters `slides` and Ray dataset rows to the selected set.
TIFF filename sanitization `ml/callbacks/tiff_prediction_map_writer.py`	`_slide_prediction_filename` now composes a `{stem}-{blake2b_digest}.tiff` basename and normalizes it via `_safe_filename`.

ProvGigaPath Experiment Configurations

Layer / File(s)	Summary
Final ProvGigaPath linear classifiers `configs/experiment/ml/final_linear_provgigapath_adamw.yaml`, `configs/experiment/ml/final_linear_provgigapath_lbfgs.yaml`	New final experiment configs for ProvGigaPath embeddings that inherit Virchow2 final configs and set embedding model name/dim, kfold wiring, MLflow paths, model weight decay, and run metadata.
ProvGigaPath training configurations `configs/experiment/ml/train_linear_provgigapath_adamw_group_kfold.yaml`, `configs/experiment/ml/train_linear_provgigapath_lbfgs_group_kfold.yaml`	Adds AdamW group-kfold training config and an LBFGS variant that references the AdamW config; templated metadata includes dataset, kfold, fold, optimizer, and weight decay; LBFGS config sets `data.num_workers: 0`.
ProvGigaPath and Virchow2 test configurations `configs/experiment/ml/test_linear_provgigapath_adamw.yaml`, `configs/experiment/ml/test_linear_provgigapath_lbfgs.yaml`, `configs/experiment/ml/test_linear_virchow2_adamw.yaml`, `configs/experiment/ml/test_linear_virchow2_lbfgs.yaml`	Adds ProvGigaPath test configs with MLflow checkpoint URIs and test-mode settings; updates Virchow2 test defaults to reference `final_linear_virchow2_*`.
Training defaults updates and legacy removal `configs/experiment/ml/train_linear_virchow2_lbfgs_group_kfold.yaml`, `configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml`	Updates Virchow2 train defaults reference; clears legacy `linear_classifier_adamw_stratified_kfold.yaml` contents.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

RationAI/tissue-classification#13: Related ProvGigaPath embedding generation and downstream linear-probe configs.
RationAI/tissue-classification#14: Overlaps changes to TIFF prediction map writer naming/sanitization.
RationAI/tissue-classification#11: Earlier changes to EmbeddingTilesDataset and MetaArch that this PR extends with slide metadata and logging.

Suggested reviewers

vejtek
Adames4

Poem

🐰 In fields of slides I hop and plot,

Names and budgets tied in a knot,

ProvGigaPath seeds take their place,

Per‑slide logs wear a friendlier face,

TIFFs named safe — a tidy spot.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main changes: adding Prov-GigaPath support with new test workflows and prediction-map utilities, which aligns with the primary additions throughout the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/provgigapath-metrics-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml`:
- Around line 14-15: Update the top-of-file header comment to reflect that
embeddings are generated using slide-budget sampling (i.e., sampled per-slide up
to slide_sample_max_tiles with deterministic seed slide_sample_seed) instead of
saying embeddings are generated for “every tile”; edit the header text wherever
it mentions “every tile” (also at the other occurrence noted around line 19) to
clearly state the sampled-slide behavior and reference the
slide_sample_max_tiles and slide_sample_seed parameters.

In `@ml/callbacks/tiff_prediction_map_writer.py`:
- Line 520: The current return call uses
_safe_filename(Path(str(path)).with_suffix(".tiff").name) which can collapse
distinct slide basenames into the same sanitized name; change this to produce a
stable, collision-resistant filename by incorporating a short deterministic hash
of the original path (or original basename) into the filename before
sanitization — for example, compute an sha256 or blake2b of str(path) (or
Path(path).name), take an 8-character prefix, append or insert it into the
basename (e.g., original_basename + "-" + hash + ".tiff"), then pass that
combined string through _safe_filename so the function (and its callers) produce
unique, stable TIFF filenames that prevent silent overwrites.

In `@preprocessing/embeddings.py`:
- Around line 98-108: The budget check currently skips enforcement for the first
sampled slide because the condition uses "if selected and selected_tiles +
tile_count > max_tiles:" so an empty selected list lets an oversized first slide
through; change that condition to always check the budget (e.g., "if
selected_tiles + tile_count > max_tiles: continue") so you never append a slide
that would push selected_tiles over max_tiles, keeping the fallback that uses
counts.sort_values("tile_count") intact if nothing is selected.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab2803b2-bc8c-46f5-adb9-6738e7db1023

📥 Commits

Reviewing files that changed from the base of the PR and between 2344026 and 67d3ef4.

📒 Files selected for processing (31)

configs/experiment/ml/final_linear_provgigapath_adamw.yaml
configs/experiment/ml/final_linear_provgigapath_lbfgs.yaml
configs/experiment/ml/final_linear_virchow2_adamw.yaml
configs/experiment/ml/final_linear_virchow2_lbfgs.yaml
configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml
configs/experiment/ml/predict_linear_virchow2_lbfgs_tissue_tiles.yaml
configs/experiment/ml/test_linear_provgigapath_adamw.yaml
configs/experiment/ml/test_linear_provgigapath_lbfgs.yaml
configs/experiment/ml/test_linear_virchow2_adamw.yaml
configs/experiment/ml/test_linear_virchow2_lbfgs.yaml
configs/experiment/ml/train_linear_provgigapath_adamw_group_kfold.yaml
configs/experiment/ml/train_linear_provgigapath_lbfgs_group_kfold.yaml
configs/experiment/ml/train_linear_virchow2_adamw_group_kfold.yaml
configs/experiment/ml/train_linear_virchow2_lbfgs_group_kfold.yaml
configs/experiment/preprocessing/embeddings_provgigapath_0_5mpp.yaml
configs/experiment/preprocessing/embeddings_virchow2_0_5mpp.yaml
configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml
configs/experiment/preprocessing/tile_masks_0_5mpp.yaml
configs/experiment/preprocessing/tiling_0_5mpp.yaml
configs/experiment/preprocessing/tissue_masks_2mpp.yaml
configs/experiment/preprocessing/tissue_stats_0_5mpp.yaml
configs/ml/data/final_embedding_tiles.yaml
configs/ml/model/linear_classifier.yaml
configs/ml/task/final_linear_classifier.yaml
configs/ml/task/kfold_linear_classifier.yaml
configs/ml/trainer/early_stopping.yaml
configs/preprocessing/embeddings.yaml
ml/callbacks/tiff_prediction_map_writer.py
ml/data/datasets/embedding_tiles.py
ml/meta_arch.py
preprocessing/embeddings.py

💤 Files with no reviewable changes (1)

configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml

gemini-code-assist

Code Review

This pull request adds support for ProvGigaPath embeddings, implements a slide sampling strategy in the embedding preprocessing pipeline to respect tile budgets, and improves test logging by including slide names. It also parameterizes model dimensions and updates several training and testing configurations. Feedback highlights a potential runtime error due to an unsupported parameter in a YAML config and suggests a minor optimization to remove redundant column selection in a data loading loop.

vojtech-cifka and others added 30 commits May 7, 2026 21:43

feat: create ml pipeline for linear probe

24668c3

fix: sort only tiles parquet

894c27b

fix: log join types of tile keys

fc824ad

fix: remove embeddings from the join

11931d1

fix: remove label column

fb6b320

fix: prevent overflow

7434ae9

Merge remote-tracking branch 'origin/master' into feature/linear-probe

1b18daa

feat: add class tresholds and run ids

911bec2

fix: wrong run id

1a02395

Merge remote-tracking branch 'origin/master' into feature/embedding-d…

08d7ba5

…ataset

feat: add timing

b38465e

refactor: use pyarrow to avoid to pandas conversion

bfc9578

fix: join on keys only

eb213c6

fix: typing

c92d9a1

fix: add prints

01cc394

refactor: use combine chunks

cad0d37

chore: remove time

3b0137f

feat: add timing

8df47aa

chore: revert to the previous state

926753d

feat: add prints

b0e9ba4

vojtech-cifka and others added 15 commits May 18, 2026 20:12

fix: criterion weight

e370417

fix: keep space in MUG prediction masks names

632a8f6

fix: log test accuracy as jsons

3cd0243

chore: remove username from the submission script

76e4194

fix: force the entering of the write phase of the prediction maps

597e348

fix: remove username

3829ebd

feat: generate embeddings up to a budget

0b2d38e

Merge branch 'feature/ml-test-mode' into feature/provgigapath-metrics…

ff7c06e

…-test

chore: rename ml experiments for clarity

b3d803a

feat: add original slide name in the per slide statistics

1703c01

Merge remote-tracking branch 'origin/master' into feature/provgigapat…

3c72c4f

…h-metrics-test

refactor: simplify the preprocessing name scripts

67d3ef4

vojtech-cifka requested review from Adames4 and vejtek May 19, 2026 09:33

vojtech-cifka self-assigned this May 19, 2026

vojtech-cifka requested a review from a team May 19, 2026 09:33

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml

Comment thread ml/callbacks/tiff_prediction_map_writer.py Outdated

Comment thread preprocessing/embeddings.py Outdated

fix: commnets, generate safer filenames

cd4b19a

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread configs/experiment/ml/test_linear_virchow2_lbfgs.yaml Outdated

Comment thread ml/data/datasets/embedding_tiles.py Outdated

vojtech-cifka added 3 commits May 19, 2026 11:45

fix: remove erorr masks

c9b4c67

fix: remove rendundant column selection

df7888f

fix: format

191892c

vejtek approved these changes May 19, 2026

View reviewed changes

Adames4 approved these changes May 19, 2026

View reviewed changes

vojtech-cifka merged commit 8a8f76e into master May 19, 2026
2 of 3 checks passed

vojtech-cifka deleted the feature/provgigapath-metrics-test branch May 19, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Prov-GigaPath linear probe test workflows and prediction-map utilities#15

feat: Add Prov-GigaPath linear probe test workflows and prediction-map utilities#15
vojtech-cifka merged 129 commits into
masterfrom
feature/provgigapath-metrics-test

vojtech-cifka commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vojtech-cifka commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vojtech-cifka commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading