feat: Add Prov-GigaPath linear probe test workflows and prediction-map utilities#15
Conversation
Use the `label` and `fold` columns produced by the upstream k-fold split instead of deriving labels from coverage columns and randomly splitting val. Memory-mapped via HuggingFace datasets so the full embedding parquet no longer has to fit in numpy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ddings Datamodule downloads embeddings + kfold artifacts from MLflow, joins on (slide_id, x, y) via pyarrow, applies class mapping, tissue/class coverage filters, and exposes per-fold splits via set_val_fold(). Training script loops folds in a single run and logs per-fold + aggregate metrics. Probe adds per-class F1, confusion matrix figures, optional input L2-norm and class weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The experiment file was declaring /class_mapping as a fresh default while configs/ml/linear_probe.yaml already had one, which Hydra rejects as a duplicate. Mark it as an override so the experiment replaces the base default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ml/train.py uses @with_cli_args(["+ml=linear_probe"]), so the decorator already injects that arg. Passing it again on the command line caused Hydra to load configs/ml/linear_probe.yaml twice and reject duplicate defaults. Rely on the decorator and pass only +experiment=... Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng refs
Two interpolation problems prevented Hydra from resolving the linear-probe
config:
1. configs/ml.yaml uses ${random_seed:} and configs/ml/linear_probe.yaml
uses ${len:...}, but neither resolver is registered anywhere. Register
both at module import time in ml/train.py.
2. The class_mapping yamls use # @Package _global_, so class_mapping,
class_indices, and class_names land at the config root. The references
in linear_probe.yaml were doubly nested (e.g. class_mapping.class_mapping).
Drop the prefix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The filtered tiles parquet collapses ROI columns at tiling time, so
kfold writes canonical names ("Epithelium", etc.) directly into `label`.
The raw→canonical lookup built from the BB-suffixed YAML lists matched
none of these and dropped the entire 1.1M-tile dataset under
drop_unmapped=True.
Extend _raw_to_canonical with identity entries for every canonical class
so modern parquets pass through while legacy un-collapsed labels still
collapse correctly. "background" stays unmapped → dropped, as intended.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add EmbeddingsDataModule.compute_class_weights("balanced"|"inverse")
using sklearn-style weights from the current train fold.
- train.py resolves class_weights="balanced"/"inverse" via the
datamodule and passes the resulting list to LinearProbe at instantiate
time (per-fold, since splits change).
- Bump class_coverage_min from 0.0 to 0.5 to drop mosaic tiles.
- Drop the redundant /class_mapping default from configs/ml/linear_probe.yaml;
experiment files now own the choice.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup(stage="fit") replaces criterion with class-weighted CrossEntropyLoss, adding a criterion.weight buffer that gets saved to checkpoints. At test, Lightning restores the checkpoint before setup() runs, so the model still has the unweighted criterion from __init__ and strict load fails with "Unexpected key(s) in state_dict: criterion.weight". Affected both adamw and lbfgs test runs. Initialize criterion with a placeholder ones-weight sized num_classes so the criterion.weight key always exists; setup(fit) still overrides it with the real class-balanced weights. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Clear the batch buffer only on rank!=0 or after a successful write so the on_test_end fallback no longer hits an always-empty buffer. Add diagnostic prints to the silent early-return guards and an idempotency flag so the two write hooks cooperate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # configs/experiment/ml/linear_classifier_test_adamw.yaml # configs/experiment/ml/linear_classifier_test_lbfgs.yaml # configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_05mpp.yaml # configs/ml/task/final_linear_classifier.yaml # configs/preprocessing/embeddings.yaml # ml/callbacks/tiff_prediction_map_writer.py # preprocessing/embeddings.py
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughParameterizes embedding selection and dims, wires slide metadata into embedding datasets, enriches per‑slide MLflow logs with slide names, adds deterministic slide-budget sampling to preprocessing, sanitizes TIFF output names, and introduces ProvGigaPath train/final/test experiment configs plus related defaults updates and a legacy config removal. ChangesEmbedding Model Parameterization and Slide Metadata Infrastructure
ProvGigaPath Experiment Configurations
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@configs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yaml`:
- Around line 14-15: Update the top-of-file header comment to reflect that
embeddings are generated using slide-budget sampling (i.e., sampled per-slide up
to slide_sample_max_tiles with deterministic seed slide_sample_seed) instead of
saying embeddings are generated for “every tile”; edit the header text wherever
it mentions “every tile” (also at the other occurrence noted around line 19) to
clearly state the sampled-slide behavior and reference the
slide_sample_max_tiles and slide_sample_seed parameters.
In `@ml/callbacks/tiff_prediction_map_writer.py`:
- Line 520: The current return call uses
_safe_filename(Path(str(path)).with_suffix(".tiff").name) which can collapse
distinct slide basenames into the same sanitized name; change this to produce a
stable, collision-resistant filename by incorporating a short deterministic hash
of the original path (or original basename) into the filename before
sanitization — for example, compute an sha256 or blake2b of str(path) (or
Path(path).name), take an 8-character prefix, append or insert it into the
basename (e.g., original_basename + "-" + hash + ".tiff"), then pass that
combined string through _safe_filename so the function (and its callers) produce
unique, stable TIFF filenames that prevent silent overwrites.
In `@preprocessing/embeddings.py`:
- Around line 98-108: The budget check currently skips enforcement for the first
sampled slide because the condition uses "if selected and selected_tiles +
tile_count > max_tiles:" so an empty selected list lets an oversized first slide
through; change that condition to always check the budget (e.g., "if
selected_tiles + tile_count > max_tiles: continue") so you never append a slide
that would push selected_tiles over max_tiles, keeping the fallback that uses
counts.sort_values("tile_count") intact if nothing is selected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ab2803b2-bc8c-46f5-adb9-6738e7db1023
📒 Files selected for processing (31)
configs/experiment/ml/final_linear_provgigapath_adamw.yamlconfigs/experiment/ml/final_linear_provgigapath_lbfgs.yamlconfigs/experiment/ml/final_linear_virchow2_adamw.yamlconfigs/experiment/ml/final_linear_virchow2_lbfgs.yamlconfigs/experiment/ml/linear_classifier_adamw_stratified_kfold.yamlconfigs/experiment/ml/predict_linear_virchow2_lbfgs_tissue_tiles.yamlconfigs/experiment/ml/test_linear_provgigapath_adamw.yamlconfigs/experiment/ml/test_linear_provgigapath_lbfgs.yamlconfigs/experiment/ml/test_linear_virchow2_adamw.yamlconfigs/experiment/ml/test_linear_virchow2_lbfgs.yamlconfigs/experiment/ml/train_linear_provgigapath_adamw_group_kfold.yamlconfigs/experiment/ml/train_linear_provgigapath_lbfgs_group_kfold.yamlconfigs/experiment/ml/train_linear_virchow2_adamw_group_kfold.yamlconfigs/experiment/ml/train_linear_virchow2_lbfgs_group_kfold.yamlconfigs/experiment/preprocessing/embeddings_provgigapath_0_5mpp.yamlconfigs/experiment/preprocessing/embeddings_virchow2_0_5mpp.yamlconfigs/experiment/preprocessing/embeddings_virchow2_tissue_tiles_0_5mpp.yamlconfigs/experiment/preprocessing/tile_masks_0_5mpp.yamlconfigs/experiment/preprocessing/tiling_0_5mpp.yamlconfigs/experiment/preprocessing/tissue_masks_2mpp.yamlconfigs/experiment/preprocessing/tissue_stats_0_5mpp.yamlconfigs/ml/data/final_embedding_tiles.yamlconfigs/ml/model/linear_classifier.yamlconfigs/ml/task/final_linear_classifier.yamlconfigs/ml/task/kfold_linear_classifier.yamlconfigs/ml/trainer/early_stopping.yamlconfigs/preprocessing/embeddings.yamlml/callbacks/tiff_prediction_map_writer.pyml/data/datasets/embedding_tiles.pyml/meta_arch.pypreprocessing/embeddings.py
💤 Files with no reviewable changes (1)
- configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml
There was a problem hiding this comment.
Code Review
This pull request adds support for ProvGigaPath embeddings, implements a slide sampling strategy in the embedding preprocessing pipeline to respect tile budgets, and improves test logging by including slide names. It also parameterizes model dimensions and updates several training and testing configurations. Feedback highlights a potential runtime error due to an unsupported parameter in a YAML config and suggests a minor optimization to remove redundant column selection in a data loading loop.
Description
This branch extends the linear-probe workflow to cover Prov-GigaPath alongside Virchow2 and cleans up experiment naming.
Added:
Summary by CodeRabbit
New Features
Improvements