feat: add embedding dataset build pipeline#10
Conversation
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds a complete embedding dataset preprocessing pipeline that joins precomputed tile embeddings with k-fold split metadata and filtered tile artifacts. It introduces a reusable label/tissue-proportion helper, defines Hydra configuration for experiment-specific tissue filtering thresholds, implements the core processing logic with PyArrow joins and per-class filtering, and provides a Kubernetes job submission script. ChangesEmbedding Dataset Preprocessing Pipeline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new preprocessing module, embedding_dataset.py, designed to join tile metadata with precomputed embeddings. It also refactors the label and tissue proportion calculation into a shared utility and adds the necessary Hydra configurations and job submission scripts. Feedback focuses on improving the scalability and memory efficiency of the PyArrow operations, specifically by using 64-bit integers for row indices to prevent overflow and avoiding unnecessary memory copies when processing large chunked arrays.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
preprocessing/embedding_dataset.py (1)
80-85: 💤 Low valueConsider handling already-large_list embeddings explicitly.
The current check
if pa.types.is_list(emb_col.type)only enters the casting block for regularlisttypes. If the embeddings are alreadylarge_list, they skip the casting logic, which is correct. However, for clarity and future maintainability, consider also checking for and preservinglarge_listexplicitly or adding a comment explaining why onlylistis checked.📝 Optional clarification
emb_col = emb_table.column("embedding") if pa.types.is_list(emb_col.type): + # Cast regular list to large_list to avoid int32 offset overflow in take() target_type = pa.large_list(emb_col.type.value_type) emb_col = pa.chunked_array( [c.cast(target_type) for c in emb_col.chunks], type=target_type ) + elif pa.types.is_large_list(emb_col.type): + # Already large_list, no casting needed + pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@preprocessing/embedding_dataset.py` around lines 80 - 85, The embedding column handling only tests pa.types.is_list and skips pa.types.is_large_list which is unclear; update the emb_col logic around emb_table.column("embedding") to explicitly detect pa.types.is_large_list (or pa.types.is_list) and either preserve the large_list as-is or perform the same casting behavior, and add a concise comment explaining why list vs large_list are treated differently; reference emb_col, emb_table.column("embedding"), pa.types.is_list, pa.types.is_large_list, pa.large_list and pa.chunked_array to locate and update the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@preprocessing/_labels.py`:
- Around line 10-24: Add a validation at the start of
compute_label_and_tissue_prop to ensure roi_cols is non-empty; explicitly check
"if not roi_cols" and raise a clear ValueError (e.g. "roi_cols must be a
non-empty list of column names") or alternatively return labels of "background"
and zero tissue_prop for each row, so that downstream calls to roi_df =
pd.DataFrame(...) and operations on roi_df (idxmax, sum) cannot fail; reference
the function compute_label_and_tissue_prop and variables roi_cols, roi_df, tp,
lbl when making the change.
---
Nitpick comments:
In `@preprocessing/embedding_dataset.py`:
- Around line 80-85: The embedding column handling only tests pa.types.is_list
and skips pa.types.is_large_list which is unclear; update the emb_col logic
around emb_table.column("embedding") to explicitly detect pa.types.is_large_list
(or pa.types.is_list) and either preserve the large_list as-is or perform the
same casting behavior, and add a concise comment explaining why list vs
large_list are treated differently; reference emb_col,
emb_table.column("embedding"), pa.types.is_list, pa.types.is_large_list,
pa.large_list and pa.chunked_array to locate and update the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 44745c5a-5b3a-4578-9628-2aa0b4dcb7ca
📒 Files selected for processing (7)
configs/data/dataset.yamlconfigs/experiment/preprocessing/embedding_dataset.yamlconfigs/preprocessing/embedding_dataset.yamlpreprocessing/_labels.pypreprocessing/embedding_dataset.pyscripts/submit_embedding_dataset.pysplit/kfold_split.py
Summary
Adds a Hydra/MLflow pipeline that joins precomputed tile embeddings with k-fold
(train) and filter_tiles (test) metadata to produce a training-ready Parquet
dataset consumable by
SlidesTilesLoader.Changes
preprocessing/embedding_dataset.py— main pipeline:apply_thresholds: filters bytissue_prop_min, drops tiles covered bytwo or more distinct ROI classes, then applies per-class argmax-threshold.
join_embeddings: Arrow-native join on(slide_id, x, y)using asynthetic index to avoid Acero's limitation with list columns; casts
embedding column to
large_listto prevent int32 offset overflow ontake().process_split: orchestrates download → filter → join → write for onesplit; logs tile counts at each filtering stage as MLflow metrics.
log_label_distributions: logs per-label and per-fold label counts asMLflow tables.
preprocessing/_labels.py— shared helpercompute_label_and_tissue_prop(argmax over
roi_coverage_*columns, background fallback when all coveragesare zero); used by the test split where source metadata has no pre-derived
labels.
configs/preprocessing/embedding_dataset.yaml— Hydra config skeleton(
tissue_prop_min,thresholds, MLflow metadata).configs/data/dataset.yaml— addsembedding_run_idartifact reference.scripts/submit_embedding_dataset.py— Kubernetes job submission script.Override the embedding run ID at submission time via
dataset.mlflow_artifacts.embedding_run_id=<run_id>appended to the command.split/kfold_split.py— utlize the helper functionLabeling strategy
Train split reuses labels written by
kfold_split. Test split derives labelson-the-fly via
compute_label_and_tissue_prop. In both paths, tiles with twoor more non-zero ROI coverages are dropped before the per-class threshold is
applied.
Summary by CodeRabbit
New Features
Refactor