feat: add embedding dataset build pipeline by vojtech-cifka · Pull Request #10 · RationAI/tissue-classification

vojtech-cifka · 2026-05-11T13:24:36Z

Summary

Adds a Hydra/MLflow pipeline that joins precomputed tile embeddings with k-fold
(train) and filter_tiles (test) metadata to produce a training-ready Parquet
dataset consumable by SlidesTilesLoader.

Changes

preprocessing/embedding_dataset.py — main pipeline:
- apply_thresholds: filters by tissue_prop_min, drops tiles covered by
  two or more distinct ROI classes, then applies per-class argmax-threshold.
- join_embeddings: Arrow-native join on (slide_id, x, y) using a
  synthetic index to avoid Acero's limitation with list columns; casts
  embedding column to large_list to prevent int32 offset overflow on
  take().
- process_split: orchestrates download → filter → join → write for one
  split; logs tile counts at each filtering stage as MLflow metrics.
- log_label_distributions: logs per-label and per-fold label counts as
  MLflow tables.
preprocessing/_labels.py — shared helper compute_label_and_tissue_prop
(argmax over roi_coverage_* columns, background fallback when all coverages
are zero); used by the test split where source metadata has no pre-derived
labels.
configs/preprocessing/embedding_dataset.yaml — Hydra config skeleton
(tissue_prop_min, thresholds, MLflow metadata).
configs/data/dataset.yaml — adds embedding_run_id artifact reference.
scripts/submit_embedding_dataset.py — Kubernetes job submission script.
Override the embedding run ID at submission time via
dataset.mlflow_artifacts.embedding_run_id=<run_id> appended to the command.
split/kfold_split.py — utlize the helper function

Labeling strategy

Train split reuses labels written by kfold_split. Test split derives labels
on-the-fly via compute_label_and_tissue_prop. In both paths, tiles with two
or more non-zero ROI coverages are dropped before the per-class threshold is
applied.

Summary by CodeRabbit

New Features
- Added embedding dataset pipeline to build training datasets by combining tile embeddings with metadata, applying tissue filtering and label assignment rules.
Refactor
- Extracted label and tissue proportion computation into a shared utility function for reuse across modules.

Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ataset

Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-11T13:24:50Z

Warning

Rate limit exceeded

@vojtech-cifka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5322b776-97a7-4fcf-a0ee-f6875aa2915c

📥 Commits

Reviewing files that changed from the base of the PR and between 11ed4e3 and d59425a.

📒 Files selected for processing (1)

preprocessing/embedding_dataset.py

📝 Walkthrough

Walkthrough

This PR adds a complete embedding dataset preprocessing pipeline that joins precomputed tile embeddings with k-fold split metadata and filtered tile artifacts. It introduces a reusable label/tissue-proportion helper, defines Hydra configuration for experiment-specific tissue filtering thresholds, implements the core processing logic with PyArrow joins and per-class filtering, and provides a Kubernetes job submission script.

Changes

Embedding Dataset Preprocessing Pipeline

Layer / File(s)	Summary
Shared Label/Tissue Proportion Helper `preprocessing/_labels.py`, `split/kfold_split.py`	New module `preprocessing/_labels.py` adds `compute_label_and_tissue_prop()` function to derive tissue labels and proportions from ROI coverage columns. `split/kfold_split.py` refactored to delegate label/tissue-proportion derivation to this helper instead of computing inline.
Configuration & Metadata Wiring `configs/data/dataset.yaml`, `configs/preprocessing/embedding_dataset.yaml`, `configs/experiment/preprocessing/embedding_dataset.yaml`	`configs/data/dataset.yaml` adds `kfold_run_id` and `embedding_run_id` MLflow artifact identifiers. New `configs/preprocessing/embedding_dataset.yaml` defines base config with template parameters and MLflow artifact path. New `configs/experiment/preprocessing/embedding_dataset.yaml` composes dataset defaults and defines tissue-specific filtering thresholds with hyperparameter metadata.
Core Embedding Dataset Processing `preprocessing/embedding_dataset.py`	Implements complete per-split pipeline: downloads tile metadata and embedding artifacts via MLflow, applies tissue proportion and ROI class filtering via `apply_thresholds()`, joins embeddings with metadata using synthetic index via `join_embeddings()`, orchestrates filtering and joining in `process_split()`, and logs label distributions and split-level metrics via `log_label_distributions()`.
Kubernetes Job Submission `scripts/submit_embedding_dataset.py`	New script submits Kubernetes job with 8 CPU, 64Gi memory, secured storage mount, and shell command sequence to clone repo, sync dependencies, and execute embedding dataset preprocessing with experiment override.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

RationAI/tissue-classification#7: Related — both PRs add and wire up embedding preprocessing configs and submission scripts and update configs/data/dataset.yaml with MLflow artifact run IDs.
RationAI/tissue-classification#3: Main PR refactors label/tissue_prop derivation into a shared helper that is called by kfold_split, which directly relates to the label/tissue_prop helper introduced and integrated in this PR.

Suggested reviewers

vejtek
ejdam87
matejpekar

Poem

🐰 Embeddings join with tiles so neat,
k-folds and filters skip and greet,
Labels bloom from ROI's art,
Shared helpers play their part,
Kubernetes hops—the pipeline's complete! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding an embedding dataset build pipeline with Hydra/MLflow configuration, Python modules for filtering and joining embeddings, and Kubernetes job submission.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/embedding-dataset

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new preprocessing module, embedding_dataset.py, designed to join tile metadata with precomputed embeddings. It also refactors the label and tissue proportion calculation into a shared utility and adds the necessary Hydra configurations and job submission scripts. Feedback focuses on improving the scalability and memory efficiency of the PyArrow operations, specifically by using 64-bit integers for row indices to prevent overflow and avoiding unnecessary memory copies when processing large chunked arrays.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

preprocessing/embedding_dataset.py (1)

80-85: 💤 Low value

Consider handling already-large_list embeddings explicitly.

The current check if pa.types.is_list(emb_col.type) only enters the casting block for regular list types. If the embeddings are already large_list, they skip the casting logic, which is correct. However, for clarity and future maintainability, consider also checking for and preserving large_list explicitly or adding a comment explaining why only list is checked.

📝 Optional clarification

     emb_col = emb_table.column("embedding")
     if pa.types.is_list(emb_col.type):
+        # Cast regular list to large_list to avoid int32 offset overflow in take()
         target_type = pa.large_list(emb_col.type.value_type)
         emb_col = pa.chunked_array(
             [c.cast(target_type) for c in emb_col.chunks], type=target_type
         )
+    elif pa.types.is_large_list(emb_col.type):
+        # Already large_list, no casting needed
+        pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@preprocessing/embedding_dataset.py` around lines 80 - 85, The embedding
column handling only tests pa.types.is_list and skips pa.types.is_large_list
which is unclear; update the emb_col logic around emb_table.column("embedding")
to explicitly detect pa.types.is_large_list (or pa.types.is_list) and either
preserve the large_list as-is or perform the same casting behavior, and add a
concise comment explaining why list vs large_list are treated differently;
reference emb_col, emb_table.column("embedding"), pa.types.is_list,
pa.types.is_large_list, pa.large_list and pa.chunked_array to locate and update
the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@preprocessing/_labels.py`:
- Around line 10-24: Add a validation at the start of
compute_label_and_tissue_prop to ensure roi_cols is non-empty; explicitly check
"if not roi_cols" and raise a clear ValueError (e.g. "roi_cols must be a
non-empty list of column names") or alternatively return labels of "background"
and zero tissue_prop for each row, so that downstream calls to roi_df =
pd.DataFrame(...) and operations on roi_df (idxmax, sum) cannot fail; reference
the function compute_label_and_tissue_prop and variables roi_cols, roi_df, tp,
lbl when making the change.

---

Nitpick comments:
In `@preprocessing/embedding_dataset.py`:
- Around line 80-85: The embedding column handling only tests pa.types.is_list
and skips pa.types.is_large_list which is unclear; update the emb_col logic
around emb_table.column("embedding") to explicitly detect pa.types.is_large_list
(or pa.types.is_list) and either preserve the large_list as-is or perform the
same casting behavior, and add a concise comment explaining why list vs
large_list are treated differently; reference emb_col,
emb_table.column("embedding"), pa.types.is_list, pa.types.is_large_list,
pa.large_list and pa.chunked_array to locate and update the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44745c5a-5b3a-4578-9628-2aa0b4dcb7ca

📥 Commits

Reviewing files that changed from the base of the PR and between da61791 and 11ed4e3.

📒 Files selected for processing (7)

configs/data/dataset.yaml
configs/experiment/preprocessing/embedding_dataset.yaml
configs/preprocessing/embedding_dataset.yaml
preprocessing/_labels.py
preprocessing/embedding_dataset.py
scripts/submit_embedding_dataset.py
split/kfold_split.py

vojtech-cifka and others added 22 commits May 8, 2026 21:27

feat: add class tresholds and run ids

911bec2

fix: wrong run id

1a02395

Merge remote-tracking branch 'origin/master' into feature/embedding-d…

08d7ba5

…ataset

feat: add timing

b38465e

refactor: use pyarrow to avoid to pandas conversion

bfc9578

fix: join on keys only

eb213c6

fix: typing

c92d9a1

fix: add prints

01cc394

refactor: use combine chunks

cad0d37

chore: remove time

3b0137f

feat: add timing

8df47aa

chore: revert to the previous state

926753d

feat: add prints

b0e9ba4

refactor: use discusssed thresholds

6a915de

refactor: use different labeling strategy

0f50307

refactor: drop tiles that are covered by two or more distinct labels

c421c74

fix: format

718ec08

chore: update embeddings run id

389a0a5

chore: remove timing prints

11ed4e3

vojtech-cifka requested review from Adames4 and vejtek May 11, 2026 13:24

vojtech-cifka self-assigned this May 11, 2026

vojtech-cifka requested a review from a team May 11, 2026 13:24

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Comment thread preprocessing/embedding_dataset.py Outdated

Comment thread preprocessing/embedding_dataset.py

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread preprocessing/_labels.py

refactor: use int64

d59425a

vejtek approved these changes May 11, 2026

View reviewed changes

vojtech-cifka closed this May 11, 2026

vojtech-cifka deleted the feature/embedding-dataset branch May 12, 2026 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add embedding dataset build pipeline#10

feat: add embedding dataset build pipeline#10
vojtech-cifka wants to merge 23 commits into
masterfrom
feature/embedding-dataset

vojtech-cifka commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vojtech-cifka commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Labeling strategy

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vojtech-cifka commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading