Skip to content

feat: add embedding dataset build pipeline#10

Closed
vojtech-cifka wants to merge 23 commits into
masterfrom
feature/embedding-dataset
Closed

feat: add embedding dataset build pipeline#10
vojtech-cifka wants to merge 23 commits into
masterfrom
feature/embedding-dataset

Conversation

@vojtech-cifka
Copy link
Copy Markdown
Collaborator

@vojtech-cifka vojtech-cifka commented May 11, 2026

Summary

Adds a Hydra/MLflow pipeline that joins precomputed tile embeddings with k-fold
(train) and filter_tiles (test) metadata to produce a training-ready Parquet
dataset consumable by SlidesTilesLoader.

Changes

  • preprocessing/embedding_dataset.py — main pipeline:

    • apply_thresholds: filters by tissue_prop_min, drops tiles covered by
      two or more distinct ROI classes, then applies per-class argmax-threshold.
    • join_embeddings: Arrow-native join on (slide_id, x, y) using a
      synthetic index to avoid Acero's limitation with list columns; casts
      embedding column to large_list to prevent int32 offset overflow on
      take().
    • process_split: orchestrates download → filter → join → write for one
      split; logs tile counts at each filtering stage as MLflow metrics.
    • log_label_distributions: logs per-label and per-fold label counts as
      MLflow tables.
  • preprocessing/_labels.py — shared helper compute_label_and_tissue_prop
    (argmax over roi_coverage_* columns, background fallback when all coverages
    are zero); used by the test split where source metadata has no pre-derived
    labels.

  • configs/preprocessing/embedding_dataset.yaml — Hydra config skeleton
    (tissue_prop_min, thresholds, MLflow metadata).

  • configs/data/dataset.yaml — adds embedding_run_id artifact reference.

  • scripts/submit_embedding_dataset.py — Kubernetes job submission script.
    Override the embedding run ID at submission time via
    dataset.mlflow_artifacts.embedding_run_id=<run_id> appended to the command.

  • split/kfold_split.py — utlize the helper function

Labeling strategy

Train split reuses labels written by kfold_split. Test split derives labels
on-the-fly via compute_label_and_tissue_prop. In both paths, tiles with two
or more non-zero ROI coverages are dropped before the per-class threshold is
applied.

Summary by CodeRabbit

  • New Features

    • Added embedding dataset pipeline to build training datasets by combining tile embeddings with metadata, applying tissue filtering and label assignment rules.
  • Refactor

    • Extracted label and tissue proportion computation into a shared utility function for reuse across modules.

Review Change Stack

vojtech-cifka and others added 22 commits May 8, 2026 21:27
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in
both split/kfold_split.py and the new embedding_dataset pipeline. The new
pipeline joins k-fold (train) / filter_tiles (test) tile metadata with
precomputed embeddings after applying tissue + per-dominant-class ROI thresholds,
and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on
to_pandas() or hitting int32 list-offset overflow inside take(). The fix:
- read embeddings into Arrow only and cast each chunk to large_list so
  take() concatenation uses int64 offsets;
- run the join on keys plus a synthetic row index because Acero refuses
  list columns in non-key fields, then pull embeddings via take();
- combine_chunks() before take() for an O(N) single-pass copy;
- write the parquet straight from Arrow, never materialising the
  embedding column in pandas.

Also bumps the kube job memory to 64Gi to give the combined-chunks +
take() peak some headroom, and trims the verbose [timing] prints down
to one progress line per split.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside
apply_thresholds with a confusing KeyError. Surface a clear error that
points at the expected upstream artifact instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vojtech-cifka vojtech-cifka requested review from Adames4 and vejtek May 11, 2026 13:24
@vojtech-cifka vojtech-cifka self-assigned this May 11, 2026
@vojtech-cifka vojtech-cifka requested a review from a team May 11, 2026 13:24
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Warning

Rate limit exceeded

@vojtech-cifka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5322b776-97a7-4fcf-a0ee-f6875aa2915c

📥 Commits

Reviewing files that changed from the base of the PR and between 11ed4e3 and d59425a.

📒 Files selected for processing (1)
  • preprocessing/embedding_dataset.py
📝 Walkthrough

Walkthrough

This PR adds a complete embedding dataset preprocessing pipeline that joins precomputed tile embeddings with k-fold split metadata and filtered tile artifacts. It introduces a reusable label/tissue-proportion helper, defines Hydra configuration for experiment-specific tissue filtering thresholds, implements the core processing logic with PyArrow joins and per-class filtering, and provides a Kubernetes job submission script.

Changes

Embedding Dataset Preprocessing Pipeline

Layer / File(s) Summary
Shared Label/Tissue Proportion Helper
preprocessing/_labels.py, split/kfold_split.py
New module preprocessing/_labels.py adds compute_label_and_tissue_prop() function to derive tissue labels and proportions from ROI coverage columns. split/kfold_split.py refactored to delegate label/tissue-proportion derivation to this helper instead of computing inline.
Configuration & Metadata Wiring
configs/data/dataset.yaml, configs/preprocessing/embedding_dataset.yaml, configs/experiment/preprocessing/embedding_dataset.yaml
configs/data/dataset.yaml adds kfold_run_id and embedding_run_id MLflow artifact identifiers. New configs/preprocessing/embedding_dataset.yaml defines base config with template parameters and MLflow artifact path. New configs/experiment/preprocessing/embedding_dataset.yaml composes dataset defaults and defines tissue-specific filtering thresholds with hyperparameter metadata.
Core Embedding Dataset Processing
preprocessing/embedding_dataset.py
Implements complete per-split pipeline: downloads tile metadata and embedding artifacts via MLflow, applies tissue proportion and ROI class filtering via apply_thresholds(), joins embeddings with metadata using synthetic index via join_embeddings(), orchestrates filtering and joining in process_split(), and logs label distributions and split-level metrics via log_label_distributions().
Kubernetes Job Submission
scripts/submit_embedding_dataset.py
New script submits Kubernetes job with 8 CPU, 64Gi memory, secured storage mount, and shell command sequence to clone repo, sync dependencies, and execute embedding dataset preprocessing with experiment override.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • RationAI/tissue-classification#7: Related — both PRs add and wire up embedding preprocessing configs and submission scripts and update configs/data/dataset.yaml with MLflow artifact run IDs.
  • RationAI/tissue-classification#3: Main PR refactors label/tissue_prop derivation into a shared helper that is called by kfold_split, which directly relates to the label/tissue_prop helper introduced and integrated in this PR.

Suggested reviewers

  • vejtek
  • ejdam87
  • matejpekar

Poem

🐰 Embeddings join with tiles so neat,
k-folds and filters skip and greet,
Labels bloom from ROI's art,
Shared helpers play their part,
Kubernetes hops—the pipeline's complete! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding an embedding dataset build pipeline with Hydra/MLflow configuration, Python modules for filtering and joining embeddings, and Kubernetes job submission.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/embedding-dataset

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new preprocessing module, embedding_dataset.py, designed to join tile metadata with precomputed embeddings. It also refactors the label and tissue proportion calculation into a shared utility and adds the necessary Hydra configurations and job submission scripts. Feedback focuses on improving the scalability and memory efficiency of the PyArrow operations, specifically by using 64-bit integers for row indices to prevent overflow and avoiding unnecessary memory copies when processing large chunked arrays.

Comment thread preprocessing/embedding_dataset.py Outdated
Comment thread preprocessing/embedding_dataset.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
preprocessing/embedding_dataset.py (1)

80-85: 💤 Low value

Consider handling already-large_list embeddings explicitly.

The current check if pa.types.is_list(emb_col.type) only enters the casting block for regular list types. If the embeddings are already large_list, they skip the casting logic, which is correct. However, for clarity and future maintainability, consider also checking for and preserving large_list explicitly or adding a comment explaining why only list is checked.

📝 Optional clarification
     emb_col = emb_table.column("embedding")
     if pa.types.is_list(emb_col.type):
+        # Cast regular list to large_list to avoid int32 offset overflow in take()
         target_type = pa.large_list(emb_col.type.value_type)
         emb_col = pa.chunked_array(
             [c.cast(target_type) for c in emb_col.chunks], type=target_type
         )
+    elif pa.types.is_large_list(emb_col.type):
+        # Already large_list, no casting needed
+        pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@preprocessing/embedding_dataset.py` around lines 80 - 85, The embedding
column handling only tests pa.types.is_list and skips pa.types.is_large_list
which is unclear; update the emb_col logic around emb_table.column("embedding")
to explicitly detect pa.types.is_large_list (or pa.types.is_list) and either
preserve the large_list as-is or perform the same casting behavior, and add a
concise comment explaining why list vs large_list are treated differently;
reference emb_col, emb_table.column("embedding"), pa.types.is_list,
pa.types.is_large_list, pa.large_list and pa.chunked_array to locate and update
the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@preprocessing/_labels.py`:
- Around line 10-24: Add a validation at the start of
compute_label_and_tissue_prop to ensure roi_cols is non-empty; explicitly check
"if not roi_cols" and raise a clear ValueError (e.g. "roi_cols must be a
non-empty list of column names") or alternatively return labels of "background"
and zero tissue_prop for each row, so that downstream calls to roi_df =
pd.DataFrame(...) and operations on roi_df (idxmax, sum) cannot fail; reference
the function compute_label_and_tissue_prop and variables roi_cols, roi_df, tp,
lbl when making the change.

---

Nitpick comments:
In `@preprocessing/embedding_dataset.py`:
- Around line 80-85: The embedding column handling only tests pa.types.is_list
and skips pa.types.is_large_list which is unclear; update the emb_col logic
around emb_table.column("embedding") to explicitly detect pa.types.is_large_list
(or pa.types.is_list) and either preserve the large_list as-is or perform the
same casting behavior, and add a concise comment explaining why list vs
large_list are treated differently; reference emb_col,
emb_table.column("embedding"), pa.types.is_list, pa.types.is_large_list,
pa.large_list and pa.chunked_array to locate and update the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44745c5a-5b3a-4578-9628-2aa0b4dcb7ca

📥 Commits

Reviewing files that changed from the base of the PR and between da61791 and 11ed4e3.

📒 Files selected for processing (7)
  • configs/data/dataset.yaml
  • configs/experiment/preprocessing/embedding_dataset.yaml
  • configs/preprocessing/embedding_dataset.yaml
  • preprocessing/_labels.py
  • preprocessing/embedding_dataset.py
  • scripts/submit_embedding_dataset.py
  • split/kfold_split.py

Comment thread preprocessing/_labels.py
@vojtech-cifka vojtech-cifka deleted the feature/embedding-dataset branch May 12, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants