feat: add filter_tiles preprocessing step by vojtech-cifka · Pull Request #8 · RationAI/tissue-classification

vojtech-cifka · 2026-05-05T18:22:35Z

Summary

Adds a dedicated filter_tiles preprocessing step that produces a canonical
filtered tile set for downstream steps (threshold_stats, embeddings).

Drops tiles with zero annotation coverage (PyArrow predicate pushdown on
tile_coverage_* columns — only matching rows materialised from 80M-row parquet)
Joins against tissue stats to drop tiles with zero tissue coverage
(pandas merge replaces PyArrow hash join, which hung on 45M string-keyed rows)
Outputs {train,test}_tiles.parquet to MLflow under filter_tiles/
Logs per-split row counts at each filter stage as MLflow metrics

Pipeline order: tiling → tissue_masks → coverage_stats → filter_tiles → threshold_stats / embeddings

New files

preprocessing/filter_tiles.py
configs/preprocessing/filter_tiles.yaml
configs/experiment/preprocessing/filter_tiles.yaml
scripts/submit_filter_tiles.py

Config changes

configs/data/dataset.yaml: added tissue_stats_run_id

Summary by CodeRabbit

Release Notes

New Features
- Tile filtering preprocessing step now available. Retains tiles with annotation coverage and positive tissue coverage, with MLflow integration for artifact tracking and reproducibility.

Drops tiles with no annotation coverage and no tissue coverage by joining tiling output with tissue stats. Carries through annotation and tissue coverage columns so downstream stages can consume the filtered set without re-joining. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds progress prints between read/join/write and restricts the tissue table to columns and slides that can possibly join, cutting peak memory and join time on the 80M-row input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

isin against ~200 string slide ids forced per-row Python checks across the 80M-row tissue parquet, dwarfing the tiny saving. Drop it and rely on column projection plus the tissue>0 predicate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PyArrow's hash join on 45M string-keyed rows hangs in practice. Pandas merge on the same data takes ~30s. Frees intermediate tables to keep peak memory bounded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-05T18:22:49Z

Warning

Rate limit exceeded

@vojtech-cifka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 9 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87a25215-b9c6-4015-92b7-f511b2e4c733

📥 Commits

Reviewing files that changed from the base of the PR and between 90435e9 and eda39fe.

📒 Files selected for processing (1)

preprocessing/filter_tiles.py

📝 Walkthrough

Walkthrough

A new tile filtering preprocessing stage is introduced. It extends dataset MLflow artifact metadata, adds experiment and global configurations, implements PyArrow-based filtering logic that retains tiles with annotation and tissue coverage, and provides a Kubernetes job submission script.

Changes

Tile Filtering Preprocessing Feature

Layer / File(s)	Summary
Dataset Artifact Metadata `configs/data/dataset.yaml`	Adds `tissue_stats_run_id` MLflow artifact identifier to dataset configuration.
Preprocessing Configuration `configs/preprocessing/filter_tiles.yaml`, `configs/experiment/preprocessing/filter_tiles.yaml`	Global and experiment-level configs wire `filter_tiles` to tissue statistics artifacts from MLflow, expose `tissue_coverage_column` as a hyperparameter, and define MLflow metadata.
Core Filtering Logic `preprocessing/filter_tiles.py`	Implements `filter_split` to download tiles and tissue parquet from MLflow, filter by annotation coverage (`tile_coverage_*` columns) and tissue coverage, join filtered results, and return per-stage counts. `main` runs filtering for both train/test splits, logs metrics, and uploads results as MLflow artifacts.
Job Submission `scripts/submit_filter_tiles.py`	Kubernetes job submission script configures and launches `tissue-classification-filter-tiles` with resource limits, clones the repository, installs dependencies, and executes the preprocessing module.

Sequence Diagram

sequenceDiagram
    participant User
    participant K8sJob as Kubernetes Job
    participant Repo as Git Repository
    participant MLflow
    participant PyArrow as PyArrow/Pandas
    participant Output as Output Storage

    User->>K8sJob: Submit tissue-classification-filter-tiles job
    K8sJob->>Repo: Clone tissue-classification repository
    K8sJob->>Repo: uv sync (install dependencies)
    K8sJob->>MLflow: Download tiles parquet (tiling_run_id)
    K8sJob->>MLflow: Download tissue_stats parquet (tissue_stats_run_id)
    PyArrow->>PyArrow: Filter tiles by annotation coverage<br/>(tile_coverage_* > 0)
    PyArrow->>PyArrow: Filter tissue by tissue coverage<br/>(tissue_column > 0)
    PyArrow->>PyArrow: Inner join on slide_id/x/y
    K8sJob->>Output: Write filtered tiles
    K8sJob->>MLflow: Upload output directory as artifacts
    K8sJob->>MLflow: Log per-split metrics<br/>(original_count, after_annotation, after_tissue)
    MLflow-->>User: Job complete with artifact tracking

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

RationAI/tissue-classification#6: Extends dataset MLflow artifact IDs and adds preprocessing consuming tissue and tile artifacts; the coverage_stats module in that PR complements the filter_tiles preprocessing introduced here.

Suggested reviewers

vejtek

Poem

🐰 A filtering tale, hops the tiles so fine,
With tissue coverage, the boundaries align,
From PyArrow's whisper to MLflow's keep,
The annotations are counted, the coverage runs deep! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main changeset: adding a new filter_tiles preprocessing step with associated configs and infrastructure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/filter-tiles

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a tile filtering preprocessing step that removes tiles lacking annotation or tissue coverage. It includes new configuration files, a PyArrow-based filtering script, and a Kubernetes job submission script. The review feedback recommends enhancing the script's flexibility by removing restrictive column filtering and adding a runtime check to ensure the final filtered dataset is not empty.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

scripts/submit_filter_tiles.py (1)

11-12: ⚡ Quick win

Pin the repository revision for reproducible jobs.

Line 11 clones whatever is current at runtime, which makes results non-reproducible across reruns.

Proposed fix

-        "git clone https://github.com/RationAI/tissue-classification.git workdir",
+        "git clone --depth 1 https://github.com/RationAI/tissue-classification.git workdir",
+        "cd workdir && git checkout <commit-or-tag>",
-        "cd workdir",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/submit_filter_tiles.py` around lines 11 - 12, Replace the unpinned
clone command string "git clone
https://github.com/RationAI/tissue-classification.git workdir" (and its
subsequent "cd workdir") with a reproducible sequence that checks out a specific
revision: either clone and checkout a fixed commit/tag (e.g., "git clone
https://github.com/RationAI/tissue-classification.git workdir && cd workdir &&
git checkout <REV>") or use "git clone --branch <TAG> --single-branch ...
workdir && cd workdir". Make <REV> configurable via an environment variable or
constant (e.g., GIT_REVISION) so the job is reproducible without changing code;
update the command strings used by the code that builds the job to reference
that variable.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@preprocessing/filter_tiles.py`:
- Around line 59-67: Before using tissue_column in the filter, verify it exists
in the dataset schema: check tissue_column is present in tissue_ds.schema (use
the field names from tissue_ds.schema, e.g., via f.name) and if not raise/raise
ValueError with a clear message referencing tissue_column and available columns;
then proceed to build tissue_cols and call tissue_ds.to_table with the
pads.field(tissue_column) > 0 filter as before (locate variables tissue_column,
tissue_ds, tissue_cols, tissue_table, and pads.field).

In `@scripts/submit_filter_tiles.py`:
- Around line 4-15: The submit_job invocation contains unresolved placeholders
(username=... and "+experiment=...") which break execution; update the
submit_job call in submit_filter_tiles.py to supply real values or read them
from environment/arguments (e.g., use a USER/EXPERIMENT variable or argparse)
and interpolate those into the script command instead of literal "..."; ensure
the variable names used match the submit_job parameters and add a quick
validation check that username and experiment are non-empty before calling
submit_job to fail fast with a clear error.

---

Nitpick comments:
In `@scripts/submit_filter_tiles.py`:
- Around line 11-12: Replace the unpinned clone command string "git clone
https://github.com/RationAI/tissue-classification.git workdir" (and its
subsequent "cd workdir") with a reproducible sequence that checks out a specific
revision: either clone and checkout a fixed commit/tag (e.g., "git clone
https://github.com/RationAI/tissue-classification.git workdir && cd workdir &&
git checkout <REV>") or use "git clone --branch <TAG> --single-branch ...
workdir && cd workdir". Make <REV> configurable via an environment variable or
constant (e.g., GIT_REVISION) so the job is reproducible without changing code;
update the command strings used by the code that builds the job to reference
that variable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91cf61e6-c8fb-444b-8cc8-d3ac523c2b53

📥 Commits

Reviewing files that changed from the base of the PR and between eddafdd and 90435e9.

📒 Files selected for processing (5)

configs/data/dataset.yaml
configs/experiment/preprocessing/filter_tiles.yaml
configs/preprocessing/filter_tiles.yaml
preprocessing/filter_tiles.py
scripts/submit_filter_tiles.py

…ists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vojtech-cifka and others added 7 commits May 5, 2026 19:07

chore: add submission script and experiment config for filter_tiles

1f86829

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: remove experiment name

832a2ca

fix: switch tissue join to pandas merge

f3a9148

PyArrow's hash join on 45M string-keyed rows hangs in practice. Pandas merge on the same data takes ~30s. Frees intermediate tables to keep peak memory bounded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: remove debug prints from filter_tiles

90435e9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vojtech-cifka requested a review from vejtek May 5, 2026 18:22

vojtech-cifka self-assigned this May 5, 2026

vojtech-cifka requested review from a team and JakubPekar May 5, 2026 18:22

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread preprocessing/filter_tiles.py Outdated

Comment thread preprocessing/filter_tiles.py

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread preprocessing/filter_tiles.py Outdated

Comment thread scripts/submit_filter_tiles.py

vojtech-cifka and others added 2 commits May 5, 2026 20:44

refactor: load all tissue stats columns and validate tissue_column ex…

f6e8eac

…ists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: raise RuntimeError when tissue filter drops all tiles

eda39fe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vejtek approved these changes May 5, 2026

View reviewed changes

vojtech-cifka requested a review from ejdam87 May 6, 2026 12:33

ejdam87 approved these changes May 6, 2026

View reviewed changes

vojtech-cifka merged commit 7293555 into master May 6, 2026
3 checks passed

vojtech-cifka deleted the feature/filter-tiles branch May 6, 2026 14:20

coderabbitai Bot mentioned this pull request May 13, 2026

# feat: linear classifier training pipeline on precomputed embeddings #11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add filter_tiles preprocessing step #8

feat: add filter_tiles preprocessing step #8
vojtech-cifka merged 9 commits into
masterfrom
feature/filter-tiles

vojtech-cifka commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vojtech-cifka commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files

Config changes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vojtech-cifka commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading