Skip to content

feat: add filter_tiles preprocessing step #8

Merged
vojtech-cifka merged 9 commits into
masterfrom
feature/filter-tiles
May 6, 2026
Merged

feat: add filter_tiles preprocessing step #8
vojtech-cifka merged 9 commits into
masterfrom
feature/filter-tiles

Conversation

@vojtech-cifka
Copy link
Copy Markdown
Collaborator

@vojtech-cifka vojtech-cifka commented May 5, 2026

Summary

Adds a dedicated filter_tiles preprocessing step that produces a canonical
filtered tile set for downstream steps (threshold_stats, embeddings).

  • Drops tiles with zero annotation coverage (PyArrow predicate pushdown on
    tile_coverage_* columns — only matching rows materialised from 80M-row parquet)
  • Joins against tissue stats to drop tiles with zero tissue coverage
    (pandas merge replaces PyArrow hash join, which hung on 45M string-keyed rows)
  • Outputs {train,test}_tiles.parquet to MLflow under filter_tiles/
  • Logs per-split row counts at each filter stage as MLflow metrics

Pipeline order: tiling → tissue_masks → coverage_stats → filter_tiles → threshold_stats / embeddings

New files

  • preprocessing/filter_tiles.py
  • configs/preprocessing/filter_tiles.yaml
  • configs/experiment/preprocessing/filter_tiles.yaml
  • scripts/submit_filter_tiles.py

Config changes

  • configs/data/dataset.yaml: added tissue_stats_run_id

Summary by CodeRabbit

Release Notes

  • New Features
    • Tile filtering preprocessing step now available. Retains tiles with annotation coverage and positive tissue coverage, with MLflow integration for artifact tracking and reproducibility.

vojtech-cifka and others added 7 commits May 5, 2026 19:07
Drops tiles with no annotation coverage and no tissue coverage by joining
tiling output with tissue stats. Carries through annotation and tissue
coverage columns so downstream stages can consume the filtered set
without re-joining.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds progress prints between read/join/write and restricts the tissue
table to columns and slides that can possibly join, cutting peak memory
and join time on the 80M-row input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isin against ~200 string slide ids forced per-row Python checks across
the 80M-row tissue parquet, dwarfing the tiny saving. Drop it and rely
on column projection plus the tissue>0 predicate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyArrow's hash join on 45M string-keyed rows hangs in practice. Pandas
merge on the same data takes ~30s. Frees intermediate tables to keep
peak memory bounded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vojtech-cifka vojtech-cifka requested a review from vejtek May 5, 2026 18:22
@vojtech-cifka vojtech-cifka self-assigned this May 5, 2026
@vojtech-cifka vojtech-cifka requested review from a team and JakubPekar May 5, 2026 18:22
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Warning

Rate limit exceeded

@vojtech-cifka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 9 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87a25215-b9c6-4015-92b7-f511b2e4c733

📥 Commits

Reviewing files that changed from the base of the PR and between 90435e9 and eda39fe.

📒 Files selected for processing (1)
  • preprocessing/filter_tiles.py
📝 Walkthrough

Walkthrough

A new tile filtering preprocessing stage is introduced. It extends dataset MLflow artifact metadata, adds experiment and global configurations, implements PyArrow-based filtering logic that retains tiles with annotation and tissue coverage, and provides a Kubernetes job submission script.

Changes

Tile Filtering Preprocessing Feature

Layer / File(s) Summary
Dataset Artifact Metadata
configs/data/dataset.yaml
Adds tissue_stats_run_id MLflow artifact identifier to dataset configuration.
Preprocessing Configuration
configs/preprocessing/filter_tiles.yaml, configs/experiment/preprocessing/filter_tiles.yaml
Global and experiment-level configs wire filter_tiles to tissue statistics artifacts from MLflow, expose tissue_coverage_column as a hyperparameter, and define MLflow metadata.
Core Filtering Logic
preprocessing/filter_tiles.py
Implements filter_split to download tiles and tissue parquet from MLflow, filter by annotation coverage (tile_coverage_* columns) and tissue coverage, join filtered results, and return per-stage counts. main runs filtering for both train/test splits, logs metrics, and uploads results as MLflow artifacts.
Job Submission
scripts/submit_filter_tiles.py
Kubernetes job submission script configures and launches tissue-classification-filter-tiles with resource limits, clones the repository, installs dependencies, and executes the preprocessing module.

Sequence Diagram

sequenceDiagram
    participant User
    participant K8sJob as Kubernetes Job
    participant Repo as Git Repository
    participant MLflow
    participant PyArrow as PyArrow/Pandas
    participant Output as Output Storage

    User->>K8sJob: Submit tissue-classification-filter-tiles job
    K8sJob->>Repo: Clone tissue-classification repository
    K8sJob->>Repo: uv sync (install dependencies)
    K8sJob->>MLflow: Download tiles parquet (tiling_run_id)
    K8sJob->>MLflow: Download tissue_stats parquet (tissue_stats_run_id)
    PyArrow->>PyArrow: Filter tiles by annotation coverage<br/>(tile_coverage_* > 0)
    PyArrow->>PyArrow: Filter tissue by tissue coverage<br/>(tissue_column > 0)
    PyArrow->>PyArrow: Inner join on slide_id/x/y
    K8sJob->>Output: Write filtered tiles
    K8sJob->>MLflow: Upload output directory as artifacts
    K8sJob->>MLflow: Log per-split metrics<br/>(original_count, after_annotation, after_tissue)
    MLflow-->>User: Job complete with artifact tracking
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • RationAI/tissue-classification#6: Extends dataset MLflow artifact IDs and adds preprocessing consuming tissue and tile artifacts; the coverage_stats module in that PR complements the filter_tiles preprocessing introduced here.

Suggested reviewers

  • vejtek

Poem

🐰 A filtering tale, hops the tiles so fine,
With tissue coverage, the boundaries align,
From PyArrow's whisper to MLflow's keep,
The annotations are counted, the coverage runs deep!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main changeset: adding a new filter_tiles preprocessing step with associated configs and infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/filter-tiles

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tile filtering preprocessing step that removes tiles lacking annotation or tissue coverage. It includes new configuration files, a PyArrow-based filtering script, and a Kubernetes job submission script. The review feedback recommends enhancing the script's flexibility by removing restrictive column filtering and adding a runtime check to ensure the final filtered dataset is not empty.

Comment thread preprocessing/filter_tiles.py Outdated
Comment thread preprocessing/filter_tiles.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
scripts/submit_filter_tiles.py (1)

11-12: ⚡ Quick win

Pin the repository revision for reproducible jobs.

Line 11 clones whatever is current at runtime, which makes results non-reproducible across reruns.

Proposed fix
-        "git clone https://github.com/RationAI/tissue-classification.git workdir",
+        "git clone --depth 1 https://github.com/RationAI/tissue-classification.git workdir",
+        "cd workdir && git checkout <commit-or-tag>",
-        "cd workdir",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/submit_filter_tiles.py` around lines 11 - 12, Replace the unpinned
clone command string "git clone
https://github.com/RationAI/tissue-classification.git workdir" (and its
subsequent "cd workdir") with a reproducible sequence that checks out a specific
revision: either clone and checkout a fixed commit/tag (e.g., "git clone
https://github.com/RationAI/tissue-classification.git workdir && cd workdir &&
git checkout <REV>") or use "git clone --branch <TAG> --single-branch ...
workdir && cd workdir". Make <REV> configurable via an environment variable or
constant (e.g., GIT_REVISION) so the job is reproducible without changing code;
update the command strings used by the code that builds the job to reference
that variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@preprocessing/filter_tiles.py`:
- Around line 59-67: Before using tissue_column in the filter, verify it exists
in the dataset schema: check tissue_column is present in tissue_ds.schema (use
the field names from tissue_ds.schema, e.g., via f.name) and if not raise/raise
ValueError with a clear message referencing tissue_column and available columns;
then proceed to build tissue_cols and call tissue_ds.to_table with the
pads.field(tissue_column) > 0 filter as before (locate variables tissue_column,
tissue_ds, tissue_cols, tissue_table, and pads.field).

In `@scripts/submit_filter_tiles.py`:
- Around line 4-15: The submit_job invocation contains unresolved placeholders
(username=... and "+experiment=...") which break execution; update the
submit_job call in submit_filter_tiles.py to supply real values or read them
from environment/arguments (e.g., use a USER/EXPERIMENT variable or argparse)
and interpolate those into the script command instead of literal "..."; ensure
the variable names used match the submit_job parameters and add a quick
validation check that username and experiment are non-empty before calling
submit_job to fail fast with a clear error.

---

Nitpick comments:
In `@scripts/submit_filter_tiles.py`:
- Around line 11-12: Replace the unpinned clone command string "git clone
https://github.com/RationAI/tissue-classification.git workdir" (and its
subsequent "cd workdir") with a reproducible sequence that checks out a specific
revision: either clone and checkout a fixed commit/tag (e.g., "git clone
https://github.com/RationAI/tissue-classification.git workdir && cd workdir &&
git checkout <REV>") or use "git clone --branch <TAG> --single-branch ...
workdir && cd workdir". Make <REV> configurable via an environment variable or
constant (e.g., GIT_REVISION) so the job is reproducible without changing code;
update the command strings used by the code that builds the job to reference
that variable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91cf61e6-c8fb-444b-8cc8-d3ac523c2b53

📥 Commits

Reviewing files that changed from the base of the PR and between eddafdd and 90435e9.

📒 Files selected for processing (5)
  • configs/data/dataset.yaml
  • configs/experiment/preprocessing/filter_tiles.yaml
  • configs/preprocessing/filter_tiles.yaml
  • preprocessing/filter_tiles.py
  • scripts/submit_filter_tiles.py

Comment thread preprocessing/filter_tiles.py Outdated
Comment thread scripts/submit_filter_tiles.py
vojtech-cifka and others added 2 commits May 5, 2026 20:44
…ists

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vojtech-cifka vojtech-cifka requested a review from ejdam87 May 6, 2026 12:33
@vojtech-cifka vojtech-cifka merged commit 7293555 into master May 6, 2026
3 checks passed
@vojtech-cifka vojtech-cifka deleted the feature/filter-tiles branch May 6, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants