feat: add filter_tiles preprocessing step #8
Conversation
Drops tiles with no annotation coverage and no tissue coverage by joining tiling output with tissue stats. Carries through annotation and tissue coverage columns so downstream stages can consume the filtered set without re-joining. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds progress prints between read/join/write and restricts the tissue table to columns and slides that can possibly join, cutting peak memory and join time on the 80M-row input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isin against ~200 string slide ids forced per-row Python checks across the 80M-row tissue parquet, dwarfing the tiny saving. Drop it and rely on column projection plus the tissue>0 predicate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyArrow's hash join on 45M string-keyed rows hangs in practice. Pandas merge on the same data takes ~30s. Frees intermediate tables to keep peak memory bounded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new tile filtering preprocessing stage is introduced. It extends dataset MLflow artifact metadata, adds experiment and global configurations, implements PyArrow-based filtering logic that retains tiles with annotation and tissue coverage, and provides a Kubernetes job submission script. ChangesTile Filtering Preprocessing Feature
Sequence DiagramsequenceDiagram
participant User
participant K8sJob as Kubernetes Job
participant Repo as Git Repository
participant MLflow
participant PyArrow as PyArrow/Pandas
participant Output as Output Storage
User->>K8sJob: Submit tissue-classification-filter-tiles job
K8sJob->>Repo: Clone tissue-classification repository
K8sJob->>Repo: uv sync (install dependencies)
K8sJob->>MLflow: Download tiles parquet (tiling_run_id)
K8sJob->>MLflow: Download tissue_stats parquet (tissue_stats_run_id)
PyArrow->>PyArrow: Filter tiles by annotation coverage<br/>(tile_coverage_* > 0)
PyArrow->>PyArrow: Filter tissue by tissue coverage<br/>(tissue_column > 0)
PyArrow->>PyArrow: Inner join on slide_id/x/y
K8sJob->>Output: Write filtered tiles
K8sJob->>MLflow: Upload output directory as artifacts
K8sJob->>MLflow: Log per-split metrics<br/>(original_count, after_annotation, after_tissue)
MLflow-->>User: Job complete with artifact tracking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a tile filtering preprocessing step that removes tiles lacking annotation or tissue coverage. It includes new configuration files, a PyArrow-based filtering script, and a Kubernetes job submission script. The review feedback recommends enhancing the script's flexibility by removing restrictive column filtering and adding a runtime check to ensure the final filtered dataset is not empty.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/submit_filter_tiles.py (1)
11-12: ⚡ Quick winPin the repository revision for reproducible jobs.
Line 11 clones whatever is current at runtime, which makes results non-reproducible across reruns.
Proposed fix
- "git clone https://github.com/RationAI/tissue-classification.git workdir", + "git clone --depth 1 https://github.com/RationAI/tissue-classification.git workdir", + "cd workdir && git checkout <commit-or-tag>", - "cd workdir",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/submit_filter_tiles.py` around lines 11 - 12, Replace the unpinned clone command string "git clone https://github.com/RationAI/tissue-classification.git workdir" (and its subsequent "cd workdir") with a reproducible sequence that checks out a specific revision: either clone and checkout a fixed commit/tag (e.g., "git clone https://github.com/RationAI/tissue-classification.git workdir && cd workdir && git checkout <REV>") or use "git clone --branch <TAG> --single-branch ... workdir && cd workdir". Make <REV> configurable via an environment variable or constant (e.g., GIT_REVISION) so the job is reproducible without changing code; update the command strings used by the code that builds the job to reference that variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@preprocessing/filter_tiles.py`:
- Around line 59-67: Before using tissue_column in the filter, verify it exists
in the dataset schema: check tissue_column is present in tissue_ds.schema (use
the field names from tissue_ds.schema, e.g., via f.name) and if not raise/raise
ValueError with a clear message referencing tissue_column and available columns;
then proceed to build tissue_cols and call tissue_ds.to_table with the
pads.field(tissue_column) > 0 filter as before (locate variables tissue_column,
tissue_ds, tissue_cols, tissue_table, and pads.field).
In `@scripts/submit_filter_tiles.py`:
- Around line 4-15: The submit_job invocation contains unresolved placeholders
(username=... and "+experiment=...") which break execution; update the
submit_job call in submit_filter_tiles.py to supply real values or read them
from environment/arguments (e.g., use a USER/EXPERIMENT variable or argparse)
and interpolate those into the script command instead of literal "..."; ensure
the variable names used match the submit_job parameters and add a quick
validation check that username and experiment are non-empty before calling
submit_job to fail fast with a clear error.
---
Nitpick comments:
In `@scripts/submit_filter_tiles.py`:
- Around line 11-12: Replace the unpinned clone command string "git clone
https://github.com/RationAI/tissue-classification.git workdir" (and its
subsequent "cd workdir") with a reproducible sequence that checks out a specific
revision: either clone and checkout a fixed commit/tag (e.g., "git clone
https://github.com/RationAI/tissue-classification.git workdir && cd workdir &&
git checkout <REV>") or use "git clone --branch <TAG> --single-branch ...
workdir && cd workdir". Make <REV> configurable via an environment variable or
constant (e.g., GIT_REVISION) so the job is reproducible without changing code;
update the command strings used by the code that builds the job to reference
that variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 91cf61e6-c8fb-444b-8cc8-d3ac523c2b53
📒 Files selected for processing (5)
configs/data/dataset.yamlconfigs/experiment/preprocessing/filter_tiles.yamlconfigs/preprocessing/filter_tiles.yamlpreprocessing/filter_tiles.pyscripts/submit_filter_tiles.py
…ists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Adds a dedicated
filter_tilespreprocessing step that produces a canonicalfiltered tile set for downstream steps (threshold_stats, embeddings).
tile_coverage_*columns — only matching rows materialised from 80M-row parquet)(pandas merge replaces PyArrow hash join, which hung on 45M string-keyed rows)
{train,test}_tiles.parquetto MLflow underfilter_tiles/Pipeline order: tiling → tissue_masks → coverage_stats → filter_tiles → threshold_stats / embeddings
New files
preprocessing/filter_tiles.pyconfigs/preprocessing/filter_tiles.yamlconfigs/experiment/preprocessing/filter_tiles.yamlscripts/submit_filter_tiles.pyConfig changes
configs/data/dataset.yaml: addedtissue_stats_run_idSummary by CodeRabbit
Release Notes