# feat: linear classifier training pipeline on precomputed embeddings#11
Conversation
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces an end-to-end linear classifier training pipeline for tissue classification. It adds type aliases and a shared label-computation helper, implements a PyArrow-backed embedding dataset with metadata filtering and k-fold support, provides a Lightning-based DataModule and MetaArch model architecture with class-weighted training and comprehensive MLflow logging, creates a parquet prediction callback, defines Hydra configurations for model/data/trainer/experiments, and wires everything through a Hydra entrypoint that orchestrates training via a Kubernetes job submission script. ChangesLinear Classifier Training Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The PR introduces substantial new logic across multiple domains: embedding dataset with complex filtering/joining logic, Lightning MetaArch with multi-metric tracking and MLflow integration, extensive Hydra configuration wiring, and a new training entrypoint. While individual files are coherent, the cross-file dependencies, data flow complexity, and MLflow/Lightning integration patterns demand careful review of assumptions around tensor shapes, metric computation, and artifact logging. Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Train loss ~0.02 vs val loss ~0.32 indicated severe overfit on the linear probe. AdamW weight_decay was 0; bump to 1e-3 to regularize the head.
This reverts commit c5bab90.
Summary
Adds an end-to-end ML training pipeline for linear probing on precomputed tile
embeddings. Introduces the embedding dataset preprocessing step, a PyTorch
Lightning training module, and all supporting configs and submission scripts.
Changes
Preprocessing
preprocessing/_labels.py— shared label/tissue-prop derivation logic.ML training
ml/meta_arch.py—MetaArchLightning module: backbone + decode head +CrossEntropyLoss with balanced class weights computed from the train fold.
Logs per-class metrics, confusion matrices, and per-slide accuracy.
ml/data/datasets/embedding_tiles.py—EmbeddingTilesDataset: loads theembedding parquet, inner-joins with metadata, and serves
(embedding, label, slide_id)triples. Stays in Arrow for the join to avoid large-list → pandasconversion overhead.
ml/data/data_module.py— LightningDataModulewrapping train/val/test splits.ml/callbacks/parquet_prediction_writer.py— writes model predictions to Parquet.configs/experiment/ml/linear_classifier.yaml— full experiment config.configs/ml/— model, data, and trainer sub-configs.scripts/submit_train_linear.py— MLflow submission script.Summary by CodeRabbit
Release Notes