-
Notifications
You must be signed in to change notification settings - Fork 0
Profile and optimize Phase 2 species classification performance #72
Copy link
Copy link
Closed
Description
Problem
Phase 2 species classification is the pipeline bottleneck. ADMS (11k segments, 5 species) takes ~70 min sequential, ~14 min per species. The expensive operation is frs_classify() which runs fwa_upstream() network traversal for every segment against every break point.
With workers = 4 this parallelizes but is still ~18 min.
Profiling needed
frs_classify()SQL: thefwa_upstream()call is O(segments × breaks). With 11k segments × 32k breaks that's a lot of ltree comparisons- Index strategy on breaks table (now has
label+sourcecolumns) - Whether batching or partitioning the classify query would help
- Phase 1 is fast (<80s even with 533k crossings) — no optimization needed there
Potential approaches
- Add indexes on breaks table
blue_line_keycolumn - Batch
frs_classify()by blue_line_key groups - Pre-filter breaks to only BLKs present in the working table
- Use spatial index instead of ltree for segment-break matching
- Docker PG tuning (already high work_mem/parallel workers, but could profile)
Context
Phase 2 conn fix in v0.7.0 means sequential mode reuses conn correctly. Parallel mode (workers > 1) works but requires PG_*_SHARE env vars pointing at the target DB.
Relates to #70
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels