Skip to content

Profile and optimize Phase 2 species classification performance #72

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

Phase 2 species classification is the pipeline bottleneck. ADMS (11k segments, 5 species) takes ~70 min sequential, ~14 min per species. The expensive operation is frs_classify() which runs fwa_upstream() network traversal for every segment against every break point.

With workers = 4 this parallelizes but is still ~18 min.

Profiling needed

  • frs_classify() SQL: the fwa_upstream() call is O(segments × breaks). With 11k segments × 32k breaks that's a lot of ltree comparisons
  • Index strategy on breaks table (now has label + source columns)
  • Whether batching or partitioning the classify query would help
  • Phase 1 is fast (<80s even with 533k crossings) — no optimization needed there

Potential approaches

  1. Add indexes on breaks table blue_line_key column
  2. Batch frs_classify() by blue_line_key groups
  3. Pre-filter breaks to only BLKs present in the working table
  4. Use spatial index instead of ltree for segment-break matching
  5. Docker PG tuning (already high work_mem/parallel workers, but could profile)

Context

Phase 2 conn fix in v0.7.0 means sequential mode reuses conn correctly. Parallel mode (workers > 1) works but requires PG_*_SHARE env vars pointing at the target DB.

Relates to #70

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions