Skip to content

Optimize FREIGHT multi-pass streaming evaluation (1.8x)#2

Merged
schulzchristian merged 1 commit intomainfrom
opt/freight-multipass-eval
Apr 3, 2026
Merged

Optimize FREIGHT multi-pass streaming evaluation (1.8x)#2
schulzchristian merged 1 commit intomainfrom
opt/freight-multipass-eval

Conversation

@schulzchristian
Copy link
Copy Markdown
Contributor

@schulzchristian schulzchristian commented Apr 3, 2026

Summary

  • Optimize multi-pass evaluation in bindings/freight_binding.cpp for ~1.8x overall speedup
  • Replace expensive net_to_nodes reverse mapping + std::set-per-net evaluation with in-memory bit vectors (connectivity) and direct CUT_NET counting (cut-net)
  • Eliminate valid_neighboring_nets vector, use memcpy for snapshots and output, skip redundant copies
  • All results remain bit-identical with FREIGHT CLI (freight_con_opt, freight_cut_opt)

Benchmark (ibm18, k=8, --ram_stream, wall-clock including evaluation)

passes objective CLI wall Binding wall speedup
1 connectivity 101ms 36ms 2.8x
5 connectivity 366ms 262ms 1.4x
10 connectivity 653ms 515ms 1.3x
1 cut-net 102ms 37ms 2.8x
5 cut-net 371ms 242ms 1.5x
10 cut-net 685ms 494ms 1.4x

Quality and balance match exactly across all tested configs (ibm01/ibm05/ibm18, k=8, passes 1-10, both objectives, seed=0).

Test plan

  • Verify bit-identical results with CLI on ISPD98 instances
  • Build on macOS (x86 + ARM) -- no platform-specific code used
  • Run existing tests/test_freight.py

Replace expensive per-pass evaluation with efficient in-memory alternatives,
producing bit-identical results with the FREIGHT CLI.

Changes to bindings/freight_binding.cpp:

- Connectivity evaluation: replace vector-of-vectors reverse mapping +
  std::set-per-net with per-net bit vectors (ceil(k/64) words per net),
  set incrementally during the main partitioning loop, evaluated via popcount

- Cut-net evaluation: count CUT_NET entries in stream_edges_assign directly
  instead of building a reverse mapping (O(num_nets) sequential scan)

- Eliminate valid_neighboring_nets vector; re-iterate CSR edges for per-net
  tracking update (edge data is in L1 cache from prior accumulation scan)

- Pre-allocate best partition vectors to avoid dynamic reallocation

- Skip best-partition snapshot on the last pass (read from stream_nodes_assign
  directly if last pass is best); use memcpy for intermediate snapshots

- Copy result directly from best/current assignment to numpy output via memcpy,
  skipping the intermediate restore step

- Replace /dev/null file open with lightweight null_buf for output suppression

Verified bit-identical against FREIGHT CLI (freight_con_opt, freight_cut_opt)
on ISPD98 ibm01/ibm05/ibm18, k=8, passes 1-10, both objectives.

Binding vs CLI wall-clock (--ram_stream, includes evaluation):

  ibm18 connectivity k=8:
    1 pass:  CLI 101ms  Bind  36ms  (2.8x faster)
    5 pass:  CLI 366ms  Bind 262ms  (1.4x faster)
   10 pass:  CLI 653ms  Bind 515ms  (1.3x faster)

  ibm18 cut-net k=8:
    1 pass:  CLI 102ms  Bind  37ms  (2.8x faster)
    5 pass:  CLI 371ms  Bind 242ms  (1.5x faster)
   10 pass:  CLI 685ms  Bind 494ms  (1.4x faster)
@schulzchristian schulzchristian merged commit 28fbb1d into main Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant