Optimize FREIGHT multi-pass streaming evaluation (1.8x) by schulzchristian · Pull Request #2 · CHSZLab/CHSZLabLib

schulzchristian · 2026-04-03T08:47:47Z

Summary

Optimize multi-pass evaluation in bindings/freight_binding.cpp for ~1.8x overall speedup
Replace expensive net_to_nodes reverse mapping + std::set-per-net evaluation with in-memory bit vectors (connectivity) and direct CUT_NET counting (cut-net)
Eliminate valid_neighboring_nets vector, use memcpy for snapshots and output, skip redundant copies
All results remain bit-identical with FREIGHT CLI (freight_con_opt, freight_cut_opt)

Benchmark (ibm18, k=8, --ram_stream, wall-clock including evaluation)

passes	objective	CLI wall	Binding wall	speedup
1	connectivity	101ms	36ms	2.8x
5	connectivity	366ms	262ms	1.4x
10	connectivity	653ms	515ms	1.3x
1	cut-net	102ms	37ms	2.8x
5	cut-net	371ms	242ms	1.5x
10	cut-net	685ms	494ms	1.4x

Quality and balance match exactly across all tested configs (ibm01/ibm05/ibm18, k=8, passes 1-10, both objectives, seed=0).

Test plan

Verify bit-identical results with CLI on ISPD98 instances
Build on macOS (x86 + ARM) -- no platform-specific code used
Run existing tests/test_freight.py

Replace expensive per-pass evaluation with efficient in-memory alternatives, producing bit-identical results with the FREIGHT CLI. Changes to bindings/freight_binding.cpp: - Connectivity evaluation: replace vector-of-vectors reverse mapping + std::set-per-net with per-net bit vectors (ceil(k/64) words per net), set incrementally during the main partitioning loop, evaluated via popcount - Cut-net evaluation: count CUT_NET entries in stream_edges_assign directly instead of building a reverse mapping (O(num_nets) sequential scan) - Eliminate valid_neighboring_nets vector; re-iterate CSR edges for per-net tracking update (edge data is in L1 cache from prior accumulation scan) - Pre-allocate best partition vectors to avoid dynamic reallocation - Skip best-partition snapshot on the last pass (read from stream_nodes_assign directly if last pass is best); use memcpy for intermediate snapshots - Copy result directly from best/current assignment to numpy output via memcpy, skipping the intermediate restore step - Replace /dev/null file open with lightweight null_buf for output suppression Verified bit-identical against FREIGHT CLI (freight_con_opt, freight_cut_opt) on ISPD98 ibm01/ibm05/ibm18, k=8, passes 1-10, both objectives. Binding vs CLI wall-clock (--ram_stream, includes evaluation): ibm18 connectivity k=8: 1 pass: CLI 101ms Bind 36ms (2.8x faster) 5 pass: CLI 366ms Bind 262ms (1.4x faster) 10 pass: CLI 653ms Bind 515ms (1.3x faster) ibm18 cut-net k=8: 1 pass: CLI 102ms Bind 37ms (2.8x faster) 5 pass: CLI 371ms Bind 242ms (1.5x faster) 10 pass: CLI 685ms Bind 494ms (1.4x faster)

schulzchristian merged commit 28fbb1d into main Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize FREIGHT multi-pass streaming evaluation (1.8x)#2

Optimize FREIGHT multi-pass streaming evaluation (1.8x)#2
schulzchristian merged 1 commit intomainfrom
opt/freight-multipass-eval

schulzchristian commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

schulzchristian commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark (ibm18, k=8, --ram_stream, wall-clock including evaluation)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

schulzchristian commented Apr 3, 2026 •

edited

Loading