Phase 1 of `mesh_ifc_streaming_framed` (entity-table walk + per-
product parsing + placement resolve) was the Amdahl tail capping
the rayon speedup at ~2×. Now three sub-phases, two parallel:
1a (parallel): shard `table.order()` across rayon workers. Each
worker filters for products and parses each product's args far
enough to extract guid, entity_name, placement_id, repr_id.
Output: `Vec<PartialWork>`.
1b (serial): warm a single `PlacementResolver` against every
placement_id from 1a. Chain caching makes this near-free —
placements share long parent tails (every product under a
building reuses the same IfcLocalPlacement chain). Freeze the
cache into `Arc<HashMap<u64, DMat4>>`.
1c (parallel): finalize PartialWork → Work by looking up the
world matrix in the frozen cache.
Why not merge 1a/1c into one pass: workers would either need
`&mut PlacementResolver` (impossible) or a thread-safe DashMap
cache (contention on shared parent chains). Two share-nothing
passes with a frozen Arc map in between matches the resolver's
"walk each chain once" contract.
Adds `EntityTable::order(&self) -> &[u64]` (public sharding
accessor) and `PlacementResolver::into_cache(self) -> HashMap`.
Bench (8-core, release, real Skiplum files):
- LBK_RIBp_C 41 MB, 34k products, T=8: v0.4.21 ~315 ms median →
v0.4.22 ~245 ms (22% faster).
- LBK_ARK_C 192 MB, 19k products, T=8: v0.4.21 ~887 ms median →
v0.4.22 ~605 ms (32% faster).
- T=1 fast path unchanged (par_iter at 1 worker = serial, no
overhead).
- 13/13 mesh_reveal + 7/7 cut_openings_integration + 76/76 Python
tests pass; same triangle/product counts.
Bumps to v0.4.22. No cache schema change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>