A high-performance C++ / Python library for constructing balanced, randomly mixed bipartite routings using only deterministic bitwise operations and permutations.
The Phase Router deterministically mixes two sets of connections to produce a balanced, predictable routing, avoiding hotspots and enforcing per-row limits—useful for evenly distributing work in large-scale systems.
The Phase Router implements a seed-controlled phase-space mixing operator, also referred to as an Orthogonal Load-Balanced Incidence Operator (OLBIO) (see details), that converts two degree-specified binary matrices into a sparse bipartite coupling which removes input-order bias and phase alignment effects under typical conditions.
It is best understood as a randomized analogy to convolution in cyclic phase space: row-degree mass from both sides is embedded on a ring, independently mixed by permutations and phase shifts, and intersected to produce a sparse routing whose first-order statistics match the expected-degree law of a Chung–Lu bipartite model.
Unlike Chung–Lu, this is not an independent-edge sampler. It is a constructive, seed-randomized, deterministic transport operator implemented entirely with bitwise AND, shifts, popcount, and permutations, making it cache-efficient, SIMD-friendly, and fully reproducible.
This is:
- A deterministic degree-weighted mixing operator
- A fast expected-degree bipartite transport operator
- A scalable routing primitive with hard fan-out limits
This is not:
- A learned router
- A greedy load balancer
- A hash
- It is not a hard real-time load balancer
- It does not eliminate Poisson extremes
- It does not guarantee per-column caps
- It is not optimal for k ≪ √N
Given two binary matrices
S, T ∈ {0,1}^(N×N)
with row sums
s_i = Σ_j S_ij , t_j = Σ_j T_ij
the Phase Router constructs a sparse bipartite routing matrix O with at most k outputs per source row by embedding both matrices into a shared cyclic phase space and intersecting them:
O = S' ∧ (T')^T
Here S' and T' are degree-preserving, seed-randomized, deterministic phase-mixed transforms of the original inputs.
-
Row-degree mass from both sides is packed into contiguous arcs on a ring of size
N. -
Independent permutations and phase shifts distribute arcs quasi-randomly.
-
Bitwise intersection produces a sparse bipartite coupling with first-order statistics:
E[O_ij] ≈ s_i · t_j / N
This matches the expected-degree law of a Chung–Lu bipartite model, but the joint distribution is fundamentally different: edges arise from structured phase overlap, not independent Bernoulli trials.
The Phase Router is therefore a low-discrepancy randomized coupling, not a probabilistic graph sampler.
The phase-mixed transforms S' and T' are constructed as follows:
-
Row Alignment
Left-align all 1-bits in each row of
SandTinto a contiguous block, preserving row sums. -
Phase Offset Computation (Independent for S and T)
Compute cumulative row offsets from the original matrix order:
φ_i^S = Σ_{r<i} s_r (mod N) φ_i^T = Σ_{r<i} t_r (mod N)These offsets determine how much each row will be cyclically rotated.
-
Cyclic row-wise bit rotation (Global N-bit Row-wise)
Apply independent cumulative cyclic rotations to each row:
S_rot[i] = RotateLeft(S[i], φ_i^S) T_rot[i] = RotateLeft(T[i], φ_i^T)This embeds each row's contiguous arc on a ring of size
N, creating low-discrepancy phase spreading that reduces collisions. -
Column Permutations
Apply independent random column permutations to
S_rotandT_rot:S_shuf = PermuteColumns(S_rot, col_perm_S) T_shuf = PermuteColumns(T_rot, col_perm_T)Column permutations redistribute bits horizontally without altering the row-wise phase spreading effect. This preserves row sums while destroying geometric correlations. Each matrix uses a different seed-derived permutation.
-
OPTIONAL: Global Row Permutations (Independent for S and T)
Apply independent row permutations to both matrices:
S_final[i] = S_shuf[row_perm[i]] T_prepared[i] = T_shuf[row_perm_T[i]]This removes input-order bias. Note that
SandTuse different row permutations to ensure independent mixing.The row permutation step is not part of the core algorithm. Its main purposes are:
- Anonymization / obfuscation of input order
- Minor reduction of skew for very small k values
Trade-offs:
- Computational cost: Each row permutation adds memory accesses and bookkeeping overhead, which can noticeably increase runtime for large N.
- Marginal benefit: In two-phase adversarial or large-scale scenarios, row permutations have little to no impact on column skew or maximum loads.
Usage:
- Row permutations can be toggled via the
ENABLE_ROW_PERMcompile-time flag (#ifdef ENABLE_ROW_PERM) or the corresponding runtime API argument. - By default, they are disabled to maximize performance.
- Recommended only for testing, anonymization, or seed-independent input-order mixing.
Performance Note:
Empirical tests show that disabling row permutations:
- Reduces runtime by up to 30–50% for large N (≥1024)
- Has negligible effect on skew for k ≥ 256
- Preserves the Phase Router’s deterministic, low-discrepancy behavior
-
Transpose T 90° Clockwise
Rotate T 90° clockwise (transpose) to align its columns with the output rows:
T_final[j][i] = T_prepared[i][j]This preserves the phase offsets applied in Step 3 while preparing
Tfor the intersection. -
Bitwise Intersection & Seed-deterministic Randomized Top-k Extraction
Boolean AND of S and T:
O[i] = S_final[i] & T_final[i]Only up to k intersections per row are retained. By default, these are chosen via seed-deterministic random selection among candidate overlaps, rather than always taking the top-k. This reduces column skew and distributes load more evenly, especially for small k, while hard fan-out bound is enforced. If fewer than k overlaps exist, the remaining slots are filled with -1.
-
Hard Fan-Out Bound:
Σ_j O_ij ≤ kDeterministically enforced for every source row.
-
Degree-Weighted Load:
Seed-averaged column load satisfies
E[c_j] ∝ t_j , where c_j = Σ_i O_ij -
No Geometric Bias: Columns cannot become hotspots due to input ordering, contiguity, or phase alignment.
-
Deterministic Reproducibility:
Fixed seeds produce identical routings.
The Phase Router is:
-
A degree-weighted bipartite transport operator
-
Seed-randomized but deterministic
-
CPU-efficient and bit-parallel
It does not solve a global optimization problem or sample edges independently.
Practical use cases include Mixture-of-Experts (MoE), sparse attention, and load-balanced routing:
-
Enforces capacity constraints
-
Avoids hotspots
-
Prevents feedback loops
-
Requires no training or coordination
-
Provides predictable first-order statistical behavior
| Router Type | Learned | Deterministic | Load Guarantees | Fan-Out Bound | Global State |
|---|---|---|---|---|---|
| Softmax MoE | ✓ | ✗ | Weak | ✗ | ✓ |
| Hash Router | ✗ | ✓ | Poor | ✓ | ✗ |
| Greedy Balancer | ✗ | ✓ | Strong | ✓ | ✓ |
| Phase Router | ✗ | ✓ | Statistical | ✓ | ✗ |
Hash Router vs Phase Router
| Feature | Hash Router | Phase Router |
|---|---|---|
| Column Skew | Higher (worse) | Lower (better) |
| Max Column Load | Higher / bursty | Lower / balanced |
| Runtime | Faster | Slower |
| Determinism | Seed-deterministic | Seed-deterministic |
| Handling Structured Inputs | Prone to hotspots | Smooth, decorrelated |
For more detail, see phase_router_vs_hash_router_tables.md
Chung–Lu vs Phase Router:
| Feature | Chung–Lu | Phase Router |
|---|---|---|
| Randomness | Independent edges | Structured phase overlap |
| Edge Caps | None | Hard k-cap |
| Reproducibility | ✗ | ✓ (seed-deterministic) |
| Geometry Bias | ✗ | Explicit phase mixing |
The Phase Router produces lower column skew and more balanced routing compared with a hash router, while remaining deterministic and enforcing a hard per-row fan-out limit. It is well-suited for large-scale MoE routing, sparse attention, or other bipartite routing tasks where predictable load distribution and handling of structured inputs are important, even if it is somewhat slower than a simple hash-based approach.
The Phase Router achieves exceptional performance through bit-packed operations and cache-optimized permutations:
- Sub-millisecond routing for N ≤ 512 (typical MoE token routing scale)
- ~16 ms for N=1024, k=256 (production MoE configurations)
- ~78 ms for N=4096, k=512 (large-scale sparse routing)
- Linear scaling with matrix size for fixed k
- Memory-bandwidth limited, not compute-limited
| N | k | Routing Time | Total Time |
|---|---|---|---|
| 256 | 64 | 0.55 ms | 0.68 ms |
| 512 | 256 | 2.35 ms | 2.84 ms |
| 1024 | 256 | 15.61 ms | 17.58 ms |
| 2048 | 512 | 29.71 ms | 42.15 ms |
| 4096 | 512 | 77.95 ms | 110.76 ms |
Note: PBM dumping disabled; PBM output adds (O(N^2)) memory and file I/O overhead that dominates runtime at large (N).
Key observations:
- Disabling debug dumps yields 30-157× speedup (dumping incurs PBM I/O overhead)
- Routing time dominates for large N (memory bandwidth)
- Packing overhead visible only for small N (< 1 ms)
- Suitable for amortized or batch-level routing in real-time MoE and sparse attention systems
- No GPU required: optimized for cache-coherent SIMD hardware
See
why_cpu.md
- Bit-parallel operations using 64-bit words + popcount/ctz intrinsics
- OpenMP parallelization for row-level operations
- Permutations are cache-friendly (sequential reads after shuffle)
- Phase rotations wrap across word boundaries with minimal branching
git clone https://github.com/TSltd/bitpacked_phase_router.gitcd bitpacked_phase_routerNo local Python, virtualenv, or system dependencies required.
docker build -t phase_router_tests .docker run --rm -v "$(pwd -W 2>/dev/null || pwd)/results:/app/results" phase_router_tests- Runs all evaluation scripts back-to-back
- Builds the C++ / pybind backend inside the container
- Saves all CSV, JSON, plots, and Markdown summaries to
./results/ - Uses all memory available to Docker (no artificial limits)
Windows (Docker Desktop): Ensure Docker Desktop → Settings → Resources → Memory is set sufficiently high.
Remove Docker containers/images:
docker container prune # remove stopped containers
docker image rm phase_router_testsRemove cloned repo:
rm -rf bitpacked_phase_router
If you want to modify or debug the code outside Docker.
-
Python: 3.10+ (tested with 3.10/3.12)
-
Core Python packages: (all listed in
requirements.txt)- numpy ≥ 1.21
- pandas ≥ 1.3
- matplotlib ≥ 3.4
- pillow ≥ 8.0
- scipy ≥ 1.7
- torch ≥ 2.0
- tabulate
-
Build dependencies:
- pybind11 ≥ 2.6
- setuptools ≥ 58
-
System / Compiler:
- C++ compiler:
g++,clang++, or MSVC - OpenMP support for multi-threading (optional but recommended)
- psutil ≥ 5.9 (for system benchmarking, optional)
- C++ compiler:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython setup.py build_ext --inplace- Single test (N = 256, k = 32):
python evaluation/phase_router_test.py- Batch execution script (full scaling sweep):
python evaluation/phase_router_run.py- Stress test: Phase Router vs Hash Router
python evaluation/phase_router_vs_hash.py --skip-plots- 10-point multi-test probing load balance, determinism, composability, and failure modes:
python evaluation/phase_router_test_matrix.py- Random binary matrix routing demo:
python examples/demo_router.py- MoE capacity planning & overflow analysis:
python examples/moe_routing_demo.pystats = router.pack_and_route(S, T, k, routes, seed=42)-
Runs
align → phase → permutations → transpose → AND → top-k extraction -
Deterministic given a seed
Advanced API: Users can provide their own packed arrays and permutations:
route_packed_with_stats(...)- Each seed produces a fully deterministic routing:
fixed `(S, T, k)` + seed → deterministic routing `O(seed)`
- Averaging over seeds approximates expected-degree behavior.
- Statistical quantities are seed-averaged, not independent-edge estimates
-
monte_carlo_stats()→ seed-averaged loads, biases, skew -
estimate_expert_capacity()→ percentile-based capacity planning -
suggest_k_for_balance()→ parameter search over k -
analyze_routing_distribution()→ theoretical vs empirical densities
Key Features:
-
Vectorized column load computation
-
Binary search for optimal k
-
Tail-risk estimation for MoE
-
Deterministic reproducibility
| Metric | Formula | Interpretation |
|---|---|---|
mean_load[j] |
E[L_j] | Expected load on column j |
ideal_mean[j] |
k·t_j/∑t | Seed-averaged expected-degree baseline |
bias[j] |
mean_load[j] - ideal_mean[j] | Deviation from theory |
global_skew |
max(mean_load)/mean(mean_load) | MoE overload risk |
temporal_skew[j] |
max(L_j)/mean(L_j) | Per-column variability |
edge_density |
∑mean_load/(N²) | Empirical sparsity |
theoretical_density |
k/N | Expected sparsity |
Note: These are seed-averaged and not Monte-Carlo independent-edge estimates
-
This is not a cryptographic primitive
-
Guarantees apply to benign or random-like degree inputs
-
Adversarially constructed inputs may violate mixing assumptions if seed is known
| Symbol / Expression | Meaning |
|---|---|
S, T ∈ {0,1}^(N×N) |
Binary source and target matrices of size N×N |
s_i = Σ_j S_ij |
Row sum of source row i |
t_j = Σ_i T_ij |
Column sum of target column j |
O = S' ∧ (T')^T |
Output routing via bitwise AND of phase-mixed S' and transposed T' |
E[O_ij] ≈ s_i * t_j / N |
Expected output bit; first-order Chung–Lu degree approximation |
Pr(S'_ik = 1) |
Probability that the k-th bit of row i in S' is set |
O_j = Σ_i O_ij |
Column load induced by structured phase overlap |
φ_i^S = Σ_{r<i} s_r |
Phase offset for row i in S |
fill = (total active routes) / (N * k) |
Fill ratio |
max(c_j) / mean(c_j) |
Load balance ratio across columns |
The theoretical construction is documented in theory.md
The testing suite is documented in Testing Suite.md
Empirical performance and load-balance results are documented in evaluation.md
Potential applications are suggested in applications.md
