BalanceRoute: DP Load-Balanced Routing for MoE LLM Serving

Implementation and study of request routing algorithms for data-parallel MoE LLM serving, based on "Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving" (Bu et al., 2026). Includes BR-0, BR-H, and our extensions (JSQ-Load, Fast-Phi), with real multi-node GPU validation using vLLM's native DP with EP barrier synchronization.

Blog post: blog_post.md — detailed writeup with examples, math, and experimental results.

Why This Matters

MoE models (DeepSeek-V3, Qwen3-30B-A3B, Mixtral) use Expert Parallelism (EP) across DP workers. The EP all-to-all creates a synchronization barrier at every decode step — the slowest worker determines throughput for everyone. Bad routing wastes GPU cycles at this barrier. Our experiments show that switching from vLLM's default queue-count routing to KV-load-aware routing improves throughput by 7-15%.

Two Architectures

This repo contains two approaches. See each directory's README for details.

1. Proxy-based routing — `balanceroute/` (dense models)

An external FastAPI proxy sits in front of independent vLLM instances and routes requests. No vLLM source code is modified. Suitable for dense models where DP workers are independent.

Clients → [Proxy :9000] → vLLM :8000 (GPU 0)
                         → vLLM :8001 (GPU 1)
                         → ...

2. Native DP routing — `scripts/` (MoE models, recommended)

Monkey-patches vLLM's internal DP router (DPLBAsyncMPClient) to inject custom routing while preserving the real EP barrier. This is the main experiment setup and captures the true cost of load imbalance.

Clients → vLLM (patched DP router) → Engine 0 ─┐
                                    → Engine 1 ─┤ EP all-to-all barrier
                                    → ...       ─┘

Quick Start

Simulation (no GPUs needed)

pip install -e .
python run_simulation.py --num-workers 8 --num-requests 5000

Live experiment on a GPU cluster

# 1. Install
pip install vllm
pip install -e .

# 2. Download a model and dataset
# (adjust paths for your cluster)
python scripts/prepare_data.py

# 3. Launch vLLM with patched DP router
python scripts/vllm_patched_dp.py \
    --model <path-to-model> \
    --data-parallel-size 4 --tensor-parallel-size 2 \
    --dp-router jsq_load \
    --host 0.0.0.0 --port 8000

# 4. Benchmark
python run_benchmark.py \
    --target http://localhost:8000/v1/chat/completions \
    --num-requests 500 --request-rate 3.0 \
    --model <model-name> --dataset sharegpt

For multi-node experiments with Ray, see scripts/README.md.

Routing Algorithms

Router	Description	Strengths
`vllm_default`	`score = 4*waiting + running`	Simple, reliable, zero failures
`jsq_load`	Route to worker with lowest KV load	Best throughput and tail latency
`br0`	BR-0 F-score with JSQ-Load tie-breaker	Highest peak throughput, targets overflow
`fast_phi`	K-point quadrature of overflow integral	Best TTFT, zero failures, scales at large G

Key Results (DP=8, 2 nodes, Qwen3-30B-A3B, real EP barrier)

Router	Throughput	TPOT P95	TTFT	Fail%
vllm_default	69.6 tok/s	3,381 ms	12.8 s	0.0%
jsq_load	74.8 tok/s	3,047 ms	10.5 s	0.0%
fast_phi	70.1 tok/s	3,385 ms	9.1 s	0.0%
br0	79.9 tok/s	3,012 ms	11.3 s	3.6%

(Rate = 3.0 req/s, 500 ShareGPT requests, max 1024 output tokens)

Key Files

File	Description
`balanceroute/routers.py`	All routing algorithms (BR-0, BR-H, JSQ-Load, baselines)
`balanceroute/proxy.py`	FastAPI proxy server with W&B logging
`balanceroute/simulator.py`	Offline trace-driven simulator
`balanceroute/benchmark.py`	Async benchmark client
`scripts/vllm_patched_dp.py`	Monkey-patches vLLM's DP router (main experiment code)
`scripts/vllm_ray_dp.py`	Multi-node launcher via Ray
`scripts/plot_gantt.py`	Visualization (active requests, imbalance, Gantt charts)
`blog_post.md`	Full writeup with math, examples, and analysis

Configuration

Config files in configs/:

Config	Model	Use case
`small.yaml`	Qwen3-0.6B	Quick testing, 1 node
`medium.yaml`	Qwen3-30B-A3B	MoE experiments, 1-2 nodes
`default.yaml`	DeepSeek-V4-Flash	Full-scale, 4-8 nodes

References

Bu et al. (2026). Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving. arXiv:2605.06113v2
Chen et al. (2026). A Universal Load Balancing Principle and Its Application to Large Language Model Serving. arXiv:2601.17855v2

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
analysis		analysis
balanceroute		balanceroute
configs		configs
figures		figures
results		results
scripts		scripts
slurm		slurm
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
blog_post.md		blog_post.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
run_proxy.py		run_proxy.py
run_simulation.py		run_simulation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BalanceRoute: DP Load-Balanced Routing for MoE LLM Serving

Why This Matters

Two Architectures

1. Proxy-based routing — `balanceroute/` (dense models)

2. Native DP routing — `scripts/` (MoE models, recommended)

Quick Start

Simulation (no GPUs needed)

Live experiment on a GPU cluster

Routing Algorithms

Key Results (DP=8, 2 nodes, Qwen3-30B-A3B, real EP barrier)

Key Files

Configuration

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BalanceRoute: DP Load-Balanced Routing for MoE LLM Serving

Why This Matters

Two Architectures

1. Proxy-based routing — balanceroute/ (dense models)

2. Native DP routing — scripts/ (MoE models, recommended)

Quick Start

Simulation (no GPUs needed)

Live experiment on a GPU cluster

Routing Algorithms

Key Results (DP=8, 2 nodes, Qwen3-30B-A3B, real EP barrier)

Key Files

Configuration

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Proxy-based routing — `balanceroute/` (dense models)

2. Native DP routing — `scripts/` (MoE models, recommended)

Packages