Implementation and study of request routing algorithms for data-parallel MoE LLM serving, based on "Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving" (Bu et al., 2026). Includes BR-0, BR-H, and our extensions (JSQ-Load, Fast-Phi), with real multi-node GPU validation using vLLM's native DP with EP barrier synchronization.
Blog post: blog_post.md — detailed writeup with examples, math, and experimental results.
MoE models (DeepSeek-V3, Qwen3-30B-A3B, Mixtral) use Expert Parallelism (EP) across DP workers. The EP all-to-all creates a synchronization barrier at every decode step — the slowest worker determines throughput for everyone. Bad routing wastes GPU cycles at this barrier. Our experiments show that switching from vLLM's default queue-count routing to KV-load-aware routing improves throughput by 7-15%.
This repo contains two approaches. See each directory's README for details.
An external FastAPI proxy sits in front of independent vLLM instances and routes requests. No vLLM source code is modified. Suitable for dense models where DP workers are independent.
Clients → [Proxy :9000] → vLLM :8000 (GPU 0)
→ vLLM :8001 (GPU 1)
→ ...
Monkey-patches vLLM's internal DP router (DPLBAsyncMPClient) to inject custom routing while preserving the real EP barrier. This is the main experiment setup and captures the true cost of load imbalance.
Clients → vLLM (patched DP router) → Engine 0 ─┐
→ Engine 1 ─┤ EP all-to-all barrier
→ ... ─┘
pip install -e .
python run_simulation.py --num-workers 8 --num-requests 5000# 1. Install
pip install vllm
pip install -e .
# 2. Download a model and dataset
# (adjust paths for your cluster)
python scripts/prepare_data.py
# 3. Launch vLLM with patched DP router
python scripts/vllm_patched_dp.py \
--model <path-to-model> \
--data-parallel-size 4 --tensor-parallel-size 2 \
--dp-router jsq_load \
--host 0.0.0.0 --port 8000
# 4. Benchmark
python run_benchmark.py \
--target http://localhost:8000/v1/chat/completions \
--num-requests 500 --request-rate 3.0 \
--model <model-name> --dataset sharegptFor multi-node experiments with Ray, see scripts/README.md.
| Router | Description | Strengths |
|---|---|---|
vllm_default |
score = 4*waiting + running |
Simple, reliable, zero failures |
jsq_load |
Route to worker with lowest KV load | Best throughput and tail latency |
br0 |
BR-0 F-score with JSQ-Load tie-breaker | Highest peak throughput, targets overflow |
fast_phi |
K-point quadrature of overflow integral | Best TTFT, zero failures, scales at large G |
| Router | Throughput | TPOT P95 | TTFT | Fail% |
|---|---|---|---|---|
| vllm_default | 69.6 tok/s | 3,381 ms | 12.8 s | 0.0% |
| jsq_load | 74.8 tok/s | 3,047 ms | 10.5 s | 0.0% |
| fast_phi | 70.1 tok/s | 3,385 ms | 9.1 s | 0.0% |
| br0 | 79.9 tok/s | 3,012 ms | 11.3 s | 3.6% |
(Rate = 3.0 req/s, 500 ShareGPT requests, max 1024 output tokens)
| File | Description |
|---|---|
balanceroute/routers.py |
All routing algorithms (BR-0, BR-H, JSQ-Load, baselines) |
balanceroute/proxy.py |
FastAPI proxy server with W&B logging |
balanceroute/simulator.py |
Offline trace-driven simulator |
balanceroute/benchmark.py |
Async benchmark client |
scripts/vllm_patched_dp.py |
Monkey-patches vLLM's DP router (main experiment code) |
scripts/vllm_ray_dp.py |
Multi-node launcher via Ray |
scripts/plot_gantt.py |
Visualization (active requests, imbalance, Gantt charts) |
blog_post.md |
Full writeup with math, examples, and analysis |
Config files in configs/:
| Config | Model | Use case |
|---|---|---|
small.yaml |
Qwen3-0.6B | Quick testing, 1 node |
medium.yaml |
Qwen3-30B-A3B | MoE experiments, 1-2 nodes |
default.yaml |
DeepSeek-V4-Flash | Full-scale, 4-8 nodes |
- Bu et al. (2026). Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving. arXiv:2605.06113v2
- Chen et al. (2026). A Universal Load Balancing Principle and Its Application to Large Language Model Serving. arXiv:2601.17855v2