Skip to content

JasonJiaxiangLi/DP-BalanceRouter

Repository files navigation

BalanceRoute: DP Load-Balanced Routing for MoE LLM Serving

Implementation and study of request routing algorithms for data-parallel MoE LLM serving, based on "Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving" (Bu et al., 2026). Includes BR-0, BR-H, and our extensions (JSQ-Load, Fast-Phi), with real multi-node GPU validation using vLLM's native DP with EP barrier synchronization.

Blog post: blog_post.md — detailed writeup with examples, math, and experimental results.

Why This Matters

MoE models (DeepSeek-V3, Qwen3-30B-A3B, Mixtral) use Expert Parallelism (EP) across DP workers. The EP all-to-all creates a synchronization barrier at every decode step — the slowest worker determines throughput for everyone. Bad routing wastes GPU cycles at this barrier. Our experiments show that switching from vLLM's default queue-count routing to KV-load-aware routing improves throughput by 7-15%.

Two Architectures

This repo contains two approaches. See each directory's README for details.

1. Proxy-based routing — balanceroute/ (dense models)

An external FastAPI proxy sits in front of independent vLLM instances and routes requests. No vLLM source code is modified. Suitable for dense models where DP workers are independent.

Clients → [Proxy :9000] → vLLM :8000 (GPU 0)
                         → vLLM :8001 (GPU 1)
                         → ...

2. Native DP routing — scripts/ (MoE models, recommended)

Monkey-patches vLLM's internal DP router (DPLBAsyncMPClient) to inject custom routing while preserving the real EP barrier. This is the main experiment setup and captures the true cost of load imbalance.

Clients → vLLM (patched DP router) → Engine 0 ─┐
                                    → Engine 1 ─┤ EP all-to-all barrier
                                    → ...       ─┘

Quick Start

Simulation (no GPUs needed)

pip install -e .
python run_simulation.py --num-workers 8 --num-requests 5000

Live experiment on a GPU cluster

# 1. Install
pip install vllm
pip install -e .

# 2. Download a model and dataset
# (adjust paths for your cluster)
python scripts/prepare_data.py

# 3. Launch vLLM with patched DP router
python scripts/vllm_patched_dp.py \
    --model <path-to-model> \
    --data-parallel-size 4 --tensor-parallel-size 2 \
    --dp-router jsq_load \
    --host 0.0.0.0 --port 8000

# 4. Benchmark
python run_benchmark.py \
    --target http://localhost:8000/v1/chat/completions \
    --num-requests 500 --request-rate 3.0 \
    --model <model-name> --dataset sharegpt

For multi-node experiments with Ray, see scripts/README.md.

Routing Algorithms

Router Description Strengths
vllm_default score = 4*waiting + running Simple, reliable, zero failures
jsq_load Route to worker with lowest KV load Best throughput and tail latency
br0 BR-0 F-score with JSQ-Load tie-breaker Highest peak throughput, targets overflow
fast_phi K-point quadrature of overflow integral Best TTFT, zero failures, scales at large G

Key Results (DP=8, 2 nodes, Qwen3-30B-A3B, real EP barrier)

Router Throughput TPOT P95 TTFT Fail%
vllm_default 69.6 tok/s 3,381 ms 12.8 s 0.0%
jsq_load 74.8 tok/s 3,047 ms 10.5 s 0.0%
fast_phi 70.1 tok/s 3,385 ms 9.1 s 0.0%
br0 79.9 tok/s 3,012 ms 11.3 s 3.6%

(Rate = 3.0 req/s, 500 ShareGPT requests, max 1024 output tokens)

Key Files

File Description
balanceroute/routers.py All routing algorithms (BR-0, BR-H, JSQ-Load, baselines)
balanceroute/proxy.py FastAPI proxy server with W&B logging
balanceroute/simulator.py Offline trace-driven simulator
balanceroute/benchmark.py Async benchmark client
scripts/vllm_patched_dp.py Monkey-patches vLLM's DP router (main experiment code)
scripts/vllm_ray_dp.py Multi-node launcher via Ray
scripts/plot_gantt.py Visualization (active requests, imbalance, Gantt charts)
blog_post.md Full writeup with math, examples, and analysis

Configuration

Config files in configs/:

Config Model Use case
small.yaml Qwen3-0.6B Quick testing, 1 node
medium.yaml Qwen3-30B-A3B MoE experiments, 1-2 nodes
default.yaml DeepSeek-V4-Flash Full-scale, 4-8 nodes

References

  • Bu et al. (2026). Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving. arXiv:2605.06113v2
  • Chen et al. (2026). A Universal Load Balancing Principle and Its Application to Large Language Model Serving. arXiv:2601.17855v2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors