Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions docs/online_serving/router.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ server:
splitwise: true # true enables PD disaggregation; false disables it

scheduler:
policy: "power_of_two" # Scheduling policy (optional): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, fd_metrics_score
policy: "power_of_two" # Scheduling policy (optional): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, remote_cache_aware, fd_metrics_score, fd_remote_metrics_score
prefill-policy: "cache_aware" # Prefill scheduling policy in PD mode
decode-policy: "fd_metrics_score" # Decode scheduling policy in PD mode
eviction-interval-secs: 60 # Cache eviction interval for CacheAware scheduling
Expand All @@ -199,9 +199,13 @@ scheduler:
hit-ratio-weight: 1.0 # Cache hit ratio weight
load-balance-weight: 0.05 # Load balancing weight
cache-block-size: 4 # Cache block size
tokenizer-url: "http://0.0.0.0:8098" # Tokenizer service endpoint (optional)
tokenizer-timeout-secs: 2 # Tokenizer service timeout
# tokenizer-url: "http://0.0.0.0:8098" # Tokenizer service endpoint (optional), cache_aware uses character-level tokenization when not configured.
# Note: Enabling this option causes a synchronous remote tokenizer call on every scheduling decision,
# introducing additional network latency. Only enable it when precise token-level tokenization
# is needed to improve cache hit rate.
# tokenizer-timeout-secs: 2 # Tokenizer service timeout; default: 2
waiting-weight: 10 # Waiting weight for CacheAware scheduling
stats-interval-secs: 5 # Stats logging interval in seconds, includes load and cache hit rate statistics; default: 5

manager:
health-failure-threshold: 3 # Number of failed health checks before marking unhealthy
Expand Down Expand Up @@ -254,6 +258,24 @@ Instance Registration Parameters:

Among these, `role`, `host_ip`, and `port` are required; all other parameters are optional.

## Scheduling Strategies

The Router supports the following scheduling strategies, configurable via `policy` (mixed mode), `prefill-policy`, and `decode-policy` (PD disaggregated mode) fields in the configuration file.

**Default strategies**: When not configured, prefill nodes default to `process_tokens`, mixed and decode nodes default to `request_num`.

| Strategy | Applicable Scenario | Implementation |
|----------|---------------------|----------------|
| `random` | General | Randomly selects one available instance, stateless, suitable for lightweight scenarios. |
| `round_robin` | General | Uses atomic counter to cycle through instance list, distributing requests evenly in order. |
| `power_of_two` | General | Randomly picks two instances, compares their concurrent request counts, selects the one with lower load. |
| `process_tokens` | **prefill (default)** | Iterates all instances, selects the one with the fewest tokens currently being processed (in-memory counting), suitable for prefill long-request load balancing. |
| `request_num` | **mixed / decode (default)** | Iterates all instances, selects the one with the fewest concurrent requests (in-memory counting), suitable for decode and mixed scenarios. |
| `fd_metrics_score` | mixed / decode | Uses in-memory counting to get running/waiting request counts, scores by `running + waiting × waitingWeight`, selects the instance with the lowest score. |
| `fd_remote_metrics_score` | mixed / decode | Fetches running/waiting request counts from each instance's remote `/metrics` endpoint in real-time, scores by `running + waiting × waitingWeight`, selects the instance with the lowest score. Requires `metrics_port` in instance registration. **Note: A synchronous remote HTTP request is issued on every scheduling decision. With a large number of instances or poor network conditions, this can significantly increase scheduling latency. Evaluate your deployment conditions carefully before enabling this strategy.** |
| `cache_aware` | prefill | Maintains KV Cache prefix hit information per instance via Radix Tree, selects instances by combining hit ratio and load scores (in-memory counting); automatically falls back to `process_tokens` when load is severely imbalanced. |
| `remote_cache_aware` | prefill | Same cache-aware strategy as `cache_aware`, but uses remote `/metrics` endpoint for instance load data. Requires `metrics_port` in instance registration. **Note: A synchronous remote HTTP request is issued on every scheduling decision. With a large number of instances or poor network conditions, this can significantly increase scheduling latency. Evaluate your deployment conditions carefully before enabling this strategy.** |

## Troubleshooting

If you encounter issues while using the Router, please refer to the [Router Troubleshooting Guide](router_faq.md), which covers common log analysis, response output interpretation, and troubleshooting methods.
51 changes: 48 additions & 3 deletions docs/online_serving/router_faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,22 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
| `Failed to register instance from index {index}: {error}` | Instance at index {index} in config file failed to register | That instance was not registered | Health status, registration parameters |
| `failed to send request to {url} with error: {error}` | Health check request failed to send | The instance may be marked as unhealthy | Network connectivity, proxy settings |
| `scanner error: {error}` | Error occurred while reading backend streaming response | The current request may fail | Backend instance status |
| `[prefill] scanner error: {error}, message={message}` | Error occurred while reading Prefill backend streaming response | The current Prefill request may fail | Backend instance status |
| `[prefill] copy error: {error}, message={message}` | Error occurred while copying Prefill response data | The current Prefill request may fail | Backend instance status |
| `Panic recovered: {error}` | A panic occurred during request processing and was recovered | The current request fails, but the service continues running | Backend instance status, request content |
| `empty baseURL provided` | Health check received an empty base URL | Health check cannot be performed | Registration parameters |
| `failed to create request: {error}` | Failed to create health check request | The instance may be marked as unhealthy | Network environment |
| `failed to read response body: {error}` | Failed to read health check response body | The instance may be marked as unhealthy | Backend instance status |

### Warn-Level Logs

| Log Message | Meaning | Impact | What to Check |
| :--- | :--- | :--- | :--- |
| `Server {url} is not healthy` | The instance at this URL failed health check | Router cannot register the instance, or will remove it from the registered list | Health status |
| `Instance {url} role is unknown` | Instance role cannot be recognized | The instance will not be added to the scheduling list | Registration parameters |
| `cache-aware prefill: tokenizer failed, fallback to process_tokens: {error}` | Tokenizer service call failed, automatically falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Tokenizer service status |
| `cache-aware prefill: tokenizer failed, fallback to char tokens: {error}` | Tokenizer service call failed, automatically falling back to character-based tokenization | cache_aware strategy remains active, using character-based tokenization for cache matching instead of the Tokenizer; normal request processing is not affected | Tokenizer service status |
| `cache-aware prefill: tokenize failed, fallback to process_tokens: {error}` | Tokenization completely failed (e.g., empty input), falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Request content, Tokenizer service status |
| `cache-aware prefill: final strategy: process_tokens, reason: tokenize failed: {error}. ts_ms={ts}` | Tokenization failed (new format), falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Request content, Tokenizer service status |

### Info-Level Logs

Expand All @@ -42,6 +50,42 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
| `No instances found in config file {path}` | No instances found in the registration config file | Check whether register.yaml is empty |
| `Request completed successfully.` | Request processing completed | Normal operation log |
| `Request failed, retrying...` | Request failed, retrying | Router will retry up to 3 times |
| `select worker (prefill): {url}, tokens: {tokens}` | Prefill scheduler selected a worker, showing current token processing count | Normal operation log |
| `select worker ({type}): {url}, count: {count}` | Decode/Mixed scheduler selected a worker, showing current request concurrency | Normal operation log |
| `release worker: {url}, count: {count}` | Request ended, worker counter released | Normal operation log |
| `release prefill tokens: {url}, tokens: {tokens}` | Prefill request ended, token load released | Normal operation log |
| `cleanup unhealthy worker counter: {url}` | Cleaned up counter for unhealthy worker | Normal operation log |
| `removed counters for {count} unhealthy workers: {urls}` | Batch cleanup of counters for unhealthy workers | Normal operation log |
| `[stats] total_running={n}, workers: [{loads}], cache_hit_rate={rate}% (hits={hits}/total={total})` | Periodic stats: total requests, worker loads, cache hit rate | Normal operation log, useful for monitoring and tuning |
| `Parsing completed; starting worker selection.` | Request parsing completed, starting worker selection | Normal operation log |
| `Request completed with an error.` | Request processing completed with an error | Check backend instance status |
| `[SelectWorkerPair] decode selection failed, releasing prefill counter url={url}` | Decode selection failed in PD disaggregated mode, releasing Prefill counter | Error handling log |
| `[prefill] first chunk received, release counter url={url}` | Prefill streaming response received first chunk, counter released | Normal operation log |
| `[prefill] non-stream prefill response done, release counter url={url}` | Prefill non-streaming response completed, counter released | Normal operation log |
| `[prefill] backendResp is nil or backendResp.Body is nil, url={url}` | Prefill backend response is nil | May indicate backend connection issue |
| `[prefill] release in defer (fallback) url={url}, isStream={bool}` | Fallback resource release when Prefill request exits abnormally | Error handling log |
| `[prefill] release in CommonCompletions defer (error path) url={url}` | Prefill resource release on error path | Error handling log |
| `cache-aware prefill: final strategy: process_tokens, reason: strategy not initialized` | cache_aware strategy not initialized, falling back to process_tokens | Check cache_aware configuration |
| `cache-aware prefill: final strategy: process_tokens, reason: load imbalanced, loads={loads}. ts_ms={ts}` | Load imbalanced across instances, falling back to process_tokens strategy | Normal operation log, automatic load balancing switch |
| `cache-aware prefill: final strategy: cache_aware_scoring, selected={url}, loads={loads}, hitRatios={ratios}. ts_ms={ts}` | cache_aware scoring strategy selected a worker | Normal operation log, showing loads and hit ratios |
| `[{method}] {path} {proto} {status} {latency} {clientIP}` | HTTP request access log | Normal operation log, records basic info for each request |
| `before SelectWorker prefill. ts_ms={ts}` | Starting Prefill worker selection in PD disaggregated mode | Normal operation log, for performance tracing |
| `before SelectWorker decode, after prefill. ts_ms={ts}` | Starting Decode worker selection after Prefill selection | Normal operation log, for performance tracing |
| `after SelectWorker decode, before return. ts_ms={ts}` | Decode worker selection completed | Normal operation log, for performance tracing |

### Debug-Level Logs

> Debug-level logs are only output when the log level is set to `debug`, typically used for development debugging.

| Log Message | Meaning | Description |
| :--- | :--- | :--- |
| `Healthy instances: prefill={urls}, decode={urls}, mixed={urls}` | Lists healthy instances for each role | Useful for verifying instance discovery |
| `cache-aware prefill: hashes={n} workers={n} load={loads} hit={hits}` | Hash count, worker count, and load info for cache_aware strategy | Useful for debugging cache hits |
| `cache-aware prefill: tokenizer tokens={tokens}` | Tokenizer tokenization result | Useful for debugging tokenization results |
| `cache-aware score: worker={url} hit={hit} loadRatio={ratio} score={score}` | Scoring details for cache_aware strategy | Useful for debugging scheduling decisions |
| `radix match: hashes={n} matched_len={n} node_children={n}` | Radix tree match details | Useful for debugging cache matching |
| `radix record: worker={url} hashes={n} node_depth={n}` | Radix tree record details | Useful for debugging cache recording |
| `radix eviction: removed={n} nodeCount={n}` | Radix tree eviction details | Useful for debugging cache eviction |

## Common Response Output Analysis

Expand Down Expand Up @@ -189,9 +233,10 @@ If `Failed to start server` appears in startup logs, check:

### Tokenizer Service (cache_aware Strategy)

When using the `cache_aware` scheduling strategy, the Router calls a Tokenizer service to tokenize requests for cache hit ratio computation. When the Tokenizer service is unavailable, the Router will log a Warn-level message: `tokenizer failed, fallback to process_tokens`.
When using the `cache_aware` scheduling strategy, the Router calls a Tokenizer service to tokenize requests for cache hit ratio computation. When the Tokenizer service is unavailable, the Router has a two-level degradation mechanism:

**This does not affect normal request processing** — the Router has a built-in degradation mechanism that automatically falls back to the `process_tokens` strategy for continued scheduling. The only impact is the temporary loss of cache-aware optimization.
1. **Fallback to character-based tokenization** (common case): The log will show `tokenizer failed, fallback to char tokens`. The cache_aware strategy remains active, using character-based tokenization for cache matching instead of the Tokenizer. Cache hit accuracy may decrease, but normal request processing is not affected.
2. **Fallback to process_tokens strategy** (extreme case): When tokenization completely fails (e.g., empty request content), the log will show `tokenize failed, fallback to process_tokens`. The cache_aware strategy temporarily becomes inactive, and scheduling falls back to token processing volume. Normal request processing is not affected.

To restore full cache_aware functionality:

Expand Down
19 changes: 12 additions & 7 deletions docs/zh/online_serving/router.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ server:
splitwise: true # true代表开启pd分离模式,false代表开启非pd分离模式

scheduler:
policy: "power_of_two" # 调度策略(可选): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, fd_metrics_score; 默认: request_num
policy: "power_of_two" # 调度策略(可选): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, remote_cache_aware, fd_metrics_score, fd_remote_metrics_score; 默认: request_num
prefill-policy: "cache_aware" # pd分离模式下prefill节点调度策略; 默认: process_tokens
decode-policy: "fd_metrics_score" # pd分离模式下decode节点调度策略; 默认: request_num
eviction-interval-secs: 60 # cache-aware策略清理过期cache的间隔时间
Expand All @@ -199,9 +199,12 @@ scheduler:
hit-ratio-weight: 1.0 # cache-aware策略命中率权重
load-balance-weight: 0.05 # cache-aware策略负载均衡权重
cache-block-size: 4 # cache-aware策略cache block大小
tokenizer-url: "http://0.0.0.0:8098" # tokenizer服务地址(可选)
tokenizer-timeout-secs: 2 # tokenizer服务超时时间
# tokenizer-url: "http://0.0.0.0:8098" # tokenizer服务地址(可选), 不配置时cache_aware策略自动使用字符级分词。
# 注意:配置此项会在每次调度时同步调用远程tokenizer服务,引入额外网络时延,
# 仅在需要精确token级分词以提升cache命中率时再考虑启用。
# tokenizer-timeout-secs: 2 # tokenizer服务超时时间; 默认: 2
waiting-weight: 10 # cache-aware策略等待权重
stats-interval-secs: 5 # 日志统计信息打印间隔时间(秒), 包含负载和缓存命中率等统计数据; 默认: 5

manager:
health-failure-threshold: 3 # 健康检查失败次数,超过次数后认为节点不健康
Expand Down Expand Up @@ -265,10 +268,12 @@ Router 支持以下调度策略,可通过配置文件中的 `policy`(mixed
| `random` | 通用 | 从所有可用实例中随机选择一个,无状态感知,适合轻量场景。 |
| `round_robin` | 通用 | 使用原子计数器对实例列表循环取模,按顺序均匀分发请求。 |
| `power_of_two` | 通用 | 随机选取两个实例,比较其当前并发请求数,选择负载较低的一个。 |
| `process_tokens` | **prefill(默认)** | 遍历所有实例,选择当前正在处理的 token 数最少的实例,适合 prefill 阶段的长请求负载均衡。 |
| `request_num` | **mixed / decode(默认)** | 遍历所有实例,选择当前并发请求数最少的实例,适合 decode 及 mixed 场景的请求均衡。 |
| `fd_metrics_score` | mixed / decode | 实时从各实例的 metrics 接口获取 running/waiting 请求数,按 `running + waiting × waitingWeight` 打分,选择得分最低的实例。 |
| `cache_aware` | prefill | 基于 Radix Tree 维护各实例的 KV Cache 前缀命中情况,综合命中率与负载打分选择实例;负载严重不均衡时自动回退至 `process_tokens`。 |
| `process_tokens` | **prefill(默认)** | 遍历所有实例,选择当前正在处理的 token 数最少的实例(内存计数),适合 prefill 阶段的长请求负载均衡。 |
| `request_num` | **mixed / decode(默认)** | 遍历所有实例,选择当前并发请求数最少的实例(内存计数),适合 decode 及 mixed 场景的请求均衡。 |
| `fd_metrics_score` | mixed / decode | 基于内存计数获取 running/waiting 请求数,按 `running + waiting × waitingWeight` 打分,选择得分最低的实例。 |
| `fd_remote_metrics_score` | mixed / decode | 实时从各实例的远程 `/metrics` 接口获取 running/waiting 请求数,按 `running + waiting × waitingWeight` 打分,选择得分最低的实例。需要实例注册时提供 `metrics_port`。**注意:每次调度时会同步发起远程 HTTP 请求,在实例数量较多或网络条件较差时会显著增加调度时延,请结合实际情况评估后再启用。** |
| `cache_aware` | prefill | 基于 Radix Tree 维护各实例的 KV Cache 前缀命中情况,综合命中率与负载打分(内存计数)选择实例;负载严重不均衡时自动回退至 `process_tokens`。 |
| `remote_cache_aware` | prefill | 与 `cache_aware` 相同的缓存感知策略,但使用远程 `/metrics` 接口获取实例负载数据。需要实例注册时提供 `metrics_port`。**注意:每次调度时会同步发起远程 HTTP 请求,在实例数量较多或网络条件较差时会显著增加调度时延,请结合实际情况评估后再启用。** |

## 常见问题排查

Expand Down
Loading
Loading