Skip to content

KV Cache Memory Estimation Error for GLM-4.7-Flash-AWQ on V100 #4366

@windreamer

Description

@windreamer

Description

When running GLM-4.7-Flash-AWQ on a single V100-32G-SXM2, the KV Cache memory estimation appears to be incorrect, causing premature context length truncation.

Environment

Observed Behavior

  • Context length initialized: Only 4928 tokens (severely truncated)
  • Expected behavior: Should support much longer context with 32GB VRAM

Error Logs

[TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
[TM][WARNING] [SegMgr] prefix caching is enabled
[TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
[TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6

Root Cause Analysis

The issue appears to be in the block manager's maximum block calculation logic. The session length is truncated based on:

const auto max_cached_tokens = seq_mgr_->max_block_count() * (size_t)cache_block_seq_len * param_.attn_cp_size;
session_len_trunc_ = std::min(max_cached_tokens, (size_t)param_.session_len);

Located at: lmdeploy/src/turbomind/engine/engine.cc (lines 248-253)

Questions

  1. Is the 2.6MB/token KV Cache usage expected for this model configuration?
  2. Why does the block manager underestimate available memory on V100?

@windreamer 我在单块V100-32G-SXM2上测试了这个PR,可以跑通,但是KV Cache太大了,GLM-4.7-Flash-AWQ权重18.4GB,启动后总共占用31GB,但是只初始化了4928的上下文,2.6MB/token的缓存占用是否过于恐怖

Active code page: 65001
Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin, please note cuda version should >= 11.3 when compiled with cuda 11
The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2026-02-24 15:35:59,808 - lmdeploy - WARNING - converter.py:67 - data type fallback to float16 since torch.cuda.is_bf16_supported is False
[TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
2026-02-24 15:36:01,275 - lmdeploy - WARNING - turbomind.py:246 - get 27431 model params
[TM][WARNING] [SegMgr] prefix caching is enabled
[TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
[TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6
HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
INFO:     Started server process [7100]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:10002 (Press CTRL+C to quit)

Originally posted by @lingyezhixing in #4283

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions