KV Cache Memory Estimation Error for GLM-4.7-Flash-AWQ on V100

### Description
When running GLM-4.7-Flash-AWQ on a single V100-32G-SXM2, the KV Cache memory estimation appears to be incorrect, causing premature context length truncation.

### Environment
- GPU: NVIDIA V100-32G-SXM2
- CUDA: 12.9
- Model: GLM-4.7-Flash-AWQ (18.4GB weights)
- Framework: LMDeploy (TurboMind backend) with PR #4362 

### Observed Behavior
- **Context length initialized**: Only 4928 tokens (severely truncated)
- **Expected behavior**: Should support much longer context with 32GB VRAM

### Error Logs
```
[TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
[TM][WARNING] [SegMgr] prefix caching is enabled
[TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
[TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6
```

### Root Cause Analysis
The issue appears to be in the block manager's maximum block calculation logic. The session length is truncated based on:

```cpp
const auto max_cached_tokens = seq_mgr_->max_block_count() * (size_t)cache_block_seq_len * param_.attn_cp_size;
session_len_trunc_ = std::min(max_cached_tokens, (size_t)param_.session_len);
```

Located at: `lmdeploy/src/turbomind/engine/engine.cc` (lines 248-253)

### Questions
1. Is the 2.6MB/token KV Cache usage expected for this model configuration?
2. Why does the block manager underestimate available memory on V100?


> @windreamer 我在单块V100-32G-SXM2上测试了这个PR，可以跑通，但是KV Cache太大了，GLM-4.7-Flash-AWQ权重18.4GB，启动后总共占用31GB，但是只初始化了4928的上下文，2.6MB/token的缓存占用是否过于恐怖
> 
> ```
> Active code page: 65001
> Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin, please note cuda version should >= 11.3 when compiled with cuda 11
> The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
> 2026-02-24 15:35:59,808 - lmdeploy - WARNING - converter.py:67 - data type fallback to float16 since torch.cuda.is_bf16_supported is False
> [TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
> 2026-02-24 15:36:01,275 - lmdeploy - WARNING - turbomind.py:246 - get 27431 model params
> [TM][WARNING] [SegMgr] prefix caching is enabled
> [TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
> [TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
> [TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
> [TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6
> HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
> HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
> HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
> INFO:     Started server process [7100]
> INFO:     Waiting for application startup.
> INFO:     Application startup complete.
> INFO:     Uvicorn running on http://127.0.0.1:10002 (Press CTRL+C to quit)
> ``` 

 _Originally posted by @lingyezhixing in [#4283](https://github.com/InternLM/lmdeploy/issues/4283#issuecomment-3949813287)_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV Cache Memory Estimation Error for GLM-4.7-Flash-AWQ on V100 #4366

Description

Environment

Observed Behavior

Error Logs

Root Cause Analysis

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KV Cache Memory Estimation Error for GLM-4.7-Flash-AWQ on V100 #4366

Description

Description

Environment

Observed Behavior

Error Logs

Root Cause Analysis

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions