Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a multi-turn dialogue benchmark tool, a quantization configuration for Qwen 3.5 122B, and a server startup script. The benchmark script simulates concurrent multi-turn sessions to measure metrics such as TTFT, TPOT, and cache hit ratios. Review feedback highlights a potential issue with SSE message parsing in the benchmark script, suggesting the use of readline() for robustness. Other recommendations include optimizing prompt length calculations to prevent performance bottlenecks in long sessions, improving the handling of existing output files, and using script-relative paths in the shell script to increase portability.
| async for raw in response.content: | ||
| line = raw.strip() | ||
| if not line or not line.startswith(b"data:"): | ||
| continue | ||
| data_str = line[len(b"data:"):].strip() | ||
| if data_str == b"[DONE]": | ||
| break |
There was a problem hiding this comment.
The current implementation iterates over response.content using async for, which in aiohttp yields arbitrary chunks of data as they arrive from the network. This is not guaranteed to align with SSE message boundaries (lines starting with data:). If a chunk contains multiple lines or a partial line, json.loads will fail or data will be skipped. Using readline() ensures that each iteration processes a complete line.
| async for raw in response.content: | |
| line = raw.strip() | |
| if not line or not line.startswith(b"data:"): | |
| continue | |
| data_str = line[len(b"data:"):].strip() | |
| if data_str == b"[DONE]": | |
| break | |
| while True: | |
| line = await response.content.readline() | |
| if not line: | |
| break | |
| line = line.strip() | |
| if not line or not line.startswith(b"data:"): | |
| continue | |
| data_str = line[len(b"data:"):].strip() | |
| if data_str == b"[DONE]": | |
| break |
| else: | ||
| new_text = "" | ||
| new_prompt = prompt + generated_text + new_text | ||
| new_len = len(tokenizer.encode(new_prompt, add_special_tokens=False)) |
There was a problem hiding this comment.
Re-encoding the entire prompt on every turn leads to prompt_tokens from the previous turn to avoid redundant computation.
| if args.dump_file and os.path.exists(args.dump_file) and os.path.getsize(args.dump_file) > 0: | ||
| with open(args.dump_file, "r") as f: | ||
| print(json.dumps(json.load(f), indent=4)) | ||
| return |
There was a problem hiding this comment.
The script exits immediately if the dump_file already exists and is non-empty. This behavior is counter-intuitive for a benchmark tool; users typically expect the script to run a new benchmark and either overwrite or append to the output file. It would be better to provide a warning or a flag to control this behavior (e.g., --force to overwrite or a separate command to view results).
| ) | ||
|
|
||
| if [[ "${ENABLE_DEEPEP:-0}" == "1" || "${ENABLE_DEEPEP:-}" == "true" ]]; then | ||
| ARGS+=(--quant_cfg ../../advanced_config/mixed_quantization/qwen3_5-122b-moe-only-fp8.yaml --enable_ep_moe --dp 8 --batch_max_tokens 4096 --graph_max_batch_size 64 --chunked_prefill_size 2048 --mem_fraction 0.8 --linear_att_cache_size 300) |
There was a problem hiding this comment.
Using a relative path for --quant_cfg makes the script fragile as it depends on the current working directory from which the script is executed. It is safer to resolve the path relative to the script's own location using $(dirname "$0").
| ARGS+=(--quant_cfg ../../advanced_config/mixed_quantization/qwen3_5-122b-moe-only-fp8.yaml --enable_ep_moe --dp 8 --batch_max_tokens 4096 --graph_max_batch_size 64 --chunked_prefill_size 2048 --mem_fraction 0.8 --linear_att_cache_size 300) | |
| ARGS+=(--quant_cfg "$(dirname "$0")/../../advanced_config/mixed_quantization/qwen3_5-122b-moe-only-fp8.yaml" --enable_ep_moe --dp 8 --batch_max_tokens 4096 --graph_max_batch_size 64 --chunked_prefill_size 2048 --mem_fraction 0.8 --linear_att_cache_size 300) |
No description provided.