Skip to content

[Enhancement] support online quantization#653

Merged
valarLip merged 2 commits into
ROCm:mainfrom
haoyangli0109:lhy/online_quantization
May 19, 2026
Merged

[Enhancement] support online quantization#653
valarLip merged 2 commits into
ROCm:mainfrom
haoyangli0109:lhy/online_quantization

Conversation

@haoyangli0109

@haoyangli0109 haoyangli0109 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

In terms of user interaction, there is a JSON-based online_quant_config object, which is divided into global_quant_config, layer_quant_config, and exclude_layer. It uses wildcards to capture full_name values.

Once a name is matched, PTPC or MXFP4 quantization is applied, the value is modified in place, and the quantization attribute is updated. If the value cannot be retrieved, the original quantization type or a high-precision type is used instead.

For more details, see "3.7 Online Quantization at Load Time" in docs/configuration_guide.md

Feature

  1. support linear mixed mxfp4 and ptpc_fp8
  2. support moe mixed mxfp4 and ptpc_fp8
  3. for PTPC format and certain necessary cases, gather all weights before quantization.
  4. suport dpsk DQ and Q
  5. check EP mode
  TTFT TTFT TPOT TPOT gsm8k gsm8k elapsed_seconds/s
model offline online offline online offline online  
Qwen3-30B-A3B-Thinking-2507-ptpc 126.81 126.93 11.65 11.55 0.6861 0.6971 0.834
Qwen3-235B-A22B-Instruct-2507-MXFP4 450.51 445.52 34.16 34.03 0.8961 0.8976 1.398
DeepSeek-R1-0528-attn-ptpc-moe-mxfp4 296.06 296.39 65.46 65.22 0.9462 0.9484 8.148
DeepSeek-R1-0528-attn-ptpc-moe-mxfp4 with mtp 339.86 339.78 26.39 26.47 0.9439 0.9439 8.975

Reproduction
aiter: d6e73f9
atom: 81054f9

command:

qwen3-30B ptpc online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/Qwen/Qwen3-30B-A3B-Thinking-2507 \
  -tp 4 --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"ptpc_fp8"},"exclude_layer":["lm_head","*.gate.*"]}' 

python3 -m atom.entrypoints.openai_server --model /shareddata/amd/Qwen3-30B-A3B-Thinking-2507-ptpc \
  -tp 4 --port 5679 --server-port 7778

deepseek-r1-0528 online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/deepseek-ai/DeepSeek-R1-0528 \
  --enforce-eager -tp 8 \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"mxfp4"},"exclude_layer":["lm_head","*.gate.*"]}' \
  --method mtp --num-speculative-tokens 3 

Qwen3-235B-A22B-Instruct-2507 mxfp4 online & offline command
 python -m atom.entrypoints.openai_server \
  --model /shareddata/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"mxfp4","exclude_layer":["lm_head","*.gate.*"]}'

  

**ACC & performance command**
lm_eval \
  --model local-completions \
  --model_args "model=model_path,base_url=http://localhost:7778/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto
  
python -m atom.benchmarks.benchmark_serving \
  --model=model_path --backend=vllm --base-url=http://localhost:7778 \
  --dataset-name=random \
  --random-input-len=1024 --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --num-prompts=1280 --max-concurrency=128 \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el"

@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch from efba94e to e8fca54 Compare April 28, 2026 05:51
@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch 2 times, most recently from 9abf8bf to 92ec964 Compare May 7, 2026 08:30
@haoyangli0109 haoyangli0109 marked this pull request as ready for review May 7, 2026 08:54
@lihaoyang-amd lihaoyang-amd requested a review from valarLip May 8, 2026 11:10
@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch from 92ec964 to 21baff7 Compare May 12, 2026 09:39
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch from 21baff7 to 6a6c873 Compare May 12, 2026 11:31
@lihaoyang-amd

lihaoyang-amd commented May 15, 2026

Copy link
Copy Markdown

Hi, @valarLip
I think it’s necessary to provide CI tests for online quantization.
I suggest using the following configuration:

python3 -m atom.entrypoints.openai_server --model /shareddata/deepseek-ai/DeepSeek-R1-0528 \
  -tp 8 \
  --online_quant_config ‘{“global_quant_config”:“ptpc_fp8”,“layer_quant_config”:{“*expert*”:“mxfp4”},“exclude_layer”:[‘lm_head’,“*.gate.*”]}’ \
  --method mtp --num-speculative-tokens 3 
  
 python -m atom.entrypoints.openai_server \
  --model /shareddata/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"mxfp4","exclude_layer":["lm_head","*.gate.*"]}'

@valarLip valarLip merged commit d734536 into ROCm:main May 19, 2026
51 of 58 checks passed
sijyang pushed a commit that referenced this pull request May 24, 2026
* WIP

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

* add docs

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

---------

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants