[Enhancement] support online quantization by haoyangli0109 · Pull Request #653 · ROCm/ATOM

haoyangli0109 · 2026-04-28T04:58:28Z

In terms of user interaction, there is a JSON-based online_quant_config object, which is divided into global_quant_config, layer_quant_config, and exclude_layer. It uses wildcards to capture full_name values.

Once a name is matched, PTPC or MXFP4 quantization is applied, the value is modified in place, and the quantization attribute is updated. If the value cannot be retrieved, the original quantization type or a high-precision type is used instead.

For more details, see "3.7 Online Quantization at Load Time" in docs/configuration_guide.md

Feature

support linear mixed mxfp4 and ptpc_fp8
support moe mixed mxfp4 and ptpc_fp8
for PTPC format and certain necessary cases, gather all weights before quantization.
suport dpsk DQ and Q
check EP mode

	TTFT	TTFT	TPOT	TPOT	gsm8k	gsm8k	elapsed_seconds/s
model	offline	online	offline	online	offline	online
Qwen3-30B-A3B-Thinking-2507-ptpc	126.81	126.93	11.65	11.55	0.6861	0.6971	0.834
Qwen3-235B-A22B-Instruct-2507-MXFP4	450.51	445.52	34.16	34.03	0.8961	0.8976	1.398
DeepSeek-R1-0528-attn-ptpc-moe-mxfp4	296.06	296.39	65.46	65.22	0.9462	0.9484	8.148
DeepSeek-R1-0528-attn-ptpc-moe-mxfp4 with mtp	339.86	339.78	26.39	26.47	0.9439	0.9439	8.975

Reproduction
aiter: d6e73f9
atom: 81054f9

command:

qwen3-30B ptpc online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/Qwen/Qwen3-30B-A3B-Thinking-2507 \
  -tp 4 --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"ptpc_fp8"},"exclude_layer":["lm_head","*.gate.*"]}' 

python3 -m atom.entrypoints.openai_server --model /shareddata/amd/Qwen3-30B-A3B-Thinking-2507-ptpc \
  -tp 4 --port 5679 --server-port 7778

deepseek-r1-0528 online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/deepseek-ai/DeepSeek-R1-0528 \
  --enforce-eager -tp 8 \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"mxfp4"},"exclude_layer":["lm_head","*.gate.*"]}' \
  --method mtp --num-speculative-tokens 3 

Qwen3-235B-A22B-Instruct-2507 mxfp4 online & offline command
 python -m atom.entrypoints.openai_server \
  --model /shareddata/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"mxfp4","exclude_layer":["lm_head","*.gate.*"]}'

  

**ACC & performance command**
lm_eval \
  --model local-completions \
  --model_args "model=model_path,base_url=http://localhost:7778/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto
  
python -m atom.benchmarks.benchmark_serving \
  --model=model_path --backend=vllm --base-url=http://localhost:7778 \
  --dataset-name=random \
  --random-input-len=1024 --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --num-prompts=1280 --max-concurrency=128 \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el"

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

lihaoyang-amd · 2026-05-15T07:47:50Z

Hi, @valarLip
I think it’s necessary to provide CI tests for online quantization.
I suggest using the following configuration:

python3 -m atom.entrypoints.openai_server --model /shareddata/deepseek-ai/DeepSeek-R1-0528 \
  -tp 8 \
  --online_quant_config ‘{“global_quant_config”:“ptpc_fp8”,“layer_quant_config”:{“*expert*”:“mxfp4”},“exclude_layer”:[‘lm_head’,“*.gate.*”]}’ \
  --method mtp --num-speculative-tokens 3 
  
 python -m atom.entrypoints.openai_server \
  --model /shareddata/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"mxfp4","exclude_layer":["lm_head","*.gate.*"]}'

* WIP Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> * add docs Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> --------- Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

haoyangli0109 force-pushed the lhy/online_quantization branch from efba94e to e8fca54 Compare April 28, 2026 05:51

haoyangli0109 force-pushed the lhy/online_quantization branch 2 times, most recently from 9abf8bf to 92ec964 Compare May 7, 2026 08:30

haoyangli0109 marked this pull request as ready for review May 7, 2026 08:54

lihaoyang-amd requested a review from valarLip May 8, 2026 11:10

haoyangli0109 force-pushed the lhy/online_quantization branch from 92ec964 to 21baff7 Compare May 12, 2026 09:39

haoyangli0109 added 2 commits May 12, 2026 11:31

WIP

2c2978e

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

add docs

6a6c873

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

haoyangli0109 force-pushed the lhy/online_quantization branch from 21baff7 to 6a6c873 Compare May 12, 2026 11:31

valarLip approved these changes May 15, 2026

View reviewed changes

valarLip merged commit d734536 into ROCm:main May 19, 2026
51 of 58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] support online quantization#653

[Enhancement] support online quantization#653
valarLip merged 2 commits into
ROCm:mainfrom
haoyangli0109:lhy/online_quantization

haoyangli0109 commented Apr 28, 2026 •

edited by lihaoyang-amd

Loading

Uh oh!

lihaoyang-amd commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

haoyangli0109 commented Apr 28, 2026 • edited by lihaoyang-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyang-amd commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haoyangli0109 commented Apr 28, 2026 •

edited by lihaoyang-amd

Loading

lihaoyang-amd commented May 15, 2026 •

edited

Loading