# FP8 Quantization with Per-Channel Static Weights and Per-Token Dynamic Activations

## Installation

To get started, install:

 `pip install amd-quark transformers`

## Quickstart

The example includes an end-to-end script for applying the weight_static_fp8_activation_dynamic_fp8 quantization algorithm.

`python3 qwen3_example.py`


The resulting model Qwen3-8B-FP8-ptpc is ready to be loaded into vLLM.

## Code Overview

Typically, quantizing a floating-point model with AMD Quark involves the following steps:

### 1) Load the original floating-point model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

ckpt_path = "Qwen/Qwen3-8B"
model = AutoModelForCausalLM.from_pretrained(ckpt_path)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)

### 2) Set the quantization configuration

In [None]:
from quark.torch.quantization import FP8E4M3PerChannelSpec
from quark.torch.quantization.config.config import Config, QuantizationConfig

FP8_PER_CHANNEL_SPEC = FP8E4M3PerChannelSpec(is_dynamic=False, ch_axis=0).to_quantization_spec()
FP8_PER_TOKEN_DYNAMIC_SPEC = FP8E4M3PerChannelSpec(is_dynamic=True, ch_axis=1).to_quantization_spec()
W_FP8_PER_CHANNEL_STATIC_A_FP8_PER_TOKEN_DYNAMIC_CONFIG = QuantizationConfig(
    input_tensors=FP8_PER_TOKEN_DYNAMIC_SPEC, weight=FP8_PER_CHANNEL_SPEC
)
quant_config = Config(global_quant_config=W_FP8_PER_CHANNEL_STATIC_A_FP8_PER_TOKEN_DYNAMIC_CONFIG, exclude=["lm_head"])

### 3) Use the AMD Quark API to perform an in-place replacement of the model's modules with quantized modules

In [None]:
from quark.torch import ModelQuantizer

quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)

### 4) (Optional) Export the quantized model

In [None]:
from quark.torch import export_safetensors

output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-FP8-ptpc"
model = quantizer.freeze(model)
export_safetensors(model, output_dir)

### 5) (Optional) Evaluate Accuracy

Install `vllm` and `lm-evaluation-harness`:

`pip install vllm lm_eval`

Evaluate accuracy with `lm_eval` (for example on 200 samples of `gsm8k`):

```
lm_eval \
  --model vllm \
  --model_args pretrained=./Qwen3-8B-FP8-ptpc,add_bos_token=True \
  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 200
```