Skip to content

Quantization

Gaurav14cs17 edited this page Jun 21, 2026 · 1 revision

Quantization

Quantization reduces model precision from FP32 to lower bit-widths (INT8, FP16), dramatically reducing model size and inference latency.

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model without retraining. It requires a small calibration dataset.

from flashoptim import FlashOptim, PTQuantizer

model = FlashOptim("pretrained/model.pth")

quantizer = PTQuantizer(
    dtype="int8",
    calibration_samples=500,
    per_channel=True,
    symmetric=True,
)
quantized = quantizer.quantize(model, calibration_data="data/calibration/")

Calibration Strategies

Strategy Description Best For
minmax Track min/max activation ranges Speed
histogram Entropy-based calibration Accuracy
percentile Clip outliers at percentile Robustness

Quantization-Aware Training (QAT)

QAT simulates quantization during training, enabling the model to adapt to reduced precision.

from flashoptim import FlashOptim, QATTrainer

model = FlashOptim("pretrained/model.pth")

trainer = QATTrainer(
    dtype="int8",
    epochs=10,
    lr=0.001,
)
qat_model = trainer.train(model, train_data="data/train/", val_data="data/val/")

Mixed Precision

quantizer = PTQuantizer(
    dtype="mixed",
    sensitive_layers=["backbone.layer1", "head.cls"],  # Keep FP32
)

CLI Usage

flashoptim quantize --config configs/flashoptim_quantize_int8.yaml

Tips

  • Use at least 100-500 calibration samples representative of your data
  • Histogram calibration is slower but often more accurate
  • QAT typically recovers 0.5-1% accuracy compared to PTQ
  • Per-channel quantization is more accurate than per-tensor

Clone this wiki locally