-
Notifications
You must be signed in to change notification settings - Fork 0
Quantization
Gaurav14cs17 edited this page Jun 21, 2026
·
1 revision
Quantization reduces model precision from FP32 to lower bit-widths (INT8, FP16), dramatically reducing model size and inference latency.
PTQ quantizes a pre-trained model without retraining. It requires a small calibration dataset.
from flashoptim import FlashOptim, PTQuantizer
model = FlashOptim("pretrained/model.pth")
quantizer = PTQuantizer(
dtype="int8",
calibration_samples=500,
per_channel=True,
symmetric=True,
)
quantized = quantizer.quantize(model, calibration_data="data/calibration/")| Strategy | Description | Best For |
|---|---|---|
minmax |
Track min/max activation ranges | Speed |
histogram |
Entropy-based calibration | Accuracy |
percentile |
Clip outliers at percentile | Robustness |
QAT simulates quantization during training, enabling the model to adapt to reduced precision.
from flashoptim import FlashOptim, QATTrainer
model = FlashOptim("pretrained/model.pth")
trainer = QATTrainer(
dtype="int8",
epochs=10,
lr=0.001,
)
qat_model = trainer.train(model, train_data="data/train/", val_data="data/val/")quantizer = PTQuantizer(
dtype="mixed",
sensitive_layers=["backbone.layer1", "head.cls"], # Keep FP32
)flashoptim quantize --config configs/flashoptim_quantize_int8.yaml- Use at least 100-500 calibration samples representative of your data
- Histogram calibration is slower but often more accurate
- QAT typically recovers 0.5-1% accuracy compared to PTQ
- Per-channel quantization is more accurate than per-tensor
FlashOptim — Model optimization toolkit | PyPI | MIT License