Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Model Quantization

Model quantization is an efficient model optimization tool that can accelerate the model inference speed and decrease the memory load while still maintaining the model accuracy.

Different from the inherent model quantization callback "QuantizationAwareTraining" in PyTorch Lightning, Intel® Neural Compressor provides a convenient model quantization API to quantize the already-trained Lightning module with Post-training Quantization and Quantization Aware Training. This extension API exhibits the merits of an ease-of-use coding environment and multi-functional quantization options. The user can easily quantize their fine-tuned model by adding a few clauses to their original code. We only introduce post-training quantization in this document.
Intel® Neural Compressor provides a convenient model quantization API to quantize the already-trained Lightning module with Post-training Quantization and Quantization Aware Training. This extension API exhibits the merits of an ease-of-use coding environment and multi-functional quantization options. The user can easily quantize their fine-tuned model by adding a few clauses to their original code. We only introduce post-training quantization in this document.

There are two post-training quantization types in Intel® Neural Compressor, post-training static quantization and post-training dynamic quantization. Post-training dynamic quantization is a recommended starting point because it provides reduced memory usage and faster computation without additional calibration datasets. This type of quantization statically quantizes only the weights from floating point to integer at conversion time. This optimization provides latencies close to post-training static quantization. But the outputs of ops are still stored with the floating point, so the increased speed of dynamic-quantized ops is less than a static-quantized computation.

Expand Down