Quantization

Quantization refers to compressing models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits, is an active field of research.

A quantizer is a quantization algorithm implementation in NNI, NNI provides multiple quantizers as below. You can also create your own quantizer using NNI model compression interface.

Quantizers <Quantizer> Quantization Speedup <QuantizationSpeedup>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization.rst

quantization.rst

Quantization

Files

quantization.rst

Latest commit

History

quantization.rst

File metadata and controls

Quantization