This is an implementation of various quantization schemes to minimize variance. The code implements a range of different quantizers, and allows to integrate them into a harness which can train a specific set of models using quantized gradients.
In each iteration of the SGD it takes the gradient, applies a quantization to it and only after applies the gradient to the model. By that it emulates the quantization process.
The quantization is specified by the
bucket size and the quantization
level. The quantization
splits the gradient into buckets and applies to bucket the chosen quantization algorithm that quantizes to
level different values.
There are four different quantization algorithms available:
- Standard. Take min and max in the bucket. Take the
leveltarget points uniformly at random from min-max range. Then each value from the bucket is assigned to the nearest target point.
- Variance reduction. Choose max(
P%1, 512) candidates from the bucket and then find
leveltarget points that reduce the variance (see https://arxiv.org/abs/1611.05402 for more details). Then each value from the bucket is assigned to the nearest target point.
P=1%is chosen to amortize the complexity of the target point algorithms which is
O(n level log n)where n is the number of points to choose from.
- Hadamard. Almost as the standard one. Apply Hadamard-Walsh transfrom to bucket and use the standard quantize way from 2.
- Exponential. Calculate the norm of the gradient
norm. For each element of the gradient we divide its absolute value by
normand round it to the nearest power of two. Note that the different number of powers of two is
level/2since we spend one bit to store the sign of each element of the gradient.
bucket is of size 512 and
level is 16.
Usages with different quantization algorithms:
cifar.py --model resnet_20 --optimizer quantization --quantization-type none
cifar.py --model resnet_20 --optimizer quantization --quantization-type standard
cifar.py --model resnet_20 --optimizer quantization --quantization-type smart --bucket-size 51200
cifar.py --model resnet_20 --optimizer quantization --quantization-type hadamard
cifar.py --model resnet_20 --optimizer quantization --quantization-type exponential
The following table contains some information concerning running resnet-20 cifar-10 on the convnet regime, i.e., 200 epochs
with decaying learning rate, using three quantization algorithms: standard, smart and exponential, with
level 16 both.
|standard||exponential||smart 1%||smart 5%||smart 15%||smart 25%|
|time per epoch||~130s||~330s||~350s||~1225s||~||~|
Variance of the quantization is considered as follows. We find
level target points. Then for any
x from the bucket
we choose the two nearest target points
r from below and above, and finally add
(x - l)(r - x) to the variance.
To calculate the average variance we run standard SGD for the first 200 steps, get the gradient, calculate its quantization variance and update model with the original gradient.