Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
added info on fusing batch-norm to conv/fc layers
  • Loading branch information
Naveen Suda committed Jun 25, 2018
1 parent ae31ce1 commit e3f1685
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions Deployment/Quant_guide.md
Expand Up @@ -27,3 +27,12 @@ python quant_test.py --model_architecture dnn --model_size_info 144 144 144 --dc
--window_stride_ms 40 --checkpoint <path to trained model checkpoint>
```
After quantizing the weights, next step is quantizing the activation data or layer input/outputs. [quant_models.py](../quant_models.py) has fake_quant_with_min_max_args Op inserted with its min/max ranges exposed as commmand line arguments. Optimal activation range that maximizes the accuracy can be progressively determined for each layer starting from the input layer to the output layer. For example, to quantize the neural network input layer data (i.e. MFCC features) to the range [-32,32) while the other layer activations are left as floating point, run `python quant_test.py <model-specific arguments and checkpoint file> --act_max 32 0 0 0 0`. After the activations of all layers are quantized and we are satisfied with the accuracy, the model can be deployed by calling the optimized neural network kernels from [CMSIS-NN](https://github.com/ARM-software/CMSIS_5) with appropriate scaling parameters obtained from the quantization steps. For example, from quantization sweeps we get the input to the first fully-connected layer in DNN model is of Q5.2 format or 2 bits after the decimal point (i.e. ranges from -32,31.75) and the expected output format for maximum accuracy is also Q5.2. Using quant_test.py, we get the quantized weights are of the format Q0.7 (i.e. in range -1,1) and the quantized biases have 8 bits for fractional point (i.e. range -0.5,0.5). So, the product (layer_input x weights) will have 9 bits in the fractional part (Q5.2 x Q0.7 = Qx.9) and biases need to be shifted left by 1 to get to the same representation (i.e. Qx.8<<1 = Qy.9). The layer output has 2 bits for fractional part (Q5.2), so the product (layer_input x weights) needs to be shifted right by 7 to get to Q5.2 format.

## Fusing batch-norm layers
The parameters of batch normalization layers that follow the convolution or fully-connected layers (i.e. mean, variance, scale factor and offset factors: see [this](https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization) for more details) can be fused into the corresponding convn/fc layer weights and biases, which saves both memory and inference time.
```bash
# generate a new checkpoint (*_bnfused) with batch-norm fused to conv/fc layers
python fold_batchnorm.py --model_architecture ds_cnn --model_size_info 5 64 10 4 2 2 64 3 3 1 1 64 3 3 1 1 64 3 3 1 1 64 3 3 1 1 \
--dct_coefficient_count 10 --window_size_ms 40 --window_stride_ms 20 \
--checkpoint work/DS_CNN/DS_CNN1/training/best/<checkpoint name>
# continue with quantizing weights and activations as shown in previous steps, but with the new checkpoint

0 comments on commit e3f1685

Please sign in to comment.