-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I am running inference on a model that I have build from python code through TensorRT API. All the layers are forced to run convolution in INT8 precision.
The minimalistic networks that reproduces the problem consists of just one convolutional layer and looks like this:
- INT8 input tensor
- DQ layer (gets removed)
- INT8 convolution - I have initialized the weights using QDQ layer, scales are FP32
- Q layer
- output is INT8 tensor
I have written a bunch of unit tests that validate various scenarios and almost everything works as expected. However there is an issue with rounding values of 0.5. According to the documentation TensorRT developers guide the roundWithTiesToEven is used.
Now depending on the kernel size, the accumulated and scaled value of 0.5 gets rounded to 1 or to 0.
0.5 here is the sum of Hadamard product returned by convolution and divided by scale passed Q layer creation - in network.add_quantize.
Environment
TensorRT Version: 8.6.0
NVIDIA GPU: RTX 3090Ti (desktop), RTX 3070 (laptop), GTX 1050 (laptop)
NVIDIA Driver Version: 525.105.17
CUDA Version: 11.7
CUDNN Version: 8.9
Operating System: Ubuntu 20.04
Python Version (if applicable): 3.8
Tensorflow Version (if applicable): TF 2.11.0
Relevant Files
Look here for TRT builder output for the built engine.
Steps To Reproduce
- Create an INT8 input tensor and fill it with value 126 - eg. NCHW - 1x128x8x8 - all values are 126
- create an identity kernel that will just copy the input tensor value - eg. RSCK 5x5x128x1. Fill the kernel with zeros and set one of the values to 1 (like the middle of a kernel) - the idea is that the input value will pass through the convolution unchanged.
- set scales of Q layer to 252.0
126 divided by 252.0 is 0.5, this should be rounded to 0 but is rounded to 0 or 1 depending on which tactic is picked up during built (or at least that is what it seems like). Expected value is 0.
When kernel size is 3x3x128x1 the value of 0 is returned.
Tactic:"sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize32x32x64_stage6_warpsize2x1x1_g1_tensor16x8x32_t1r3s3"
When kernel size is 5x5x128x1 the value of 1 is returned.
Tactic:"sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize32x32x64_stage6_warpsize2x1x1_g1_tensor16x8x32_t1r5s5"
When kernel size is 7x7x128x1 the value of 1 is returned
Tactic"sm80_xmma_fprop_implicit_gemm_interleaved_indexed_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize32x64x64_stage6_warpsize2x2x1_g1_tensor16x8x32"
This may seem like a small difference but we need the values in the output tensor to always exactly match expected result for particular input.
- What is the correct rounding?
- Any ideas why some tactics return 0 and some other 1?
- how to make this behavior consistent?
When I run layers with the same inputs and weights (both restricted to INT8 values) in tensorflow through nn.conv2d layer with FP32 precision and then scale and round in the code I get correct values of 0.