-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Hi,
I took out the token embedding layer in Bert and built tensorrt engine to test the inference effect of int8 mode, but found that int8 mode is slower than fp16;
i use nvprof to view the GPU consumption of the two modes, as follows:
fp16:
GPU activities: 99.87% 22.158ms 6 3.6930ms 1.7280us 22.148ms [CUDA memcpy HtoD]
0.06% 13.376us 8 1.6720us 1.6000us 1.9520us [CUDA memset]
0.05% 10.688us 1 10.688us 10.688us 10.688us void cuGatherLayer::gatherGeneric<float, int=32>(void*, cuGatherLayer::StrideArray, cuGatherLayer::gatherGeneric<float, int=32>, void*, int*, void*, cuGatherLayer::ShapeArray, int*, int*, int, cuGatherLayer::ReducedDivisorArray, int, int, int, int, cuGatherLayer::CoefficientData, cuGatherLayer::CoefficientIndices)
0.02% 4.1600us 1 4.1600us 4.1600us 4.1600us [CUDA memcpy DtoH]
0.01% 1.6320us 1 1.6320us 1.6320us 1.6320us [CUDA memcpy DtoD]
int8:
GPU activities: 99.84% 20.210ms 6 3.3683ms 1.6950us 20.201ms [CUDA memcpy HtoD]
0.07% 13.536us 8 1.6920us 1.6000us 1.9840us [CUDA memset]
0.07% 13.311us 1 13.311us 13.311us 13.311us void cuGatherLayer::gatherAxisZeroPartition<float, int=64, int=256>(void*, cuGatherLayer::StrideArray, cuGatherLayer::gatherAxisZeroPartition<float, int=64, int=256>, void*, int*, void*, cuGatherLayer::ShapeArray, int*, int*, int, cuGatherLayer::ReducedDivisorArray, cuGatherLayer::ShapeArray, cuGatherLayer::ShapeArray, int, int, int, int, int, int, nvinfer1::rt::reduced_divisor)
0.02% 3.7120us 1 3.7120us 3.7120us 3.7120us [CUDA memcpy DtoH]
0.01% 1.7280us 1 1.7280us 1.7280us 1.7280us [CUDA memcpy DtoD]
I want to know if there is something wrong with int8 quantization.
Thanks!
TensorRT Version: 6.0.1.5
GPU Type: V100
Nvidia Driver Version: 418.39
CUDA Version: 10.1
Operating System: ubuntu18.04