Skip to content

INT8 wrong results for max batch size > 1 #3103

@AdrianoSimonetto

Description

@AdrianoSimonetto

Hi, I have an issue using an INT8 quantized model fine-tuned with QAT. The fine-tuning procedure works correctly, and I have no issues in converting it to ONNX. The problem arises once I try to build the Engine from the ONNX. The build finishes without issues but I am able to run a successful inference only with a batch size set to 1.

If I try to increase the batch size, only the first batch will show a correct output, while all the others will have a wrong prediction.
I am also running the same identical C++ code both for the quantized model and for a non-quantized one with precision FP16 (quantized model and not quantized one are exported from the same training). The non-quantized model with precision FP16 works instead perfectly! Same goes for a model exported with FP32 precision.

In the figure below you can see a sample output of the INT8 and FP16 models.

INT8 model prediction, batch_size=2 FP16 model prediction, batch_size=2

• I’m using TensorRT 8.6.1.6, CUDA v11.8 and Cudnn 8.9.0.131
• I have checked the INT8 sample here: https://github.com/NVIDIA/TensorRT/blob/master/samples/sampleINT8/sampleINT8.cpp, but I did not find any mistake in my implementation.
• The quantized model is exported from the Python side, with the batch dimension set to “None” or “-1”.

Note: I tried not setting the batch dimension with the function “setBindingDimensions” just before the inference call (with “executeV2”). In this case also FP16 model shows the same issue.
Have you encountered this issue before? Thank you

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions