-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP16 Accuracy failure of TensorRT 8.6.3 when running trtexec built engine on GPU RTX4090 #3893
Comments
Hi, thank you for the detailed information.
Can you please let me know what you mean by this? Was there an older version of TRT with which you'd get reasonable FP16 results? But, not with the current version you're using? Fwiw, I've tried running your model using polygraphy using Will need to investigate further. |
Thanks for the quick response.
To clarify we are using the same version of TRT that is provided in the nvcr.io/nvidia/tensorrt:24.02-py3 container before and now. Perhaps this information is more confusing than enlightening. However, it could be better to focus on the reproducible error that we have know: the discrepancy between FP32, and FP16. How do we go about isolating this potential problem further? This is how we built the engine, adding
|
polygraphy run l2_encoder.onnx --trt --onnxrt --fp16 \
--trt-outputs mark all \
--onnx-outputs mark all to see which layer is diff first, may be fp16 overflow. |
Here is the polygraphy log. There were other processes running on the GPU in parallel, which causes the long inference latencies. Snippet that includes pass rate, and comparison of the network output.
Regarding tolerance threshold, is this too strict? All layers fail. What conclusion can we draw from this? |
I think Polygraphy is mainly meant for accuracy debugging and not latency debugging. Here are some insights from a colleague who specializes in quantization:
|
What log file are you referring to? If it is polygraphy.log in 3893#issuecomment-2131240255, then the --fp16 flag is set (1st row)
I agree, sorry for the confusing statement. Here are the results for Auto-reduce model, and Debug precision: I think I found the smallest subgraph as seen in the text. Debug precision failed. Having analyzed the network graph, what is next? Is the solution likely to set precision constraints on the failing nodes? Are there other solutions, such that one can still run the model in FP16? Auto-reduce model1.
3.
4.
4.
5.
Debug precision
|
This will go away if you use golden inputs/outputs produced separately, rather than passing
|
Description
We are trying to recreate the results from: https://arxiv.org/abs/2402.05008.
Using the same .onnx file to compile engines provided by the authors, we find that using fp16 model has 0% accuracy while the fp32 model has accuracy as expected. This was previously possible for us, but now compiling the model in fp16 is not working.
The measurements are mIoU scores for different sized objects.
L2 - FP16
{"all": 0.0, "large": 0.0, "medium": 0.0, "small": 0.0}
L2 - FP32
{"all": 79.12385607181146, "large": 83.05853600575689, "medium": 81.50597370444349, "small": 74.8830670481846}
Environment
Baremetal or Container (if so, version): tensort-24.02-py
Relevant Files
engine inspection:
l2__fp16_inspect.txt
l2_fp32_inspect.txt
build log:
l2_fp32.log
l2_fp16.log
onnx link: https://drive.google.com/drive/folders/1Yt8xDfdkmL6W-IO-KhUhR_J-_2ion-v5?usp=sharing
The text was updated successfully, but these errors were encountered: