Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

Closed
HSDai opened this issue May 24, 2024 · 10 comments
Closed

slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

HSDai opened this issue May 24, 2024 · 10 comments
Assignees
Labels
internal-bug-tracked triaged Issue has been triaged by maintainers

Comments

@HSDai
Copy link

HSDai commented May 24, 2024

Description

I convert nafnet from onnx to tensorrt on Tesla T4 with TensorRT 10.0. However, the inference speed is much slower than engine converted from TensorRT 8.6.

TensorRT 10.0:
[05/24/2024-14:43:21] [I] === Trace details ===
[05/24/2024-14:43:21] [I] Trace averages of 10 runs:
[05/24/2024-14:43:21] [I] Average on 10 runs - GPU latency: 539.803 ms - Host latency: 546.901 ms (enqueue 4.92217 ms)
[05/24/2024-14:43:21] [I]
[05/24/2024-14:43:21] [I] === Performance summary ===
[05/24/2024-14:43:21] [I] Throughput: 1.64966 qps
[05/24/2024-14:43:21] [I] Latency: min = 542.295 ms, max = 550.235 ms, mean = 546.901 ms, median = 546.891 ms, percentile(90%) = 550.032 ms, percentile(95%) = 550.235 ms, percentile(99%) = 550.235 ms
[05/24/2024-14:43:21] [I] Enqueue Time: min = 3.92992 ms, max = 5.48389 ms, mean = 4.92217 ms, median = 5.14417 ms, percentile(90%) = 5.33893 ms, percentile(95%) = 5.48389 ms, percentile(99%) = 5.48389 ms
[05/24/2024-14:43:21] [I] H2D Latency: min = 3.60913 ms, max = 4.47997 ms, mean = 3.70715 ms, median = 3.62408 ms, percentile(90%) = 3.63037 ms, percentile(95%) = 4.47997 ms, percentile(99%) = 4.47997 ms
[05/24/2024-14:43:21] [I] GPU Compute Time: min = 535.282 ms, max = 543.216 ms, mean = 539.803 ms, median = 539.882 ms, percentile(90%) = 543.027 ms, percentile(95%) = 543.216 ms, percentile(99%) = 543.216 ms
[05/24/2024-14:43:21] [I] D2H Latency: min = 3.38086 ms, max = 3.40747 ms, mean = 3.3907 ms, median = 3.38916 ms, percentile(90%) = 3.39551 ms, percentile(95%) = 3.40747 ms, percentile(99%) = 3.40747 ms
[05/24/2024-14:43:21] [I] Total Host Walltime: 6.06185 s
[05/24/2024-14:43:21] [I] Total GPU Compute Time: 5.39803 s
[05/24/2024-14:43:21] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/24/2024-14:43:21] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # ./trtexec --loadEngine=nafnetcc75_t4_float32_v10.trtmodel --shapes=input:1x1920x1920x3 --device=3

TensorRT 8.6:
[05/24/2024-14:44:43] [I] === Trace details ===
[05/24/2024-14:44:43] [I] Trace averages of 10 runs:
[05/24/2024-14:44:43] [I] Average on 10 runs - GPU latency: 143.531 ms - Host latency: 150.62 ms (enqueue 4.77478 ms)
[05/24/2024-14:44:43] [I] Average on 10 runs - GPU latency: 141.829 ms - Host latency: 148.839 ms (enqueue 5.34015 ms)
[05/24/2024-14:44:43] [I]
[05/24/2024-14:44:43] [I] === Performance summary ===
[05/24/2024-14:44:43] [I] Throughput: 6.59775 qps
[05/24/2024-14:44:43] [I] Latency: min = 147.611 ms, max = 165.985 ms, mean = 149.754 ms, median = 148.669 ms, percentile(90%) = 151.169 ms, percentile(95%) = 151.494 ms, percentile(99%) = 165.985 ms
[05/24/2024-14:44:43] [I] Enqueue Time: min = 2.2744 ms, max = 5.82202 ms, mean = 5.09928 ms, median = 5.2124 ms, percentile(90%) = 5.76062 ms, percentile(95%) = 5.77234 ms, percentile(99%) = 5.82202 ms
[05/24/2024-14:44:43] [I] H2D Latency: min = 3.60007 ms, max = 4.53885 ms, mean = 3.65205 ms, median = 3.61035 ms, percentile(90%) = 3.63367 ms, percentile(95%) = 3.63477 ms, percentile(99%) = 4.53885 ms
[05/24/2024-14:44:43] [I] GPU Compute Time: min = 140.629 ms, max = 158.058 ms, mean = 142.711 ms, median = 141.668 ms, percentile(90%) = 144.174 ms, percentile(95%) = 144.487 ms, percentile(99%) = 158.058 ms
[05/24/2024-14:44:43] [I] D2H Latency: min = 3.38074 ms, max = 3.40759 ms, mean = 3.3908 ms, median = 3.38867 ms, percentile(90%) = 3.40186 ms, percentile(95%) = 3.40405 ms, percentile(99%) = 3.40759 ms
[05/24/2024-14:44:43] [I] Total Host Walltime: 3.48604 s
[05/24/2024-14:44:43] [I] Total GPU Compute Time: 3.28235 s
[05/24/2024-14:44:43] [W] * GPU compute time is unstable, with coefficient of variance = 2.41332%.
[05/24/2024-14:44:43] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/24/2024-14:44:43] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/24/2024-14:44:43] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # ./trtexec --loadEngine=nafnetcc75_t4_float32_v86.trtmodel --shapes=input:1x1920x1920x3 --device=3

detail log:
trt10.log

trt8.6.log

Environment

image

TensorRT Version:10.0

NVIDIA GPU:Telsa T4

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:
onnx.zip

trt10.zip

trt86.zip

Steps To Reproduce

./trtexec --onnx=color_consistency_nafnet.onnx --saveEngine=nafnetcc75_t4_float32_v10.trtmodel --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --device=3 --minShapes=input:1x64x64x3 --optShapes=input:1x1024x1024x3 --maxShapes=input:1x1920x1920x3

./trtexec --loadEngine=nafnetcc75_t4_float32_v10.trtmodel --shapes=input:1x1920x1920x3 --device=3

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

@lix19937
Copy link

You can compare the layer info time-profile and fusion tactic.

@HSDai
Copy link
Author

HSDai commented May 28, 2024

You can compare the layer info time-profile and fusion tactic.

I can't get performance profile because there are some errors, while execute trtexec with --dumpProfile.
dumpProfile.log
without dumpProfile.log

But it's ok compiled with tensorrt8.6.
dumpProfile_v86.log

Could this be related to the slower inference speed? How can I find out the reason? Thank you very much.

@zerollzeng
Copy link
Collaborator

Thanks, I can repro the issue and filed internal bug 4672320 to track this.

@zerollzeng zerollzeng self-assigned this May 29, 2024
@zerollzeng zerollzeng added triaged Issue has been triaged by maintainers internal-bug-tracked labels May 29, 2024
@zerollzeng
Copy link
Collaborator

You can try add --builderOptimizationLevel=5 to WAR this, we are still working on the real fix.

@HSDai
Copy link
Author

HSDai commented Jun 7, 2024

You can try add --builderOptimizationLevel=5 to WAR this, we are still working on the real fix.

Thank you, that's helpful!

@geraldstanje
Copy link

hi, is there a profiler you can run for triton inference server?

@geraldstanje1
Copy link

geraldstanje1 commented Jun 10, 2024 via email

@HSDai
Copy link
Author

HSDai commented Jun 11, 2024

hi, is there a profiler you can run for triton inference server?

no, I haven't used triton inference server before.

@nvpohanh
Copy link
Collaborator

We are actively investigating this issue. Meanwhile, you can work around this regression by setting the optimization level in builder config to 5 or adding --builderOptimizationLevel=5 flag to the trtexec command. Thanks

@zerollzeng
Copy link
Collaborator

Fixed in TRT 10.3, closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal-bug-tracked triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants