slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

HSDai · 2024-05-24T10:18:24Z

Description

I convert nafnet from onnx to tensorrt on Tesla T4 with TensorRT 10.0. However, the inference speed is much slower than engine converted from TensorRT 8.6.

TensorRT 10.0:
[05/24/2024-14:43:21] [I] === Trace details ===
[05/24/2024-14:43:21] [I] Trace averages of 10 runs:
[05/24/2024-14:43:21] [I] Average on 10 runs - GPU latency: 539.803 ms - Host latency: 546.901 ms (enqueue 4.92217 ms)
[05/24/2024-14:43:21] [I]
[05/24/2024-14:43:21] [I] === Performance summary ===
[05/24/2024-14:43:21] [I] Throughput: 1.64966 qps
[05/24/2024-14:43:21] [I] Latency: min = 542.295 ms, max = 550.235 ms, mean = 546.901 ms, median = 546.891 ms, percentile(90%) = 550.032 ms, percentile(95%) = 550.235 ms, percentile(99%) = 550.235 ms
[05/24/2024-14:43:21] [I] Enqueue Time: min = 3.92992 ms, max = 5.48389 ms, mean = 4.92217 ms, median = 5.14417 ms, percentile(90%) = 5.33893 ms, percentile(95%) = 5.48389 ms, percentile(99%) = 5.48389 ms
[05/24/2024-14:43:21] [I] H2D Latency: min = 3.60913 ms, max = 4.47997 ms, mean = 3.70715 ms, median = 3.62408 ms, percentile(90%) = 3.63037 ms, percentile(95%) = 4.47997 ms, percentile(99%) = 4.47997 ms
[05/24/2024-14:43:21] [I] GPU Compute Time: min = 535.282 ms, max = 543.216 ms, mean = 539.803 ms, median = 539.882 ms, percentile(90%) = 543.027 ms, percentile(95%) = 543.216 ms, percentile(99%) = 543.216 ms
[05/24/2024-14:43:21] [I] D2H Latency: min = 3.38086 ms, max = 3.40747 ms, mean = 3.3907 ms, median = 3.38916 ms, percentile(90%) = 3.39551 ms, percentile(95%) = 3.40747 ms, percentile(99%) = 3.40747 ms
[05/24/2024-14:43:21] [I] Total Host Walltime: 6.06185 s
[05/24/2024-14:43:21] [I] Total GPU Compute Time: 5.39803 s
[05/24/2024-14:43:21] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/24/2024-14:43:21] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # ./trtexec --loadEngine=nafnetcc75_t4_float32_v10.trtmodel --shapes=input:1x1920x1920x3 --device=3

TensorRT 8.6:
[05/24/2024-14:44:43] [I] === Trace details ===
[05/24/2024-14:44:43] [I] Trace averages of 10 runs:
[05/24/2024-14:44:43] [I] Average on 10 runs - GPU latency: 143.531 ms - Host latency: 150.62 ms (enqueue 4.77478 ms)
[05/24/2024-14:44:43] [I] Average on 10 runs - GPU latency: 141.829 ms - Host latency: 148.839 ms (enqueue 5.34015 ms)
[05/24/2024-14:44:43] [I]
[05/24/2024-14:44:43] [I] === Performance summary ===
[05/24/2024-14:44:43] [I] Throughput: 6.59775 qps
[05/24/2024-14:44:43] [I] Latency: min = 147.611 ms, max = 165.985 ms, mean = 149.754 ms, median = 148.669 ms, percentile(90%) = 151.169 ms, percentile(95%) = 151.494 ms, percentile(99%) = 165.985 ms
[05/24/2024-14:44:43] [I] Enqueue Time: min = 2.2744 ms, max = 5.82202 ms, mean = 5.09928 ms, median = 5.2124 ms, percentile(90%) = 5.76062 ms, percentile(95%) = 5.77234 ms, percentile(99%) = 5.82202 ms
[05/24/2024-14:44:43] [I] H2D Latency: min = 3.60007 ms, max = 4.53885 ms, mean = 3.65205 ms, median = 3.61035 ms, percentile(90%) = 3.63367 ms, percentile(95%) = 3.63477 ms, percentile(99%) = 4.53885 ms
[05/24/2024-14:44:43] [I] GPU Compute Time: min = 140.629 ms, max = 158.058 ms, mean = 142.711 ms, median = 141.668 ms, percentile(90%) = 144.174 ms, percentile(95%) = 144.487 ms, percentile(99%) = 158.058 ms
[05/24/2024-14:44:43] [I] D2H Latency: min = 3.38074 ms, max = 3.40759 ms, mean = 3.3908 ms, median = 3.38867 ms, percentile(90%) = 3.40186 ms, percentile(95%) = 3.40405 ms, percentile(99%) = 3.40759 ms
[05/24/2024-14:44:43] [I] Total Host Walltime: 3.48604 s
[05/24/2024-14:44:43] [I] Total GPU Compute Time: 3.28235 s
[05/24/2024-14:44:43] [W] * GPU compute time is unstable, with coefficient of variance = 2.41332%.
[05/24/2024-14:44:43] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/24/2024-14:44:43] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/24/2024-14:44:43] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # ./trtexec --loadEngine=nafnetcc75_t4_float32_v86.trtmodel --shapes=input:1x1920x1920x3 --device=3

detail log:
trt10.log

trt8.6.log

Environment

TensorRT Version:10.0

NVIDIA GPU:Telsa T4

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:
onnx.zip

trt10.zip

trt86.zip

Steps To Reproduce

./trtexec --onnx=color_consistency_nafnet.onnx --saveEngine=nafnetcc75_t4_float32_v10.trtmodel --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --device=3 --minShapes=input:1x64x64x3 --optShapes=input:1x1024x1024x3 --maxShapes=input:1x1920x1920x3

./trtexec --loadEngine=nafnetcc75_t4_float32_v10.trtmodel --shapes=input:1x1920x1920x3 --device=3

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-05-26T11:17:36Z

You can compare the layer info time-profile and fusion tactic.

HSDai · 2024-05-28T02:43:36Z

You can compare the layer info time-profile and fusion tactic.

I can't get performance profile because there are some errors, while execute trtexec with --dumpProfile.
dumpProfile.log
without dumpProfile.log

But it's ok compiled with tensorrt8.6.
dumpProfile_v86.log

Could this be related to the slower inference speed? How can I find out the reason? Thank you very much.

zerollzeng · 2024-05-29T01:47:21Z

Thanks, I can repro the issue and filed internal bug 4672320 to track this.

zerollzeng · 2024-06-07T09:26:32Z

You can try add --builderOptimizationLevel=5 to WAR this, we are still working on the real fix.

HSDai · 2024-06-07T10:28:02Z

You can try add --builderOptimizationLevel=5 to WAR this, we are still working on the real fix.

Thank you, that's helpful!

geraldstanje · 2024-06-09T21:05:57Z

hi, is there a profiler you can run for triton inference server?

geraldstanje1 · 2024-06-10T03:52:16Z

https://aws.amazon.com/blogs/machine-learning/host-ml-models-on-amazon-sagemaker-using-triton-onnx-models/ From: geraldstanje ***@***.***> Date: Sunday, June 9, 2024 at 5:06 PM To: NVIDIA/TensorRT ***@***.***> Cc: Gerald Stanje (gstanje) ***@***.***>, Manual ***@***.***> Subject: Re: [NVIDIA/TensorRT] slower inference speed of TensorRT 10.0 on GPU Tesla T4 (Issue #3896) hi, is there a profiler you can run for triton inference server? — Reply to this email directly, view it on GitHub<#3896 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BGZOXMORNGNFLMTOQX5MYW3ZGS7UZAVCNFSM6AAAAABIHJOUOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWG44DSNRTGU>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

HSDai · 2024-06-11T01:31:34Z

hi, is there a profiler you can run for triton inference server?

no, I haven't used triton inference server before.

nvpohanh · 2024-06-11T08:09:26Z

We are actively investigating this issue. Meanwhile, you can work around this regression by setting the optimization level in builder config to 5 or adding --builderOptimizationLevel=5 flag to the trtexec command. Thanks

zerollzeng · 2024-07-23T06:46:12Z

Fixed in TRT 10.3, closed.

zerollzeng self-assigned this May 29, 2024

zerollzeng added triaged Issue has been triaged by maintainers internal-bug-tracked labels May 29, 2024

zerollzeng closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

HSDai commented May 24, 2024 •

edited

Loading

lix19937 commented May 26, 2024

HSDai commented May 28, 2024

zerollzeng commented May 29, 2024

zerollzeng commented Jun 7, 2024

HSDai commented Jun 7, 2024 •

edited

Loading

geraldstanje commented Jun 9, 2024

geraldstanje1 commented Jun 10, 2024 via email

HSDai commented Jun 11, 2024

nvpohanh commented Jun 11, 2024

zerollzeng commented Jul 23, 2024

slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

slower inference speed of TensorRT 10.0 on GPU Tesla T4 #3896

Comments

HSDai commented May 24, 2024 • edited Loading

Description

Environment

Relevant Files

Steps To Reproduce

lix19937 commented May 26, 2024

HSDai commented May 28, 2024

zerollzeng commented May 29, 2024

zerollzeng commented Jun 7, 2024

HSDai commented Jun 7, 2024 • edited Loading

geraldstanje commented Jun 9, 2024

geraldstanje1 commented Jun 10, 2024 via email

HSDai commented Jun 11, 2024

nvpohanh commented Jun 11, 2024

zerollzeng commented Jul 23, 2024

HSDai commented May 24, 2024 •

edited

Loading

HSDai commented Jun 7, 2024 •

edited

Loading