-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Regression observed when using CUDA execution provider #20712
Comments
Which cuDNN version are you using ? |
@gedoensmax I am using cuDNN 8.7.0.84. Tried to use cuDNN 9 with onnxruntime-gpu 1.17.1 but it's finding cuDNN 8.
|
Hi team, I was wondering if we have any update on this issue? |
Hello, do you have some idea about the performance degrassion? I have test the performance of onnxruntime 1.17,it's performance is even worse than torch2.0.1 |
@tianleiwu can you help out with this. My initial guess was that there might be regressions due to cuDNN shipping less kernels. But it looks like cuDNN version was the same across the different versions. |
@krishung5 I would recommend trying to use a CUDA graph, that might help reducing the execution time for such small networks. |
@gedoensmax Sir, one thing i am confused is that if i install onnxruntime by pip install onnxruntime-gpu==1.17, would the onnxruntime package be the optimum one (i mean it will match the cuda-11.8 install on my machine and corresponding cublas cudnn librarys). Can you explain that, thanks a lot! |
The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x |
OK, Thank you very much. Can you please take a look at this issue about dynamic quantize? There are some problem with dynamic quantize vicuna-7b model from fp16 to int8 |
Hi @pranavsharma, just wanted to follow up and see if we have any update on this, thank you! |
I reproduced the issue with https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v1-12.onnx in A100. The average latency (ms) output: |
The root cause seems to be the change of default value of cudnn_conv_use_max_workspace from 0 to 1 The solution is to set the value to 0 for Resnet:
For debugging, set an environment variable to limit the cudnn workspace in MiB could help:
@gedoensmax, do you know why larger workspace causes performance drop in some convolution network (we've enabled conv algo tuning by default)? |
@tianleiwu I just saw cone also tuning is now set to exhaustive search. This should guarantee the best possible perf, but usually using the heuristics is sufficient. |
The Nsight trace files: |
@gedoensmax I think using cuda graph indeed helps with the performance. I wasn't able to run the model used by RIVA team due to the issue
but with the resnet model, I'm seeing an approximate improvement of 19.18% in average latency. ONNX 1.18 with CUDA Graph
ONNX 1.18 without CUDA Graph:
|
Describe the issue
We are seeing a regression when using onnxruntime with the CUDA execution provider starting from version 1.14.1. Before version 1.14.1, there was no regression. We also observe the regression in subsequent versions, including the latest version, 1.17.1.
To reproduce
pip install onnxruntime-gpu for different versions, and run the script below
Sharing some numbers:
onnxruntime-gpu == 1.13.1
onnxruntime-gpu == 1.14.1
onnxruntime-gpu == 1.17.1
The latency increase from version 1.13.1 to version 1.14.1 is approximately 14.29%.
The latency increase from version 1.14.1 to version 1.17.1 is approximately 4.18%.
Note that when using the CPU execution provider, there is no regression. It only occurs when using the CUDA execution provider.
For a simpler reproduction, I’m using a resnet50 model. We observe a latency increase of more than 20x with our model:
onnxruntime-gpu == 1.13.1
onnxruntime-gpu == 1.16.3
The regression can be observed using the C++ code as well.
Urgency
High
Platform
Linux
OS Version
22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.2 and 11.8
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: