Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements by calling cuDNN API #321

Merged
merged 4 commits into from Sep 21, 2020

Conversation

gartangh
Copy link
Contributor

  • Selects the fastest convolutional algorithm
  • Faster activations, if supported
  • Faster softmax
  • Merges convolution, bias addition, and activation function if possible

Depends on:
FluxML/NNlib.jl#228

@gartangh gartangh added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Jul 26, 2020
@gartangh gartangh requested a review from maleadt July 26, 2020 13:46
@gartangh gartangh self-assigned this Jul 26, 2020
lib/cudnn/conv.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Jul 26, 2020

cc @denizyuret
Also see #318, looks like the Find* methods would be even better, but this is a good start.

@denizyuret
Copy link
Contributor

  1. Can we also do algo selection for RNNs?
  2. And an we make the fast modes always the default?
  3. To use the Find* methods I have a bunch of design issues listed in Performance: cudnn algorithm selection #318, I am not happy with my Knet implementation. If you have suggestions, I'd be happy to implement.

@gartangh gartangh linked an issue Jul 27, 2020 that may be closed by this pull request
@gartangh
Copy link
Contributor Author

@maleadt @denizyuret
I just implemented wrappers for:

  • cudnnGetConvolutionForwardAlgorithm_v7
  • cudnnFindConvolutionForwardAlgorithm
  • cudnnFindConvolutionForwardAlgorithmEx
  • cudnnGetConvolutionForwardAlgorithmMaxCount
  1. in cuDNN 8, the following functions will be deprecated:

    • cudnnFindRNNForwardInferenceAlgorithmEx
    • cudnnFindRNNForwardTrainingAlgorithmEx
    • cudnnFindRNNBackwardDataAlgorithmEx
    • cudnnFindRNNBackwardWeightsAlgorithmEx

    so I do not think it is worth the effort to implement them at the moment.

  2. I am actually in favor of using accurate algorithms over fast ones by default.

  3. Any suggestions on how to cache the algorithm for certain input sizes? And how to limit memory usage? I am currently proposing cudnnGetConvolutionForwardAlgorithm_v7 until we have a way for caching, as cudnnGetConvolutionForwardAlgorithm will be removed in cuDNN 8. This can however still take algorithms that use lots of memory.

@denizyuret
Copy link
Contributor

@maleadt @denizyuret
I just implemented wrappers for:

  • cudnnGetConvolutionForwardAlgorithm_v7
  • cudnnFindConvolutionForwardAlgorithm
  • cudnnFindConvolutionForwardAlgorithmEx
  • cudnnGetConvolutionForwardAlgorithmMaxCount

Great!

  1. in cuDNN 8, the following functions will be deprecated:

    • cudnnFindRNNForwardInferenceAlgorithmEx
    • cudnnFindRNNForwardTrainingAlgorithmEx
    • cudnnFindRNNBackwardDataAlgorithmEx
    • cudnnFindRNNBackwardWeightsAlgorithmEx

    so I do not think it is worth the effort to implement them at the moment.

Good catch, I agree.

  1. I am actually in favor of using accurate algorithms over fast ones by default.

What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.

  1. Any suggestions on how to cache the algorithm for certain input sizes? And how to limit memory usage? I am currently proposing cudnnGetConvolutionForwardAlgorithm_v7 until we have a way for caching, as cudnnGetConvolutionForwardAlgorithm will be removed in cuDNN 8. This can however still take algorithms that use lots of memory.

As mentioned in #318 my experience with Knet is:

  • "Get" functions did not have very good heuristics and I could not beat the benchmarks with them. Things may have improved since then, this was a couple years ago.
  • "Find" functions are better, they try each possibility after all. However as you mentioned they require caching and workspace memory.
  • cudnnFindConvolutionForwardAlgorithm() does the required memory allocation for its experiments on its own which makes things simpler. But if the CUDA pool does not leave enough memory for an external cudaMalloc you risk running out of memory and getting CUDNN_STATUS_ALLOC_FAILED.
  • Thus my current implementation uses cudnnFindConvolutionForwardAlgorithmEx() which lets you allocate the workspace memory (which when you do using CuArrays avoid the out of memory problem). However this makes things a bit complicated because you need to figure out how much workspace memory to allocate for a given array size to try all the algorithms you want.
  • I cache results in global dictionaries (input sizes -> algo). This is probably not good for multi-threading, multi-gpu etc. and needs to be fixed. I have a maximum number of keys I cache and just return the default algo (0) if more requests keep coming.

The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl

@gartangh
Copy link
Contributor Author

@denizyuret
I also added wrappers for these functions, as they are fairly similar to the once I implemented before:

  • cudnnFindConvolutionBackwardFilterAlgorithm
  • cudnnFindConvolutionBackwardFilterAlgorithmEx
  • cudnnGetConvolutionBackwardFilterAlgorithm
  • cudnnGetConvolutionBackwardFilterAlgorithm_v7
  • cudnnGetConvolutionBackwardFilterAlgorithmMaxCount
  • cudnnFindConvolutionBackwardDataAlgorithm
  • cudnnFindConvolutionBackwardDataAlgorithmEx
  • cudnnGetConvolutionBackwardDataAlgorithm
  • cudnnGetConvolutionBackwardDataAlgorithm_v7
  • cudnnGetConvolutionBackwardDataAlgorithmMaxCount

@gartangh
Copy link
Contributor Author

@denizyuret

What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.

I feel like Julia and its packages always favor accuracy over performance. So for the sake of consistency, we could do the same here. I completely agree with your reasoning that accuracy is not that important for Deep Learning, so I would be happy to use the most performant code by default, but that is not a decision I want to make by myself.

@denizyuret
Copy link
Contributor

@denizyuret

What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.

I feel like Julia and its packages always favor accuracy over performance. So for the sake of consistency, we could do the same here. I completely agree with your reasoning that accuracy is not that important for Deep Learning, so I would be happy to use the most performant code by default, but that is not a decision I want to make by myself.

@ViralBShah what do you think? Can we break with tradition and make the most performant variants of CUDNN algorithms the default for the sake of deep learning packages? We can add keyword options to choose the more precise algorithms and document them, but I think it is important (esp. for new users) that the defaults work fast.

@gartangh
Copy link
Contributor Author

@maleadt @denizyuret

As mentioned in #318 my experience with Knet is:
"Get" functions did not have very good heuristics and I could not beat the benchmarks with them. Things may have improved since then, this was a couple years ago.
"Find" functions are better, they try each possibility after all. However as you mentioned they require caching and workspace memory.
cudnnFindConvolutionForwardAlgorithm() does the required memory allocation for its experiments on its own which makes things simpler. But if the CUDA pool does not leave enough memory for an external cudaMalloc you risk running out of memory and getting CUDNN_STATUS_ALLOC_FAILED.
Thus my current implementation uses cudnnFindConvolutionForwardAlgorithmEx() which lets you allocate the workspace memory (which when you do using CuArrays avoid the out of memory problem). However this makes things a bit complicated because you need to figure out how much workspace memory to allocate for a given array size to try all the algorithms you want.
I cache results in global dictionaries (input sizes -> algo). This is probably not good for multi-threading, multi-gpu etc. and needs to be fixed. I have a maximum number of keys I cache and just return the default algo (0) if more requests keep coming.
The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl

I just added a way of caching the algorithm returned from cudnnFindConvolutionForwardAlgorithmEx, similar to how you did it in Knet.jl. I opted for a dictionary with limited capacity and Least Recently Used (LRU) replacement. I would love to have some feedback on this. (Tomorrow, I will run some benchmarks and profile a Deep Neural Network to measure the differences.)

@denizyuret
Copy link
Contributor

@maleadt @denizyuret

As mentioned in #318 my experience with Knet is:
"Get" functions did not have very good heuristics and I could not beat the benchmarks with them. Things may have improved since then, this was a couple years ago.
"Find" functions are better, they try each possibility after all. However as you mentioned they require caching and workspace memory.
cudnnFindConvolutionForwardAlgorithm() does the required memory allocation for its experiments on its own which makes things simpler. But if the CUDA pool does not leave enough memory for an external cudaMalloc you risk running out of memory and getting CUDNN_STATUS_ALLOC_FAILED.
Thus my current implementation uses cudnnFindConvolutionForwardAlgorithmEx() which lets you allocate the workspace memory (which when you do using CuArrays avoid the out of memory problem). However this makes things a bit complicated because you need to figure out how much workspace memory to allocate for a given array size to try all the algorithms you want.
I cache results in global dictionaries (input sizes -> algo). This is probably not good for multi-threading, multi-gpu etc. and needs to be fixed. I have a maximum number of keys I cache and just return the default algo (0) if more requests keep coming.
The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl

I just added a way of caching the algorithm returned from cudnnFindConvolutionForwardAlgorithmEx, similar to how you did it in Knet.jl. I opted for a dictionary with limited capacity and Least Recently Used (LRU) replacement. I would love to have some feedback on this. (Tomorrow, I will run some benchmarks and profile a Deep Neural Network to measure the differences.)

That's great. About the LRU: my purpose in limiting the number was not to conserve memory: the entries do not take that much space. It was more about not wasting too much time with Find calls if the array sizes keep changing, which LRU will not limit. Although in practice this probably never happens so it doesn't matter that much as long as we have a cache size that can handle training common deep models. And I can see the logic of LRU if the user switches between different models during the same session.

@@ -67,6 +94,7 @@ fix1d(cdims::DenseConvDims{1,K,C_in,C_out,S,P,D,F}) where {K,C_in,C_out,S,P,D,F}
fix1d(pdims::PoolDims{1,K,S,P,D}) where {K,S,P,D,F} =
PoolDims{2,(K...,1),(S...,1),(P...,0,0),(D...,1)}((pdims.I..., 1), pdims.C_in)

conv_forward = CircularDict{Tuple, Int32}(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only cache a limited amount? If you key with not much uniqueness, only (T, size, DenseConvDims), we should have a high enough hit rate? The entries are small, too, and don't keep anything alive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf. my comment here: #321 (comment)

lib/cudnn/nnlib.jl Outdated Show resolved Hide resolved
lib/cudnn/nnlib.jl Outdated Show resolved Hide resolved
lib/cudnn/nnlib.jl Outdated Show resolved Hide resolved
@gartangh
Copy link
Contributor Author

@denizyuret

That's great. About the LRU: my purpose in limiting the number was not to conserve memory: the entries do not take that much space. It was more about not wasting too much time with Find calls if the array sizes keep changing, which LRU will not limit. Although in practice this probably never happens so it doesn't matter that much as long as we have a cache size that can handle training common deep models. And I can see the logic of LRU if the user switches between different models during the same session.

@maleadt

Why only cache a limited amount? If you key with not much uniqueness, only (T, size, DenseConvDims), we should have a high enough hit rate? The entries are small, too, and don't keep anything alive.

I was not really aiming to conserve memory, since (like you say) the entries do not take that much space.
I really was thinking about "what if we switch between large models?".
I think the conclusion from your comments is that we can use a regular dictionary without a capacity limit.

@gartangh
Copy link
Contributor Author

gartangh commented Jul 29, 2020

So I just did the measurements I promised: (all measurements are in milliseconds)

Network Input size #master #conv_bias_act #experimentation
VGG19 (224,224,3,64) 263 160 154
Darknet-19 (224,224,3,128) 120 90 84
ResNet-50 (224,224,3,64) 253 285 197
Darknet-53 (256,256,3,64) 194 164 148

The reason that Resnet-50 is slower, is that there are lots of memory issues. I believe that if we can come up with a better heuristic for the workspacesize, we can speed up all these networks.

@denizyuret
Copy link
Contributor

The reason that Resnet-50 is slower, is that there are lots of memory issues. I believe that if we can come up with a better heuristic for the workspacesize, we can speed up all these networks.

These look very good. Can you give a bit more detail about Resnet-50? Is it the workspacesize used during algorithm discovery, or during training/inference you think is the problem? Is it running close to the GPU memory limit? About workspacesize heuristics, I use two:

(1) Limit the maximum workspacesize during discovery to a reasonable percentage of available memory and/or a reasonable multiple of the input array size: https://github.com/denizyuret/Knet.jl/blob/12430600ea0782cb16cd1aa5887796a5f0d4359f/src/conv.jl#L538

(2) When you are picking an algorithm looking at the Find results, prefer ones with lower memory up to a 10% speed penalty: https://github.com/denizyuret/Knet.jl/blob/12430600ea0782cb16cd1aa5887796a5f0d4359f/src/conv.jl#L617

@gartangh
Copy link
Contributor Author

gartangh commented Jul 30, 2020

@denizyuret , I was already using your first heuristic but overlooked the second.
I also found some speedups for my implementations of activation functions on the GPU.
ResNet-50 now uses 25 % less memory and clocks in below 200ms.

I have updated my previous comment with new measurements from the experimentation branch.

@denizyuret
Copy link
Contributor

@gartangh, please be careful when adding bias, activation functions etc: I have observed inferior performance relative to handwritten kernels in previous versions of cudnn, so I would not make these default without benchmarking.

@gartangh
Copy link
Contributor Author

@denizyuret , as per your request:

Activations

Activations Speedup

Where before is the master branch and after the conv_bias_act branch.
These cuDNN kernels seem to outperform Julia Broadcast kernels by quite some margin.

@gartangh gartangh force-pushed the conv_bias_act branch 2 times, most recently from 9558736 to d7cf913 Compare July 30, 2020 12:28
@maleadt
Copy link
Member

maleadt commented Jul 30, 2020

These cuDNN kernels seem to outperform Julia Broadcast kernels by quite some margin.

That's with all the recent broadcast optimizations, I take it?

@gartangh gartangh force-pushed the conv_bias_act branch 2 times, most recently from 6dcdf63 to 137ff45 Compare July 30, 2020 17:59
@denizyuret
Copy link
Contributor

Something is not quite right with cudnnAddTensor

@denizyuret , I have made some adjustments to the code in 233ebe6.
This should make sure that cudnnAddTensor is only called when the inputs have the right dimensions.
For all other cases, Julia broadcasting is used as before.

Interestingly the cudnn documentation suggests all Julia broadcastable arrays should work: "Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. " -- we should file a bug report :)

@gartangh
Copy link
Contributor Author

gartangh commented Aug 4, 2020

Interestingly the cudnn documentation suggests all Julia broadcastable arrays should work: "Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. " -- we should file a bug report :)

@denizyuret , here we go: https://developer.nvidia.com/nvidia_bug/3084210.

@JeffBezanson
Copy link

I feel like Julia and its packages always favor accuracy over performance.

No, there is no such rule or "tradition". Packages should do whatever's best for them, and authors should make independent decisions about what is best.

@DhairyaLGandhi
Copy link
Member

This is indeed very promising,thank you!

I'm still interested in diving deeper in resnet, is there a profile of the call graph with this branch?

Also, it seems that all activations are now constant time, can we verify this against CuDNN implementations elsewhere

@DhairyaLGandhi
Copy link
Member

Also, are there updates to the numbers posted in #321 (comment)

@gartangh
Copy link
Contributor Author

gartangh commented Aug 6, 2020

No, there is no such rule or "tradition". Packages should do whatever's best for them, and authors should make independent decisions about what is best.

@JeffBezanson , I am selecting the fast algorithms over the accurate ones by default now.

Also, are there updates to the numbers posted in #321 (comment)

@DhairyaLGandhi , here you go: (I made some changes to my benchmark setup, so the numbers are not exactly the same as in #321 (comment)). All numbers are still in milliseconds, and this is tested on an NVIDIA V100 16GB.

I'm still interested in diving deeper in resnet, is there a profile of the call graph with this branch?

@DhairyaLGandhi , the problem with ResNet-50 is gone after the changes to my benchmark setup.

Network Input size CuArrays.jl #master (JuliaGPU/CuArrays.jl@6ebe463) CUDA.jl #master (8c61a05) CUDA.jl #conv_bias_act (gartangh@a1e83ee)
VGG19 (224,224,3,64) 308 267 127
Darknet-19 (224,224,3,128) 155 114 67
ResNet-50 (224,224,3,64) 412 193 83
Darknet-53 (256,256,3,64) 238 187 95
Network Input size CuArrays.jl #grouped_convolutions (gartangh/CuArrays.jl@99ed388) CUDA.jl #grouped_conv (gartangh@feb3572)
ResNeXt-50 (224,224,3,32) 245 45

@denizyuret
Copy link
Contributor

I'd like to run some benchmarks but could not find the right combination of CUDA, GPUArrays, NNlib etc. that works with your branch (checking out the masters of all the others did not work), @gartangh can you advise?

@gartangh
Copy link
Contributor Author

gartangh commented Aug 6, 2020

@denizyuret

CUDA v1.2.0 (#conv_bias_act from https://github.com/gartangh/CUDA.jl)
Flux v0.11.0 (#conv_bias_act from https://github.com/gartangh/Flux.jl)
GPUArrays v4.0.1 (#56bc004 from https://github.com/JuliaGPU/GPUArrays.jl)
Metalhead v0.5.0 (#1d9db5e from https://github.com/FluxML/Metalhead.jl)
NNlib v0.7.3 (#conv_bias_act from https://github.com/gartangh/NNlib.jl)

@DhairyaLGandhi
Copy link
Member

bump

@maleadt
Copy link
Member

maleadt commented Sep 21, 2020

bors try

bors bot added a commit that referenced this pull request Sep 21, 2020
@DhairyaLGandhi
Copy link
Member

Needs the different branch from NNlib,otherwise we wouldn't get the benefit I would imagine

@maleadt
Copy link
Member

maleadt commented Sep 21, 2020

Needs the different branch from NNlib,otherwise we wouldn't get the benefit I would imagine

There's more than just the conv_bias_act! method here, e.g., algorithm selection, broadcast changes. Those have an effect already.

@bors
Copy link
Contributor

bors bot commented Sep 21, 2020

try

Build succeeded:

@maleadt
Copy link
Member

maleadt commented Sep 21, 2020

bors r+

@bors
Copy link
Contributor

bors bot commented Sep 21, 2020

Build succeeded:

@bors bors bot merged commit 353eb9c into JuliaGPU:master Sep 21, 2020
@denizyuret denizyuret mentioned this pull request Dec 23, 2020
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: cudnn algorithm selection
5 participants