Performance improvements by calling cuDNN API #321

gartangh · 2020-07-26T13:46:52Z

Selects the fastest convolutional algorithm
Faster activations, if supported
Faster softmax
Merges convolution, bias addition, and activation function if possible

Depends on:
FluxML/NNlib.jl#228

lib/cudnn/conv.jl

maleadt · 2020-07-26T14:55:34Z

cc @denizyuret
Also see #318, looks like the Find* methods would be even better, but this is a good start.

denizyuret · 2020-07-27T09:29:50Z

Can we also do algo selection for RNNs?
And an we make the fast modes always the default?
To use the Find* methods I have a bunch of design issues listed in Performance: cudnn algorithm selection #318, I am not happy with my Knet implementation. If you have suggestions, I'd be happy to implement.

gartangh · 2020-07-27T15:38:09Z

@maleadt @denizyuret
I just implemented wrappers for:

cudnnGetConvolutionForwardAlgorithm_v7
cudnnFindConvolutionForwardAlgorithm
cudnnFindConvolutionForwardAlgorithmEx
cudnnGetConvolutionForwardAlgorithmMaxCount

in cuDNN 8, the following functions will be deprecated:
- cudnnFindRNNForwardInferenceAlgorithmEx
- cudnnFindRNNForwardTrainingAlgorithmEx
- cudnnFindRNNBackwardDataAlgorithmEx
- cudnnFindRNNBackwardWeightsAlgorithmEx
so I do not think it is worth the effort to implement them at the moment.
I am actually in favor of using accurate algorithms over fast ones by default.
Any suggestions on how to cache the algorithm for certain input sizes? And how to limit memory usage? I am currently proposing cudnnGetConvolutionForwardAlgorithm_v7 until we have a way for caching, as cudnnGetConvolutionForwardAlgorithm will be removed in cuDNN 8. This can however still take algorithms that use lots of memory.

denizyuret · 2020-07-27T16:34:45Z

@maleadt @denizyuret
I just implemented wrappers for:

cudnnGetConvolutionForwardAlgorithm_v7

cudnnFindConvolutionForwardAlgorithm

cudnnFindConvolutionForwardAlgorithmEx

cudnnGetConvolutionForwardAlgorithmMaxCount

Great!

in cuDNN 8, the following functions will be deprecated:

cudnnFindRNNForwardInferenceAlgorithmEx

cudnnFindRNNForwardTrainingAlgorithmEx

cudnnFindRNNBackwardDataAlgorithmEx

cudnnFindRNNBackwardWeightsAlgorithmEx

so I do not think it is worth the effort to implement them at the moment.

Good catch, I agree.

I am actually in favor of using accurate algorithms over fast ones by default.

What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.

Any suggestions on how to cache the algorithm for certain input sizes? And how to limit memory usage? I am currently proposing cudnnGetConvolutionForwardAlgorithm_v7 until we have a way for caching, as cudnnGetConvolutionForwardAlgorithm will be removed in cuDNN 8. This can however still take algorithms that use lots of memory.

As mentioned in #318 my experience with Knet is:

"Get" functions did not have very good heuristics and I could not beat the benchmarks with them. Things may have improved since then, this was a couple years ago.
"Find" functions are better, they try each possibility after all. However as you mentioned they require caching and workspace memory.
cudnnFindConvolutionForwardAlgorithm() does the required memory allocation for its experiments on its own which makes things simpler. But if the CUDA pool does not leave enough memory for an external cudaMalloc you risk running out of memory and getting CUDNN_STATUS_ALLOC_FAILED.
Thus my current implementation uses cudnnFindConvolutionForwardAlgorithmEx() which lets you allocate the workspace memory (which when you do using CuArrays avoid the out of memory problem). However this makes things a bit complicated because you need to figure out how much workspace memory to allocate for a given array size to try all the algorithms you want.
I cache results in global dictionaries (input sizes -> algo). This is probably not good for multi-threading, multi-gpu etc. and needs to be fixed. I have a maximum number of keys I cache and just return the default algo (0) if more requests keep coming.

The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl

gartangh · 2020-07-27T18:58:00Z

@denizyuret
I also added wrappers for these functions, as they are fairly similar to the once I implemented before:

cudnnFindConvolutionBackwardFilterAlgorithm
cudnnFindConvolutionBackwardFilterAlgorithmEx
cudnnGetConvolutionBackwardFilterAlgorithm
cudnnGetConvolutionBackwardFilterAlgorithm_v7
cudnnGetConvolutionBackwardFilterAlgorithmMaxCount
cudnnFindConvolutionBackwardDataAlgorithm
cudnnFindConvolutionBackwardDataAlgorithmEx
cudnnGetConvolutionBackwardDataAlgorithm
cudnnGetConvolutionBackwardDataAlgorithm_v7
cudnnGetConvolutionBackwardDataAlgorithmMaxCount

gartangh · 2020-07-28T10:37:30Z

@denizyuret

What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.

I feel like Julia and its packages always favor accuracy over performance. So for the sake of consistency, we could do the same here. I completely agree with your reasoning that accuracy is not that important for Deep Learning, so I would be happy to use the most performant code by default, but that is not a decision I want to make by myself.

denizyuret · 2020-07-28T11:44:23Z

@denizyuret

What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.

I feel like Julia and its packages always favor accuracy over performance. So for the sake of consistency, we could do the same here. I completely agree with your reasoning that accuracy is not that important for Deep Learning, so I would be happy to use the most performant code by default, but that is not a decision I want to make by myself.

@ViralBShah what do you think? Can we break with tradition and make the most performant variants of CUDNN algorithms the default for the sake of deep learning packages? We can add keyword options to choose the more precise algorithms and document them, but I think it is important (esp. for new users) that the defaults work fast.

gartangh · 2020-07-28T18:32:49Z

@maleadt @denizyuret

As mentioned in #318 my experience with Knet is:
"Get" functions did not have very good heuristics and I could not beat the benchmarks with them. Things may have improved since then, this was a couple years ago.
"Find" functions are better, they try each possibility after all. However as you mentioned they require caching and workspace memory.
cudnnFindConvolutionForwardAlgorithm() does the required memory allocation for its experiments on its own which makes things simpler. But if the CUDA pool does not leave enough memory for an external cudaMalloc you risk running out of memory and getting CUDNN_STATUS_ALLOC_FAILED.
Thus my current implementation uses cudnnFindConvolutionForwardAlgorithmEx() which lets you allocate the workspace memory (which when you do using CuArrays avoid the out of memory problem). However this makes things a bit complicated because you need to figure out how much workspace memory to allocate for a given array size to try all the algorithms you want.
I cache results in global dictionaries (input sizes -> algo). This is probably not good for multi-threading, multi-gpu etc. and needs to be fixed. I have a maximum number of keys I cache and just return the default algo (0) if more requests keep coming.
The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl

I just added a way of caching the algorithm returned from cudnnFindConvolutionForwardAlgorithmEx, similar to how you did it in Knet.jl. I opted for a dictionary with limited capacity and Least Recently Used (LRU) replacement. I would love to have some feedback on this. (Tomorrow, I will run some benchmarks and profile a Deep Neural Network to measure the differences.)

denizyuret · 2020-07-28T19:16:50Z

@maleadt @denizyuret

As mentioned in #318 my experience with Knet is:
"Get" functions did not have very good heuristics and I could not beat the benchmarks with them. Things may have improved since then, this was a couple years ago.
"Find" functions are better, they try each possibility after all. However as you mentioned they require caching and workspace memory.
cudnnFindConvolutionForwardAlgorithm() does the required memory allocation for its experiments on its own which makes things simpler. But if the CUDA pool does not leave enough memory for an external cudaMalloc you risk running out of memory and getting CUDNN_STATUS_ALLOC_FAILED.
Thus my current implementation uses cudnnFindConvolutionForwardAlgorithmEx() which lets you allocate the workspace memory (which when you do using CuArrays avoid the out of memory problem). However this makes things a bit complicated because you need to figure out how much workspace memory to allocate for a given array size to try all the algorithms you want.
I cache results in global dictionaries (input sizes -> algo). This is probably not good for multi-threading, multi-gpu etc. and needs to be fixed. I have a maximum number of keys I cache and just return the default algo (0) if more requests keep coming.
The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl

I just added a way of caching the algorithm returned from cudnnFindConvolutionForwardAlgorithmEx, similar to how you did it in Knet.jl. I opted for a dictionary with limited capacity and Least Recently Used (LRU) replacement. I would love to have some feedback on this. (Tomorrow, I will run some benchmarks and profile a Deep Neural Network to measure the differences.)

That's great. About the LRU: my purpose in limiting the number was not to conserve memory: the entries do not take that much space. It was more about not wasting too much time with Find calls if the array sizes keep changing, which LRU will not limit. Although in practice this probably never happens so it doesn't matter that much as long as we have a cache size that can handle training common deep models. And I can see the logic of LRU if the user switches between different models during the same session.

maleadt · 2020-07-29T06:28:00Z

lib/cudnn/nnlib.jl

@@ -67,6 +94,7 @@ fix1d(cdims::DenseConvDims{1,K,C_in,C_out,S,P,D,F}) where {K,C_in,C_out,S,P,D,F}
 fix1d(pdims::PoolDims{1,K,S,P,D}) where {K,S,P,D,F} =
  PoolDims{2,(K...,1),(S...,1),(P...,0,0),(D...,1)}((pdims.I..., 1), pdims.C_in)

+conv_forward = CircularDict{Tuple, Int32}(100)


Why only cache a limited amount? If you key with not much uniqueness, only (T, size, DenseConvDims), we should have a high enough hit rate? The entries are small, too, and don't keep anything alive.

Cf. my comment here: #321 (comment)

lib/cudnn/nnlib.jl

gartangh · 2020-07-29T09:12:14Z

@denizyuret

That's great. About the LRU: my purpose in limiting the number was not to conserve memory: the entries do not take that much space. It was more about not wasting too much time with Find calls if the array sizes keep changing, which LRU will not limit. Although in practice this probably never happens so it doesn't matter that much as long as we have a cache size that can handle training common deep models. And I can see the logic of LRU if the user switches between different models during the same session.

@maleadt

Why only cache a limited amount? If you key with not much uniqueness, only (T, size, DenseConvDims), we should have a high enough hit rate? The entries are small, too, and don't keep anything alive.

I was not really aiming to conserve memory, since (like you say) the entries do not take that much space.
I really was thinking about "what if we switch between large models?".
I think the conclusion from your comments is that we can use a regular dictionary without a capacity limit.

gartangh · 2020-07-29T17:34:20Z

So I just did the measurements I promised: (all measurements are in milliseconds)

Network	Input size	#master	#conv_bias_act	#experimentation
VGG19	(224,224,3,64)	263	160	154
Darknet-19	(224,224,3,128)	120	90	84
ResNet-50	(224,224,3,64)	253	285	197
Darknet-53	(256,256,3,64)	194	164	148

The reason that Resnet-50 is slower, is that there are lots of memory issues. I believe that if we can come up with a better heuristic for the workspacesize, we can speed up all these networks.

denizyuret · 2020-07-29T17:40:56Z

The reason that Resnet-50 is slower, is that there are lots of memory issues. I believe that if we can come up with a better heuristic for the workspacesize, we can speed up all these networks.

These look very good. Can you give a bit more detail about Resnet-50? Is it the workspacesize used during algorithm discovery, or during training/inference you think is the problem? Is it running close to the GPU memory limit? About workspacesize heuristics, I use two:

(1) Limit the maximum workspacesize during discovery to a reasonable percentage of available memory and/or a reasonable multiple of the input array size: https://github.com/denizyuret/Knet.jl/blob/12430600ea0782cb16cd1aa5887796a5f0d4359f/src/conv.jl#L538

(2) When you are picking an algorithm looking at the Find results, prefer ones with lower memory up to a 10% speed penalty: https://github.com/denizyuret/Knet.jl/blob/12430600ea0782cb16cd1aa5887796a5f0d4359f/src/conv.jl#L617

gartangh · 2020-07-30T01:52:32Z

@denizyuret , I was already using your first heuristic but overlooked the second.
I also found some speedups for my implementations of activation functions on the GPU.
ResNet-50 now uses 25 % less memory and clocks in below 200ms.

I have updated my previous comment with new measurements from the experimentation branch.

denizyuret · 2020-07-30T08:42:09Z

@gartangh, please be careful when adding bias, activation functions etc: I have observed inferior performance relative to handwritten kernels in previous versions of cudnn, so I would not make these default without benchmarking.

gartangh · 2020-07-30T10:46:00Z

@denizyuret , as per your request:

Where before is the master branch and after the conv_bias_act branch.
These cuDNN kernels seem to outperform Julia Broadcast kernels by quite some margin.

maleadt · 2020-07-30T14:18:17Z

These cuDNN kernels seem to outperform Julia Broadcast kernels by quite some margin.

That's with all the recent broadcast optimizations, I take it?

denizyuret · 2020-08-04T15:23:43Z

Something is not quite right with cudnnAddTensor

@denizyuret , I have made some adjustments to the code in 233ebe6.
This should make sure that cudnnAddTensor is only called when the inputs have the right dimensions.
For all other cases, Julia broadcasting is used as before.

Interestingly the cudnn documentation suggests all Julia broadcastable arrays should work: "Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. " -- we should file a bug report :)

gartangh · 2020-08-04T18:02:13Z

Interestingly the cudnn documentation suggests all Julia broadcastable arrays should work: "Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. " -- we should file a bug report :)

@denizyuret , here we go: https://developer.nvidia.com/nvidia_bug/3084210.

JeffBezanson · 2020-08-05T16:31:52Z

I feel like Julia and its packages always favor accuracy over performance.

No, there is no such rule or "tradition". Packages should do whatever's best for them, and authors should make independent decisions about what is best.

DhairyaLGandhi · 2020-08-05T18:04:33Z

This is indeed very promising,thank you!

I'm still interested in diving deeper in resnet, is there a profile of the call graph with this branch?

Also, it seems that all activations are now constant time, can we verify this against CuDNN implementations elsewhere

DhairyaLGandhi · 2020-08-05T18:05:39Z

Also, are there updates to the numbers posted in #321 (comment)

gartangh · 2020-08-06T11:01:45Z

No, there is no such rule or "tradition". Packages should do whatever's best for them, and authors should make independent decisions about what is best.

@JeffBezanson , I am selecting the fast algorithms over the accurate ones by default now.

Also, are there updates to the numbers posted in #321 (comment)

@DhairyaLGandhi , here you go: (I made some changes to my benchmark setup, so the numbers are not exactly the same as in #321 (comment)). All numbers are still in milliseconds, and this is tested on an NVIDIA V100 16GB.

I'm still interested in diving deeper in resnet, is there a profile of the call graph with this branch?

@DhairyaLGandhi , the problem with ResNet-50 is gone after the changes to my benchmark setup.

Network	Input size	CuArrays.jl #master (JuliaGPU/CuArrays.jl@`6ebe463`)	CUDA.jl #master (`8c61a05`)	CUDA.jl #conv_bias_act (gartangh@`a1e83ee`)
VGG19	(224,224,3,64)	308	267	127
Darknet-19	(224,224,3,128)	155	114	67
ResNet-50	(224,224,3,64)	412	193	83
Darknet-53	(256,256,3,64)	238	187	95

Network	Input size	CuArrays.jl #grouped_convolutions (gartangh/CuArrays.jl@`99ed388`)	CUDA.jl #grouped_conv (gartangh@`feb3572`)
ResNeXt-50	(224,224,3,32)	245	45

denizyuret · 2020-08-06T14:05:23Z

I'd like to run some benchmarks but could not find the right combination of CUDA, GPUArrays, NNlib etc. that works with your branch (checking out the masters of all the others did not work), @gartangh can you advise?

gartangh · 2020-08-06T14:19:56Z

@denizyuret

CUDA v1.2.0 (#conv_bias_act from https://github.com/gartangh/CUDA.jl)
Flux v0.11.0 (#conv_bias_act from https://github.com/gartangh/Flux.jl)
GPUArrays v4.0.1 (#56bc004 from https://github.com/JuliaGPU/GPUArrays.jl)
Metalhead v0.5.0 (#1d9db5e from https://github.com/FluxML/Metalhead.jl)
NNlib v0.7.3 (#conv_bias_act from https://github.com/gartangh/NNlib.jl)

DhairyaLGandhi · 2020-09-13T20:05:51Z

bump

…r softmax

maleadt · 2020-09-21T09:18:36Z

bors try

DhairyaLGandhi · 2020-09-21T09:42:54Z

Needs the different branch from NNlib,otherwise we wouldn't get the benefit I would imagine

maleadt · 2020-09-21T09:43:57Z

Needs the different branch from NNlib,otherwise we wouldn't get the benefit I would imagine

There's more than just the conv_bias_act! method here, e.g., algorithm selection, broadcast changes. Those have an effect already.

bors · 2020-09-21T10:48:16Z

try

Build succeeded:

ci/gitlab/gitlab.com

maleadt · 2020-09-21T11:53:31Z

bors r+

bors · 2020-09-21T14:01:32Z

Build succeeded:

ci/gitlab/gitlab.com

gartangh added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Jul 26, 2020

gartangh requested a review from maleadt July 26, 2020 13:46

gartangh self-assigned this Jul 26, 2020

gartangh mentioned this pull request Jul 26, 2020

Added wrapper for convolution, bias, and activation FluxML/NNlib.jl#228

Merged

maleadt reviewed Jul 26, 2020

View reviewed changes

lib/cudnn/conv.jl Outdated Show resolved Hide resolved

gartangh linked an issue Jul 27, 2020 that may be closed by this pull request

Performance: cudnn algorithm selection #318

Closed

gartangh force-pushed the conv_bias_act branch from 9e26229 to 904f3fb Compare July 27, 2020 17:47

gartangh force-pushed the conv_bias_act branch from 4d2cca5 to d7f82bc Compare July 28, 2020 10:34

maleadt reviewed Jul 29, 2020

View reviewed changes

gartangh force-pushed the conv_bias_act branch 2 times, most recently from 9558736 to d7cf913 Compare July 30, 2020 12:28

gartangh force-pushed the conv_bias_act branch 2 times, most recently from 6dcdf63 to 137ff45 Compare July 30, 2020 17:59

maleadt mentioned this pull request Aug 6, 2020

API for fast math-like mode #354

Closed

maleadt assigned maleadt and unassigned gartangh Sep 14, 2020

gartangh force-pushed the conv_bias_act branch from a1e83ee to c7307d7 Compare September 14, 2020 11:01

select fastest convolutional algorithm, faster activations, and faste…

2ca732c

…r softmax

maleadt mentioned this pull request Sep 21, 2020

Don't rely on broadcasted returning a Broadcasted. JuliaGPU/GPUArrays.jl#321

Merged

maleadt added 2 commits September 21, 2020 11:17

Bump GPUArrays.

a4bbfc7

Conditionalize NNlib implementation.

2ad6e4c

maleadt force-pushed the conv_bias_act branch from c7307d7 to 2ad6e4c Compare September 21, 2020 09:17

bors bot added a commit that referenced this pull request Sep 21, 2020

Try #321:

b1f53ea

Use CUDA's global math mode.

f3b62e3

maleadt force-pushed the conv_bias_act branch from 5ba060e to f3b62e3 Compare September 21, 2020 11:52

bors bot merged commit 353eb9c into JuliaGPU:master Sep 21, 2020

maleadt mentioned this pull request Oct 29, 2020

Fix CUDNN-optimized activation broadcasts #515

Merged

denizyuret mentioned this pull request Dec 23, 2020

New high level interface for cuDNN #523

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements by calling cuDNN API #321

Performance improvements by calling cuDNN API #321

gartangh commented Jul 26, 2020

maleadt commented Jul 26, 2020

denizyuret commented Jul 27, 2020

gartangh commented Jul 27, 2020

denizyuret commented Jul 27, 2020

gartangh commented Jul 27, 2020

gartangh commented Jul 28, 2020

denizyuret commented Jul 28, 2020

gartangh commented Jul 28, 2020

denizyuret commented Jul 28, 2020

maleadt Jul 29, 2020

gartangh Jul 29, 2020

gartangh commented Jul 29, 2020

gartangh commented Jul 29, 2020 •

edited

denizyuret commented Jul 29, 2020

gartangh commented Jul 30, 2020 •

edited

denizyuret commented Jul 30, 2020

gartangh commented Jul 30, 2020

maleadt commented Jul 30, 2020

denizyuret commented Aug 4, 2020

gartangh commented Aug 4, 2020

JeffBezanson commented Aug 5, 2020

DhairyaLGandhi commented Aug 5, 2020

DhairyaLGandhi commented Aug 5, 2020

gartangh commented Aug 6, 2020 •

edited

denizyuret commented Aug 6, 2020

gartangh commented Aug 6, 2020

DhairyaLGandhi commented Sep 13, 2020

maleadt commented Sep 21, 2020

DhairyaLGandhi commented Sep 21, 2020

maleadt commented Sep 21, 2020

bors bot commented Sep 21, 2020

maleadt commented Sep 21, 2020

bors bot commented Sep 21, 2020

Performance improvements by calling cuDNN API #321

Performance improvements by calling cuDNN API #321

Conversation

gartangh commented Jul 26, 2020

maleadt commented Jul 26, 2020

denizyuret commented Jul 27, 2020

gartangh commented Jul 27, 2020

denizyuret commented Jul 27, 2020

gartangh commented Jul 27, 2020

gartangh commented Jul 28, 2020

denizyuret commented Jul 28, 2020

gartangh commented Jul 28, 2020

denizyuret commented Jul 28, 2020

maleadt Jul 29, 2020

Choose a reason for hiding this comment

gartangh Jul 29, 2020

Choose a reason for hiding this comment

gartangh commented Jul 29, 2020

gartangh commented Jul 29, 2020 • edited

denizyuret commented Jul 29, 2020

gartangh commented Jul 30, 2020 • edited

denizyuret commented Jul 30, 2020

gartangh commented Jul 30, 2020

maleadt commented Jul 30, 2020

denizyuret commented Aug 4, 2020

gartangh commented Aug 4, 2020

JeffBezanson commented Aug 5, 2020

DhairyaLGandhi commented Aug 5, 2020

DhairyaLGandhi commented Aug 5, 2020

gartangh commented Aug 6, 2020 • edited

denizyuret commented Aug 6, 2020

gartangh commented Aug 6, 2020

DhairyaLGandhi commented Sep 13, 2020

maleadt commented Sep 21, 2020

DhairyaLGandhi commented Sep 21, 2020

maleadt commented Sep 21, 2020

bors bot commented Sep 21, 2020

try

maleadt commented Sep 21, 2020

bors bot commented Sep 21, 2020

gartangh commented Jul 29, 2020 •

edited

gartangh commented Jul 30, 2020 •

edited

gartangh commented Aug 6, 2020 •

edited