Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN: Store convolution algorithm choice to disk. #1947

Closed
RomeoV opened this issue Jun 11, 2023 · 8 comments · Fixed by #1948
Closed

cuDNN: Store convolution algorithm choice to disk. #1947

RomeoV opened this issue Jun 11, 2023 · 8 comments · Fixed by #1948
Labels
performance How fast can we go?

Comments

@RomeoV
Copy link
Contributor

RomeoV commented Jun 11, 2023

Problem description.

In deep learning (DL), we usually define a computation using high-level instructions (e.g. with Flux.jl) and then map the computation over many elements.
It's the compilers job to figure out how to actually execute this in a performant manner, which has lead to the invention of e.g. XLA, and is also at the core of the motivation of the recently announced Mojo programming language.

Usually, the only way to find out the details of the execution is to run micro-benchmarks.
In the cpu world, for example, ATLAS has long been a way to automatically tune execution details to new hardware.
(PS: I believe LoopVectorization.jl does something similar, but looking at the code actually I didn't find any "on-the-fly benchmarking?" @chriselrod)
cuDNN.jl currently does something similar only for convolutions, delegating the specifics to the actual cuDNN library (e.g. here).
Performance results are cached in a local dictionary and discarded at the end of the session.

Currently, I'm not sure how long the microbenchmarks take compared to the startup time in general (I found it a bit hard to measure).
In the future however, I think we may see more involved performance optimizations, and I think it would be interesting to (optionally) store the benchmark results to disk.

Solution proposal.

We already have three cache variables cudnnConvolutionFwdAlgoPerfCache, cudnnConvolutionBwdDataAlgoPerfCache, and cudnnConvolutionBwdFilterAlgoPerfCache.
Conceptually, they map from a tuple of descriptors (basically structs describing the convolution parameters, weight tensor, and data tensor) to performance results for different implementations.
In general, it should be relatively straightforward to repopulate them from disk.
However, the key is currently actually not the julia struct/tuple, but a pointer to a C struct, which dies at the end of the julia session.

This should be an easy fix though, if we change the key from the pointer to the actual descriptor.
A proof-of-concept implementation is provided at #1948 .

Broader impact.

I think if Julia wants to stay competitive in this domain, especially with the Mojo language coming up, more optimizations for gpu execution must be available, without paying a high compile time cost.
In the future I hope that there can be further research projects related to different ways of executing on the hardware.

I'm happy for any feedback on the topic in general, and also for this issue in particular.

@chriselrod
Copy link

chriselrod commented Jun 11, 2023

PS: I believe LoopVectorization.jl does something similar, but looking at the code actually I didn't find any "on-the-fly benchmarking?"

I hope the compile times don't feel like it is doing on the fly benchmarking!
Tuning with the likes of TVM is not fast.

It uses a cost model to choose register tile sizes.
It does not tile at higher levels.

I am working on a replacement, where I plan on having a later release support both tiling cache sizes, and auto-tuning on top of using a cost model by default.
Models are much quicker, but auto-tuning results could (a) be cached (and saved to disk, like you say here, or also distributed with a dictionary mapping specific hardware to kernel, if the hardware [e.g. A100, H100] and kernels are common enough) and (b) inform and improve the model.
Additionally, if users specify the search space, it is likely that auto-tuning will get worse results. When a coworker was demoing TVM for me, he didn't include LoopVectorization's answer in the search space, and ended up with much worse performance than it achieved.

I also intend to have the rewrite support targeting accelerators, so long as those accelerators are supported by LLVM.

@ToucheSir
Copy link
Contributor

ToucheSir commented Jun 11, 2023

Instead of creating a mapping from cuDNN struct pointers to Julia-side info, it would probably be better to extract all the relevant info from the descriptor(s) and use some subset of that as the cache key. Some previous discussion in #1461 (comment) and #1461 (comment) (which references microsoft/CNTK#3362).

Currently, I'm not sure how long the microbenchmarks take compared to the startup time in general (I found it a bit hard to measure).

You can check this using nsight. I find the benchmarking phase is pretty easy to distinguish from actual execution.

@RomeoV
Copy link
Contributor Author

RomeoV commented Jun 11, 2023

Thanks @chriselrod, I also had another look in the LoopVectorization docs and indeed it does mention being model based, I must have missed it the first time ;)

As an aside: Here's an API example of Mojo's approach to determining the tile size, which seems to be (for now) basically defining a search space and benchmarking. I'm not sure they're doing anything further at this point @lattner? I suppose we won't know for some time.

I also intend to have the rewrite support targeting accelerators, so long as those accelerators are supported by LLVM.

That would certainly be exciting. If something like this can become competitive with vendor implementations of common kernels I think it would be a very exciting result, and certainly a show-piece for the Julia community.
Either way, if you also consider implementing autotuning, perhaps there is space for a shared abstraction for storing autotuning results for both cpu/cuda/xpu, i.e. what this issue is about in the first place.

@RomeoV
Copy link
Contributor Author

RomeoV commented Jun 11, 2023

Thanks @ToucheSir for the feedback! Here's what the keys in the pull request example actually look like:

# cudnnFilterDescriptor
(cuDNN.CUDNN_DATA_FLOAT, cuDNN.CUDNN_TENSOR_NCHW, 4, Int32[32, 64, 3, 3])
# cudnnTensorDescriptor
(cuDNN.CUDNN_TENSOR_NCHW, cuDNN.CUDNN_DATA_FLOAT, 4, Int32[7, 32, 30, 30])
# cudnnConvDescriptor
# (padding, stride, dilation, ..., beta) where beta is some scaling factor(?)
(Int32[0, 0], Int32[1, 1], Int32[1, 1], cuDNN.CUDNN_CONVOLUTION, cuDNN.CUDNN_DATA_FLOAT, cuDNN.CUDNN_TENSOR_OP_MATH, cuDNN.CUDNN_DEFAULT_REORDER, 1)

I don't think it would be wise to strip much away from this. Even for the cudnnConvDescriptor, I could only imagine (maybe) stripping the padding - but otherwise I think there will always be cases where you end up needing different execution implementations.
In the issue you linked it was considered to remove the batch size, for example. But specifically the batch size can be quite significant for the concrete choice of implementation/algorithm.

Therefore imo using the descriptors as is is quite reasonable. For the case where the batch size is varying all the time, one could consider filling the dictionary with copies of the same keys (but with different batch sizes), or providing an algorithm override

@lattner
Copy link

lattner commented Jun 11, 2023

That code is a simple example meant to motivate builtin language autotuning. Our production matmul algorithms use that functionality, but their implementation is much different and more complicated.

@RomeoV
Copy link
Contributor Author

RomeoV commented Jun 11, 2023

Thanks @lattner, excited to see more details as they become available, especially for the gpu/tpu/xpu autotuning!

@lattner
Copy link

lattner commented Jun 11, 2023

Me too! We're seeing amazing results, can't wait to share them.

@ToucheSir
Copy link
Contributor

Therefore imo using the descriptors as is is quite reasonable. For the case where the batch size is varying all the time, one could consider filling the dictionary with copies of the same keys (but with different batch sizes), or providing an algorithm override

I think my point was unclear, sorry. The main observation is that it is not necessary to cache the descriptors themselves but just their contents, which should make deserialization much easier. Will go into more detail on #1948.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance How fast can we go?
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants