You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue (motivated by https://discourse.julialang.org/t/flux-vs-pytorch-cpu-performance/42667/25) is intended to be a high level overview of the common bottlenecks that show up in common models. This is a non-exhaustive list and would be expanded upon as more suggestions and use cases come along.
I think it could make sense to split GPU and CPU dispatches if you wanted to take the time to write out the adjoints and add CUDA.free expressions, since if you write out the adjoint then you will have lots of intermediate calculations that you know won't escape and thus it would be safe to free those variables the moment they are used in the backpass. That only makes sense for GPUs of course, while everything in the forwards passes should probably get some @avx magic on it.
For data parallelism, we might want to just smack it dead center and have a tutorial in the docs titled "Multi-GPU on Clusters" that shows vmap, tmap, pmap, and then setting up multiple GPUs + pmap, all inside of gradients, with a link to ClusterManagers.jl. It should make it extremely obvious that Flux works with huge compute. Not necessarily a "bottleneck", but it's a common enough question that anyone who searches for it should easily find that page.
We should look at doing 5-argument mul! in things like the Dense kernel. I get inconsistent results on OpenBLAS, but we should get some measurements on MKL.
It would be good to time @avx broadcasting against https://github.com/JuliaMath/IntelVectorMath.jl which is probably the fastest vector math library out there. If we're at least close, I think we can say that @axv is good enough to be at least matching everyone else.
This issue (motivated by https://discourse.julialang.org/t/flux-vs-pytorch-cpu-performance/42667/25) is intended to be a high level overview of the common bottlenecks that show up in common models. This is a non-exhaustive list and would be expanded upon as more suggestions and use cases come along.
Base.tanh
is slower compared to some SIMD'd versions (For example from SLEEFPirates) (replace Base.tanh with faster tanh #1272)@avx
to our activation functions to help with SIMD'ingsoftmax
via LoopVectorisation.jl (use LoopVectorization to vectorize activation functions and softmax NNlib.jl#199)cc @CarloLucibello @ChrisRackauckas @ViralBShah
The text was updated successfully, but these errors were encountered: