Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find out common bottlenecks #1273

Open
4 tasks
DhairyaLGandhi opened this issue Jul 9, 2020 · 1 comment
Open
4 tasks

Find out common bottlenecks #1273

DhairyaLGandhi opened this issue Jul 9, 2020 · 1 comment

Comments

@DhairyaLGandhi
Copy link
Member

This issue (motivated by https://discourse.julialang.org/t/flux-vs-pytorch-cpu-performance/42667/25) is intended to be a high level overview of the common bottlenecks that show up in common models. This is a non-exhaustive list and would be expanded upon as more suggestions and use cases come along.

cc @CarloLucibello @ChrisRackauckas @ViralBShah

@ChrisRackauckas
Copy link
Member

I think it could make sense to split GPU and CPU dispatches if you wanted to take the time to write out the adjoints and add CUDA.free expressions, since if you write out the adjoint then you will have lots of intermediate calculations that you know won't escape and thus it would be safe to free those variables the moment they are used in the backpass. That only makes sense for GPUs of course, while everything in the forwards passes should probably get some @avx magic on it.

For data parallelism, we might want to just smack it dead center and have a tutorial in the docs titled "Multi-GPU on Clusters" that shows vmap, tmap, pmap, and then setting up multiple GPUs + pmap, all inside of gradients, with a link to ClusterManagers.jl. It should make it extremely obvious that Flux works with huge compute. Not necessarily a "bottleneck", but it's a common enough question that anyone who searches for it should easily find that page.

We should look at doing 5-argument mul! in things like the Dense kernel. I get inconsistent results on OpenBLAS, but we should get some measurements on MKL.

It would be good to time @avx broadcasting against https://github.com/JuliaMath/IntelVectorMath.jl which is probably the fastest vector math library out there. If we're at least close, I think we can say that @axv is good enough to be at least matching everyone else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants