-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocked computations on the cpu #333
Comments
Hey @LaurentPlagne thank you for your kind words! The code and documentation definitely needs to be improved, and the CUDAnative.jl, CuArrays.jl, and GPUifyLoops.jl packages have helped a lot. Julia has been perfect for this package. We haven't really considered improving CPU performance yet as we've been focusing mostly on single-GPU performance (and we've worked on a little bit of multi-GPU stuff with MPI). As a result, running the model on a CPU isn't very useful as it only uses one core :( Julia 1.3 seems to have some really promising multithreading so it might become easy to parallelize across multiple cores of one CPU. I think when we start improving CPU performance, we'll probably look to Julia 1.3 first. Or maybe it'll be implemented in GPUifyLoops.jl (or a renamed version of the package). I didn't know about blocking stencil computations and had to look them up. I might be wrong but it sounds like utilizing shared memory like you suggested. We haven't implemented anything like that yet but we've been playing around with an |
Thank you for the links ! The Oceananigans src code is very clearly written so the reading is relatively OK (I could use some explanations on your closure usage). If it was up to me I would prefer an extended documentation of GPUifyLoops ;) I only understand how it works by reading what you do with it. The GPU shared memory is basically a programmable cache while the cache of CPU can't be (easily) controlled. In both cases there is cache so, if you compute multiple partial derivatives of a given (set of) field(s) (d/dx,d/dy, d2/dx2,...) once a block has been fetched in the cache then the memory operations are cheap. Cache is also useful for performing tiny transpositions enabling fast access and vectorized (SIMD) CPU or GPU ops in both X,Y or Z directions. I hope that obtaining an efficient code for both (multicore SIMD) CPUs and GPUs maybe possible adjusting the (recursive?) block sizes (i.e. controlling the data layout and adapt it to the computing target). I will try to use part of your code to rewrite the toy 2D CFD solver I have translated from Matlab (https://discourse.julialang.org/t/asynchronous-makie/27127/9?u=laurentplagne). Kudos again to your team for this inspiring package. Laurent |
@glwagner is the mastermind behind the implementation of the LES closures. It's still not documented but he's written up some documentation about them in a different package: https://dedales.readthedocs.io/en/latest/summaryclosures.html
Yeah GPUifyLoops could do with an extra example or two... I have an old and still open PR showing an example of a 3D stencil computation but you probably already know how to do this: vchuravy/GPUifyLoops.jl#18 Ah interesting, yeah I found papers online that described a 50-80% speedup but sounds like a lot of manual work which Julia may be able to do for us :) We've barely thought about CPU performance let alone SIMD but maybe something like GPUifyLoops.jl can figure out the multicore SIMD code.
That's awesome! Do let us know if you have any questions. That's a pretty cool use of Makie with |
Hi,
Very impressed and interested by this project !
I wonder if you have plans to execute complex operators with some blocking in order to accelerate CPU computations. I have the feeling that it could also be beneficial for GPU via the shared memory.
Anyway I am impressed by the neat design and the multitarget code architecture.
I hope that the Oceananigans example will help the HPC community to switch to Julia 😀
The text was updated successfully, but these errors were encountered: