Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocked computations on the cpu #333

Closed
LaurentPlagne opened this issue Aug 5, 2019 · 3 comments
Closed

Blocked computations on the cpu #333

LaurentPlagne opened this issue Aug 5, 2019 · 3 comments
Labels
question 💭 No such thing as a stupid question

Comments

@LaurentPlagne
Copy link

Hi,
Very impressed and interested by this project !
I wonder if you have plans to execute complex operators with some blocking in order to accelerate CPU computations. I have the feeling that it could also be beneficial for GPU via the shared memory.
Anyway I am impressed by the neat design and the multitarget code architecture.
I hope that the Oceananigans example will help the HPC community to switch to Julia 😀

@ali-ramadhan ali-ramadhan added the question 💭 No such thing as a stupid question label Aug 5, 2019
@ali-ramadhan
Copy link
Member

Hey @LaurentPlagne thank you for your kind words!

The code and documentation definitely needs to be improved, and the CUDAnative.jl, CuArrays.jl, and GPUifyLoops.jl packages have helped a lot. Julia has been perfect for this package.

We haven't really considered improving CPU performance yet as we've been focusing mostly on single-GPU performance (and we've worked on a little bit of multi-GPU stuff with MPI). As a result, running the model on a CPU isn't very useful as it only uses one core :(

Julia 1.3 seems to have some really promising multithreading so it might become easy to parallelize across multiple cores of one CPU. I think when we start improving CPU performance, we'll probably look to Julia 1.3 first. Or maybe it'll be implemented in GPUifyLoops.jl (or a renamed version of the package).

I didn't know about blocking stencil computations and had to look them up. I might be wrong but it sounds like utilizing shared memory like you suggested. We haven't implemented anything like that yet but we've been playing around with an @stencil macro abstraction implemented in GPUifyLoops.jl (vchuravy/GPUifyLoops.jl#81) in PR #293.

@LaurentPlagne
Copy link
Author

Thank you for the links !

The Oceananigans src code is very clearly written so the reading is relatively OK (I could use some explanations on your closure usage). If it was up to me I would prefer an extended documentation of GPUifyLoops ;) I only understand how it works by reading what you do with it.

The GPU shared memory is basically a programmable cache while the cache of CPU can't be (easily) controlled. In both cases there is cache so, if you compute multiple partial derivatives of a given (set of) field(s) (d/dx,d/dy, d2/dx2,...) once a block has been fetched in the cache then the memory operations are cheap. Cache is also useful for performing tiny transpositions enabling fast access and vectorized (SIMD) CPU or GPU ops in both X,Y or Z directions.

I hope that obtaining an efficient code for both (multicore SIMD) CPUs and GPUs maybe possible adjusting the (recursive?) block sizes (i.e. controlling the data layout and adapt it to the computing target).

I will try to use part of your code to rewrite the toy 2D CFD solver I have translated from Matlab (https://discourse.julialang.org/t/asynchronous-makie/27127/9?u=laurentplagne).

Kudos again to your team for this inspiring package.

Laurent

@ali-ramadhan
Copy link
Member

I could use some explanations on your closure usage

@glwagner is the mastermind behind the implementation of the LES closures. It's still not documented but he's written up some documentation about them in a different package: https://dedales.readthedocs.io/en/latest/summaryclosures.html

If it was up to me I would prefer an extended documentation of GPUifyLoops ;) I only understand how it works by reading what you do with it.

Yeah GPUifyLoops could do with an extra example or two... I have an old and still open PR showing an example of a 3D stencil computation but you probably already know how to do this: vchuravy/GPUifyLoops.jl#18

Ah interesting, yeah I found papers online that described a 50-80% speedup but sounds like a lot of manual work which Julia may be able to do for us :) We've barely thought about CPU performance let alone SIMD but maybe something like GPUifyLoops.jl can figure out the multicore SIMD code.

I will try to use part of your code to rewrite the toy 2D CFD solver I have translated from Matlab (https://discourse.julialang.org/t/asynchronous-makie/27127/9?u=laurentplagne).

That's awesome! Do let us know if you have any questions. That's a pretty cool use of Makie with fillrange=true!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question 💭 No such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

2 participants