Blocked computations on the cpu #333

LaurentPlagne · 2019-08-05T16:18:25Z

Hi,
Very impressed and interested by this project !
I wonder if you have plans to execute complex operators with some blocking in order to accelerate CPU computations. I have the feeling that it could also be beneficial for GPU via the shared memory.
Anyway I am impressed by the neat design and the multitarget code architecture.
I hope that the Oceananigans example will help the HPC community to switch to Julia 😀

ali-ramadhan · 2019-08-05T17:30:27Z

Hey @LaurentPlagne thank you for your kind words!

The code and documentation definitely needs to be improved, and the CUDAnative.jl, CuArrays.jl, and GPUifyLoops.jl packages have helped a lot. Julia has been perfect for this package.

We haven't really considered improving CPU performance yet as we've been focusing mostly on single-GPU performance (and we've worked on a little bit of multi-GPU stuff with MPI). As a result, running the model on a CPU isn't very useful as it only uses one core :(

Julia 1.3 seems to have some really promising multithreading so it might become easy to parallelize across multiple cores of one CPU. I think when we start improving CPU performance, we'll probably look to Julia 1.3 first. Or maybe it'll be implemented in GPUifyLoops.jl (or a renamed version of the package).

I didn't know about blocking stencil computations and had to look them up. I might be wrong but it sounds like utilizing shared memory like you suggested. We haven't implemented anything like that yet but we've been playing around with an @stencil macro abstraction implemented in GPUifyLoops.jl (vchuravy/GPUifyLoops.jl#81) in PR #293.

LaurentPlagne · 2019-08-05T19:45:26Z

Thank you for the links !

The Oceananigans src code is very clearly written so the reading is relatively OK (I could use some explanations on your closure usage). If it was up to me I would prefer an extended documentation of GPUifyLoops ;) I only understand how it works by reading what you do with it.

The GPU shared memory is basically a programmable cache while the cache of CPU can't be (easily) controlled. In both cases there is cache so, if you compute multiple partial derivatives of a given (set of) field(s) (d/dx,d/dy, d2/dx2,...) once a block has been fetched in the cache then the memory operations are cheap. Cache is also useful for performing tiny transpositions enabling fast access and vectorized (SIMD) CPU or GPU ops in both X,Y or Z directions.

I hope that obtaining an efficient code for both (multicore SIMD) CPUs and GPUs maybe possible adjusting the (recursive?) block sizes (i.e. controlling the data layout and adapt it to the computing target).

I will try to use part of your code to rewrite the toy 2D CFD solver I have translated from Matlab (https://discourse.julialang.org/t/asynchronous-makie/27127/9?u=laurentplagne).

Kudos again to your team for this inspiring package.

Laurent

ali-ramadhan · 2019-08-05T23:45:42Z

I could use some explanations on your closure usage

@glwagner is the mastermind behind the implementation of the LES closures. It's still not documented but he's written up some documentation about them in a different package: https://dedales.readthedocs.io/en/latest/summaryclosures.html

If it was up to me I would prefer an extended documentation of GPUifyLoops ;) I only understand how it works by reading what you do with it.

Yeah GPUifyLoops could do with an extra example or two... I have an old and still open PR showing an example of a 3D stencil computation but you probably already know how to do this: vchuravy/GPUifyLoops.jl#18

Ah interesting, yeah I found papers online that described a 50-80% speedup but sounds like a lot of manual work which Julia may be able to do for us :) We've barely thought about CPU performance let alone SIMD but maybe something like GPUifyLoops.jl can figure out the multicore SIMD code.

I will try to use part of your code to rewrite the toy 2D CFD solver I have translated from Matlab (https://discourse.julialang.org/t/asynchronous-makie/27127/9?u=laurentplagne).

That's awesome! Do let us know if you have any questions. That's a pretty cool use of Makie with fillrange=true!

ali-ramadhan added the question 💭 No such thing as a stupid question label Aug 5, 2019

ali-ramadhan closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocked computations on the cpu #333

Blocked computations on the cpu #333

LaurentPlagne commented Aug 5, 2019

ali-ramadhan commented Aug 5, 2019

LaurentPlagne commented Aug 5, 2019

ali-ramadhan commented Aug 5, 2019

Blocked computations on the cpu #333

Blocked computations on the cpu #333

Comments

LaurentPlagne commented Aug 5, 2019

ali-ramadhan commented Aug 5, 2019

LaurentPlagne commented Aug 5, 2019

ali-ramadhan commented Aug 5, 2019