Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Mutable arrays / lower level interface #86

Open
js6i opened this Issue · 11 comments

3 participants

@js6i

Hi,

I figured this would be a good place to post this.

For more memory reusing one could use a lower level interface bridging between accelerate and cuda. Something along the lines of:

ptr1 <- getDevicePtr $ use initialData
ptr2 <- getDevicePtr newDeviceArray
krn <- getKernel algorithm
result <- sequence . take steps . cycle $ [krn ptr1 ptr2, krn ptr2 ptr1]

Where krn p1 p2 would indicate p1 as input and p2 as output array.

In my case this is motivated by the fact that my kernel takes at most few hundred ms to execude, while the whole step with allocation takes a few seconds. With hundreds of steps and enough samples for statistics this is unacceptable (and obviously I'd love to do my stuff in haskell and not C++).

Would it be a difficult task to add this features? Or maybe there's another way of reusing memory? I tried keeping my loop in Acc, but it was no good either (according to -ddump-gc output arrays were reallocated anyway, and I couldn't do stuff between steps).

Thanks for your consideration
Jan Sikorski

@mchakravarty
Owner

I can imagine that what you like to do is a common idiom. Hence, I wonder whether we cannot find a better high-level way to specify this idiom in Accelerate such that it can generate efficient code for you.

Have you maybe got an example program (may be a cut down version of your application) that exemplifies your problem? (I find it much easier to work from a concrete example.)

@js6i

A slightly more general case would be when the next step depends on a set of N previous steps. Then you would cycle [[1..N] -> o, [2..N, o] -> 1, [3..N, o, 1] -> 2, etc.]. Although it appears that for stencils currently only N=1,2 is available. For non stencil operations and N=1 it would also allow modifying the array in place. So a somewhat higher level function could take inputs and maybe an action to execute inbetween steps and return the result.

Anyway given that with a C-like interface it is trivial to implement such schemes (or maybe I'm missing something) I'd suggest exporting appropriate building blocks. Especially that it will probably be rather difficult if not impossible to anticipate all possible schemes involving mutation and put it in a pure, non monadic interface.

As for a concrete example -- any iterated kernel would probably benefit from this approach. My code is solving a 3+1D PDE for relativistic hydrodynamics and is rather crappy atm (thrown together in a couple of days), but if you wish so I can send it to you.

JS

@mchakravarty
Owner

Anyway given that with a C-like interface it is trivial to implement such schemes (or maybe I'm missing something) I'd suggest exporting appropriate building blocks. Especially that it will probably be rather difficult if not impossible to anticipate all possible schemes involving mutation and put it in a pure, non monadic interface.

Possibly. However, this is what we do (and the whole point of Accelerate): to get out of the morass of imperative stateful interfaces and to find suitable high-level abstractions.

Having said that, we (to be precise @robeverest) recently added support for a foreign interface to use native CUDA code within Accelerate. We might want to think about the converse, too. To use Accelerate computations from native CUDA. @robeverest, what do you think?

Nevertheless, even if we can "foreign export" Accelerate computations to CUDA, I'd still still like to find a high-level pattern for your use case, because, I think, it is a common one, worth abstracting.

@tmcdonell What do you think? Would that situation fit into any of the thoughts we had about a combinator for iteration? In a sense, maybe, we'd like a kind of converge operator that repeats a stencil (or similar) until a condition is satisfied.

@js6i

Fair enough. I agree that such abstraction is desirable. So the most pragmatic question: can I somehow achieve (even with an ugly hack) this effect quickly without major changes to accelerate? My stuff is a university project and also topic for my master thesis so I'd like to advertise haskell and accelerate a bit -- accelerate kernel is as fast or even faster than our current C++ implementation, and due to JIT compiling there are more nice runtime features as well.

JS

@mchakravarty

@tmcdonell Have you got any idea for quick hack to get @js6i going?

@tmcdonell
Collaborator

Hi Jan, sorry for the slow reply...

To try and narrow down the problem a bit first:

It looks like for each step of your solver you need to iterate a bit until convergence, and during the converge iteration is when you would have done the pointer swapping technique, is that that correct? I'm wondering how your program is set up at the moment, and specifically how times you call run*. Note that run* will copy its result back to the host, so if this is part of the convergence iteration that is not going to be good. We could provide a version of run* that doesn't do the final copy, but it would be better if the Accelerate program fired off steps number of kernels for each call to run*.

If your program is the latter (single call to run* attempts to launch many kernels) then the problem really is with device-side memory allocation. I noticed this recently actually; it seems that sometimes the CUDA API will pause for a while trying to allocate memory; running your program through the CUDA profiler will tell you if this is happening in your program. I think it wouldn't be too hard to be a bit cleverer with memory de/allocation --- basically, don't free device memory straight away, but keep it in a "nursery" of sorts so it can be reused instead of allocating a fresh array every time.

If you could provide a (preferably trimmed down) example program, I think that would help a lot.

Additionally, using run1 in your program instead of run is a lot quicker if you are applying the same computation (which it sounds like you are doing).

@js6i

Hi,

OK so I made a little example. Here it is:
http://www.fileswap.com/dl/EW0i7HOfty/

So it basically does 2 loops -- one inside, and one outside run1.
Results are such that outside loop takes ~1.5 or so times longer to run than steps x kernel time due to (mainly) memory copying & reallocation.
On the other hand the big kernel with inside loop has a rather big gap before executing, and then proceeds as expected (faster then outside, but still with memory reallocation).
It's not compilation time, since next execution of the same big kernel on different data has this gap again.

So in my program it's probably the combination of the two -- my kernel is quite big (and does some iteration inside if it's relevant), and I iterate it outside of run1 (for time propagation of the system).

Btw. sorry for duplicate post on the mailing list -- I though that it got deleted and would not appear.

JS

@tmcdonell
Collaborator
@tmcdonell
Collaborator

I pushed some patches that attempt to reduce the number of allocations. For your example it doesn't change things too much, but once you scale up the number of iterations good things happen. For the fluid simulation program in accelerate-examples, which (from memory) computes 120 stencil operations per frame, the runtime drops ~20%.

If you compile accelerate-cuda in debug mode (cabal install -fdebug) you'll see it in action. The first call to loop1 will spit out lots of lines containing "malloc/new", and then subsequent calls should be all "malloc/nursery". The latter isn't calling out to the device to get fresh memory, it's using an old pointer.

Let me know how it goes for your program!

@js6i

Hi,

I just tested your changes (thanks for that, nice work!) and confirmed that it works nicely for the example program.

Sadly it did not fix the problem in my real program (though the gap may be smaller now). I tried cranking iteration steps in the example and noticed that all the bad stuff comes back -- the long gap before execution, output array reallocation. At 1000 I even get CUDA out of memory error at some point.
Do you have any idea where that may come from?

Another problem I noticed (not directly related to this one) is iteration deeper inside the kernel. In the example I made a little 10-step iteration of sin and cos. With 100 steps or so nvcc already takes ridiculously long to compile. I believe adding an iteration primitive was discussed somewhere -- is that still scheduled to be done?

JS

@tmcdonell
Collaborator

I see what you mean about cranking up the iterations. As a workaround, if you change the definition of loop1 to

loop1 steps = run1 $ foldl1 (>->) (P.replicate steps update)

then I don't get the out of memory problem anymore and the program goes through a bit quicker. The (>->) operator can be used to separate parts of the program. We have that (acc1 >-> acc2) is similar to (acc1 . acc2), except that acc1 and acc2 will not share any subcomputations. This gives my crappy simplifier smaller chunks to work with (which will be fixed at some point… hopefully), but also helps the GC know when it can clear out old unused arrays.

An iteration primitive is still on the cards, yes, but I haven't had a chance to work on that in a while, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.