Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement simple memory "working space" #138

Open
VinInn opened this issue Aug 19, 2018 · 31 comments
Open

implement simple memory "working space" #138

VinInn opened this issue Aug 19, 2018 · 31 comments

Comments

@VinInn
Copy link

VinInn commented Aug 19, 2018

many gpu algorithms require global data structure as a "working space".
In some cases the data structures are used to communicate between the various kernels that compose a more complex algo (encapsulated in a EDProdcucer)
At the moment these data structures are allocated at the beginning of the job by each EDProducer.
It should be easy to create just one arena, large enough for the most greedy algo, and then allocate those data structure in it. The Arena will be local to each stream (as the current data structures).
Concurrent access is not possible as kernels are sequential in each stream.
no host2dev or dev2host memcpy will be supported (even if should be safe as again any previous activity must be finished before those operation can be schedule on the stream)

The interface would be trivial: an "init" (or clear, or acquire) method that zero the allocated-memory counter and an "alloc(nBytes)" method that returns a pointer (8 byte aligned?) and increment the counter.
It throws bad_alloc if the preallocated working space is too small.
in principle the counter can be local, the only global quantities shall be the pointer of the working space and its size.

This scheme can go wrong only if we allow independent EDProducers that launch kernels on the same stream to be scheduled concurrently (on different cpu-thread). This is in principle safe in itself as the memcpy and the kernels will happily queue in the coda-stream while the parent EDProducers will continue their async activity. In case of a shared working space they with overwrite each other data structures (unless each algo is made on just one kernel and no memcpy is allowed in the working space)

Any more complex solution would immediately require a fully fledged malloc with garbage collector, etc

@fwyzard
Copy link

fwyzard commented Aug 19, 2018

I'd rather see a full clean up of the data structures being used to pass data among different kernels of the same producer, and across different producers, first.

@makortel
Copy link

I fully agree that we will eventually need some more dynamic approach for the device memory.

Let me ask a clarification to see if I understood correctly:

The Arena will be local to each stream (as the current data structures). Concurrent access is not possible as kernels are sequential in each stream.
...
This scheme can go wrong only if we allow independent EDProducers that launch kernels on the same stream to be scheduled concurrently (on different cpu-thread).

Does "stream" refer to EDM stream in all these cases?

So essentially each EDProducer in the beginning of acquire() first resets the "workspace" for that EDM stream and then allocates the scratch memory from the "workspace" that will be released at the end of produce()?

@VinInn
Copy link
Author

VinInn commented Aug 20, 2018

Does "stream" refer to EDM stream in all these cases?

not really: to cuda stream, but I think it applies as well to edm stream

So essentially each EDProducer in the beginning of acquire() first resets the "workspace" for that EDM stream and then allocates the scratch memory from the "workspace" that will be released at the end of produce()?

Yes, in reality there is not need to communicate back anything to anybody: an EDProducer just receive a pointer to the arena and its size. a utility class will "help" to allocate memory for the required data structure (just to avoid trivial error of byte counting)
in old good days there was a common block say /WS/ws(5000) and every routine was
just declaring in it its own arrays... /WS/myI(1000),myf(1000)
(the only shared info was the name of the common block
same here

@fwyzard
Copy link

fwyzard commented Aug 20, 2018

So this is only to avoid calling cudaMalloc and cudaMallocHost (or equivalent) for every event ?
Whatever memory is being obtained by the pool is not used to hold data which is shared across different EDM modules ?

@VinInn
Copy link
Author

VinInn commented Aug 20, 2018

So this is only to avoid calling cudaMalloc and cudaMallocHost (or equivalent) for every event ?

yes. My understanding is that it is expensive. If not, no need of such old-school solution.

Whatever memory is being obtained by the pool is not used to hold data which is shared across different EDM modules ?

Definitively NOT. Data shared across different EDM modules are "gpu specific" event-data.

@fwyzard
Copy link

fwyzard commented Aug 20, 2018

OK, understood, thanks.

@makortel
Copy link

Thanks for the clarifications. Below I'm thinking out loud.

Currently each EDProducer has a CUDA stream for each EDM stream. In #100 a "chain of modules" in an EDM stream share a CUDA stream. So there the assumption "only one module per EDM stream doing GPU work" may not hold anymore. Well, actually it does not hold in the current system either, as there is nothing preventing two independent EDProducers (doing GPU work) in a single EDM stream to be run in parallel.

But do we actually have to tie the "workspace" to EDM/cuda streams? Couldn't we (rather easily) go one step further and provide N workspaces of size M MB (or split a given total memory to slices with size of M?) and then EDProducer in acquire() just requests a workspace and it is given one if free (and throw exception if not)?

On long term I'm a bit concerned on the very different allocation mechanism between the "workspace" and "products to event". I'm sure we can manage it, but it is an additional source for easy mistakes.

In the context of #100 another downside is that naively it prevents the "streaming mode" (if we will ever make really use of it...) because EDProducer should not "return" before all the kernels have finished. A possible way to overcome this limitation would be to enqueue a callback function for the release of the workspace to the CUDA stream after the kernels. I believe these details could even be abstracted behind CUDAScopedContext.

@cmsbot
Copy link

cmsbot commented Aug 21, 2018

A new Issue was created by @VinInn Vincenzo Innocente.

can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link

fwyzard commented Aug 22, 2018

Rather than reinventing a (simple) memory allocator, what about reusing something like https://github.com/FelixPetriconi/AllocatorBuilder ?
Or any other found on the web and decently supported ...

@VinInn
Copy link
Author

VinInn commented Aug 22, 2018

as said in my original posting
"Any more complex solution would immediately require a fully fledged malloc with garbage collector, etc"

If one finds something that suites our needs and fits our framework, I am not against

@makortel
Copy link

makortel commented Sep 4, 2018

I took a look on a couple of options

[edited by @fwyzard]

[/edited]

So far I haven't encountered anything that would sounds like a perfect fit to us. That may be because I don't know (or have wrong idea of) what exactly we want. Some random thoughts below

  • Do we want to allocate device memory only on the host, only on the device, or on both? (so far I've assumed the first)
    • For event data allocated-on-host memory sounds easiest to deal with
      • anyway will need some device memory allocated on host to be able to deliver data from a kernel
    • For "workspace" data the allocation location may be indifferent
      • if on-device allocations become possible, I'm sure we'll see imaginative use of that
    • If both, the two allocator instances must communicate (=synchronize) between each other, which sounds costly
  • Event data has a life time of (at most) the event, and "workspace" data has a life time of a module
    • If memory for these are served from a common (per EDM stream) pool of memory, that pool can be fully re-set at the end of event

@VinInn
Copy link
Author

VinInn commented Sep 10, 2018

I think we should test th "CUB caching allocator".
From the description it behaves as a typical allocator in limited memory.
It will suffer of "random-time garbage collection": constant latency is not a real requirement for us though.

I will try to implement it in one of my "Unit" test...

@makortel
Copy link

makortel commented Sep 10, 2018 via email

@VinInn
Copy link
Author

VinInn commented Sep 10, 2018

@makortel,
great!
I think we can wait for your full show-case in raw2cluster

@makortel
Copy link

My experiment with the CachingDeviceAllocator can be found (finally) here #172.

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

Summary of the chat with @makortel regarding the behaviour of the caching allocator, after looking at the code for the cub::CachingDeviceAllocator.

For large memory chunks (bigger than the largest bin):

  • allocations are synchronous, using cudaMalloc
  • deallocations are synchronous, using cudaFree

For small memory chunks (up to the size of the largest bin):

  • allocations can be synchronous, using cudaMalloc or memory returned to the pool from an idle CUDA stream, or asynchronous, reusing memory "freed" in the current CUDA stream
  • deallocations are asynchronous, marking the memory are reusable by the current CUDA stream

Since work within each CUDA stream is serialised, it is possible to do something along the lines of (pseudocode by @makortel):

dev_mem = allocate(1024, stream);
kernel<<<1,1, 0, stream>>>(dev_mem);
cudaMemcpyAsync(host_mem, dev_mem, stream);
free(dev_mem);
dev_mem2 = allocate(1024, stream);
kernel2<<<1,1, 0, stream>>>(dev_mem2);

Here free(dev_mem) marks the chunk used by dev_mem as "available within the stream stream".
dev_mem2 is likely to receive a pointer to the same chunk.
Launching kernel2 using dev_mem2 is assumed to be fine, because the operations associated to the CUDA stream stream are serial:

  • execute kernel that possibly writes to the chunk via dev_mem;
  • read the memory via dev_mem;
  • execute kernel2 that possibly writes to the chunk via dev_mem2.

If the allocator is replaced by direct calls to cudaMalloc/cudaFree, the behaviour changes, and the assumption is no longer valid.

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

I think we should settle on the semantic we want, and then update the allocator to make it consistent with it. I can think of three options:

  1. synchronous: memory operations on the device are synchronous with the host: memory is allocated and deallocated immediately (à la cudaMalloc()/cudaFree())
  2. asynchronous: memory operations on the device are asynchronous from the host, and synchronous with a CUDA stream (or other constructs like CUDA graphs)
  3. lazy: memory allocations on the device are synchronous with the host (cudaMalloc()), while memory deallocations are asynchronous from the host, and synchronous with a CUDA stream.

The cub::CachingDeviceAllocator is doing a mixture of all three: 1. for large memory chunks, 2. for small memory chunks recycled from the pool, 3. for small memory chunks when the allocation pool is grown.

I suspect that what we want for temporary buffers is more along the lines of 2, to avoid issuing a synchronisation every time.

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

Few more thing to consider:

  • do we want to be able to allocate and deallocate memory from device code ?
  • do we want to be able to allocate memory on the host, and deallocate it from the device(or the opposite) ?
  • do we want to support managed memory ?

@fwyzard
Copy link

fwyzard commented Dec 4, 2018

Here is an attempt of sketching a possible behaviour of the CachingDeviceAllocator:

host cpu   ----1-2-3-4-5-6----7-8-9-10-11-------------12-13-14-15-16------------
stream 1   ----------------3-4-5-6---7--8-9--10-11------------------------------
stream 2   --------------------------------------------------13-14--15-16-------

host mem   ----1---------------5-------------10---------------------15----------
device mem ------2---------3-4-5----6+7-8-9--10----11+12-----13-14--15----------

 1 prepare data on the host

 2 allocate device memory (associated to stream 1)      <-- cudaMalloc
 3 copy to device memory (async on stream 1)
 4 use device memory (e.g. async kernel on stream 1)
 5 copy from device memory (async on stream 1)
 6 release device memory (associated to stream 1)       <-- happens when reused by 7

 7 allocate device memory (associated to stream 1)      <-- reuse in the same stream
 8 copy to device memory (async on stream 1)
 9 use device memory (e.g. async kernel on stream 1)
10 copy from device memory (async on stream 1)
11 release device memory (associated to stream 1)       <-- happens when reused by 12

12 allocate device memory (associated to stream 2)      <-- reuse in a different stream
13 copy to device memory (async on stream 2)
14 use device memory (e.g. async kernel on stream 2)
15 copy from device memory (async on stream 2)
16 release device memory (associated to stream 2)

The key points are that

  • 7 is delayed until stream 1 is done with 6, because they happen on the same stream
  • 8-11 are delayed until after 7, because they happen on the same stream
  • 12 can reuse the memory released by 11 even though it is on a different stream, stream 1 is idle (and the allocator has already synchronised with 11) when the request is made

@makortel

This comment has been minimized.

@fwyzard

This comment has been minimized.

@makortel
Copy link

I came across with Umpire https://github.com/LLNL/Umpire that is a "resource management library that allows the discovery, provision, and management of memory on next-generation architectures". I took a quick look, but am not really convinced (e.g. I didn't see any notes about asynchronous copies).

@VinInn
Copy link
Author

VinInn commented Dec 27, 2018 via email

@makortel
Copy link

I noticed that cutorch (https://github.com/torch/cutorch/) is using a caching allocator for both device and pinned host memory. The exact logic is different from CUB, but their allocator also considers a device/host memory in use until all operations queued on a CUDA stream at the point of host-side free have finished.

@fwyzard
Copy link

fwyzard commented Dec 29, 2018

I do like that it is supposed to be a drop-in replacement for cudaMalloc()/cudaFree():

By default, cutorch calls cudaMalloc and cudaFree when CUDA tensors are allocated and freed. This is expensive because cudaFree synchronizes the CPU with the GPU. Setting THC_CACHING_ALLOCATOR=1 will cause cutorch to cache and re-use CUDA device and pinned memory allocations to avoid synchronizations.

@makortel
Copy link

makortel commented Jan 2, 2019

Hmm, they write

cudaFree synchronizes the CPU with the GPU

which makes me wonder the failure of #205. If cudaFree/cudaFreeHost indeed implicitly cudaDeviceSynchronize (simple test supports that), #205 should be safe (and my conclusion in #205 (comment) was too quick). Maybe I'll try to play around with #205 again.

Cutorch's caching allocators use a similar logic as CUB's caching allocator to keep the host-side-freed memory reserved until a CUDA event recorded on a stream at the point of host-side-free has occurred (they actually go a bit beyond to CUB's allocator by having mechanism to do the same for all CUDA streams that read from the memory block in addition to the one that was associated at the time of allocation; we'd actually need it too for non-ExternalWork modules (that resemble TBB's streaming_node) if we ever get improved "early deletion" of event data products cms-sw#16481).

@makortel
Copy link

Adding here from #306 (comment) by @fwyzard

By the way, during the E4 Hackathon, an NVIDIA guy mentioned the new device-side RAPIDS Memory Manager.

Interesting, thanks for sharing. Do you know if there is any documentation beyond the (rather terse) README (and code, of course)?

@fwyzard
Copy link

fwyzard commented Apr 17, 2019

Unfortunately no, it was mention only en passant .

@makortel
Copy link

makortel commented May 1, 2019

I haven't had time to investigate further at all, but ArrayFire https://github.com/arrayfire/arrayfire appears to have some sort of memory manager.

@makortel
Copy link

There is now some more information on the RAPIDS Memory Manager
https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/

It could be interesting to give it a try at some point.

@makortel
Copy link

Just to document here as well, @fwyzard gave a try of cudaMallocAsync()+cudaFreeAsync() in https://github.com/fwyzard/pixeltrack-standalone/tree/cudaMallocAsync.

The results were for caching allocator

Found 1 devices
Processing 10000 events, of which 10 concurrently, with 10 threads.
Processed 10000 events in 1.071067e+01 seconds, throughput 933.648 events/s.

and for cudaMallocAsync() + cudaFreeAsync()

Found 1 devices
Processing 10000 events, of which 10 concurrently, with 10 threads.
Processed 10000 events in 1.376592e+01 seconds, throughput 726.432 events/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants