Next prototype of the framework integration #100

makortel · 2018-07-23T12:13:41Z

Despite of my original plan of not proceeding with framework side before the demonstrator, here is a prototype of the CUDA algorithm integration based on my discussions with @Dr15Jones and @wddgit. See the included README.md for more technical details.

I'm marking the PR with RFC because we need to discuss first on the details and understand whether it could make sense to deploy it already for the demonstrator. Otherwise the PR can serve as a discussion forum on the topic until the demonstrator is finished.

Pros wrt. HeterogeneousEDProducer
- Very flexible
- Provides "streaming node" functionality out of the box
- GPU->CPU transfers done on demand also when need to make edm::Refs
- Allows running both CPU and CUDA versions of the algorithms in the same job
  - E.g. for validation/debugging
- Simpler code (both infrastructure and use)
Cons
- More verbose and more boilerplate
- ~~Need to add cms.Paths in the cff files~~ (not needed with `SwitchProducer)
- Duplication in module customizations

Fixes #133.

@felicepantaleo @fwyzard @VinInn @rovere

makortel · 2018-07-24T14:39:05Z

Summarizing here the outcome of the meeting today (*). We chose to continue with the HeterogeneousEDProducer for the demonstrator, and leave this PR open for discussion (and possible further development) for that time. Questions raised:

Can CUDADeviceChooser and CUDADeviceFilter be combined?
The configuration side is awful, especially for the end-developer side. One way to seek for improvement would be to think first how one would like to configure, and then think how to implement that.
- The main problem lies in all the boilerplate of the pattern needed for "GPU or CPU"
Adding new Paths in the configuration doesn't (currently) work with HLT
- HLT is still run in scheduled mode (with process.Schedule and Paths containing all producers)
- Configuration editing system does not currently support Tasks etc
- Dislike non-physics Paths

(*) https://indico.cern.ch/event/746161/contributions/3084531/attachments/1692036/2722511/slides_mk_20180724.pdf

makortel · 2018-08-07T11:23:56Z

Rebased on top of head of CMSSW_10_2_X_Patatrack (c2aba96).

Regarding the point 1 in #100 (comment), the choice of splitting the logic to an EDProducer and an EDFilter was based on earlier experience that usually (though not always) combining them leads to problems. Thinking further this particular case

the producer+filter functionality is "instruct the downstream to run on GPU if possible, otherwise in CPU", while
the producer-only functionality is "instruct the downstream to run on GPU, if not possible, throw an error"

so it seems that the producer side for the two cases is fundamentally different. Therefore I added a commit toying with the idea of providing

CUDADeviceChooserFilter: if decide to run on GPU, return true and produce CUDAToken, else return false and produce nothing
CUDADeviceChooserProducer: if decide to run on GPU, produce CUDAToken, else throw an exception

In short, yes, the producer and the filter can be (and makes sense to) combined for the case where(/if) we want to be able to dynamically decide whether a chain of CUDA EDModules should run on a GPU or a CPU.

makortel · 2018-08-13T08:32:17Z

Some random thoughts about CUDA streams

~~I start to think now that it would probably be better for CUDADeviceChooserFilter and CUDADeviceChooserProducer to own the CUDA stream (per EDM stream)~~
- ~~In principle tiny bit faster as they are not created+destroyed for each event and "chain of CUDA modules"~~
- My main motivation comes from the profiling though. I believe it would be clearer in nvvp that a single CUDA stream id is always associated to the same EDM stream and to the same chain of modules (across events)
I believe (need to test it through of course) that overlapping the GPU->CPU transfer with kernels (on the same "computation CUDA stream") using another CUDA stream comes out in a straightforward way. We just need to add a variant of CUDADeviceChooserProducer (named e.g. CUDAStreamInDevice) which reads a CUDAToken and produces a new CUDAToken in the same CUDA device but with a new CUDA stream
We are currently using CUDA streams in beginStream() (where we do the block memory allocations) to "asynchronously" set memory or transfer some constant data. I used quotes because given the "global synchronization" nature of cudaMalloc I'm not sure how much we actually benefit from the CUDA streams in there, and whether it would be good-enough to do all memsets and transfers there synchronously (they are once-in-job anyway).

makortel · 2018-08-13T14:28:35Z

Except the first two (CUDA stream owned by a module, and ability to create additional CUDA streams on a given device) are in conflict because (in general) I can't communicate the chosen device from a module to the beginStream() of another, and anyway the first bullet assumes the current EDM stream -> CUDA device mapping.

fwyzard · 2018-08-14T13:30:40Z

@makortel

I believe (need to test it through of course) that overlapping the GPU->CPU transfer with kernels (on the same "computation CUDA stream") using another CUDA stream comes out in a straightforward way.

I am not sure I understand this point. Are you suggesting to use one CUDA stream to compute, and a separate CUDA stream for the transfer from GPU to CPU of the results ?

Do we submit the kernels in acquire() and run the transfer in the produce(), and rely on the explicit synchronisation from the framework (produce() runs after the callback from acquire()) ?

Or do we submit them all in acquire() but interleave them with CUDA events to enforce that the transfer waits for the kernel to have completed ?

I think the former is easier - but what do we gain from reusing the same CUDA stream for the chain of modules ?

makortel · 2018-08-14T14:33:55Z

@fwyzard

I am not sure I understand this point. Are you suggesting to use one CUDA stream to compute, and a separate CUDA stream for the transfer from GPU to CPU of the results ?

To my understanding that is the standard "trick" to compute and transfer data in parallel. My main motivation was to think how that could be done within the context of this PR (regardless if we want to do that or not).

There would be some benefits (even with the assumption "we achieve the parallelism with EDM streams")

it exposes more parallel work for the GPU/driver
on-demand-scheduled GPU->CPU transfers would not incur additional latency on the "compute stream" of a chain of GPU modules

but what do we gain from reusing the same CUDA stream for the chain of modules ?

We get the behaviour equivalent to TBB Flowgraph's "streaming_node". I.e., if an EDProducer does not have to transfer anything back to CPU for subsequent work (like number of digis/clusters/hits/quadruplets), it can be a regular EDProducer just queueing more kernels to the CUDA stream. Performance benefit would come from running the GPU computations in parallel to the "framework overhead".

The "streaming_node" is not enforced though, it just emerges automatically if an EDProducer meets the necessary constraints.

Do we submit the kernels in acquire() and run the transfer in the produce(), and rely on the explicit synchronisation from the framework (produce() runs after the callback from acquire()) ?

Or do we submit them all in acquire() but interleave them with CUDA events to enforce that the transfer waits for the kernel to have completed ?

Closer to the latter. As a concrete example let's take the raw2cluster. The chain of events would be the following

raw2cluster EDProducer acquire() queues all kernels to the CUDA stream (that it got from input CUDAToken)
raw2cluster EDProducer acquire() queues the transfer of number of active modules and the number of clusters from GPU to CPU
- needed by subsequent modules for their kernel launches
- if subsequent modules would not need them, raw2cluster could be a regular EDProducer and queue all work in its produce()
raw2cluster produce() puts a CUDA<T> in the event containing all the pointers to GPU memory and the two numbers mentioned in 2
rechit etc. queue their work
cluster GPU->CPU transfer EDProducer acquire() queues all GPU->CPU transfers for clusters
- this module is run only if any module consumes() the CPU clusters
cluster GPU->CPU transfer EDProducer produce() converts the CPU SOA to legacy formats
- or, puts the CPU SOA in the event, and we have yet another EDProducer for the conversion to legacy format

If points 4 and 5 use the same CUDA stream, they will be run serially (5 gets inserted somewhere in the middle of subsequent work of 4, or after them). They can be made run in parallel by introducing additional CUDA stream, and the mechanism I described on slide 15 of #100 (comment) will take care of the synchronization with a CUDA event.

Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream. Minimize data movements between the CPU and the device, and support multiple devices. Allow the same job configuration to be used on all hardware combinations. See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.

Should have been removed as part of cms-patatrack#100.

Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100. Address review comments for SiPixelClustersCUDA: - remove commented out default constructor and private: from DeviceConstView; this is perhaps the best compromise between non-default constructors not being preferred for device allocations, and the use case in SiPixelRecHitSoAFromLegacy (for the expected life time of this class) - remove const getters with c_ prefix - improve constructor parameter name - use more initializer list - initialize nClusters_h Address review comments for SiPixelDigiErrorsCUDA: - use type alias - remove const getters with c_ prefix and other unnecessary methods - use more initializer list Address review comments for SiPixelDigisCUDA: - remove const getters with c_ prefix and other unnecessary methods - remove commented out default constructor and private: from DeviceConstView - add comments for remaining SiPixelDigisCUDA member arrays Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes Address review comments for SiPixelErrorsSoA - remove redundant assert - move constructor inline Address review comments for SiPixelDigisSoA - remove redundant assert - add comments Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common