Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine Abstraction in Layers #4187

Open
rddesmond opened this issue May 20, 2016 · 19 comments
Open

Engine Abstraction in Layers #4187

rddesmond opened this issue May 20, 2016 · 19 comments

Comments

@rddesmond
Copy link

rddesmond commented May 20, 2016

I have been trying to have a build-once run-anywhere caffe with/CuDNN. Unfortunately, the CuDNN engine's layers call the GPU during the layer setup. This means that the CuDNN engine cannot be used in a CPU-only environment. There is also no easy path for CUDA fallback (i.e. when we build with CuDNN.

This is an attempt to revisit some of the efforts which were unfortunately abandoned 2 years ago in "Device Abstraction" #610.

I would propose something that is focused on the Layers and is less abrupt on the internal API:

  • Overload "Mode" to make it so that it has entries beyond CPU/GPU, but is CPU, CUDA, CuDNN, etc.
  • Layers are initialized with a specific mode.
  • When layers register with the layer factory, they do so to support one or more specific modes (that means, the layer factory has a map of the classes to instantiate for various engines at each layer).
  • Add a new static method to each Layer, canInstantiate(const LayerParameter& param), to eliminate the need for layer-specific logic that is hardcoded for each engine.
  • Make an abstract Caffe layer which supports CPU and GPU, just like right now. That would be allow the same design pattern to be continued forward, since this default works well and is widely adopted.
  • The layer factory selects the "best" class for each layer ranking the engines and calling canInstantiate down the list.

(I'm also creating a PR for a quicker CuDNN work around (to make it so it can be turned off in runtime, either because the mode is not GPU or as a separate flag).)

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@rddesmond
https://github.com/BVLC/caffe/tree/opencl
The OpenCL branch has device abstraction to some extent (since it's covering both OpenCL and CUDA compability).
Maybe not covering all the points you mention yet, but I think it's a good start.

If you are interested I'm looking for help on going forward to fully abstract the interface to be OpenCL/CUDA/CPU agnostic.

The issue is even more pressing there since it has 4 convolution engines, of which some are CUDA+OpenCL, some are CUDA only, some OpenCL only.

@bhack
Copy link
Contributor

bhack commented Jun 2, 2016

@naibaf7 We are also interested in this. We will want to plug your libdnn effort and other engines in tinycnn. I think also that we need to think how to support multi engines binaries for distribution packagers. @cdluminate Do you have any feedback on this topic? /cc @mtamburrano @edgarriba

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@bhack
LibDNN will require a part of the Caffe device*-abstraction, which wraps OpenCL and CUDA devices and gives each device a system-wide and a Caffe-wide integer ID to easily switch between devices at runtime, as well as a thread-local variable which can select a global per-thread device.

Furthermore, I constructed the abstraction so that each layer "owns" a device, and executes it's methods on that particular device. So far, this approach works quite well and is not very intrusive/conflicting with the existing CUDA master branch.

But it's also not perfect/complete yet. Contributions/ideas are welcome.

@bhack
Copy link
Contributor

bhack commented Jun 2, 2016

@naibaf7 So with this ID is there "something similar" to device placement in TF?

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@bhack
Exactly.
At the moment, the CUDA devices are put first, so assuming we have a system using these devices (my System as an example):

  • nVidia GTX 980
  • AMD W9100
  • Intel HD4600
  • Intel i7-4790K (OpenCL)
  • Intel i7-4790K (native)

Then the following devices will be in the global ID:
0: GTX 980 CUDA (CUDA devices are put first for setDevice compability)
1: GTX 980 OpenCL
2: i7-4790K OpenCL
3: W9100 OpenCL
4: HD4600 OpenCL

and then a selection of these devices (let's say 0 and 3) can be initialized and get remapped to Caffe IDs:
0: GTX 980 (CUDA)
1: W9100 (OpenCL)
these devices can then be used together in multi-GPU networks or multiple networks within the same Caffe instance.

As you can see, CPU devices are not included in that scheme yet, but should/will be included in a full abstraction as well.

@bhack
Copy link
Contributor

bhack commented Jun 2, 2016

@ajtulloch Do you have any feedback on this? Are you still working on integrating nnpack engine in caffe?

@bhack
Copy link
Contributor

bhack commented Jun 2, 2016

@naibaf7 Seems good. Device placement it is still an open topic on TF see the thread at tensorflow/tensorflow#2126

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@bhack
Yeah not sure how clean/good it is but it works for me (had no time to discuss these solutions so I just went ahead with it on Caffe when I did my thesis). I understand that things like device abstraction / device placement / OpenCL take a bit longer to implement with TensorFlow (more people involved).

One way I offer device placement as well is setting the device on layers explicitly in the protobuf or the python Layer interface. However this is not giving full support yet, as multidevice-Blobs (for sharing a single Blob across devices) is not finished yet.

@edgarriba
Copy link

@naibaf7 what's a clean/good solution for you? We are starting to design the device-abstraction in tiny-cnn so we would like to think it very well before start coding.

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@edgarriba
Hard to say but here some considerations I came across:

  • Binding a device to a thread is sometimes required, especially with 3rd party libraries that just execute on the currently active device. So there must be a thread local variable identifying the current device.
  • For OpenCL it is useful to keep a global list of the initialized contexts.
  • Consider if it is useful to allow multiple devices per layer or memory object. This, in my opinion, makes templating classes with a device not a good option. Rather, memory object and layers should maintain a list of devices that participate or belong to it.
  • The device lists should probably be maintained as pointers to initialized devices. The devices themselves can be stored in a global list/vector.
  • Layers can have different ways of handling multiple assigned devices, such as using them jointly to compute the operation.
  • Blobs/memory objects should internally trigger a device-to-device copy if a different device than the current head-device is accessing it. The head-device is the device that made the last writing kernel call on that object. If multithreading on the CPU side should work, memory objects also need to be locked (RWLocks, allowing to enqueue multiple read access but a write is exclusive) if a layer is currently using them in read/write mode.
  • Basic device operations (such as the math functions in Caffe) can be abstracted on the device using polymorphism to wrap around libraries such as clBLAS and cuBLAS.
  • Consider abstracting kernels by using NVRTC on CUDA and OpenCL runtime compilation on OpenCL. Like that, the kernels have to be written only once and can dynamically be submitted, compiled and launched on devices using a polymorphic device abstraction that hides CUDA, OpenCL and possibly CPU execution (if an OpenMP backend with runtime compiling is to be considered).

To the last point: Most CUDA/OpenCL code can be cross-compiled by adding a few macros and inline helper functions to the source code (Greentea-libDNN is doing this). Creating an OpenMP version of it may be a bit more tricky.

Especially if you don't have a GPU backend yet, I can strongly recommend using the cross-compile scheme as it will save a lot of work and duplicated code. The performance critical kernels will probably be handled by 3rd party libraries anyways (which are mostly the convolutions and BLAS calls).

Not all of these properties are currently implemented in OpenCL Caffe, but hopefully will be at some point.

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@rddesmond
Back to engine abstraction:

  • It would be nice if auto-selecting the fastest implementation for a given device would be possible. Maybe some combined cost function depending on the device memory available and the speed provided by the implementations.
  • It is probably necessary that the engines specify compability with:

a.) The convolution settings selected. Some engines do not support dilation.

b.) The device selected. This is a combination of device backend (CUDA/OpenCL/CPU) and device property support (i.e. some convolutions only support AMD or Intel hardware).

@bhack
Copy link
Contributor

bhack commented Jun 2, 2016

Also, not all the layers and math functions (ops) are accelerated/implemented by all backends.

@rddesmond
Copy link
Author

I think the naive first approach is have the devices ranked, but it would be nice to do more than that.

For the layers and ops not being supported by all backends.. I think asking the layer if it can be implemented for a given set of parameters would simplify the factory and make the heuristics much easier.

Getting the blobs to handle the memory movement is a big piece, but I don't think is broken but needs to be expanded -- we can already mix and match CUDA/CUDNN/CPU in the same model without a problem.

The whole thing that started because the CuDNN can't fallback to CPU (or at runtime without changing the model, CUDA). The biggest problem I had was that the CPU was treated as a fallback for each engine, and not as an engine itself.

@bhack
Copy link
Contributor

bhack commented Jun 2, 2016

Yes for example in NNpack that is a CPU engine has only a coverage of some critical layers. As you can see from the readme forward on training and inference kernels could be differentiated in some engines.

@shelhamer
Copy link
Member

Regarding device placement and such it might be worth looking back over the sync SGD/layer parallelism PR by @longjon #2219. Returning to that would have more generality than the current data parallelism but the interface is more low-level since each device is annotated in the net definition.

@ajtulloch
Copy link
Contributor

re. NNPACK, there's nothing fundamental that Caffe needed to change to implement that, the patch is very non-invasive. I haven't proposed it as a PR since adding an (optional) NNPACK dependency is a bit of a hassle from a maintenance perspective (especially since the NNPACK interface isn't stable and there's no versioning policy). It's easy for anyone motivated to take the patch if they want it though.

@shelhamer
Copy link
Member

@ajtulloch it might be helpful if you post an issue specifically about NNPACK and link to your patch so that it's easily discoverable, if not actually merged in master for the present maintenance reasons you mentioned.

@bhack
Copy link
Contributor

bhack commented Jun 3, 2016

#2219 has some interesting bullet points. See also 3.2 and 10 sections of https://github.com/samjabrahams/tensorflow-white-paper-notes/blob/master/README.md. Also 9.2 Performance tracing section describe some profiling tools that are still not released.

@ajtulloch
Copy link
Contributor

@shelhamer OK sure, I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants