-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine Abstraction in Layers #4187
Comments
@rddesmond If you are interested I'm looking for help on going forward to fully abstract the interface to be OpenCL/CUDA/CPU agnostic. The issue is even more pressing there since it has 4 convolution engines, of which some are CUDA+OpenCL, some are CUDA only, some OpenCL only. |
@naibaf7 We are also interested in this. We will want to plug your libdnn effort and other engines in tinycnn. I think also that we need to think how to support multi engines binaries for distribution packagers. @cdluminate Do you have any feedback on this topic? /cc @mtamburrano @edgarriba |
@bhack Furthermore, I constructed the abstraction so that each layer "owns" a device, and executes it's methods on that particular device. So far, this approach works quite well and is not very intrusive/conflicting with the existing CUDA master branch. But it's also not perfect/complete yet. Contributions/ideas are welcome. |
@naibaf7 So with this ID is there "something similar" to device placement in TF? |
@bhack
Then the following devices will be in the global ID: and then a selection of these devices (let's say 0 and 3) can be initialized and get remapped to Caffe IDs: As you can see, CPU devices are not included in that scheme yet, but should/will be included in a full abstraction as well. |
@ajtulloch Do you have any feedback on this? Are you still working on integrating nnpack engine in caffe? |
@naibaf7 Seems good. Device placement it is still an open topic on TF see the thread at tensorflow/tensorflow#2126 |
@bhack One way I offer device placement as well is setting the device on layers explicitly in the protobuf or the python Layer interface. However this is not giving full support yet, as multidevice-Blobs (for sharing a single Blob across devices) is not finished yet. |
@naibaf7 what's a clean/good solution for you? We are starting to design the device-abstraction in tiny-cnn so we would like to think it very well before start coding. |
@edgarriba
To the last point: Most CUDA/OpenCL code can be cross-compiled by adding a few macros and inline helper functions to the source code (Greentea-libDNN is doing this). Creating an OpenMP version of it may be a bit more tricky. Especially if you don't have a GPU backend yet, I can strongly recommend using the cross-compile scheme as it will save a lot of work and duplicated code. The performance critical kernels will probably be handled by 3rd party libraries anyways (which are mostly the convolutions and BLAS calls). Not all of these properties are currently implemented in OpenCL Caffe, but hopefully will be at some point. |
@rddesmond
a.) The convolution settings selected. Some engines do not support dilation. b.) The device selected. This is a combination of device backend (CUDA/OpenCL/CPU) and device property support (i.e. some convolutions only support AMD or Intel hardware). |
Also, not all the layers and math functions (ops) are accelerated/implemented by all backends. |
I think the naive first approach is have the devices ranked, but it would be nice to do more than that. For the layers and ops not being supported by all backends.. I think asking the layer if it can be implemented for a given set of parameters would simplify the factory and make the heuristics much easier. Getting the blobs to handle the memory movement is a big piece, but I don't think is broken but needs to be expanded -- we can already mix and match CUDA/CUDNN/CPU in the same model without a problem. The whole thing that started because the CuDNN can't fallback to CPU (or at runtime without changing the model, CUDA). The biggest problem I had was that the CPU was treated as a fallback for each engine, and not as an engine itself. |
Yes for example in NNpack that is a CPU engine has only a coverage of some critical layers. As you can see from the readme forward on training and inference kernels could be differentiated in some engines. |
re. NNPACK, there's nothing fundamental that Caffe needed to change to implement that, the patch is very non-invasive. I haven't proposed it as a PR since adding an (optional) NNPACK dependency is a bit of a hassle from a maintenance perspective (especially since the NNPACK interface isn't stable and there's no versioning policy). It's easy for anyone motivated to take the patch if they want it though. |
@ajtulloch it might be helpful if you post an issue specifically about NNPACK and link to your patch so that it's easily discoverable, if not actually merged in master for the present maintenance reasons you mentioned. |
#2219 has some interesting bullet points. See also 3.2 and 10 sections of https://github.com/samjabrahams/tensorflow-white-paper-notes/blob/master/README.md. Also 9.2 Performance tracing section describe some profiling tools that are still not released. |
@shelhamer OK sure, I'll do that. |
I have been trying to have a build-once run-anywhere caffe with/CuDNN. Unfortunately, the CuDNN engine's layers call the GPU during the layer setup. This means that the CuDNN engine cannot be used in a CPU-only environment. There is also no easy path for CUDA fallback (i.e. when we build with CuDNN.
This is an attempt to revisit some of the efforts which were unfortunately abandoned 2 years ago in "Device Abstraction" #610.
I would propose something that is focused on the Layers and is less abrupt on the internal API:
canInstantiate(const LayerParameter& param)
, to eliminate the need for layer-specific logic that is hardcoded for each engine.canInstantiate
down the list.(I'm also creating a PR for a quicker CuDNN work around (to make it so it can be turned off in runtime, either because the mode is not GPU or as a separate flag).)
The text was updated successfully, but these errors were encountered: