Device Abstraction #610

shelhamer · 2014-07-04T00:39:25Z

CPU / GPU device separation and abstraction

simplify the code
make a CPU-only build possible with less mess -- compare CPU-only build #561
prepare layers and blobs for multi-device parallelism
pave the way for different GPU backends such as OpenCL

and so improves Caffe all around. That is, provided there is little to no overhead in both performance and coding. Since this requires a non-trivial set of coordinated changes, it has been promoted to the BVLC:device-abstraction feature branch. To contribute to this development, PR to BVLC:device-abstraction. When you rebase your feature branch on dev, comment with your fork to notify the maintainers to update.

See #415 for history.

This PR tracks the progress and integration of this branch for its final merge to dev.

shelhamer · 2014-07-04T00:40:22Z

@robwhess #555 has been merged, so let me know once you have pushed a rebase of this branch on dev so that it can be reset to your work.

kloudkl · 2014-07-04T03:51:36Z

@shelhamer, I'm a little confused about the most suitable workflow to contribute to a feature branch with an open PR. Is it the case that you will rebase this PR against device-abstraction from time to time while @robwhess's #587 and any other similar PRs should rebase against both dev and device-abstraction before being merged into the device-abstraction branch?

shelhamer · 2014-07-04T04:21:52Z

@kloudkl this PR and the device-abstraction PR are one and the same. Any change to BVLC:device-abstraction is automatically reflected in this PR. It's just like when you push further commits to a branch on your fork while a PR is open.

So the workflow to contribute to this PR is to make a PR on BVLC:device-abstraction. The usual rules for PRs hold except the base is device-abstraction instead of dev: you should have a clean merge, PRs should be short and to the point, etc.

The one complication for a feature branch that you are right to point out is that device-abstraction must itself track dev and be rebased from time to time. @robwhess has volunteered to do the first rebase of this kind. You and any other contributors can also help by rebasing when the feature branch has fallen behind. Once rebased, the contributor should push to their fork and comment for a BVLC member with push rights to update BVLC:device-abstraction to their rebased fork.

This is what was done for #165 except that I did too much of the rebasing then.

Let me know if that's not clear, since BVLC feature branches PRs are a little different in this respect.

kloudkl · 2014-07-04T05:46:18Z

Very clear. Thanks a lot!

jeffdonahue · 2014-07-27T20:00:06Z

Interesting that Travis build totally fails...I at least compiled @ e316202 on Linux with gcc successfully (but had a few warnings).

edit: oh right -- it's because of CPU_ONLY. If I do make clean && export CPU_ONLY=1 and then rebuild I do get the same results as Travis.

jeffdonahue · 2014-07-27T22:47:25Z

I did another rebase onto the latest dev (just another few days' worth of commits) and force pushed again. This rebase was relatively easy -- maybe the trick is to just not let too much time pass between rebases, or maybe these changes just happened to be easier than most.

jeffdonahue · 2014-07-29T23:23:22Z

Did one more rebase and force pushed after seeing Travis pass on my fork. This one was a bit more painful due to a couple interface-breaking PRs, but manageable; took around an hour. I've tested ./train_imagenet.sh @ dev and @ device-abstraction with a seed, and get the same results over 20 iterations, so between that and the unit tests I don't think this breaks any functionality. From that test there was a 0.8% performance hit vs. dev (28.72s @ device-abstraction vs 28.48s @ dev). I wonder where the slowdown comes from (assuming my test even shows a significant difference) -- could it help to inline all the new GetDevice type calls and math functions which would previously have been called directly?

robwhess · 2014-07-30T04:01:45Z

@jeffdonahue yeah, the GetDevice() calls are obvious candidates for inlining and happen all over the place. I'll look into that to see if it saves any time.

Thanks for doing the rebases and getting things working with Travis.

robertsdionne · 2014-09-27T23:58:13Z

Perhaps once device abstraction is complete (which I see might take a while), implementing the OpenCL portions of the math code might use parts of VexCL (https://github.com/ddemidov/vexcl). It's a template expression library that generates OpenCL C kernels from C++ expressions involving vex::vector<T> (a device analogue to std::vector<T>). Incidentally, it also has a CUDA backend.

For instance, a ReLU kernel might be as simple as the following C++ code:
top[i] = max(0.0f, bottom[i]);

or caffe_copy:
top[i] = bottom[i];

I believe it can be mixed with typical OpenCL-style libraries like clBLAS or clFFT.

However, it uses C++11 features. I'm not sure which versions of clang/g++ caffe uses.

kloudkl · 2014-09-28T01:01:25Z

VexCL and ViennaCL were once considered as candidates. Out of the concerns about the performance, clBLAS was preferred. Before making further decisions, please benchmark how do they perform in Caffe in the most computationally expensive parts such as the convolution layer.

bhack · 2014-09-28T01:12:09Z

Yes VexCL can be used with CLBLAS and CUBLAS but you can try to benchmark in the convolutional layer.

leonardt · 2014-10-07T20:42:44Z

Hi all,
I'm interested in furthering the work done for device abstraction (and towards OpenCL support). Is there any ongoing work with this branch? I noticed @robwhess commented that he would attempt to do a rebase when he had time. I'd be interested in any input from the core caffe devs as to how to approach getting this branch back up and running and working towards full device abstraction.

kloudkl · 2014-10-22T01:58:13Z

First things first. Git rebase and then add your own code.

bhack · 2014-11-26T10:37:51Z

I don't know if this is definitely stalled but consider also this news: Arrayfire is now under BSD

https://github.com/arrayfire/arrayfire

shelhamer · 2014-11-26T22:04:11Z

Device abstraction is still a worthy goal but has stalled for the moment.
Once the interface is in place for abstracting between CPU, (CUDA) GPU, and
other devices arrayfire seems like a potential choice for de-duplication.
Thanks for the pointer.

On Wed, Nov 26, 2014 at 5:37 AM, bhack notifications@github.com wrote:

I don't know if this is definitely stalled but consider also this news: Arrayfire
is now under BSD
http://developer.amd.com/community/blog/2014/11/21/arrayfire-now-open-source/

https://github.com/arrayfire/arrayfire

—
Reply to this email directly or view it on GitHub
#610 (comment).

bhack · 2014-11-26T23:46:35Z

@robwhess @kloudkl Do you really think that you could come back on this? I think that arrayfire could help on this task (if performances are confirmed). I saw that you are not so more active in the caffe flow and I don't know if you have a short/mid term plan to come back on this.

robwhess · 2014-11-26T23:52:27Z

Well, it is still my intention to come back on this. We'll eventually need to use the device abstraction at Flickr, so there should be an opportunity for us to work on this again, but that time just hasn't come yet, unfortunately. I'll keep Arrayfile in mind for when I finally do get the chance to work on this again.

bhack · 2014-11-30T09:32:37Z

If any is interested this is an example of a trasparent multi device gemv

bhack · 2014-11-30T09:41:29Z

I also ask to @pavanky if he could give us some opinion about device abstraction role that could give arrayfire and what will be the performance impact if we don't use anymore directly cublas/openblas/atlas and cufft calls.

kloudkl · 2014-12-01T09:44:10Z

@bhack No. I can no longer open source any code.

bhack · 2014-12-01T09:46:39Z

@kloudkl I'm sorry that we have lost a very active member.

kloudkl · 2014-12-01T12:27:31Z

Caffe has been among the top 15 most forked C++ projects on GitHub. How could there be not enough contributions?

At the same time, many other organizations, e.g. Google, Baidu, Microsoft, Tencent, Samsung and GraphLab Inc, have all published or even open sourced various other (distributed) deep learning frameworks some of which may pose serious disruptive threats in the coming year.

kloudkl · 2014-12-01T15:24:01Z

#1148 (comment)

bhack · 2015-03-28T20:11:12Z

Referencing #2195

futurely · 2015-06-10T00:42:24Z

#2577

shelhamer · 2015-08-26T00:06:27Z

While device abstraction is still a good direction this PR is closed since it is against the deprecated dev branch.

bhack · 2015-08-26T00:08:05Z

@shelhamer Probably @naibaf7 has a plan to supersed this.

shelhamer mentioned this pull request Jul 4, 2014

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

Closed

shelhamer added the interface label Jul 4, 2014

jeffdonahue mentioned this pull request Jul 5, 2014

CMake build system (feature development tracking PR) #623

Merged

4 tasks

sergeyk mentioned this pull request Jul 14, 2014

Split CUDA code (*.cu) from CPU code (*.cpp). #152

Closed

5 tasks

robwhess mentioned this pull request Jul 24, 2014

Implement device abstraction for remaining classes #587

Merged

jeffdonahue mentioned this pull request Jul 27, 2014

Tweak device abstraction so CPU_ONLY build works #803

Merged

kloudkl added 16 commits August 12, 2014 13:07

Wrap the CPU and GPU math functions in math backend classes

bc909c6

Add the math backend to the Layer base class

93bee80

Add device type independent getters to Blob

b9621c6

Remove tab from the code and reformat using google style

8697a72

Allow Layer::Forward and Backward to be overridden

6df6109

Use zero as the default return values of Blob data and diff methods

2883a62

Add and test device type ignorant Forward and Backward in ConcatLayer

ed71d59

Add default implementations of Layer::Forward_cpu and Backward_cpu

7b1733c

Directly implement device neutral Forward and Backward in ConcatLayer

0f37050

Generalize the math backend classes into device wrapper classes

13f2f57

Add Device::copy_from_cpu for the data layers

79325f3

Unify the CPU and the GPU Forward of the DataLayer

67d5621

Unify the CPU and the GPU Forward of the ImageDataLayer

e57a703

Unify the CPU and the GPU Forward of the HDF5DataLayer

aed27b0

Unify the CPU and the GPU Forward & Backward of the HDF5OutputDataLayer

0703d7b

Merge the CPU and the GPU Backward of the data layers

ed7582b

shelhamer force-pushed the dev branch from d8eb4df to 914da95 Compare October 8, 2014 16:36

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

longjon mentioned this pull request Dec 22, 2014

Refactor convolution layer and add deconvolution layer #1615

Merged

jeffdonahue mentioned this pull request Jan 21, 2015

Refactored solvers with fluent pattern #1759

Closed

bhack mentioned this pull request Mar 25, 2015

OpenCL Backend #2195

Closed

futurely mentioned this pull request Jun 9, 2015

Deduplicate the CPU and GPU versions of Forward and Backward methods #2577

Closed

jmerkow mentioned this pull request Jul 16, 2015

Incorporating cuda unified memory into caffe #2775

Closed

shelhamer mentioned this pull request Aug 25, 2015

Clear all content out of dev except README.md #2081

Closed

shelhamer closed this Aug 26, 2015

rddesmond mentioned this pull request May 20, 2016

Engine Abstraction in Layers #4187

Open

shelhamer deleted the device-abstraction branch April 10, 2017 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Abstraction #610

Device Abstraction #610

shelhamer commented Jul 4, 2014

shelhamer commented Jul 4, 2014

kloudkl commented Jul 4, 2014

shelhamer commented Jul 4, 2014

kloudkl commented Jul 4, 2014

jeffdonahue commented Jul 27, 2014

jeffdonahue commented Jul 27, 2014

jeffdonahue commented Jul 29, 2014

robwhess commented Jul 30, 2014

robertsdionne commented Sep 27, 2014

kloudkl commented Sep 28, 2014

bhack commented Sep 28, 2014

leonardt commented Oct 7, 2014

kloudkl commented Oct 22, 2014

bhack commented Nov 26, 2014

shelhamer commented Nov 26, 2014

bhack commented Nov 26, 2014

robwhess commented Nov 26, 2014

bhack commented Nov 30, 2014

bhack commented Nov 30, 2014

kloudkl commented Dec 1, 2014

bhack commented Dec 1, 2014

kloudkl commented Dec 1, 2014

kloudkl commented Dec 1, 2014

bhack commented Mar 28, 2015

futurely commented Jun 10, 2015

shelhamer commented Aug 26, 2015

bhack commented Aug 26, 2015

Device Abstraction #610

Device Abstraction #610

Conversation

shelhamer commented Jul 4, 2014

shelhamer commented Jul 4, 2014

kloudkl commented Jul 4, 2014

shelhamer commented Jul 4, 2014

kloudkl commented Jul 4, 2014

jeffdonahue commented Jul 27, 2014

jeffdonahue commented Jul 27, 2014

jeffdonahue commented Jul 29, 2014

robwhess commented Jul 30, 2014

robertsdionne commented Sep 27, 2014

kloudkl commented Sep 28, 2014

bhack commented Sep 28, 2014

leonardt commented Oct 7, 2014

kloudkl commented Oct 22, 2014

bhack commented Nov 26, 2014

shelhamer commented Nov 26, 2014

bhack commented Nov 26, 2014

robwhess commented Nov 26, 2014

bhack commented Nov 30, 2014

bhack commented Nov 30, 2014

kloudkl commented Dec 1, 2014

bhack commented Dec 1, 2014

kloudkl commented Dec 1, 2014

kloudkl commented Dec 1, 2014

bhack commented Mar 28, 2015

futurely commented Jun 10, 2015

shelhamer commented Aug 26, 2015

bhack commented Aug 26, 2015