Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Abstraction #610

Closed
wants to merge 62 commits into from
Closed

Device Abstraction #610

wants to merge 62 commits into from

Conversation

shelhamer
Copy link
Member

CPU / GPU device separation and abstraction

  1. simplify the code
  2. make a CPU-only build possible with less mess -- compare CPU-only build #561
  3. prepare layers and blobs for multi-device parallelism
  4. pave the way for different GPU backends such as OpenCL

and so improves Caffe all around. That is, provided there is little to no overhead in both performance and coding. Since this requires a non-trivial set of coordinated changes, it has been promoted to the BVLC:device-abstraction feature branch. To contribute to this development, PR to BVLC:device-abstraction. When you rebase your feature branch on dev, comment with your fork to notify the maintainers to update.

See #415 for history.

This PR tracks the progress and integration of this branch for its final merge to dev.

@shelhamer
Copy link
Member Author

@robwhess #555 has been merged, so let me know once you have pushed a rebase of this branch on dev so that it can be reset to your work.

@kloudkl
Copy link
Contributor

kloudkl commented Jul 4, 2014

@shelhamer, I'm a little confused about the most suitable workflow to contribute to a feature branch with an open PR. Is it the case that you will rebase this PR against device-abstraction from time to time while @robwhess's #587 and any other similar PRs should rebase against both dev and device-abstraction before being merged into the device-abstraction branch?

@shelhamer
Copy link
Member Author

@kloudkl this PR and the device-abstraction PR are one and the same. Any change to BVLC:device-abstraction is automatically reflected in this PR. It's just like when you push further commits to a branch on your fork while a PR is open.

So the workflow to contribute to this PR is to make a PR on BVLC:device-abstraction. The usual rules for PRs hold except the base is device-abstraction instead of dev: you should have a clean merge, PRs should be short and to the point, etc.

The one complication for a feature branch that you are right to point out is that device-abstraction must itself track dev and be rebased from time to time. @robwhess has volunteered to do the first rebase of this kind. You and any other contributors can also help by rebasing when the feature branch has fallen behind. Once rebased, the contributor should push to their fork and comment for a BVLC member with push rights to update BVLC:device-abstraction to their rebased fork.

This is what was done for #165 except that I did too much of the rebasing then.

Let me know if that's not clear, since BVLC feature branches PRs are a little different in this respect.

@kloudkl
Copy link
Contributor

kloudkl commented Jul 4, 2014

Very clear. Thanks a lot!

@jeffdonahue
Copy link
Contributor

Interesting that Travis build totally fails...I at least compiled @ e316202 on Linux with gcc successfully (but had a few warnings).

edit: oh right -- it's because of CPU_ONLY. If I do make clean && export CPU_ONLY=1 and then rebuild I do get the same results as Travis.

@jeffdonahue
Copy link
Contributor

I did another rebase onto the latest dev (just another few days' worth of commits) and force pushed again. This rebase was relatively easy -- maybe the trick is to just not let too much time pass between rebases, or maybe these changes just happened to be easier than most.

@jeffdonahue
Copy link
Contributor

Did one more rebase and force pushed after seeing Travis pass on my fork. This one was a bit more painful due to a couple interface-breaking PRs, but manageable; took around an hour. I've tested ./train_imagenet.sh @ dev and @ device-abstraction with a seed, and get the same results over 20 iterations, so between that and the unit tests I don't think this breaks any functionality. From that test there was a 0.8% performance hit vs. dev (28.72s @ device-abstraction vs 28.48s @ dev). I wonder where the slowdown comes from (assuming my test even shows a significant difference) -- could it help to inline all the new GetDevice type calls and math functions which would previously have been called directly?

@robwhess
Copy link

@jeffdonahue yeah, the GetDevice() calls are obvious candidates for inlining and happen all over the place. I'll look into that to see if it saves any time.

Thanks for doing the rebases and getting things working with Travis.

@robertsdionne
Copy link

Perhaps once device abstraction is complete (which I see might take a while), implementing the OpenCL portions of the math code might use parts of VexCL (https://github.com/ddemidov/vexcl). It's a template expression library that generates OpenCL C kernels from C++ expressions involving vex::vector<T> (a device analogue to std::vector<T>). Incidentally, it also has a CUDA backend.

For instance, a ReLU kernel might be as simple as the following C++ code:
top[i] = max(0.0f, bottom[i]);

or caffe_copy:
top[i] = bottom[i];

I believe it can be mixed with typical OpenCL-style libraries like clBLAS or clFFT.

However, it uses C++11 features. I'm not sure which versions of clang/g++ caffe uses.

@kloudkl
Copy link
Contributor

kloudkl commented Sep 28, 2014

VexCL and ViennaCL were once considered as candidates. Out of the concerns about the performance, clBLAS was preferred. Before making further decisions, please benchmark how do they perform in Caffe in the most computationally expensive parts such as the convolution layer.

@bhack
Copy link
Contributor

bhack commented Sep 28, 2014

Yes VexCL can be used with CLBLAS and CUBLAS but you can try to benchmark in the convolutional layer.

@leonardt
Copy link

leonardt commented Oct 7, 2014

Hi all,
I'm interested in furthering the work done for device abstraction (and towards OpenCL support). Is there any ongoing work with this branch? I noticed @robwhess commented that he would attempt to do a rebase when he had time. I'd be interested in any input from the core caffe devs as to how to approach getting this branch back up and running and working towards full device abstraction.

@kloudkl
Copy link
Contributor

kloudkl commented Oct 22, 2014

First things first. Git rebase and then add your own code.

@bhack
Copy link
Contributor

bhack commented Nov 26, 2014

I don't know if this is definitely stalled but consider also this news: Arrayfire is now under BSD

https://github.com/arrayfire/arrayfire

@shelhamer
Copy link
Member Author

Device abstraction is still a worthy goal but has stalled for the moment.
Once the interface is in place for abstracting between CPU, (CUDA) GPU, and
other devices arrayfire seems like a potential choice for de-duplication.
Thanks for the pointer.

On Wed, Nov 26, 2014 at 5:37 AM, bhack notifications@github.com wrote:

I don't know if this is definitely stalled but consider also this news: Arrayfire
is now under BSD
http://developer.amd.com/community/blog/2014/11/21/arrayfire-now-open-source/

https://github.com/arrayfire/arrayfire


Reply to this email directly or view it on GitHub
#610 (comment).

@bhack
Copy link
Contributor

bhack commented Nov 26, 2014

@robwhess @kloudkl Do you really think that you could come back on this? I think that arrayfire could help on this task (if performances are confirmed). I saw that you are not so more active in the caffe flow and I don't know if you have a short/mid term plan to come back on this.

@robwhess
Copy link

Well, it is still my intention to come back on this. We'll eventually need to use the device abstraction at Flickr, so there should be an opportunity for us to work on this again, but that time just hasn't come yet, unfortunately. I'll keep Arrayfile in mind for when I finally do get the chance to work on this again.

@bhack
Copy link
Contributor

bhack commented Nov 30, 2014

If any is interested this is an example of a trasparent multi device gemv

@bhack
Copy link
Contributor

bhack commented Nov 30, 2014

I also ask to @pavanky if he could give us some opinion about device abstraction role that could give arrayfire and what will be the performance impact if we don't use anymore directly cublas/openblas/atlas and cufft calls.

@kloudkl
Copy link
Contributor

kloudkl commented Dec 1, 2014

@bhack No. I can no longer open source any code.

@bhack
Copy link
Contributor

bhack commented Dec 1, 2014

@kloudkl I'm sorry that we have lost a very active member.

@kloudkl
Copy link
Contributor

kloudkl commented Dec 1, 2014

Caffe has been among the top 15 most forked C++ projects on GitHub. How could there be not enough contributions?

At the same time, many other organizations, e.g. Google, Baidu, Microsoft, Tencent, Samsung and GraphLab Inc, have all published or even open sourced various other (distributed) deep learning frameworks some of which may pose serious disruptive threats in the coming year.

@kloudkl
Copy link
Contributor

kloudkl commented Dec 1, 2014

#1148 (comment)

@bhack
Copy link
Contributor

bhack commented Mar 28, 2015

Referencing #2195

@futurely
Copy link

#2577

@shelhamer
Copy link
Member Author

While device abstraction is still a good direction this PR is closed since it is against the deprecated dev branch.

@shelhamer shelhamer closed this Aug 26, 2015
@bhack
Copy link
Contributor

bhack commented Aug 26, 2015

@shelhamer Probably @naibaf7 has a plan to supersed this.

@shelhamer shelhamer deleted the device-abstraction branch April 10, 2017 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants