Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Forward / Backward by Depth #547

Open
shelhamer opened this issue Jun 26, 2014 · 15 comments
Open

Parallelize Forward / Backward by Depth #547

shelhamer opened this issue Jun 26, 2014 · 15 comments
Labels

Comments

@shelhamer
Copy link
Member

Forward and Backward are done in sequence by layer ID at the moment. In principle, all Forward / Backward steps at the same depth in the DAG can be executed in parallel.

In DAG models where single layer operations do not saturate the host / device, this should improve performance.

As I understand it, this would be done by batch cuBLAS and streams for parallel kernel execution at each depth in the model.

@bhack
Copy link
Contributor

bhack commented Jun 26, 2014

One of the design that this feature could speed up I think that is the model in the diagram at page 13 of this pubblication:
http://arxiv.org/abs/1312.6082v4

@shelhamer
Copy link
Member Author

Pyramids and any model with late fusion [1, plus others and more to come] should likewise benefit.

[1] Large-Scale Video Classification with Convolutional Neural Networks
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei
CVPR 2014. http://cs.stanford.edu/people/karpathy/deepvideo/deepvideo_cvpr2014.pdf

@bhack
Copy link
Contributor

bhack commented Jun 26, 2014

Fresh meat from cvpr 🍖

@sguada
Copy link
Contributor

sguada commented Jun 27, 2014

Actually any two non overlapping paths could be run in parallel, even if
they have different length.

On Thursday, June 26, 2014, bhack notifications@github.com wrote:

Fresh meat from cvpr [image: 🍖]


Reply to this email directly or view it on GitHub
#547 (comment).

Sergio

@shelhamer
Copy link
Member Author

@sguada right, advancing by depth covers that case too: execute in parallel depth-by-depth and if any particular path completes that's fine, just keep going until the execution of the deepest layer. There's no requirement for equal length.

There has to be some logic to decide the number of streams / handles though. To start this could simply be manually selected.

@kloudkl
Copy link
Contributor

kloudkl commented Jun 27, 2014

The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
http://tez.incubator.apache.org/

@bhack
Copy link
Contributor

bhack commented Jun 27, 2014

@kloudkl I don't know if also graphlab and graphchi could be useful:
http://graphlab.org/projects/index.html
http://docs.graphlab.org/index.html

@shelhamer
Copy link
Member Author

A simple graph traversal to make a depth -> layers mapping should suffice for our purposes.

Thanks for the project pointers all the same.

@bhack
Copy link
Contributor

bhack commented Jun 27, 2014

Yes but what kind of parallelization paradigm? Multi thread, multi device, (long term) distributed, or any combination of this.

@sguada
Copy link
Contributor

sguada commented Jun 27, 2014

My only concern with paths of different lengths is that one can be
faster/slower than other and then computing by depth will make all the
paths run at the speed of the longest. But probably it doesn't matter much
if paths have similar times and will have to merge at point and wait.

Sergio

2014-06-27 11:49 GMT-07:00 bhack notifications@github.com:

Yes but what kind of parallelization paradigm? Multi thread, multi device,
(long term) distributed, or any combination of this.


Reply to this email directly or view it on GitHub
#547 (comment).

@shelhamer
Copy link
Member Author

@bhack

Multi thread, multi device, (long term) distributed, or any combination of this.

Our parallelization goal is entirely single node. We have single process multi-thread / multi-device parallelism in mind.

Distributed computation has its place, but in my opinion there's no point pursuing it while there are still important single node gains to be made.

Of course anyone is free to pursue whatever parallelization they want, but this is the present direction of the project.

@sguada

But probably it doesn't matter much if paths have similar times and will have to merge at point and wait.

That was my thinking. We can always engage in fancier parallelization later if need-be, but depth ordering should suffice.

@kloudkl
Copy link
Contributor

kloudkl commented Jun 30, 2014

Have you tried to parallelize on a multi-device node using NVBLAS(#194) which only requires dynamically linking the shared library?

@shelhamer
Copy link
Member Author

@kloudkl no, because I want to control the communication and only
distribute layer-wise. The only time a parallelized forward / backward pass
needs to communicate the data / diff is when a DAG model forks. At that
point one path can keep computing while the data / diff are communicated to
devices for the other path, which "hides" the communication while useful
work is done.

It could be interesting to try distributing all BLAS operations with
NVBLAS, but I expect it to not be worth the communication at standard input
sizes. Worth noting all the same since only benchmarks will tell.

On Sun, Jun 29, 2014 at 7:04 PM, kloudkl notifications@github.com wrote:

Have you tried to parallelize on a multi-device node using NVBLAS(#194
#194) which only requires
dynamically linking the shared library?


Reply to this email directly or view it on GitHub
#547 (comment).

@kloudkl
Copy link
Contributor

kloudkl commented Jun 30, 2014

NVBLAS is such a low-hanging fruit that it is really worth some benchmarks. But I don't have acess to multi-GPU devices in the near future. I hope someone interested will be able to do so.

@shelhamer
Copy link
Member Author

I'm a little skeptical because it has a shared host memory model that
doesn't mesh with Caffe's lazy allocation and communication-minimizing
design. It doesn't seem like you can just give it a GPU memory pointer and
accelerate away.

That said, I've only given a cursory look at cublasXT and would welcome
example code and benchmarking that turn my impression on its head.

On Sun, Jun 29, 2014 at 7:35 PM, kloudkl notifications@github.com wrote:

NVBLAS is such a low-hanging fruit that it is really worth some
benchmarks. But I don't have acess to multi-GPU devices in the near future.
I hope someone interested will be able to do so.


Reply to this email directly or view it on GitHub
#547 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants