Parallelize Forward / Backward by Depth #547

shelhamer · 2014-06-26T20:10:31Z

Forward and Backward are done in sequence by layer ID at the moment. In principle, all Forward / Backward steps at the same depth in the DAG can be executed in parallel.

In DAG models where single layer operations do not saturate the host / device, this should improve performance.

As I understand it, this would be done by batch cuBLAS and streams for parallel kernel execution at each depth in the model.

bhack · 2014-06-26T20:28:40Z

One of the design that this feature could speed up I think that is the model in the diagram at page 13 of this pubblication:
http://arxiv.org/abs/1312.6082v4

shelhamer · 2014-06-26T20:31:25Z

Pyramids and any model with late fusion [1, plus others and more to come] should likewise benefit.

[1] Large-Scale Video Classification with Convolutional Neural Networks
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei
CVPR 2014. http://cs.stanford.edu/people/karpathy/deepvideo/deepvideo_cvpr2014.pdf

bhack · 2014-06-26T20:38:15Z

Fresh meat from cvpr 🍖

sguada · 2014-06-27T04:27:41Z

Actually any two non overlapping paths could be run in parallel, even if
they have different length.

On Thursday, June 26, 2014, bhack notifications@github.com wrote:

Fresh meat from cvpr [image: 🍖]

—
Reply to this email directly or view it on GitHub
#547 (comment).

Sergio

shelhamer · 2014-06-27T04:39:53Z

@sguada right, advancing by depth covers that case too: execute in parallel depth-by-depth and if any particular path completes that's fine, just keep going until the execution of the deepest layer. There's no requirement for equal length.

There has to be some logic to decide the number of streams / handles though. To start this could simply be manually selected.

kloudkl · 2014-06-27T09:33:25Z

The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
http://tez.incubator.apache.org/

bhack · 2014-06-27T09:52:00Z

@kloudkl I don't know if also graphlab and graphchi could be useful:
http://graphlab.org/projects/index.html
http://docs.graphlab.org/index.html

shelhamer · 2014-06-27T16:18:36Z

A simple graph traversal to make a depth -> layers mapping should suffice for our purposes.

Thanks for the project pointers all the same.

bhack · 2014-06-27T18:49:07Z

Yes but what kind of parallelization paradigm? Multi thread, multi device, (long term) distributed, or any combination of this.

sguada · 2014-06-27T18:55:08Z

My only concern with paths of different lengths is that one can be
faster/slower than other and then computing by depth will make all the
paths run at the speed of the longest. But probably it doesn't matter much
if paths have similar times and will have to merge at point and wait.

Sergio

2014-06-27 11:49 GMT-07:00 bhack notifications@github.com:

Yes but what kind of parallelization paradigm? Multi thread, multi device,
(long term) distributed, or any combination of this.

—
Reply to this email directly or view it on GitHub
#547 (comment).

shelhamer · 2014-06-27T19:02:12Z

@bhack

Multi thread, multi device, (long term) distributed, or any combination of this.

Our parallelization goal is entirely single node. We have single process multi-thread / multi-device parallelism in mind.

Distributed computation has its place, but in my opinion there's no point pursuing it while there are still important single node gains to be made.

Of course anyone is free to pursue whatever parallelization they want, but this is the present direction of the project.

@sguada

But probably it doesn't matter much if paths have similar times and will have to merge at point and wait.

That was my thinking. We can always engage in fancier parallelization later if need-be, but depth ordering should suffice.

kloudkl · 2014-06-30T02:04:42Z

Have you tried to parallelize on a multi-device node using NVBLAS(#194) which only requires dynamically linking the shared library?

shelhamer · 2014-06-30T02:25:08Z

@kloudkl no, because I want to control the communication and only
distribute layer-wise. The only time a parallelized forward / backward pass
needs to communicate the data / diff is when a DAG model forks. At that
point one path can keep computing while the data / diff are communicated to
devices for the other path, which "hides" the communication while useful
work is done.

It could be interesting to try distributing all BLAS operations with
NVBLAS, but I expect it to not be worth the communication at standard input
sizes. Worth noting all the same since only benchmarks will tell.

On Sun, Jun 29, 2014 at 7:04 PM, kloudkl notifications@github.com wrote:

Have you tried to parallelize on a multi-device node using NVBLAS(#194
#194) which only requires
dynamically linking the shared library?

—
Reply to this email directly or view it on GitHub
#547 (comment).

kloudkl · 2014-06-30T02:35:16Z

NVBLAS is such a low-hanging fruit that it is really worth some benchmarks. But I don't have acess to multi-GPU devices in the near future. I hope someone interested will be able to do so.

shelhamer · 2014-06-30T02:39:56Z

I'm a little skeptical because it has a shared host memory model that
doesn't mesh with Caffe's lazy allocation and communication-minimizing
design. It doesn't seem like you can just give it a GPU memory pointer and
accelerate away.

That said, I've only given a cursory look at cublasXT and would welcome
example code and benchmarking that turn my impression on its head.

On Sun, Jun 29, 2014 at 7:35 PM, kloudkl notifications@github.com wrote:

NVBLAS is such a low-hanging fruit that it is really worth some
benchmarks. But I don't have acess to multi-GPU devices in the near future.
I hope someone interested will be able to do so.

—
Reply to this email directly or view it on GitHub
#547 (comment).

shelhamer added the enhancement label Jun 26, 2014

bhack mentioned this issue Jun 29, 2014

Multi label Data and MultiLabel Accuracy #523

Closed

kloudkl mentioned this issue Jul 7, 2014

Training ImageNet with 2 GPUs #630

Closed

kloudkl mentioned this issue Aug 6, 2014

Try to extract Convolution code from cuda-convnet2 #830

Closed

shelhamer added speed-up and removed enhancement labels Dec 30, 2014

cypof closed this as completed Apr 14, 2017

shelhamer reopened this Apr 14, 2017

karthiksgovindappa pushed a commit to BuddyGuard/caffe that referenced this issue May 6, 2017

Merge pull request BVLC#547 from amoussawi/fix-sampling-bug

96175b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Forward / Backward by Depth #547

Parallelize Forward / Backward by Depth #547

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

sguada commented Jun 27, 2014

shelhamer commented Jun 27, 2014

kloudkl commented Jun 27, 2014

bhack commented Jun 27, 2014

shelhamer commented Jun 27, 2014

bhack commented Jun 27, 2014

sguada commented Jun 27, 2014

shelhamer commented Jun 27, 2014

kloudkl commented Jun 30, 2014

shelhamer commented Jun 30, 2014

kloudkl commented Jun 30, 2014

shelhamer commented Jun 30, 2014

Parallelize Forward / Backward by Depth #547

Parallelize Forward / Backward by Depth #547

Comments

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

sguada commented Jun 27, 2014

shelhamer commented Jun 27, 2014

kloudkl commented Jun 27, 2014

bhack commented Jun 27, 2014

shelhamer commented Jun 27, 2014

bhack commented Jun 27, 2014

sguada commented Jun 27, 2014

shelhamer commented Jun 27, 2014

kloudkl commented Jun 30, 2014

shelhamer commented Jun 30, 2014

kloudkl commented Jun 30, 2014

shelhamer commented Jun 30, 2014