Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] doc for refactoring Paddle Ops/Graph/Tensor, etc. #2445

Closed
wants to merge 5 commits into from

Conversation

reyoung
Copy link
Collaborator

@reyoung reyoung commented Jun 12, 2017

这里也许更好review


Graph由Op简单组成,是一个Op的数组。不过一个Graph还可以加上`tensors_`字段,来表示这个图中所有的tensors_。这个字段从`ops_`中得到的,只是方便用户去访问这个图中所有的信息。

## Graph Compiler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

仅作记录和交流
在sql engine中,与graph对应的叫plan(logical plan, phsical plan),与compiler对应的叫optimizer,每个GraphCompilerFN叫一个pass(优化遍)。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实在MxNet中将这一概念称作pass。但是pass又和Paddle中原由概念运算完所有数据相冲突。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL engine => SQL execution engine

plan => execution plan

optimizer 优化的是一个SQL程序对应的execution plan。那个execution plan确实是一个SQL interpreter生成的。


1. Inplace Compiler: 当某一个Op的输入和输出size一致,且输入Tensor只被这个Op所依赖,就可以简化计算图,将输入输出合并成一个TensorAttr
2. Backward Compiler: 从Loss.mean开始反向传播。先将`Loss.mean.grad`梯度默认设置成`1.0`。然后反向遍历这个图,将梯度的Op插入到图中。如果某一个TensorAttr不需要Gradient,则不产生这个TensorAttr的Gradient。
3. Optimizer Compiler: 找出所有参数的梯度和参数值,将两者之间添加上SGD op,讲输出结果保存会参数TensorAttr。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

讲输出结果 ==> 将输出结果
保存会 ==> 保存为

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

假设是这样添加在图里面的话,按照之前讨论的结论ops是被线性调用的,那么sgd op也是分布在ops链条上的,所以也是会间隔着调用是么,执行到这个op的时候就被执行

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

假设是这样添加在图里面的话,按照之前讨论的结论ops是被线性调用的,那么sgd op也是分布在ops链条上的,所以也是会间隔着调用是么,执行到这个op的时候就被执行

线性调用。Ops_是DAG的拓扑排序后的结果,只要保证依赖先计算就可以了。顺序可以多种多样。

Graph graph;
auto input = graph.createTensor("input", {1000, 784}, float);
auto hidden0 = graph.fullyConnected("hidden0", input, 200);
auto hidden0Sigmoid = graph.fullyConnected("hidden0_sigmoid", fc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的hidden0 hidden0Sigmoid似乎没有用到

Copy link
Collaborator Author

@reyoung reyoung Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo 今天fix


## Engine

Engine输入的Op加定已经添加好如何进行多设备同步的相关Op了。那么Engine的实现比较简单:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加定 ==> 假定


graph.backward(loss);
graph.optimize("sgd", 1e-4);
graph.multiCpu(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里指的就是对graph的编译?把一个graph变为另外一个graph

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspace跟graph的绑定是在engine里面做的吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里指的就是对graph的编译?把一个graph变为另外一个graph

是,但是为了实现效率,这里是直接对graph本身修改。而如果需要修改成另一个graph,可以

Graph other = graph;
other.backward(...);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspace跟graph的绑定是在engine里面做的吗?

是,但其实没有『绑定关系』。engine里面只是根据graph的名字,将内存从workspace中拿出来,然后交由Op的kernel函数进行计算。

Graph与Workspace并没有绑定关系。

## 设想中的用户使用方法

```cpp
Workspace w;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspace需要用户创建么,还是默认一个进程只有一个全局的。
graph看起来是可以存在多个的。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspace需要用户创建么

需要用户创建。还是不用全局变量的好。

auto hidden0Sigmoid = graph.fullyConnected("hidden0_sigmoid", fc);
auto hidden1 = graph.fullyConnected("hidden1", input, 10);
auto prediction = graph.softmax("prediction", hidden1);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的auto是什么类型?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorAttrPtr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是TensorAttrPtr有点奇怪,这个概念应该被隐藏起来吧。我觉得应该抽象出来一层类似dynet中的Expression或者mxnet中的symbol的概念。用户在配置的时候,只需要书写expression或者symbol,在内部完成graph的构建。TensorAttrPtr是一个挺底层的概念,暴露给用户我觉得不太合适

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原意就像让大家不要关注这里是什么类型。他只是一种句柄,可以呗其他的Op作为输入使用。

目前来看是 TensorAttrPtr

struct Op {
std::string type_;
Vec<TensorAttrPtr> inputs_;
Vec<TensorAttrPtr> outputs_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 这个outputs_是由inputs_推导得到的吗?
2 在graph中前后两个op排列在一起,那么前一个op的outputs_是不是跟后一个op的inputs_是相等的

Copy link
Collaborator Author

@reyoung reyoung Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 这个outputs_是由inputs_推导得到的吗?

2 在graph中前后两个op排列在一起,那么前一个op的outputs_是不是跟后一个op的inputs_是相等的

确实前后两个Op可能排列在一起,但:

  1. 前一个Op的输出不一定完全等于后一个Op的输入,譬如 交叉熵 Op的输入,就是FC的输出和另一个label 对象。
  2. 只要保证这个Op的顺序,是Graph这个DAG的拓扑排序后的一个结果即可,不一定是前后紧密挨着的。

```
所有的参数全部存放到一个WorkSpace中。申请、释放、resize `TensorBuffer`交由这个全局的WorkSpace完成。这个WorkSpace的好处是:

1. 可以在不同的拓扑结构中,共享一段内存。(share 同一个`Workspace`即可)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(share 同一个Workspace即可)

可以是:不同的拓扑可以只共享'某几段'内存,也就是其中某几个TensorBufferPtr吧?
如果是这样,觉得应该是不同的拓扑,用不同的workspace, workspace间共享某几个内存吧? 而不是 "(share 同一个Workspace即可)" 吧?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不同的拓扑可以只共享'某几段'内存

这里意图是通过Workspace与Graph中的名字对应完成。

用不同的workspace, workspace间共享某几个内存吧? 而不是 "(share 同一个Workspace即可)" 吧?

这里,使用更少的Workspace可以更方便的做其他工作。例如:

  • workspace.checkpoint()
  • 只使用这一个workspace与pserver通信即可。

否则,如果两个网络,使用两个workspace。使用多机训练,究竟应该用workspace和多机进行同步呢?

Map<string, TensorBufferPtr> allBuffers_;
}
```
所有的参数全部存放到一个WorkSpace中。申请、释放、resize `TensorBuffer`交由这个全局的WorkSpace完成。这个WorkSpace的好处是:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

申请、释放、resize TensorBuffer交由这个全局的WorkSpace完成

是这些操作是 Workspace 的成员函数吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

今天完善这段,抱歉没写清楚。


* shape. 不同shape的输入,经过这个Op后会产生输出的shape。在这个函数中,可以通过throw exception报错。
* grad. 某一个Op对应的梯度Op是哪些。grad可以为空。为空表示这个Op不支持反向传播。
* kernels。 不同设备上如何计算该Op。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape, grad, kernels请和上面struct里成员对应,加上下划线_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpMetaTensorAttr的关系是什么,为什么不能把TensorAttr也当做是OpMeta的一部分呢?
比如说device,need_gradient等信息,放在OpMeta里面也可以吧

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是很理解OpOpMeta划分开来的主要依据是什么,如果说Op表示操作之间的输入输入和链接关系,OpMeta表示操作本身的特性,那么TensorAttr似乎确实可以放在OpMeta里面

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape, grad, kernels请和上面struct里成员对应,加上下划线_。


主要的概念包括:

* Workspace: 全部的神经网络层的参数和输入,均由一个全局对象所管理。内存/显存的申请,释放,resize都使用该对象管理。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面的Workspace定义中并没有看到这里的管理包含了哪些。

struct Workspace {
  Map<string, TensorBufferPtr> allBuffers_;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

今天完善。


* name_ 在一个计算图中,每一个Tensor都有唯一的名字。
* type_ Paddle的Tensor类型按照数据类型分类,只包括float, double, int等类型。
* needGrad_ 这个Tensor是否需要在backward的时候计算梯度。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里needGrad_=false/true是根据什么设置的?是在什么时候去判断这个值的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 用户去设置的。
  2. 通常情况下,用户根据是否是输入数据来做设置。如果是输入数据, needGrad_=false。 否则needGrad_=true。
  3. 在backward的Graph comipler中,判断这个值。如果这个值是false,则不去backward这一分支。

* Op是神经网络中的所有操作,他们包括:
* 元信息,即属于某一个操作类型的信息。包括不同设备的Kernel函数,Shape推导函数,梯度Op推导函数。
* 配置信息,即某一个FC Layer的输入有是哪些TensorAttr,输出是哪些TensorAttr。
* Graph是表示一个神经网络的全部计算,主要由op的数组表示。其中,为了方便获取tensor的信息,还添加了一个`map<string, TensorAttr>`的字段方便快速获得tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方便快速获得tensor应该是方便快速获取tensorattr吧。我看文档里面对Tensor和TensorAttr是两个描述。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph是否应该表示成一个DAG呢,是否有可能构建有汇聚或分流这样的Op节点,比如一个网络有多个cost的时候需要有一个tensor的分流

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph是否应该表示成一个DAG呢,是否有可能构建有汇聚或分流这样的Op节点,比如一个网络有多个cost的时候需要有一个tensor的分流

Ops是经过拓扑排序后的DAG,Op里面有输入和输出指针,其实就是完全记录下了这个图。

auto crossEntropy = graph.crossEntropy("xe_loss", prediction, label);
auto loss = graph.mean("mean_loss", crossEntropy);

graph.backward(loss);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph.backward是一个什么操作?看起来像是基于前向ops生成反向ops的,但这样的话传入loss参数用来做啥?

Copy link
Collaborator

@JiayiFeng JiayiFeng Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loss是整个网络的最终输出,可以反向查找出整张网络,所以我猜graph.backward应该是指“通过反向查找生成整个计算图”,也许还包括了”顺便在图中加上反向传播部分“的过程

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

传入loss有两点目的:

  1. 表示backward的起始点是loss这个内存。从loss开始反向遍历所有的Op
  2. loss这个内存的grad在engine开始执行会设置成1.0, 而不是0.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以,这个loss是真正开始执行backward计算的时候才用到,而不是用来生成backward计算图的。第2点,设置成1.0这个不是很明白。

graph.backward(loss);
graph.optimize("sgd", 1e-4);
graph.multiCpu(4);
Engine engine(w, graph, cpuNum = 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的cpuNum=4和前面的graph.multiCpu(4)有关系吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有关系。graph是编译成4个CPU的计算图,包括了"SplitDataOp"和"MergeGradientOp",而engine里面设置Cpu数量,设置了使用了5个线程计算,4个为计算线程,1个为控制线程。

不过,确实应该把他们合并成一句话,直接在engine里面做graph.multiCpu(4)。这里提出来原想让说明很清晰一点。

今天改掉。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

设置了使用了5个线程计算,4个为计算线程,1个为控制线程。

对应的Engine engine(w, graph, cpuNum = 4);是写错了?应该写成Engine engine(w, graph, cpuNum = 5);
那这个5用户也没法知道该怎么配置。

## Workspace

```cpp
class TensorBuffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorBuffer如何表示GPU的内存?上面40行直接写了w[input] = [...];好像也没考虑GPU的情况。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. TensorBuffer是一个基类,子类有CpuTensorBuffer和GpuTensorBuffer。
  2. 用户能够直接操作的都是CpuTensorBuffer。在GPU运算之前,Engine会将原始的Graph编译成多设备的Graph,即使是单GPU的运算,也需要多一个"CopyHost2DeviceOp"。

## Graph

```cpp
struct Graph {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据21行开始的例子,这个Graph应该还包含很多方法才对。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,需要更多方法。其实,严格的说,后面哪些方法应该是另一个东西的。例如GraphBuilder之类的。因为Graph应该很单纯,只有成员变量就好。。

GraphBuilder我今天也加到这个里面。


graph.backward(loss);
graph.optimize("sgd", 1e-4);
graph.multiCpu(4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图的compile过程是不是graph.backwardgraph.optimize还有graph.multiCpu(4)这几个函数共同完成的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,背后掉的是compile


![alt](./pics/step1.png)

在该图中,进行了三次编译,他们是:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph Compiler是一组函数,他们会使用一些参数,对图进行修改,将修改结果写会原图中。

由上面这句话,理解compiler并不是实际编译器编译的过程,用这个名字会不会有点让人疑惑? 还是有其他用法参考?或者是我理解错了?

另外,作为开发设计文档,只看到Graph Compiler和Graph在图中有条连接线,实际在系统中怎么用,在设计文档中没有get到,和Engine没有关系吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果觉得『Compile』这个概念比较容易混淆的话,可以换一个概念,但是我没有想出来更容易理解的名字。

这个名字不算混淆,因为编译原理中,编译指由语言A 转变成 语言 B。 语言是一种描述的约定,不一定指编程语言或者机器码。也不是所有的编译都是编译成可执行程序的。

另外,作为开发设计文档,只看到Graph Compiler和Graph在图中有条连接线,实际在系统中怎么用,在设计文档中没有get到,和Engine没有关系吗?

compiler就是对Graph的一次修改过程,使用方法类似于:

compile(graph, "backward", {"from", avg_loss});

Engine可以调用Compiler去修改一些Graph方便调度。

graph.optimize("sgd", 1e-4);
graph.multiCpu(4);
Engine engine(w, graph, cpuNum = 4);
for (;;) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是用户的使用方式,是Python伪代码? 如果是,请用正确的语法~

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是C++代码。

Python的API目前和这个事情没关系。我们封装出Op,Graph之后,在C++端也应该可以配置构造一个神经网络。

struct Graph {
Vec<Op> ops_;

Map<string, TensorAttr> tensors_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensors_是否可以和Map<string, TensorBufferPtr> allBuffers_一样放在Workspace里?如果可以,放在这里有什么优势吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理论上也是可以把tensors_放到Workspace里面,但是这样做会带来一个概念上的混淆:

  • Workspace只有内存管理,没有图的描述。
  • Graph只有图的描述,没有内存。

1. 每一个设备一个计算线程,如果某一个Op的输入输出数据不在一个设备上,那么使用另一个控制线程执行这个Op。
1. 每个线程顺序执行内部的Op,直到所有线程运行结束,一次`Engine.run`结束

即如上图所示,蓝色,绿色和黄色三者分别在三个不同线程执行。
Copy link
Collaborator

@JiayiFeng JiayiFeng Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EngineWorkspace的关系是怎么样的?Workspace中的buffer是不是由Engine来生成和操作?如果是的话,第一张图中WorkspaceEngine之间的连线画成双向的比较好

主要的概念包括:

* Workspace: 全部的神经网络层的参数和输入,均由一个全局对象所管理。内存/显存的申请,释放,resize都使用该对象管理。
* 内存申请后的类型是`TensorBuffer`,该类型是一个`void*`指针,size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typedef void * TensorBuffer looks an unacceptable over-simplification.

Neither TensorFlow's tensor nor MXNet's Blob is a void*.

TensorFlow's Tensor has the following data members:

 TensorShape shape_;
 TensorBuffer* buf_;

where TensorBuffer is an interface derives from RefCounted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有个误会,我应该把TensorBuffer和Workspace全部定义写出来.

class TensorBuffer {
public:
  void* buf_;
  size_t size;
  virtual void resize(size_t ) = 0;
};

class CpuTensorBuffer : public TensorBufer {
public:
...
};

using TensorBufferPtr = std::shared_ptr<TensorBuffer>;

class Workspace {
public:
  Map<string, TensorBufferPtr> buffers_;
};

这里,TensorBufferPtr确实是一个RefCounted的指针,并且也有size等属性。


* Workspace: 全部的神经网络层的参数和输入,均由一个全局对象所管理。内存/显存的申请,释放,resize都使用该对象管理。
* 内存申请后的类型是`TensorBuffer`,该类型是一个`void*`指针,size
* Tensor是计算Kernel函数的参数格式,他包括`TensorBuffer`和`TensorAttr`。其中,`TensorAttr`中包括了Tensor的`Shape`,是否需要Gradient,设备信息等。
Copy link
Collaborator

@wangkuiyi wangkuiyi Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, a "tensor" is a high-dimensional matrix, and cannot "include gradient", because a gradient is another tensor.

Do you want to say Variable instead of tensor here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实Variable更好。我今天改掉。

* Tensor是计算Kernel函数的参数格式,他包括`TensorBuffer`和`TensorAttr`。其中,`TensorAttr`中包括了Tensor的`Shape`,是否需要Gradient,设备信息等。
* Op是神经网络中的所有操作,他们包括:
* 元信息,即属于某一个操作类型的信息。包括不同设备的Kernel函数,Shape推导函数,梯度Op推导函数。
* 配置信息,即某一个FC Layer的输入有是哪些TensorAttr,输出是哪些TensorAttr。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition, meta-data is the data used to describe data. So meta-data (元信息) should include attributes (配置信息).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

抱歉,这里我说的不太清楚。

MetaData是表示『一类』Op的固有属性,而信息是指『某一个』Op的具体属性。具体某一个FC Layer,才真的能知道有哪些输入和哪些输出。

Meta和配置的关系,是类型和值的关系。

* 内存申请后的类型是`TensorBuffer`,该类型是一个`void*`指针,size
* Tensor是计算Kernel函数的参数格式,他包括`TensorBuffer`和`TensorAttr`。其中,`TensorAttr`中包括了Tensor的`Shape`,是否需要Gradient,设备信息等。
* Op是神经网络中的所有操作,他们包括:
* 元信息,即属于某一个操作类型的信息。包括不同设备的Kernel函数,Shape推导函数,梯度Op推导函数。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

”梯度Op推导函数“指的是什么,为什么梯度Op需要推导,我以为一个Op有其对应梯度Op的引用,不需要推导?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个Op确实都有其对应的梯度Op,但是这个对应关系需要有一个函数去描述,而不是简单的Map可以描述的,故有『梯度Op推导函数』。推导的复杂性在于:

  • 需要有一个函数描绘,梯度Op的输入参数和输出参数,与前向Op的输入参数和输出参数是如何一一对应的。
    • 参考该代码, FC Op梯度的输入是FC Op的前两个输入和FC Op输出的梯度,返回的梯度是三个输入的梯度。
  • 前向Op与梯度Op不是一对一的关系,而是一对多的关系。一个前向Op可能对应多个梯度Op。譬如,有可能某一个Op的输入是不需要梯度的,但是这个Op有三个输入。那么这个Op的梯度Op可以对应两个梯度Op。


主要的概念包括:

* Workspace: 全部的神经网络层的参数和输入,均由一个全局对象所管理。内存/显存的申请,释放,resize都使用该对象管理。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否不要用全局变量,可以用context,或者像TF那样用session来表示。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,确实不应该用全局变量,这里是个Typo

因为一开始我自己实现一个样例的时候,实现成了全局变量,这里没有修改过来。

本文里面的Workspace都是一个生命周期比较长的局部变量。并不是全局变量。抱歉。

```cpp
Workspace w;
Graph graph;
auto input = graph.createTensor("input", {1000, 784}, float);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor的size是固定的,无法因mini_batch大小的变化而变化。此处是不是该引入一个新概念,是一个placeholder,其shape的mini_batch那个维度是不定的。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,确实Graph中Tensor的size是随时会变得。

这里只是说他第一次创建的时候,size是这样,后面可以修改这个size。

或许我应该改成 createOrResizeTensor("input", {1000, 784}, float);

其实我在自己的实验实现中,使用的是createOrResizeTensor

auto loss = graph.mean("mean_loss", crossEntropy);

graph.backward(loss);
graph.optimize("sgd", 1e-4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这两个是training时候才需要指定的属性,跟graph不是很有关系:

graph.backward(loss);
graph.optimize("sgd", 1e-4);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前这两个函数其实是compiler,会修改graph

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backward和Optimize都是对图(Graph)的修改,而不是Engine的属性。

参考上面的配图。

目前这个概念中,没有Trainer的概念,大家都是执行一张图。当然,更高层的用户接口中,可以封装出一个Trainer来,这时候,backward和optimize都是Trainer的属性了。

所有的参数全部存放到一个WorkSpace中。申请、释放、resize `TensorBuffer`交由这个全局的WorkSpace完成。这个WorkSpace的好处是:

1. 可以在不同的拓扑结构中,共享一段内存。(share 同一个`Workspace`即可)
2. 可以实现check point机制。即将所有的buffer序列化后即完成了checkpoint。`Workspace.checkpoint()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解只有Parameter和Optimizer有状态序列化,通过workspace序列化貌似与Parameter, Optimizer分别序列化没有什么特别大的好处。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From

optimizer.check_point();
for (auto& param : params ){
   param.check_point();
}

To

Workspace workspace;
...

workspace.checkpoint();

Workspace管理了所有有状态的内存,可以一并checkpoint。

另外,其实不止Parameter和Optimizer是有状态的,有些Layer也有状态,譬如BatchNorm中的滑动平均和方差,也是一个状态(但不是参数和Optimizer)。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,其实对于PServer来讲,也更清晰了。真正跨节点之间需要做状态同步的东西,都保存在了Workspace中。即

workspace.synchronize()

就可以完成通信。


TensorAttr是记录在计算图中,对某一个Op输入输出参数的描述。包括输入输出的名字,数据类型,形状,设备等等。

* name_ 在一个计算图中,每一个Tensor都有唯一的名字。
Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不理解为什么Tensor需要名字。假设用户想要名字为"cost"的layer的输出,Engine直接执行到cost layer把输出拿出来就可。
如果要查找Tensor,通过map查找了,Tensor本身貌似也不需要name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor添加名字,可以在多个Graph中间,share同一个Tensor。

Tensor的内存会交由Workspace统一管理,Tensor的Shape等其他属性会交由每一个Graph管理。Shape和内存通过Name对应,不同Graph之间share同一段内存也可以通过name对应。

当然,上面 @wangkuiyi 也说,Tensor这个概念比较容易混淆,稍后会修改成Variable

const AttributeMap& attrs,
)>;

using KernelFN = std::function<void(const Vec<Tensor>& inputs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RNN的KernelFN会是什么样的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和这个一致。

对于成型的RNN,例如LSTM,那就是一个正常的KernelFN。
对于动态的RNN,网络都是动态的,所以每一个时间步的Kernel函数也是这样子的。

enum DeviceType {
kDEVICE_CPU = 0,
kDEVICE_GPU,
kNUM_DEVICES
Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kDEVICES可能比kNUM_DEVICES更合适?kNUM_DEVICES中的NUM有点混淆。

KernelFN kernels_[kNUM_DEVICES];
};

struct Op {
Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像一般Op被理解为operation,或者kernel,然而现在只有OpMeta包含kernel,Op并不包含。
是不是把Op重命名为类似于OpRuntimeInfo,把OpMeta重命名Op更好些?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这Op, OpMeta二者严格的说,都是运行时产生的。OpMeta是运行时注册出来的。

  2. 这二者根本的区别是,一个表示『一类Op』的固有属性,一个表示『某一个Op』的实际属性。

  3. 对比其他框架的称谓确实也不是那么清晰,但基本上表示Meta的结构叫做Schema或者Def。我们也许可以选择一个更正确的称谓。

    • TensorFlow: OpDef, Op (all in protobuf)
    • Caffe2: OperatorSchema, OperatorDef
    • MxNet: Op, nnvm::NodeAttrs

};

struct Op {
std::string type_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里靠type_找到对应的OpMeta,直接放一个OpMeta的引用是不是更方便点。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

简单证明OpMeta不能放到Op中的原因如下:

  1. Op是计算图Graph的全局变量,最初的计算图是用户的配置。
  2. 用户的配置是必须可以序列化到字符串或者文件的。
  3. OpMeta中存储的大部分是函数,无法序列化。

所以,OpMeta不能放到Op中。

另外,从概念上说,OpMeta表示『一类Op』的固有属性。譬如对于所有的FC Op,他的Shape推到函数,Kernel函数都是一样的。而Op表示『某一个Op』的属性,比如第一个hidden layer中,他的输入和输出具体是什么。

这相当于类型与值的关系,『int』和『3』的关系。


## 循环神经网络 RNN

由于该设计的所有配置类型全是C++对象,操作速度快。并且内存管理(`TensorBuffer`)和计算图`Graph`完全分离,所以对于RNN可以完全按照动态图来进行设计。具体的使用方法有两类:
Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

例子里每个mini-batch Graph都会被Compile并且Optimize,不一定会操作速度快,特别是以Op而不是Layer为计算单位的时候,Op会很多,优化应该挺复杂的。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layer + tensors -> big op -> rnn op?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

而且会涉及到多机的优化

Copy link
Collaborator Author

@reyoung reyoung Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 对于成熟的RNN,可以变成一个Op。例如Paddle目前的LSTMemory。提升性能。

  2. 对于动态的RNN网络来说,从DyNet的论文来看,其中第24页的对比,完全动态的神经网络和静态的使用Padding的RNN对比,Batch Size增大的时候,速度会更快。这主要是因为:

    • 需要很小心谨慎的实现Graph表示和Compile逻辑。Graph表示必须是C++的对象,修改Graph必须很快。
    • 整体来看,Compile的工作是O(n)复杂度的,如果修改速度足够快,系数k可以足够小。
    • 另外相比于计算开销,图构造和修改的开销应该不会占比很大,并且有手段可以掩藏(pipeline)。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layer + tensors -> big op -> rnn op?

确实,一些compiler可以将零碎的Op先Compile成一个大的Op计算。

graph.backward(loss);
graph.optimize("sgd", 1e-4);
graph.multiCpu(4);
Engine engine(w, graph, cpuNum = 4);
Copy link
Contributor

@Superjomn Superjomn Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine 具体执行时,可以指定graph具体的inputs/outputs么,这样可以根据不修改graph,而只要指定outputs不同,支持不同的最小子图的执行
也就是:

Engine engine(w, graph, input_tensors, output_tensors, cpuNum=4)

engine 里面会自己分析一套最小子图,而非外面,这里类比到 tf.Session.run
如果有多个子图执行,则有多个 engine维护多个最小子图。

具体的执行资源(线程/device)都在更底层全局共享,类似 mxnet::executor 和 mxnet::engine的关系。


## Engine

Engine输入的Op加定已经添加好如何进行多设备同步的相关Op了。那么Engine的实现比较简单:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 Engine 是否对应到 mxnet的graph_executor?底层是否还需要一个 mxnet::engine的功能?
应该有多一点细节,比如tensor 并行依赖的调度,device间不同线程池啥的,我可以补充一部分。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我实际在想,Executor和Engine是不是可以合成一个东西。当然,这两者的功能肯定都要在这一个东西里面搞定。

这块确实想的不是很多,特别是调度问题。目前来想,最简单的是线性的执行Op,不过这就要求Op是已经排序好的。但实际情况是,用户也有可能配置出一个没有排序好的Op,或者用户的排序在性能上不是最优的。

所以,还是可以再完善下这块。

Copy link
Collaborator Author

@reyoung reyoung Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实,把问题化简来说,就是

对于一个DAG,怎么样分配到不同设备上,在怎么样拓扑排序,性能是最优的。

auto label = graph.createTensor("label", {1000, 10}, int);

auto crossEntropy = graph.crossEntropy("xe_loss", prediction, label);
auto loss = graph.mean("mean_loss", crossEntropy);
Copy link
Contributor

@Superjomn Superjomn Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是 graph.min 还是 graph.mean?
感觉mean是一个常规的操作,loss需要指定优化的方向,比如 minimize

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

再添加一个 optimizer 的概念? optimizer + 常规的op = loss

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实是optimize(loss.mean())

Optimize默认是指最小化,而最小化的是所有数据的loss的均值。
这样每一层参数的梯度也会等于所有数据梯度的均值。

See here

auto hidden0Sigmoid = graph.fullyConnected("hidden0_sigmoid", fc);
auto hidden1 = graph.fullyConnected("hidden1", input, 10);
auto prediction = graph.softmax("prediction", hidden1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里,如果能对应到 python 的界面的话,能否保证每个op的返回都是 tensor
一个统一的风格哈,降低用户理解逻辑的成本

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实要保证返回的是TensorAttrPtr,即shared_ptr;

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Jun 13, 2017

I think it would be easier to subdivide this document into the following topics, and try to figure out an optimal design for each topic:

  1. memory management
    1. class Place, inspired by Majel
    2. Allocation and Allocators
    3. unified malloc and allocation API for GPU and CPU --
      p=malloc(Place pl, ...);
      used(pl);
      free(p); 
  2. Tensor
    1. consider re-use existing ones, including mshadow, Eigen, porting Majel, or write a wrapper of these libraries.
  3. Expression Template
    1. Expression Template is important for performance optimization.
    2. Is there any (what is the) difference between above libraries -- mshadow and Eigen. (Majel doesn't complete implementation of Expression Template.)
    3. Which is better?
  4. Ops and Variables
    1. TensorFlow's Ops take Tensor, a wrapper of Eigen, as its inputs and outputs.
    2. Caffe2 and PyTorch take Variable as Ops' inputs and outputs.
    3. A Variable includes a tensor for the forward algorithm and another gradient tensor for the backward algorithm. The difference seems that TensorFlow is a general-purpose framework, whereas Caffe2 and PyTorch are DL-specific.
    4. Which approach should PaddlePaddle take?
  5. Ops and Gradient Ops
    1. In TensorFlow, an Op is a general concept -- all computations are represented by ops.
    2. In Caffe2, each Op has one or more corresponding GradientOps.
    3. Which approach should PaddlePaddle take?
  6. Ops and Kernels
    1. TensorFlow separate Op's signature (as OpDef) and its implementations (as OpKernel)
    2. others might not have such clear separation.
    3. Which approach should PaddlePaddle take.
  7. Execution engine
    1. How TensorFlow and other solutions parse the network definition and create a network?
    2. How TensorFlow and other solutions execute the training algorithm over the network while creating/managing the memory of Variables?

@luotao1
Copy link
Contributor

luotao1 commented Dec 8, 2017

@reyoung How about this pull request: updated or closed? Since there is an extra branch in Paddle repo:
image

@luotao1 luotao1 closed this Dec 22, 2017
@luotao1 luotao1 deleted the refactor_docs branch December 22, 2017 06:32
heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021
* add Distill model
* add distill+prune
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet