Lod Tensor design doc #3746

wangkuiyi · 2017-08-29T18:45:05Z

This design doc PR is going to replace #3454

luotao1 · 2017-08-30T02:16:53Z

paddle/framework/lod_tensor.md

+
+```
+   3
+3   1 2


这里能存offset信息么？即0,3,4,6。这样顶上的3也不用存了。
原来Paddle中存的是位移信息，这样方便sequence相关的layer，比如maxlayer, seqlastinlayer等的操作。

~~顶层的 3 应该总是等于 tensor 第一个维度的 size （也就是batch size）吧，所以应该不需要专门存这个信息？~~

似乎也并非如此 (看 batch size 如何定义)，顶层的 “3” 总是一个scalar，总是等于下一层元素的个数（如果存储起始位置，就等于元素个数 - 1），但确实可以自动得到。

如果存了顶层的3，那对于多层序列来说，这个3只有最顶上的有用，其他层都用不到。所以觉得没必要存，这样每层的操作是完全一样的。

嗯~ 这个3 没有实际的用处，看上去的好处是概念结构会比较统一。

讨论过，感觉还是长度好点。一是 slice 时不需要改 offset; 二是看起来好懂; 三是用的时候基本是按顺序，随机查找的机会不多

一是 slice 时不需要改 offset; 二是看起来好懂; 三是用的时候基本是按顺序，随机查找的机会不多

slice的时候只有在recurrentOp框架才会用到，但其他所有sequence相关的Op，每次都要重新算一下offset，非常没必要。 @lcy-seso

offset看起来也很容易懂，没有比length难懂。

第三点对length和offset是一样的。

"其他所有sequence相关的Op" 指的是什么？

指需要sequence信息的op，包括以下20多个op @wangkuiyi ：

luotao1 · 2017-08-30T02:28:39Z

paddle/framework/lod_tensor.md

+         3
+3           1  2
+3   2  4    1  2  3
+||| || |||| |  || |||


之前和 @Superjom @qingqing01 @hedaoyuan 讨论的是：
对N层序列，Lod.size() = N，Lod[0]存的是句子的位移信息，Lod[1]存的是段落的位移信息，依次类推。（和目前Paddle的存法类似）：

0,3,5,9,10,12,15 0, 9,10, 15

这样存有两个原因：

小粒度的放在前面：不管对单层RNN还是多层RNN的配置，sequence相关的layer处理句子粒度的情况占绝大多数，这样只需取vector的第一个元素就能拿到所有句子粒度的信息。

采用offset来排列。如果采用3 1 2来记录段落信息的话，需要和3 2 4 1 2 3配合，才能获得每个段落的长度信息。有两个不方便的地方：

对sequence相关的layer，如果仅处理段落信息的时候，比如用maxLayer获得段落的max，如果用0, 9,10, 15，只需遍历一次，分别找到[0.9], [9,10], [10,15]之间的max元素即可。

对GPU kernel来说，传给cuda函数的只能是指针类型的数据，传不了vector类型。如果传offset，一个参数即可。而不需要每次传入前转换成offset信息。 @hedaoyuan

看上去有道理哟。是不是得写两个程序（以及调用实例），对比一下，就知道到底怎么弄比较好了？

以sequenceLastInstanceLayer的forword为例：这个layer是取每个sequence或每个paragraph的最后一个元素。目前paddle的源代码：

// sequenceLastInstaceLayer的kernel并不需要知道startPositions_是paragraph还是sequence， // 因为核心的计算是一样的。 // startPositions_ = // type_ ? input.subSequenceStartPositions : input.sequenceStartPositions; auto starts = startPositions_->getData(false); MatrixPtr inputValue = getInputValue(0); MatrixPtr outputValue = getOutputValue(); instanceIds_.clear(); //instanceId_是为了记录取出来的id index，这样backward的时候可以用。 for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) { int insId = reversed_ ? starts[seqId] : starts[seqId + 1] - 1; instanceIds_.push_back(insId); outputValue->subMatrix(seqId, 1, tmpDest_) ->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_))); }

如果改成以length形式存的话：

// lengthPositions_= // type_ ? input.subLengthPositions : input.lengthPositions; auto length= lengthPositions_->getData(false); MatrixPtr inputValue = getInputValue(0); MatrixPtr outputValue = getOutputValue(); instanceIds_.clear(); int offset = 0; // 多余的行 for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) { int insId = reversed_ ? length[seqId] +offset : length[seqId + 1] - 1 +offset; // 需要多算一次加法。 instanceIds_.push_back(insId); outputValue->subMatrix(seqId, 1, tmpDest_) ->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_))); offset+=length[seqId]; //多余的行 }

因为outputValue都是连续内存，所以取的时候就多了一步将length转成offset的步骤。

@wangkuiyi @Superjom 上面写的两个对比的例子，我以为存length时候的结构是这样的：

9 1 5 3 2 4 1 2 3

如果存成：

3 1 2 3 2 4 1 2 3

那计算subSequenceLength的时候，需要3,1,2和3,2,4,1,2,3一起首先计算出lengthPositions。就不是上面多加的三行，至少十行以上了。

好像是这样。

应该要存成上面那种

qingqing01

对于Slice，通常也需要指定[start, end]位置吧，我觉得可以看下SubSequenceLayer
和SliceProjection的这些实际slice的使用情况，或者其他框架slice op。

qingqing01 · 2017-08-30T02:02:03Z

paddle/framework/lod_tensor.md

+
+## Challenge of Variable-length Inputs
+
+People usually represent a mini-batch by a Tensor.  For example, a mini-batch of 32 images, each of size 32x32, is a 10x32x32 Tensor.  So a transformation, T, of all images can be a matrix multiplication of the 32x32xO-dimensional tensor T and the 10x32x32 Tensor.


32 images - > 10 images,

qingqing01 · 2017-08-30T02:06:55Z

paddle/framework/lod_tensor.md

+  typedef std::vector<std::vector> > LoD;
+  ```
+
+- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1.


can is not -> is not

lcy-seso · 2017-08-30T01:42:02Z

paddle/framework/lod_tensor.md

+
+## Slicing of LoD Tensor
+
+Consider that we have a network with three levels of RNN: the top level one handles articles, the second level one handles sentences, and the basic level one handles words.  This network requires that mini-batches represented by 4 level LoD Tensor, for example,


Usually, RNN only handles sequence (though non-sequence can be regarded as sequence with only one element, hardly any one uses RNN to handle non-sequence), I think there are two levels of RNN: the top level one handles articles (which is a sequence of sentences), the second level one handles sentences (which is a sequence of words), and the basic level ones are word embedding vectors.

I do not quite understand here why it is a 4 level LoD tensor rather than 3 level LoD tensor to represent a nested sequence?

it seems that 3 level is enough.

It should be 3. Changing.

lcy-seso · 2017-08-30T01:52:24Z

paddle/framework/lod_tensor.md

+
+  ```c++
+  typedef std::vector<std::vector> > LoD;
+  ```


Does the definition (line 57) of LoD tensor mean the current design only support slice three-dimensional data? or to say slicing at two levels?

each level is stored in a vector<int>

so a vector<vector<int>> is enough to store any levels.

lcy-seso · 2017-08-30T02:22:16Z

paddle/framework/lod_tensor.md

+  ```
+
+- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1.
+


an extra "can" in line 60.

lcy-seso · 2017-08-30T02:26:09Z

paddle/framework/lod_tensor.md

+In summary, as long as that the essential elements (words  or images) have the same size, we can represent mini-batches by a LoD Tensor:
+
+- The underlying tensor has size LxD1xD2x..., where D1xD2... is the size of the essential elements, and
+- the first dimension size L has an additon property -- a LoD index as a nested vector:


The fact I understand about LoD tensor, please help me to figure out am I right:

One input batch (a tensor) has only one LoD tensor (if needed).

It can be regarded as a kind of splitting information attached to the first dimension of the input tensor.

It is not that each dimension has a LoD tensor (if needed).

yes, some split information + a tensor = LoD tensor

lcy-seso · 2017-08-30T02:38:51Z

paddle/framework/lod_tensor.md

+     3
+3     1  2
+口口口 口 口口
+```


May be it is better to store the start positions as Paddle does currently, which is more convenient for slicing the original input, otherwise, all the layers that process sequence may have to compute the sequence start position in a batch.

On the other hand, with sequence start positions in hand (they are offsets of a sequence in a batch) it is very easy to get a sequence's length.

qingqing01 · 2017-08-30T03:13:33Z

我理解存储格式还是需要按照@Superjom之前文档的格式，举例：

单层

3 (batch size， 3条样本)
2  1 3 (分别2个、1个、3个word)
|| | ||| (word级别)

typedef vector<vector > LoD 存储：

std::vector<std::vector<int>>  LoD = {{0,2,3,6}}

1 = LoD.size: 表示level数
3 = LoD[0].size - 1: 表示Batch Size数

双层

3 (batch size， 3条样本)
2       1    3 (sentece级别：分别2个、1个、3个句子)
2  3    2    3  2  1 （word 级别）
|| |||  ||  ||| || |

typedef vector<vector > LoD 存储：

// LoD[0]：sentece信息
// LoD[1]: word信息
std::vector<std::vector<int>>  LoD = {{0,5,7,13}, {0,2,5,7,10,12,13}} 
or

// LoD[0]：word信息
// LoD[1]: sentence信息
std::vector<std::vector<int>>  LoD = {{0,2,5,7,10,12,13},{0,5,7,13}}

2 = LoD.size : 表示level数
3 = LoD[0].size - 1: 对于上面存储，表示Batch Size数
- 或 3 = LoD[LoD.size -1 ].size - 1 : 对于下面存储，表示Batch Size数

lcy-seso · 2017-08-30T03:15:00Z

@qingqing01 我也支持这一种，否则序列相关操作的layer潜在都需要自己重新计算需要取的那一份数据在内存中的偏移，这个计算是没有必要的。

lcy-seso · 2017-08-30T03:25:30Z

@qingqing01 对上面你贴出来的这种格式，我有一个问题，当batch size = 1 时，对这样一条数据：

std::vector<std::vector<int>>  LoD = {{0,2,3,6}}

应该如何判断这是一条单层sequence的数据，还是一条双层序列数据，但是第一层 nested sequence 只有一个序列？

luotao1 · 2017-08-30T03:27:34Z

应该如何判断这是一条单层sequence的数据，还是一条双层序列数据，但是第一层 nested sequence 只有一个序列？

这个很简单，看LoD.size()即可。如果是单层sequence，LoD.size()=1。
如果是双层，std::vector<std::vector> LoD = {{0,2,3,6}, {0,6}}

lcy-seso · 2017-08-30T03:29:12Z

明白啦~ 漏看了 LoD.size 这一位~

QiJune · 2017-08-30T07:25:21Z

paddle/framework/lod_tensor.md

+```c++
+typedef vector<vector<int> > LoD;
+
+struct LoDTensor {


Is LodTensor a composition of Lod and Tensor* or a derived class from Tensor?

wangkuiyi · 2017-09-03T06:14:56Z

I'd like to change to save start offsets instead of lengths in LoD as explained in #3746 (comment). However, it seems that those 0's in LoD are not necessary? @qingqing01

// LoD[0]：sentece信息
// LoD[1]: word信息
std::vector<std::vector<int>>  LoD = {{0,5,7,13}, {0,2,5,7,10,12,13}} 
or

// LoD[0]：word信息
// LoD[1]: sentence信息
std::vector<std::vector<int>>  LoD = {{0,2,5,7,10,12,13},{0,5,7,13}}

Superjomn · 2017-09-03T18:17:17Z

I read codes from Luotao's comment, it seems that both approaches need the same lines of code.

The length-approach do poorly when randomly obtain elements, while the offset-approach do poorly in slice, and less concise.

We do need a concise data structure to make our new concept LoD easier to be understood, so I prefer the length-approach with some performance improvements.

Currently, LODTensor will only be used by RNNOp, which will access LoD Tensor's elements in some level sequentially, so we can add an iterator to length-approach and make it faster.

@luotao1 @wangkuiyi

Superjomn

LGTM

wangkuiyi · 2017-09-03T18:39:14Z

I agree with @Superjom that when multiple designs have similar complexity of implementations, we'd prefer the design that is the easiest to understand.

wangkuiyi · 2017-09-03T19:25:41Z

I forgot to follow some comments before merging this PR. I created #3837 to remind myself to do it.

luotao1 · 2017-09-04T02:38:52Z

@Superjom LODTensor will be used in more than 20+ ops #3746 (comment) . In these ops, using offset is more convenient than using length. And in the future, we will add more sequence-related ops.

Though in RecurrentOp, using length is more convenient, considering that 20+ ops : 1 RecurrentOp, we should choose the offset.

luotao1 · 2017-09-06T01:52:19Z

@wangkuiyi @Superjom 在CPU计算的时候，存length还是offset都差不多，前者多了O(1)的计算而已。但是在GPU计算的时候，必须存offset。以maxlayer的前向GPUkernel为例，代码在hl_cuda_sequence.cu

__global__ void KeMaxSequenceForward(real* input,
                                     const int* sequence,
                                     real* output,
                                     int* index,
                                     int numSequences,
                                     int dim) {
  int dimIdx = threadIdx.x;
  int sequenceId = blockIdx.x; // 随机获得一个block线程块
  if (sequenceId >= numSequences) return;
  int start = sequence[sequenceId]; // 取得该线程块对应的sequence的起始位置
  int end = sequence[sequenceId + 1]; // 取得该线程块对应的sequence的终止位置

  for (int i = dimIdx; i < dim; i += blockDim.x) {
    real tmp = -HL_FLOAT_MAX;
    int tmpId = -1;
    for (int insId = start; insId < end; insId++) {
      if (tmp < input[insId * dim + i]) {
        tmp = input[insId * dim + i];
        tmpId = insId;
      }
    }
    output[sequenceId * dim + i] = tmp;
    index[sequenceId * dim + i] = tmpId;
  }
}

从上面的代码可以看到：

如果存offset，每个线程块直接取出对应sequence的起始和终止位置即可。
如果存length，那么需要在每个线程块里加O(n^2)的额外操作，来获得对应sequence的起始和终止位置。或者在gpu kernel前，将length转成offset格式，也是O(n^2)的复杂度。其中n是sequence的个数。

Yi Wang added 2 commits August 29, 2017 11:44

Add LoD Tensor design doc

95b41be

Add LoD Tensor design doc

f6457e6

wangkuiyi requested review from Superjomn, backyes, jacquesqiao, qingqing01, JiayiFeng and lcy-seso and removed request for backyes August 29, 2017 18:45

luotao1 reviewed Aug 30, 2017

View reviewed changes

qingqing01 reviewed Aug 30, 2017

View reviewed changes

lcy-seso reviewed Aug 30, 2017

View reviewed changes

QiJune reviewed Aug 30, 2017

View reviewed changes

Superjomn approved these changes Sep 3, 2017

View reviewed changes

wangkuiyi merged commit 5e78359 into PaddlePaddle:develop Sep 3, 2017

This was referenced Sep 3, 2017

Update LoDTensor code according to the design doc #3836

Closed

Update LoDTensor design #3837

Closed

wanghaoshuang mentioned this pull request Sep 19, 2017

Difference between NumElements and offset of LoDTensor #4181

Closed

wanghaoshuang mentioned this pull request Oct 19, 2017

Make LoDTensor::lod_element return start and end offset of sequence #4918

Merged

+              ```
+
+1 2


		## Challenge of Variable-length Inputs

		People usually represent a mini-batch by a Tensor. For example, a mini-batch of 32 images, each of size 32x32, is a 10x32x32 Tensor. So a transformation, T, of all images can be a matrix multiplication of the 32x32xO-dimensional tensor T and the 10x32x32 Tensor.


		## Slicing of LoD Tensor

		Consider that we have a network with three levels of RNN: the top level one handles articles, the second level one handles sentences, and the basic level one handles words. This network requires that mini-batches represented by 4 level LoD Tensor, for example,

		```

		- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1.

Lod Tensor design doc #3746

Lod Tensor design doc #3746

Conversation

wangkuiyi commented Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

lcy-seso Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcy-seso Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 Sep 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingqing01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Sep 2, 2017 • edited Loading

Choose a reason for hiding this comment

lcy-seso Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcy-seso Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

qingqing01 commented Aug 30, 2017 • edited Loading

单层

双层

lcy-seso commented Aug 30, 2017

lcy-seso commented Aug 30, 2017 • edited Loading

luotao1 commented Aug 30, 2017

lcy-seso commented Aug 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi commented Sep 3, 2017

Superjomn commented Sep 3, 2017 • edited Loading

Superjomn left a comment

Choose a reason for hiding this comment

wangkuiyi commented Sep 3, 2017

wangkuiyi commented Sep 3, 2017 • edited Loading

luotao1 commented Sep 4, 2017 • edited Loading

luotao1 commented Sep 6, 2017

wangkuiyi commented Aug 29, 2017 •

edited

Loading

lcy-seso Aug 30, 2017 •

edited

Loading

lcy-seso Aug 30, 2017 •

edited

Loading

luotao1 Aug 30, 2017 •

edited

Loading

luotao1 Sep 3, 2017 •

edited

Loading

wangkuiyi Sep 2, 2017 •

edited

Loading

lcy-seso Aug 30, 2017 •

edited

Loading

lcy-seso Aug 30, 2017 •

edited

Loading

qingqing01 commented Aug 30, 2017 •

edited

Loading

lcy-seso commented Aug 30, 2017 •

edited

Loading

Superjomn commented Sep 3, 2017 •

edited

Loading

wangkuiyi commented Sep 3, 2017 •

edited

Loading

luotao1 commented Sep 4, 2017 •

edited

Loading