add MKLDNN_DEVICE #3712

tensor-tang · 2017-08-28T02:18:57Z

add MKLDNN_DEVICE and MKLDNNMatrix to handle internal mkldnn::memory

luotao1

这个PR还是过长，在11个commit中，以下3个commit和 MKLDNN_DEVICE关系不大，可以作为独立的PR发（这次就维持现状）：

luotao1 · 2017-08-28T08:49:58Z

paddle/gserver/layers/MKLDNNLayer.h

+    if (useGpu_ == true) {
+      LOG(WARNING) << "Do not support GPU yet, will change to useGpu = false";
+      useGpu_ = false;
+    }


89-92行：useGpu_ = true时直接报错，不用转成false。
CHECK(!useGpu_);

luotao1 · 2017-08-28T08:59:26Z

paddle/gserver/layers/MKLDNNLayer.h

+
+  /**
+   * Is previous layer MKLDNN type.
+   * Otherwise, only support otherdevice CPU device.


Otherwise, only support the previous layer using CPU device.

Thanks，done

luotao1 · 2017-08-28T09:00:57Z

paddle/gserver/layers/MKLDNNLayer.h

+    }
+    real* iData = getInputValue(0, CPU_DEVICE)->getData();
+    // update input data
+    // since it might be changed if this is after data layer


上一个PR的时候，这里注释没仔细看："it might be changed if this is after data layer"表达的是什么意思？input_data可能会变？

updateData不需要封装吧：直接用set_data_handle不行么？

上一个PR的时候，这里注释没仔细看："it might be changed if this is after data layer"表达的是什么意思？input_data可能会变？

这是因为，data layer的output的buffer是一直在变的，就是指针会变，所以需要更新data。这个我是很早以前发现的。如果上一个layer的output buffer地址稳定不变的话，这里就不用加了。

updateData不需要封装吧：直接用set_data_handle不行么

我认为还是封装一下比较好吧，主要是为了函数名字风格统一。

luotao1 · 2017-08-28T09:16:34Z

paddle/gserver/layers/MKLDNNLayer.h

+  /**
+   * Set deviceId of the params used in this layer.
+   */
+  void setParamsDevice(int id, const ParameterMap& parameterMap) {


这个函数和layer::init非常类似，其中

CHECK_EQ(parameter->getDeviceId(), getDeviceId());

parameter是带有deviceId信息的，那setDevice后，还需要setParamsDevice么

这里就是因为layer::init会CHECK_EQ(parameter->getDeviceId(), getDeviceId());，所以需要param提前设置好。

setDevice其实具体是指setLayerDevice，但是由于已经在layer中了，所以没有使用这个名字

这里不是很理解，paddle也有cpu+gpu混合使用的配置，也不需要提前setParamsDevice。为什么这里需要呢？

首先是这样

if (useGpu_ && FLAGS_parallel_nn) { /* gpu environment is specified by device property */ deviceId_ = config_.device(); if (deviceId_ < 0) { useGpu_ = false; } }

所以，一般是用gpu且FLAGS_parallel_nn的时候才会设置deviceId，所以普遍情况下都是-1的。

并且这里实际上MKLDNN的param用的也就是CPU的param，所以也没有必要专门在Python端口设置为-2，直接在MKLDNNLayer里面统一设置，为了避免Layer::init的检查出错。

也就是说，使用mkldnn的时候，只要设置paddle.init(use_mkldnn=True)即可，不需要设置FLAGS_parallel_nn和手动给每个python layer设置deviceId了。

luotao1 · 2017-08-28T09:19:12Z

paddle/math/MKLDNNMatrix.cpp

+  if (m == nullptr) {
+    size_t height = dims[0];
+    size_t width = cnts / dims[0];
+    // LOG(INFO) << height << "," << width;


34行去掉。

luotao1 · 2017-08-28T10:19:44Z

paddle/gserver/layers/MKLDNNLayer.h

+          << "Only support other device is CPU yet";
+    }
+    return outputOtherDevice_.size() == 0;
+  }


nextIsMKLDNN函数应该放在prevIsMKLDNN后面写

luotao1 · 2017-08-28T10:37:18Z

paddle/gserver/layers/MKLDNNFcLayer.cpp


+    // fc cpu output value do not need convert
+    cpuOutput.value = output_.value;


paddle中是在Layer::copyOutputToOtherDevice里面自动会做这一部分转换，所以这里为什么还要单独再写一遍呢？这里写的话，很多信息也不全，比如SequenceStartPosition这些信息。

需要nextIsMKLDNN么？layer间不同deviceId的转换，是在layer基类中完成的。

Layer::copyOutputToOtherDevice里面自动会做这一部分转换

Layer只是提供了一个接口，但是并不会自动调用，只用于ParallelNeuralNetwork.cpp，另外在dataLayer.h有重写。

并且图像的高宽信息也没有设置，还有很重要的一点这里我是赋值的指针（是因为fc的输出不需要reorder，不然是需要加mkldnn的reorder的，比如conv的这里是需要加额外的东西），不是copyFrom，与父类的不同。

由于现在mkldnn没考虑sequence的东西，所以没有copy这个信息，不过我确实可以补上，以后也可以用得到。

所以我在MKLDNNLayer里面加一个copyOutputInfoToOtherDevice的接口，负责copy data以外的信息；
一个convertOutputToOtherDevice去实际创建convert信息，FC这里就是share 指针。

done

luotao1 · 2017-08-28T10:40:08Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
+  outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
+
+  // change original output value to mkldnn output value


original output value？指的是什么格式？

这里的original output指的是修改原先的output为一个可以case为MKLDNNMatrixPtr的指针，这样下一个如果是MKLDNN layer，是可以直接cast，并且拿到需要的信息。

luotao1 · 2017-08-28T10:43:27Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+    const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
+    // fc do not need to convert from cpu device since output always nc
+    // only need create from cpu device
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
  }


merge top diffs是在 layer::waitAndMergeOutputGrad中做的，这里为什么要单独写一下。

top diffs的叫法是caffe的，注释改一下，下同

merge top diffs是在 layer::waitAndMergeOutputGrad中做的，这里为什么要单独写一下。

父类的是waitAndMergeOutputGrad是专门用于ParallelNeuralNetwork的情况。不能通过，并且我这里的不是wait的，父类的是结合线程使用的。
并且我这里写的是以后需要用mkldnn::sum实现，应该不可以直接用父类的那个函数。

top diffs的叫法是caffe的，注释改一下，下同

done

luotao1 · 2017-08-28T10:47:34Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      // TODO(TJ): use outputMaps_ ways when merge topdiff done
+    } else {
+      inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
+    }
  }


230-249可以合并两个分支：

bool device = prevIsMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE; const MatrixPtr& in = getInputGrad(0, device); if (in == nullptr) return; if (getInput(0, device).getAllCount() > 1) { // TODO(TJ): use outputMaps_ ways when merge topdiff done } else { inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD()); }

Done，但是 device不能是bool，应该是int

tensor-tang

Thanks. done

tensor-tang · 2017-08-29T02:05:48Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
+  outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
+
+  // change original output value to mkldnn output value


这里的original output指的是修改原先的output为一个可以case为MKLDNNMatrixPtr的指针，这样下一个如果是MKLDNN layer，是可以直接cast，并且拿到需要的信息。

tensor-tang · 2017-08-29T02:15:53Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+    const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
+    // fc do not need to convert from cpu device since output always nc
+    // only need create from cpu device
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
  }


merge top diffs是在 layer::waitAndMergeOutputGrad中做的，这里为什么要单独写一下。

父类的是waitAndMergeOutputGrad是专门用于ParallelNeuralNetwork的情况。不能通过，并且我这里的不是wait的，父类的是结合线程使用的。
并且我这里写的是以后需要用mkldnn::sum实现，应该不可以直接用父类的那个函数。

top diffs的叫法是caffe的，注释改一下，下同

done

tensor-tang · 2017-08-29T02:19:18Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      // TODO(TJ): use outputMaps_ ways when merge topdiff done
+    } else {
+      inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
+    }
  }


Done，但是 device不能是bool，应该是int

tensor-tang · 2017-08-29T02:20:27Z

paddle/gserver/layers/MKLDNNLayer.h

+    if (useGpu_ == true) {
+      LOG(WARNING) << "Do not support GPU yet, will change to useGpu = false";
+      useGpu_ = false;
+    }


tensor-tang · 2017-08-29T03:04:21Z

paddle/gserver/layers/MKLDNNLayer.h

+protected:
+  /**
+   * If next layer only has MKLDNN type.
+   * Otherwise, only support otherdevice CPU device.


原本是希望名字可以与prevIsMKLDNN对应，并不是想使用next layer的信息，注释是需要改改。
但是名字的话体现出only也挺好的，不过要改的话还是觉得prevIsMKLDNN的也一起改了比较好。
改为：prevIsOnlyMKLDNN 和 nextIsOnlyMKLDNN

tensor-tang · 2017-08-29T09:05:10Z

paddle/math/MKLDNNMatrix.h

+  /**
+   * Update the memory data handle.
+   * Caution: This will not check the buffer size of the data,
+   *          it should be coverd by user.


这个同上。

tensor-tang · 2017-08-29T09:10:08Z

paddle/math/MKLDNNMatrix.h

+  /**
+   * Get primitive descriptor.
+   */
+  mkldnn::memory::primitive_desc getPD() { return this->get_primitive_desc(); }


这个我原本的考虑是：

这个PD和MD在mkldnn里面用的是比较多的，已经类似一个mkldnn里面比较通用的词汇了。另外函数前面也会有对应的注释。

如果改详细的名字之后，会有很多函数那一行会变得很长，会有好几个换行，感觉也不好看。

不过还是改掉算了吧。
done

tensor-tang · 2017-08-29T09:10:51Z

paddle/math/MKLDNNMatrix.h

+  /**
+   * Get memory descriptor.
+   */
+  mkldnn::memory::desc getMD() { return getPD().desc(); }


tensor-tang · 2017-08-29T09:13:06Z

paddle/math/MKLDNNMatrix.h

+
+protected:
+  /**
+   * Do once reorder supported inplace.


tensor-tang · 2017-08-29T11:43:07Z

paddle/gserver/layers/MKLDNNFcLayer.cpp


+    // fc cpu output value do not need convert
+    cpuOutput.value = output_.value;


Layer::copyOutputToOtherDevice里面自动会做这一部分转换

Layer只是提供了一个接口，但是并不会自动调用，只用于ParallelNeuralNetwork.cpp，另外在dataLayer.h有重写。

并且图像的高宽信息也没有设置，还有很重要的一点这里我是赋值的指针（是因为fc的输出不需要reorder，不然是需要加mkldnn的reorder的，比如conv的这里是需要加额外的东西），不是copyFrom，与父类的不同。

由于现在mkldnn没考虑sequence的东西，所以没有copy这个信息，不过我确实可以补上，以后也可以用得到。

所以我在MKLDNNLayer里面加一个copyOutputInfoToOtherDevice的接口，负责copy data以外的信息；
一个convertOutputToOtherDevice去实际创建convert信息，FC这里就是share 指针。

done

luotao1 · 2017-08-30T03:50:42Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+  const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
+  const MatrixPtr& out = output_.value;
+
+  if (prevIsOnlyMKLDNN()) {


名字改成inputIsOnlyMKLDNN和outputIsOnlyMKLDNN，如何？

对layer来说，没有next信息

prev和next对应，所以都改成input/output

可以~ done.

luotao1 · 2017-08-30T03:53:30Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+    const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
+    // fc do not need to convert from cpu device since output always nc
+    // only need create from cpu device
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());


197-207两分支也可以合并成241行的形式

int device = nextIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE; const MatrixPtr& out = getOutput(device).grad; outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());

另外，205行注释末尾nc是什么意思？是nc format么？

OK thanks。

是的，就是格式，我写详细点。

luotao1 · 2017-08-30T03:55:47Z

paddle/gserver/layers/MKLDNNLayer.h

+  /**
+   * copy image size and sequence info to other device
+   */
+  void copyOutputInfoToOtherDevice() {


注释这里最好再加一个 @note，说一下为什么不用Layer::copyOutputToOtherDevice。我理解的是value不能直接copy过去，因为格式不一样。

OK，这个是copy的基本info，不包含data，与Layer::copyOutputToOtherDevice还不一样。
不过也可以加一个note。

luotao1 · 2017-08-30T03:59:35Z

paddle/gserver/layers/MKLDNNLayer.h

+  /**
+   * Set deviceId of the params used in this layer.
+   */
+  void setParamsDevice(int id, const ParameterMap& parameterMap) {


这里不是很理解，paddle也有cpu+gpu混合使用的配置，也不需要提前setParamsDevice。为什么这里需要呢？

luotao1 · 2017-08-30T04:05:42Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      // fc cpu output value do not need convert
+      // just share point
+      outputOtherDevice_[i].value = output_.value;
+      ++cnt;


对mkldnn layer来说，要转value的地方多么？如果大部分layer都是

outputOtherDevice_[i].value = output_.value;

那放进基类函数即可。需要转的layer再单独写一下，会比较清爽。

不能超过1个CPU设备的检查也应该放进基类函数中吧。而且为什么不能超过1个呢？

不是大部分都是outputOtherDevice_[i].value = output_.value;，另外的layer是需要别的操作，这里fc因为一直是nc格式的输出，与paddle的cpu device格式一样，所以直接可以share。
不过后面layer多一点之后，可以再整理一遍的。

不超过一个CPU device是理论上我认为不应该会出现多个，担心目前考虑的不周全，比如RNN的case会不会有影响，所以给一个warning。如果就算是多个，每个还是用的share。这一点也是特定在FClayer的。

如果要做检查，也应该放在mkldnnLayer的convertOutputToOtherDevice里做，可以在下一个PR中修改。

tensor-tang

Thanks. done

tensor-tang · 2017-08-30T05:17:36Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      // fc cpu output value do not need convert
+      // just share point
+      outputOtherDevice_[i].value = output_.value;
+      ++cnt;


不是大部分都是outputOtherDevice_[i].value = output_.value;，另外的layer是需要别的操作，这里fc因为一直是nc格式的输出，与paddle的cpu device格式一样，所以直接可以share。
不过后面layer多一点之后，可以再整理一遍的。

不超过一个CPU device是理论上我认为不应该会出现多个，担心目前考虑的不周全，比如RNN的case会不会有影响，所以给一个warning。如果就算是多个，每个还是用的share。这一点也是特定在FClayer的。

tensor-tang · 2017-08-30T05:24:06Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+  const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
+  const MatrixPtr& out = output_.value;
+
+  if (prevIsOnlyMKLDNN()) {


可以~ done.

tensor-tang · 2017-08-30T05:30:35Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+    const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
+    // fc do not need to convert from cpu device since output always nc
+    // only need create from cpu device
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());


OK thanks。

是的，就是格式，我写详细点。

tensor-tang · 2017-08-30T05:34:58Z

paddle/gserver/layers/MKLDNNLayer.h

+  /**
+   * copy image size and sequence info to other device
+   */
+  void copyOutputInfoToOtherDevice() {


OK，这个是copy的基本info，不包含data，与Layer::copyOutputToOtherDevice还不一样。
不过也可以加一个note。

tensor-tang · 2017-08-30T05:42:53Z

paddle/gserver/layers/MKLDNNLayer.h

+  /**
+   * Set deviceId of the params used in this layer.
+   */
+  void setParamsDevice(int id, const ParameterMap& parameterMap) {


首先是这样

if (useGpu_ && FLAGS_parallel_nn) { /* gpu environment is specified by device property */ deviceId_ = config_.device(); if (deviceId_ < 0) { useGpu_ = false; } }

所以，一般是用gpu且FLAGS_parallel_nn的时候才会设置deviceId，所以普遍情况下都是-1的。

并且这里实际上MKLDNN的param用的也就是CPU的param，所以也没有必要专门在Python端口设置为-2，直接在MKLDNNLayer里面统一设置，为了避免Layer::init的检查出错。

luotao1

LGTM. 最后的两个commit名字略简单，refine和rename，下次可以写详细点。

luotao1 · 2017-08-30T06:23:52Z

paddle/gserver/layers/MKLDNNLayer.h

+  /**
+   * Set deviceId of the params used in this layer.
+   */
+  void setParamsDevice(int id, const ParameterMap& parameterMap) {


也就是说，使用mkldnn的时候，只要设置paddle.init(use_mkldnn=True)即可，不需要设置FLAGS_parallel_nn和手动给每个python layer设置deviceId了。

luotao1 · 2017-08-30T06:26:04Z

paddle/gserver/layers/MKLDNNFcLayer.cpp

+      // fc cpu output value do not need convert
+      // just share point
+      outputOtherDevice_[i].value = output_.value;
+      ++cnt;


如果要做检查，也应该放在mkldnnLayer的convertOutputToOtherDevice里做，可以在下一个PR中修改。

tensor-tang added 10 commits August 18, 2017 09:58

check format before set header format

4d8992c

update mkldnn tag v0.10

462b9b1

add MKLDNNMatrix files

62e6dac

use MKLDNNMatrix in fc forward

4bffbd3

use MKLDNNMatrix in fc backward

4eecd0c

pass test, support input CPU device

48d87e5

make downSpatial work, and remove hasSpatial_

780c8d9

enable reorder

4cc5783

add todo

98b7c67

Merge remote-tracking branch 'upstream/develop' into merge

2efac83

tensor-tang requested a review from luotao1 August 28, 2017 02:18

fix cmake

fe51f72

tensor-tang force-pushed the merge branch from 1ceb2e8 to fe51f72 Compare August 28, 2017 03:50

luotao1 reviewed Aug 28, 2017

View reviewed changes

refine

bfbd066

tensor-tang commented Aug 29, 2017

View reviewed changes

luotao1 reviewed Aug 30, 2017

View reviewed changes

rename

c5183ca

tensor-tang commented Aug 30, 2017

View reviewed changes

luotao1 approved these changes Aug 30, 2017

View reviewed changes

luotao1 merged commit 322d9ad into PaddlePaddle:develop Aug 30, 2017

tensor-tang added this to Doing in Optimization on Intel Platform Aug 30, 2017

tensor-tang moved this from Doing to Done in Optimization on Intel Platform Aug 30, 2017


		// fc cpu output value do not need convert
		cpuOutput.value = output_.value;

add MKLDNN_DEVICE #3712

add MKLDNN_DEVICE #3712

Conversation

tensor-tang commented Aug 28, 2017

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment