Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MKLDNN_DEVICE #3712

Merged
merged 13 commits into from
Aug 30, 2017
Merged

add MKLDNN_DEVICE #3712

merged 13 commits into from
Aug 30, 2017

Conversation

tensor-tang
Copy link
Contributor

add MKLDNN_DEVICE and MKLDNNMatrix to handle internal mkldnn::memory

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR还是过长,在11个commit中,以下3个commit和 MKLDNN_DEVICE关系不大,可以作为独立的PR发(这次就维持现状):

if (useGpu_ == true) {
LOG(WARNING) << "Do not support GPU yet, will change to useGpu = false";
useGpu_ = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

89-92行:useGpu_ = true时直接报错,不用转成false。
CHECK(!useGpu_);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done


/**
* Is previous layer MKLDNN type.
* Otherwise, only support otherdevice CPU device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, only support the previous layer using CPU device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks,done

}
real* iData = getInputValue(0, CPU_DEVICE)->getData();
// update input data
// since it might be changed if this is after data layer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 上一个PR的时候,这里注释没仔细看:"it might be changed if this is after data layer"表达的是什么意思?input_data可能会变?
  • updateData不需要封装吧:直接用set_data_handle不行么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上一个PR的时候,这里注释没仔细看:"it might be changed if this is after data layer"表达的是什么意思?input_data可能会变?

这是因为,data layer的output的buffer是一直在变的,就是指针会变,所以需要更新data。这个我是很早以前发现的。如果上一个layer的output buffer地址稳定不变的话,这里就不用加了。

updateData不需要封装吧:直接用set_data_handle不行么

我认为还是封装一下比较好吧,主要是为了函数名字风格统一。

/**
* Set deviceId of the params used in this layer.
*/
void setParamsDevice(int id, const ParameterMap& parameterMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数和layer::init非常类似,其中

CHECK_EQ(parameter->getDeviceId(), getDeviceId());

parameter是带有deviceId信息的,那setDevice后,还需要setParamsDevice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里就是因为layer::initCHECK_EQ(parameter->getDeviceId(), getDeviceId());,所以需要param提前设置好。

setDevice其实具体是指setLayerDevice,但是由于已经在layer中了,所以没有使用这个名字

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是很理解,paddle也有cpu+gpu混合使用的配置,也不需要提前setParamsDevice。为什么这里需要呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

首先是这样

if (useGpu_ && FLAGS_parallel_nn) {
    /* gpu environment is specified by device property */
    deviceId_ = config_.device();
    if (deviceId_ < 0) {
      useGpu_ = false;
    }
  }

所以,一般是用gpu且FLAGS_parallel_nn的时候才会设置deviceId, 所以普遍情况下都是-1的。

并且这里实际上MKLDNN的param用的也就是CPU的param,所以也没有必要专门在Python端口设置为-2,直接在MKLDNNLayer里面统一设置,为了避免Layer::init的检查出错。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也就是说,使用mkldnn的时候,只要设置paddle.init(use_mkldnn=True)即可,不需要设置FLAGS_parallel_nn和手动给每个python layer设置deviceId了。

if (m == nullptr) {
size_t height = dims[0];
size_t width = cnts / dims[0];
// LOG(INFO) << height << "," << width;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

34行去掉。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx,done

<< "Only support other device is CPU yet";
}
return outputOtherDevice_.size() == 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextIsMKLDNN函数应该放在prevIsMKLDNN后面写

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done


// fc cpu output value do not need convert
cpuOutput.value = output_.value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. paddle中是在Layer::copyOutputToOtherDevice里面自动会做这一部分转换,所以这里为什么还要单独再写一遍呢?这里写的话,很多信息也不全,比如SequenceStartPosition这些信息。
  2. 需要nextIsMKLDNN么?layer间不同deviceId的转换,是在layer基类中完成的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer::copyOutputToOtherDevice里面自动会做这一部分转换

Layer只是提供了一个接口,但是并不会自动调用,只用于ParallelNeuralNetwork.cpp,另外在dataLayer.h有重写。

并且图像的高宽信息也没有设置,还有很重要的一点这里我是赋值的指针(是因为fc的输出不需要reorder,不然是需要加mkldnn的reorder的,比如conv的这里是需要加额外的东西),不是copyFrom,与父类的不同。

由于现在mkldnn没考虑sequence的东西,所以没有copy这个信息,不过我确实可以补上,以后也可以用得到。

所以我在MKLDNNLayer里面加一个copyOutputInfoToOtherDevice的接口,负责copy data以外的信息;
一个convertOutputToOtherDevice去实际创建convert信息,FC这里就是share 指针。

done

hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);

// change original output value to mkldnn output value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original output value?指的是什么格式?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的original output指的是修改原先的output为一个可以case为MKLDNNMatrixPtr的指针,这样下一个如果是MKLDNN layer,是可以直接cast,并且拿到需要的信息。

const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
// fc do not need to convert from cpu device since output always nc
// only need create from cpu device
outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • merge top diffs是在 layer::waitAndMergeOutputGrad中做的,这里为什么要单独写一下。
  • top diffs的叫法是caffe的,注释改一下,下同

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge top diffs是在 layer::waitAndMergeOutputGrad中做的,这里为什么要单独写一下。

父类的是waitAndMergeOutputGrad是专门用于ParallelNeuralNetwork的情况。不能通过,并且我这里的不是wait的,父类的是结合线程使用的。
并且我这里写的是以后需要用mkldnn::sum实现,应该不可以直接用父类的那个函数。

top diffs的叫法是caffe的,注释改一下,下同

done

// TODO(TJ): use outputMaps_ ways when merge topdiff done
} else {
inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

230-249可以合并两个分支:

bool device = prevIsMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
const MatrixPtr& in = getInputGrad(0,  device);
if (in == nullptr) return;
if (getInput(0, device).getAllCount() > 1) {
// TODO(TJ): use outputMaps_ ways when merge topdiff done
} else {
     inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 但是 device不能是bool,应该是int

Copy link
Contributor Author

@tensor-tang tensor-tang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. done

hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);

// change original output value to mkldnn output value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的original output指的是修改原先的output为一个可以case为MKLDNNMatrixPtr的指针,这样下一个如果是MKLDNN layer,是可以直接cast,并且拿到需要的信息。

const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
// fc do not need to convert from cpu device since output always nc
// only need create from cpu device
outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge top diffs是在 layer::waitAndMergeOutputGrad中做的,这里为什么要单独写一下。

父类的是waitAndMergeOutputGrad是专门用于ParallelNeuralNetwork的情况。不能通过,并且我这里的不是wait的,父类的是结合线程使用的。
并且我这里写的是以后需要用mkldnn::sum实现,应该不可以直接用父类的那个函数。

top diffs的叫法是caffe的,注释改一下,下同

done

// TODO(TJ): use outputMaps_ ways when merge topdiff done
} else {
inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 但是 device不能是bool,应该是int

if (useGpu_ == true) {
LOG(WARNING) << "Do not support GPU yet, will change to useGpu = false";
useGpu_ = false;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done

protected:
/**
* If next layer only has MKLDNN type.
* Otherwise, only support otherdevice CPU device.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原本是希望名字可以与prevIsMKLDNN对应,并不是想使用next layer的信息,注释是需要改改。
但是名字的话体现出only也挺好的,不过要改的话还是觉得prevIsMKLDNN的也一起改了比较好。
改为:prevIsOnlyMKLDNNnextIsOnlyMKLDNN

/**
* Update the memory data handle.
* Caution: This will not check the buffer size of the data,
* it should be coverd by user.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个同上。

/**
* Get primitive descriptor.
*/
mkldnn::memory::primitive_desc getPD() { return this->get_primitive_desc(); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个我原本的考虑是:

  1. 这个PDMD在mkldnn里面用的是比较多的,已经类似一个mkldnn里面比较通用的词汇了。另外函数前面也会有对应的注释。
  2. 如果改详细的名字之后,会有很多函数那一行会变得很长,会有好几个换行,感觉也不好看。

不过还是改掉算了吧。
done

/**
* Get memory descriptor.
*/
mkldnn::memory::desc getMD() { return getPD().desc(); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


protected:
/**
* Do once reorder supported inplace.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// fc cpu output value do not need convert
cpuOutput.value = output_.value;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer::copyOutputToOtherDevice里面自动会做这一部分转换

Layer只是提供了一个接口,但是并不会自动调用,只用于ParallelNeuralNetwork.cpp,另外在dataLayer.h有重写。

并且图像的高宽信息也没有设置,还有很重要的一点这里我是赋值的指针(是因为fc的输出不需要reorder,不然是需要加mkldnn的reorder的,比如conv的这里是需要加额外的东西),不是copyFrom,与父类的不同。

由于现在mkldnn没考虑sequence的东西,所以没有copy这个信息,不过我确实可以补上,以后也可以用得到。

所以我在MKLDNNLayer里面加一个copyOutputInfoToOtherDevice的接口,负责copy data以外的信息;
一个convertOutputToOtherDevice去实际创建convert信息,FC这里就是share 指针。

done

const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
const MatrixPtr& out = output_.value;

if (prevIsOnlyMKLDNN()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

名字改成inputIsOnlyMKLDNN和outputIsOnlyMKLDNN,如何?

  • 对layer来说,没有next信息
  • prev和next对应,所以都改成input/output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以~ done.

const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
// fc do not need to convert from cpu device since output always nc
// only need create from cpu device
outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

197-207两分支也可以合并成241行的形式

int device = nextIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
const MatrixPtr& out = getOutput(device).grad;
outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());

另外,205行注释末尾nc是什么意思?是nc format么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks。

是的,就是格式, 我写详细点。

/**
* copy image size and sequence info to other device
*/
void copyOutputInfoToOtherDevice() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释这里最好再加一个 @note,说一下为什么不用Layer::copyOutputToOtherDevice。我理解的是value不能直接copy过去,因为格式不一样。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,这个是copy的基本info,不包含data,与Layer::copyOutputToOtherDevice还不一样。
不过也可以加一个note。

/**
* Set deviceId of the params used in this layer.
*/
void setParamsDevice(int id, const ParameterMap& parameterMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是很理解,paddle也有cpu+gpu混合使用的配置,也不需要提前setParamsDevice。为什么这里需要呢?

// fc cpu output value do not need convert
// just share point
outputOtherDevice_[i].value = output_.value;
++cnt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对mkldnn layer来说,要转value的地方多么?如果大部分layer都是

outputOtherDevice_[i].value = output_.value;

那放进基类函数即可。需要转的layer再单独写一下,会比较清爽。

不能超过1个CPU设备的检查也应该放进基类函数中吧。而且为什么不能超过1个呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是大部分都是outputOtherDevice_[i].value = output_.value;,另外的layer是需要别的操作,这里fc因为一直是nc格式的输出,与paddle的cpu device格式一样,所以直接可以share。
不过后面layer多一点之后,可以再整理一遍的。

不超过一个CPU device是理论上我认为不应该会出现多个,担心目前考虑的不周全,比如RNN的case会不会有影响,所以给一个warning。如果就算是多个,每个还是用的share。这一点也是特定在FClayer的。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果要做检查,也应该放在mkldnnLayer的convertOutputToOtherDevice里做,可以在下一个PR中修改。

Copy link
Contributor Author

@tensor-tang tensor-tang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. done

// fc cpu output value do not need convert
// just share point
outputOtherDevice_[i].value = output_.value;
++cnt;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是大部分都是outputOtherDevice_[i].value = output_.value;,另外的layer是需要别的操作,这里fc因为一直是nc格式的输出,与paddle的cpu device格式一样,所以直接可以share。
不过后面layer多一点之后,可以再整理一遍的。

不超过一个CPU device是理论上我认为不应该会出现多个,担心目前考虑的不周全,比如RNN的case会不会有影响,所以给一个warning。如果就算是多个,每个还是用的share。这一点也是特定在FClayer的。

const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
const MatrixPtr& out = output_.value;

if (prevIsOnlyMKLDNN()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以~ done.

const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
// fc do not need to convert from cpu device since output always nc
// only need create from cpu device
outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks。

是的,就是格式, 我写详细点。

/**
* copy image size and sequence info to other device
*/
void copyOutputInfoToOtherDevice() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,这个是copy的基本info,不包含data,与Layer::copyOutputToOtherDevice还不一样。
不过也可以加一个note。

/**
* Set deviceId of the params used in this layer.
*/
void setParamsDevice(int id, const ParameterMap& parameterMap) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

首先是这样

if (useGpu_ && FLAGS_parallel_nn) {
    /* gpu environment is specified by device property */
    deviceId_ = config_.device();
    if (deviceId_ < 0) {
      useGpu_ = false;
    }
  }

所以,一般是用gpu且FLAGS_parallel_nn的时候才会设置deviceId, 所以普遍情况下都是-1的。

并且这里实际上MKLDNN的param用的也就是CPU的param,所以也没有必要专门在Python端口设置为-2,直接在MKLDNNLayer里面统一设置,为了避免Layer::init的检查出错。

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 最后的两个commit名字略简单,refine和rename,下次可以写详细点。

/**
* Set deviceId of the params used in this layer.
*/
void setParamsDevice(int id, const ParameterMap& parameterMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也就是说,使用mkldnn的时候,只要设置paddle.init(use_mkldnn=True)即可,不需要设置FLAGS_parallel_nn和手动给每个python layer设置deviceId了。

// fc cpu output value do not need convert
// just share point
outputOtherDevice_[i].value = output_.value;
++cnt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果要做检查,也应该放在mkldnnLayer的convertOutputToOtherDevice里做,可以在下一个PR中修改。

@luotao1 luotao1 merged commit 322d9ad into PaddlePaddle:develop Aug 30, 2017
@tensor-tang tensor-tang moved this from Doing to Done in Optimization on Intel Platform Aug 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants