Add bfloat16 data type #25402

wozna · 2020-07-06T14:04:46Z

PR types

Others

PR changes

Others

Describe

PR adds bfloat16 data type implementation with a test for this type.
Formats of float32 and bfloat16 are presented in the picture below.

The implementation of bfloat16 focuses on the conversion from float to bfloat16, which involves copying the first 2 bytes from float and saving them as bfloat16. The rest of the conversion from other types is done using the default conversion to float and then to bfloat16.

This is the first step associated with model inference on bfloat16 using the OneDNN library on supporting devices such as Cooper Lake.

This PR does not need verification on Cooper Lake, because it is a universal implementation of the bfloat16 data type.

paddle-bot-old · 2020-07-06T14:04:54Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

jczaja

LGTM

grygielski

I've posted one question. Other than that, LGTM.

paddle/fluid/pybind/tensor_py.h

paddle/fluid/framework/io/shell.h

wozna · 2020-07-27T08:31:54Z

@luotao1 @wzzju Can ask you for review and approval for framework.proto because PR-CI-CPU-Py2 informs about it.
In PR-CI-Coverage there is a build error that BF16 is not a proto VarType although it is defined in framework.proto. I wonder if this is also due to a lack of approval or if it is a bug in the code.

luotao1 · 2020-07-28T06:02:57Z

In PR-CI-Coverage there is a build error that BF16 is not a proto VarType although it is defined in framework.proto. I wonder if this is also due to a lack of approval or if it is a bug in the code.

It is a bug in the code, not due to lack of approval

wozna · 2020-08-03T19:39:36Z

@luotao1 You're right, but I still have a problem with that PR-CI-Coverage that BF16 is not visible as proto VarType although it is defined in framework.proto. The problem came when PR-CI-Coverage started to be built with -DWITH_LITE=ON. Could you ask someone related to Lite to help me with this problem?

luotao1 · 2020-08-05T04:50:07Z

The problem came when PR-CI-Coverage started to be built with -DWITH_LITE=ON. Could you ask someone related to Lite to help me with this problem?

WITH_LITE=ON seems not to affect the BF16. @lidanqing-intel Could you help @wozna internal?

lidanqing-intel · 2020-08-07T08:01:13Z

Adam Osewski is looking at this WITH_LITE=ON issue. @arogowie-intel

wozna · 2020-08-11T06:49:45Z

@luotao1 We found out that problem with DWITH_LITE=ON is caused by https://github.com/PaddlePaddle/Paddle-Lite/blob/develop/lite/core/framework.proto which is the same file in Paddle and Paddle-Lite. That is why BF16 wasn't visible in Paddle-Lite, because Lite was using its framework.proto without defined BF16.
The simples solution is to first add BF16 to framework.proto in Paddle-Lite and then in Paddle.
What do you think about it? Is there any procedure for this?

luotao1 · 2020-08-12T06:45:03Z

@wozna The update of framework.proto should be cautious. Discussed with @Superjomn who is responsible for inference, to avoid any incompatible among different release versions, you should provide two experiment results before.

You could create a separate PR only changed framework.proto, such as PR1
Training a model with develop+PR1, and then do C++ Inference with 1.8.x library.
Training a model with 1.8.x whl, and then do C++ inference with develop+PR1 library.

If two experiment results are both OK, you can first add BF16 to framework.proto in Paddle-Lite and then in Paddle.
If not, it is an incompatible update, and we will make a big review first.

wozna · 2020-08-12T09:45:13Z

Thank you @luotao1 . I will provide these experiments.

Training a model with develop+PR1, and then do C++ Inference with 1.8.x library.

@luotao1 or @Superjomn Do you think about some particular model training?

luotao1 · 2020-08-12T10:19:34Z

Do you think about some particular model training

@wozna You can choose any model.

arogowie-intel · 2020-08-13T14:56:54Z

Hi @luotao1
I'm going to continue work on this PR since @wozna is going on vacation next week.

(Please correct me if I get sth fundamentally wrong - I'm a newbie here :) )

Let me note that Paddle-Lite contains already 3 copies of framework.proto file:

I'm not sure that keeping all those files in sync with Paddle's framework.proto is a good solution. I wonder why Paddle-Lite is not using framework.proto protobuf message deffinitions available in Paddle already? As I can see Paddle builds it's framework_proto target paddle/fluid/framework/CMakeLists.txt#L29 and this target is used as a dependency (paddle/fluid/inference/lite/CMakeLists.txt#L8 ) for file where compiler error is reported. Maybe this is the problem that should be fixed? I checked and both framework.proto files generated correct sources (*pb.h, *.pb.cc) at different paths so this is not the case of overwriting itself. Apparently it seems like Paddle-Lite used only this proto file: lite/core/framework.proto.

Regarding the experiments with PaddleLite you asked I also wonder whether they're necessary since:

We'd add only new value in protobuf message, no additional changes needed. Not using it. -PR1
The trained model by either PaddleLite:develop+PR1 or 1.8.x whl I suppose would be FP32 not yet BF16, thus there should be no difference - there are no passes which could change anything with BF16 yet.

arogowie-intel · 2020-08-18T13:58:05Z

@luotao1
After my investigation the problem is solved by updating the Paddle-Lite commit hash used in Paddle to the latest develop. Can we update Paddle-Lite version?

test=develop

paddle/fluid/framework/data_layout_transform.cc

wojtuss · 2020-08-28T10:31:07Z

paddle/fluid/framework/data_type_test.cc

@@ -38,3 +38,25 @@ TEST(DataType, float16) {
  std::string type = "::paddle::platform::float16";
  EXPECT_STREQ(f::DataTypeToString(dtype).c_str(), type.c_str());
 }
+
+TEST(DataType, bfloat16) {


The first test could be parameterized and reused here, to avoid duplicating of code.
This can be done later as a refactoring.

You are right. I will refactor it later.

wojtuss · 2020-08-28T10:32:39Z

paddle/fluid/framework/data_type_transform_test.cc

@@ -189,4 +194,120 @@ TEST(DataTypeTransform, CPUTransform) {
                static_cast<paddle::platform::float16>(in_data_bool[i]).x);
    }
  }
+
+  // data type transform from/to bfloat16
+  {


The previous block could be parameterized as a function and reused here, to avoid duplication of code.
This can be done later as a refactoring.

You are right. I will refactor it later.

wozna · 2020-08-28T13:27:42Z

@luotao1
Unfortunately, PR #25402 didn't resolve the problem with Paddle-Lite.
framework.proto in Paddle is still somehow replaced by framework.proto from Paddle-Lite.

For now, my last proposition that resolves this error for sure is to add BF16 to framework.proto from Paddle-Lite. Then update Paddle-Lite hash in Paddle.
Right now I am going to do the same experiment that you asked @arogowie-intel before.

Do I need to do something more?

luotao1 · 2020-08-28T15:10:38Z

Unfortunately, PR #25402 didn't resolve the problem with Paddle-Lite.
framework.proto in Paddle is still somehow replaced by framework.proto from Paddle-Lite.

@arogowie-intel Does #25402 (comment) before not solve the problem? where is the omission in the before investigation?

paddle/fluid/framework/data_layout_transform.h

paddle/fluid/framework/data_layout_transform_test.cc

arogowie-intel · 2020-08-31T10:32:07Z

Hi @luotao1

@arogowie-intel Does #25402 (comment) before not solve the problem? where is the omission in the before investigation?

I'm truly sorry, but it looks like that was my fault. Just before the solution I've proposed in that comment I've been experimenting with Paddle-Lite and adding to its framework.proto file BF16 data type definition. I was compiling Paddle-Lite locally with this change and instructed Paddle to link with it and not to download and compile it by itself. I suspect that after those experiments I might have my working environmet not cleaned entirely which caused that Paddle was still using my local version of Paddle-Lite with updated framework.proto.

After all it seems that this is still somehow unexpected behavior, that the Paddle-Lite proto files have priority over the Paddle ones, or is it correct? I have been trying to find why this happen, however searching through cmake files in Paddle and Paddle-Lite I couldn't find anything suspicious. The update of framework.proto of external project used in Paddle is rather a workaround for this issue. Please help us to explain this.

luotao1 · 2020-08-31T11:49:46Z

The update of framework.proto of external project used in Paddle is rather a workaround for this issue.

Got it. I will discuss with @Superjomn

luotao1 · 2020-09-01T07:29:20Z

@wozna @arogowie-intel You can create a PR into Paddle-Lite repo to update the framework.proto.

paddle/fluid/framework/io/shell.h

jczaja · 2020-09-02T12:58:42Z

@luotao1 , @Superjomn We looked into this problem with this PR not building with -DWITH_LITE=ON option added and actually
root cause is that some of test files: /Paddle/paddle/fluid/inference/lite/test_engine.cc
are indirectly including two framework.pb.h files when PaddlePaddle is built with Paddle-Lite (one framework.pb.h comes from Paddle framework.proto and the other comes from Paddle-Lite). Depending which one of those two headers will be processed first by a compiler , that one will be used. So if those framework.proto files are not in sync then there building error. In this PR we modified the order how headers are included so that Paddle framework.pb.h is included before Paddle-Lite framework.pb.h and that way this PR should be buildable. This is a workaround . Ultimately it would be good
to use only one framework.proto not two or guarantee that both framework.proto are always in sync.

grygielski

LGTM

luotao1

LGTM

This reverts commit 95e1434.

wozna added the Intel label Jul 6, 2020

wozna force-pushed the bfloat16-dt branch 2 times, most recently from cb12a84 to eeae10e Compare July 8, 2020 07:20

wozna requested review from wojtuss, jczaja and lidanqing-intel July 13, 2020 09:59

jczaja previously approved these changes Jul 13, 2020

View reviewed changes

jczaja requested a review from grygielski July 13, 2020 13:42

grygielski previously approved these changes Jul 14, 2020

View reviewed changes

paddle/fluid/pybind/tensor_py.h Outdated Show resolved Hide resolved

lidanqing-intel mentioned this pull request Jul 14, 2020

[oneDNN] Add bfloat16 to C-API #24295

Closed

wozna dismissed stale reviews from grygielski and jczaja via 27f412d July 16, 2020 09:47

wozna force-pushed the bfloat16-dt branch from 27f412d to 55ee5d5 Compare July 16, 2020 11:01

luotao1 requested a review from wzzju July 16, 2020 13:54

jczaja reviewed Jul 20, 2020

View reviewed changes

paddle/fluid/framework/io/shell.h Outdated Show resolved Hide resolved

wozna mentioned this pull request Jul 23, 2020

Add NOMINMAX define due to windows.h max/min macro conflict #25637

Merged

wozna mentioned this pull request Jul 27, 2020

Change the type of attribute "use_quantizer" #25739

Closed

arogowie-intel mentioned this pull request Aug 18, 2020

Update Paddle-Lite commit hash. #26413

Merged

wozna added 8 commits August 28, 2020 01:24

Add test to impove coverage

a0c6265

test=develop

Add numeric_limits

17d1240

test=develop

Add min and max

0536eac

test=develop

Add lib

92146cd

test=develop

Add define NOMINMAX related to msvc min/max conflict

7e7cc65

test=develop

Remove nominmax

c9122aa

test=develop

Add explanatory comments

b05c237

Change headers order

bd43831

wozna force-pushed the bfloat16-dt branch from 574ece6 to bd43831 Compare August 28, 2020 08:25

wojtuss reviewed Aug 28, 2020

View reviewed changes

paddle/fluid/framework/data_layout_transform.h Outdated Show resolved Hide resolved

paddle/fluid/framework/data_layout_transform_test.cc Outdated Show resolved Hide resolved

wozna mentioned this pull request Sep 1, 2020

Add BF16 to framework.proto PaddlePaddle/Paddle-Lite#4233

Closed

wozna added 2 commits September 1, 2020 08:34

Remove unnecessary template

d7cf24e

Refactor and headers reorder in test_engine

658efd8

wozna commented Sep 2, 2020

View reviewed changes

paddle/fluid/framework/io/shell.h Outdated Show resolved Hide resolved

Removed accidentally added part

ccf8594

jczaja approved these changes Sep 2, 2020

View reviewed changes

grygielski approved these changes Sep 2, 2020

View reviewed changes

luotao1 approved these changes Sep 3, 2020

View reviewed changes

luotao1 merged commit 95e1434 into PaddlePaddle:develop Sep 3, 2020

luotao1 added a commit that referenced this pull request Sep 3, 2020

Revert "Add bfloat16 data type (#25402)"

877b6cc

This reverts commit 95e1434.

luotao1 mentioned this pull request Sep 3, 2020

Revert "Add bfloat16 data type" #26981

Closed

hysunflower mentioned this pull request Sep 9, 2020

"Add bfloat16 data type" 引起bert fp16 模型训练挂掉 #27205

Closed

wozna deleted the bfloat16-dt branch February 24, 2023 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bfloat16 data type #25402

Add bfloat16 data type #25402

wozna commented Jul 6, 2020 •

edited

paddle-bot-old bot commented Jul 6, 2020

jczaja left a comment

grygielski left a comment

wozna commented Jul 27, 2020 •

edited

luotao1 commented Jul 28, 2020

wozna commented Aug 3, 2020

luotao1 commented Aug 5, 2020

lidanqing-intel commented Aug 7, 2020 •

edited

wozna commented Aug 11, 2020

luotao1 commented Aug 12, 2020

wozna commented Aug 12, 2020 •

edited

luotao1 commented Aug 12, 2020

arogowie-intel commented Aug 13, 2020

arogowie-intel commented Aug 18, 2020

wojtuss Aug 28, 2020

wozna Sep 1, 2020

wojtuss Aug 28, 2020

wozna Sep 1, 2020

wozna commented Aug 28, 2020

luotao1 commented Aug 28, 2020

arogowie-intel commented Aug 31, 2020

luotao1 commented Aug 31, 2020

luotao1 commented Sep 1, 2020

jczaja commented Sep 2, 2020 •

edited by luotao1

grygielski left a comment

luotao1 left a comment

Add bfloat16 data type #25402

Add bfloat16 data type #25402

Conversation

wozna commented Jul 6, 2020 • edited

PR types

PR changes

Describe

paddle-bot-old bot commented Jul 6, 2020

jczaja left a comment

Choose a reason for hiding this comment

grygielski left a comment

Choose a reason for hiding this comment

wozna commented Jul 27, 2020 • edited

luotao1 commented Jul 28, 2020

wozna commented Aug 3, 2020

luotao1 commented Aug 5, 2020

lidanqing-intel commented Aug 7, 2020 • edited

wozna commented Aug 11, 2020

luotao1 commented Aug 12, 2020

wozna commented Aug 12, 2020 • edited

luotao1 commented Aug 12, 2020

arogowie-intel commented Aug 13, 2020

arogowie-intel commented Aug 18, 2020

wojtuss Aug 28, 2020

Choose a reason for hiding this comment

wozna Sep 1, 2020

Choose a reason for hiding this comment

wojtuss Aug 28, 2020

Choose a reason for hiding this comment

wozna Sep 1, 2020

Choose a reason for hiding this comment

wozna commented Aug 28, 2020

luotao1 commented Aug 28, 2020

arogowie-intel commented Aug 31, 2020

luotao1 commented Aug 31, 2020

luotao1 commented Sep 1, 2020

jczaja commented Sep 2, 2020 • edited by luotao1

grygielski left a comment

Choose a reason for hiding this comment

luotao1 left a comment

Choose a reason for hiding this comment

wozna commented Jul 6, 2020 •

edited

wozna commented Jul 27, 2020 •

edited

lidanqing-intel commented Aug 7, 2020 •

edited

wozna commented Aug 12, 2020 •

edited

jczaja commented Sep 2, 2020 •

edited by luotao1