Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bfloat16 data type #25402

Merged
merged 15 commits into from
Sep 3, 2020
Merged

Add bfloat16 data type #25402

merged 15 commits into from
Sep 3, 2020

Conversation

wozna
Copy link
Contributor

@wozna wozna commented Jul 6, 2020

PR types

Others

PR changes

Others

Describe

PR adds bfloat16 data type implementation with a test for this type.
Formats of float32 and bfloat16 are presented in the picture below.
image
The implementation of bfloat16 focuses on the conversion from float to bfloat16, which involves copying the first 2 bytes from float and saving them as bfloat16. The rest of the conversion from other types is done using the default conversion to float and then to bfloat16.

This is the first step associated with model inference on bfloat16 using the OneDNN library on supporting devices such as Cooper Lake.

This PR does not need verification on Cooper Lake, because it is a universal implementation of the bfloat16 data type.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jul 6, 2020

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@wozna wozna added the Intel label Jul 6, 2020
@wozna wozna force-pushed the bfloat16-dt branch 2 times, most recently from cb12a84 to eeae10e Compare July 8, 2020 07:20
jczaja
jczaja previously approved these changes Jul 13, 2020
Copy link
Contributor

@jczaja jczaja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jczaja jczaja requested a review from grygielski July 13, 2020 13:42
grygielski
grygielski previously approved these changes Jul 14, 2020
Copy link
Contributor

@grygielski grygielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted one question. Other than that, LGTM.

paddle/fluid/pybind/tensor_py.h Outdated Show resolved Hide resolved
@wozna
Copy link
Contributor Author

wozna commented Jul 27, 2020

@luotao1 @wzzju Can ask you for review and approval for framework.proto because PR-CI-CPU-Py2 informs about it.
In PR-CI-Coverage there is a build error that BF16 is not a proto VarType although it is defined in framework.proto. I wonder if this is also due to a lack of approval or if it is a bug in the code.

@luotao1
Copy link
Contributor

luotao1 commented Jul 28, 2020

In PR-CI-Coverage there is a build error that BF16 is not a proto VarType although it is defined in framework.proto. I wonder if this is also due to a lack of approval or if it is a bug in the code.

It is a bug in the code, not due to lack of approval

@wozna
Copy link
Contributor Author

wozna commented Aug 3, 2020

@luotao1 You're right, but I still have a problem with that PR-CI-Coverage that BF16 is not visible as proto VarType although it is defined in framework.proto. The problem came when PR-CI-Coverage started to be built with -DWITH_LITE=ON. Could you ask someone related to Lite to help me with this problem?

@luotao1
Copy link
Contributor

luotao1 commented Aug 5, 2020

The problem came when PR-CI-Coverage started to be built with -DWITH_LITE=ON. Could you ask someone related to Lite to help me with this problem?

WITH_LITE=ON seems not to affect the BF16. @lidanqing-intel Could you help @wozna internal?

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Aug 7, 2020

Adam Osewski is looking at this WITH_LITE=ON issue. @arogowie-intel

@wozna
Copy link
Contributor Author

wozna commented Aug 11, 2020

@luotao1 We found out that problem with DWITH_LITE=ON is caused by https://github.com/PaddlePaddle/Paddle-Lite/blob/develop/lite/core/framework.proto which is the same file in Paddle and Paddle-Lite. That is why BF16 wasn't visible in Paddle-Lite, because Lite was using its framework.proto without defined BF16.
The simples solution is to first add BF16 to framework.proto in Paddle-Lite and then in Paddle.
What do you think about it? Is there any procedure for this?

@luotao1
Copy link
Contributor

luotao1 commented Aug 12, 2020

@wozna The update of framework.proto should be cautious. Discussed with @Superjomn who is responsible for inference, to avoid any incompatible among different release versions, you should provide two experiment results before.

  • You could create a separate PR only changed framework.proto, such as PR1
  • Training a model with develop+PR1, and then do C++ Inference with 1.8.x library.
  • Training a model with 1.8.x whl, and then do C++ inference with develop+PR1 library.

If two experiment results are both OK, you can first add BF16 to framework.proto in Paddle-Lite and then in Paddle.
If not, it is an incompatible update, and we will make a big review first.

@wozna
Copy link
Contributor Author

wozna commented Aug 12, 2020

Thank you @luotao1 . I will provide these experiments.

  • Training a model with develop+PR1, and then do C++ Inference with 1.8.x library.

@luotao1 or @Superjomn Do you think about some particular model training?

@luotao1
Copy link
Contributor

luotao1 commented Aug 12, 2020

Do you think about some particular model training

@wozna You can choose any model.

@arogowie-intel
Copy link
Contributor

Hi @luotao1
I'm going to continue work on this PR since @wozna is going on vacation next week.

(Please correct me if I get sth fundamentally wrong - I'm a newbie here :) )

Let me note that Paddle-Lite contains already 3 copies of framework.proto file:

I'm not sure that keeping all those files in sync with Paddle's framework.proto is a good solution. I wonder why Paddle-Lite is not using framework.proto protobuf message deffinitions available in Paddle already? As I can see Paddle builds it's framework_proto target paddle/fluid/framework/CMakeLists.txt#L29 and this target is used as a dependency (paddle/fluid/inference/lite/CMakeLists.txt#L8 ) for file where compiler error is reported. Maybe this is the problem that should be fixed? I checked and both framework.proto files generated correct sources (*pb.h, *.pb.cc) at different paths so this is not the case of overwriting itself. Apparently it seems like Paddle-Lite used only this proto file: lite/core/framework.proto.

Regarding the experiments with PaddleLite you asked I also wonder whether they're necessary since:

  • We'd add only new value in protobuf message, no additional changes needed. Not using it. -PR1
  • The trained model by either PaddleLite:develop+PR1 or 1.8.x whl I suppose would be FP32 not yet BF16, thus there should be no difference - there are no passes which could change anything with BF16 yet.

@arogowie-intel
Copy link
Contributor

@luotao1
After my investigation the problem is solved by updating the Paddle-Lite commit hash used in Paddle to the latest develop. Can we update Paddle-Lite version?

paddle/fluid/framework/data_layout_transform.cc Outdated Show resolved Hide resolved
@@ -38,3 +38,25 @@ TEST(DataType, float16) {
std::string type = "::paddle::platform::float16";
EXPECT_STREQ(f::DataTypeToString(dtype).c_str(), type.c_str());
}

TEST(DataType, bfloat16) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first test could be parameterized and reused here, to avoid duplicating of code.
This can be done later as a refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I will refactor it later.

@@ -189,4 +194,120 @@ TEST(DataTypeTransform, CPUTransform) {
static_cast<paddle::platform::float16>(in_data_bool[i]).x);
}
}

// data type transform from/to bfloat16
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous block could be parameterized as a function and reused here, to avoid duplication of code.
This can be done later as a refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I will refactor it later.

@wozna
Copy link
Contributor Author

wozna commented Aug 28, 2020

@luotao1
Unfortunately, PR #25402 didn't resolve the problem with Paddle-Lite.
framework.proto in Paddle is still somehow replaced by framework.proto from Paddle-Lite.

For now, my last proposition that resolves this error for sure is to add BF16 to framework.proto from Paddle-Lite. Then update Paddle-Lite hash in Paddle.
Right now I am going to do the same experiment that you asked @arogowie-intel before.

Do I need to do something more?

@luotao1
Copy link
Contributor

luotao1 commented Aug 28, 2020

Unfortunately, PR #25402 didn't resolve the problem with Paddle-Lite.
framework.proto in Paddle is still somehow replaced by framework.proto from Paddle-Lite.

@arogowie-intel Does #25402 (comment) before not solve the problem? where is the omission in the before investigation?

@arogowie-intel
Copy link
Contributor

Hi @luotao1

@arogowie-intel Does #25402 (comment) before not solve the problem? where is the omission in the before investigation?

I'm truly sorry, but it looks like that was my fault. Just before the solution I've proposed in that comment I've been experimenting with Paddle-Lite and adding to its framework.proto file BF16 data type definition. I was compiling Paddle-Lite locally with this change and instructed Paddle to link with it and not to download and compile it by itself. I suspect that after those experiments I might have my working environmet not cleaned entirely which caused that Paddle was still using my local version of Paddle-Lite with updated framework.proto.

After all it seems that this is still somehow unexpected behavior, that the Paddle-Lite proto files have priority over the Paddle ones, or is it correct? I have been trying to find why this happen, however searching through cmake files in Paddle and Paddle-Lite I couldn't find anything suspicious. The update of framework.proto of external project used in Paddle is rather a workaround for this issue. Please help us to explain this.

@luotao1
Copy link
Contributor

luotao1 commented Aug 31, 2020

The update of framework.proto of external project used in Paddle is rather a workaround for this issue.

Got it. I will discuss with @Superjomn

@luotao1
Copy link
Contributor

luotao1 commented Sep 1, 2020

@wozna @arogowie-intel You can create a PR into Paddle-Lite repo to update the framework.proto.

@jczaja
Copy link
Contributor

jczaja commented Sep 2, 2020

@luotao1 , @Superjomn We looked into this problem with this PR not building with -DWITH_LITE=ON option added and actually
root cause is that some of test files: /Paddle/paddle/fluid/inference/lite/test_engine.cc
are indirectly including two framework.pb.h files when PaddlePaddle is built with Paddle-Lite (one framework.pb.h comes from Paddle framework.proto and the other comes from Paddle-Lite). Depending which one of those two headers will be processed first by a compiler , that one will be used. So if those framework.proto files are not in sync then there building error. In this PR we modified the order how headers are included so that Paddle framework.pb.h is included before Paddle-Lite framework.pb.h and that way this PR should be buildable. This is a workaround . Ultimately it would be good
to use only one framework.proto not two or guarantee that both framework.proto are always in sync.

image

Copy link
Contributor

@grygielski grygielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 merged commit 95e1434 into PaddlePaddle:develop Sep 3, 2020
luotao1 added a commit that referenced this pull request Sep 3, 2020
@wozna wozna deleted the bfloat16-dt branch February 24, 2023 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants