Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Update TF to 2.4.1. #6545

Closed
wants to merge 8 commits into from
Closed

Conversation

riga
Copy link
Contributor

@riga riga commented Jan 12, 2021

This PR is meant to update TensorFlow to version 2.4.1 which seems to be the first release that is compatible with CUDA 11 in our software stack, so we can eventually aim for GPU support.

However, in terms of the usual manual patches we apply during the integration process, 2.4 drops SWIG but uses pybind11 instead to generate its bindings. This change interferes with some of the patches in our build process (no more use_default_shell_env @smuzaffar @mrodozov) which is why I opened the PR as "work-in-progress" to store discussions early on for later reference.

So far, the changes to cmsdist are minimal and most of the work needs to be done in the external tensorflow fork.

  • I started porting the cms-specific changes here but further adjustments seem necessary (see below).
  • For reference, the changes needed for 2.3.1 can be found here.

The general idea would be to enable GPU support in a second PR after 2.4.1 is successfully integrated.

@mialiu149 @vlimant @gkasieczka

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @riga (Marcel R.) for branch IB/CMSSW_11_3_X/master.

@cmsbuild, @smuzaffar, @mrodozov can you please review it and eventually sign? Thanks.
cms-bot commands are listed here

@mrodozov
Copy link
Contributor

last issue we had with use_default_shell_env was actually with bazel.
In general it's good to update TF more frequently, even only for integration purposes.
I'll check latest bazel and then push to get 240 asap (so that it doesn't become time wasting black hole again)
I hope they ditched swig also for bazel

@riga
Copy link
Contributor Author

riga commented Jan 12, 2021

Great, thank you!

I gave you full access to my tensorflow v2.4.0 branch in case you want to test things on top.
In the current state, I see the "GLIBCXX_... not found" error we had before,

ERROR: /build/mrieger/tf24/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.0/tensorflow-2.4.0/tensorflow/lite/python/BUILD:11:22: error executing shell command: '/bin/bash -c find 'bazel-out/k8-opt/bin/tensorflow/lite/python/schema_py_srcs_no_include_all' -name '*.py' -exec cat {} + | sed '/import flatbuffers/d' | sed '1s/^/import flatbuffers\'$'\n/' > baze...' failed (Exit 1): bash failed: error executing command
  (cd /build/mrieger/tf24/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.0/build/cc9d7d342a1e7379d0513a09b90d108b/execroot/org_tensorflow && \
  exec env - \
  /bin/bash -c 'find '\''bazel-out/k8-opt/bin/tensorflow/lite/python/schema_py_srcs_no_include_all'\'' -name '\''*.py'\'' -exec cat {} + | sed '\''/import flatbuffers/d'\'' | sed '\''1s/^/import flatbuffers\'\''$'\''\n/'\'' > bazel-out/k8-opt/bin/tensorflow/lite/python/schema_py_generated.py')
Execution platform: @local_execution_config_platform//:platform
/build/mrieger/tf24/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.0/build/install/2e09b417cdf204992c32c4cbc610e6a7/process-wrapper: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /build/mrieger/tf24/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.0/build/install/2e09b417cdf204992c32c4cbc610e6a7/process-wrapper)
/build/mrieger/tf24/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.0/build/install/2e09b417cdf204992c32c4cbc610e6a7/process-wrapper: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /build/mrieger/tf24/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.0/build/install/2e09b417cdf204992c32c4cbc610e6a7/process-wrapper)
Target //tensorflow/tools/pip_package:build_pip_package failed to build

since bazel uses its process wrapper again, but I couldn't yet spot the place where to disable that.

@smuzaffar
Copy link
Contributor

@riga, https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_3_X/master/bazel-3.7.0-patches.patch should disable the bazel process wrapper use

@mrodozov
Copy link
Contributor

I tried bazel 3.7.2 in the morning. it builds with the existing patch. haven't tried TF 24 yet

@riga
Copy link
Contributor Author

riga commented Jan 15, 2021

@smuzaffar This was my expectation as well. Maybe there is another place where a patch is needed, I'll have a look.

@smuzaffar
Copy link
Contributor

looks like you also need the following

--- a/third_party/flatbuffers/build_defs.bzl
+++ b/third_party/flatbuffers/build_defs.bzl
@@ -382,6 +382,7 @@ def _concat_flatbuffer_py_srcs_impl(ctx):
     command = "find '%s' -name '*.py' -exec cat {} + | sed '/import flatbuffers/d'"
     command += " | sed '1s/^/import flatbuffers\\'$'\\n/' > %s"
     ctx.actions.run_shell(
+        use_default_shell_env = True,
         inputs = ctx.attr.deps[0].files,
         outputs = [ctx.outputs.out],
         command = command % (

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 15, 2021

@riga, also remove https://github.com/riga/tensorflow/blob/cms/v2.4.0/tensorflow/workspace.bzl#L393-L402 which should be picked up from cms

@riga
Copy link
Contributor Author

riga commented Jan 15, 2021

looks like you also need the following

I think I even had that at some point, but ran into another problem (will post it).

also remove https://github.com/riga/tensorflow/blob/cms/v2.4.0/tensorflow/workspace.bzl#L393-L402 which should be picked up from cms

Right, thank you.

@riga
Copy link
Contributor Author

riga commented Jan 15, 2021

The py2-keras-applications package seems to require cuda, which appears to fail for me, even when trying to build it with the current cmsdist master:

> pkgtools/cmsBuild -i cuda_test_820 -a slc7_amd64_gcc820 -j 8 build cuda

...

Creating directory pkg
Log file not open.
/build/mrieger/cuda_test_820/SOURCES/external/cuda/11.2.0/cuda_11.2.0_460.27.04_linux.run: line 513: 10898 Segmentation fault      ./cuda-installer --silent --override --installpath=/build/mrieger/cuda_test_820/BUILD/slc7_amd64_gcc820/external/cuda/11.2.0/build --toolkit
error: Bad exit status from /build/mrieger/cuda_test_820/tmp/rpm-tmp.MkCF3n (%install)

(both 820 and 900). Do you know if I'm missing something?

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 15, 2021

gcc version in this branch is GCC 9, so please use -a slc7_amd64_gcc900. This will avoid re-building of externals including cuda

@riga
Copy link
Contributor Author

riga commented Jan 15, 2021

I'm experiencing the same issue with gcc900 based the latest commit (ddec832).

> pkgtools/cmsBuild -i cuda_test_900 -a slc7_amd64_gcc900 -j 8 build py2-keras-applications

...

Starting to process package external+cuda+11.2.0
Checking if cuda is cached.
Package external+cuda+11.2.0 not found in repository. Queuing for build.

...

Creating directory pkg
Log file not open.
/build/mrieger/cuda_test_900/SOURCES/external/cuda/11.2.0/cuda_11.2.0_460.27.04_linux.run: line 513: 26324 Segmentation fault      ./cuda-installer --silent --override --installpath=/build/mrieger/cuda_test_900/BUILD/slc7_amd64_gcc900/external/cuda/11.2.0/build --toolkit
error: Bad exit status from /build/mrieger/cuda_test_900/tmp/rpm-tmp.1TiPXG (%install)

I'm wondering why cuda isn't picked up from the cache.

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 15, 2021

are you sure you have latest cmsdist IB/CMSSW_11_3_X/master branch + your changes? Also try to build in a fresh area. Use --weekly flag for cmsBuild too, this will download the latest packages from this week's IBs area

@smuzaffar
Copy link
Contributor

was there any progress on this?

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2021

Pull request #6545 was updated.

Conflicts:
	pip/requirements.txt
	tensorflow-requires.file
@cmsbuild
Copy link
Contributor

Pull request #6545 was updated.

@riga riga changed the title [WIP] Update TF to 2.4.0. [WIP] Update TF to 2.4.1. Feb 10, 2021
@riga
Copy link
Contributor Author

riga commented Feb 10, 2021

Hi @smuzaffar,
I ran into a segfault in the build of target //tensorflow/python/keras/api:keras_python_api_gen_compat_v2. I updated to 2.4.1, hoping that it would fix the issue but with no luck.

/build/mrieger/tf241/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.1/cms-pytool/python: line 3: 30361 Segmentation fault      /build/mrieger/tf241/slc7_amd64_gcc900/external/python/2.7.15-bcolbf2/bin/python2 "$@"
Target //tensorflow/python/keras/api:keras_python_api_gen_compat_v2 failed to build

Unfortunately, nothing else is logged to track down the error. Did you perhaps see something similar in the 2.3 integration?

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 15, 2021

@riga, I think issue here is that grpc does not support c++17 standard. We also have grpc in our software stack and latest version of it also does not build with c++17.

@riga
Copy link
Contributor Author

riga commented Feb 15, 2021

Thanks for the hint. Do you think we can use the version of gprc then that's currently in the stack? It's currently built as a third_party dependency in tensorflow/workspace.bzl.

@smuzaffar
Copy link
Contributor

yes we can and we should. I had updated TF to use grpc from our stack but I am afraid we are stuck with TF 2.3 unless grpc supports c++17

@makortel
Copy link
Contributor

I had updated TF to use grpc from our stack but I am afraid we are stuck with TF 2.3 unless grpc supports c++17

Just curious, why couldn't we build grpc with C++14?

@smuzaffar
Copy link
Contributor

we need to build with c++17 otheriwse it using absl instead of string_view

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 15, 2021 via email

@mrodozov
Copy link
Contributor

c++11… Perhaps this thread? https://stackoverflow.com/questions/59939678/grpc-auth-context-h-using-stditerator-produces-deprecated-warning-with-latest

On Feb 15, 2021, at 3:21 PM, Matti Kortelainen @.***> wrote: I had updated TF to use grpc from our stack but I am afraid we are stuck with TF 2.3 unless grpc supports c++17 Just curious, why couldn't we build grpc with C++14? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

I built grpc with
-DCMAKE_CXX_STANDARD=17
which is resulting in:
-std=gnu++17
It builds. I'm trying if it's going to help with building TF, but I can't find this reference

I had updated TF to use grpc from our stack

@smuzaffar
Copy link
Contributor

c++11… Perhaps this thread? https://stackoverflow.com/questions/59939678/grpc-auth-context-h-using-stditerator-produces-deprecated-warning-with-latest

On Feb 15, 2021, at 3:21 PM, Matti Kortelainen @.***> wrote: I had updated TF to use grpc from our stack but I am afraid we are stuck with TF 2.3 unless grpc supports c++17 Just curious, why couldn't we build grpc with C++14? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

I built grpc with
-DCMAKE_CXX_STANDARD=17
which is resulting in:
-std=gnu++17
It builds. I'm trying if it's going to help with building TF, but I can't find this reference

I had updated TF to use grpc from our stack

I had done all that but it still fails. Even latest version of grpc with protobuf 3.12 fails.

@mrodozov
Copy link
Contributor

mrodozov commented Feb 16, 2021

I got somewhere using only the py3 recipe, now I'm at:

ImportError: /build/mrodozov/tf24/build_tf24_cpp17/BUILD/slc7_amd64_gcc900/external/tensorflow-sources/2.4.1/build/70ce6c5a16d70ca60b901f2e5732d6d2/execroot/org_tenso\
rflow/bazel-out/k8-opt/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/_pywrap_t\
ensorflow_internal.so: undefined symbol: _ZN10tensorflow6StatusC1ENS_5error4CodeEN4absl14lts_2020_02_2511string_viewEOSt6vectorINS_10StackFrameESaIS7_EE

@smuzaffar would you tell us if what you tried failed again with segfault or with this same missing symbol (which has numerous issues related to it and looks like it can be fixed, or at least it's better than the segfault)
Measured by the great method of how big the log file is I'm at 84M (very close to the end)
Edit: I used c++17 for this build.
With c++14 it actually built, but if I remember right the c++14 recipe doesn't run :/

@smuzaffar
Copy link
Contributor

undefined symbol: _ZN10tensorflow6StatusC1ENS_5error4CodeEN4absl14lts_2020_02_2511string_viewEOSt6vectorINS_10StackFrameESaIS7_EE

this was the exact error I got for py3

@smuzaffar
Copy link
Contributor

@mrodozov , note that TF uses its internal grpc, so building our grpc is not going to change TF build. I will push the changes for TF to use our GRPC.

@cmsbuild
Copy link
Contributor

Pull request #6545 was updated.

@cmsbuild
Copy link
Contributor

Pull request #6545 was updated.

@cmsbuild
Copy link
Contributor

Pull request #6545 was updated.

@smuzaffar
Copy link
Contributor

Ok, I have updated this PR to use grpc from cms externals. TF sources are not downloaded from cms-externals organization.

@riga
Copy link
Contributor Author

riga commented Feb 16, 2021

Ah, GitHub did not update the comments above while my tab was open. I roughly did the same + I managed to built the lastest grpc (1.35.0) after patching the parts of the code that were not C++17 compatible (at least the lines that error'd in the build) and updating protobuf to the lastest version (3.14.0).

I couldn't fully test this yet as TF is still compiling but I'll keep you posted.

Edit: the same error persists.

@cmsbuild
Copy link
Contributor

Pull request #6545 was updated.

@riga
Copy link
Contributor Author

riga commented Feb 19, 2021

I went a bit down the rabbit whole to track down the missing symbol. First of all, all links in both _pywrap_tensorflow_internal.so and libtensorflow_framework.so.2 look okay in the build environment, with the former linking to the latter as intended.

The missing symbol _ZN10tensorflow6StatusC1ENS_5error4CodeEN4absl14lts_2020_02_2511string_viewEOSt6vectorINS_10StackFrameESaIS7_EE resolves to

tensorflow::Status::Status(tensorflow::error::Code, absl::lts_2020_02_25::string_view, std::vector<tensorflow::StackFrame, std::allocator<tensorflow::StackFrame> >&&)

which is indeed missing in libtensorflow_framework.so.2. The absl string_view rings a bell. Didn't we see this also during the first integration attempts?

Also I noted that both gRPC and TF bring their own absl library, potentially in different versions, so we might consider building absl in cmsdist and linking it.

@mrodozov
Copy link
Contributor

Hmm since the last commits we are using gprc from our externals, but grpc is installing absl as a submodule, and also compiles it with C++14 by default (I think)
The shortest thing to do would be add the CMAKE_CXX_STANDARD=17 to the grpc.spec and rebuild because if the grpc is passing this C++14 flag to the absl it builds well you might have the missing symbols ...
We are using two externals which both build absl and they build it on their own so I guess the absl troll grew too big :/

@riga
Copy link
Contributor Author

riga commented Feb 19, 2021

The shortest thing to do would be add the CMAKE_CXX_STANDARD=17 to the grpc.spec and rebuild

That's actually what I'm doing :) I updated gPRC to 1.35.0 and build with C++17.
I'll try building absl externally next to have better control of the version and C++ standard. Ok, this would require tons of changes and is not worth the effort.

@smuzaffar
Copy link
Contributor

closing in favor of #6674

@smuzaffar smuzaffar closed this Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants