New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Redhat, Centos and many superclusters #110
Comments
Since we depend on bazel, this sounds like a bazel issue. Feel free to re-open if bazel ends up supporting 2.12 or lower, and we can see what we can do. |
Am I right that you depend on bazel only at build-time? If this is true then it can be viewed as something you could do something about too... You could also release static-linked packages that would be very useful to people stuck on clusters with old libraries... |
So did anyone find some way past this problem? I'm using redhat 6.4, as is my entire corporation. We're stuck on redhat 6.4. I'm not sure how to end up running tensorflow on such a machine... |
I managed to have it running on a CentOS 6.7 : http://stackoverflow.com/a/34897674/1990516 :) Edit: I proposed an alternative solution also: http://stackoverflow.com/a/34900471/1990516 |
Thanks man! I'll look into it as soon as I can. Sent from my IPhone
|
Could you let me know if this worked? I can't seem to get any of these other solutions working. |
Since @ttrouill only says he got it working on 6.7 so I didn't check whether this works on 6.4 actually... |
Both solutions seem to work, but they're not optimal. TensorFlow and Python seem to run okay, but if I try and run IPython, then with the first solution I get an Invalid ELF error, and with the second solution there is a memory leak and IPython continues to absorb all memory with time. I believe that this can also happen with other Python imports that rely on libraries that were compiled using the older libc. I'd love to see a straightforward how-to-compile-bazel-with-old-glibc guide, but I haven't come across one yet. |
Also bazelbuild/bazel#760 is relevant, but it's far from straightforward and my attempt to build bazel using this guide failed. Hopefully within the next few weeks I can give it some more time and continue that thread with the errors I end up getting. |
…n and train a translation model.
Compiling on CentOS still isn't all that straightforward, but I figured I'd give an overview here for now. This works for me with The instructions here are specific to my base Paths for reference: Bazel
TensorFlow
|
Update: Previous process was for a commit after release 7. Here are necessary changes for commit 1d4fd06, which is after release 8:
|
Our administrator managed to run pip installed tensorflow package on RHEL 6.7 server (without building bazel and tensorflow source), the core idea is get separated newer version of GLIBC version:
Fast test: import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b)) Note: this approach is only for running python scripts, remember that, every time you add $libcroot to your path all the shell commands are corrupted (i.e you cannot use ls, cd ...). You might use bash -l, or screen, or byobu before you try this so you don't mess up your own session. |
Yeah that was described here a while back, but as you mention, it's not ideal. For example if you run Jupyter it'll lead to a memory leak / crash (at least on the system I tried it with). |
Should these lines be added after every occurence of the toolpath containing gcc path- i.e. twice wherever i changed the usr/bin/gcc ? |
I don't know what you mean by twice. I'm pretty sure I only inserted those lines once, although if you were to insert them in multiple places it probably wouldn't do any harm. |
@kskp @rdipietro : is that still needed with latest version of Bazel? If yes then we have an issue in the C++ detection code. |
Bazel compiles out of the box as long as I set |
You mean change to the cuda crosstool file? On Fri, Jun 24, 2016 at 2:30 PM Robert DiPietro notifications@github.com
|
Yes. My May 17 comment above includes everything I needed to do. Specifically, needed to edit CROSSTOOL and needed to introduce two hacks to get bazel to find things outside of its isolated environment. |
@rdipietro Thanks for your reply. Sorry for my ignorance, but could you please tell me what toolpath is? I am assuming it is the block of code where the gcc path had to be changed. I did that twice in the entire file (Since it said to replace all occurences of /usr/bin/gcc). So do I have to add those lines after the block of code where I changed the /usr/bin/gcc path?? |
@rdipietro @damienmg I am not using the latest version of Bazel. I need the 0.2.2b version. I ultimately have to run Syntaxnet on Cent OS 6.7. |
0.2.2b should work too. On Fri, Jun 24, 2016 at 2:55 PM kskp notifications@github.com wrote:
|
Oh, I tried a couple of weeks ago but it did not work. Will do it again today. Thanks for your reply. |
note that you still have to do the CUDA CROSSTOOL modification for doing it with --config cuda |
I built the latest Tensorflow (github master branch) with GPU support on a supercomputing center (CentOS 6.7 with gcc 4.9.2/Generally with a customized cc tool chain). I pointed out some of environment variables settings that are necessary for a success built. Just to document here for future reference: |
Thanks @rdipietro ! I have been able to successfully install r0.12 with Bazel 0.4.3 on a cluster. Some of your suggestions needed to be modified to cater to the changes in the new version of TF and Bazel. But, your suggestions provided a solid starting point. When I get the time, I will write up the changes that I had to make. |
You're welcome @VittalP :) I have an updated set of notes that works as of 1.0.0 alpha: First of all Bazel finally just works. Can download the newest 0.4.x source code (dist zip version), run TensorFlow unfortunately still doesn't just work. So (replacing my paths with yours):
linker_flag: "-Wl,-R/cm/shared/apps/gcc/4.8.2/lib64"
I configured with cuda 7.5, cudnn 5, compute compatibility 3.5 and built with |
@rdipietro @VittalP I have wrote an explanation on the installation of the latest Tensorflow right before @VittalP 's post. But you guys just simply ignored my post... As a jhuer, I kindly note that I have sent my instructions to MARCC's guy and there is already a tensorflow module on MARCC. If you like to read my post to see where is different. http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html If something needs to be updated, please inform me of that. |
Sorry! I didn't notice that you had posted here. But note that you are making changes that I didn't need to make. Probably depends on specific versions of TF / cuda / gcc / whatever. Side note: I still compile on MARCC because they only installed TF for Python 2.x, whereas I'm using 3.x. |
I have updated my webpage for building tensorflow 1.0.0 with python 3.5.2. I provided two wheels on the webpage as well. Please refer to: |
For whoever wants to compile TensorFlow 1.0 on RedHat 6 and with Python 2.7, I provide a detailed step-by-step guide here: https://www.linkedin.com/pulse/compiling-tensorflow-10-python-27-redhat-6-florian-raudies |
And here we go again for r1.2. (Note: since r1.0, the Bazel configuration file organization has been mucked with.) Bazel: Need new ish version. 0.4.3 did not work, 0.4.5 did. Again, Bazel now compiles easily even with older CentOS / glibc, so this is straightforward. Required edits for TensorFlow: vim third_party/gpus/crosstool/CROSSTOOL_nvcc.tpl
vim third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl Final notes: Wouldn't work with Cuda 7.5, CuDNN 5 (cuda compilation errors). Success with Cuda 8.0, CuDNN 5. |
* [OpenCL] Provides atomic free MaxPool3DGrad Atomic support in SYCL is not designed in a way that plays nicely with Tensorflow and Eigen. Here we provide a new implementation for MaxPool3DGrad which does not rely on atomics, and so avoids any such problems. * [OpenCL] Provides atomic free MaxPoolGrad Atomic support in SYCL is not designed in a way that plays nicely with Tensorflow and Eigen. Here we provide a new implementation for MaxPoolGrad which does not rely on atomics, and so avoids any such problems. * [OpenCL] Changes expected NaN behaviour in test The new SYCL kernels provide the same behaviour as the CUDA and cuDNN kernels when an input tensor only contains NaN and the test needs to reflect this. As NaN cannot be compared to any other float value, it makes little sense to decide which of the NaNs is the maximum, and so which NaN should have the error propagated to it. * [OpenCL] Removes unneeded SYCL atomic functions * [OpenCL] Tidies SYCL MaxPoolGrad kernels Some tidying up and also adds a local accumulator value which will be written to memory at the end of the kernel, to decrease the number og memory writes in the kernel.
I am working on a CentOS 6 cluster which uses Lustre filesystem. I am unable to make Bazel work on it since it can't use file locking. Refer this issue. So would it be possible for tensorflow to support other build tools? Edit : Error: unexpected result from F_SETLK: Function not implemented. Also refer the hyper-link above |
@JoyChopra1298 |
@yliu120 Thank you using bazel's output_user_root option worked. |
I have some similar problem here, in step2 for TF specifically. /home2/my_name/.cache/bazel/_bazel_my_name/b9c3b9594c932d1e804df44467c1c0d2/external/boringssl/BUILD:115:1: C++ compilation of rule '@boringssl//:crypto' failed (Exit 1) And when I used --verbose_failures to monitor the building process, I obtained the output organized in error_records.txt Can anyone help with this issue? |
@owenyoung75 Did you solve this problem? I'm facing similar situation. |
- more highlighting: numbers, elemental types inside shaped types - add some more keywords Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Closes #110 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#110 from bondhugula:vim 029777db0ecb95bfc6453c0869af1c233d84d521 PiperOrigin-RevId: 266487768
- more highlighting: numbers, elemental types inside shaped types - add some more keywords Signed-off-by: Uday Bondhugula <uday@polymagelabs.com> Closes tensorflow#110 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#110 from bondhugula:vim 029777db0ecb95bfc6453c0869af1c233d84d521 PiperOrigin-RevId: 266487768
…r_is_not_gcc (#3077) Many superclusters need to compile TensorFlow from source due to an outdated glibc version (see #110). In @rdipietro's excellent workaround post (tensorflow/tensorflow#110 (comment)) he mentions issues with the referenced Python version in this file. I have issues as well, but of a different nature. In my case the build script is unable to find `libpython2.7.so.1.0`, since only Python 3 is present on my machine. The issue originates from `crosstool_wrapper_driver_is_not_gcc` where the only Python 2.7 exclusive feature is the `print` statement. By `import`ing `print_function from __future__` the explicit dependency can be dropped and both versions of Python are supported.
Many clusters system using module with Redhat or Centos < 7 which is glibc 2.12
Since, bazel requires glibc 2.14 and the prebuilt version for linux requires glibc 2.17. It is hopeless to make tensorflow run on clusters.
Referred to this issue reported on bazel: bazelbuild/bazel#583
The text was updated successfully, but these errors were encountered: