Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programs that use ocaml-torch with GPU acceleration segfault right before terminating #43

Open
jonathan-laurent opened this issue May 10, 2020 · 5 comments

Comments

@jonathan-laurent
Copy link

jonathan-laurent commented May 10, 2020

Programs I write using ocaml-torch that use GPU acceleration segfault right before terminating:

Segmentation fault (core dumped)

This is not a huge deal as it happens when the program is about to terminate anyway but I was wondering if you had observed the same phenomenon.

In particular, I replicated the problem on your mnist/conv and char_rnn examples.

@jonathan-laurent jonathan-laurent changed the title Programs that use ocaml-torch segfault right before terminating Programs that use ocaml-torch with GPU acceleration segfault right before terminating May 10, 2020
@LaurentMazare
Copy link
Owner

That's strange, I don't get any such error. Does it also happen when running examples/basics/basics.exe? Could you try running it within gdb if you have it installed?

@jonathan-laurent
Copy link
Author

jonathan-laurent commented May 11, 2020

The bug does not happen with basics/basics.exe.

I ran the mnist/conv.exe example within GDB and got the following backtrace:

(gdb) run
Starting program: /home/jonathan/neurarith/_build/default/deps/ocaml-torch/examples/mnist/conv.exe 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff9bcdd700 (LWP 8205)]
[New Thread 0x7fff9a5e6700 (LWP 8206)]
[New Thread 0x7fff99de5700 (LWP 8207)]
[New Thread 0x7fff995e4700 (LWP 8208)]
[New Thread 0x7fff98de3700 (LWP 8209)]
[New Thread 0x7fff93fff700 (LWP 8210)]
[New Thread 0x7fff91a15700 (LWP 8211)]
[New Thread 0x7fff91214700 (LWP 8212)]
[New Thread 0x7fff90a13700 (LWP 8213)]
[New Thread 0x7fff5dfff700 (LWP 8214)]
50 0.268205 94.07%
...
4950 0.000706 99.09%
5000 0.005436 98.99%
[Thread 0x7fff90a13700 (LWP 8213) exited]
[Thread 0x7fff5dfff700 (LWP 8214) exited]

Thread 1 "conv.exe" received signal SIGSEGV, Segmentation fault.
0x00007fffe817c23e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2

(gdb) backtrace
#0  0x00007fffe817c23e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1  0x00007fffe818170b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2  0x00007fffe81ae2d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3  0x00007fffa7fdb51d in cudnnDestroy () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#4  0x00007fffa72dda05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#5  0x00007fffe5a22615 in __cxa_finalize (d=0x7fffe1e763c0) at cxa_finalize.c:83
#6  0x00007fffa4b3ea83 in __do_global_dtors_aux () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#7  0x00007fffffffdad0 in ?? ()
#8  0x00007ffff7de5b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC

I suspect this is not very useful as GDB is missing some debug symbols. Would you be able to recommend some build options to get a more useful backtrace?

@zbroyar
Copy link

zbroyar commented May 12, 2020

I had the similar issue and cured it with Caml.Gc.full_major() after each epoch.

@jonathan-laurent
Copy link
Author

I have also observed that not calling the GC often enough during training can result in segfaults but I suspect the problem is different here. For example, conv/mnist.exe already calls the GC after each epoch but still displays the problem on my machine.

@Kwonsoo
Copy link

Kwonsoo commented Dec 17, 2020

Hi all,

I also met segmentation fault, and spent some time for the recent two days trying to resolve the issue.

I was running some optimization procedures other than the basic examples here, which I cannot share here, and after several epochs, the process just terminated with the Segmentation fault (core dumped) error message. In the /var/log/syslog file, I found the relevant line something like below:
Dec 17 12:08:55 h02 kernel: [1880432.552416] main.exe[29569]: segfault at 7fd86f5db908 ip 00007fd86153e7ee sp 00007fff73c98f00 error 4 in libtorch.so[7fd86037b000+e8d1000]; or
Dec 17 11:34:34 h02 kernel: [1878371.127518] traps: main.exe[18968] general protection ip:14bc7b71d19c sp:7ffd41c02ed8 error:0 in libc10.so[14bc7b701000+43000].

I have made several attempts below before running dune exec ...:
(1) ran ulimit -s unlimited
(2) ran sudo apt update; sudo apt upgrade
(3) ran opam update; opam upgrade
Trial (1) and (2) did not help, but Trial (3) actually resolved the issue. I think upgrading thelibtorch version here helped. I also add some detailed information about what opam upgrade did on my computer for completeness below.

The following actions will be performed:
  ↗ upgrade   num                     1.3 to 1.4
  ↗ upgrade   dune                    2.4.0 to 2.7.1
  ↗ upgrade   conf-openblas           0.2.0 to 0.2.1
  ↗ upgrade   conf-pkg-config         1.1 to 1.3
  ↗ upgrade   libtorch                1.4.0 to 1.7.0+linux-x86_64
  ↗ upgrade   topkg                   1.0.1 to 1.0.3
  ↗ upgrade   batteries               3.0.0 to 3.2.0
  ∗ install   trie                    1.0.0                       [required by mew]
  ∗ install   octavius                1.2.2                       [required by ppx_js_style]
  ∗ install   jane-street-headers     v0.14.0                     [required by time_now]
  ↗ upgrade   sexplib0                v0.13.0 to v0.14.0
  ↗ upgrade   ocaml-migrate-parsetree 1.6.0 to 2.1.0
  ↗ upgrade   ocaml-compiler-libs     v0.12.1 to v0.12.3
  ↗ upgrade   integers                0.3.0 to 0.4.0
  ↗ upgrade   dune-private-libs       2.4.0 to 2.7.1
  ↻ recompile stdlib-shims            0.1.0                       [uses dune]
  ↻ recompile result                  1.5                         [uses dune]
  ↻ recompile re                      1.9.0                       [uses dune]
  ↻ recompile ppx_derivers            1.2.1                       [uses dune]
  ↻ recompile npy                     0.0.9                       [uses dune]
  ↻ recompile mmap                    1.1.0                       [uses dune]
  ↻ recompile csv                     2.4                         [uses dune]
  ↻ recompile cppo                    1.6.6                       [uses dune]
  ↻ recompile camomile                1.0.2                       [uses dune]
  ∗ install   conf-libffi             2.0.0                       [required by ctypes-foreign]
  ↗ upgrade   astring                 0.8.3 to 0.8.5
  ↻ recompile b0                      0.0.1                       [uses topkg]
  ↻ recompile owl-base                0.9.0                       [uses dune]
  ∗ install   mew                     0.1.0                       [required by mew_vi]
  ∗ install   csexp                   1.3.2                       [required by dune-configurator]
  ↻ recompile tyxml                   4.4.0                       [uses dune]
  ↗ upgrade   ppxlib                  0.12.0 to 0.17.0
  ↗ upgrade   ocplib-endian           1.0 to 1.1
  ↻ recompile charInfo_width          1.1.0                       [uses dune]
  ↻ recompile ctypes-foreign          0.4.0                       [upstream changes]
  ↗ upgrade   fpath                   0.7.2 to 0.7.3
  ∗ install   mew_vi                  0.5.0                       [required by lambda-term]
  ↗ upgrade   dune-configurator       2.4.0 to 2.7.1
  ↗ upgrade   zed                     2.0.6 to 3.1.0
  ↻ recompile ctypes                  0.17.1                      [uses integers, conf-pkg-config, ctypes-foreign]
  ↗ upgrade   odoc                    1.5.0 to 1.5.2
  ↗ upgrade   lwt                     5.2.0 to 5.3.0
  ↗ upgrade   base                    v0.13.1 to v0.14.0
  ↗ upgrade   eigen                   0.2.0 to 0.3.0
  ↻ recompile odig                    0.0.5                       [uses odoc, topkg]
  ↻ recompile lwt_react               1.1.3                       [uses dune, lwt]
  ↻ recompile lwt_log                 1.1.1                       [uses dune, lwt]
  ∗ install   ppx_js_style            v0.14.0                     [required by ppx_base]
  ∗ install   ppx_enumerate           v0.14.0                     [required by ppx_base]
  ↗ upgrade   variantslib             v0.13.0 to v0.14.0
  ↗ upgrade   stdio                   v0.13.0 to v0.14.0
  ↗ upgrade   ppx_sexp_conv           v0.13.0 to v0.14.1
  ↗ upgrade   ppx_here                v0.13.0 to v0.14.0
  ↗ upgrade   ppx_compare             v0.13.0 to v0.14.0
  ↗ upgrade   ppx_cold                v0.13.0 to v0.14.0
  ↗ upgrade   parsexp                 v0.13.0 to v0.14.0
  ↗ upgrade   fieldslib               v0.13.0 to v0.14.0
  ↗ upgrade   lambda-term             2.0.3 to 3.1.0
  ↗ upgrade   ppx_variants_conv       v0.13.0 to v0.14.1
  ∗ install   ppx_optcomp             v0.14.0                     [required by time_now]
  ↻ recompile owl                     0.9.0*                      [uses eigen, dune, base, etc.]
  ↗ upgrade   ppx_custom_printf       v0.13.0 to v0.14.0
  ∗ install   ppx_hash                v0.14.0                     [required by ppx_base]
  ↗ upgrade   ppx_assert              v0.13.0 to v0.14.0
  ↗ upgrade   sexplib                 v0.13.0 to v0.14.0
  ↗ upgrade   ppx_fields_conv         v0.13.0 to v0.14.1
  ↗ upgrade   utop                    2.4.3 to 2.6.0
  ∗ install   ppx_base                v0.14.0                     [required by time_now]
  ∗ install   jst-config              v0.14.0                     [required by time_now]
  ∗ install   time_now                v0.14.0                     [required by ppx_inline_test]
  ↗ upgrade   ppx_inline_test         v0.13.0 to v0.14.1
  ↗ upgrade   ppx_expect              v0.13.0 to v0.14.0
  ↗ upgrade   torch                   0.8 to 0.11

I hope it will save people time for debugging in the future.

Thanks,
Gwonsoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants