Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ginkgo + CUDA Tests Fail on Marianas #540

Closed
cameronrutherford opened this issue Sep 1, 2022 · 7 comments
Closed

Ginkgo + CUDA Tests Fail on Marianas #540

cameronrutherford opened this issue Sep 1, 2022 · 7 comments

Comments

@cameronrutherford
Copy link
Collaborator

cameronrutherford commented Sep 1, 2022

The error message is here:

27: Setting up Ginkgo solver ... 
27: terminate called after throwing an instance of 'gko::CudaError'
27:   what():  /tmp/ruth521/spack-stage/spack-stage-ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/spack-src/cuda/base/executor.cpp:192: raw_copy_to: cudaErrorInvalidValue: invalid argument
27: [dlt03:33417] *** Process received signal ***

Marianas is CentOS 7 with a max compute capability of 60, and so the current assumption is that there is a bug with that specific build combination (see here in #521 ).

The current test config to ctest disables relevant tests, and so it may appear CI is passing: https://github.com/LLNL/hiop/blob/develop/.gitlab-ci.yml#L164

@cameronrutherford
Copy link
Collaborator Author

Relevant failing test logs in PNNL CI:

NlpSparse1_6 (Click to show logs)

PNNL Pipeline

test 22
      Start 22: NlpSparse1_6
22: Test command: /share/apps/openmpi/4.1.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe" "500" "-ginkgo_cuda" "-selfcheck"
22: Test timeout computed to be: 1800
22: [1659713290.594617] [dlt03:33270:0]    ucp_context.c:1470 UCX  WARN  UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /usr/lib64/libucp.so.0)
22: [1659713290.621398] [dlt03:33270:0]       ib_iface.c:665  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
22: [dlt03.local:33270] pml_ucx.c:273  Error: Failed to create UCP worker
22: ===============
22: Hiop SOLVER
22: ===============
22: Using 1 MPI ranks.
22: ---------------
22: Problem Summary
22: ---------------
22: Total number of variables: 500
22:      lower/upper/lower_and_upper bounds: 499 / 1 / 1
22: Total number of equality constraints: 1
22: Total number of inequality constraints: 498
22:      lower/upper/lower_and_upper bounds: 498 / 497 / 497
22: LSQ linear solver --- KKT_SPARSE_XDYcYd linsys: MA57 size 1497 cons 499 nnz 3991 (option 'duals_init_linear_solver_sparse' 'auto')
22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:    0  7.6705009e+00 9.980e+00  1.118e+00  -1.00  0.000e+00  0.000e+00  -(-)
22: Setting up Ginkgo solver ... 
22: terminate called after throwing an instance of 'gko::CudaError'
22:   what():  /tmp/ruth521/spack-stage/spack-stage-ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/spack-src/cuda/base/executor.cpp:192: raw_copy_to: cudaErrorInvalidValue: invalid argument
22: [dlt03:33270] *** Process received signal ***
22: [dlt03:33270] Signal: Aborted (6)
22: [dlt03:33270] Signal code:  (-6)
22: [dlt03:33270] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x7f9f37a82630]
22: [dlt03:33270] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f9f27bd8387]
22: [dlt03:33270] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7f9f27bd9a78]
22: [dlt03:33270] [ 3] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0x9934c)[0x7f9f2852334c]
22: [dlt03:33270] [ 4] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4656)[0x7f9f2852e656]
22: [dlt03:33270] [ 5] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa46c1)[0x7f9f2852e6c1]
22: [dlt03:33270] [ 6] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4955)[0x7f9f2852e955]
22: [dlt03:33270] [ 7] /qfs/projects/exasgd/src/cameron/spack/opt/spack/linux-centos7-zen2/gcc-10.2.0/ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/lib64/libginkgo_cuda.so.1.5.0(+0x18cb28)[0x7f9f29858b28]
22: [dlt03:33270] [ 8] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe(_ZNK3gko8Executor9copy_fromIdEEvPKS0_mPKT_PS4_+0x150)[0x64d6d6]
22: [dlt03:33270] [ 9] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe(_ZN3gko5arrayIdEaSERKS1_+0x2ca)[0x648daa]
22: [dlt03:33270] [10] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x646a54]
22: [dlt03:33270] [11] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x64353d]
22: [dlt03:33270] [12] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x63ab67]
22: [dlt03:33270] [13] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x635579]
22: [dlt03:33270] [14] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x6261dc]
22: [dlt03:33270] [15] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x6274b4]
22: [dlt03:33270] [16] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x627743]
22: [dlt03:33270] [17] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5c82b7]
22: [dlt03:33270] [18] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5c8473]
22: [dlt03:33270] [19] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5ca579]
22: [dlt03:33270] [20] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5bc02b]
22: [dlt03:33270] [21] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x4d7c60]
22: [dlt03:33270] [22] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9f27bc4555]
22: [dlt03:33270] [23] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x4d5237]
22: [dlt03:33270] *** End of error message ***
22: --------------------------------------------------------------------------
22: Primary job  terminated normally, but 1 process returned
22: a non-zero exit code. Per user-direction, the job has been aborted.
22: --------------------------------------------------------------------------
22: --------------------------------------------------------------------------
22: mpirun noticed that process rank 0 with PID 33270 on node dlt03 exited on signal 6 (Aborted).
22: --------------------------------------------------------------------------
NlpSparse2_5 (Click to show logs)

PNNL Pipeline

Start 27: NlpSparse2_5
27: Test command: /share/apps/openmpi/4.1.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck"
27: Test timeout computed to be: 1800
27: [1659713336.290306] [dlt03:33417:0]    ucp_context.c:1470 UCX  WARN  UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /usr/lib64/libucp.so.0)
27: [1659713336.318631] [dlt03:33417:0]       ib_iface.c:665  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
27: [dlt03.local:33417] pml_ucx.c:273  Error: Failed to create UCP worker
27: ===============
27: Hiop SOLVER
27: ===============
27: Using 1 MPI ranks.
27: ---------------
27: Problem Summary
27: ---------------
27: Total number of variables: 500
27:      lower/upper/lower_and_upper bounds: 499 / 1 / 1
27: Total number of equality constraints: 2
27: Total number of inequality constraints: 499
27:      lower/upper/lower_and_upper bounds: 498 / 498 / 497
27: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
27:    0  6.4379656e+01 9.980e+00  1.010e+00   0.00  0.000e+00  0.000e+00  -(-)
27: Setting up Ginkgo solver ... 
27: terminate called after throwing an instance of 'gko::CudaError'
27:   what():  /tmp/ruth521/spack-stage/spack-stage-ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/spack-src/cuda/base/executor.cpp:192: raw_copy_to: cudaErrorInvalidValue: invalid argument
27: [dlt03:33417] *** Process received signal ***
27: [dlt03:33417] Signal: Aborted (6)
27: [dlt03:33417] Signal code:  (-6)
27: [dlt03:33417] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x7f715e9aa630]
27: [dlt03:33417] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f714eb00387]
27: [dlt03:33417] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7f714eb01a78]
27: [dlt03:33417] [ 3] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0x9934c)[0x7f714f44b34c]
27: [dlt03:33417] [ 4] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4656)[0x7f714f456656]
27: [dlt03:33417] [ 5] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa46c1)[0x7f714f4566c1]
27: [dlt03:33417] [ 6] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4955)[0x7f714f456955]
27: [dlt03:33417] [ 7] /qfs/projects/exasgd/src/cameron/spack/opt/spack/linux-centos7-zen2/gcc-10.2.0/ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/lib64/libginkgo_cuda.so.1.5.0(+0x18cb28)[0x7f7150780b28]
27: [dlt03:33417] [ 8] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe(_ZNK3gko8Executor9copy_fromIdEEvPKS0_mPKT_PS4_+0x150)[0x64de3a]
27: [dlt03:33417] [ 9] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe(_ZN3gko5arrayIdEaSERKS1_+0x2ca)[0x64950e]
27: [dlt03:33417] [10] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x6471b8]
27: [dlt03:33417] [11] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x643ca1]
27: [dlt03:33417] [12] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x63b2cb]
27: [dlt03:33417] [13] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x635cdd]
27: [dlt03:33417] [14] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x626940]
27: [dlt03:33417] [15] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x627c18]
27: [dlt03:33417] [16] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x627ea7]
27: [dlt03:33417] [17] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5c8a1b]
27: [dlt03:33417] [18] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5c8bd7]
27: [dlt03:33417] [19] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5cacdd]
27: [dlt03:33417] [20] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5bc78f]
27: [dlt03:33417] [21] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x4d7fed]
27: [dlt03:33417] [22] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f714eaec555]
27: [dlt03:33417] [23] /people/svcexasgd/gitlab/97388/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x4d51e7]
27: [dlt03:33417] *** End of error message ***
27: --------------------------------------------------------------------------
27: Primary job  terminated normally, but 1 process returned
27: a non-zero exit code. Per user-direction, the job has been aborted.
27: --------------------------------------------------------------------------
27: --------------------------------------------------------------------------
27: mpirun noticed that process rank 0 with PID 33417 on node dlt03 exited on signal 6 (Aborted).
27: --------------------------------------------------------------------------
27/43 Test #27: NlpSparse2_5 ......................***Failed    9.74 sec

Tagging relevant developers from offline discussion: @pelesh @nkoukpaizan @cnpetra @fritzgoebel @nychiang

@pelesh
Copy link
Collaborator

pelesh commented Sep 2, 2022

CC @maksud

@pelesh
Copy link
Collaborator

pelesh commented Sep 27, 2022

In PR #548 tests are still failing on Marianas, but now with different error messages (PNNL Pipeline):

NlpSparse1_6 (Click to show logs)
      Start 22: NlpSparse1_6

22: Test command: /share/apps/openmpi/4.1.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe" "500" "-ginkgo_cuda" "-selfcheck"
22: Test timeout computed to be: 10000000
22: [1664307938.160667] [dlt02:239553:0]       ib_iface.c:964  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
22: [dlt02.local:239553] pml_ucx.c:273  Error: Failed to create UCP worker
22: ===============
22: Hiop SOLVER
22: ===============
22: Using 1 MPI ranks.
22: ---------------
22: Problem Summary
22: ---------------
22: Total number of variables: 500
22:      lower/upper/lower_and_upper bounds: 499 / 1 / 1
22: Total number of equality constraints: 1
22: Total number of inequality constraints: 498
22:      lower/upper/lower_and_upper bounds: 498 / 497 / 497
22: LSQ linear solver --- KKT_SPARSE_XDYcYd linsys: MA57 size 1497 cons 499 nnz 3991 (option 'duals_init_linear_solver_sparse' 'auto')
22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:    0  7.6705009e+00 9.980e+00  1.118e+00  -1.00  0.000e+00  0.000e+00  -(-)
22: Setting up Ginkgo solver ... 
22: terminate called after throwing an instance of 'std::length_error'
22:   what():  cannot create std::vector larger than max_size()
22: [dlt02:239553] *** Process received signal ***
22: [dlt02:239553] Signal: Aborted (6)
22: [dlt02:239553] Signal code:  (-6)
22: [dlt02:239553] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x7fed38db6630]
22: [dlt02:239553] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7fed28f0c387]
22: [dlt02:239553] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7fed28f0da78]
22: [dlt02:239553] [ 3] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0x9934c)[0x7fed2985734c]
22: [dlt02:239553] [ 4] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4656)[0x7fed29862656]
22: [dlt02:239553] [ 5] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa46c1)[0x7fed298626c1]
22: [dlt02:239553] [ 6] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4955)[0x7fed29862955]
22: [dlt02:239553] [ 7] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x3d)[0x7fed29859bfa]
22: [dlt02:239553] [ 8] /qfs/projects/exasgd/src/cameron/spack/opt/spack/linux-centos7-zen2/gcc-10.2.0/ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/lib64/libglu.so(_ZN15Symbolic_Matrix7fill_inEPjS0_+0x800)[0x7fed29b91070]
22: [dlt02:239553] [ 9] /qfs/projects/exasgd/src/cameron/spack/opt/spack/linux-centos7-zen2/gcc-10.2.0/ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/lib64/libginkgo.so.1.5.0(_ZN3gko12experimental13factorization3GluIdiE15ReusableFactory22symbolic_factorizationEPKNS_5LinOpE+0x43c)[0x7fed2f7c321c]
22: [dlt02:239553] [10] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe(_ZN3gko12experimental13factorization3GluIdiE15ReusableFactoryC1ESt10shared_ptrIKNS_8ExecutorEEPKNS_5LinOpEPKNS_7reorder14ReorderingBaseERKNS3_25ReusableFactoryParametersE+0x23b)[0x6407ab]
22: [dlt02:239553] [11] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe(_ZNK3gko12experimental13factorization3GluIdiE25ReusableFactoryParameters2onESt10shared_ptrIKNS_8ExecutorEEPKNS_5LinOpEPKNS_7reorder14ReorderingBaseE+0x6f)[0x63a1c1]
22: [dlt02:239553] [12] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x62be3a]
22: [dlt02:239553] [13] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x62ca01]
22: [dlt02:239553] [14] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x62cc33]
22: [dlt02:239553] [15] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5ca30f]
22: [dlt02:239553] [16] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5ca5a2]
22: [dlt02:239553] [17] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5cc989]
22: [dlt02:239553] [18] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x5bdc49]
22: [dlt02:239553] [19] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x4d7d30]
22: [dlt02:239553] [20] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fed28ef8555]
22: [dlt02:239553] [21] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx1.exe[0x4d5307]
22: [dlt02:239553] *** End of error message ***
22: --------------------------------------------------------------------------
22: Primary job  terminated normally, but 1 process returned
22: a non-zero exit code. Per user-direction, the job has been aborted.
22: --------------------------------------------------------------------------
22: --------------------------------------------------------------------------
22: mpirun noticed that process rank 0 with PID 239553 on node dlt02 exited on signal 6 (Aborted).
22: --------------------------------------------------------------------------
22/43 Test #22: NlpSparse1_6 ......................***Failed   15.30 sec
NlpSparse2_5 (Click to show logs)
test 27
      Start 27: NlpSparse2_5

27: Test command: /share/apps/openmpi/4.1.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck"
27: Test timeout computed to be: 10000000
27: [1664307956.157689] [dlt02:239651:0]       ib_iface.c:964  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
27: [dlt02.local:239651] pml_ucx.c:273  Error: Failed to create UCP worker
27: ===============
27: Hiop SOLVER
27: ===============
27: Using 1 MPI ranks.
27: ---------------
27: Problem Summary
27: ---------------
27: Total number of variables: 500
27:      lower/upper/lower_and_upper bounds: 499 / 1 / 1
27: Total number of equality constraints: 2
27: Total number of inequality constraints: 499
27:      lower/upper/lower_and_upper bounds: 498 / 498 / 497
27: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
27:    0  6.4379656e+01 9.980e+00  1.010e+00   0.00  0.000e+00  0.000e+00  -(-)
27: Setting up Ginkgo solver ... 
27: terminate called after throwing an instance of 'std::length_error'
27:   what():  cannot create std::vector larger than max_size()
27: [dlt02:239651] *** Process received signal ***
27: [dlt02:239651] Signal: Aborted (6)
27: [dlt02:239651] Signal code:  (-6)
27: [dlt02:239651] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x7efec26e2630]
27: [dlt02:239651] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7efeb2838387]
27: [dlt02:239651] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7efeb2839a78]
27: [dlt02:239651] [ 3] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0x9934c)[0x7efeb318334c]
27: [dlt02:239651] [ 4] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4656)[0x7efeb318e656]
27: [dlt02:239651] [ 5] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa46c1)[0x7efeb318e6c1]
27: [dlt02:239651] [ 6] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(+0xa4955)[0x7efeb318e955]
27: [dlt02:239651] [ 7] /share/apps/gcc/10.2.0/lib64/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x3d)[0x7efeb3185bfa]
27: [dlt02:239651] [ 8] /qfs/projects/exasgd/src/cameron/spack/opt/spack/linux-centos7-zen2/gcc-10.2.0/ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/lib64/libglu.so(_ZN15Symbolic_Matrix7fill_inEPjS0_+0x800)[0x7efeb34bd070]
27: [dlt02:239651] [ 9] /qfs/projects/exasgd/src/cameron/spack/opt/spack/linux-centos7-zen2/gcc-10.2.0/ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/lib64/libginkgo.so.1.5.0(_ZN3gko12experimental13factorization3GluIdiE15ReusableFactory22symbolic_factorizationEPKNS_5LinOpE+0x43c)[0x7efeb90ef21c]
27: [dlt02:239651] [10] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe(_ZN3gko12experimental13factorization3GluIdiE15ReusableFactoryC1ESt10shared_ptrIKNS_8ExecutorEEPKNS_5LinOpEPKNS_7reorder14ReorderingBaseERKNS3_25ReusableFactoryParametersE+0x23b)[0x640f0f]
27: [dlt02:239651] [11] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe(_ZNK3gko12experimental13factorization3GluIdiE25ReusableFactoryParameters2onESt10shared_ptrIKNS_8ExecutorEEPKNS_5LinOpEPKNS_7reorder14ReorderingBaseE+0x6f)[0x63a925]
27: [dlt02:239651] [12] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x62c59e]
27: [dlt02:239651] [13] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x62d165]
27: [dlt02:239651] [14] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x62d397]
27: [dlt02:239651] [15] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5caa73]
27: [dlt02:239651] [16] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5cad06]
27: [dlt02:239651] [17] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5cd0ed]
27: [dlt02:239651] [18] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x5be3ad]
27: [dlt02:239651] [19] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x4d80bd]
27: [dlt02:239651] [20] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7efeb2824555]
27: [dlt02:239651] [21] /people/svcexasgd/gitlab/102551/build/src/Drivers/Sparse/NlpSparseEx2.exe[0x4d52b7]
27: [dlt02:239651] *** End of error message ***
27: --------------------------------------------------------------------------
27: Primary job  terminated normally, but 1 process returned
27: a non-zero exit code. Per user-direction, the job has been aborted.
27: --------------------------------------------------------------------------
27: --------------------------------------------------------------------------
27: mpirun noticed that process rank 0 with PID 239651 on node dlt02 exited on signal 6 (Aborted).
27: --------------------------------------------------------------------------
27/43 Test #27: NlpSparse2_5 ......................***Failed   14.67 sec

@cnpetra
Copy link
Collaborator

cnpetra commented Oct 21, 2022

are the tests still failing?

@cameronrutherford
Copy link
Collaborator Author

Yes see #548 for current state of work.

@fritzgoebel
Copy link

Is there a way for me to get access to Marianas? I am failing to reproduce this so far

@pelesh
Copy link
Collaborator

pelesh commented Nov 30, 2022

Fixed in #548 and #551.

@pelesh pelesh closed this as completed Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants