Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests working on gadi #36

Open
nichannah opened this issue Jan 9, 2020 · 7 comments
Open

Tests working on gadi #36

nichannah opened this issue Jan 9, 2020 · 7 comments
Assignees

Comments

@nichannah
Copy link
Contributor

@nichannah nichannah commented Jan 9, 2020

Do COSIMA/access-om2#182 for libaccessom2 tests

@aekiss

This comment has been minimized.

Copy link
Contributor

@aekiss aekiss commented Feb 5, 2020

@nichannah I get a segfault running tests with a9e2883

export LIBACCESSOM2_DIR=$(pwd)
module load openmpi
cd tests/
./copy_test_data_from_gadi.sh
cd JRA55_IAF
rm -rf log ; mkdir log ; rm -f accessom2_restart_datetime.nml ; cp ../test_data/i2o.nc ./ ; cp ../test_data/o2i.nc ./
mpirun -np 1 $LIBACCESSOM2_DIR/build/bin/yatm.exe : -np 1 $LIBACCESSOM2_DIR/build/bin/ice_stub.exe : -np 1 $LIBACCESSOM2_DIR/build/bin/ocean_stub.exe

yields

 YATM_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 OCEAN_STUB_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 ICE_STUB_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 mom5xx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 cicexx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 matmxx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
[gadi-login-04:6441 :0:6441] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffdb447bbc0)
==== backtrace (tid:   6441) ====
 0 0x0000000000012d80 .annobin_sigaction.c()  sigaction.c:0
 1 0x000000000069f46b m_attrvect_mp_sort__.V()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_AttrVect.F90:3455
 2 0x000000000062f1f0 sort_()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrix.F90:2637
 3 0x000000000062f1f0 sortpermute_()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrix.F90:2750
 4 0x00000000006341b5 m_sparsematrixtomaps_mp_sparsematrixtoxglobalsegmap__()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrixToMaps.F90:150
 5 0x00000000006338ab m_sparsematrixplus_mp_initdistributed__()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrixPlus.F90:516
 6 0x00000000005956c0 mod_oasis_coupler_mp_oasis_coupler_setup_.V()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/psmile/src/mod_oasis_coupler.F90:943
 7 0x000000000044b023 mod_oasis_method_mp_oasis_enddef_.V()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/psmile/src/mod_oasis_method.F90:741
 8 0x000000000041f3d1 coupler_mod_mp_coupler_init_end_()  /home/156/aek156/github/COSIMA/libaccessom2/libcouple/src/coupler.F90:149
 9 0x000000000040e74c MAIN__.V()  /home/156/aek156/github/COSIMA/libaccessom2/ice_stub/src/ice.F90:109
10 0x000000000040ce22 main()  ???:0
11 0x0000000000023813 __libc_start_main()  ???:0
12 0x000000000040cd2e _start()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
yatm.exe           00000000007F8834  Unknown               Unknown  Unknown
libpthread-2.28.s  00007F363EAA5D80  Unknown               Unknown  Unknown
mca_pml_ucx.so     00007F36251EFE20  Unknown               Unknown  Unknown
mca_pml_ucx.so     00007F36251F10F3  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.20.2  00007F363F0A86CD  MPI_Recv              Unknown  Unknown
libmpi_mpifh.so    00007F363F386A10  pmpi_recv_            Unknown  Unknown
yatm.exe           00000000006F66FD  Unknown               Unknown  Unknown
yatm.exe           000000000066D8DA  Unknown               Unknown  Unknown
yatm.exe           00000000005E5AF2  mod_oasis_coupler        1055  mod_oasis_coupler.F90
yatm.exe           000000000049A8B3  mod_oasis_method_         741  mod_oasis_method.F90
yatm.exe           00000000004418E1  coupler_mod_mp_co         149  coupler.F90
yatm.exe           000000000040EF77  MAIN__.V                  108  atm.F90
yatm.exe           000000000040D5E2  Unknown               Unknown  Unknown
libc-2.28.so       00007F363E4EE813  __libc_start_main     Unknown  Unknown
yatm.exe           000000000040D4EE  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
ocean_stub.exe     0000000000797A24  Unknown               Unknown  Unknown
libpthread-2.28.s  00007F2CC137BD80  Unknown               Unknown  Unknown
mca_pml_ucx.so     00007F2CAC0DF0F3  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.20.2  00007F2CC197E6CD  MPI_Recv              Unknown  Unknown
libmpi_mpifh.so    00007F2CC1C5CA10  pmpi_recv_            Unknown  Unknown
ocean_stub.exe     00000000006A226D  Unknown               Unknown  Unknown
ocean_stub.exe     000000000061944A  Unknown               Unknown  Unknown
ocean_stub.exe     0000000000591662  mod_oasis_coupler        1055  mod_oasis_coupler.F90
ocean_stub.exe     0000000000446043  mod_oasis_method_         741  mod_oasis_method.F90
ocean_stub.exe     000000000041A3F1  coupler_mod_mp_co         149  coupler.F90
ocean_stub.exe     000000000040E539  MAIN__.V                   78  ocean.F90
ocean_stub.exe     000000000040CE22  Unknown               Unknown  Unknown
libc-2.28.so       00007F2CC0DC4813  __libc_start_main     Unknown  Unknown
ocean_stub.exe     000000000040CD2E  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gadi-login-04 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
@russfiedler

This comment has been minimized.

Copy link
Contributor

@russfiedler russfiedler commented Feb 5, 2020

This is the sort of message that I was getting in my ports of CM4 etc. My guess is that a temp array is being created for aV%iAttr(iIndex(n),:). Try setting -heap-arrays or heap-arrays 10 when compiling to put temp on the heap rather than the stack.

@aekiss

This comment has been minimized.

Copy link
Contributor

@aekiss aekiss commented Feb 5, 2020

Thanks - I just tried -heap-arrays 10 but got the same error

@russfiedler

This comment has been minimized.

Copy link
Contributor

@russfiedler russfiedler commented Feb 5, 2020

Try just -heap-arrays to put them all on the heap.

@aekiss

This comment has been minimized.

Copy link
Contributor

@aekiss aekiss commented Feb 5, 2020

I just tried -heap-arrays - still no luck

@russfiedler

This comment has been minimized.

Copy link
Contributor

@russfiedler russfiedler commented Feb 5, 2020

Seems to be working fine for me on express queue without having to invoke -heap-arrays.

Hang on it's just crashed at the end in the ocean stub with a heap of warnings like

1580886787.462249] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8433000 was not returned to mpool ucp_am_bufs [1580886787.462270] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8435080 was not returned to mpool ucp_am_bufs [1580886787.462273] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x1460615b8040 was not returned to mpool ucp_am_bufs [1580886787.462275] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x1460615ba0c0 was not returned to mpool ucp_am_bufs

`
0 0x0000000000051959 ucs_fatal_error_message() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:36

1 0x0000000000051a36 ucs_fatal_error_format() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:52

2 0x00000000000562f0 ucs_mem_region_destroy_internal() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:200

3 0x000000000005c6c6 ucs_class_call_cleanup_chain() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/type/class.c:52

4 0x0000000000056f38 ucs_rcache_destroy() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:729

5 0x00000000000030f2 uct_knem_md_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/uct/sm/knem/../../../../../src/uct/sm/knem/knem_md.c:91
6 0x000000000000f1c9 ucp_free_resources() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:710

7 0x000000000000f1c9 ucp_cleanup() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:1266

8 0x0000000000005bcc mca_pml_ucx_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:247

9 0x0000000000007909 mca_pml_ucx_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:82

10 0x00000000000582b9 mca_base_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:53

11 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:85

12 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:86

13 0x00000000000621da mca_base_framework_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_framework.c:216

14 0x000000000004f479 ompi_mpi_finalize() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/../../ompi/runtime/ompi_mpi_finalize.c:363

15 0x000000000004ac29 ompi_finalize_f() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/intel-opt/ompi/mpi/fortran/mpif-h/profile/pfinalize_f.c:71

16 0x0000000000418cb0 accessom2_mod_mp_accessom2_deinit_() /scratch/p93/raf599/cosima/gaditest/libaccessom2/libcouple/src/accessom2.F90:839

17 0x000000000040ec0a MAIN__.V() /scratch/p93/raf599/cosima/gaditest/libaccessom2/ocean_stub/src/ocean.F90:114

18 0x000000000040ce22 main() ???:0
19 0x0000000000023813 __libc_start_main() ???:0
20 0x000000000040cd2e _start() ???:0
`

I also found this in the thousands of messages. A warning in rcache.c and a failed assertion which matches the trace.

[1580886787.458225] [gadi-cpu-clx-2901:94690:0] rcache.c:360 UCX WARN knem rcache device: destroying inuse region 0x1c85a20 [0x1d56c00..0x1e29b00] g- rw ref 1 cookie 10351893497382213308 addr 0x1d56c00 [1580886787.458245] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x887e080 was not returned to mpool ucp_am_bufs [1580886787.458248] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8880100 was not returned to mpool ucp_am_bufs [1580886787.458250] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8882180 was not returned to mpool ucp_am_bufs [1580886787.458263] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8884200 was not returned to mpool ucp_am_bufs [1580886787.458267] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8886280 was not returned to mpool ucp_am_bufs [1580886787.458270] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8888300 was not returned to mpool ucp_am_bufs [gadi-cpu-clx-2901:94690:0:94690] rcache.c:200 Assertion region->refcount == 0' failed
`

@aekiss

This comment has been minimized.

Copy link
Contributor

@aekiss aekiss commented Feb 5, 2020

interesting. do you think that's a related problem or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.