Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShenEOS run seg-faults on single or distributed runs #549

Closed
vamatya opened this issue Sep 24, 2012 · 13 comments

Comments

Projects
None yet
3 participants
@vamatya
Copy link
Member

commented Sep 24, 2012

When running ShenEOS on single node or on 8 nodes, the application Seg Faults.

HPX_VERSION: Updated Forked version on git-hub repo. (fresh build/install directory)

HPX_LOG:

[03:58:47]:vamatya@deneb:/home/vamatya/packages/inst/hpx/bin:0:$ ./sheneos_test --file /home/vamatya/HShenEOS_rho440_temp360_ye260_version2.0_20120427.h5 -Y 128 -T 256 -R 256 --num-partitions 1 --num-workers 4
Seed: 1348459191
Partition 0: {0000000100ff0001, 0000000000000005}
[stack-trace]: 19 frames:
0x7f91cf09ae7c  : hpx::detail::backtrace() + 0x5c in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
0x7f91cf0be25d  : hpx::termination_handler(int) + 0x2d in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
0x7f91cb865030  : ??? + 0x7f91cb865030 in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f91ce5c419b  : ??? + 0x7f91ce5c419b in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce5c4c85  : ??? + 0x7f91ce5c4c85 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce5cbecb  : H5S_select_get_seq_list + 0xdf in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce40ec84  : ??? + 0x7f91ce40ec84 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce40f1a3  : H5D_select_read + 0xd5 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce3f27e6  : H5D_contig_read + 0x174 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce4069dd  : ??? + 0x7f91ce4069dd in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ce40578a  : H5Dread + 0x632 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7f91ceaac8bf  : H5::DataSet::read(void*, H5::DataType const&, H5::DataSpace const&, H5::DataSpace const&, H5::DSetMemXferPropList const&) const + 0x9f in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5_cpp.so.7
0x7f91cfa671fd  : ??? + 0x7f91cfa671fd in /home/vamatya/packages/inst/hpx/lib/hpx/libsheneos.so.1
0x7f91cfaa7738  : sheneos::server::partition3d::init(std::string const&, sheneos::dimension const&, sheneos::dimension const&, sheneos::dimension const&) + 0x98 in /home/vamatya/packages/inst/hpx/lib/hpx/libsheneos.so.1
0x7f91cfac8cdd  : ??? + 0x7f91cfac8cdd in /home/vamatya/packages/inst/hpx/lib/hpx/libsheneos.so.1
0x7f91cfb0c1fb  : ??? + 0x7f91cfb0c1fb in /home/vamatya/packages/inst/hpx/lib/hpx/libsheneos.so.1
0x7f91cf33659b  : ??? + 0x7f91cf33659b in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
0x7f91cf3368a9  : ??? + 0x7f91cf3368a9 in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
[what]: Segmentation fault
[version]: V1.0.0-trunk (AGAS: V2.1), Git: unknown
[boost]: V1.48.0
[build-type]: release
[date]: Sep 24 2012 03:20:55
[platform]: linux
[compiler]: GNU C++ version 4.6.2
[stdlib]: GNU libstdc++ version 20120120
Aborted

Q: How to upload file/s on issues on github?

@brycelelbach

This comment has been minimized.

Copy link
Member

commented Sep 24, 2012

Known problem, happens with certain numbers of partitions. If you use 32 partitions, this goes away. This has been a problem since Maciek first reported it when he was doing benchmarks for some paper ages ago.

@brycelelbach

This comment has been minimized.

Copy link
Member

commented Sep 24, 2012

Other numbers of partitions work, do a parameter sweep across 8-64 partitions or so.

@brycelelbach

This comment has been minimized.

Copy link
Member

commented Sep 24, 2012

Actually, that particular seg fault looks like an issue in HDF5. You're using a good version of HDF5 though (that's the one we've been using on hermione for ages).

Please use the ShenEOS tables in /opt/hdf5.

Also, please do not run that on the head node.

@vamatya

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2012

I don't understand. The same test-configuration(8 nodes, 1 partition per node) ran a week or so ago, without any seg-fault.
ShenEos table: I thought you made a symbolic link to /opt/hdf5 from my folder.

@brycelelbach

This comment has been minimized.

Copy link
Member

commented Sep 24, 2012

Oh, right.

Dunno, then. Attach debug logs. Could be the guard page issue. Try setting HPX_THREAD_GUARD_PAGE=OFF in CMake.

@hkaiser

This comment has been minimized.

Copy link
Member

commented Sep 24, 2012

It could be a stackoverflow. We are chasing this problem for a while but never got it to be reproducible. Thanks guard-pages!

@brycelelbach

This comment has been minimized.

Copy link
Member

commented Sep 24, 2012

Good point, ignore my comment about trying without guard pages

Bryce Adelstein-Lelbach aka wash

STE||AR Group, Center for Computation and Science, LSU

860-808-7497 - Cell

225-578-6182 - Work (no voicemail)

boost-spirit.com
stellar.cct.lsu.edu

llvm.linuxfoundation.org

On Sep 24, 2012, at 6:34 AM, Hartmut Kaiser notifications@github.com wrote:

It could be a stackoverflow. We are chasing this problem for a while but never got it to be reproducible. Thanks guard-pages!


Reply to this email directly or view it on GitHub.

@vamatya

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2012

Seed: 1348589310
Partition 0: {0000000100ff0001, 0000000000000005}
Partition 1: {0000000200ff0001, 0000000000000002}
Created interpolator: 2.58098 [s]
[stack-trace]: 17 frames:
0x7fed8bc34e7c : hpx::detail::backtrace() + 0x5c in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
0x7fed8bc5825d : hpx::termination_handler(int) + 0x2d in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
0x7fed883ff030 : ??? + 0x7fed883ff030 in /lib/x86_64-linux-gnu/libpthread.so.0
0x7fed8afa87bb : ??? + 0x7fed8afa87bb in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7fed8afa91a3 : H5D_select_read + 0xd5 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7fed8af8c7e6 : H5D_contig_read + 0x174 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7fed8afa09dd : ??? + 0x7fed8afa09dd in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7fed8af9f78a : H5Dread + 0x632 in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5.so.7
0x7fed8b6468bf : H5::DataSet::read(void*, H5::DataType const&, H5::DataSpace const&, H5::DataSpace const&, H5::DSetMemXferPropList const&) const + 0x9f in /opt/hdf5/1.8.7-threadsafe/lib/libhdf5_cpp.so.7
0x7fed8c600cda : ??? + 0x7fed8c600cda in /home/vamatya/packages/inst/hpx/lib/hpx/libsheneos.so.1
0x7fed8c66e218 : sheneos::interpolator::connect(std::string) + 0x2d8 in /home/vamatya/packages/inst/hpx/lib/hpx/libsheneos.so.1
0x41e923 : ??? + 0x41e923 in /home/vamatya/packages/inst/hpx/bin/sheneos_test
0x474888 : ??? + 0x474888 in /home/vamatya/packages/inst/hpx/bin/sheneos_test
0x474a25 : ??? + 0x474a25 in /home/vamatya/packages/inst/hpx/bin/sheneos_test
0x7fed8bed059b : ??? + 0x7fed8bed059b in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
0x7fed8bed08a9 : ??? + 0x7fed8bed08a9 in /home/vamatya/packages/inst/hpx/lib/hpx/libhpx.so.1
[what]: Segmentation fault
[version]: V1.0.0-trunk (AGAS: V2.1), Git: unknown

[date]: Sep 24 2012 03:20:55

[compiler]: GNU C++ version 4.6.2
[stdlib]: GNU libstdc++ version 20120120
pbsdsh: task 1 exit status 262
Finished bulk task 0: 7.18904 [s]
Finished bulk task 1: 7.61457 [s]
Finished bulk task 2: 7.80705 [s]
Finished bulk task 3: 7.89825 [s]
=>> PBS: job killed: walltime 1839 exceeded limit 1800

Comment: This if for two localities, with stack size of 0x400000. Got same error with stack size of 0x100000

@hkaiser

This comment has been minimized.

Copy link
Member

commented Sep 25, 2012

What are the command lines you used?

@vamatya

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2012

#!/bin/bash

#PBS -N sheneos_test_8d38mil_2_2_weak
#PBS -l nodes=lyra01:ppn=4+beowulf00:ppn=4
#PBS -l walltime=030:00
#PBS -V
#PBS -j oe

APP_PATH=/home/vamatya/packages/inst/hpx/bin/sheneos_test

APP_OPTIONS="
--file /home/vamatya/HShenEOS_rho440_temp360_ye260_version2.0_20120427.h5
-Y 128
-T 256
-R 256
--num-partitions 2
--num-workers 4
"

HPX_OPTIONS="
-Ihpx.default_stack_size=0x4000000

"

pbsdsh -u $APP_PATH --hpx:nodes=cat $PBS_NODEFILE $APP_OPTIONS $HPX_OPTIONS

@ghost ghost assigned hkaiser Sep 25, 2012

@brycelelbach

This comment has been minimized.

Copy link
Member

commented Oct 2, 2012

Culprit was stack size. Closing.

@hkaiser

This comment has been minimized.

Copy link
Member

commented Oct 2, 2012

We have not quite figured out what needs to be done to fix this reliably. Reopening.

@hkaiser hkaiser reopened this Oct 2, 2012

@hkaiser

This comment has been minimized.

Copy link
Member

commented Oct 3, 2012

That's resolved by db79854

@hkaiser hkaiser closed this Oct 3, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.