Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange errors appear when requesting performance counters on multiple nodes #479

Closed
maeneas opened this issue Aug 8, 2012 · 7 comments
Closed
Assignees
Milestone

Comments

@maeneas
Copy link
Contributor

maeneas commented Aug 8, 2012

I see crashes 90% of the time with sheneos when requesting counters on two nodes or more in a distributed run:

i.e.

#PBS -l nodes=4:ppn=8,pmem=2gb,walltime=00:10:00,qos=test

 pbsdsh -v -u /fslhome/mwa2/compute/shen_updated/sheneos_test \
    -Y 40 -T 40 -R 40 --num-workers 1 --num-partitions 4 \
    --file /fslhome/mwa2/compute/shen_updated/HShenEOS_rho440_temp360_ye260_version2.0_20120427.h5  \
    --hpx:nodes=`cat $PBS_NODEFILE` --hpx:debug-clp \
    --hpx:print-counter=/messages{locality#0/total}/count/sent \
    --hpx:print-counter=/parcels{locality#0/total}/count/sent \
    --hpx:print-counter=/messages{locality#1/total}/count/sent \
    --hpx:print-counter=/parcels{locality#1/total}/count/sent

reproduced on marylou

@ghost ghost assigned hkaiser Aug 8, 2012
@maeneas
Copy link
Contributor Author

maeneas commented Aug 8, 2012

Here's one of the error messages I get when running the above:
Received Segmentation fault, 3 frames:
0x2b66aa7e4837 : hpx::detail::backtrace() + 0x77 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx.so.1
0x2b66aa888a98 : hpx::termination_handler(int) + 0x18 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx.so.1
0x3d21a0e7c0 : ??? + 0x3d21a0e7c0 in /lib64/libpthread.so.0

I rarely get the same error message twice, though.

@hkaiser
Copy link
Member

hkaiser commented Aug 8, 2012

Hmmm, that looks close to what is reported in #480. I'll investigate this closer asap.

@maeneas
Copy link
Contributor Author

maeneas commented Aug 8, 2012

Here's another example of an error I get running the above:

Received Segmentation fault, 13 frames:
0x2b4e54660837 : hpx::detail::backtrace() + 0x77 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx.so.1
0x2b4e54704a98 : hpx::termination_handler(int) + 0x18 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx.so.1
0x3854e0e7c0 : ??? + 0x3854e0e7c0 in /lib64/libpthread.so.0
0x69272b : std::vector<double, std::allocator >::operator=(std::vector<double, std::allocator > const&) + 0x2b in /fslhome/mwa2/compute/shen_updated/sheneos_test
0x2b4e539d301c : hpx::util::detail::vtable::type<sheneos::on_completed_bulk, void ()(hpx::lcos::future<std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > > >), void, void>::invoke(void**, hpx::lcos::future<std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > > >) + 0x14c in /fslhome/mwa2/hpxgit/lib/hpx/libhpx_component_sheneos.so.1
0x2b4e539bbc24 : ZN3hpx4lcos6detail11future_dataISt6vectorIS3_IdSaIdEESaIS5_EES7_E8set_dataIS7_EEvOT + 0xe4 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx_component_sheneos.so.1
0x6d903b : ZN3hpx7actions14direct_action1INS_4lcos19base_lco_with_valueISt6vectorIS4_IdSaIdEESaIS6_EES8_EELi1EOS8_XadL_ZNS9_17set_value_nonvirtESA_EENS0_6detail9this_typeEE16execute_functionINS_4util6tuple1IS8_EEEEN5boost6fusion11unused_typeEmOT + 0x15b in /fslhome/mwa2/compute/shen_updated/sheneos_test
0x6d9224 : ZNK3hpx7actions21base_lco_continuationISt6vectorIS2_IdSaIdEESaIS4_EEE13trigger_valueEOS6 + 0x154 in /fslhome/mwa2/compute/shen_updated/sheneos_test
0x2b4e539b2e8d : ZNK3hpx7actions6actionIN7sheneos6server11partition3dELi3ESt6vectorIS5_IdSaIdEESaIS7_EENS_4util6tuple2IS5_INS2_13sheneos_coordESaISC_EEjEENS0_14result_action2IS4_S9_Li3ERKSE_jXadL_ZNS4_16interpolate_bulkESI_jEELNS_7threads15thread_priorityE0ENS0_6detail9this_typeEEELSK_0EE37continuation_thread_object_function_2clIS4_RSE_RjSI_jEENSJ_17thread_state_enumEN5boost10shared_ptrINS0_12continuationEEEMT_FS9_T2_T3_EPS4_OT0_OT1 + 0x8d in /fslhome/mwa2/hpxgit/lib/hpx/libhpx_component_sheneos.so.1
0x2b4e539b30a9 : hpx::util::detail::vtable::type<hpx::util::detail::bound_functor5<hpx::actions::action<sheneos::server::partition3d, 3, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >, hpx::util::tuple2<std::vector<sheneos::sheneos_coord, std::allocatorsheneos::sheneos_coord >, unsigned int>, hpx::actions::result_action2<sheneos::server::partition3d, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > >, 3, std::vector<sheneos::sheneos_coord, std::allocatorsheneos::sheneos_coord > const&, unsigned int, &(sheneos::server::partition3d::interpolate_bulk(std::vector<sheneos::sheneos_coord, std::allocatorsheneos::sheneos_coord > const&, unsigned int)), (hpx::threads::thread_priority)0, hpx::actions::detail::this_type>, (hpx::threads::thread_priority)0>::continuation_thread_object_function_2, boost::shared_ptrhpx::actions::continuation, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > > (sheneos::server::partition3d::)(std::vector<sheneos::sheneos_coord, std::allocatorsheneos::sheneos_coord > const&, unsigned int), sheneos::server::partition3d, std::vector<sheneos::sheneos_coord, std::allocatorsheneos::sheneos_coord >, unsigned int const>, hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum), void, void>::invoke(void**, hpx::threads::thread_state_ex_enum) + 0x49 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx_component_sheneos.so.1
0x2b4e54940ac9 : boost::coroutines::detail::coroutine_impl_wrapper<hpx::util::function_nonser<hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum)>, boost::coroutines::coroutine<hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum), hpx::threads::detail::coroutine_allocator, boost::coroutines::detail::lx::x86_linux_context_impl>, boost::coroutines::detail::lx::x86_linux_context_impl, hpx::threads::detail::coroutine_allocator>::operator()() + 0xd9 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx.so.1
0x2b4e54940f79 : void boost::coroutines::detail::lx::trampoline<boost::coroutines::detail::coroutine_impl_wrapper<hpx::util::function_nonser<hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum)>, boost::coroutines::coroutine<hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum), hpx::threads::detail::coroutine_allocator, boost::coroutines::detail::lx::x86_linux_context_impl>, boost::coroutines::detail::lx::x86_linux_context_impl, hpx::threads::detail::coroutine_allocator> >(boost::coroutines::detail::coroutine_impl_wrapper<hpx::util::function_nonser<hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum)>, boost::coroutines::coroutine<hpx::threads::thread_state_enum ()(hpx::threads::thread_state_ex_enum), hpx::threads::detail::coroutine_allocator, boost::coroutines::detail::lx::x86_linux_context_impl>, boost::coroutines::detail::lx::x86_linux_context_impl, hpx::threads::detail::coroutine_allocator>*) + 0x9 in /fslhome/mwa2/hpxgit/lib/hpx/libhpx.so.1

Tonight the statistics look a little closer to 50% instead of 90% occurrence rate.

@hkaiser
Copy link
Member

hkaiser commented Aug 8, 2012

I'm not able to reproduce this on Windows (neither Debug nor Release builds) :(

However this confirms the suspicion that the problem is caused by something system specific for Linux, most probably changes to the optimization settings (bff3661 ?)

@hkaiser
Copy link
Member

hkaiser commented Aug 9, 2012

Is this fixed now after #480 was solved? Can we close this ticket?

@maeneas
Copy link
Contributor Author

maeneas commented Aug 9, 2012

I don't know yet. The lastest update of hpx doesn't compile using g++44. I can't test the changes.

@maeneas
Copy link
Contributor Author

maeneas commented Aug 10, 2012

this is fixed now.

@maeneas maeneas closed this as completed Aug 10, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants