Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildbot builds locking up / timing out in MPI initialization #892

Closed
khuck opened this issue Mar 21, 2019 · 2 comments
Closed

buildbot builds locking up / timing out in MPI initialization #892

khuck opened this issue Mar 21, 2019 · 2 comments

Comments

@khuck
Copy link
Contributor

khuck commented Mar 21, 2019

When running python tests on the buildbot server, MPI initialization is hanging in OpenMPI when launching non-distributed executions. See: http://ktau.nic.uoregon.edu:8020/#/builders/7/builds/198. The tests that fail:

	 60 - tests.regressions.python.python.empty_list_509 (Timeout)
	 61 - tests.regressions.python.python.empty_list_510 (Timeout)
	 62 - tests.regressions.python.python.exception_swallowed_369 (Timeout)
	 63 - tests.regressions.python.python.for_map_516 (Timeout)
	 64 - tests.regressions.python.python.lambda_492 (Timeout)
	 65 - tests.regressions.python.python.list_iteration_524 (Timeout)
	 66 - tests.regressions.python.python.list_iter_space_429 (Timeout)
	 67 - tests.regressions.python.python.list_slice_assign_528 (Timeout)
	 70 - tests.regressions.python.python.np_sum_489 (Timeout)
	 71 - tests.regressions.python.python.passing_compiler_state_453 (Timeout)
	 72 - tests.regressions.python.python.reassign_512 (Timeout)
	 73 - tests.regressions.python.python.zero_dimensional_array_502 (Timeout)
	208 - tests.unit.python.ast.generate_ast (Timeout)
	209 - tests.unit.python.ast.node (Timeout)
	210 - tests.unit.python.ast.python_builds_ast (Timeout)
	211 - tests.unit.python.ast.traverse_ast (Failed)
	212 - tests.unit.python.execution_tree.dictionary (Failed)
	213 - tests.unit.python.execution_tree.config_hpx (Timeout)
	214 - tests.unit.python.execution_tree.dynamic_init (Timeout)
	215 - tests.unit.python.execution_tree.for (Timeout)
	216 - tests.unit.python.execution_tree.eval (Timeout)
	217 - tests.unit.python.execution_tree.lazy_eval (Timeout)
	218 - tests.unit.python.execution_tree.make_array (Timeout)
	219 - tests.unit.python.execution_tree.map_numpy (Timeout)
	220 - tests.unit.python.execution_tree.map_numpy_constants (Timeout)
	221 - tests.unit.python.execution_tree.multi_init (Timeout)
	223 - tests.unit.python.execution_tree.parallel (Timeout)
	224 - tests.unit.python.execution_tree.set_operation (Timeout)
	225 - tests.unit.python.execution_tree.slice (Timeout)
	226 - tests.unit.python.primitives.lambda (Timeout)
	227 - tests.unit.python.primitives.make_list (Timeout)
	228 - tests.unit.python.primitives.make_vector (Timeout)
	229 - tests.unit.python.primitives.numpy_dtype (Timeout)

An example backtrace (from test 60):

(gdb) bt
#0  0x00007f9469d6e6fd in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f94554aa246 in rte_init () from /packages/openmpi/2.0.4_gcc-6.4/lib/libopen-rte.so.20
#2  0x00007f9455468a35 in orte_init () from /packages/openmpi/2.0.4_gcc-6.4/lib/libopen-rte.so.20
#3  0x00007f9458a576a6 in ompi_mpi_init () from /packages/openmpi/2.0.4_gcc-6.4/lib/libmpi.so.20
#4  0x00007f9458a768f3 in PMPI_Init_thread () from /packages/openmpi/2.0.4_gcc-6.4/lib/libmpi.so.20
#5  0x00007f945b973468 in hpx::util::mpi_environment::init (argc=0x7ffda64d1bcc, argv=0x7ffda64d1bc0, 
    cfg=...)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/plugins/parcelport/mpi/mpi_environment.cpp:133
#6  0x00007f945b976efb in hpx::traits::plugin_config_data<hpx::parcelset::policies::mpi::parcelport, void>::init (argc=0x7ffda64d1bcc, argv=0x7ffda64d1bc0, cfg=...)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/plugins/parcelport/mpi/parcelport_mpi.cpp:272
#7  0x00007f945b977bcf in hpx::plugins::parcelport_factory<hpx::parcelset::policies::mpi::parcelport>::init (
    this=0x7f945c506080 <parcelport_mpi_factory_init(std::vector<hpx::plugins::parcelport_factory_base*, std::allocator<hpx::plugins::parcelport_factory_base*> >&)::factory>, argc=0x7ffda64d1bcc, argv=0x7ffda64d1bc0, 
    cfg=...)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/hpx/plugins/parcelport_factory.hpp:125
#8  0x00007f945b43659f in hpx::parcelset::parcelhandler::init (argc=0x7ffda64d1bcc, argv=0x7ffda64d1bc0, 
    cfg=...)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/src/runtime/parcelset/parcelhandler.cpp:1559
#9  0x00007f945b773eae in hpx::util::command_line_handling::call (this=0x22ae000, desc_cmdline=..., argc=2, 
    argv=0x7ffda64d4138)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/src/util/command_line_handling.cpp:1365
#10 0x00007f945b44cff5 in hpx::resource::detail::partitioner::parse(hpx::util::function<int (boost::program_options::variables_map&), false> const&, boost::program_options::options_description, int, char**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, hpx::resource::partitioner_mode, hpx::ru---Type <return> to continue, or q <return> to quit---
ntime_mode, bool) (this=0x22ae000, f=..., desc_cmdline=..., argc=2, argv=0x7ffda64d4138, ini_config=..., 
    rpmode=hpx::resource::mode_default, mode=hpx::runtime_mode_console, fill_internal_topology=true)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/src/runtime/resource/detail/detail_partitioner.cpp:916
#11 0x00007f945b457ff6 in hpx::resource::detail::create_partitioner(hpx::util::function<int (boost::program_options::variables_map&), false> const&, boost::program_options::options_description const&, int, char**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, hpx::resource::partitioner_mode, hpx::runtime_mode, bool) (f=..., desc_cmdline=..., argc=2, argv=0x7ffda64d4138, ini_config=..., 
    rpmode=hpx::resource::mode_default, mode=hpx::runtime_mode_console, check=false)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/src/runtime/resource/partitioner.cpp:236
#12 0x00007f945aedd97a in hpx::detail::run_or_start(hpx::util::function<int (boost::program_options::variables_map&), false> const&, boost::program_options::options_description const&, int, char**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&&, hpx::util::unique_function<void (), false>, hpx::util::unique_function<void (), false>, hpx::runtime_mode, bool) (f=..., desc_cmdline=..., argc=2, 
    argv=0x7ffda64d4138, 
    ini_config=<unknown type in /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/build-delphi-x86_64-Linux-gcc/hpx-Debug/lib/libhpxd.so.1, CU 0xb91d5, DIE 0x21b9cc>, startup=..., shutdown=..., 
    mode=hpx::runtime_mode_console, blocking=false)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/src/hpx_init.cpp:626
#13 0x00007f94603b6038 in hpx::start(hpx::util::function<int (boost::program_options::variables_map&), false> const&, boost::program_options::options_description const&, int, char**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, hpx::util::unique_function<void (), false>, hpx::util::unique_function<void (), false>, hpx::runtime_mode) (f=..., desc_cmdline=..., argc=2, argv=0x7ffda64d4138, 
    cfg=..., startup=..., shutdown=..., mode=hpx::runtime_mode_console)
    at /var/lib/buildbot/slaves/phylanx/x86_64-gcc7-debug/build/tools/buildbot/src/hpx/hpx/hpx_start_impl.hpp:77
#14 0x00007f94603b627a in hpx::start(hpx::util::function<int (int, char**), false> const&, int, char**, std::v---Type <return> to continue, or q <return> to quit---
@hkaiser
Copy link
Member

hkaiser commented Mar 31, 2019

@khuck this should be fine now as the HPX PR was merged.

@khuck
Copy link
Contributor Author

khuck commented Mar 31, 2019

@hkaiser no - the build after that PR was merged (three hours ago?) failed the same way....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants