Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed problem #696

Closed
hkaiser opened this issue Feb 7, 2013 · 2 comments

Comments

Projects
None yet
3 participants
@hkaiser
Copy link
Member

commented Feb 7, 2013

I'm seeing this with both jacobi, and gtcx. Reproduced in release and debug. I've tried pulling the latest from trunk, and completely rebuilding from scratch. I suspect a race condition in the parcelport, as it doesn't always show up.

Example (in debug) with gtcx, running manually on 4 localities:

[18:26:46]:wash@beowulf00:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf00 -c -t4 -l4
num_partitions = 16
[snip, normal output]
  step            4

[18:26:46]:wash@beowulf01:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf01 -w -t4 -l4

[18:26:46]:wash@beowulf02:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf02 -w -t4 -l4

{what}: std::exception: HPX(unhandled_exception)
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d04cf07fd419d276db1a5322a5b1b86af0
{boost}: V1.52.0
{build-type}: debug
{date}: Feb  6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120

Aborted

[18:26:46]:wash@beowulf03:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf03 -w -t4 -l4

{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d04cf07fd419d276db1a5322a5b1b86af0
{boost}: V1.52.0
{build-type}: debug
{date}: Feb  6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120

Aborted

Example (in debug) with jacobi, using pbs. PBS file is:

#! /bin/bash
#
# Copyright (c) 2009-2011 Bryce Lelbach
#
# Distributed under the Boost Software License, Version 1.0. (See accompanying
# file BOOST_LICENSE_1_0.rst or copy at http://www.boost.org/LICENSE_1_0.txt)
#
#PBS -l nodes=4:beowulf:ppn=4,walltime=00:30:00

APP_PATH=/home/wash/install/hpx/gcc-4.6.2-debug/bin/jacobi
APP_OPTIONS="--nx 100 --ny 100"

time pbsdsh -u $APP_PATH --hpx:nodes=`cat $PBS_NODEFILE` $APP_OPTIONS 

Result:

{what}: std::exception: HPX(unhandled_exception)
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120

pbsdsh: task 3 exit status 262

{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120

{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120

pbsdsh: task 0 exit status 262

Logs (from GTCX) indicate where the problem may lie:

(T00000000/00007f8f47cb2f00.363/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.299 [0000000000006202] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2ea0.302/00007f8f4824c008) P00000000/00007f8f47cab660.16 18:44.22.364 [00000000000065cd] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2000.2ff/00007f8f4824c018) P00000000/00007f8f47cab660.18 18:44.22.371 [000000000000665c] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f00.3cf/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.374 [0000000000006698] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2ea0.391/00007f8f4824c008) P00000000/00007f8f47cab660.16 18:44.22.696 [000000000000709c] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f00.481/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.697 [00000000000070b7] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f60.28f/00007f8f4824c000) P00000000/00007f8f47cab660.16 18:44.22.706 [0000000000007163] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [000000000000716c] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [000000000000716d] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [0000000000007171] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.709 [0000000000000001] [ERR] tfunc(2): caught boost::system::system_error: remote_endpoint: Transport endpoint is not connected, aborted thread execution

That seems like a bug...

@ghost ghost assigned brycelelbach Feb 7, 2013

hkaiser added a commit that referenced this issue Feb 7, 2013

@sithhell

This comment has been minimized.

Copy link
Member

commented Feb 7, 2013

This latest commit does not entirely fix the issue. Still getting errors that the remote_endpoint is not connected.

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Feb 10, 2013

How can I reproduce this problem?

@hkaiser hkaiser closed this in add286b Feb 12, 2013

sithhell added a commit that referenced this issue Feb 13, 2013

Fixing parcelports:
    - Changing connection to only have an upper bound for maximum
      connections per locality
    - Changing the parcelport to not bail out when no connection
      was available
    - Removing max_connections from the configuration
    - Adding option to configure the data buffer cache size

This fixes #710 and #696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.