You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm seeing this with both jacobi, and gtcx. Reproduced in release and debug. I've tried pulling the latest from trunk, and completely rebuilding from scratch. I suspect a race condition in the parcelport, as it doesn't always show up.
Example (in debug) with gtcx, running manually on 4 localities:
[18:26:46]:wash@beowulf00:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf00 -c -t4 -l4
num_partitions = 16
[snip, normal output]
step 4
[18:26:46]:wash@beowulf01:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf01 -w -t4 -l4
[18:26:46]:wash@beowulf02:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf02 -w -t4 -l4
{what}: std::exception: HPX(unhandled_exception)
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d04cf07fd419d276db1a5322a5b1b86af0
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
Aborted
[18:26:46]:wash@beowulf03:/home/wash:0:$ ~/install/hpx/gcc-4.6.2-debug/bin/gtcx_client -a beowulf00 -x beowulf03 -w -t4 -l4
{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d04cf07fd419d276db1a5322a5b1b86af0
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
Aborted
Example (in debug) with jacobi, using pbs. PBS file is:
#! /bin/bash
#
# Copyright (c) 2009-2011 Bryce Lelbach
#
# Distributed under the Boost Software License, Version 1.0. (See accompanying
# file BOOST_LICENSE_1_0.rst or copy at http://www.boost.org/LICENSE_1_0.txt)
#
#PBS -l nodes=4:beowulf:ppn=4,walltime=00:30:00
APP_PATH=/home/wash/install/hpx/gcc-4.6.2-debug/bin/jacobi
APP_OPTIONS="--nx 100 --ny 100"
time pbsdsh -u $APP_PATH --hpx:nodes=`cat $PBS_NODEFILE` $APP_OPTIONS
Result:
{what}: std::exception: HPX(unhandled_exception)
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
pbsdsh: task 3 exit status 262
{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
pbsdsh: task 0 exit status 262
Logs (from GTCX) indicate where the problem may lie:
(T00000000/00007f8f47cb2f00.363/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.299 [0000000000006202] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2ea0.302/00007f8f4824c008) P00000000/00007f8f47cab660.16 18:44.22.364 [00000000000065cd] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2000.2ff/00007f8f4824c018) P00000000/00007f8f47cab660.18 18:44.22.371 [000000000000665c] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f00.3cf/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.374 [0000000000006698] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2ea0.391/00007f8f4824c008) P00000000/00007f8f47cab660.16 18:44.22.696 [000000000000709c] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f00.481/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.697 [00000000000070b7] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f60.28f/00007f8f4824c000) P00000000/00007f8f47cab660.16 18:44.22.706 [0000000000007163] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [000000000000716c] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [000000000000716d] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [0000000000007171] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.709 [0000000000000001] [ERR] tfunc(2): caught boost::system::system_error: remote_endpoint: Transport endpoint is not connected, aborted thread execution
That seems like a bug...
The text was updated successfully, but these errors were encountered:
- Changing connection to only have an upper bound for maximum
connections per locality
- Changing the parcelport to not bail out when no connection
was available
- Removing max_connections from the configuration
- Adding option to configure the data buffer cache size
This fixes#710 and #696
I'm seeing this with both jacobi, and gtcx. Reproduced in release and debug. I've tried pulling the latest from trunk, and completely rebuilding from scratch. I suspect a race condition in the parcelport, as it doesn't always show up.
Example (in debug) with gtcx, running manually on 4 localities:
Example (in debug) with jacobi, using pbs. PBS file is:
Result:
{what}: std::exception: HPX(unhandled_exception)
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
pbsdsh: task 3 exit status 262
{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
{what}: remote_endpoint: Transport endpoint is not connected
{version}: V1.0.0-trunk (AGAS: V2.1), Git: 624b56d
{boost}: V1.52.0
{build-type}: debug
{date}: Feb 6 2013 17:41:35
{platform}: linux
{compiler}: GNU C++ version 4.6.2
{stdlib}: GNU libstdc++ version 20120120
pbsdsh: task 0 exit status 262
Logs (from GTCX) indicate where the problem may lie:
(T00000000/00007f8f47cb2f00.363/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.299 [0000000000006202] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2ea0.302/00007f8f4824c008) P00000000/00007f8f47cab660.16 18:44.22.364 [00000000000065cd] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2000.2ff/00007f8f4824c018) P00000000/00007f8f47cab660.18 18:44.22.371 [000000000000665c] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f00.3cf/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.374 [0000000000006698] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2ea0.391/00007f8f4824c008) P00000000/00007f8f47cab660.16 18:44.22.696 [000000000000709c] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f00.481/00007f8f4824c010) P00000000/00007f8f47cab660.16 18:44.22.697 [00000000000070b7] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/00007f8f47cb2f60.28f/00007f8f4824c000) P00000000/00007f8f47cab660.16 18:44.22.706 [0000000000007163] [ERR] created exception: timed out while trying to find room in the connection cache: HPX(network_error)
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [000000000000716c] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [000000000000716d] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.708 [0000000000007171] [PT] handle read operation completion: error: Connection reset by peer
(T00000000/----------------.--/----------------) P--------/----------------.-- 18:44.22.709 [0000000000000001] [ERR] tfunc(2): caught boost::system::system_error: remote_endpoint: Transport endpoint is not connected, aborted thread execution
That seems like a bug...
The text was updated successfully, but these errors were encountered: