Implement local-only primary namespace service #2232

hkaiser · 2016-07-02T20:23:36Z

This PR implements an optimization of AGAS for local-only operation. In this mode (command line option --hpx:run-locally) all networking is disabled and the local virtual addresses of global objects are directly encoded in their global identifiers. This removes a large part of the AGAS overheads introduced by global reference counting and global address resolution.

This fixes #1591

- adding agas::detail::local_primary_namespace - simplified and unified generation of gids - refactored performance counters for server::primary_namespace - moved part of implementation for agas::detail::hosted_data_type and agas::detail::bootstrap_data_type into source files

- refactoring primary_namespace into a base-class component and two derived classes - split generate_unique_ids into two implementations - disable AGAS caching when in local mode

- hello_world is running, more testing is required

- this also disables all networking and will not expect any other localities to connect - the command line option --hpx:expect-connecting-localities now can take an (optional) argument - fly-by: removed docs of configurations setting for IPC and VERBS parcelports - fly-by: simply implementation of performance counters for primary AGAS namespaces

- enabled performance counters for local primary AGAS namespace

sithhell · 2016-07-03T06:42:45Z

Which kind of applications would benefit from such an optimization? What's the actual performance increase when using this?

hkaiser · 2016-07-03T13:45:33Z

Which kind of applications would benefit from such an optimization?

Any application which is potentially distributed but has to run on a single locality (embedded devices?) This patch allows for the global addresses being used in a more efficient way when running on one locality. This will also benefit our comparisons with libraries like TBB.

What's the actual performance increase when using this?

I have not done any solid performance analysis. Here are the performance counter results for agas/primary_namespace for hello_world with and without this optimization, though:

hello_world.exe -t8 \
    --hpx:print-counter=/agas{locality#*/total}/primary/count \
    --hpx:print-counter=/agas{locality#*/total}/primary/time  
hello world from OS-thread 2 on locality 0
hello world from OS-thread 6 on locality 0
hello world from OS-thread 1 on locality 0
hello world from OS-thread 5 on locality 0
hello world from OS-thread 4 on locality 0
hello world from OS-thread 7 on locality 0
hello world from OS-thread 0 on locality 0
hello world from OS-thread 3 on locality 0
/agas{locality#0/total}/primary/count,1,0.012257,[s],56
/agas{locality#0/total}/primary/time,1,0.010653,[s],110242,[ns]

hello_world.exe -t8 --hpx:run-locally \
    --hpx:print-counter=/agas{locality#*/total}/primary/count \
    --hpx:print-counter=/agas{locality#*/total}/primary/time  
hello world from OS-thread 2 on locality 0
hello world from OS-thread 6 on locality 0
hello world from OS-thread 1 on locality 0
hello world from OS-thread 5 on locality 0
hello world from OS-thread 4 on locality 0
hello world from OS-thread 7 on locality 0
hello world from OS-thread 0 on locality 0
hello world from OS-thread 3 on locality 0
/agas{locality#0/total}/primary/count,1,0.011453,[s],52
/agas{locality#0/total}/primary/time,1,0.011582,[s],45498,[ns]

So it looks like that in this case both, the number of calls and the time required to execute those are reduced.

sithhell · 2016-07-03T19:09:34Z

Which kind of applications would benefit from such an optimization?

Any application which is potentially distributed but has to run on a single locality (embedded devices?) This patch allows for the global addresses being used in a more efficient way when running on one locality. This will also benefit our comparisons with libraries like TBB.

That was mainly the reason why I asked for performance data ... I am not sure that this optimization buy us a lot in general. Especially in comparison with TBB, where we would compare a solution written for distributed with a completely local solution (apples and oranges). Our parallel algorithms etc. don't really depend on AGAS anyway.

Having this option enabled might also bring a distorted picture once you go to distributed. Given that the overheads are indeed less, you need to adapt grainsizes etc. once again. I am not sure if this optimization is really worth it in the end.

What's the actual performance increase when using this?

I have not done any solid performance analysis. Here are the performance counter results for agas/primary_namespace for hello_world with and without this optimization, though:

hello_world.exe -t8 \ --hpx:print-counter=/agas{locality#*/total}/primary/count \ --hpx:print-counter=/agas{locality#*/total}/primary/time hello world from OS-thread 2 on locality 0 hello world from OS-thread 6 on locality 0 hello world from OS-thread 1 on locality 0 hello world from OS-thread 5 on locality 0 hello world from OS-thread 4 on locality 0 hello world from OS-thread 7 on locality 0 hello world from OS-thread 0 on locality 0 hello world from OS-thread 3 on locality 0 /agas{locality#0/total}/primary/count,1,0.012257,[s],56 /agas{locality#0/total}/primary/time,1,0.010653,[s],110242,[ns]

hello_world.exe -t8 --hpx:run-locally \ --hpx:print-counter=/agas{locality#*/total}/primary/count \ --hpx:print-counter=/agas{locality#*/total}/primary/time hello world from OS-thread 2 on locality 0 hello world from OS-thread 6 on locality 0 hello world from OS-thread 1 on locality 0 hello world from OS-thread 5 on locality 0 hello world from OS-thread 4 on locality 0 hello world from OS-thread 7 on locality 0 hello world from OS-thread 0 on locality 0 hello world from OS-thread 3 on locality 0 /agas{locality#0/total}/primary/count,1,0.011453,[s],52 /agas{locality#0/total}/primary/time,1,0.011582,[s],45498,[ns]

So it looks like that in this case both, the number of calls and the time required to execute those are reduced.

Why do we have less requests all of sudden? But the timing looks nice. Would be nice to have a baseline with master.
Hello World is of course not really a meaningful benchmark. In addition, the total runtime seems to be roughly the same.

P.S.: I am not through with reading through the code ;)

hkaiser · 2016-07-03T19:28:30Z

Having this option enabled might also bring a distorted picture once you go to distributed

I agree. I'm not sure myself how many applications are written using AGAS in mind which might have to run on single localities in the end.

Our parallel algorithms etc. don't really depend on AGAS anyway.

Not quite. The segmented algorithms depend on it.

Why do we have less requests all of sudden?

Less operations are needed for the initialization of things in this case. So in the end it's a one-time benefit.

Hello World is of course not really a meaningful benchmark. In addition, the total runtime seems to be roughly the same.

I absolutely agree, even more as the overall runtime is probably determined by the console IO anyways.

Overall, this PR has potential benefits, the question is probably how big would be the maintenance burden to have it in. As I think it wouldn't be too large, I'd like to have that available...

hkaiser · 2016-07-26T13:26:44Z

I would like to go ahead and merge this. Are there any objections, still?

sithhell · 2016-07-26T13:49:29Z

On Dienstag, 26. Juli 2016 06:26:44 CEST Hartmut Kaiser wrote:

I would like to go ahead and merge this. Are there any objections, still?

My main objection still is if this is really worth it, we add a significant
amount of code which has to be maintained. I would really like to see a
benchmark (apart from hello world) showing real improvements.

sithhell · 2016-07-26T14:02:31Z

Am 26.07.2016 3:49 nachm. schrieb "Thomas Heller" thom.heller@gmail.com:

On Dienstag, 26. Juli 2016 06:26:44 CEST Hartmut Kaiser wrote:

I would like to go ahead and merge this. Are there any objections,
still?

My main objection still is if this is really worth it, we add a
significant
amount of code which has to be maintained. I would really like to see a
benchmark (apart from hello world) showing real improvements.

The transpose example might be a could candidate to give an indication of
the performance benefits.

sithhell · 2016-07-26T14:14:52Z

If we want to go ahead with this, we need to extend the test suite to run the local only case as well.

hkaiser · 2016-07-26T15:42:08Z

Here are some results from running transpose_block:

transpose_block.exe -t12 --iterations=100 --matrix_size=10240 --num_blocks=16
Finding blocks ...
Matrix transpose: B = A^T
Matrix order          = 10240
Matrix local columns  = 640
Number of blocks      = 16
Number of localities  = 1
Untiled
Number of iterations  = 100
Finding blocks A ... done
Finding blocks B ... done
Solution validates
Rate (MB/s): 9154.42, Avg time (s): 0.184647, Min time (s): 0.183269, Max time (s): 0.187131

transpose_block.exe -t12 --iterations=100 --matrix_size=10240 --num_blocks=16 --hpx:run-locally
Finding blocks ...
Matrix transpose: B = A^T
Matrix order          = 10240
Matrix local columns  = 640
Number of blocks      = 16
Number of localities  = 1
Untiled
Number of iterations  = 100
Finding blocks A ... done
Finding blocks B ... done
Solution validates
Rate (MB/s): 9163.28, Avg time (s): 0.185206, Min time (s): 0.183092, Max time (s): 0.196651

So the difference is not significant, but measurable.

sithhell · 2016-07-26T15:52:14Z

In addition to the posted results I did a similar and compared it to master:

I ran the following command:

./bin/transpose_block_numa --transpose-threads=8 --transpose-numa-domains=2 \
        --matrix_size=24000 --num_blocks=120 \
        --hpx:print-counter=/agas{locality#*/total}/primary/count \
        --hpx:print-counter=/agas{locality#*/total}/primary/time

master:

Rate (MB/s): 25510.5, Avg time (s): 0.363383, Min time (s): 0.361263, Max time (s): 0.365022
/agas{locality#0/total}/primary/count,1,6.591869,[s],290462
/agas{locality#0/total}/primary/time,1,6.591697,[s],6.56639e+09,[ns]

fixing_1591:

Rate (MB/s): 25160.4, Avg time (s): 0.366579, Min time (s): 0.366289, Max time (s): 0.366936
/agas{locality#0/total}/primary/count,1,6.621262,[s],290703
/agas{locality#0/total}/primary/time,1,6.621201,[s],6.31083e+09,[ns]

fixing_1591: (Adding --hpx:run-locally)

Rate (MB/s): 25208.1, Avg time (s): 0.365771, Min time (s): 0.365596, Max time (s): 0.366282
/agas{locality#0/total}/primary/count,1,6.619994,[s],290222
/agas{locality#0/total}/primary/time,1,6.620322,[s],4.0047e+07,[ns]

In order to increase the load onto AGAS, I changed the number of blocks:

./bin/transpose_block_numa --transpose-threads=8 --transpose-numa-domains=2 \
        --matrix_size=24000 --num_blocks=240 \
        --hpx:print-counter=/agas{locality#*/total}/primary/count \
        --hpx:print-counter=/agas{locality#*/total}/primary/time

master:

Rate (MB/s): 19784.4, Avg time (s): 0.470493, Min time (s): 0.465821, Max time (s): 0.475618
/agas{locality#0/total}/primary/count,1,7.713836,[s],1.15689e+06
/agas{locality#0/total}/primary/time,1,7.715651,[s],2.29584e+10,[ns]

fixing_1591:

Rate (MB/s): 18909.5, Avg time (s): 0.489008, Min time (s): 0.487375, Max time (s): 0.490676
/agas{locality#0/total}/primary/count,1,7.738666,[s],1.15738e+06
/agas{locality#0/total}/primary/time,1,7.738564,[s],1.45573e+10,[ns]

fixing_1591: (Added --hpx:run-locally)

Rate (MB/s): 19432.5, Avg time (s): 0.478318, Min time (s): 0.474256, Max time (s): 0.479987
/agas{locality#0/total}/primary/count,1,7.628902,[s],1.15642e+06
/agas{locality#0/total}/primary/time,1,7.630093,[s],1.3725e+08,[ns]

So the overall result is that adding --hpx:run-locally slightly increases performance on the fixing_1591 branch, but still lacks a little behind master. That is, no performance improvement with respect to what we have already. Without any profiling information, the only explanation I have for this is that the virtual function dispatch is responsible for that.

The run on fixing_1591 without --hpx:run-locally unfortunately is sometimes failing with the following exception:

{stack-trace}: 4 frames:
0x7f737961873d  : hpx::detail::backtrace[abi:cxx11](unsigned long) + 0x9d in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f737964962a  : boost::exception_ptr hpx::detail::get_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xaa in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7379649b9e  : void hpx::detail::throw_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x4e in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7379655f5e  : hpx::detail::throw_exception(hpx::error, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x4e in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
{locality-id}: 0
{hostname}: [ (tcp:131.188.33.179:7910) ]
{process-id}: 25975
{function}: primary_namespace::decrement_sweep
{file}: /home/inf3/heller/programming/hpx/src/runtime/agas/server/primary_namespace_server.cpp
{line}: 1019
{os-thread}: 10, worker-thread#10
{thread-id}: 00007f7365223ec0
{thread-description}: <unknown>
{state}: state_running
{auxinfo}:
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=OFF
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=ON (OpenMPI V1.10.2, MPI V3.0)
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=jemalloc
  HPX_PREFIX (configured)=/usr/local
  HPX_PREFIX=/home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx
{version}: V1.0.0-trunk (AGAS: V3.0), Git: unknown
{boost}: V1.60.0
{build-type}: release
{date}: Jul 26 2016 17:39:23
{platform}: linux
{compiler}: GNU C++ version 5.3.0
{stdlib}: GNU libstdc++ version 20151204
{what}: negative entry in reference count table, raw({0000000100000001, 00000000006011d6}), refcount(-1073741824): HPX(invalid_data)

{stack-trace}: 6 frames:
0x7f73795d0467  : hpx::termination_handler(int) + 0x117 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7377c6d8d0  : ??? + 0x7f7377c6d8d0 in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f737979e5c7  : hpx::threads::coroutines::detail::coroutine_self::set_self(hpx::threads::coroutines::detail::coroutine_self*) + 0x17 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f737979ee33  : hpx::threads::coroutines::detail::coroutine_impl::operator()() + 0xd3 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7379658bc9  : ??? + 0x7f7379658bc9 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
{what}: Segmentation fault
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=OFF
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=ON (OpenMPI V1.10.2, MPI V3.0)
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=jemalloc
  HPX_PREFIX (configured)=/usr/local
  HPX_PREFIX=/home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx
{version}: V1.0.0-trunk (AGAS: V3.0), Git: unknown
{boost}: V1.60.0
{build-type}: release
{date}: Jul 26 2016 17:39:23
{platform}: linux
{compiler}: GNU C++ version 5.3.0
{stdlib}: GNU libstdc++ version 20151204
Aborted

hkaiser added 4 commits June 29, 2016 18:19

Adding more functionality to local_primary_namespace implementation

12ede9a

- refactoring primary_namespace into a base-class component and two derived classes - split generate_unique_ids into two implementations - disable AGAS caching when in local mode

Functionality for local only AGAS primary namespace complete

8d262e6

- hello_world is running, more testing is required

hkaiser added category: AGAS type: enhancement labels Jul 2, 2016

hkaiser added this to the 0.9.99 milestone Jul 2, 2016

More cleanup and refactoring

538b229

- enabled performance counters for local primary AGAS namespace

hkaiser force-pushed the fixing_1591 branch from df26e34 to 538b229 Compare July 3, 2016 02:15

sithhell modified the milestones: 0.9.99, 1.0.0 Jul 15, 2016

Merge branch 'master' into fixing_1591

cbe3e5d

hkaiser closed this Jul 26, 2016

hkaiser deleted the fixing_1591 branch July 26, 2016 17:34

hkaiser mentioned this pull request Jul 26, 2016

Optimize AGAS for shared memory only operation #1591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement local-only primary namespace service #2232

Implement local-only primary namespace service #2232

hkaiser commented Jul 2, 2016

sithhell commented Jul 3, 2016

hkaiser commented Jul 3, 2016 •

edited

sithhell commented Jul 3, 2016

hkaiser commented Jul 3, 2016 •

edited

hkaiser commented Jul 26, 2016

sithhell commented Jul 26, 2016

sithhell commented Jul 26, 2016

sithhell commented Jul 26, 2016 via email

hkaiser commented Jul 26, 2016 •

edited

sithhell commented Jul 26, 2016 •

edited

Implement local-only primary namespace service #2232

Implement local-only primary namespace service #2232

Conversation

hkaiser commented Jul 2, 2016

sithhell commented Jul 3, 2016

hkaiser commented Jul 3, 2016 • edited

sithhell commented Jul 3, 2016

hkaiser commented Jul 3, 2016 • edited

hkaiser commented Jul 26, 2016

sithhell commented Jul 26, 2016

sithhell commented Jul 26, 2016

sithhell commented Jul 26, 2016 via email

hkaiser commented Jul 26, 2016 • edited

sithhell commented Jul 26, 2016 • edited

hkaiser commented Jul 3, 2016 •

edited

hkaiser commented Jul 3, 2016 •

edited

hkaiser commented Jul 26, 2016 •

edited

sithhell commented Jul 26, 2016 •

edited