Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement local-only primary namespace service #2232

Closed
wants to merge 6 commits into from
Closed

Conversation

hkaiser
Copy link
Member

@hkaiser hkaiser commented Jul 2, 2016

This PR implements an optimization of AGAS for local-only operation. In this mode (command line option --hpx:run-locally) all networking is disabled and the local virtual addresses of global objects are directly encoded in their global identifiers. This removes a large part of the AGAS overheads introduced by global reference counting and global address resolution.

This fixes #1591

- adding agas::detail::local_primary_namespace
- simplified and unified generation of gids
- refactored performance counters for server::primary_namespace
- moved part of implementation for agas::detail::hosted_data_type and agas::detail::bootstrap_data_type into source files
- refactoring primary_namespace into a base-class component and two derived classes
- split generate_unique_ids into two implementations
- disable AGAS caching when in local mode
- hello_world is running, more testing is required
- this also disables all networking and will not expect any other localities to connect
- the command line option --hpx:expect-connecting-localities now can take an (optional) argument
- fly-by: removed docs of configurations setting for IPC and VERBS parcelports
- fly-by: simply implementation of performance counters for primary AGAS namespaces
- enabled performance counters for local primary AGAS namespace
@sithhell
Copy link
Member

sithhell commented Jul 3, 2016

Which kind of applications would benefit from such an optimization? What's the actual performance increase when using this?

@hkaiser
Copy link
Member Author

hkaiser commented Jul 3, 2016

Which kind of applications would benefit from such an optimization?

Any application which is potentially distributed but has to run on a single locality (embedded devices?) This patch allows for the global addresses being used in a more efficient way when running on one locality. This will also benefit our comparisons with libraries like TBB.

What's the actual performance increase when using this?

I have not done any solid performance analysis. Here are the performance counter results for agas/primary_namespace for hello_world with and without this optimization, though:

hello_world.exe -t8 \
    --hpx:print-counter=/agas{locality#*/total}/primary/count \
    --hpx:print-counter=/agas{locality#*/total}/primary/time  
hello world from OS-thread 2 on locality 0
hello world from OS-thread 6 on locality 0
hello world from OS-thread 1 on locality 0
hello world from OS-thread 5 on locality 0
hello world from OS-thread 4 on locality 0
hello world from OS-thread 7 on locality 0
hello world from OS-thread 0 on locality 0
hello world from OS-thread 3 on locality 0
/agas{locality#0/total}/primary/count,1,0.012257,[s],56
/agas{locality#0/total}/primary/time,1,0.010653,[s],110242,[ns]
hello_world.exe -t8 --hpx:run-locally \
    --hpx:print-counter=/agas{locality#*/total}/primary/count \
    --hpx:print-counter=/agas{locality#*/total}/primary/time  
hello world from OS-thread 2 on locality 0
hello world from OS-thread 6 on locality 0
hello world from OS-thread 1 on locality 0
hello world from OS-thread 5 on locality 0
hello world from OS-thread 4 on locality 0
hello world from OS-thread 7 on locality 0
hello world from OS-thread 0 on locality 0
hello world from OS-thread 3 on locality 0
/agas{locality#0/total}/primary/count,1,0.011453,[s],52
/agas{locality#0/total}/primary/time,1,0.011582,[s],45498,[ns]

So it looks like that in this case both, the number of calls and the time required to execute those are reduced.

@sithhell
Copy link
Member

sithhell commented Jul 3, 2016

Which kind of applications would benefit from such an optimization?

Any application which is potentially distributed but has to run on a single locality (embedded devices?) This patch allows for the global addresses being used in a more efficient way when running on one locality. This will also benefit our comparisons with libraries like TBB.

That was mainly the reason why I asked for performance data ... I am not sure that this optimization buy us a lot in general. Especially in comparison with TBB, where we would compare a solution written for distributed with a completely local solution (apples and oranges). Our parallel algorithms etc. don't really depend on AGAS anyway.

Having this option enabled might also bring a distorted picture once you go to distributed. Given that the overheads are indeed less, you need to adapt grainsizes etc. once again. I am not sure if this optimization is really worth it in the end.

What's the actual performance increase when using this?

I have not done any solid performance analysis. Here are the performance counter results for agas/primary_namespace for hello_world with and without this optimization, though:

hello_world.exe -t8 \ --hpx:print-counter=/agas{locality#*/total}/primary/count \ --hpx:print-counter=/agas{locality#*/total}/primary/time hello world from OS-thread 2 on locality 0 hello world from OS-thread 6 on locality 0 hello world from OS-thread 1 on locality 0 hello world from OS-thread 5 on locality 0 hello world from OS-thread 4 on locality 0 hello world from OS-thread 7 on locality 0 hello world from OS-thread 0 on locality 0 hello world from OS-thread 3 on locality 0 /agas{locality#0/total}/primary/count,1,0.012257,[s],56 /agas{locality#0/total}/primary/time,1,0.010653,[s],110242,[ns]

hello_world.exe -t8 --hpx:run-locally \ --hpx:print-counter=/agas{locality#*/total}/primary/count \ --hpx:print-counter=/agas{locality#*/total}/primary/time hello world from OS-thread 2 on locality 0 hello world from OS-thread 6 on locality 0 hello world from OS-thread 1 on locality 0 hello world from OS-thread 5 on locality 0 hello world from OS-thread 4 on locality 0 hello world from OS-thread 7 on locality 0 hello world from OS-thread 0 on locality 0 hello world from OS-thread 3 on locality 0 /agas{locality#0/total}/primary/count,1,0.011453,[s],52 /agas{locality#0/total}/primary/time,1,0.011582,[s],45498,[ns]

So it looks like that in this case both, the number of calls and the time required to execute those are reduced.

Why do we have less requests all of sudden? But the timing looks nice. Would be nice to have a baseline with master.
Hello World is of course not really a meaningful benchmark. In addition, the total runtime seems to be roughly the same.

P.S.: I am not through with reading through the code ;)

@hkaiser
Copy link
Member Author

hkaiser commented Jul 3, 2016

Having this option enabled might also bring a distorted picture once you go to distributed

I agree. I'm not sure myself how many applications are written using AGAS in mind which might have to run on single localities in the end.

Our parallel algorithms etc. don't really depend on AGAS anyway.

Not quite. The segmented algorithms depend on it.

Why do we have less requests all of sudden?

Less operations are needed for the initialization of things in this case. So in the end it's a one-time benefit.

Hello World is of course not really a meaningful benchmark. In addition, the total runtime seems to be roughly the same.

I absolutely agree, even more as the overall runtime is probably determined by the console IO anyways.

Overall, this PR has potential benefits, the question is probably how big would be the maintenance burden to have it in. As I think it wouldn't be too large, I'd like to have that available...

@sithhell sithhell modified the milestones: 0.9.99, 1.0.0 Jul 15, 2016
@hkaiser
Copy link
Member Author

hkaiser commented Jul 26, 2016

I would like to go ahead and merge this. Are there any objections, still?

@sithhell
Copy link
Member

On Dienstag, 26. Juli 2016 06:26:44 CEST Hartmut Kaiser wrote:

I would like to go ahead and merge this. Are there any objections, still?

My main objection still is if this is really worth it, we add a significant
amount of code which has to be maintained. I would really like to see a
benchmark (apart from hello world) showing real improvements.

@sithhell
Copy link
Member

Am 26.07.2016 3:49 nachm. schrieb "Thomas Heller" thom.heller@gmail.com:

On Dienstag, 26. Juli 2016 06:26:44 CEST Hartmut Kaiser wrote:

I would like to go ahead and merge this. Are there any objections,
still?

My main objection still is if this is really worth it, we add a
significant
amount of code which has to be maintained. I would really like to see a
benchmark (apart from hello world) showing real improvements.

The transpose example might be a could candidate to give an indication of
the performance benefits.

@sithhell
Copy link
Member

sithhell commented Jul 26, 2016 via email

@hkaiser
Copy link
Member Author

hkaiser commented Jul 26, 2016

Here are some results from running transpose_block:

transpose_block.exe -t12 --iterations=100 --matrix_size=10240 --num_blocks=16
Finding blocks ...
Matrix transpose: B = A^T
Matrix order          = 10240
Matrix local columns  = 640
Number of blocks      = 16
Number of localities  = 1
Untiled
Number of iterations  = 100
Finding blocks A ... done
Finding blocks B ... done
Solution validates
Rate (MB/s): 9154.42, Avg time (s): 0.184647, Min time (s): 0.183269, Max time (s): 0.187131

transpose_block.exe -t12 --iterations=100 --matrix_size=10240 --num_blocks=16 --hpx:run-locally
Finding blocks ...
Matrix transpose: B = A^T
Matrix order          = 10240
Matrix local columns  = 640
Number of blocks      = 16
Number of localities  = 1
Untiled
Number of iterations  = 100
Finding blocks A ... done
Finding blocks B ... done
Solution validates
Rate (MB/s): 9163.28, Avg time (s): 0.185206, Min time (s): 0.183092, Max time (s): 0.196651

So the difference is not significant, but measurable.

@sithhell
Copy link
Member

sithhell commented Jul 26, 2016

In addition to the posted results I did a similar and compared it to master:

I ran the following command:

./bin/transpose_block_numa --transpose-threads=8 --transpose-numa-domains=2 \
        --matrix_size=24000 --num_blocks=120 \
        --hpx:print-counter=/agas{locality#*/total}/primary/count \
        --hpx:print-counter=/agas{locality#*/total}/primary/time

master:

Rate (MB/s): 25510.5, Avg time (s): 0.363383, Min time (s): 0.361263, Max time (s): 0.365022
/agas{locality#0/total}/primary/count,1,6.591869,[s],290462
/agas{locality#0/total}/primary/time,1,6.591697,[s],6.56639e+09,[ns]

fixing_1591:

Rate (MB/s): 25160.4, Avg time (s): 0.366579, Min time (s): 0.366289, Max time (s): 0.366936
/agas{locality#0/total}/primary/count,1,6.621262,[s],290703
/agas{locality#0/total}/primary/time,1,6.621201,[s],6.31083e+09,[ns]

fixing_1591: (Adding --hpx:run-locally)

Rate (MB/s): 25208.1, Avg time (s): 0.365771, Min time (s): 0.365596, Max time (s): 0.366282
/agas{locality#0/total}/primary/count,1,6.619994,[s],290222
/agas{locality#0/total}/primary/time,1,6.620322,[s],4.0047e+07,[ns]

In order to increase the load onto AGAS, I changed the number of blocks:

./bin/transpose_block_numa --transpose-threads=8 --transpose-numa-domains=2 \
        --matrix_size=24000 --num_blocks=240 \
        --hpx:print-counter=/agas{locality#*/total}/primary/count \
        --hpx:print-counter=/agas{locality#*/total}/primary/time

master:

Rate (MB/s): 19784.4, Avg time (s): 0.470493, Min time (s): 0.465821, Max time (s): 0.475618
/agas{locality#0/total}/primary/count,1,7.713836,[s],1.15689e+06
/agas{locality#0/total}/primary/time,1,7.715651,[s],2.29584e+10,[ns]

fixing_1591:

Rate (MB/s): 18909.5, Avg time (s): 0.489008, Min time (s): 0.487375, Max time (s): 0.490676
/agas{locality#0/total}/primary/count,1,7.738666,[s],1.15738e+06
/agas{locality#0/total}/primary/time,1,7.738564,[s],1.45573e+10,[ns]

fixing_1591: (Added --hpx:run-locally)

Rate (MB/s): 19432.5, Avg time (s): 0.478318, Min time (s): 0.474256, Max time (s): 0.479987
/agas{locality#0/total}/primary/count,1,7.628902,[s],1.15642e+06
/agas{locality#0/total}/primary/time,1,7.630093,[s],1.3725e+08,[ns]

So the overall result is that adding --hpx:run-locally slightly increases performance on the fixing_1591 branch, but still lacks a little behind master. That is, no performance improvement with respect to what we have already. Without any profiling information, the only explanation I have for this is that the virtual function dispatch is responsible for that.

The run on fixing_1591 without --hpx:run-locally unfortunately is sometimes failing with the following exception:

{stack-trace}: 4 frames:
0x7f737961873d  : hpx::detail::backtrace[abi:cxx11](unsigned long) + 0x9d in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f737964962a  : boost::exception_ptr hpx::detail::get_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xaa in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7379649b9e  : void hpx::detail::throw_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x4e in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7379655f5e  : hpx::detail::throw_exception(hpx::error, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x4e in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
{locality-id}: 0
{hostname}: [ (tcp:131.188.33.179:7910) ]
{process-id}: 25975
{function}: primary_namespace::decrement_sweep
{file}: /home/inf3/heller/programming/hpx/src/runtime/agas/server/primary_namespace_server.cpp
{line}: 1019
{os-thread}: 10, worker-thread#10
{thread-id}: 00007f7365223ec0
{thread-description}: <unknown>
{state}: state_running
{auxinfo}:
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=OFF
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=ON (OpenMPI V1.10.2, MPI V3.0)
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=jemalloc
  HPX_PREFIX (configured)=/usr/local
  HPX_PREFIX=/home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx
{version}: V1.0.0-trunk (AGAS: V3.0), Git: unknown
{boost}: V1.60.0
{build-type}: release
{date}: Jul 26 2016 17:39:23
{platform}: linux
{compiler}: GNU C++ version 5.3.0
{stdlib}: GNU libstdc++ version 20151204
{what}: negative entry in reference count table, raw({0000000100000001, 00000000006011d6}), refcount(-1073741824): HPX(invalid_data)

{stack-trace}: 6 frames:
0x7f73795d0467  : hpx::termination_handler(int) + 0x117 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7377c6d8d0  : ??? + 0x7f7377c6d8d0 in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f737979e5c7  : hpx::threads::coroutines::detail::coroutine_self::set_self(hpx::threads::coroutines::detail::coroutine_self*) + 0x17 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f737979ee33  : hpx::threads::coroutines::detail::coroutine_impl::operator()() + 0xd3 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
0x7f7379658bc9  : ??? + 0x7f7379658bc9 in /home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx/lib/libhpx.so.1
{what}: Segmentation fault
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=OFF
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=ON (OpenMPI V1.10.2, MPI V3.0)
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=jemalloc
  HPX_PREFIX (configured)=/usr/local
  HPX_PREFIX=/home/inf3/heller/scratch/build/gcc-5.3.0/boost-1.60.0/openmpi-1.10.2/release/hpx
{version}: V1.0.0-trunk (AGAS: V3.0), Git: unknown
{boost}: V1.60.0
{build-type}: release
{date}: Jul 26 2016 17:39:23
{platform}: linux
{compiler}: GNU C++ version 5.3.0
{stdlib}: GNU libstdc++ version 20151204
Aborted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize AGAS for shared memory only operation
2 participants