Significant performance mismatch between MPI and HPX in SMP for allgather example #445

brycelelbach · 2012-07-10T23:41:58Z

[reported by manderson] [Trac time Fri Jul 6 16:40:11 2012]
ea07d6f
Boost 1.48.0
g++ 4.4
OpenMPI 1.4.2 (for MPI equivalent code)
Release Mode
examples/allgather
examples/allgather/mpi_equivalent

MPI Allgather versus HPX Allgather show HPX unexpectedly an order of magnitude slower than MPI in SMP mode for simple allgather operations.

Performance Results:

tasks MPI HPX
1 4.0E-6 1.1E-4
2 1.3E-5 1.9E-4
4 1.4E-5 3.7E-4
8 9.9E-5 8.2E-4

To reproduce:

 MPI executable: a.out

Tasks           MPI                                    HPX
 1        mpirun -np 1 ./a.out 1         ./ag_client --np 1 -t 1
 2        mpirun -np 2 ./a.out 1         ./ag_client --np 2 -t 2
 4        mpirun -np 4 ./a.out 1         ./ag_client --np 4 -t 4
 8        mpirun -np 8 ./a.out 1         ./ag_client --np 8 -t 8

The text was updated successfully, but these errors were encountered:

brycelelbach · 2012-07-10T23:41:59Z

[comment by manderson] [Trac time Fri Jul 6 17:01:05 2012] The performance mismatch becomes even more significant in distributed. It shows no sign of improving even as the number of processors increases:

all timings in seconds; startup costs for both codes are not reported; only actual allgather communication cost.

nodes (8 cores/node)     MPI             HPX
  2                                  1.73E-4   3.69E-2
  4                                  2.13E-4   6.92E-2
  8                                  5.37E-4   4.62
  16                                2.02E-4   10.6

brycelelbach · 2012-07-10T23:42:00Z

[comment by hkaiser] [Trac time Sun Jul 8 20:53:42 2012] The MPI and HPX codes are not comparable. While the MPI version uses MPI_AllGather, which has a complexity of O(N), where N is the number of participants, the algorithm implemented in the HPX example exposes a complexity of O(N*N), it even gathers the local values. What needs to be done is to develop a new algorithm specificly targeted towards HPX (or in general terms, targetted towards message driven models).

Additionally, what's interesting from your numbers is that the 8 worker MPI version runs 20 times slower than the version with 1 worker (which shouldn't do anything, btw), while the HPX example's performance is only deterioating 8 times.

brycelelbach · 2012-07-10T23:42:00Z

[comment by manderson] [Trac time Sun Jul 8 21:35:59 2012] The MPI and HPX codes do the same thing and are comparable:

          sendbuff
          ########
          #      #
        0 #  AA  #
          #      #
          ########
     T    #      #
        1 #  BB  #
     a    #      #
          ########
     s    #      #
        2 #  CC  #                                   BEFORE
     k    #      #
          ########
     s    #      #
        3 #  DD  #
          #      #
          ########
          #      #
        4 #  EE  #
          #      #
          ########

            <---------- recvbuff ---------->
          ####################################
          #      #      #      #      #      #
        0 #  AA  #  BB  #  CC  #  DD  #  EE  #
          #      #      #      #      #      #
          ####################################
     T    #      #      #      #      #      #
        1 #  AA  #  BB  #  CC  #  DD  #  EE  #
     a    #      #      #      #      #      #
          ####################################
     s    #      #      #      #      #      #
        2 #  AA  #  BB  #  CC  #  DD  #  EE  #       AFTER
     k    #      #      #      #      #      #
          ####################################
     s    #      #      #      #      #      #
        3 #  AA  #  BB  #  CC  #  DD  #  EE  #
          #      #      #      #      #      #
          ####################################
          #      #      #      #      #      #
        4 #  AA  #  BB  #  CC  #  DD  #  EE  #
          #      #      #      #      #      #
          ####################################

Removing the local gather has no impact on the reported results. Removing the O(N*N) complexity in the HPX call would remove the ability to extract asynchrony and defeat the purpose of using HPX.

It is difficult to say anything bad about the MPI numbers by using the HPX results since they are orders of magnitude slower. If HPX ran as fast as MPI, would it's scaling behavior be the same?

brycelelbach · 2012-07-10T23:42:01Z

[comment by blelbach] [Trac time Mon Jul 9 14:02:24 2012] In each compute iteration, this code passes the GIDs to all the components as an argument to each future. This is probably significantly affecting performance, as the GIDs end up being split every 8 iterations. These GIDs are never updated throughout the lifetime of the computation, so there's absolutely no need to pass them to each call to compute_async. Instead, they should be copied once into a data member of the allgather component (or some similar approach).

brycelelbach · 2012-07-10T23:42:01Z

[comment by manderson] [Trac time Mon Jul 9 14:06:35 2012] There is only one iteration in this example. np is the number of components and each component needs to receive the gids of all other components in order to use the stubs in the asynchronous allgather. There are no extraneous gids sent as suggested above. Further, this is not the source of the performance slowdown. You can easily verify this (no need to speculate as above) by simply commenting out the gather in compute. The performance is near optimal then.

brycelelbach · 2012-09-15T06:00:46Z

Is this still an issue? Can someone re-run the aforementioned numbers on the top of trunk?

hkaiser · 2012-10-19T12:50:55Z

That seems to be resolved now. Here is the message from Matt:

Just wanted to pass along the latest benchmark for gtc:
          HPX     MPI
  1     540       470
  2     286       238
  4     161       138
  8      98        96
 16      54 
 32      29

 There were some segfaults in the MPI code at 16 and 32 (trying to resolve).  Looks like 
 HPX will beat MPI in GTC by 16 cores.  This is quite an accomplishment since GTC is 
 pretty highly optimized and HPX is only able to extract asynchrony intra timestep.

maeneas · 2012-10-19T15:54:57Z

Those GTC numbers were for distributed incorporate many more collectives than allgather. I have re-done the allgather HPX and MPI comparison and the significant mismatch remains regardless of the good results in comparing GTC as a whole in HPX and MPI. This ticket was closed prematurely.

sithhell · 2013-09-21T20:30:30Z

HPX will always perform worse than MPI with that type of code. It's a matter of the programming model differences between MPI and HPX, not a intrinsic problem of HPX.

hkaiser closed this as completed Oct 19, 2012

maeneas reopened this Oct 19, 2012

sithhell closed this as completed Sep 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance mismatch between MPI and HPX in SMP for allgather example #445

Significant performance mismatch between MPI and HPX in SMP for allgather example #445

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Sep 15, 2012

hkaiser commented Oct 19, 2012

maeneas commented Oct 19, 2012

sithhell commented Sep 21, 2013

Significant performance mismatch between MPI and HPX in SMP for allgather example #445

Significant performance mismatch between MPI and HPX in SMP for allgather example #445

Comments

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Sep 15, 2012

hkaiser commented Oct 19, 2012

maeneas commented Oct 19, 2012

sithhell commented Sep 21, 2013