Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance mismatch between MPI and HPX in SMP for allgather example #445

Closed
brycelelbach opened this issue Jul 10, 2012 · 9 comments

Comments

@brycelelbach
Copy link
Member

[reported by manderson] [Trac time Fri Jul 6 16:40:11 2012]
ea07d6f
Boost 1.48.0
g++ 4.4
OpenMPI 1.4.2 (for MPI equivalent code)
Release Mode
examples/allgather
examples/allgather/mpi_equivalent

MPI Allgather versus HPX Allgather show HPX unexpectedly an order of magnitude slower than MPI in SMP mode for simple allgather operations.

Performance Results:

tasks MPI HPX
1 4.0E-6 1.1E-4
2 1.3E-5 1.9E-4
4 1.4E-5 3.7E-4
8 9.9E-5 8.2E-4

To reproduce:

 MPI executable: a.out

Tasks           MPI                                    HPX
 1        mpirun -np 1 ./a.out 1         ./ag_client --np 1 -t 1
 2        mpirun -np 2 ./a.out 1         ./ag_client --np 2 -t 2
 4        mpirun -np 4 ./a.out 1         ./ag_client --np 4 -t 4
 8        mpirun -np 8 ./a.out 1         ./ag_client --np 8 -t 8
@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Fri Jul 6 17:01:05 2012] The performance mismatch becomes even more significant in distributed. It shows no sign of improving even as the number of processors increases:

all timings in seconds; startup costs for both codes are not reported; only actual allgather communication cost.

nodes (8 cores/node)     MPI             HPX
  2                                  1.73E-4   3.69E-2
  4                                  2.13E-4   6.92E-2
  8                                  5.37E-4   4.62
  16                                2.02E-4   10.6

@brycelelbach
Copy link
Member Author

[comment by hkaiser] [Trac time Sun Jul 8 20:53:42 2012] The MPI and HPX codes are not comparable. While the MPI version uses MPI_AllGather, which has a complexity of O(N), where N is the number of participants, the algorithm implemented in the HPX example exposes a complexity of O(N*N), it even gathers the local values. What needs to be done is to develop a new algorithm specificly targeted towards HPX (or in general terms, targetted towards message driven models).

Additionally, what's interesting from your numbers is that the 8 worker MPI version runs 20 times slower than the version with 1 worker (which shouldn't do anything, btw), while the HPX example's performance is only deterioating 8 times.

@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Sun Jul 8 21:35:59 2012] The MPI and HPX codes do the same thing and are comparable:

          sendbuff
          ########
          #      #
        0 #  AA  #
          #      #
          ########
     T    #      #
        1 #  BB  #
     a    #      #
          ########
     s    #      #
        2 #  CC  #                                   BEFORE
     k    #      #
          ########
     s    #      #
        3 #  DD  #
          #      #
          ########
          #      #
        4 #  EE  #
          #      #
          ########

            <---------- recvbuff ---------->
          ####################################
          #      #      #      #      #      #
        0 #  AA  #  BB  #  CC  #  DD  #  EE  #
          #      #      #      #      #      #
          ####################################
     T    #      #      #      #      #      #
        1 #  AA  #  BB  #  CC  #  DD  #  EE  #
     a    #      #      #      #      #      #
          ####################################
     s    #      #      #      #      #      #
        2 #  AA  #  BB  #  CC  #  DD  #  EE  #       AFTER
     k    #      #      #      #      #      #
          ####################################
     s    #      #      #      #      #      #
        3 #  AA  #  BB  #  CC  #  DD  #  EE  #
          #      #      #      #      #      #
          ####################################
          #      #      #      #      #      #
        4 #  AA  #  BB  #  CC  #  DD  #  EE  #
          #      #      #      #      #      #
          ####################################

Removing the local gather has no impact on the reported results. Removing the O(N*N) complexity in the HPX call would remove the ability to extract asynchrony and defeat the purpose of using HPX.

It is difficult to say anything bad about the MPI numbers by using the HPX results since they are orders of magnitude slower. If HPX ran as fast as MPI, would it's scaling behavior be the same?

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Mon Jul 9 14:02:24 2012] In each compute iteration, this code passes the GIDs to all the components as an argument to each future. This is probably significantly affecting performance, as the GIDs end up being split every 8 iterations. These GIDs are never updated throughout the lifetime of the computation, so there's absolutely no need to pass them to each call to compute_async. Instead, they should be copied once into a data member of the allgather component (or some similar approach).

@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Mon Jul 9 14:06:35 2012] There is only one iteration in this example. np is the number of components and each component needs to receive the gids of all other components in order to use the stubs in the asynchronous allgather. There are no extraneous gids sent as suggested above. Further, this is not the source of the performance slowdown. You can easily verify this (no need to speculate as above) by simply commenting out the gather in compute. The performance is near optimal then.

@brycelelbach
Copy link
Member Author

Is this still an issue? Can someone re-run the aforementioned numbers on the top of trunk?

@hkaiser
Copy link
Member

hkaiser commented Oct 19, 2012

That seems to be resolved now. Here is the message from Matt:

Just wanted to pass along the latest benchmark for gtc:
          HPX     MPI
  1     540       470
  2     286       238
  4     161       138
  8      98        96
 16      54 
 32      29

 There were some segfaults in the MPI code at 16 and 32 (trying to resolve).  Looks like 
 HPX will beat MPI in GTC by 16 cores.  This is quite an accomplishment since GTC is 
 pretty highly optimized and HPX is only able to extract asynchrony intra timestep.

@hkaiser hkaiser closed this as completed Oct 19, 2012
@maeneas
Copy link
Contributor

maeneas commented Oct 19, 2012

Those GTC numbers were for distributed incorporate many more collectives than allgather. I have re-done the allgather HPX and MPI comparison and the significant mismatch remains regardless of the good results in comparing GTC as a whole in HPX and MPI. This ticket was closed prematurely.

@maeneas maeneas reopened this Oct 19, 2012
@sithhell
Copy link
Member

HPX will always perform worse than MPI with that type of code. It's a matter of the programming model differences between MPI and HPX, not a intrinsic problem of HPX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants