Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of ga_malloc #330

Closed
bjpalmer opened this issue Jul 25, 2024 · 45 comments · Fixed by #348
Closed

Get rid of ga_malloc #330

bjpalmer opened this issue Jul 25, 2024 · 45 comments · Fixed by #348

Comments

@bjpalmer
Copy link
Member

Should we get rid of ga_malloc? It is used in a few places inside the GA code but not extensively. The main reason for getting rid of it would be to get rid of obscure errors that occur if you have not allocated a lot of memory using MA_Init. The original justification for using it was to improve performance by using pinned memory, but on the few occasions when I've swapped regular malloc for ga_malloc, I haven't seen any difference. @edoapra, @jeffhammond, what do you think?

@bjpalmer bjpalmer changed the title Get rid of ga_mallo Get rid of ga_malloc Jul 25, 2024
@jeffhammond
Copy link
Member

Yes. It should go away. I recall I had a PR for this as part of thread safety 6 years ago.

@edoapra
Copy link
Contributor

edoapra commented Jul 25, 2024

Unfortunately, ga_malloc() is used in a NWChem routine
https://github.com/nwchemgit/nwchem/blob/master/src/util/ga_asymmetr.c

Should I simply replace ga_malloc() with vanilla malloc()?

@bjpalmer
Copy link
Member Author

You could probably still use it, if you want. I'd just propose getting rid of it inside GA. But same comment applies: does it look like it is doing anything for you inside NWChem?

@jeffhammond
Copy link
Member

Yes, we should replace with malloc.

edoapra added a commit to edoapra/nwchem that referenced this issue Jul 25, 2024
@jeffhammond
Copy link
Member

One reason to stop using ga_malloc is that it creates a less-than-obvious dependency on MA, as noted in #211. We either want to use regular malloc or ARMCI_Malloc_local.

edoapra added a commit to omarkahmed/nwchem that referenced this issue Jul 26, 2024
@bjpalmer
Copy link
Member Author

bjpalmer commented Aug 6, 2024

I created a branch, feature/malloc, that replaces ga_malloc with malloc. The test suite mostly runs but there may be a few hangs. However, there also appear to be a few hangs in develop, so I need to track those down. I put the changes inside an ifdef preprocessor block, so it should be easy to switch back and forth from ga_malloc to malloc by uncommenting the #define USE_GA_MALLOC line in base.h. If anyone gets a chance, can you try this branch out and see if there are any performance differences?

@edoapra
Copy link
Contributor

edoapra commented Aug 6, 2024

I have just run the following tests on deception with OpenMPI 4.1.4 on the feature/malloc branch without any issue using MPI-PR.

srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perfmod.x
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perf.x
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/perform.x

@bjpalmer
Copy link
Member Author

@edoapra, what compiler are you using? I tried running the mir_perf2.x test standalone and it still looks like it is hanging.

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

gcc

@bjpalmer
Copy link
Member Author

Which version? I'm using 8.1.0

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

the default 4.8.5
Let me try 8

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

The OpenMPI installation for the the combination gcc 8.1 is broken on deception

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
[dmi01.local:247150] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dmi01.local:247152] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dmi01.local:247151] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

@bjpalmer
Copy link
Member Author

I thought you were running on constance.

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

I didn't know constance is still around

@bjpalmer
Copy link
Member Author

Yep, still is. I tried running on deception with gcc/9.4.0 and openmpi/4.1.4 and got the same hangs. I'll try gcc/4.8.5 and see what happens.

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

[edo@constance01 bp3.cons]$ module load gcc/8.1.0
[edo@constance01 bp3.cons]$ module load openmpi
	 The combination of compiler and MPI does not appear to be built yet

@bjpalmer
Copy link
Member Author

Try gcc/9.4.0

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

gcc/4.4.7 is busted on constance

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

Try gcc/9.4.0

On constance or deception?

@bjpalmer
Copy link
Member Author

deception

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

Works for me.
Just a warning in the initialization phase

module purge
module load gcc/9.4.0
module load openmpi/4.1.4
../ga_feature_malloc/configure --with-mpi-pr
make 
make check-ga
salloc  -N 2
srun --mpi=pmi2 -N 2 --tasks-per-node=3 global/testing/mir_perf2.x
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi03
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi03
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi03
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi01
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi01
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   dmi01
  Local device: mlx5_0
--------------------------------------------------------------------------
  
  
  
 Performance of GA get, put & acc for 1-dimensional sections of array[1048576]
  
  
                          Local 1-D Array Section  
     section           get               put           accumulate
  bytes    dim     sec      MB/s     sec      MB/s     sec      MB/s
      8      1  .430D-06 .186D+02 .352D-06 .227D+02 .527D-06 .152D+02
     72      9  .421D-06 .171D+03 .359D-06 .200D+03 .523D-06 .138D+03
    128     16  .421D-06 .304D+03 .371D-06 .345D+03 .523D-06 .245D+03
    648     81  .434D-06 .149D+04 .385D-06 .168D+04 .552D-06 .117D+04
   2048    256  .460D-06 .446D+04 .436D-06 .470D+04 .608D-06 .337D+04
   4608    576  .525D-06 .878D+04 .508D-06 .906D+04 .736D-06 .626D+04
   7200    900  .580D-06 .124D+05 .585D-06 .123D+05 .862D-06 .835D+04
  18432   2304  .944D-06 .195D+05 .995D-06 .185D+05 .140D-05 .132D+05
  32768   4096  .141D-05 .233D+05 .144D-05 .228D+05 .205D-05 .160D+05
  66248   8281  .259D-05 .256D+05 .253D-05 .261D+05 .363D-05 .183D+05
 131072  16384  .456D-05 .287D+05 .456D-05 .288D+05 .673D-05 .195D+05
 233928  29241  .796D-05 .294D+05 .871D-05 .269D+05 .119D-04 .197D+05
 524288  65536  .190D-04 .276D+05 .207D-04 .253D+05 .252D-04 .208D+05
 996872 124609  .343D-04 .291D+05 .337D-04 .295D+05 .498D-04 .200D+05
1548800 193600  .529D-04 .293D+05 .523D-04 .296D+05 .714D-04 .217D+05
2097152 262144  .707D-04 .296D+05 .697D-04 .301D+05 .963D-04 .218D+05
  
                          Remote 1-D Array Section 
     section           get               put           accumulate
  bytes    dim     sec      MB/s     sec      MB/s     sec      MB/s
      8      1  .315D-05 .254D+01 .845D-06 .947D+01 .974D-06 .821D+01
     72      9  .295D-05 .244D+02 .849D-06 .848D+02 .982D-06 .733D+02
    128     16  .303D-05 .422D+02 .853D-06 .150D+03 .989D-06 .129D+03
    648     81  .386D-05 .168D+03 .101D-05 .642D+03 .111D-05 .584D+03
   2048    256  .426D-05 .480D+03 .104D-05 .198D+04 .110D-05 .186D+04
   4608    576  .506D-05 .911D+03 .115D-05 .402D+04 .118D-05 .390D+04
   7200    900  .562D-05 .128D+04 .123D-05 .586D+04 .128D-05 .561D+04
  18432   2304  .786D-05 .234D+04 .177D-05 .104D+05 .182D-05 .101D+05
  32768   4096  .914D-05 .359D+04 .212D-05 .154D+05 .221D-05 .148D+05
  66248   8281  .126D-04 .526D+04 .349D-05 .190D+05 .367D-05 .180D+05
 131072  16384  .186D-04 .706D+04 .566D-05 .232D+05 .595D-05 .220D+05
 233928  29241  .289D-04 .810D+04 .980D-05 .239D+05 .989D-05 .236D+05
 524288  65536  .247D-03 .212D+04 .473D-04 .111D+05 .283D-04 .185D+05
 996872 124609  .236D-03 .423D+04 .657D-04 .152D+05 .554D-04 .180D+05
1548800 193600  .195D-03 .796D+04 .485D-03 .319D+04 .248D-03 .625D+04
2097152 262144  .288D-03 .729D+04 .180D-03 .116D+05 .308D-03 .680D+04
 All tests successful 

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

I am using the incline queue on deception.
srun -p incline -N 2

@edoapra
Copy link
Contributor

edoapra commented Aug 12, 2024

setenv MCA_btl_openib_allow_ib true gets rid of the OpenFabrics warning

@bjpalmer
Copy link
Member Author

Okay, you must be setting something that I am not. I've loaded gcc/4.8.5 (I've also tried gcc/9.4.0) and openmpi/4.1.4, configured using your configure line and run using the submit script

#!/bin/csh
#SBATCH -t 01:30:00
#SBATCH -A panyaya
#SBATCH -p incline
#SBATCH -N 2
#SBATCH -n 6
#SBATCH -o ./test.out
#SBATCH -e ./test.err

source /etc/profile.d/modules.csh

#source ~/set_deception2
module purge

module load gcc/4.8.5
module load openmpi/4.1.4
env | grep PATH
module list

setenv MCA_btl_openib_allow_ib true
srun -N 2 --tasks-per-node=3 mir_perf2.x

and the code still hangs.

@bjpalmer
Copy link
Member Author

I just built the develop branch and am running the test suite. That looks like it works. There appear to have been a lot of changes since I spun off the ga_malloc branch.

@edoapra
Copy link
Contributor

edoapra commented Aug 13, 2024

Okay, you must be setting something that I am not. I've loaded gcc/4.8.5 (I've also tried gcc/9.4.0) and openmpi/4.1.4, configured using your configure line and run using the submit script

#!/bin/csh #SBATCH -t 01:30:00 #SBATCH -A panyaya #SBATCH -p incline #SBATCH -N 2 #SBATCH -n 6 #SBATCH -o ./test.out #SBATCH -e ./test.err

source /etc/profile.d/modules.csh

#source ~/set_deception2 module purge

module load gcc/4.8.5 module load openmpi/4.1.4 env | grep PATH module list

setenv MCA_btl_openib_allow_ib true srun -N 2 --tasks-per-node=3 mir_perf2.x

and the code still hangs.

This script works for me and I am using the feature/malloc branch.

Do you happen to have some extra OpenMPI configuration under $HOME/.openmpi ?

@edoapra
Copy link
Contributor

edoapra commented Aug 13, 2024

Could you try to run the binary I have copied as
/tmp/edo_mir_perf2.x
on deception03 ?

@bjpalmer
Copy link
Member Author

Okay, I'm an idiot. I didn't pick up changes to feature/malloc that were made recently. It looks like everything is passing for me on deception. I'll see if it works on constance.

@bjpalmer
Copy link
Member Author

The full test suite passes for me on deception. It looks like the two-sided runtime passes on constance.

@edoapra
Copy link
Contributor

edoapra commented Aug 14, 2024

I have also been testing this branch with the full QA test suite in NWChem.
The results are the same as with 5.8.2

@bjpalmer
Copy link
Member Author

The RMA runtime mostly passes on deception. The only failure is nbtestc.x, which has been consistently failing even with ga_malloc. I'm not sure if that is a problem with the MPI implementation of a problem on our side with how we handle non-blocking handles. At this point, @edoapra, @ajaypanyala, @jeffhammond do you have an issue with merging this into develop?

@edoapra
Copy link
Contributor

edoapra commented Aug 26, 2024

I have no issue against the merge.
nbtestc.x works for me on deception with MPI3

@bjpalmer
Copy link
Member Author

There is already a pull request for this (#348). @jeffhammond, if you get a chance can you approve this and we will merge it into develop?

@jeffhammond
Copy link
Member

I had nontrivial feedback comments on that PR. I'm supportive but I think it can be improved. I'll make the changes if Edo is too busy.

@edoapra
Copy link
Contributor

edoapra commented Aug 26, 2024

I had nontrivial feedback comments on that PR. I'm supportive but I think it can be improved. I'll make the changes if Edo is too busy.

Could you describe the feedback your are talking about? I am not able to find any comment under your name in #348

@jeffhammond
Copy link
Member

I don't know why they aren't visible. When I scroll down on #348, I see
Screenshot 2024-08-27 at 12 55 27
Screenshot 2024-08-27 at 12 55 33
Screenshot 2024-08-27 at 12 55 38
Screenshot 2024-08-27 at 12 55 43

edoapra added a commit that referenced this issue Aug 27, 2024
@bjpalmer
Copy link
Member Author

I thought about using MA_Sizeof but didn't so that we wouldn't be adding any additional MA dependencies in case we want to completely divorce GA from MA at some point in the future. I don't know if that is a priority for anyone. It's not a make or break item for me, so if anyone else feels strongly about it, let me know.

@jeffhammond
Copy link
Member

In that case, perhaps we can hide the case-switch inside of a GA_SIZEOF macro. I'm not wedded to MA, merely against code duplication.

@bjpalmer
Copy link
Member Author

I modified the code and defined a new function pnga_malloc that has the same signature as ga_malloc but can be switched under the hood to use either ga_malloc or malloc. This function is used everywhere that ga_malloc used to be used and that gets rid of the USE_GA_MALLOC symbol (except to convert between the two implementations of pnga_malloc). Somebody already created a GAsizeof function inside GA, so I just used that to handle GA types.

@bjpalmer
Copy link
Member Author

@ajaypanyala, @edoapra, @jeffhammond, any final comments on this branch? If not, then we can approve the pull request and merge this into develop.

@edoapra
Copy link
Contributor

edoapra commented Sep 11, 2024

I modified the code and defined a new function pnga_malloc that has the same signature as ga_malloc but can be switched under the hood to use either ga_malloc or malloc. This function is used everywhere that ga_malloc used to be used and that gets rid of the USE_GA_MALLOC symbol (except to convert between the two implementations of pnga_malloc). Somebody already created a GAsizeof function inside GA, so I just used that to handle GA types.

Where is this commit?

@bjpalmer
Copy link
Member Author

Oh duh, I forgot to check it in. Try it now.

@edoapra
Copy link
Contributor

edoapra commented Sep 11, 2024

Works for me

@ajaypanyala
Copy link
Member

Looks good to me

bjpalmer pushed a commit that referenced this issue Sep 13, 2024
* Modified code to replace all instances of ga_malloc with regular malloc.

* minor fix

* another fix

* Modified code to replace all instances of ga_malloc with regular malloc.

* minor fix

* another fix

* stop to avoid division by zero

* fix for brew install warning

* Modified code to replace all instances of ga_malloc with regular malloc.

* minor fix

* another fix

* stop to avoid division by zero

* fix for brew install warning

* added #if HAVE_STDLIB_H  #330 (comment)

* fix: replaced MA with malloc/free and fix race condition in LJ_Initialize

* Modified handling of ga_malloc by defining a new function pnga_malloc that can
be converted to use either ga_malloc or regular malloc. This is the only
function that needs to be changed.

---------

Co-authored-by: Bruce J Palmer <d3g293@deception01.pnl.gov>
Co-authored-by: Ajay Panyala <ajay.panyala@gmail.com>
Co-authored-by: Bruce J Palmer <d3g293@deception03.pnl.gov>
Co-authored-by: Bruce J Palmer <d3g293@constance03.pnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants