LAMMPS on multiple GPUs #195

Felixrccs · 2023-10-19T12:24:15Z

Everything works on one gpu, however I would like to run my LAMMPS simulation over multiple GPUs.
My lammps submission command for two GPUs:
srun -n 4 lmp -partition 1 1 1 1 -l lammps.log -sc screen -k on g 2 -sf kk -i lammps.in

If I run it on multiple GPUs, LAMMPS returns this error back:

Starting MPS on ravg1112
Exception: Specified device cuda:0 does not match device of data cuda:1
Exception raised from make_tensor at aten/src/ATen/Functions.cpp:26 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x14f6528553cb in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libc10.so)
frame #1: at::TensorMaker::make_tensor() + 0xa1d (0x14f604c2641d in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libtorch_cpu.so)
frame #2: torch::from_blob(void*, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xfb (0x14f656649e2b in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #3: LAMMPS_NS::PairMACEKokkos<Kokkos::Cuda>::compute(int, int) + 0xce1 (0x14f6566620f1 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #4: LAMMPS_NS::VerletKokkos::setup(int) + 0x5b2 (0x14f655d85fb2 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #5: LAMMPS_NS::Temper::command(int, char**) + 0x6c4 (0x14f655583c84 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #6: LAMMPS_NS::Input::execute_command() + 0xaec (0x14f6551e55ec in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #7: LAMMPS_NS::Input::file() + 0x155 (0x14f6551e5995 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #8: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x4049e8]
frame #9: __libc_start_main + 0xef (0x14f653e3e24d in /lib64/libc.so.6)
frame #10: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x404b9a]

Exception: Specified device cuda:0 does not match device of data cuda:1
Exception raised from make_tensor at aten/src/ATen/Functions.cpp:26 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x14dd2ea553cb in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libc10.so)
frame #1: at::TensorMaker::make_tensor() + 0xa1d (0x14dce0e2641d in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libtorch_cpu.so)
frame #2: torch::from_blob(void*, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xfb (0x14dd32849e2b in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #3: LAMMPS_NS::PairMACEKokkos<Kokkos::Cuda>::compute(int, int) + 0xce1 (0x14dd328620f1 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #4: LAMMPS_NS::VerletKokkos::setup(int) + 0x5b2 (0x14dd31f85fb2 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #5: LAMMPS_NS::Temper::command(int, char**) + 0x6c4 (0x14dd31783c84 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #6: LAMMPS_NS::Input::execute_command() + 0xaec (0x14dd313e55ec in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #7: LAMMPS_NS::Input::file() + 0x155 (0x14dd313e5995 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #8: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x4049e8]
frame #9: __libc_start_main + 0xef (0x14dd3003e24d in /lib64/libc.so.6)
frame #10: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x404b9a]

Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
slurmstepd: error: *** STEP 7590002.0 ON ravg1112 CANCELLED AT 2023-10-19T14:03:48 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: ravg1112: tasks 0,2: Killed
srun: launch/slurm: _step_signal: Terminating StepId=7590002.0
srun: error: ravg1112: task 3: Killed
srun: error: ravg1112: task 1: Killed

The text was updated successfully, but these errors were encountered:

ilyes319 · 2023-10-20T20:25:53Z

multi-GPU lammps is an untested beta feature, hopefully we will make progress on that soon.

Felixrccs · 2023-10-27T10:39:27Z

I was told that I probably formulated my issue not precise enough. So I want to run LAMMPS REPLICA EXCHANGE simulations. This means that LAMMPS creates multiple MD instances and runs them in parallel. So I want to stick to pair_style mace no_domain_decomposition. MACE-LAMMPS also manages to distribute the MD instances over one GPU. However when distributing the MD instances over multiple GPUs I run into this Bug. From the error I guess there is something build into the LAMMPS-MACE package that checks if the memory is identical for both GPUs, which is not necessary for my application.

I would be very happy if you take another look at this.

Felixrccs · 2023-11-13T10:33:33Z

@wcwitt @ilyes319 Any idea how to fix this?
At this point this is the only thing keeping me from starting large scale production runs.

wcwitt · 2023-11-13T10:58:06Z

I've never run replica exchange in LAMMPS, so I'll need to dig into the internals to understand why it's breaking. Would you please send a complete (but simple) example, including a model? I'll use that to figure it out. Thanks

ilyes319 · 2023-11-13T12:15:50Z

The problem seems to be linked to how to the model is sent to different GPUs. If nothing is done, libtorch probably loads the model on the first GPU of the node. I think there is a probably a line to add to distribute the models on the right GPUs.

wcwitt · 2023-11-13T12:35:13Z

Yeah I'm sure it's not terribly complicated. There is already this logic which assigns the model to the GPU associated with the local MPI rank. That's how the domain decomposition works.

  if (!torch::cuda::is_available()) {
    std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
    device = c10::Device(torch::kCPU);
  } else {
    std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
    MPI_Comm local;
    MPI_Comm_split_type(world, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
    int localrank;
    MPI_Comm_rank(local, &localrank);
    device = c10::Device(torch::kCUDA,localrank);
  }

I'd still like a self-contained example.

Felixrccs · 2023-11-13T16:23:52Z

Thanks a lot
Test.zip
Here is the Test case, I simplified it as far as possible.

wcwitt · 2023-11-20T21:50:29Z

Thanks for the example - I've looked this some now. What happens if you try without Kokkos? So, adapting your input command:

mpirun -n 2 /work/Software/lammps_mace/lammps/build-kokkos-cuda/lmp -partition 1 1 -l lammps.log -sc screen -in lammps.temper > lammps.out

Felixrccs · 2023-11-22T09:01:08Z

It returns no error. However it only uses a single GPU for all replicas (instances) instead of distributing it over the GPUs.
--> without kokkos it gives me the same result compared to using kokkos with one gpu (-k on g 2 -sf kk)

wcwitt · 2023-11-22T16:53:37Z

Okay thanks. Let’s keep Kokkos off until we figure this out - it complicates things. I confirmed yesterday that the distributed multi-GPU is working as intended, so the problem is specific to replica somehow. I couldn’t manage to run your example myself; it kept failing with MPI errors right at the beginning. I think it was a problem with my install - can you confirm which packages are required for the replica and partitioning? Also, do you know what happens if you try a CPU only install and distribute over CPUs? Does that work as expected?

…

On Wed, 22 Nov 2023 at 04:01, Felix Riccius ***@***.***> wrote: It returns no error. However it only uses a single GPU for all replicas (instances) instead of distributing it over the GPUs. --> without kokkos it gives me the same result compared to using kokkos with one gpu (-k on g 2 -sf kk) — Reply to this email directly, view it on GitHub <#195 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXHHTQ6C4JJI6SLTAX7FFDYFW5N5AVCNFSM6AAAAAA6HDRU42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRSGM2TQMRTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Felixrccs · 2023-11-23T09:34:14Z

I guess your install is missing the REPLICA package

Here is my current install

module purge
module load gcc/10 mkl/2022.2 gsl/2.4 impi/2021.4 fftw-mpi/3.3.9
module load anaconda/3/2023.03
module load cuda/11.6 cudnn/8.8.1
module load cmake/3.18

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKLROOT/lib/intel64
source ${MKLROOT}/env/vars.sh intel64

cmake \
    -D CMAKE_BUILD_TYPE=Release \
    -D CMAKE_INSTALL_PREFIX=$(pwd) \
    -D BUILD_MPI=yes \
    -D BUILD_OMP=yes \
    -D BUILD_SHARED_LIBS=yes \
    -D LAMMPS_EXCEPTIONS=yes \
    -D PKG_KOKKOS=yes \
    -D Kokkos_ARCH_AMPERE80=yes \
    -D Kokkos_ARCH_AMDAVX=yes \
    -D Kokkos_ENABLE_CUDA=yes \
    -D Kokkos_ENABLE_OPENMP=yes \
    -D Kokkos_ENABLE_DEBUG=no \
    -D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=no \
    -D Kokkos_ENABLE_CUDA_UVM=no \
    -D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper \
    -D PKG_ML-MACE=yes \
    -D PKG_MOLECULE=yes \
    -D PKG_REPLICA=yes \
    -D PKG_KSPACE=yes \
    -D PKG_RIGID=yes \
    -D CMAKE_PREFIX_PATH=$(pwd)/../../libtorch-gpu \
    ../cmake

make -j 12

I never installed the CPU only version, I will update you as soon I tried that.

Felixrccs · 2023-11-28T09:46:48Z

I've now tried the CPU installation and everything seems to work fine:

I tried it on one node and I manage to either run it with no domain decomposition and one task per replica and also with domain decomposition and multiple tasks per replica.

Felixrccs · 2023-11-28T09:49:20Z

Did you have any progress and managed to run the simulation on your installation? And do you have anything else I should try out?

wcwitt · 2023-11-28T15:43:59Z

Thanks - I'm at conference this week, so haven't had much time unfortunately. But there are some people here who may know the answer immediately - I'll track them down.

Felixrccs · 2023-11-29T12:54:25Z

Ok sounds good. It would be important for me to get this fixed until the 15th of December.

wcwitt · 2023-11-29T14:24:00Z

I will try, but can't promise.

wcwitt · 2023-11-29T14:59:47Z

Do you know if you are using a CUDA-aware MPI? That's not required in general, but I wonder if it could make a difference here.

bernstei · 2023-11-29T15:27:08Z

Yeah I'm sure it's not terribly complicated. There is already this logic which assigns the model to the GPU associated with the local MPI rank. That's how the domain decomposition works.
  if (!torch::cuda::is_available()) {
    std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
    device = c10::Device(torch::kCPU);
  } else {
    std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
    MPI_Comm local;
    MPI_Comm_split_type(world, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
    int localrank;
    MPI_Comm_rank(local, &localrank);
    device = c10::Device(torch::kCUDA,localrank);
  }
I'd still like a self-contained example.

@wcwitt how does this work? It looks to me like it'll split the MPI tasks on the current node into a "local" descriptor, and then use the rank within that local descriptor to pick GPUs. Does it implicitly assume that you're starting exactly one MPI rank per GPU?

If so, this code is effectively assigning unique integers, 0..N-1, to each task on each node, and using those to assign GPU IDs.

I have a hypothesis - I believe that the multiple images are called multiple "world"s in LAMMPS. Seeing this code, I wonder if that because each gets its own world communicator. If that's right (and my understanding of the code above is too), then the issue is that the rank in this image-specific world communicator is no longer set of integers that's unique for each task on a given node. We'd need to know their placement in the global world communicator.

I'll poke around at the replica exchange code and see if I can find evidence for this.

@wcwitt do you want me to try to play with the example, or is this enough of a clue that you want to investigate it yourself?

bernstei · 2023-11-29T15:34:46Z

OK. I found this in REPLICA/prd.cpp

  // comm_replica = communicator between all proc 0s across replicas

  int color = me;
  MPI_Comm_split(universe->uworld,color,0,&comm_replica);

Looks like universe->uworld is a communicator that includes all the MPI tasks, before they were split into worlds (one per replica). I'm happy to discuss interactively, but I suspect that if the code @wcwitt pasted above switches to using universe->uworld instead of world in the MPI_comm_split_type, it would provide local ranks that can be used to assign a unique GPU for each MPI rank on a given node.

Felixrccs · 2023-11-29T16:01:51Z

Do you know if you are using a CUDA-aware MPI? That's not required in general, but I wonder if it could make a difference here.

I dont think I do. Im gonna try this out, maybe it speeds things up if doesnt solve the issue.

wcwitt · 2023-11-29T19:34:33Z

Thanks @bernstei this is a huge help (especially if it works).

It looks to me like it'll split the MPI tasks on the current node into a "local" descriptor, and then use the rank within that local descriptor to pick GPUs. Does it implicitly assume that you're starting exactly one MPI rank per GPU?

Yes, that's how I've done the domain decomposition so far.

What you suggest sounds plausible; I was hoping it would be something simple like that. I'm in deadline mode, so won't have time until next week. If anyone else wants to try in the meantime, definitely feel free.

bernstei · 2023-11-29T19:36:58Z

If anyone else wants to try in the meantime, definitely feel free.

I have some time, but on the other hand, I don't have LAMMPS compiled appropriately right now. If someone else wants to try it (@Felixrccs ?) I'm happy to discuss the necessary patch. Otherwise, I do in principle want to get LAMMPS+GPU MACE working, so I will set it up eventually, but possibly only on the days-week timescale that @wcwitt will get to it as well.

Felixrccs · 2023-11-30T12:03:13Z

Ok I tried the suggested code changes out and it works perfect. I tried it both with and without kokkos (Though with kokkos its about twice as fast). So far I've only tested simple systems (4 replicas distributed over 4 gpus) and I get an equal distribution/workload over alll the GPUs.

Here are the changes I did.

  if (!torch::cuda::is_available()) {
    std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
    device = c10::Device(torch::kCPU);
  } else {
    std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
    //int worldrank;
    //MPI_Comm_rank(world, &worldrank);
    MPI_Comm local;
    MPI_Comm_split_type(universe->uworld, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
    int localrank;
    MPI_Comm_rank(local, &localrank);
    device = c10::Device(torch::kCUDA,localrank);
  }

I also had to add this line

#include "universe.h"

I will put my changes into a pull request in the following days (after I've run some longer simulations to check if everything continues working as expected). @wcwitt @bernstei again a thousand thanks for the help and patience in debugging this.

bernstei · 2023-11-30T13:16:23Z

Great, I'm glad it worked. I have to say I'm surprised I figured it out, because normally I find LAMMPS's internals to be pretty opaque.

@wcwitt Would it be useful to test for local_size <= n_gpus_per_node ?

Felixrccs · 2023-11-30T14:05:54Z

Ok sadly I run into new problems trying to run 8 replicas on 4 GPUs. The error I get ist the following

Exception: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
.....

The Full error is here
error.txt

I tried to google it and its apparently related to calling GPUs that do not exist.

Any Ideas? I have the feeling this is more difficult to solve than the issue we had before.

bernstei · 2023-11-30T14:15:07Z

I think that the way it's currently coded implicitly assumes that number of MPI ranks per node is equal to number of GPUs. Are you running it that way, or are you running with 8 MPI ranks, one per replica? I have an idea that might fix this, although we might need @wcwitt to confirm it'll work, but first I'd like to know how exactly your run is configured.

Felixrccs · 2023-11-30T15:07:36Z

Yes I have to use one mpi rank per replica, so that the lammps partitioning works. Bellow you see my run command for 8 replicas on 4 GPUs

srun -n 8 lmp -partition 1 1 1 1 1 1 1 1 -l lammps.log -i lammps.temper > lammps.out

bernstei · 2023-11-30T15:23:45Z

OK. In that case, I have a suggestion, if it's possible for two MPI tasks to share a GPU. If that's not possible, you're just stuck - you'll have to run on more nodes so that there's one GPU per replica. If it is possible, this is proposed (untested) syntax for assigning the same GPU ID to more than one task. Whether that's sufficient, or something else needs to be done to allow two tasks to share a GPU, I just don't know. @wcwitt?

Anyway, the proposed syntax is

#include <cuda_runtime.h>
.
.
.
    MPI_Comm_rank(local, &localrank);
    int nDevices;
    cudaGetDeviceCount(&nDevices);
    device = c10::Device(torch::kCUDA,localrank % nDevices);

This is the sloppy version, without any error checking. Also, my understanding is that you need to link to -lcudart, and I don't know if LAMMPS does that automatically (probably yes, but if you get undefined symbol errors on the cuda function, we'll have to figure out how to add that).

[edited - you might need cuda_runtime_api.h instead]

Felixrccs · 2023-12-01T07:47:13Z

It should possible to assign a single GPU to multiple MPI tasks (this was the problem that we had before 8 replicas/MPI processes calculated on a single GPU).

I will try to add your suggestion to the code and play a bit around. However, there is a high chance that this exceeds my C++ skills.

bernstei · 2023-12-01T13:47:01Z

It should possible to assign a single GPU to multiple MPI tasks (this was the problem that we had before 8 replicas/MPI processes calculated on a single GPU).

In that case, I don't think anything complex should be needed - I'm now pretty certain my code will work as is. When you compile it, if you do make VERBOSE=1 ... > make.stdout (where the ... is for whatever else you normally pass make, if anything), I can look at the output and see what libraries it's linking to.

Felixrccs · 2023-12-01T15:10:18Z

It works :)

So the compilation worked out of the box.

And everything seems to work as expected:

I tested without kokkos [8 MPIs on 4 GPUs] and it worked as expected.
With kokkos I tired [8 MPIs on 4 GPUs] and [7 MPIs on 4 GPUs] and it worked without any problems.

Beginning of next week I'll try a 2 ns run on a bigger systeme as final test. But I'm confident that this issue is solved now.

wcwitt · 2023-12-01T15:37:35Z

Excellent. Thanks, both of you.

@Felixrccs, you mentioned a PR - do feel free to go ahead with that. As part of it, or in parallel, I'll think through questions like this and other error-checking that will make things more robust.

@wcwitt Would it be useful to test for local_size <= n_gpus_per_node ?

Let's leave this issue open until the changes are merged.

bernstei · 2023-12-01T16:23:12Z

Would it be useful to test for local_size <= n_gpus_per_node ?

I can't see how this would fail (famous last words), since it should just leave them idle, but I definitely agree about deferring

wcwitt · 2023-12-01T18:48:19Z

Wasn't thinking it would fail - more concerned about what else is missing.

Felixrccs · 2023-12-04T15:56:18Z

Here is the pull request. I hope I've done everything right ACEsuit/lammps#1

ilyes319 added the lammps label Oct 20, 2023

Felixrccs mentioned this issue Dec 4, 2023

LAMMPS multiple MPI instances over limited GPUs (multiple repllica simulations) ACEsuit/lammps#1

Open

wcwitt mentioned this issue Jan 6, 2024

MACE-LAMMPS meta issue #278

Open

10 tasks

LAMMPS on multiple GPUs #195

LAMMPS on multiple GPUs #195

Comments

Felixrccs commented Oct 19, 2023

ilyes319 commented Oct 20, 2023

Felixrccs commented Oct 27, 2023 • edited

Felixrccs commented Nov 13, 2023

wcwitt commented Nov 13, 2023

ilyes319 commented Nov 13, 2023

wcwitt commented Nov 13, 2023

Felixrccs commented Nov 13, 2023

wcwitt commented Nov 20, 2023

Felixrccs commented Nov 22, 2023

wcwitt commented Nov 22, 2023 via email

Felixrccs commented Nov 23, 2023

Felixrccs commented Nov 28, 2023

Felixrccs commented Nov 28, 2023

wcwitt commented Nov 28, 2023

Felixrccs commented Nov 29, 2023

wcwitt commented Nov 29, 2023

wcwitt commented Nov 29, 2023

bernstei commented Nov 29, 2023

bernstei commented Nov 29, 2023

Felixrccs commented Nov 29, 2023

wcwitt commented Nov 29, 2023

bernstei commented Nov 29, 2023

Felixrccs commented Nov 30, 2023

bernstei commented Nov 30, 2023

Felixrccs commented Nov 30, 2023

bernstei commented Nov 30, 2023 • edited

Felixrccs commented Nov 30, 2023

bernstei commented Nov 30, 2023 • edited

Felixrccs commented Dec 1, 2023

bernstei commented Dec 1, 2023

Felixrccs commented Dec 1, 2023

wcwitt commented Dec 1, 2023

bernstei commented Dec 1, 2023 • edited

wcwitt commented Dec 1, 2023

Felixrccs commented Dec 4, 2023

Felixrccs commented Oct 27, 2023 •

edited

bernstei commented Nov 30, 2023 •

edited

bernstei commented Nov 30, 2023 •

edited

bernstei commented Dec 1, 2023 •

edited