Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS on multiple GPUs #195

Open
Felixrccs opened this issue Oct 19, 2023 · 35 comments
Open

LAMMPS on multiple GPUs #195

Felixrccs opened this issue Oct 19, 2023 · 35 comments
Labels

Comments

@Felixrccs
Copy link

Everything works on one gpu, however I would like to run my LAMMPS simulation over multiple GPUs.
My lammps submission command for two GPUs:
srun -n 4 lmp -partition 1 1 1 1 -l lammps.log -sc screen -k on g 2 -sf kk -i lammps.in

If I run it on multiple GPUs, LAMMPS returns this error back:

Starting MPS on ravg1112
Exception: Specified device cuda:0 does not match device of data cuda:1
Exception raised from make_tensor at aten/src/ATen/Functions.cpp:26 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x14f6528553cb in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libc10.so)
frame #1: at::TensorMaker::make_tensor() + 0xa1d (0x14f604c2641d in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libtorch_cpu.so)
frame #2: torch::from_blob(void*, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xfb (0x14f656649e2b in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #3: LAMMPS_NS::PairMACEKokkos<Kokkos::Cuda>::compute(int, int) + 0xce1 (0x14f6566620f1 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #4: LAMMPS_NS::VerletKokkos::setup(int) + 0x5b2 (0x14f655d85fb2 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #5: LAMMPS_NS::Temper::command(int, char**) + 0x6c4 (0x14f655583c84 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #6: LAMMPS_NS::Input::execute_command() + 0xaec (0x14f6551e55ec in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #7: LAMMPS_NS::Input::file() + 0x155 (0x14f6551e5995 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #8: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x4049e8]
frame #9: __libc_start_main + 0xef (0x14f653e3e24d in /lib64/libc.so.6)
frame #10: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x404b9a]

Exception: Specified device cuda:0 does not match device of data cuda:1
Exception raised from make_tensor at aten/src/ATen/Functions.cpp:26 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x14dd2ea553cb in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libc10.so)
frame #1: at::TensorMaker::make_tensor() + 0xa1d (0x14dce0e2641d in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libtorch_cpu.so)
frame #2: torch::from_blob(void*, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xfb (0x14dd32849e2b in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #3: LAMMPS_NS::PairMACEKokkos<Kokkos::Cuda>::compute(int, int) + 0xce1 (0x14dd328620f1 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #4: LAMMPS_NS::VerletKokkos::setup(int) + 0x5b2 (0x14dd31f85fb2 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #5: LAMMPS_NS::Temper::command(int, char**) + 0x6c4 (0x14dd31783c84 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #6: LAMMPS_NS::Input::execute_command() + 0xaec (0x14dd313e55ec in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #7: LAMMPS_NS::Input::file() + 0x155 (0x14dd313e5995 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #8: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x4049e8]
frame #9: __libc_start_main + 0xef (0x14dd3003e24d in /lib64/libc.so.6)
frame #10: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x404b9a]

Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
slurmstepd: error: *** STEP 7590002.0 ON ravg1112 CANCELLED AT 2023-10-19T14:03:48 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: ravg1112: tasks 0,2: Killed
srun: launch/slurm: _step_signal: Terminating StepId=7590002.0
srun: error: ravg1112: task 3: Killed
srun: error: ravg1112: task 1: Killed
@ilyes319
Copy link
Contributor

multi-GPU lammps is an untested beta feature, hopefully we will make progress on that soon.

@Felixrccs
Copy link
Author

Felixrccs commented Oct 27, 2023

I was told that I probably formulated my issue not precise enough. So I want to run LAMMPS REPLICA EXCHANGE simulations. This means that LAMMPS creates multiple MD instances and runs them in parallel. So I want to stick to pair_style mace no_domain_decomposition. MACE-LAMMPS also manages to distribute the MD instances over one GPU. However when distributing the MD instances over multiple GPUs I run into this Bug. From the error I guess there is something build into the LAMMPS-MACE package that checks if the memory is identical for both GPUs, which is not necessary for my application.

I would be very happy if you take another look at this.

@Felixrccs
Copy link
Author

@wcwitt @ilyes319 Any idea how to fix this?
At this point this is the only thing keeping me from starting large scale production runs.

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 13, 2023

I've never run replica exchange in LAMMPS, so I'll need to dig into the internals to understand why it's breaking. Would you please send a complete (but simple) example, including a model? I'll use that to figure it out. Thanks

@ilyes319
Copy link
Contributor

The problem seems to be linked to how to the model is sent to different GPUs. If nothing is done, libtorch probably loads the model on the first GPU of the node. I think there is a probably a line to add to distribute the models on the right GPUs.

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 13, 2023

Yeah I'm sure it's not terribly complicated. There is already this logic which assigns the model to the GPU associated with the local MPI rank. That's how the domain decomposition works.

  if (!torch::cuda::is_available()) {
    std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
    device = c10::Device(torch::kCPU);
  } else {
    std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
    MPI_Comm local;
    MPI_Comm_split_type(world, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
    int localrank;
    MPI_Comm_rank(local, &localrank);
    device = c10::Device(torch::kCUDA,localrank);
  }

I'd still like a self-contained example.

@Felixrccs
Copy link
Author

Thanks a lot
Test.zip
Here is the Test case, I simplified it as far as possible.

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 20, 2023

Thanks for the example - I've looked this some now. What happens if you try without Kokkos? So, adapting your input command:

mpirun -n 2 /work/Software/lammps_mace/lammps/build-kokkos-cuda/lmp -partition 1 1 -l lammps.log -sc screen -in lammps.temper > lammps.out

@Felixrccs
Copy link
Author

It returns no error. However it only uses a single GPU for all replicas (instances) instead of distributing it over the GPUs.
--> without kokkos it gives me the same result compared to using kokkos with one gpu (-k on g 2 -sf kk)

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 22, 2023 via email

@Felixrccs
Copy link
Author

I guess your install is missing the REPLICA package

Here is my current install

module purge
module load gcc/10 mkl/2022.2 gsl/2.4 impi/2021.4 fftw-mpi/3.3.9
module load anaconda/3/2023.03
module load cuda/11.6 cudnn/8.8.1
module load cmake/3.18

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKLROOT/lib/intel64
source ${MKLROOT}/env/vars.sh intel64

cmake \
    -D CMAKE_BUILD_TYPE=Release \
    -D CMAKE_INSTALL_PREFIX=$(pwd) \
    -D BUILD_MPI=yes \
    -D BUILD_OMP=yes \
    -D BUILD_SHARED_LIBS=yes \
    -D LAMMPS_EXCEPTIONS=yes \
    -D PKG_KOKKOS=yes \
    -D Kokkos_ARCH_AMPERE80=yes \
    -D Kokkos_ARCH_AMDAVX=yes \
    -D Kokkos_ENABLE_CUDA=yes \
    -D Kokkos_ENABLE_OPENMP=yes \
    -D Kokkos_ENABLE_DEBUG=no \
    -D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=no \
    -D Kokkos_ENABLE_CUDA_UVM=no \
    -D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper \
    -D PKG_ML-MACE=yes \
    -D PKG_MOLECULE=yes \
    -D PKG_REPLICA=yes \
    -D PKG_KSPACE=yes \
    -D PKG_RIGID=yes \
    -D CMAKE_PREFIX_PATH=$(pwd)/../../libtorch-gpu \
    ../cmake

make -j 12

I never installed the CPU only version, I will update you as soon I tried that.

@Felixrccs
Copy link
Author

I've now tried the CPU installation and everything seems to work fine:

I tried it on one node and I manage to either run it with no domain decomposition and one task per replica and also with domain decomposition and multiple tasks per replica.

@Felixrccs
Copy link
Author

Did you have any progress and managed to run the simulation on your installation? And do you have anything else I should try out?

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 28, 2023

Thanks - I'm at conference this week, so haven't had much time unfortunately. But there are some people here who may know the answer immediately - I'll track them down.

@Felixrccs
Copy link
Author

Ok sounds good. It would be important for me to get this fixed until the 15th of December.

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 29, 2023

I will try, but can't promise.

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 29, 2023

Do you know if you are using a CUDA-aware MPI? That's not required in general, but I wonder if it could make a difference here.

@bernstei
Copy link
Collaborator

Yeah I'm sure it's not terribly complicated. There is already this logic which assigns the model to the GPU associated with the local MPI rank. That's how the domain decomposition works.

  if (!torch::cuda::is_available()) {
    std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
    device = c10::Device(torch::kCPU);
  } else {
    std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
    MPI_Comm local;
    MPI_Comm_split_type(world, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
    int localrank;
    MPI_Comm_rank(local, &localrank);
    device = c10::Device(torch::kCUDA,localrank);
  }

I'd still like a self-contained example.

@wcwitt how does this work? It looks to me like it'll split the MPI tasks on the current node into a "local" descriptor, and then use the rank within that local descriptor to pick GPUs. Does it implicitly assume that you're starting exactly one MPI rank per GPU?

If so, this code is effectively assigning unique integers, 0..N-1, to each task on each node, and using those to assign GPU IDs.

I have a hypothesis - I believe that the multiple images are called multiple "world"s in LAMMPS. Seeing this code, I wonder if that because each gets its own world communicator. If that's right (and my understanding of the code above is too), then the issue is that the rank in this image-specific world communicator is no longer set of integers that's unique for each task on a given node. We'd need to know their placement in the global world communicator.

I'll poke around at the replica exchange code and see if I can find evidence for this.

@wcwitt do you want me to try to play with the example, or is this enough of a clue that you want to investigate it yourself?

@bernstei
Copy link
Collaborator

OK. I found this in REPLICA/prd.cpp

  // comm_replica = communicator between all proc 0s across replicas

  int color = me;
  MPI_Comm_split(universe->uworld,color,0,&comm_replica);

Looks like universe->uworld is a communicator that includes all the MPI tasks, before they were split into worlds (one per replica). I'm happy to discuss interactively, but I suspect that if the code @wcwitt pasted above switches to using universe->uworld instead of world in the MPI_comm_split_type, it would provide local ranks that can be used to assign a unique GPU for each MPI rank on a given node.

@Felixrccs
Copy link
Author

Do you know if you are using a CUDA-aware MPI? That's not required in general, but I wonder if it could make a difference here.

I dont think I do. Im gonna try this out, maybe it speeds things up if doesnt solve the issue.

@wcwitt
Copy link
Collaborator

wcwitt commented Nov 29, 2023

Thanks @bernstei this is a huge help (especially if it works).

It looks to me like it'll split the MPI tasks on the current node into a "local" descriptor, and then use the rank within that local descriptor to pick GPUs. Does it implicitly assume that you're starting exactly one MPI rank per GPU?

Yes, that's how I've done the domain decomposition so far.

What you suggest sounds plausible; I was hoping it would be something simple like that. I'm in deadline mode, so won't have time until next week. If anyone else wants to try in the meantime, definitely feel free.

@bernstei
Copy link
Collaborator

If anyone else wants to try in the meantime, definitely feel free.

I have some time, but on the other hand, I don't have LAMMPS compiled appropriately right now. If someone else wants to try it (@Felixrccs ?) I'm happy to discuss the necessary patch. Otherwise, I do in principle want to get LAMMPS+GPU MACE working, so I will set it up eventually, but possibly only on the days-week timescale that @wcwitt will get to it as well.

@Felixrccs
Copy link
Author

Ok I tried the suggested code changes out and it works perfect. I tried it both with and without kokkos (Though with kokkos its about twice as fast). So far I've only tested simple systems (4 replicas distributed over 4 gpus) and I get an equal distribution/workload over alll the GPUs.

Here are the changes I did.

  if (!torch::cuda::is_available()) {
    std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
    device = c10::Device(torch::kCPU);
  } else {
    std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
    //int worldrank;
    //MPI_Comm_rank(world, &worldrank);
    MPI_Comm local;
    MPI_Comm_split_type(universe->uworld, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
    int localrank;
    MPI_Comm_rank(local, &localrank);
    device = c10::Device(torch::kCUDA,localrank);
  }

I also had to add this line

#include "universe.h"

I will put my changes into a pull request in the following days (after I've run some longer simulations to check if everything continues working as expected). @wcwitt @bernstei again a thousand thanks for the help and patience in debugging this.

@bernstei
Copy link
Collaborator

Great, I'm glad it worked. I have to say I'm surprised I figured it out, because normally I find LAMMPS's internals to be pretty opaque.

@wcwitt Would it be useful to test for local_size <= n_gpus_per_node ?

@Felixrccs
Copy link
Author

Ok sadly I run into new problems trying to run 8 replicas on 4 GPUs. The error I get ist the following

Exception: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
.....

The Full error is here
error.txt

I tried to google it and its apparently related to calling GPUs that do not exist.

Any Ideas? I have the feeling this is more difficult to solve than the issue we had before.

@bernstei
Copy link
Collaborator

bernstei commented Nov 30, 2023

I think that the way it's currently coded implicitly assumes that number of MPI ranks per node is equal to number of GPUs. Are you running it that way, or are you running with 8 MPI ranks, one per replica? I have an idea that might fix this, although we might need @wcwitt to confirm it'll work, but first I'd like to know how exactly your run is configured.

@Felixrccs
Copy link
Author

Yes I have to use one mpi rank per replica, so that the lammps partitioning works. Bellow you see my run command for 8 replicas on 4 GPUs

srun -n 8 lmp -partition 1 1 1 1 1 1 1 1 -l lammps.log -i lammps.temper > lammps.out

@bernstei
Copy link
Collaborator

bernstei commented Nov 30, 2023

OK. In that case, I have a suggestion, if it's possible for two MPI tasks to share a GPU. If that's not possible, you're just stuck - you'll have to run on more nodes so that there's one GPU per replica. If it is possible, this is proposed (untested) syntax for assigning the same GPU ID to more than one task. Whether that's sufficient, or something else needs to be done to allow two tasks to share a GPU, I just don't know. @wcwitt?

Anyway, the proposed syntax is

#include <cuda_runtime.h>
.
.
.
    MPI_Comm_rank(local, &localrank);
    int nDevices;
    cudaGetDeviceCount(&nDevices);
    device = c10::Device(torch::kCUDA,localrank % nDevices);

This is the sloppy version, without any error checking. Also, my understanding is that you need to link to -lcudart, and I don't know if LAMMPS does that automatically (probably yes, but if you get undefined symbol errors on the cuda function, we'll have to figure out how to add that).

[edited - you might need cuda_runtime_api.h instead]

@Felixrccs
Copy link
Author

It should possible to assign a single GPU to multiple MPI tasks (this was the problem that we had before 8 replicas/MPI processes calculated on a single GPU).

I will try to add your suggestion to the code and play a bit around. However, there is a high chance that this exceeds my C++ skills.

@bernstei
Copy link
Collaborator

bernstei commented Dec 1, 2023

It should possible to assign a single GPU to multiple MPI tasks (this was the problem that we had before 8 replicas/MPI processes calculated on a single GPU).

In that case, I don't think anything complex should be needed - I'm now pretty certain my code will work as is. When you compile it, if you do make VERBOSE=1 ... > make.stdout (where the ... is for whatever else you normally pass make, if anything), I can look at the output and see what libraries it's linking to.

@Felixrccs
Copy link
Author

It works :)

So the compilation worked out of the box.

And everything seems to work as expected:

  • I tested without kokkos [8 MPIs on 4 GPUs] and it worked as expected.
  • With kokkos I tired [8 MPIs on 4 GPUs] and [7 MPIs on 4 GPUs] and it worked without any problems.

Beginning of next week I'll try a 2 ns run on a bigger systeme as final test. But I'm confident that this issue is solved now.

@wcwitt
Copy link
Collaborator

wcwitt commented Dec 1, 2023

Excellent. Thanks, both of you.

@Felixrccs, you mentioned a PR - do feel free to go ahead with that. As part of it, or in parallel, I'll think through questions like this and other error-checking that will make things more robust.

@wcwitt Would it be useful to test for local_size <= n_gpus_per_node ?

Let's leave this issue open until the changes are merged.

@bernstei
Copy link
Collaborator

bernstei commented Dec 1, 2023

Would it be useful to test for local_size <= n_gpus_per_node ?

I can't see how this would fail (famous last words), since it should just leave them idle, but I definitely agree about deferring

@wcwitt
Copy link
Collaborator

wcwitt commented Dec 1, 2023

Wasn't thinking it would fail - more concerned about what else is missing.

@Felixrccs
Copy link
Author

Here is the pull request. I hope I've done everything right ACEsuit/lammps#1

@wcwitt wcwitt mentioned this issue Jan 6, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants