Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Particle Container to Pure SoA Again #4653

Merged
merged 3 commits into from Feb 2, 2024

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Jan 30, 2024

Transition to new, purely SoA particle containers.

This was originally merged in #3850 and reverted in #4652, since we discovered issues loosing particles & laser particles on GPU.

Fun Mini-Benchmarks on CPU, DP

Hardware: 12th Gen Intel(R) Core(TM) i9-12900H

export OMP_NUM_THREADS=1
./warpx.3d ../../Examples/Tests/performance_tests/automated_test_1_uniform_rest_32ppc amr.max_grid_size=64 amr.n_cell=64 64 64 max_step=5 &
taskset -cp 6 $!

cpu_legacy.txt, cpu_soa.txt

Overall speed: similar to noise level of repeated runs (as expected).

Few noteworthy details in top 10 functions by runtime (Excl.):

Fun Mini-Benchmarks on A100 GPU, DP

Hardware: Perlmutter (NERSC) A100 GPU

./warpx.3d ../../Examples/Tests/performance_tests/automated_test_1_uniform_rest_32ppc amr.max_grid_size=256 amr.n_cell=256 256 256 max_step=5

Overall speed: 1.4% faster

Few noteworthy details in top 10 functions by runtime (Excl.):

  • GatherAndPush: 1.2% faster
  • Redistribute_partition: 4% faster
  • AddPlasma: 2.6% faster
  • ApplyBoundaryConditions: 1% faster
  • SortParticlesForDeposition: 231% faster 🚀 🚀 ✨
  • PermutationForDeposition: 3% faster
  • InitData: 15% faster 🚀
  • rest in TOP10 is the same as before

Comment on lines 100 to 102
idcpu_data.push_back(0);
amrex::ParticleIDWrapper{idcpu_data.back()} = ParticleType::NextID();
amrex::ParticleCPUWrapper(idcpu_data.back()) = ParallelDescriptor::MyProc();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atmyers added a commit to AMReX-Codes/amrex that referenced this pull request Jan 31, 2024
## Summary

Update `ParticleCopyPlan::build` for pure SoA particle layout.

## Additional background

- [x] testing on GPU in ECP-WarpX/WarpX#4653

## Checklist

The proposed changes:
- [x] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate

---------

Co-authored-by: Andrew Myers <atmyers2@gmail.com>
@ax3l ax3l force-pushed the topic-soa-reintro branch 4 times, most recently from cf9dd03 to 2ac1993 Compare February 2, 2024 04:31
More pure SoA and id handling goodness.
Transition to new, purely SoA particle containers.

This was originally merged in ECP-WarpX#3850 and reverted in ECP-WarpX#4652, since
we discovered issues loosing particles & laser particles on GPU.
- faster: less emitted operations, no jumps
- cheaper: less used registers
- safer: no read-before-write warnings
- cooler: no explanation needed
@ax3l
Copy link
Member Author

ax3l commented Feb 2, 2024

GPU Tests (CUDA, A100 on Perlmutter)

diff --git a/Examples/analysis_default_openpmd_regression.py b/Examples/analysis_default_openpmd_regression.py
index 3aadc49ac5..3e9fb98789 100755
--- a/Examples/analysis_default_openpmd_regression.py
+++ b/Examples/analysis_default_openpmd_regression.py
@@ -15,6 +15,6 @@ test_name = os.path.split(os.getcwd())[1]
 
 # Run checksum regression test
 if re.search( 'single_precision', fn ):
-    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd', rtol=2.e-6)
+    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd', rtol=4.)
 else:
-    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd')
+    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd', rtol=4.)
diff --git a/Examples/analysis_default_regression.py b/Examples/analysis_default_regression.py
index 453f650be0..6fa855df3d 100755
--- a/Examples/analysis_default_regression.py
+++ b/Examples/analysis_default_regression.py
@@ -15,6 +15,6 @@ test_name = os.path.split(os.getcwd())[1]
 
 # Run checksum regression test
 if re.search( 'single_precision', fn ):
-    checksumAPI.evaluate_checksum(test_name, fn, rtol=2.e-6)
+    checksumAPI.evaluate_checksum(test_name, fn, rtol=4.)
 else:
-    checksumAPI.evaluate_checksum(test_name, fn)
+    checksumAPI.evaluate_checksum(test_name, fn, rtol=4.)
diff --git a/Regression/WarpX-tests.ini b/Regression/WarpX-tests.ini
index 3310e642dd..84133add09 100644
--- a/Regression/WarpX-tests.ini
+++ b/Regression/WarpX-tests.ini
@@ -40,7 +40,7 @@ use_ctools = 0
 # sections.
 
 #MPIcommand = mpiexec -host @host@ -n @nprocs@ @command@
-MPIcommand = mpiexec -n @nprocs@ @command@
+MPIcommand = srun -n @nprocs@ @command@
 MPIhost =
 
 reportActiveTestsOnly = 1
@@ -64,7 +64,7 @@ branch = 24.02
 [source]
 dir = /home/regtester/AMReX_RegTesting/warpx
 branch = development
-cmakeSetupOpts = -DAMReX_ASSERTIONS=ON -DAMReX_TESTING=ON -DWarpX_PYTHON_IPO=OFF -DpyAMReX_IPO=OFF
+cmakeSetupOpts = -DAMReX_ASSERTIONS=ON -DAMReX_TESTING=ON -DWarpX_PYTHON_IPO=OFF -DpyAMReX_IPO=OFF -DWarpX_COMPUTE=CUDA
 # -DPYINSTALLOPTIONS="--disable-pip-version-check"
 
 # individual problems follow
cat test.sbatch 
#!/bin/bash -l

# Copyright 2021-2023 Axel Huebl, Kevin Gott
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL

#SBATCH -t 10:00:00
#SBATCH -N 1
#SBATCH -J run_test_soa
#SBATCH -A m4272_g
#SBATCH -q regular
# A100 40GB (most nodes)
#SBATCH -C gpu
# A100 80GB (256 nodes)
#S BATCH -C gpu&hbm80g
#SBATCH --exclusive
# ideally single:1, but NERSC cgroups issue
#SBATCH --gpu-bind=none
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j

# pin to closest NIC to GPU
export MPICH_OFI_NIC_POLICY=GPU

# threads for OpenMP and threaded compressors per MPI rank
#   note: 16 avoids hyperthreading (32 virtual cores, 16 physical)
export SRUN_CPUS_PER_TASK=16

export WARPX_CI_NUM_MAKE_JOBS=32

./run_test.sh

Tests that pass within a 10hr walltime in development

53 (until walltime reached)

Tests that pass within a 10hr walltime with this PR

53 (until walltime reached)

Tests that Already Crash in development

  • ImplicitPicard_VandB_2d CRASHED (backtraces produced)
  • Langmuir_multi_2d_MR CRASHED (backtraces produced)
  • Langmuir_multi_2d_MR_momentum_conserving CRASHED (backtraces produced)
  • Langmuir_multi_2d_MR_psatd CRASHED (backtraces produced)
  • LaserInjection CRASHED (backtraces produced)
  • ...

Some of those crash because of our default warning threshold, so I can retry with those.

!!! WARNING : [high][Performance] Too many boxes per GPU!
...
1::Assertion `msg_priority < abort_priority' failed, file "/tmp/ci-bbDk7v5Td5/warpx/Source/ablastr/warn_manager/WarnManager.cpp", line 97, Msg: 

Tests that Crash with this PR

  • ImplicitPicard_VandB_2d CRASHED (backtraces produced)
  • Langmuir_multi_2d_MR CRASHED (backtraces produced)
  • Langmuir_multi_2d_MR_momentum_conserving CRASHED (backtraces produced)
  • Langmuir_multi_2d_MR_psatd CRASHED (backtraces produced)
  • LaserInjection CRASHED (backtraces produced)

rtol=4. Checksums

Tests that Already Fail Analysis in development

  • BTD_rz FAILED
  • Deuterium_Deuterium_Fusion_3D FAILED
  • Deuterium_Deuterium_Fusion_3D_intraspecies FAILED
  • Deuterium_Tritium_Fusion_3D FAILED
  • Deuterium_Tritium_Fusion_RZ FAILED
  • ElectrostaticSphereRZ FAILED
  • FluxInjection FAILED
  • FluxInjection3D FAILED
  • ImplicitPicard_1d FAILED
  • Langmuir_multi_psatd_single_precision FAILED
  • Langmuir_multi_rz FAILED
  • Langmuir_multi_rz_psatd FAILED
  • Langmuir_multi_rz_psatd_current_correction FAILED
  • Langmuir_multi_single_precision FAILED
  • LaserAcceleration_BTD FAILED
  • LaserInjectionFromLASYFile_RZ FAILED
    ...

Tests that Already Fail with this PR

  • BTD_rz FAILED
  • Deuterium_Deuterium_Fusion_3D FAILED
  • Deuterium_Deuterium_Fusion_3D_intraspecies FAILED
  • Deuterium_Tritium_Fusion_3D FAILED
  • Deuterium_Tritium_Fusion_RZ FAILED
  • ElectrostaticSphereRZ FAILED
  • FluxInjection FAILED
  • FluxInjection3D FAILED
  • ImplicitPicard_1d FAILED
  • Langmuir_multi_psatd_single_precision FAILED
  • Langmuir_multi_rz FAILED
  • Langmuir_multi_rz_psatd FAILED
  • Langmuir_multi_rz_psatd_current_correction FAILED
  • Langmuir_multi_single_precision FAILED
  • LaserAcceleration_BTD FAILED
  • LaserInjectionFromLASYFile_RZ FAILED
    ...

@RemiLehe RemiLehe closed this Feb 2, 2024
@RemiLehe RemiLehe reopened this Feb 2, 2024
@RemiLehe RemiLehe merged commit 6e332e9 into ECP-WarpX:development Feb 2, 2024
64 of 71 checks passed
@ax3l ax3l deleted the topic-soa-reintro branch February 2, 2024 22:21
currSpecies["position"]["z"].storeChunk(z, {offset}, {numParticleOnTile64});
}

// reconstruct x and y from polar coordinates r, theta
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oopsi, reconstruction re-added in #4686

@@ -1084,7 +1083,7 @@ PhysicalParticleContainer::AddPlasma (PlasmaInjector const& plasma_injector, int
const int max_new_particles = Scan::ExclusiveSum(counts.size(), counts.data(), offset.data());

// Update NextID to include particles created in this function
Long pid;
int pid;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long!

auto& p = pp[ip];
p.id() = pid+ip;
p.cpu() = cpuid;
auto const new_id = ip + old_size;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants