Skip to content

Latest commit

 

History

History
224 lines (136 loc) · 7.15 KB

lumi.rst

File metadata and controls

224 lines (136 loc) · 7.15 KB

LUMI (CSC)

The LUMI cluster is located at CSC (Finland). Each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. You can think of the 8 GCDs as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).

Introduction

If you are new to this system, please see the following resources:

Preparation

Use the following commands to download the WarpX source code:

git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx

We use system software modules, add environment hints and further dependencies via the file $HOME/lumi_warpx.profile. Create it now:

cp $HOME/src/warpx/Tools/machines/lumi-csc/lumi_warpx.profile.example $HOME/lumi_warpx.profile

Script Details

../../../../Tools/machines/lumi-csc/lumi_warpx.profile.example

Edit the 2nd line of this script, which sets the export proj="project_..." variable using a text editor such as nano, emacs, or vim (all available by default on LUMI login nodes). You can find out your project name by running lumi-ldap-userinfo on LUMI. For example, if you are member of the project project_465000559, then run nano $HOME/lumi_impactx.profile and edit line 2 to read:

export proj="project_465000559"

Exit the nano editor with Ctrl + O (save) and then Ctrl + X (exit).

Important

Now, and as the first step on future logins to LUMI, activate these environment settings:

source $HOME/lumi_warpx.profile

Finally, since LUMI does not yet provide software modules for some of our dependencies, install them once:

bash $HOME/src/warpx/Tools/machines/lumi-csc/install_dependencies.sh
source $HOME/sw/lumi/gpu/venvs/warpx-lumi/bin/activate

Script Details

../../../../Tools/machines/lumi-csc/install_dependencies.sh

Compilation

Use the following cmake commands <building-cmake> to compile the application executable:

cd $HOME/src/warpx
rm -rf build_lumi

cmake -S . -B build_lumi -DWarpX_COMPUTE=HIP -DWarpX_PSATD=ON -DWarpX_QED_TABLE_GEN=ON -DWarpX_QED_TABLES_GEN_OMP=OFF -DWarpX_DIMS="1;2;RZ;3"
cmake --build build_lumi -j 16

The WarpX application executables are now in $HOME/src/warpx/build_lumi/bin/. Additionally, the following commands will install WarpX as a Python module:

rm -rf build_lumi_py

cmake -S . -B build_lumi_py -DWarpX_COMPUTE=HIP -DWarpX_PSATD=ON -DWarpX_QED_TABLE_GEN=ON -DWarpX_QED_TABLES_GEN_OMP=OFF -DWarpX_APP=OFF -DWarpX_PYTHON=ON -DWarpX_DIMS="1;2;RZ;3"
cmake --build build_lumi_py -j 16 --target pip_install

Update WarpX & Dependencies

If you already installed WarpX in the past and want to update it, start by getting the latest source code:

cd $HOME/src/warpx

# read the output of this command - does it look ok?
git status

# get the latest WarpX source code
git fetch
git pull

# read the output of these commands - do they look ok?
git status
git log     # press q to exit

And, if needed,

  • update the lumi_warpx.profile file <building-lumi-preparation>,
  • log out and into the system, activate the now updated environment profile as usual,
  • execute the dependency install scripts <building-lumi-preparation>.

As a last step, clean the build directory rm -rf $HOME/src/warpx/build_lumi and rebuild WarpX.

Running

MI250X GPUs (2x64 GB)

The GPU partition on the supercomputer LUMI at CSC has up to 2978 nodes, each with 8 Graphics Compute Dies (GCDs). WarpX runs one MPI rank per Graphics Compute Die.

For interactive runs, simply use the aliases getNode or runNode ....

The batch script below can be used to run a WarpX simulation on multiple nodes (change -N accordingly). Replace descriptions between chevrons <> by relevant values, for instance <project id> or the concete inputs file. Copy the executable or point to it via EXE and adjust the path for the INPUTS variable accordingly.

../../../../Tools/machines/lumi-csc/lumi.sbatch

To run a simulation, copy the lines above to a file lumi.sbatch and run

sbatch lumi.sbatch

to submit the job.

Post-Processing

Note

TODO: Document any Jupyter or data services.

Known System Issues

Warning

December 12th, 2022: There is a caching bug in libFabric that causes WarpX simulations to occasionally hang on LUMI on more than 1 node.

As a work-around, please export the following environment variable in your job scripts until the issue is fixed:

#export FI_MR_CACHE_MAX_COUNT=0  # libfabric disable caching
# or, less invasive:
export FI_MR_CACHE_MONITOR=memhooks  # alternative cache monitor

Warning

January, 2023: We discovered a regression in AMD ROCm, leading to 2x slower current deposition (and other slowdowns) in ROCm 5.3 and 5.4.

June, 2023: Although a fix was planned for ROCm 5.5, we still see the same issue in this release and continue to exchange with AMD and HPE on the issue.

Stay with the ROCm 5.2 module to avoid a 2x slowdown.

Warning

May 2023: rocFFT in ROCm 5.1-5.3 tries to write to a cache in the home area by default. This does not scale, disable it via:

export ROCFFT_RTC_CACHE_PATH=/dev/null