Skip to content
This repository has been archived by the owner on Mar 1, 2023. It is now read-only.

Satori Cluster

Valentin Churavy edited this page May 22, 2020 · 18 revisions

Running ClimateMachine.jl on MIT Satori

The Satori user documentation is available at https://mit-satori.github.io/ and the below focuses on running ClimateMachine.jl on Satori. If you run into troubles or have questions contact either @vchuravy or @christophernhill

Installing ClimateMachine.jl on Satori

export HOME2=/nobackup/users/`whoami`
cd ${HOME2}
git clone https://github.com/CliMA/ClimateMachine.jl ClimateMachine

Annotated SLURM script

Requesting resources

First we need to request resources from SLURM. Our configuration is to launch one MPI rank per available GPU. Satori has 4 GPUs per node, so we ask for 4 tasks per node. To scale up you can increase the number of nodes. We ask for mem=0 to not hit the soft-limit on memory for large problem sizes.

!/bin/bash
# Begin SLURM Directives
#SBATCH --job-name=ClimateMachine
#SBATCH --time=30:00
#SBARCH --mem=0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4 
#SBATCH --gres="gpu:4" # GPUs per Node
#SBATCH --cpus-per-task=4

Modules

Loading the necessary modules. We purge the previously loaded set of modules to prevent loading conflicts. Julia 1.3 has an implicit dependency on GCC 8.3 so that we can load libquadmath, and CUDA 10.1 is the latest currently supported CUDA on Satori.

# Clear the environment from any previously loaded modules
module purge > /dev/null 2>&1

module load spack
module load gcc/8.3.0 # to get libquadmath
module load julia/1.3.0
module load cuda/10.1.243
module load openmpi/3.1.4-pmi-cuda

Julia specific configuration

Set JULIA_PROJECT to be the directory to which you installed ClimateMachine. We use a separate depot (think global cache), if you plan to do many runs in parallel with different package configurations you might want to separate this out further, but be aware that separate caches lead to higher startup overhead.

We turn of pre-compiled binaries for MPI and CUDA, so that we use the system installations. Lastly we allow Julia to use multiple threads, matching the number of CPUs per SLURM task.

export JULIA_PROJECT=${HOME2}/ClimateMachine
export JULIA_DEPOT_PATH=${HOME2}/julia_depot

export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false

export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK:=1}

julia -e 'using Pkg; pkg"instantiate"; pkg"build MPI"'
julia -e 'using Pkg; pkg"precompile"'
  • julia -e 'using Pkg; pkg"instantiate"; pkg"build MPI"': instantiates the project and makes sure that MPI.jl picks up any changes any changes to the environment.
  • julia -e 'using Pkg; pkg"precompile"' Precompiles the project so that the cache is not contended by the MPI ranks. Needs to be on it's own line, otherwise we might miss some packages.

You can comment these two lines out, if you are running many runs with the same configuration and MPI variant.

Launching experiment

Set EXPERIMENT to the experiment you want to run. You can also add additional command line flags there.

# Cleaning `CUDA_VISIBLE_DEVICES`
# This is needed to take advantage of faster local CUDA-aware communication

cat > launch.sh << EoF_s
#! /bin/sh
export CUDA_VISIBLE_DEVICES=0,1,2,3
exec \$*
EoF_s
chmod +x launch.sh

EXPERIMENT="${HOME2}/ClimateMachine/experiments/AtmosLES/dycoms.jl --output-dir=${HOME2}/clima-${SLURM_JOB_ID}"

srun --mpi=pmi2 ./launch.sh julia ${EXPERIMENT}