Bebop is a Cray CS400 cluster with Intel Broadwell and Knights Landing compute nodes available in the Laboratory Computing Resources Center (LCRC) at Argonne National Laboratory.
Begin by loading the Python 3 Anaconda module:
module load anaconda3
Create a conda virtual environment in which to install libEnsemble and all dependencies:
conda config --add channels intel
conda create --name my_env intelpython3_core python=3
source activate my_env
You should have an indication that the virtual environment is activated. Install mpi4py and libEnsemble in this environment, making sure to reference the preinstalled Intel MPI compiler. Your prompt should be similar to the following block:
(my_env) user@login:~$ CC=mpiicc MPICC=mpiicc pip install mpi4py --no-binary mpi4py
(my_env) user@login:~$ pip install libensemble
Bebop uses Slurm for job submission and management. The two commands you'll likely use the most to run jobs are srun
and sbatch
for running interactively and batch, respectively.
libEnsemble node-worker affinity is especially flexible on Bebop. By adjusting srun
runtime options users may assign multiple libEnsemble workers to each allocated node(oversubscription) or assign multiple nodes per worker.
You can allocate four Knights Landing nodes for thirty minutes through the following:
salloc -N 4 -p knl -A [username OR project] -t 00:30:00
With your nodes allocated, queue your job to start with four MPI ranks:
srun -n 4 python calling.py
mpirun
should also work. This line launches libEnsemble with a manager and three workers to one allocated compute node, with three nodes available for the workers to launch calculations with the job-controller or a job-launch command. This is an example of running in centralized<platforms_index>
mode, and, if using the job_controller<../job_controller/mpi_controller>
, it should be initiated with central_mode=True
Note
When performing a distributed<platforms_index>
MPI libEnsemble run and not oversubscribing, specify one more MPI process than the number of allocated nodes. The manager and first worker run together on a node.
If you would like to interact directly with the compute nodes via a shell, the following starts a bash session on a Knights Landing node for thirty minutes:
srun --pty -A [username OR project] -p knl -t 00:30:00 /bin/bash
Note
You will need to reactivate your conda virtual environment and reload your modules! Configuring this routine to occur automatically is recommended.
Batch scripts specify run settings using #SBATCH
statements. A simple example for a libEnsemble use case running in distributed<platforms_index>
MPI mode on Broadwell nodes resembles the following:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -N 4
#SBATCH -p bdwall
#SBATCH -A myproject
#SBATCH -o myjob.out
#SBATCH -e myjob.error
#SBATCH -t 00:15:00
# These four lines construct a machinefile for the job controller and slurm
srun hostname | sort -u > node_list
head -n 1 node_list > machinefile.$SLURM_JOBID
cat node_list >> machinefile.$SLURM_JOBID
export SLURM_HOSTFILE=machinefile.$SLURM_JOBID
srun --ntasks 5 python calling_script.py
With this saved as myscript.sh
, allocating, configuring, and running libEnsemble on Bebop is achieved by running :
sbatch myscript.sh
Example submission scripts for running on Bebop in distributed and centralized mode are also given in the examples directory.
View the status of your submitted jobs with squeue
, and cancel jobs with scancel <Job ID>
.
See the LCRC Bebop docs here for more information about Bebop.