Running diagnostics distributedly

ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.

Setup

Setting up the head node

Go to the head node (aims4 or acme1) and become root. Then run the following commands:

mkdir /p/cscratch/acme/shaheen2/.dask/
cp ~/.dask/config.yaml /p/cscratch/acme/shaheen2/.dask/config.yml
export DASK_CONFIG=/p/cscratch/acme/shaheen2/.dask/config.yml

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-scheduler

Setting up the compute nodes

Go to each of the compute nodes and become root. Make sure that /p/cscratch is assessable to each node. Then run the following commands. Remember to replace SCHEDULER_ADDRESS in the last command to the one shown when you run dask-scheduler

export DASK_CONFIG=/p/cscratch/acme/shaheen2/
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
srun dask-worker SCHEDULER_ADDRESS:8786

Troubleshooting

If you get an error like the one below when running dask-worker, make sure you can ping SCHEDULER_ADDRESS. If not, contact your sysadmin.

(/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786
...
distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786

When you start dask-schduler, are there workers created (Starting worker ...) like in the snippet below? If so, run lsof | grep -E 'python2.7.*LISTEN' and kill all of the Python processes (with kill -9 PID) listening on the localhost (those with *:SOMEPORT). Do this with caution.

(acme_diags_env) shaheen2@shaheen2ml: dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO -   Scheduler at:  tcp://128.15.245.24:8786
distributed.scheduler - INFO -       bokeh at:              0.0.0.0:8787
distributed.scheduler - INFO -        http at:              0.0.0.0:9786
distributed.scheduler - INFO - Local Directory: /var/folders/nl/4tby_mh129g_95fj9dh6cgdh001nkh/T/scheduler-QMmVpu
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://128.15.245.24:50215
distributed.scheduler - INFO - Register tcp://128.15.245.24:50234
distributed.scheduler - INFO - Register tcp://128.15.245.24:50145
distributed.scheduler - INFO - Register tcp://128.15.245.24:50238
distributed.scheduler - INFO - Register tcp://128.15.245.24:50219
distributed.scheduler - INFO - Register tcp://128.15.245.24:49982
distributed.scheduler - INFO - Register tcp://128.15.245.24:50088
distributed.scheduler - INFO - Register tcp://128.15.245.24:50101
distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50238
distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50219
distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:49982
distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50088
distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50101

Creating the Anaconda environment (for developers)

Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.

Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of aims4 and the greyworm cluster, only /p/cscratch is accessible, so we installed Anaconda in /p/cscratch/acme/shaheen2/anaconda2/.
Create an Anaconda environment in a location accessible to the compute nodes (/p/cscratch/acme/shaheen2/acme_diags_env).
```
/p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
```
Make sure to use --copy, it copies the packages instead of symbolically linking them. Even if you use copy, activate, create, and conda are still symbolically linked based on what conda was used in conda create. Hence, this is why we needed the conda (in /p/cscratch/acme/shaheen2/anaconda2/bin/conda) be available on the head and compute nodes.

Provide feedback

Saved searches