Skip to content

Running diagnostics distributedly

Zeshawn Shaheen edited this page Jul 25, 2017 · 15 revisions

ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.

Setup

Setting up the head node

Go to the head node (aims4 or acme1) and become root. Then run the following commands:

mkdir /p/cscratch/acme/shaheen2/.dask/
cp ~/.dask/config.yaml /p/cscratch/acme/shaheen2/.dask/config.yml
export DASK_CONFIG=/p/cscratch/acme/shaheen2/.dask/config.yml

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-scheduler --host 10.10.10.1

Setting up the compute nodes

Go to each of the compute nodes and become root. Make sure that /p/cscratch is assessable to each node. Then run the following commands.

export DASK_CONFIG=/p/cscratch/acme/shaheen2/
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
srun dask-worker 10.10.10.1:8786

You can even select the number of workers on your machine with something like dask-worker 10.10.10.1:8786 --nprocs 4

Troubleshooting

  • If you get an error like the one below when running dask-worker, make sure you can ping SCHEDULER_ADDRESS. If not, contact your sysadmin.
    (/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786
    ...
    distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786
    
  • When you start dask-schduler, are there workers created (Starting worker ...) like in the snippet below? If so, run lsof | grep -E 'python2.7.*LISTEN' and kill all of the Python processes (with kill -9 PID) listening on the localhost (those with *:SOMEPORT). Do this with caution.
    (acme_diags_env) shaheen2@shaheen2ml: dask-scheduler
    distributed.scheduler - INFO - -----------------------------------------------
    distributed.scheduler - INFO -   Scheduler at:  tcp://128.15.245.24:8786
    distributed.scheduler - INFO -       bokeh at:              0.0.0.0:8787
    distributed.scheduler - INFO -        http at:              0.0.0.0:9786
    distributed.scheduler - INFO - Local Directory: /var/folders/nl/4tby_mh129g_95fj9dh6cgdh001nkh/T/scheduler-QMmVpu
    distributed.scheduler - INFO - -----------------------------------------------
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50215
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50234
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50145
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50238
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50219
    distributed.scheduler - INFO - Register tcp://128.15.245.24:49982
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50088
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50101
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50238
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50219
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:49982
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50088
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50101
    

Creating the Anaconda environment (for developers)

Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.

  1. Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of aims4 and the greyworm cluster, only /p/cscratch is accessible, so we installed Anaconda in /p/cscratch/acme/shaheen2/anaconda2/.
  2. Create an Anaconda environment in a location accessible to the compute nodes (/p/cscratch/acme/shaheen2/acme_diags_env).
    /p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
    
    Make sure to use --copy, it copies the packages instead of symbolically linking them. Even if you use copy, activate, create, and conda are still symbolically linked based on what conda was used in conda create. Hence, this is why we needed the conda (in /p/cscratch/acme/shaheen2/anaconda2/bin/conda) be available on the head and compute nodes.
Clone this wiki locally