Skip to content

Running diagnostics distributedly

Zeshawn Shaheen edited this page Jul 19, 2017 · 15 revisions

ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.

Setup

Setting up the head node

Go to the head node (aims4 or acme1) and become root. Then run the following commands:

mkdir /p/cscratch/acme/shaheen2/.dask/
cp ~/.dask/config.yaml /p/cscratch/acme/shaheen2/.dask/config.yml
export DASK_CONFIG=/p/cscratch/acme/shaheen2/.dask/config.yml

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-scheduler

Setting up the compute nodes

Go to each of the compute nodes and become root. Make sure that /p/cscratch is assessable to each node. Then run the following commands. Remember to replace SCHEDULER_ADDRESS in the last command to the one shown when you run dask-scheduler

export DASK_CONFIG=/p/cscratch/acme/shaheen2/
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
srun dask-worker SCHEDULER_ADDRESS:8786

Troubleshooting

  • If you get an error like the one below when running dask-worker, make sure you can ping SCHEDULER_ADDRESS. If not, contact your sysadmin.
    (/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786
    ...
    distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786
    

Creating the Anaconda environment (for developers)

Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.

  1. Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of aims4 and the greyworm cluster, only /p/cscratch is accessible, so we installed Anaconda in /p/cscratch/acme/shaheen2/anaconda2/.
  2. Create an Anaconda environment in a location accessible to the compute nodes (/p/cscratch/acme/shaheen2/acme_diags_env).
    /p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
    
    Make sure to use --copy, it copies the packages instead of symbolically linking them. Even if you use copy, activate, create, and conda are still symbolically linked based on what conda was used in conda create. Hence, this is why we needed the conda (in /p/cscratch/acme/shaheen2/anaconda2/bin/conda) be available on the head and compute nodes.
Clone this wiki locally