Running diagnostics distributedly

ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.


Setting up the head node

Go to the head node (aims4 or acme1) and become root. Then run the following commands:

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-scheduler --host

Setting up the compute nodes

Go to each of the compute nodes and become root. Make sure that /p/cscratch is assessable to each node. Then run the following commands.

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-worker -- procs NUM_WORKERS --nthreads 1

You can select the number of workers on your machine with something like dask-worker --nprocs 4 --nthreads 1

Querying the scheduler

Use cdp-distrib cli to interact with the scheduler and workers. You can do this even if you don't have access to the compute nodes. Just run cdp-distrib SCHEDULER_ADDR:PORT -w. Below is a sample output.

(/p/cscratch/acme/shaheen2/acme_diags_env) shaheen2@aims4:~$ cdp-distrib -w
Scheduler has 12 workers attached to it

Information about worker at
	name:		tcp://
	ncores:		1
	executing:	0
	memory_limit:	10146302976.0
	pid:		60292
	last-task:	2017-07-28 16:35:43
	last-seen:	2017-07-28 16:58:38

Information about worker at
	name:		tcp://
	ncores:		1
	executing:	0
	memory_limit:	10146302976.0
	pid:		60290
	last-task:	2017-07-28 16:35:46
	last-seen:	2017-07-28 16:58:38


  • If you get an error like the one below when running dask-worker, make sure you can ping SCHEDULER_ADDRESS. If not, contact your sysadmin.
    (/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker
    distributed.worker - INFO - Trying to connect to scheduler: tcp://
  • If after running the diagnostics and you eventually get an IOError: [Errno 24] Too many open files error on the worker or head nodes, follow the steps below on all of the compute and head nodes. For more detail, see the actual answer on the Dask FAQs
    • View the old soft and hard limits on open files
      # ulimit -Sn
      # ulimit -Hn
    • Open /etc/security/limits.conf and add the following lines to increase the limit for root to something larger
      root soft nofile 8192
      root hard nofile 20480
    • View the new soft and hard limits
      # exit
      $ sudo su -
      # ulimit -Sn
      # ulimit -Hn
  • When you start dask-schduler, are there workers created (Starting worker ...) like in the snippet below? If so, run lsof | grep -E 'python2.7.*LISTEN' and kill all of the Python processes (with kill -9 PID) listening on the localhost (those with *:SOMEPORT). Do this with caution.
    (acme_diags_env) shaheen2@shaheen2ml: dask-scheduler
    distributed.scheduler - INFO - -----------------------------------------------
    distributed.scheduler - INFO -   Scheduler at:  tcp://
    distributed.scheduler - INFO -       bokeh at:    
    distributed.scheduler - INFO -        http at:    
    distributed.scheduler - INFO - Local Directory: /var/folders/nl/4tby_mh129g_95fj9dh6cgdh001nkh/T/scheduler-QMmVpu
    distributed.scheduler - INFO - -----------------------------------------------
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Register tcp://
    distributed.scheduler - INFO - Starting worker compute stream, tcp://
    distributed.scheduler - INFO - Starting worker compute stream, tcp://
    distributed.scheduler - INFO - Starting worker compute stream, tcp://
    distributed.scheduler - INFO - Starting worker compute stream, tcp://
    distributed.scheduler - INFO - Starting worker compute stream, tcp://

Creating the Anaconda environment (for developers)

Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.

  1. Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of aims4 and the greyworm cluster, only /p/cscratch is accessible, so we installed Anaconda in /p/cscratch/acme/shaheen2/anaconda2/.
  2. Create an Anaconda environment in a location accessible to the compute nodes (/p/cscratch/acme/shaheen2/acme_diags_env).
    /p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
    Make sure to use --copy, it copies the packages instead of symbolically linking them. Even if you use copy, activate, create, and conda are still symbolically linked based on what conda was used in conda create. Hence, this is why we needed the conda (in /p/cscratch/acme/shaheen2/anaconda2/bin/conda) be available on the head and compute nodes.