Skip to content

Running diagnostics distributedly

Zeshawn Shaheen edited this page Aug 18, 2017 · 15 revisions

ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.

Setup

Setting up the head node

Go to the head node (aims4 or acme1) and become root. Then run the following commands:

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
export UVCDAT_ANONYMOUS_LOG=False
dask-scheduler --host 10.10.10.1

Setting up the compute nodes

Go to each of the compute nodes and become root. Make sure that /p/cscratch is assessable to each node. Then run the following commands.

source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
export UVCDAT_ANONYMOUS_LOG=False
dask-worker 10.10.10.1:8786 -- procs NUM_WORKERS --nthreads 1

You can select the number of workers on your machine with something like dask-worker 10.10.10.1:8786 --nprocs 4 --nthreads 1

Querying the scheduler

Use cdp-distrib cli to interact with the scheduler and workers. You can do this even if you don't have access to the compute nodes. Just run cdp-distrib SCHEDULER_ADDR:PORT -w. Below is a sample output.

(/p/cscratch/acme/shaheen2/acme_diags_env) shaheen2@aims4:~$ cdp-distrib 10.10.10.1:8786 -w
Scheduler 10.10.10.1:8786 has 12 workers attached to it

Information about worker at greyworm1.llnl.gov(10.10.10.51):37188
	name:		tcp://10.10.10.51:37188
	ncores:		1
	executing:	0
	memory_limit:	10146302976.0
	pid:		60292
	last-task:	2017-07-28 16:35:43
	last-seen:	2017-07-28 16:58:38

Information about worker at greyworm1.llnl.gov(10.10.10.51):38710
	name:		tcp://10.10.10.51:38710
	ncores:		1
	executing:	0
	memory_limit:	10146302976.0
	pid:		60290
	last-task:	2017-07-28 16:35:46
	last-seen:	2017-07-28 16:58:38

Troubleshooting

  • If you get an error like the one below when running dask-worker, make sure you can ping SCHEDULER_ADDRESS. If not, contact your sysadmin.
    (/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786
    ...
    distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786
    
  • If after running the diagnostics and you eventually get an IOError: [Errno 24] Too many open files error on the worker or head nodes, follow the steps below on all of the compute and head nodes. For more detail, see the actual answer on the Dask FAQs
    • View the old soft and hard limits on open files
      # ulimit -Sn
      # ulimit -Hn
      
    • Open /etc/security/limits.conf and add the following lines to increase the limit for root to something larger
      root soft nofile 8192
      root hard nofile 20480
      
    • View the new soft and hard limits
      # exit
      $ sudo su -
      # ulimit -Sn
      # ulimit -Hn
      
  • When you start dask-schduler, are there workers created (Starting worker ...) like in the snippet below? If so, run lsof | grep -E 'python2.7.*LISTEN' and kill all of the Python processes (with kill -9 PID) listening on the localhost (those with *:SOMEPORT). Do this with caution.
    (acme_diags_env) shaheen2@shaheen2ml: dask-scheduler
    distributed.scheduler - INFO - -----------------------------------------------
    distributed.scheduler - INFO -   Scheduler at:  tcp://128.15.245.24:8786
    distributed.scheduler - INFO -       bokeh at:              0.0.0.0:8787
    distributed.scheduler - INFO -        http at:              0.0.0.0:9786
    distributed.scheduler - INFO - Local Directory: /var/folders/nl/4tby_mh129g_95fj9dh6cgdh001nkh/T/scheduler-QMmVpu
    distributed.scheduler - INFO - -----------------------------------------------
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50215
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50234
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50145
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50238
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50219
    distributed.scheduler - INFO - Register tcp://128.15.245.24:49982
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50088
    distributed.scheduler - INFO - Register tcp://128.15.245.24:50101
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50238
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50219
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:49982
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50088
    distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50101
    

Creating the Anaconda environment (for developers)

Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.

  1. Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of aims4 and the greyworm cluster, only /p/cscratch is accessible, so we installed Anaconda in /p/cscratch/acme/shaheen2/anaconda2/.
  2. Create an Anaconda environment in a location accessible to the compute nodes (/p/cscratch/acme/shaheen2/acme_diags_env).
    /p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
    
    Make sure to use --copy, it copies the packages instead of symbolically linking them. Even if you use copy, activate, create, and conda are still symbolically linked based on what conda was used in conda create. Hence, this is why we needed the conda (in /p/cscratch/acme/shaheen2/anaconda2/bin/conda) be available on the head and compute nodes.