-
Notifications
You must be signed in to change notification settings - Fork 32
Running diagnostics distributedly
Zeshawn Shaheen edited this page Jul 21, 2017
·
15 revisions
ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.
Go to the head node (aims4
or acme1
) and become root.
Then run the following commands:
mkdir /p/cscratch/acme/shaheen2/.dask/
cp ~/.dask/config.yaml /p/cscratch/acme/shaheen2/.dask/config.yml
export DASK_CONFIG=/p/cscratch/acme/shaheen2/.dask/config.yml
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-scheduler
Go to each of the compute nodes and become root. Make sure that /p/cscratch
is assessable to each node.
Then run the following commands. Remember to replace SCHEDULER_ADDRESS
in the last command to the one shown when you run dask-scheduler
export DASK_CONFIG=/p/cscratch/acme/shaheen2/
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
srun dask-worker SCHEDULER_ADDRESS:8786
- If you get an error like the one below when running
dask-worker
, make sure you canping SCHEDULER_ADDRESS
. If not, contact your sysadmin.(/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786 ... distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786
- When you start
dask-schduler
, are there workers created (Starting worker ...
) like in the snippet below? If so, runlsof | grep -E 'python2.7.*LISTEN'
and kill all of the Python processes (withkill -9 PID
) listening on the localhost (those with*:SOMEPORT
). Do this with caution.(acme_diags_env) shaheen2@shaheen2ml: dask-scheduler distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Scheduler at: tcp://128.15.245.24:8786 distributed.scheduler - INFO - bokeh at: 0.0.0.0:8787 distributed.scheduler - INFO - http at: 0.0.0.0:9786 distributed.scheduler - INFO - Local Directory: /var/folders/nl/4tby_mh129g_95fj9dh6cgdh001nkh/T/scheduler-QMmVpu distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Register tcp://128.15.245.24:50215 distributed.scheduler - INFO - Register tcp://128.15.245.24:50234 distributed.scheduler - INFO - Register tcp://128.15.245.24:50145 distributed.scheduler - INFO - Register tcp://128.15.245.24:50238 distributed.scheduler - INFO - Register tcp://128.15.245.24:50219 distributed.scheduler - INFO - Register tcp://128.15.245.24:49982 distributed.scheduler - INFO - Register tcp://128.15.245.24:50088 distributed.scheduler - INFO - Register tcp://128.15.245.24:50101 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50238 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50219 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:49982 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50088 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50101
Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.
- Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of
aims4
and thegreyworm
cluster, only/p/cscratch
is accessible, so we installed Anaconda in/p/cscratch/acme/shaheen2/anaconda2/
. - Create an Anaconda environment in a location accessible to the compute nodes (
/p/cscratch/acme/shaheen2/acme_diags_env
).Make sure to use/p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
--copy
, it copies the packages instead of symbolically linking them. Even if you use copy,activate
,create
, andconda
are still symbolically linked based on whatconda
was used inconda create
. Hence, this is why we needed theconda
(in/p/cscratch/acme/shaheen2/anaconda2/bin/conda
) be available on the head and compute nodes.