-
Notifications
You must be signed in to change notification settings - Fork 30
Running diagnostics distributedly
Zeshawn Shaheen edited this page Jul 19, 2017
·
15 revisions
ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.
Go to the head node (aims4
or acme1
) and become root.
Then run the following commands:
mkdir /p/cscratch/acme/shaheen2/.dask/
cp ~/.dask/config.yaml /p/cscratch/acme/shaheen2/.dask/config.yml
export DASK_CONFIG=/p/cscratch/acme/shaheen2/.dask/config.yml
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
dask-scheduler
Go to each of the compute nodes and become root. Make sure that /p/cscratch
is assessable to each node.
Then run the following commands. Remember to replace SCHEDULER_ADDRESS
in the last command to the one shown when you run dask-scheduler
export DASK_CONFIG=/p/cscratch/acme/shaheen2/
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
srun dask-worker SCHEDULER_ADDRESS:8786
- If you get an error like the one below when running
dask-worker
, make sure you canping SCHEDULER_ADDRESS
. If not, contact your sysadmin.(/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786 ... distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786
Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.
- Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of
aims4
and thegreyworm
cluster, only/p/cscratch
is accessible, so we installed Anaconda in/p/cscratch/acme/shaheen2/anaconda2/
. - Create an Anaconda environment in a location accessible to the compute nodes (
/p/cscratch/acme/shaheen2/acme_diags_env
).Make sure to use/p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
--copy
, it copies the packages instead of symbolically linking them. Even if you use copy,activate
,create
, andconda
are still symbolically linked based on whatconda
was used inconda create
. Hence, this is why we needed theconda
(in/p/cscratch/acme/shaheen2/anaconda2/bin/conda
) be available on the head and compute nodes.