# Distributed IO

We'll see that for applications that are limited by IO bandwidth, a wide distribution across compute nodes can be beneficial.

## Technical preamble

Spin up a Jobqueue cluster that has 6 workers on 6 different nodes.
(We'll ensure different nodes for each job by requesting more than 50% of the available CPUs in each job.)

In [1]:
import dask, dask.distributed, os
import dask_jobqueue

In [2]:
# look up further Dask configurations in local directory
additional_config = dask.config.collect(paths=['.'])
dask.config.update(dask.config.config, additional_config, priority='new');

In [3]:
dask.config.get('jobqueue.juwels-jobqueue-config')

{'cores': 96,
 'memory': '90000M',
 'processes': 1,
 'local-directory': '/tmp',
 'death-timeout': 60,
 'extra': ['--host ${SLURMD_NODENAME}.ib.juwels.fzj.de'],
 'interface': None,
 'shebang': '#!/usr/bin/env bash',
 'walltime': '00:15:00',
 'log-directory': 'dask_jobqueue_logs',
 'name': 'dask-worker',
 'queue': None,
 'project': None,
 'job-cpu': None,
 'job-mem': None,
 'job-extra': [],
 'env-extra': []}

In [4]:
cluster = dask_jobqueue.SLURMCluster(
    config_name='juwels-jobqueue-config',
    project='esmtst', # specify budget name associated with project
    queue='esm', # choose queue by available resources
    scheduler_options={"host": os.environ['HOSTNAME']}, # globally visible local scheduler network location
    cores=16  # divide into 16 processes
)

client = dask.distributed.Client(cluster)
cluster.scale(jobs=12)  # not sure we'll get that many jobs

In [5]:
client

0,1
Client  Scheduler: tcp://10.11.159.193:37283  Dashboard: http://10.11.159.193:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


In [15]:
!squeue -u {os.environ["USER"]}

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2524636       esm dask-wor    rath1 PD       0:00      1 (Resources)
           2524637       esm dask-wor    rath1 PD       0:00      1 (Priority)
           2524638       esm dask-wor    rath1 PD       0:00      1 (Priority)
           2524639       esm dask-wor    rath1 PD       0:00      1 (Priority)
           2524640       esm dask-wor    rath1 PD       0:00      1 (Priority)
           2524641       esm dask-wor    rath1 PD       0:00      1 (Priority)
           2524642       esm dask-wor    rath1 PD       0:00      1 (Priority)
           2524634       esm dask-wor    rath1  R       0:23      1 jwc00n012
           2524635       esm dask-wor    rath1  R       0:23      1 jwc00n015
           2524631       esm dask-wor    rath1  R       0:27      1 jwc00n003
           2524632       esm dask-wor    rath1  R       0:27      1 jwc00n006
           2524633       esm dask-wor    rath1  R

## Create random data and write them to disk

In [12]:
from dask import array as darr

In [16]:
random_data = darr.random.normal(
    size=(int(500_000_000_000 / 8), ),
    chunks=(int(1_000_000_000 / 8), )
)
random_data

Unnamed: 0,Array,Chunk
Bytes,500.00 GB,1000.00 MB
Shape,"(62500000000,)","(125000000,)"
Count,500 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 500.00 GB 1000.00 MB Shape (62500000000,) (125000000,) Count 500 Tasks 500 Chunks Type float64 numpy.ndarray",62500000000  1,

Unnamed: 0,Array,Chunk
Bytes,500.00 GB,1000.00 MB
Shape,"(62500000000,)","(125000000,)"
Count,500 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [17]:
!rm -rf random_data.zarr/

In [18]:
%time random_data.to_zarr("random_data.zarr")

CPU times: user 6.66 s, sys: 624 ms, total: 7.29 s
Wall time: 1min 13s


In [19]:
!du -sh random_data.zarr/

444G	random_data.zarr/


## Find largest number with disk IO

We'll re-read the data and find the maximum on the fly.

Note in the Dask dashboard that we don't saturate CPU load.
This means we're limited by IO rather than compute.

In [20]:
random_data = darr.from_zarr("random_data.zarr/")
random_data

Unnamed: 0,Array,Chunk
Bytes,500.00 GB,1000.00 MB
Shape,"(62500000000,)","(125000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 500.00 GB 1000.00 MB Shape (62500000000,) (125000000,) Count 501 Tasks 500 Chunks Type float64 numpy.ndarray",62500000000  1,

Unnamed: 0,Array,Chunk
Bytes,500.00 GB,1000.00 MB
Shape,"(62500000000,)","(125000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [21]:
%time random_data.max().compute()

CPU times: user 5.37 s, sys: 407 ms, total: 5.78 s
Wall time: 28.4 s


6.425670194700115

We've just read and digested 444GB from disk, decompressed it to 500GB and found the maximum in 30 seconds.

That's approx. 15 GB/s.

## Decrease cluster size and see effect on IO bandwidth

In [22]:
cluster.scale(jobs=1)

In [26]:
!squeue -u {os.environ["USER"]}

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2524634       esm dask-wor    rath1  R       3:06      1 jwc00n012


In [27]:
client

0,1
Client  Scheduler: tcp://10.11.159.193:37283  Dashboard: http://10.11.159.193:8787/status,Cluster  Workers: 1  Cores: 16  Memory: 90.00 GB


In [28]:
random_data = darr.from_zarr("random_data.zarr/")
random_data

Unnamed: 0,Array,Chunk
Bytes,500.00 GB,1000.00 MB
Shape,"(62500000000,)","(125000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 500.00 GB 1000.00 MB Shape (62500000000,) (125000000,) Count 501 Tasks 500 Chunks Type float64 numpy.ndarray",62500000000  1,

Unnamed: 0,Array,Chunk
Bytes,500.00 GB,1000.00 MB
Shape,"(62500000000,)","(125000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [29]:
%time random_data.max().compute()

CPU times: user 9.51 s, sys: 774 ms, total: 10.3 s
Wall time: 1min 47s


6.425670194700115

## Bottom line

For IO bound problems, we'd like to be able to scale horizontally rather than vertically.

That's something that could be tackled with the scheduler config (fill all nodes equally vs. keep as many nodes as possible empty).