# Distributed IO

We'll see that for applications that are limited by IO bandwidth, a wide distribution across compute nodes can be beneficial if a [distributed filesystem](https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems) is used. (True on virtually all HPC systems.)



## Technical preamble

Spin up a Jobqueue cluster that has 6 workers on 6 different nodes.
(We'll ensure different nodes for each job by requesting more than 50% of the available CPUs in each job.)

In [1]:
import dask, dask.distributed
import dask_jobqueue

cluster = dask_jobqueue.SLURMCluster(

    # Dask worker size
    cores=17, memory='100GB',
    processes=1, # Dask workers per job
    
    # SLURM job script things
    queue='cluster', walltime='00:15:00',
    
    # Dask worker network and temporary storage
    interface='ib0', local_directory='$TMPDIR'
)

client = dask.distributed.Client(cluster)
cluster.scale(jobs=6)

In [2]:
client

0,1
Client  Scheduler: tcp://172.18.4.100:36147  Dashboard: http://172.18.4.100:8787/status,Cluster  Workers: 4  Cores: 68  Memory: 400.00 GB


## Create random data and write them to disk

In [3]:
from dask import array as darr

In [4]:
# 100 GB in chunks of 500 MB
random_data = darr.random.normal(
    size=(int(100_000_000_000 / 8), ),
    chunks=(int(200_000_000 / 8), )
)
random_data

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,500 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 100.00 GB 200.00 MB Shape (12500000000,) (25000000,) Count 500 Tasks 500 Chunks Type float64 numpy.ndarray",12500000000  1,

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,500 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [5]:
!rm -rf random_data.zarr/

In [6]:
%time random_data.to_zarr("random_data.zarr")

CPU times: user 1.85 s, sys: 126 ms, total: 1.98 s
Wall time: 11.6 s


In [7]:
!du -sh random_data.zarr/

89G	random_data.zarr/


## Find largest number with disk IO

We'll re-read the data and find the maximum on the fly.

Note in the Dask dashboard that we don't saturate CPU load.
This means we're limited by IO rather than compute.

In [8]:
random_data = darr.from_zarr("random_data.zarr/")
random_data

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 100.00 GB 200.00 MB Shape (12500000000,) (25000000,) Count 501 Tasks 500 Chunks Type float64 numpy.ndarray",12500000000  1,

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [9]:
%time random_data.max().compute()

CPU times: user 1.38 s, sys: 77.8 ms, total: 1.46 s
Wall time: 2.96 s


6.50116242127755

We've just read and digested 90GB from disk, decompressed it to 100GB and found the maximum in 3 seconds.

That's approx. 30 GB/s.

## Decrease cluster size and see effect on IO bandwidth

In [10]:
cluster.scale(jobs=1)

In [12]:
client

0,1
Client  Scheduler: tcp://172.18.4.100:36147  Dashboard: http://172.18.4.100:8787/status,Cluster  Workers: 1  Cores: 17  Memory: 100.00 GB


In [13]:
random_data = darr.from_zarr("random_data.zarr/")
random_data

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 100.00 GB 200.00 MB Shape (12500000000,) (25000000,) Count 501 Tasks 500 Chunks Type float64 numpy.ndarray",12500000000  1,

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [14]:
%time random_data.max().compute()

CPU times: user 2.94 s, sys: 100 ms, total: 3.04 s
Wall time: 9.44 s


6.50116242127755

We've just read and digested 90GB from disk, decompressed it to 100GB and found the maximum in 10 seconds.

That's approx. 10 GB/s.

## Increase cluster size again and see effect on IO bandwidth

In [16]:
cluster.scale(jobs=8)

In [17]:
client

0,1
Client  Scheduler: tcp://172.18.4.100:36147  Dashboard: http://172.18.4.100:8787/status,Cluster  Workers: 7  Cores: 119  Memory: 700.00 GB


In [18]:
%time random_data.max().compute()

CPU times: user 1.96 s, sys: 74.3 ms, total: 2.04 s
Wall time: 2.89 s


6.50116242127755

## Bottom line

For IO bound problems, we'd like to be able to scale horizontally rather than vertically.

That's something that could be tackled with the scheduler config (fill all nodes equally vs. keep as many nodes as possible empty).