# Resource Management in Dask

This tutorial provides a summary of how to manage and utilise the CPU and memory effectively when dealing with large datasets and/or with intensive computation demands. There are several ways of managing larger than memory data and improving code efficiencies. The approaches below are demonstrated with some brief examples.
    
- Partition
    - dask.dataframe
    - dask.dataarray
- Save data onto disk
    - export intermediate data onto disk 
- Scheduler
    - Persist / Compute methods
    - Futures as pointers to remote data
    - Delayed
- Clear data
- Execution in background
    - Network communication example
    
----

- Authors: NCI Virtual Research Environment Team
- Keywords: Dask, Resource Management, Partition, Schedular
- Creation Date: 2020-Sep
-------

## 1. Partition 

Dask operates on chunks. Like what we see in examples of dask arrays and dataframes, Dask provides a parallel, larger-than-memory mechanism using blocked algorithms. Simply put: distributed operation on data series, frame, or array.

*  **Parallel**: Uses all of the cores on your computer
*  **Larger-than-memory**:  Lets you work on datasets that are larger than your available memory by breaking up your data into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
*  **Blocked Algorithms**:  Perform large computations by performing many smaller computations

A Dask DataFrame is composed of many pandas DataFrames. For `dask.dataframe` the chunking happens only along the index.

<img src="http://dask.pydata.org/en/latest/_images/dask-dataframe.svg" width="30%">

Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid. They support a large subset of the Numpy API.

<img src="http://dask.pydata.org/en/latest/_images/dask-array-black-text.svg">

In this notebook, we'll build some understanding by implementing some blocked algorithms from scratch.

### Dask.DataFrame

In [None]:
import os
import dask
filename = os.path.join('/g/data/dk92/notebooks/demo_data/Weather_Stations_ACT','IDCJAC0009_*_*','IDCJAC0009*.csv')
filename

In [None]:
import dask.dataframe as dd
ddf = dd.read_csv(filename)
ddf

In [None]:
ddf.head()

In [None]:
ddf.groupby("Product code")["Rainfall amount (millimetres)"].max().visualize(filename='dataframe_graph.pdf')

### Dask.dataarray

To open multiple files simultaneously in parallel using Dask delayed, use `open_mfdataset()`.

This function will automatically concatenate and merge datasets into one in the simple cases that it understands (see `combine_by_coords()` for the full disclaimer). By default, `open_mfdataset()` will chunk each netCDF file into a single Dask array; again, supply the chunks argument to control the size of the resulting Dask arrays. In more complex cases, you can open each file individually using `open_dataset()` and merge the result, as described in the Combining data xarray tutorial. Passing the keyword argument `parallel=True` to `open_mfdataset()` will speed up the reading of large multi-file datasets by executing those read tasks in parallel using `dask.delayed`.

In [None]:
import xarray as xr
import dask.array as da
path = '/g/data/oi10/replicas/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp585/r1i1p1f1/day/pr/gr1/v20180701/*'
f_ssp585 = xr.open_mfdataset(path)
# Use Dask.Distributed utility function to display size of each dataset
from distributed.utils import format_bytes
print(
    "ssp585:",
    format_bytes(f_ssp585.nbytes),
)
dsets = xr.open_mfdataset(path,chunks={'time':730},parallel=True)
dsets

You'll notice that printing a dataset still shows a preview of array values, even if they are actually Dask arrays. We can do this quickly with Dask because we only need to compute the first few values (typically from the first block). To reveal the true nature of an array, print a DataArray:

In [None]:
dsets.pr

Once you've manipulated a Dask array, you can still write a dataset too big to fit into memory back to disk by using `to_netcdf()` in the usual way. Be mindful the following cell will take some time. 

In [None]:
# Use a location you have write access to
dsets.to_netcdf("/g/data/dk92/notebooks/demo_data/cmip6-precipitation-data.nc")

Alternatively, by setting the compute argument to `False`, `to_netcdf()` will return a `dask.delayed` object that can be computed later.

In [None]:
from dask.diagnostics import ProgressBar

# or distributed.progress when using the distributed scheduler
delayed_obj = dsets.to_netcdf("/g/data/dk92/notebooks/demo_data/cmip6-precipitation-data.nc", compute=False)

with ProgressBar():
    results = delayed_obj.compute()

When we do a small scale caculation, reading the whole file into memory is intuitive, but this approach often does not scale. Then segmentation and working on big data bit by bit is a good practice when dealing with larger-than-memory data.  Note that there is often a trade-off between time-efficiency and memory footprint: the following uses very little memory, but may be slower for files that do not fill a large faction of memory. In general, one would like chunks small enough not to stress memory, but big enough for efficient use of the CPU. 

## 2. Save data onto disk

Whenever we operate on our data we read through all of our data so that we don’t fill up RAM. This is very efficient for memory use, but reading through all of the data files every time can be slow.

As you saw in the dask dataframes example, we stored our data in Parquet, a format that is more efficient for computers to read and write. It is binary file format. Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.

The following code was copied from that example to remind you one way of how to save data onto disk.

In [None]:
# Write data as parquet format
ddf.to_parquet('/g/data/dk92/notebooks/demo_data/ACT_weather.parquet', engine='pyarrow')
!ls /g/data/dk92/notebooks/demo_data/ACT_weather.parquet

Extract binary files from disk and do some calcuation with a better performance gain:

In [None]:
%%time 
import pandas as pd
df = pd.read_parquet('/g/data/dk92/notebooks/demo_data/ACT_weather.parquet', engine='pyarrow')
df["Rainfall amount (millimetres)"].max()

## 3. Scheduler

### Persist sends work to the scheduler

In [None]:
# If you run this notebook on your local computer or NCI's VDI instance, you can create cluster
from dask.distributed import Client
client = Client()
print(client)

In [None]:
# If you run this notebook on Gadi under pangeo environment, you can create cluster using scheduler.json file
from dask.distributed import Client, LocalCluster
client = Client(scheduler_file='scheduler.json')
print(client)

<div class="alert alert-info">
<b>Warning: Please make sure you specify the correct path to the scheduler.json file within your environment.</b>  
</div>

Starting the Dask Client will provide a dashboard which is useful to gain insight into the computation. The link to the dashboard will become visible when you create the Client. We recommend having the Client open on one side of your screen and your notebook open on the other side, which will be useful for learning purposes.

In [None]:
# read csv files again
import os
import dask
filename = os.path.join('/g/data/dk92/notebooks/demo_data/Weather_Stations_ACT','IDCJAC0009_*_*','IDCJAC0009*.csv')
import dask.dataframe as dd
ddf = dd.read_csv(filename)
ddf.head()

In [None]:
df = ddf[ddf["Rainfall amount (millimetres)"]> 20]
df.head()

In [None]:
len(df)

In [None]:
df = client.persist(df)

### Futures Point to Remote Data

In [None]:
from dask.distributed import futures_of

futures_of(df)

Dask holds onto data for which a future exists:

In [None]:
df.visualize()

### Delayed feature provides non-Dask results

In [None]:
df.sum().visualize()

In [None]:
df["Rainfall amount (millimetres)"].sum().compute()

## 4. Clearing data

We remove data from distributed RAM by removing the collection from our local process. Remote data is removed once all Futures pointing to that data are removed from all client machines.

In [None]:
del df  # Deleting local data often deletes remote data

If this is the only copy then this will likely trigger the cluster to delete the data as well.

However if we have multiple copies or other collections based on this one then we’ll have to delete them all.

In [None]:
import dask.dataframe as dd
df = dd.read_csv(filename)
df2 = df[df["Rainfall amount (millimetres)"] < 10]
del df2  # would not delete data, because df2 still tracks the futures

In [None]:
df

### Aggressively Clearing Data

To definitely remove a computation and all computations that depend on it you can always cancel the futures/collection.

In [None]:
client.cancel(df)  # kills df, df2, and every other dependent computation

Alternatively, if you want a clean slate, you can restart the cluster. This clears all state and does a hard restart of all worker processes. It generally completes in around a second.

In [None]:
client.restart()

## 5. Execution in the background

There are many tasks that take a while to complete, but don't actually require much of the CPU, for example anything that requires communication over a network, or input from a user. In typical sequential programming, execution would need to halt while the process completes, and then continue execution. That would be dreadful for a user experience (imagine the slow progress bar that locks up the application and cannot be cancelled), and wasteful of time (the CPU could have been doing useful work in the meantime).
For example, we can launch processes and get their output as follows:
```python
    import subprocess
    p = subprocess.Popen(command, stdout=subprocess.PIPE)
    p.returncode
```
The task is run in a separate process, and the return-code will remain `None` until it completes, when it will change to `0`. To get the result back, we need `out = p.communicate()[0]` (which would block if the process was not complete).

Similarly, we can launch Python processes and threads in the background. Some methods allow mapping over multiple inputs and gathering the results. The thread starts and the cell completes immediately, but the data associated with the download only appears in the queue object some time later.

In [None]:
# This cell calls data as a remote file using the OPeNDAP protocol

import threading
import queue
import urllib

def get_webdata(url, q):
    u = urllib.request.urlopen(url)
    # raise ValueError
    q.put(u.read())

q = queue.Queue()
t = threading.Thread(target=get_webdata, args=('http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-0/1pctCO2/day/atmos/day/r1i1p1/latest/pr/pr_day_ACCESS1-0_1pctCO2_r1i1p1_03000101-03241231.nc.html', q))
t.start()

**Note:** the cell above won't work if you run this using Pangeo on Gadi as the compute notes do have external internet access, but will work on VDI, login nodes, or local PC. 

In [None]:
# Fetch result back into this thread. If the worker thread is not done, this would wait.
q.get()

**Consider:** What would you see if there had been an exception within the `get_webdata` function? You could uncomment the `raise` line, above, and re-execute the two cells. What happens? Is there any way to debug the execution to find the reason?

### Summary

This example summarizes some basic strategies of using chunks, dask lazy excution, schedulers, futures and persist utilities for a better performance.

Reference
https://docs.dask.org/