# Cloudify

This notebook series guides you through the *cloudify* service: Serving Xarray datasets as zarr-datasets with xpublish and enabled server-side processing with dask. It introduces to the basic concepts with some examples. It was designed to work on DKRZ's HPC.

## Use cases

1. Uniform Zarr across formats

Instead of clients having to install software requirements and dependencies for file formats and storage backends, even with kerchunked datasets, cloudify enables them to access the data as native zarr instead. Cloudify delivers binary chunks instead of references.

2. Uniform Access across storage backends

Allow users to access data from tape or cloud storage through one interface without installing software dependencies. Cloudify is the proxy for heterogenous storage systems.

3. The data-as-a-service functionality

In case you have less disk storage or if you are not sure if a user really needs to retrieve all the data of your specific product, you can just cloudify the workflow and create a virtual representation of the output. The product dataset is displayed *as if it was there* and whenever a chunk is retrieved, the workflow is triggered and the output is generated. This becomes useful when the workflow to generate the product is rather complex so that the users would struggle to execute it theirselves, e.g. if they do not have resources to run the workflow. Having said that, the data providers would spend some resources for the time the virtual dataset is hosted.

## 1. Start an app

In the following, you will learn how to start and control the cloudify service.

1. Install a kernel for jupyterhub

```bash
source activate /work/bm0021/conda-envs/cloudify
python -m ipykernel install --user --name cloudify_env
```

-  Choose the kernel

In [3]:
import os
cn=os.environ["HOSTNAME"]

In [2]:
# This produces a self-signed cert for https access
#!openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj "/C=XX/ST=Hamburg/L=Hamburg/O=Test/OU=Test/CN="{cn}

2. From this notebook, we create a python script for data serving and start to host an example dataset in a background process.  We need to consider some settings:

**Port**

The resulting service listens on a specifc *port*. In case we share a node, we can only use ports that are not allocated already. To enbale us all to run an own app, we agree to use a port `90XX` where XX are the last two digits of our account.

**Dask Cluster**

Dask is necessary for lazy access of the data. Additionally, a dask cluster can help us to do server-side processing like uniform encoding. When starting the imported predefined dask cluster, it will use the following resources:

```python
n_workers=2,
threads_per_worker=8,
memory_limit="16GB"
```

which should be sufficient for at least two clients in parallel. We store it in an environment variable so that xpublish can find it. We futhermore have to allign the two event loops of dask and xpublish's asyncio with `nest_asyncio.apply()`. Event loops can be seen as *while* loops for a permanently running main worker.


**Plug-ins**

Xpublish finds pre-installed plugins like the intake-plugin by itself. Own plugins need to be registered.

Further settings will be discussed later.

In [3]:
xpublish_example_script="xpublish_example.py"

In [8]:
%%writefile {xpublish_example_script}

#ssl_keyfile="/work/bm0021/k204210/cloudify/workshop/key.pem"
#ssl_certfile="/work/bm0021/k204210/cloudify/workshop/cert.pem"

from cloudify.plugins.stacer import *
from cloudify.plugins.geoanimation import *
from cloudify.utils.daskhelper import *
import xarray as xr
import xpublish as xp
import asyncio
import nest_asyncio
import sys
import os
import socket   
from contextlib import closing

def find_free_port_in_range(start=9000, end=9100):
    for port in range(start, end + 1):
        try:
            with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
                s.bind(('', port))
                s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
            return port
        except:
            continue
    raise RuntimeError("No free port found in the specified range.")
    
nest_asyncio.apply()
chunks={}
for coord in ["lon","lat"]:
    chunk_size=os.environ.get(f"XPUBLISH_{coord.upper()}_CHUNK_SIZE",None)
    if chunk_size:
        chunks[coord]=int(chunk_size)

l_lossy=os.environ.get("L_LOSSY",False)

def lossy_compress(partds):
    import numcodecs
    rounding = numcodecs.BitRound(keepbits=12)
    return rounding.decode(rounding.encode(partds))

if __name__ == "__main__":  # This avoids infinite subprocess creation
    import dask
    zarrcluster = asyncio.get_event_loop().run_until_complete(get_dask_cluster())
    os.environ["ZARR_ADDRESS"]=zarrcluster.scheduler._address
    
    dsname=sys.argv[1]
    glob_inp=sys.argv[2:]

    dsdict={}
    ds=xr.open_mfdataset(
        glob_inp,
        compat="override",
        coords="minimal",
        chunks=chunks,
    )
    if "height" in ds:
        del ds["height"]
    for dv in ds.variables:
        if "time" in dv:
            ds[dv]=ds[dv].load()
            ds[dv].encoding["dtype"] = "float64"
            ds[dv].encoding["compressor"] = None
    ds=ds.set_coords([a for a in ds.data_vars if "bnds" in a])
    if l_lossy:
        ds = xr.apply_ufunc(
            lossy_compress,
            ds,
            dask="parallelized", 
            keep_attrs="drop_conflicts"
        )
    dsdict[dsname]=ds
    
    collection = xp.Rest(dsdict)
    collection.register_plugin(Stac())
    collection.register_plugin(PlotPlugin())
    freeport=find_free_port_in_range()
    listen_uri_fn=f"{os.environ['HOSTNAME']}_{freeport}"
    with open(listen_uri_fn, "w"):
        collection.serve(
            host="0.0.0.0",
            port=freeport,
            #ssl_keyfile=ssl_keyfile,
            #ssl_certfile=ssl_certfile
        )        

Overwriting xpublish_example.py


You can run this app e.g. for:
```
dsname="example"
glob_inp="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc"
```
by applying:

In [6]:
%%bash --bg
#Cannot use variables from python script here so it is all hard-coded

source activate /work/bm0021/conda-envs/cloudify
python xpublish_example.py \
    example \
    /work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc

**Note that we need the port which is outputted by xpublish in the shell**

### Stop a running app

Let us try to just run **one** app at the time. Otherwise, we would have multiple ports and dask clusters. It wouldnt end up well.

You can check for the main *cloudify* processes by finding the dask workers. In a next step, you can *kill* by ID.

In [8]:
!ps -ef | grep cloudify

k204210  1200285 1199004  0 07:54 ?        00:00:01 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-6c60ba00-2b66-4a2a-baa1-69a803c35eee.json
k204210  1200287 1199004  0 07:54 ?        00:00:01 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-a86acc21-7226-4e4c-9e23-ba1a1455a370.json
k204210  1200288 1199004  0 07:54 ?        00:00:01 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-03510d0a-e893-46c3-9a10-41f5e4d42762.json
k204210  1200693 1199004  0 09:04 ?        00:00:01 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-2252b3a8-65fb-4af1-bb8f-b4e50380c7fb.json
k204210  1200694 1199004  0 09:04 ?        00:00:00 /wor

**Important note:**

If you plan to continue with another notebook, do not stop the app now.

In [11]:
!kill 3216188 

/bin/bash: line 0: kill: (3216188) - No such process
