# Cloudify

This notebook series guides you through the *cloudify* service: Serving Xarray datasets as zarr-datasets with xpublish and enabled server-side processing with dask. It introduces to the basic concepts with some examples. It was designed to work on DKRZ's HPC.

## 1. Start an app

In the following, you will learn how to start and control the cloudify service.

**Is there any other reason why to run cloudify on the only internally accessible DKRZ HPC?**

If you *cloudify* a virtual dataset prepared as a highly aggregated, analysis-ready dataset, clients can subset from this *one* large aggregated dataset instead of searching the file system.

1. Install a kernel for jupyterhub

```bash
source activate /work/bm0021/conda-envs/cloudify
python -m ipykernel install --user --name cloudify_env
```

-  Choose the kernel

2. For being able to allow secure *https* access, we need a ssl certificate. For testing purposes and for levante, we can use a self-signed one. Additionally, right now, some applications do only allow access through https. We can create it like this:

In [1]:
#!openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj "/C=XX/ST=Hamburg/L=Hamburg/O=Test/OU=Test/CN=localhost"

3. We write a cloudify script for data serving and start to host an example dataset in a background process.  We need to consider some settings:

**Port**

The resulting service listens on a specifc *port*. In case we share a node, we can only use ports that are not allocated already.  To enbale us all to run an own app, we agree to use a port `90XX` where XX are the last two digits of our account.

**Dask Cluster**

Dask is necessary for lazy access of the data. Additionally, a dask cluster can help us to do server-side processing like uniform encoding. When starting the imported predefined dask cluster, it will use the following resources:

```python
n_workers=2,
threads_per_worker=8,
memory_limit="16GB"
```

which should be sufficient for at least two clients in parallel. We store it in an environment variable so that xpublish can find it. We futhermore have to allign the two event loops of dask and xpublish's asyncio with `nest_asyncio.apply()`. Event loops can be seen as *while* loops for a permanently running main worker.


**Plug-ins**

Xpublish finds pre-installed plugins like the intake-plugin by itself. Own plugins need to be registered.

Further settings will be discussed later.

In [2]:
xpublish_example_script="xpublish_example.py"

In [3]:
%%writefile {xpublish_example_script}

port=9000
ssl_keyfile="/work/bm0021/k204210/cloudify/workshop/key.pem"
ssl_certfile="/work/bm0021/k204210/cloudify/workshop/cert.pem"

from cloudify.plugins.stacer import *
from cloudify.utils.daskhelper import *
import xarray as xr
import xpublish as xp
import asyncio
import nest_asyncio
import sys
import os

nest_asyncio.apply()
chunks={}
for coord in ["lon","lat"]:
    chunk_size=os.environ.get(f"XPUBLISH_{coord.upper()}_CHUNK_SIZE",None)
    if chunk_size:
        chunks[coord]=int(chunk_size)

l_lossy=os.environ.get("L_LOSSY",False)

def lossy_compress(partds):
    import numcodecs
    rounding = numcodecs.BitRound(keepbits=12)
    return rounding.decode(rounding.encode(partds))

if __name__ == "__main__":  # This avoids infinite subprocess creation
    import dask
    zarrcluster = asyncio.get_event_loop().run_until_complete(get_dask_cluster())
    os.environ["ZARR_ADDRESS"]=zarrcluster.scheduler._address
    
    dsname=sys.argv[1]
    glob_inp=sys.argv[2:]

    dsdict={}
    ds=xr.open_mfdataset(
        glob_inp,
        compat="override",
        coords="minimal",
        chunks=chunks
    )
    ds=ds.set_coords([a for a in ds.data_vars if "bnds" in a])
    if l_lossy:
        ds = xr.apply_ufunc(
            lossy_compress,
            ds,
            dask="parallelized", 
            keep_attrs="drop_conflicts"
        )
    dsdict[dsname]=ds
    
    collection = xp.Rest(dsdict)
    collection.register_plugin(Stac())
    collection.serve(
        host="0.0.0.0",
        port=port,
        ssl_keyfile=ssl_keyfile,
        ssl_certfile=ssl_certfile
    )

Overwriting xpublish_example.py


You can run this app e.g. for:
```
dsname="example"
glob_inp="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc"
```
by applying:

In [4]:
%%bash --bg
#Cannot use variables from python script here so it is all hard-coded

source activate /work/bm0021/conda-envs/cloudify
python xpublish_example.py \
    example \
    /work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc

### Stop a running app

Let us try to just run **one** app at the time. Otherwise, we would have multiple ports and dask clusters. It wouldnt end up well.

You can check for the main *cloudify* processes by finding the dask workers. In a next step, you can *kill* by ID.

In [6]:
!ps -ef | grep cloudify

k204210  1101280 1100112  0 07:33 ?        00:00:02 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-99eb3193-dfc0-45c3-9953-a1f29fb0888d.json
k204210  1101285 1100112  0 07:33 ?        00:00:02 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-0ac5cfbe-7cc6-4136-870a-cbac9aa18a7b.json
k204210  1101286 1100112  0 07:33 ?        00:00:30 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-947ea9d3-56f0-426e-9af8-8a5be0180793.json
k204210  1102668 1100112  0 09:10 ?        00:00:21 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-f59a40b2-6d4d-41dd-b768-7ee9040e362a.json
k204210  1109706 1100112  2 10:42 ?        00:00:00 /wor

**Important note:**

If you plan to continue with another notebook, do not stop the app now.

In [None]:
!kill 813325