## 2. Start an app on levante

**Why?**

You may want to run a *cloudify* app on DKRZ's HPC for

- testing purposes
- providing zarr as a catalog for cdo
    - Users can subset from *one* large aggregated dataset instead of searching the file system.

1. Install a kernel for jupyterhub

```bash
source activate /work/bm0021/conda-envs/cloudify
python -m ipykernel install --user --name cloudify_env
```

-  Choose the kernel

2. For being able to allow secure *https* access, we need a ssl certificate. For testing purposes and for levante, we can use a self-signed one. Additionally, right now, some applications do only allow access through https. We can create it like this:

In [1]:
#!openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj "/C=XX/ST=Hamburg/L=Hamburg/O=Test/OU=Test/CN=localhost"

3. We start to host an example dataset in a background process. The resulting service listens on a specifc *port*.

- to enbale us all to run an own app that may run on a *shared* node, we agree to use a port `90XX` where XX are the last two digits of our account.
- We name the example dataset "*example*" which shall be the temperature time series of a CMIP6 ESM.

The cloudify script for data serving

- alligns the asyncio event loop with dask event loop with `nest_asyncio`. Event loops are *while* loops for a permanently running main worker.
- sets up a small dask cluster for chunk calculation which is stored in an environment variable used by xpublish. The cluster uses the following resources:

```python
n_workers=2,
threads_per_worker=8,
memory_limit="16GB"
```
Furthermore, it

- opens the `glob_inp="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc"` with xarray
- registers user plugins
- runs the server

In [2]:
%%writefile xpublish_example.py

port=9000
ssl_keyfile="/work/bm0021/k204210/cloudify/workshop/key.pem"
ssl_certfile="/work/bm0021/k204210/cloudify/workshop/cert.pem"

from cloudify.plugins.stacer import *
from cloudify.utils.daskhelper import *
import xarray as xr
import xpublish as xp
import asyncio
import nest_asyncio
import sys
import os

nest_asyncio.apply()

if __name__ == "__main__":  # This avoids infinite subprocess creation
    import dask
    zarrcluster = asyncio.get_event_loop().run_until_complete(get_dask_cluster())
    os.environ["ZARR_ADDRESS"]=zarrcluster.scheduler._address
    
    dsname=sys.argv[1]
    glob_inp=sys.argv[2]

    dsdict={}
    ds=xr.open_mfdataset(
        glob_inp,
        compat="override",
        coords="minimal",
        chunks={}
    )
    dsdict[dsname]=ds
    
    collection = xp.Rest(dsdict)
    collection.register_plugin(Stac())
    collection.serve(
        host="0.0.0.0",
        port=port,
        ssl_keyfile=ssl_keyfile,
        ssl_certfile=ssl_certfile
    )

Overwriting xpublish_example.py


You can run this app e.g. for:
```
dsname="example"
glob_inp="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc"
```
by applying:

In [3]:
%%bash --bg
source activate /work/bm0021/conda-envs/cloudify
python xpublish_example.py example /work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*.nc

If sth goes wrong, you can check for *cloudify* processes that you can *kill* by ID.

In [13]:
!ps -ef | grep cloudify

k204210  1539328 1506563  1 15:58 ?        00:00:02 /work/bm0021/conda-envs/cloudify/bin/python -Xfrozen_modules=off -m ipykernel_launcher -f /home/k/k204210/.local/share/jupyter/runtime/kernel-d97d2c9d-9ab6-4c71-8254-b4916f9233b5.json
k204210  1539513 1539362  0 15:58 ?        00:00:00 /work/bm0021/conda-envs/cloudify/bin/python -c from multiprocessing.resource_tracker import main;main(26)
k204210  1539515 1539362  2 15:58 ?        00:00:04 /work/bm0021/conda-envs/cloudify/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=27, pipe_handle=35) --multiprocessing-fork
k204210  1539518 1539362  2 15:58 ?        00:00:04 /work/bm0021/conda-envs/cloudify/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=27, pipe_handle=35) --multiprocessing-fork
k204210  1539571 1539328  0 16:02 pts/2    00:00:00 /bin/bash -c ps -ef | grep cloudify
k204210  1539573 1539571  0 16:02 pts/2    00:00:00 grep cloudify


In [14]:
!kill 1539362

## 3. Data access

Youc an get the host url with the hostname of the levante node you work on and the port that you used for the app:

In [3]:
port=9000
hostname=!echo $HOSTNAME
hosturl="https://"+hostname[0]+":"+str(port)
print(hosturl)

https://l40095.lvt.dkrz.de:9000


We have to tell the python programs to do not verify ssl certificates for our purposes:

In [1]:
storage_options=dict(verify_ssl=False)

### 3.1. Xarray

Our example dataset is available via its name and the following `zarr_url`:

In [4]:
dsname="example"
zarr_url='/'.join([hosturl,"datasets",dsname,"zarr"])
print(zarr_url)

https://l40095.lvt.dkrz.de:9000/datasets/example/zarr


In [19]:
import xarray as xr
ds=xr.open_zarr(
    zarr_url,
    consolidated=True,
    storage_options=storage_options
)

### 3.2. Intake

**All** datasets available through the app are collected in an intake catalog:

In [None]:
intake_url='/'.join([hosturl,"intake.yaml"])
print(intake_url)

In [None]:
import intake
cat=intake.open_catalog(
    intake_url,
    storage_options=storage_options
)
list(cat)

In [None]:
cat[dsname](method="zarr",storage_options=storage_options).to_dask()

### 3.3. Stac

For each dataset, a stac item is generated with enriched metadata. The URL for this API is similar to the *zarr*-URL:

In [5]:
stac_url=zarr_url.replace('/zarr','/stac')

In [11]:
import pystac
import fsspec
import json
stacitem=pystac.item.Item.from_dict(
    json.load(fsspec.open(stac_url,**storage_options).open())
)
stacitem

In [13]:
stacitem.assets

{'data': <Asset href=https://eerie.cloud.dkrz.de/datasets/example/zarr>,
 'xarray_view': <Asset href=https://eerie.cloud.dkrz.de/datasets/example/>}

The stac API is right now hard-coded for 'eerie.cloud'. In theory, we could get to the data with xarray and the *href* asset.

## 4. Granularity of access: Chunks

With the `chunks` keyword in the `open_mfdataset` command, we control how the dataset is chunked. These chunks are mapped to *zarr* chunks of the *zarr* API. Users of the API have access to single chunks of these.

By setting different chunks, we can adapt our data provision to the use case. In case it is clear how users access the data, we can chunk the data accordingly before opening it.