## 2.Data access

**Requirement:** First run #1

**Note**

On levante, one would not want to open data with xarray through xpublish because we can just do the same *open* command *without* xpublish. Cdo would be an exception (see cdo).

However, to run this notebook at one place, we use the node where you work on as the host. You can get the host url with the hostname of the respective levante node and the port that you used for the app. Accessing data through *eerie.cloud* works the same, just replace the `hosturl` with 'https://eerie.cloud.dkrz.de'.

In [1]:
port=9000
hostname=!echo $HOSTNAME
hosturl="https://"+hostname[0]+":"+str(port)
print(hosturl)

https://l40348.lvt.dkrz.de:9000


We have to tell the python programs to do not verify ssl certificates for our purposes:

In [2]:
storage_options=dict(verify_ssl=False)

### Xarray

Our example dataset is available via its name and the following `zarr_url`:

In [3]:
dsname="example"
zarr_url='/'.join([hosturl,"datasets",dsname,"zarr"])
print(zarr_url)

https://l40348.lvt.dkrz.de:9000/datasets/example/zarr


In [4]:
import xarray as xr
ds=xr.open_zarr(
    zarr_url,
    consolidated=True,
    storage_options=storage_options
)

### Intake

**All** datasets available through the app are collected in an intake catalog:

In [5]:
intake_url='/'.join([hosturl,"intake.yaml"])
print(intake_url)

https://l40348.lvt.dkrz.de:9000/intake.yaml


In [6]:
import intake
cat=intake.open_catalog(
    intake_url,
    storage_options=storage_options
)
list(cat)

['example']

In [7]:
cat[dsname](method="zarr",storage_options=storage_options).to_dask()

  'dims': dict(self._ds.dims),


Unnamed: 0,Array,Chunk
Bytes,3.02 MiB,180.00 kiB
Shape,"(1032, 192, 2)","(60, 192, 2)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.02 MiB 180.00 kiB Shape (1032, 192, 2) (60, 192, 2) Dask graph 18 chunks in 2 graph layers Data type float64 numpy.ndarray",2  192  1032,

Unnamed: 0,Array,Chunk
Bytes,3.02 MiB,180.00 kiB
Shape,"(1032, 192, 2)","(60, 192, 2)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.05 MiB,360.00 kiB
Shape,"(1032, 384, 2)","(60, 384, 2)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 6.05 MiB 360.00 kiB Shape (1032, 384, 2) (60, 384, 2) Dask graph 18 chunks in 2 graph layers Data type float64 numpy.ndarray",2  384  1032,

Unnamed: 0,Array,Chunk
Bytes,6.05 MiB,360.00 kiB
Shape,"(1032, 384, 2)","(60, 384, 2)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,290.25 MiB,288.00 kiB
Shape,"(1032, 192, 384)","(1, 192, 384)"
Dask graph,1032 chunks in 2 graph layers,1032 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 290.25 MiB 288.00 kiB Shape (1032, 192, 384) (1, 192, 384) Dask graph 1032 chunks in 2 graph layers Data type float32 numpy.ndarray",384  192  1032,

Unnamed: 0,Array,Chunk
Bytes,290.25 MiB,288.00 kiB
Shape,"(1032, 192, 384)","(1, 192, 384)"
Dask graph,1032 chunks in 2 graph layers,1032 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.12 kiB,16 B
Shape,"(1032, 2)","(1, 2)"
Dask graph,1032 chunks in 2 graph layers,1032 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 16.12 kiB 16 B Shape (1032, 2) (1, 2) Dask graph 1032 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  1032,

Unnamed: 0,Array,Chunk
Bytes,16.12 kiB,16 B
Shape,"(1032, 2)","(1, 2)"
Dask graph,1032 chunks in 2 graph layers,1032 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray


### Stac

For each dataset, a stac item is generated with enriched metadata. The URL for this API is similar to the *zarr*-URL:

In [8]:
stac_url=zarr_url.replace('/zarr','/stac')

In [9]:
import pystac
import fsspec
import json
stacitem=pystac.item.Item.from_dict(
    json.load(fsspec.open(stac_url,**storage_options).open())
)
stacitem

In [10]:
stacitem.assets

{'data': <Asset href=https://eerie.cloud.dkrz.de/datasets/example/zarr>,
 'xarray_view': <Asset href=https://eerie.cloud.dkrz.de/datasets/example/>}

The stac API is right now hard-coded for 'eerie.cloud'. In theory, we could get to the data with xarray and the *href* asset.

### cdo

We are developing a cdo version that can read cloudified data with netcdf-zarr via http. The built netcdf is a testversion that prints some profiling warnings and hdf errors that often can be ignored. In 01/2025, it lacks port parsing which is why we cannot use it on internal apps on levante. Until this is solved, we can see how it works with the eerie.cloud:

In [3]:
cdo="/work/bm0021/cdo_incl_cmor/cdo-test_cmortest_gcc/bin/cdo"
hosturl="https://eerie.cloud.dkrz.de"
dsname="era5-dkrz.surface_analysis_hourly"
#
zarr_url='/'.join([hosturl,"datasets",dsname,"zarr"])
zarr_prefix="\#mode\=zarr,s3,consolidated"
infile=zarr_url+zarr_prefix
#
!{cdo} sinfo {infile}

[0;1m   File format[0m : NCZarr filter
[0;1m    -1 : Institut Source   T Steptype Levels Num    Points Num Dtype : Parameter ID[0m
     1 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -1            
     2 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -2            
     3 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -3            
     4 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -4            
     5 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -5            
     6 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -6            
     7 : [34munknown  ECMWF    v instant  [0m[32m     1 [0m  1 [32m   542080 [0m  1 [34m F64f [0m: -7            
     8 : [

When working with cdo on cloudified data, we have to select data with the **select** operator:

In [6]:
!{cdo} info -select,name=2t,timestep=1 {infile}

[32mcdo(1) select: [0mProcess started
[0;1m    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter ID
[0m     1 :[35m 1940-01-01 00:00:00 [0m[32m      0   542080       0 [0m:[34m      226.14      284.22      310.38[0m : -6            
[32mcdo    info: [0mProcessed 542080 values from 1 variable over 1 timestep [7.16s 3675MB]
profiling:/home/k/k202186/repos:Cannot create directory
profiling:/home/k/k202186/repos/netcdf-c/build/plugins/CMakeFiles/h5deflate.dir/H5Zdeflate.c.gcda:Skip
profiling:/home/k/k202186/repos:Cannot create directory
profiling:/home/k/k202186/repos/netcdf-c/build/plugins/CMakeFiles/h5shuffle.dir/H5Zshuffle.c.gcda:Skip
profiling:/home/k/k202186/repos:Cannot create directory
profiling:/home/k/k202186/repos/netcdf-c/build/plugins/CMakeFiles/h5fletcher32.dir/H5checksum.c.gcda:Skip
profiling:/home/k/k202186/repos:Cannot create directory
profiling:/home/k/k202186/repos/netcdf-c/build/plugins/CMakeFiles/h5fletcher32