# Loading data from the catalog

## Long story short:
```
import intake
try:
    import outtake
except:
    import sys
    print ("""Could not load outtake - tape downloads might not work. Try adding

module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

to your ~./kernel_env file""", file=sys.stderr)


catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json"  # nextGEMS and DYAMOND Winter
cat = intake.open_esm_datastore(catalog_file)
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})
keys = list(dataset_dict.keys())
dataset = dataset_dict[keys[0]]
dataset.tas.isel(time=1).max().values

# use get_from_cat from below to search a catalog
```

## Loading the catalog

The [intake-esm package](https://intake-esm.readthedocs.io/en/stable/) provides a tool to access big amounts of data, without having to worry about where it comes from. We will give you a short overview of how to do use the catalog to your advantage.
The root of the intake catalog, is a '.json' file.

In [1]:
import pandas as pd

pd.set_option("max_colwidth", None)  # makes the tables render better

import intake

try:
    import outtake
except:
    import sys

    print(
        """Could not load outtake - tape downloads might not work. Try adding
    
module use /work/k20200/k202134/hsm-tools/outtake/module
module load hsm-tools/unstable

to your ~./kernel_env file""",
        file=sys.stderr,
    )


def get_from_cat(catalog, columns):
    """A helper function for inspecting an intake catalog.

    Call with the catalog to be inspected and a list of columns of interest."""
    import pandas as pd

    pd.set_option("max_colwidth", None)  # makes the tables render better

    if type(columns) == type(""):
        columns = [columns]
    return (
        catalog.df[columns]
        .drop_duplicates()
        .sort_values(columns)
        .reset_index(drop=True)
    )

In [2]:
catalog_file = "/work/ka1081/Catalogs/dyamond-nextgems.json"

cat = intake.open_esm_datastore(catalog_file)
cat

Unnamed: 0,unique
variable_id,643
project,2
institution_id,13
source_id,21
experiment_id,5
simulation_id,16
realm,6
frequency,16
time_reduction,5
grid_label,11


The meanings of the categories are:

Info | Description |
---|--- |
**variable_id** | Shortname of variables.
**project** | Larger project the simulation belongs to.
**source_id**| Model name.
**experiment_id**| Class of experiment
**simulation_id**| Id of the run.
**realm**| oceanic or atmospheric data
**frequency**| Frequency in time of datapoints.
**time_reduction**| Average/Instantaneous/...
**grid_label**| Identifier for horizontal gridtype.
**level_type**| Identifier for vertical gridtype.
**time_min**| Starting time for a specific file.
**time_max**| End of time covered by a specific file.
**grid_id**| Identifier of horizontal grid.
**uri**|Uniform resource identifier, location of data files.

## Searching the catalog

You can access the underlying pandas [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) with "cat.df". Here we show the first 2 entries with [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html):

In [3]:
cat.df.head(n=2)

Unnamed: 0,variable_id,project,institution_id,source_id,experiment_id,simulation_id,realm,frequency,time_reduction,grid_label,level_type,time_min,time_max,grid_id,format,uri
0,"(c, l, i, v, i)",DYAMOND_WINTER,CAMS,GRIST-5km,DW-ATM,r1i1p1f1,atmos,15min,unkonwn,gn,2d,2020-01-20T00:00:00.000,2020-01-20T23:45:00.000,not_implemented,netcdf,/work/ka1081/DYAMOND_WINTER/CAMS/GRIST-5km/DW-ATM/atmos/15min/clivi/r1i1p1f1/2d/gn/clivi_15min_GRIST-5km_DW-ATM_r1i1p1f1_2d_gn_20200120000000-20200120234500.nc
1,"(c, l, t)",DYAMOND_WINTER,CAMS,GRIST-5km,DW-ATM,r1i1p1f1,atmos,15min,unkonwn,gn,2d,2020-01-20T00:00:00.000,2020-01-20T23:45:00.000,not_implemented,netcdf,/work/ka1081/DYAMOND_WINTER/CAMS/GRIST-5km/DW-ATM/atmos/15min/clt/r1i1p1f1/2d/gn/clt_15min_GRIST-5km_DW-ATM_r1i1p1f1_2d_gn_20200120000000-20200120234500.nc


**To reduce the output, we have defined a helper function in the header of this document.
We can use it to get an overview of projects, experiments, and models in the catalog.**

In [4]:
get_from_cat(cat, ["project", "experiment_id", "source_id", "simulation_id"])

Unnamed: 0,project,experiment_id,source_id,simulation_id
0,DYAMOND_WINTER,DW-ATM,ARPEGE-NH-2km,r1i1p1f1
1,DYAMOND_WINTER,DW-ATM,GEM,r1i1p1f1
2,DYAMOND_WINTER,DW-ATM,GEOS-1km,r1i1p1f1
3,DYAMOND_WINTER,DW-ATM,GEOS-3km,r1i1p1f1
4,DYAMOND_WINTER,DW-ATM,GRIST-5km,r1i1p1f1
5,DYAMOND_WINTER,DW-ATM,ICON-NWP-2km,r1i1p1f1
6,DYAMOND_WINTER,DW-ATM,ICON-SAP-5km,dpp0014
7,DYAMOND_WINTER,DW-ATM,MPAS-3km,r1i1p1f1
8,DYAMOND_WINTER,DW-ATM,SCREAM-3km,r1i1p1f1
9,DYAMOND_WINTER,DW-ATM,SHiELD-3km,r1i1p1f1


**Let's look into the variables of ICON in NGC2009.
Detailed information about how to search the catalog can be found [here](https://intake-esm.readthedocs.io/en/stable/user-guide/overview.html#finding-unique-entries-for-individual-columns).**

In [5]:
get_from_cat(cat.search(simulation_id="ngc2009"), ["realm", "frequency", "variable_id"])

Unnamed: 0,realm,frequency,variable_id
0,atm,1day,"(clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
1,atm,1day,"(psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
2,atm,1month,"(sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif)"
3,atm,1month,"(sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
4,atm,1month,"(sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
5,atm,1month,"(ua, va, wa, ta, hus, rho, clw, cli, pfull, zghalf, zg, dzghalf)"
6,atm,2hour,"(phalf,)"
7,atm,2minute,"(fc, frland, hsurf, p, rnds_dif, rnds_dir, rsds, rvds_dif, rvds_dir, soiltype, t, u, v, w)"
8,atm,30minute,"(hydro_canopy_cond_limited_box, hydro_w_snow_box, hydro_snow_soil_dens_box)"
9,atm,30minute,"(hydro_discharge_ocean_box, hydro_drainage_box, hydro_runoff_box, hydro_transpiration_box, sse_grnd_hflx_old_box)"


**Let's look into surface air temperature (tas)**

In [6]:
get_from_cat(
    cat.search(simulation_id="ngc2009", variable_id="tas"),
    ["realm", "frequency", "level_type", "variable_id"],
)

Unnamed: 0,realm,frequency,level_type,variable_id
0,atm,1day,ml,"(clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
1,atm,1day,ml,"(psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
2,atm,1month,ml,"(sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
3,atm,1month,ml,"(sfcwind, clivi, cllvi, cptgzvi, hfls, hfss, prlr, pr, prw, qgvi, qrvi, qsvi, rlds, rlus, rlut, rsds, rsdt, rsus, rsut, tauu, tauv, rpds_dir, rpds_dif, rvds_dif, rnds_dif, psl, clt, evspsbl, tas, ts, rldscs, rlutcs, rsdscs, rsuscs, rsutcs)"
4,atm,30minute,ml,"(psl, ps, sit, sic, tas, ts, uas, vas, cfh_lnd)"


In [7]:
hits = cat.search(simulation_id="ngc2009", variable_id="tas", frequency="30minute")
# The 1day files would have crashed the jupyter because the files are inconsistent across the run.
hits

Unnamed: 0,unique
variable_id,9
project,1
institution_id,1
source_id,1
experiment_id,1
simulation_id,1
realm,1
frequency,1
time_reduction,1
grid_label,1


**Note**: The variable_id field still is on 9, as there are 9 variables in total in the file(s) containing tas.

## Loading the Data

When you searched the catalog and now want to access the actual data, it is time to load it.

The Option `cdf_kwargs={"chunks": {"time":1}}` is used, so that only reasonably sized chunks of data are loaded at a time. Your kernel WILL break if you want to load the whole set at once!

In [8]:
dataset_dict = hits.to_dataset_dict(cdf_kwargs={"chunks": {"time": 1}})


--> The keys in the returned dictionary of datasets are constructed as follows:
	'project.institution_id.source_id.experiment_id.simulation_id.realm.frequency.time_reduction.grid_label.level_type'


We have only one dataset, to access it, we need the keys:

In [9]:
keys = list(dataset_dict.keys())
keys

['nextGEMS.MPI-M.ICON-ESM.nextgems_cycle2.ngc2009.atm.30minute.inst.gn.ml']

Now we can finally access the data:

In [10]:
dataset = dataset_dict[keys[0]]
dataset

Unnamed: 0,Array,Chunk
Bytes,2.80 TiB,80.00 MiB
Shape,"(36722, 1, 20971520)","(1, 1, 20971520)"
Count,74261 Tasks,36722 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.80 TiB 80.00 MiB Shape (36722, 1, 20971520) (1, 1, 20971520) Count 74261 Tasks 36722 Chunks Type float32 numpy.ndarray",20971520  1  36722,

Unnamed: 0,Array,Chunk
Bytes,2.80 TiB,80.00 MiB
Shape,"(36722, 1, 20971520)","(1, 1, 20971520)"
Count,74261 Tasks,36722 Chunks
Type,float32,numpy.ndarray


In [11]:
dataset.tas.isel(time=1).min().values
# the first time step just contains zeros, so we take the second by saying isel(time=1)

array(225.27545, dtype=float32)

In [12]:
dataset.tas.isel(time=1).max().values

array(312.81677, dtype=float32)

In [13]:
dataset.tas.max(dim="ncells")  # lazy evaluation - no real work is done yet.

Unnamed: 0,Array,Chunk
Bytes,143.45 kiB,4 B
Shape,"(36722, 1)","(1, 1)"
Count,147705 Tasks,36722 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 143.45 kiB 4 B Shape (36722, 1) (1, 1) Count 147705 Tasks 36722 Chunks Type float32 numpy.ndarray",1  36722,

Unnamed: 0,Array,Chunk
Bytes,143.45 kiB,4 B
Shape,"(36722, 1)","(1, 1)"
Count,147705 Tasks,36722 Chunks
Type,float32,numpy.ndarray


In [14]:
# evaluate if you have time to spare
# dataset.tas.max(dim="ncells").values