# Quickstart

This page demonstrates how to use intake-dataframe-catalog by building and using a very simple dataframe catalog comprising the intake-esm tutorial catalogs available at https://github.com/intake/intake-esm/raw/main/tutorial-catalogs.

In [1]:
import intake

## Prerequisites

First, we open each of the intake-esm catalogs. Our goal is to create an intake dataframe catalog with these as subcatalogs.

In [3]:
aws_cesm2_lens = intake.open_esm_datastore(
    "https://raw.githubusercontent.com/NCAR/cesm2-le-aws/main/intake-catalogs/aws-cesm2-le.json"
)

In [4]:
aws_cmip6 = intake.open_esm_datastore(
    "https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json"
)

In [5]:
google_cmip6 = intake.open_esm_datastore(
    "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
)

Each of these intake-esm catalogs includes a table of remote zarr files and metadata associated with each file. 

In [6]:
aws_cmip6.df.iloc[:9 , :9]

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore
0,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,ta,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
1,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,tauv,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
2,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,zg,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
3,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,vas,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
4,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,tas,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
5,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,rsut,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
6,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,rsus,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
7,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,rsdt,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...
8,HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,rsds,gn,s3://cmip6-pds/CMIP6/HighResMIP/CMCC/CMCC-CM2-...


While the metadata in each of our intake-esm catalogs may follow a different schema (e.g. the CMIP6 catalogs follow [cmip6 controlled vocabulary](https://pyessv.es-doc.org/1/retrieve/wcrp/cmip6)), all catalogs generally include the same sort of metadata. For example, all catalogs include metadata describing the models used, variables available, temporal frequency etc. These are the metadata we'll include in our intake dataframe catalog.

## Initialising a dataframe catalog

We'll start by initialising a intake-dataframe-catalog object (`intake_dataframe_catalog.core.DfFileCatalog`):

In [7]:
cat = intake.open_df_catalog(path="./example_catalog.csv", mode="w")

## Adding subcatalogs

We can add subcatalogs to the dataframe catalog using the `.add` method. This method takes as arguments the subcatalog to add and the metadata to associate with that subcatalog. Here, we'll add metadata about the model(s) and variable(s). We'll parse this metadata from each intake-esm catalog. The `aws_cesm2_lens` subcatalog comprises only one model and is the simplest to add:

In [8]:
aws_cesm2_lens.name = "AWS_CESM2_LENS"
aws_cesm2_lens_model = "CESM2-LENS"
aws_cesm2_lens_variables = list(
    set(
        aws_cesm2_lens.df.variable.unique().astype(str)
    )
)

cat.add(
    aws_cesm2_lens,
    metadata={"model": aws_cesm2_lens_model, "variable": aws_cesm2_lens_variables}
)

cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CESM2_LENS,{CESM2-LENS},"{V, PRECSL, hi, TREFMXAV, FSNS, TS, ICEFRAC, FSNSC, SHFLX, VVEL, VNT, Z3, FSNO, PRECSC, PD, LHFLX, WVEL, H2OSNO, FLNS, T, RAIN, UVEL, Q, TEMP, TREFHT, SNOW, VNS, SOILWATER_10CM, FLUT, TREFHTMX, na..."


Both the CMIP6 catalogs comprise multiple models. In order to keep track of which variables are available in which model we must add an entry in our dataframe catalog for each model. E.g. for the `aws_cmip6` catalog:

In [9]:
aws_cmip6.name = "AWS_CMIP6"

for model in aws_cmip6.df.source_id.unique():
    variables = list(
        set(
            aws_cmip6.df[aws_cmip6.df.source_id == model].variable_id.unique().astype(str)
        )
    )
    cat.add(
        aws_cmip6,
        metadata={"model": model, "variable": variables}
    )

And the same for the `google_cmip6` catalog

In [10]:
google_cmip6.name = "GOOGLE_CMIP6"

for model in google_cmip6.df.source_id.unique():
    variables = list(
        set(
            google_cmip6.df[google_cmip6.df.source_id == model].variable_id.unique().astype(str)
        )
    )
    cat.add(
        google_cmip6,
        metadata={"model": model, "variable": variables}
    )

Note that even though we added separate rows for each model in each CMIP6 catalog, we see a convenient summary with only one row per subcatalog when we display the dataframe in a Jupyter environment (note, this is displaying the `.df_summary` property)

In [11]:
cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CESM2_LENS,{CESM2-LENS},"{V, PRECSL, hi, TREFMXAV, FSNS, TS, ICEFRAC, FSNSC, SHFLX, VVEL, VNT, Z3, FSNO, PRECSC, PD, LHFLX, WVEL, H2OSNO, FLNS, T, RAIN, UVEL, Q, TEMP, TREFHT, SNOW, VNS, SOILWATER_10CM, FLUT, TREFHTMX, na..."
AWS_CMIP6,"{CESM2-WACCM, MPI-ESM1-2-XR, CMCC-CM2-SR5, BCC-ESM1, GISS-E2-2-H, IPSL-CM5A2-INCA, CNRM-ESM2-1, MPI-ESM1-2-HR, AWI-ESM-1-1-LR, NorESM2-MM, KIOST-ESM, IPSL-CM6A-LR, MIROC6, E3SM-1-0, MPI-ESM-1-2-HA...","{pastureFrac, cltmodis, co3satcalcos, tntrscs, co23D, prw, tauu, fediss, epcalc100, rss, airmass, epn100, pbfe, obvfsq, sfno2, rsus, wo, chlmiscos, hcl, parag, siflsensupbot, cllcalipso, tntrs, zm..."
GOOGLE_CMIP6,"{CESM2-WACCM, MPI-ESM1-2-XR, CMCC-CM2-SR5, BCC-ESM1, GISS-E2-2-H, IPSL-CM5A2-INCA, CNRM-ESM2-1, MPI-ESM1-2-HR, AWI-ESM-1-1-LR, NorESM2-MM, KIOST-ESM, IPSL-CM6A-LR, MIROC6, E3SM-1-0, MPI-ESM-1-2-HA...","{pastureFrac, cltmodis, co3satcalcos, tntrscs, co23D, prw, tauu, fediss, epcalc100, rss, airmass, epn100, pbfe, obvfsq, sfno2, rsus, wo, chlmiscos, hcl, parag, cllcalipso, tntrs, zmesoos, rtmt, ch..."


Passing `overwrite=True` to `.add` will overwrite any existing entries with the same name

In [12]:
cat.add(
    aws_cesm2_lens,
    metadata={
        "model": aws_cesm2_lens_model, 
        "variable": aws_cesm2_lens_variables
    },
    overwrite=True
    )

cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CESM2_LENS,{CESM2-LENS},"{V, PRECSL, hi, TREFMXAV, FSNS, TS, ICEFRAC, FSNSC, SHFLX, VVEL, VNT, Z3, FSNO, PRECSC, PD, LHFLX, WVEL, H2OSNO, FLNS, T, RAIN, UVEL, Q, TEMP, TREFHT, SNOW, VNS, SOILWATER_10CM, FLUT, TREFHTMX, na..."
AWS_CMIP6,"{CESM2-WACCM, MPI-ESM1-2-XR, CMCC-CM2-SR5, BCC-ESM1, GISS-E2-2-H, IPSL-CM5A2-INCA, CNRM-ESM2-1, MPI-ESM1-2-HR, AWI-ESM-1-1-LR, NorESM2-MM, KIOST-ESM, IPSL-CM6A-LR, MIROC6, E3SM-1-0, MPI-ESM-1-2-HA...","{pastureFrac, cltmodis, co3satcalcos, tntrscs, co23D, prw, tauu, fediss, epcalc100, rss, airmass, epn100, pbfe, obvfsq, sfno2, rsus, wo, chlmiscos, hcl, parag, siflsensupbot, cllcalipso, tntrs, zm..."
GOOGLE_CMIP6,"{CESM2-WACCM, MPI-ESM1-2-XR, CMCC-CM2-SR5, BCC-ESM1, GISS-E2-2-H, IPSL-CM5A2-INCA, CNRM-ESM2-1, MPI-ESM1-2-HR, AWI-ESM-1-1-LR, NorESM2-MM, KIOST-ESM, IPSL-CM6A-LR, MIROC6, E3SM-1-0, MPI-ESM-1-2-HA...","{pastureFrac, cltmodis, co3satcalcos, tntrscs, co23D, prw, tauu, fediss, epcalc100, rss, airmass, epn100, pbfe, obvfsq, sfno2, rsus, wo, chlmiscos, hcl, parag, cllcalipso, tntrs, zmesoos, rtmt, ch..."


## Saving a dataframe catalog

Once we're happy with the subcatalogs we have in our dataframe catalog, we can save it using the `.save` method.

In [13]:
cat.save()

## Loading a dataframe catalog

When reading existing catalogs, it's good practice to use `mode="r"` (default) to avoid accidentally overwriting the catalog.

In [14]:
cat = intake.open_df_catalog(
    path="./example_catalog.csv",
    columns_with_iterables=["variable"],
)

## Searching in a dataframe catalog

We can use the `.search` method to find subcatalogs that satisfy metadata queries:

In [15]:
new_cat = cat.search(model="CanESM5")

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CMIP6,{CanESM5},"{cct, phycos, tos, evspsblveg, somint, intpp, npp, prw, tauu, cSoil, epcalc100, rsut, epn100, sos, obvfsq, epc100, no3, wo, rsus, simass, mrsos, epfy, clw, rlutcs, rtmt, fgco2nat, zos, detoc, sf6,..."
GOOGLE_CMIP6,{CanESM5},"{cct, phycos, tos, evspsblveg, somint, intpp, npp, cSoil, tauu, prw, epcalc100, rsut, epn100, sos, obvfsq, epc100, no3, rsus, wo, simass, mrsos, epfy, clw, rlutcs, rtmt, fgco2nat, detoc, zos, sf6,..."


We can combine queries for more complex searches

In [16]:
new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"])

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CMIP6,{CanESM5},"{thetao, msftmzmpa}"
GOOGLE_CMIP6,{CanESM5},{thetao}


By deafult, querying on a list as above returns subcatalogs that match on any of the values in the list. The `.search` method also has an optional `require_all` argument. If this is set to `True`, returned subcatalogs satisfy all the query criteria.

In [17]:
new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"], require_all=True)

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CMIP6,{CanESM5},"{thetao, msftmzmpa}"


Regex expressions can also be used in queries. For example, below we search for subcatalogs with variables containing word letter "Fire". We can see that only one model (GFDL-ESM4) in each of the CMIP6 catalogs contains variables matching this criteria.

In [18]:
new_cat = cat.search(variable=".*Fire.*")

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AWS_CMIP6,{GFDL-ESM4},"{fFireNat, fFire}"
GOOGLE_CMIP6,{GFDL-ESM4},"{fFireNat, fFire}"


## Loading subcatalogs

There are a few options for loading subcatalogs. We can load individual catalogs if we know their name:

In [19]:
cat["AWS_CESM2_LENS"] # This is the aws_cesm2_lens intake-esm catalog

Unnamed: 0,unique
variable,53
long_name,51
component,4
experiment,2
...,...
start_time,4
end_time,7
path,313
derived_variable,0


Or

In [20]:
cat.AWS_CESM2_LENS

Unnamed: 0,unique
variable,53
long_name,51
component,4
experiment,2
...,...
start_time,4
end_time,7
path,313
derived_variable,0


Alternatively, there are `.to_subcatalog` and `.to_subcatalog_dict` methods. The former only works when there is only one subcatalog remaining in the dataframe catalog (e.g. after preforming `.search` operations). The latter loads all subcatalogs into a dictionary with the corresponding subcatalog names as keys.

In [21]:
subcat = cat.search(variable="TEMP").to_subcatalog()

subcat

Unnamed: 0,unique
variable,53
long_name,51
component,4
experiment,2
...,...
start_time,4
end_time,7
path,313
derived_variable,0


In [22]:
subcat_dict = cat.to_subcatalog_dict()

subcat_dict

{'AWS_CESM2_LENS': <aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s)>,
 'GOOGLE_CMIP6': <pangeo-cmip6 catalog with 7674 dataset(s) from 514818 asset(s)>,
 'AWS_CMIP6': <pangeo-cmip6 catalog with 7780 dataset(s) from 522217 asset(s)>}

Once subcatalogs are loaded, we can access data in the normal way for that intake source type (e.g. see the [intake-esm documentation](https://intake-esm.readthedocs.io/en/latest/index.html) for how to use intake-esm catalogs like the one's we've using in this demonstration.