# Quickstart

This page demonstrates how to use intake-dataframe-catalog by building and using a very simple dataframe catalog comprising a small number of publically-available data sources:

- An intake-esm datastore intake-esm datastores for the [Community Earth System Model Large Ensemble (CESM LENS)](https://registry.opendata.aws/ncar-cesm-lens) data hosted on AWS by NCAR
- An intake-esm datastore intake-esm datastores for the [Coupled Model Intercomparison Project 6](https://registry.opendata.aws/cmip6/) data hosted on AWS by Pangeo
- An intake-esm datastore intake-esm datastores for the [Coupled Model Intercomparison Project 6](https://console.cloud.google.com/marketplace/product/noaa-public/cmip6?pli=1) data hosted on Google Cloud by Pangeo
- A CSV file of global annual average temperatures [provided by NOAA](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series)

In [1]:
import intake

## Getting set up

First, we open each of the data sources. Our goal is to create an intake dataframe catalog with these as the cataloged intake sources.

In [3]:
aws_cesm2_lens = intake.open_esm_datastore(
    "https://raw.githubusercontent.com/NCAR/cesm2-le-aws/main/intake-catalogs/aws-cesm2-le.json"
)

In [4]:
aws_cmip6 = intake.open_esm_datastore(
    "https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json"
)

In [5]:
google_cmip6 = intake.open_esm_datastore(
    "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
)

In [6]:
noaa_global_temp = intake.open_csv(
    "https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/4/1850-2023/data.csv",
    csv_kwargs={"skiprows": 4},
)

All of these data sources point to data that share some key attributes. For example, all sources contain timeseries of *climate variables* generated by a *model* at a particular *temporal frequency*. These shared metadata attributes are what we might consider including as columns in our intake dataframe catalog.

## Initialising a dataframe catalog

We'll start by initialising a intake-dataframe-catalog object (`intake_dataframe_catalog.core.DfFileCatalog`). This can be done by initialising the class directly, or by using the `intake.open_df_catalog` convenience function:

In [7]:
cat = intake.open_df_catalog(path="./example_catalog.csv", mode="w")

## Adding sources

We can add sources to the dataframe catalog using the `.add` method. This method takes as arguments the sources to add and the metadata to associate with that source. As a simple demonstration, we'll add metadata about the model(s) and variable(s). The `noaa_global_temp` source is the easiest to add, since it contains only one model and one variable:

In [8]:
noaa_global_temp.name = "noaa_global_temp"

cat.add(
    noaa_global_temp,
    metadata={"model": "NOAAGlobalTemp", "variable": ["global_temp_anom"]}
)

cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
noaa_global_temp,{NOAAGlobalTemp},{global_temp_anom}


For the intake-esm datastores, we'll parse model and variable metadata from the datastore itself. The `aws_cesm2_lens` datastore comprises only one model:

In [9]:
aws_cesm2_lens.name = "aws_cesm2_lens"
aws_cesm2_lens_model = "CESM2-LENS"
aws_cesm2_lens_variables = list(
    set(
        aws_cesm2_lens.df.variable.unique().astype(str)
    )
)

cat.add(
    aws_cesm2_lens,
    metadata={"model": aws_cesm2_lens_model, "variable": aws_cesm2_lens_variables}
)

cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cesm2_lens,{CESM2-LENS},"{FSNO, TS, V, FSNSC, FSNS, PRECL, TREFMXAV, VNS, SALT, FLNS, Q, PRECSC, VNT, PSL, T, UET, TREFHTMN, PD, SNOW, FLUT, FSNTOA, aice, ICEFRAC, UVEL, RAIN, DOC, nan, hi_d, PRECSL, SOILWATER_10CM, TREFH..."
noaa_global_temp,{NOAAGlobalTemp},{global_temp_anom}


Both the CMIP6 datastores comprise multiple models. In order to keep track of which variables are available for which models we must add an entry in our dataframe catalog for each model. To do this, we'll write a simple function for finding which variables are available for a given model:

In [10]:
def get_variables_for_model(datastore, model):
    """
    Returns a list of unique variables for a given model in a CMIP6 intake-esm datastore
    """
    return list(
        set(
            datastore.df[datastore.df.source_id == model].variable_id.unique().astype(str)
        )
    )

Then we can add the `aws_cmip6` datastore to our dataframe catalog:

In [11]:
aws_cmip6.name = "aws_cmip6"

for model in aws_cmip6.df.source_id.unique():
    variables = get_variables_for_model(aws_cmip6, model)
    cat.add(
        aws_cmip6,
        metadata={"model": model, "variable": variables}
    )

And the same for the `google_cmip6` datastore:

In [12]:
google_cmip6.name = "google_cmip6"

for model in google_cmip6.df.source_id.unique():
    variables = get_variables_for_model(google_cmip6, model)
    cat.add(
        google_cmip6,
        metadata={"model": model, "variable": variables}
    )

Note that even though we added separate rows for each model in each CMIP6 datastore, we still see a convenient summary with only one row per source when we display the dataframe in a Jupyter environment (note, this is displaying the `.df_summary` property of `cat`):

In [13]:
cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cesm2_lens,{CESM2-LENS},"{FSNO, TS, V, FSNSC, FSNS, PRECL, TREFMXAV, VNS, SALT, FLNS, Q, PRECSC, VNT, PSL, T, UET, TREFHTMN, PD, SNOW, FLUT, FSNTOA, aice, ICEFRAC, UVEL, RAIN, DOC, nan, hi_d, PRECSL, SOILWATER_10CM, TREFH..."
aws_cmip6,"{GISS-E2-2-H, IPSL-CM6A-LR-INCA, INM-CM5-H, E3SM-1-1, NorCPM1, SAM0-UNICON, MPI-ESM1-2-LR, CMCC-ESM2, E3SM-1-1-ECA, NESM3, GISS-E2-1-G, ACCESS-CM2, CMCC-CM2-VHR4, EC-Earth3-AerChem, HadGEM3-GC31-L...","{fsitherm, osaltpmdiff, tasLut, wap, ponos, tntrl, wetnoy, opottempdiff, friver, cl, intpbp, fco2antt, phypico, fsfe, emiaoa, intpoc, limirrdiaz, tossq, cRoot, mmrpm2p5, loadso4, epsi100, hus850, ..."
google_cmip6,"{GISS-E2-2-H, IPSL-CM6A-LR-INCA, INM-CM5-H, E3SM-1-1, NorCPM1, SAM0-UNICON, MPI-ESM1-2-LR, CMCC-ESM2, E3SM-1-1-ECA, NESM3, GISS-E2-1-G, ACCESS-CM2, CMCC-CM2-VHR4, EC-Earth3-AerChem, HadGEM3-GC31-L...","{fsitherm, osaltpmdiff, tasLut, wap, ponos, tntrl, wetnoy, opottempdiff, friver, cl, intpbp, phypico, fsfe, emiaoa, intpoc, limirrdiaz, tossq, cRoot, mmrpm2p5, loadso4, epsi100, hus850, ppmisc, bd..."
noaa_global_temp,{NOAAGlobalTemp},{global_temp_anom}


Passing `overwrite=True` to `.add` will overwrite any existing source entries with the same name:

In [14]:
cat.add(
    aws_cesm2_lens,
    metadata={
        "model": aws_cesm2_lens_model, 
        "variable": aws_cesm2_lens_variables
    },
    overwrite=True
    )

cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cesm2_lens,{CESM2-LENS},"{FSNO, TS, V, FSNSC, FSNS, PRECL, TREFMXAV, VNS, SALT, FLNS, Q, PRECSC, VNT, PSL, T, UET, TREFHTMN, PD, SNOW, FLUT, FSNTOA, aice, ICEFRAC, UVEL, RAIN, DOC, nan, hi_d, PRECSL, SOILWATER_10CM, TREFH..."
aws_cmip6,"{GISS-E2-2-H, IPSL-CM6A-LR-INCA, INM-CM5-H, E3SM-1-1, NorCPM1, SAM0-UNICON, MPI-ESM1-2-LR, CMCC-ESM2, E3SM-1-1-ECA, NESM3, GISS-E2-1-G, ACCESS-CM2, CMCC-CM2-VHR4, EC-Earth3-AerChem, HadGEM3-GC31-L...","{fsitherm, osaltpmdiff, tasLut, wap, ponos, tntrl, wetnoy, opottempdiff, friver, cl, intpbp, fco2antt, phypico, fsfe, emiaoa, intpoc, limirrdiaz, tossq, cRoot, mmrpm2p5, loadso4, epsi100, hus850, ..."
google_cmip6,"{GISS-E2-2-H, IPSL-CM6A-LR-INCA, INM-CM5-H, E3SM-1-1, NorCPM1, SAM0-UNICON, MPI-ESM1-2-LR, CMCC-ESM2, E3SM-1-1-ECA, NESM3, GISS-E2-1-G, ACCESS-CM2, CMCC-CM2-VHR4, EC-Earth3-AerChem, HadGEM3-GC31-L...","{fsitherm, osaltpmdiff, tasLut, wap, ponos, tntrl, wetnoy, opottempdiff, friver, cl, intpbp, phypico, fsfe, emiaoa, intpoc, limirrdiaz, tossq, cRoot, mmrpm2p5, loadso4, epsi100, hus850, ppmisc, bd..."
noaa_global_temp,{NOAAGlobalTemp},{global_temp_anom}


## Saving a dataframe catalog

Once we're happy with the sources we have in our dataframe catalog, we can save it using the `.save` method:

In [15]:
cat.save()

## Loading a dataframe catalog

When reading existing catalogs, it's good practice to use `mode="r"` (the default) to avoid accidentally overwriting the catalog:

In [16]:
cat = intake.open_df_catalog(
    path="./example_catalog.csv",
    columns_with_iterables=["variable"],
)

## Searching in a dataframe catalog

We can use the `.search` method to find sources that satisfy metadata queries:

In [17]:
new_cat = cat.search(model="CanESM5")

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cmip6,{CanESM5},"{osaltpmdiff, zos, epcalc100, tasmax, tosga, nbp, epfz, wap, osaltdiff, rsntds, opottempdiff, sithick, lai, cl, rlds, opottemppmdiff, intpoc, grassFrac, cRoot, treeFrac, mlotst, osaltrmadvect, evs..."
google_cmip6,{CanESM5},"{osaltpmdiff, zos, epcalc100, tasmax, tosga, epfz, wap, osaltdiff, nbp, rsntds, opottempdiff, sithick, lai, cl, rlds, opottemppmdiff, intpoc, grassFrac, cRoot, treeFrac, mlotst, osaltrmadvect, evs..."


We can combine queries for more complex searches:

In [18]:
new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"])

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cmip6,{CanESM5},"{msftmzmpa, thetao}"
google_cmip6,{CanESM5},{thetao}


By default, querying on a list as above returns sources that match on any of the values in the list. The `.search` method also has an optional `require_all` argument. If this is set to `True`, returned sources satisfy all the query criteria:

In [19]:
new_cat = cat.search(model="CanESM5", variable=["thetao", "msftmzmpa"], require_all=True)

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cmip6,{CanESM5},"{msftmzmpa, thetao}"


Regex expressions can also be used in queries. For example, below we search for sources with variables containing word "Fire". We can see that only one model (GFDL-ESM4) in each of the CMIP6 datastores contains variables matching this criteria:

In [20]:
new_cat = cat.search(variable=".*Fire.*")

new_cat

Unnamed: 0_level_0,model,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1
aws_cmip6,{GFDL-ESM4},"{fFire, fFireNat}"
google_cmip6,{GFDL-ESM4},"{fFire, fFireNat}"


## Loading sources

There are a few options for loading sources. We can load individual sources if we know their name:

In [21]:
cat["aws_cesm2_lens"] # This is the aws_cesm2_lens intake-esm datastore

Unnamed: 0,unique
variable,53
long_name,51
component,4
experiment,2
forcing_variant,2
frequency,3
vertical_levels,3
spatial_domain,3
units,20
start_time,4


Or (if the source name comprises only letters, numbers and underscores):

In [22]:
cat.aws_cesm2_lens

Unnamed: 0,unique
variable,53
long_name,51
component,4
experiment,2
forcing_variant,2
frequency,3
vertical_levels,3
spatial_domain,3
units,20
start_time,4


Alternatively, there are `.to_source` and `.to_source_dict` methods. The former only works when there is only one source remaining in the dataframe catalog (e.g. after performing `.search` operations). The latter loads all sources into a dictionary with the corresponding source names as keys:

In [23]:
source = cat.search(variable="TEMP").to_source()

source

Unnamed: 0,unique
variable,53
long_name,51
component,4
experiment,2
forcing_variant,2
frequency,3
vertical_levels,3
spatial_domain,3
units,20
start_time,4


In [24]:
source_dict = cat.to_source_dict()

source_dict

{'aws_cesm2_lens': <aws-cesm2-le catalog with 40 dataset(s) from 322 asset(s)>,
 'aws_cmip6': <pangeo-cmip6 catalog with 7780 dataset(s) from 522217 asset(s)>,
 'google_cmip6': <pangeo-cmip6 catalog with 7674 dataset(s) from 514818 asset(s)>,
 'noaa_global_temp': sources:
   csv:
     args:
       csv_kwargs:
         skiprows: 4
       urlpath: https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/4/1850-2023/data.csv
     description: ''
     driver: intake.source.csv.CSVSource
     metadata:
       catalog_dir: ''}

Once sources are loaded, we can access data in the normal way for that intake source type. For example, see the [intake-esm documentation](https://intake-esm.readthedocs.io/en/latest/index.html) for how to use intake-esm datastores like the ones we've been using in this demonstration.