# Using and understanding Catalogs

<div class="alert alert-info"> <b>NOTE:</b> Catalogs in `xscen` are built upon Datastores in `intake_esm`. For more information on basic usage, such as the `search()` function, please consult their documentation: <a href="https://intake-esm.readthedocs.io/en/stable/">https://intake-esm.readthedocs.io/en/stable/</a>.</div>

Catalogs are made of two files:

- JSON file containing metadata such as the catalog's title, description etc. It also contains an attribute *catalog_file* that points towards the CSV.
- CSV file containing the catalog itself. This file can be zipped.

Two types of catalogs have been implemented in `xscen`.

- __Static catalogs:__ A `DataCatalog` is a *read-only* `intake-esm` catalog that contains information on all available data. Usually, this type of catalog should only be consulted at the start of a new project.

- __Updatable catalogs:__ A `ProjectCatalog` is a *DataCatalog* with additional *write* functionalities. This kind of catalog should be used to keep track of the new data created during the course of a project, such as regridded or bias-corrected data, since it can `update` itself and append new information to the associated CSV file.

__NOTE:__ As to not accidentaly lose data, both catalogs currently have no function to remove data from the CSV file. However, upon initialisation and when updating or refreshing itself, the catalog validates that all entries still exist and, if files have been manually removed, deletes their entries from the catalog.

Catalogs in `xscen` are made to follow a nomenclature that is as close as possible to the Python Earth Science Standard Vocabulary : [https://github.com/ES-DOC/pyessv](https://github.com/ES-DOC/pyessv). The columns are listed below but for more details and concrete examples about the entries, consult [the relevant page in the documentation](../columns.rst):

| Column name | Description |
| :- | :- |
| id | Unique DatasetID generated by `xscen` based on a subset of columns. |
| type | Type of data: [forecast, station-obs, gridded-obs, reconstruction, simulation] |
| processing_level | Level of post-processing reached: [raw, extracted, regridded, biasadjusted] |
| bias_adjust_institution | Institution that computed the bias adjustment. |
| bias_adjust_project | Name of the project that computed the bias adjustment. |
| mip_era | CMIP Generation associated with the data. |
| activity | Model Intercomparison Project (MIP) associated with the data. |
| driving_institution | Institution of the driver model. |
| driving_model | Name of the driver. |
| institution | Institution associated with the source. |
| source | Name of the model or the dataset. |
| experiment | Name of the experiment of the model. |
| member | Name of the realisation (or of the driving realisation in the case of RCMs). |
| xrfreq | Pandas/xarray frequency. |
| frequency | Frequency in letters (CMIP6 format). |
| variable | Variable(s) in the dataset. |
| domain | Name of the region covered by the dataset. |
| date_start | First date of the dataset. |
| date_end | Last date of the dataset. |
| version | Version of the dataset. |
| format | Format of the dataset. |
| path | Path to the dataset. |

Individual projects may use a different set of columns, but those will always be present in the official Ouranos internal catalogs. Some parts of `xscen` will however expect certain column names, so diverging from the official list is to be done with care.

## Basic Catalog Usage

If an official catalog already exists, it should be opened using `xs.DataCatalog` by pointing it to the JSON file:

In [1]:
from xscen import DataCatalog, ProjectCatalog
from pathlib import Path

DC = DataCatalog(f"{Path().absolute()}/samples/pangeo-cmip6.json")

DC

Unnamed: 0,unique
activity,2
institution,31
source,58
experiment,8
member,198
frequency,3
xrfreq,3
variable,50
domain,4
path,34343


The content of the catalog can be accessed by a call to `df`, which will return a `pandas.DataFrame`.

In [2]:
# Access the catalog
DC.df

Unnamed: 0,activity,institution,source,experiment,member,frequency,xrfreq,variable,domain,path,date_start,date_end,version,id,processing_level,format,mip_era
0,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,3hr,3H,"(rlus,)",gr1,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr1,raw,zarr,CMIP6
1,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,3hr,3H,"(rsdsdiff,)",gr2,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr2,raw,zarr,CMIP6
2,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,3hr,3H,"(tos,)",gr1,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr1,raw,zarr,CMIP6
3,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,3hr,3H,"(vas,)",gr2,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr2,raw,zarr,CMIP6
4,CMIP,NOAA-GFDL,GFDL-CM4,historical,r1i1p1f1,3hr,3H,"(vas,)",gr1,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr1,raw,zarr,CMIP6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34338,ScenarioMIP,E3SM-Project,E3SM-1-1,ssp585,r1i1p1f1,fx,fx,"(orog,)",gr,gs://cmip6/CMIP6/ScenarioMIP/E3SM-Project/E3SM...,2015-01-01 00:00,2100-12-31 00:00,20201117,ScenarioMIP_E3SM-Project_E3SM-1-1_ssp585_r1i1p...,raw,zarr,CMIP6
34339,ScenarioMIP,EC-Earth-Consortium,EC-Earth3-Veg-LR,ssp585,r3i1p1f1,fx,fx,"(areacella,)",gr,gs://cmip6/CMIP6/ScenarioMIP/EC-Earth-Consorti...,2015-01-01 00:00,2100-12-31 00:00,20201201,ScenarioMIP_EC-Earth-Consortium_EC-Earth3-Veg-...,raw,zarr,CMIP6
34340,ScenarioMIP,EC-Earth-Consortium,EC-Earth3-Veg-LR,ssp585,r2i1p1f1,fx,fx,"(areacella,)",gr,gs://cmip6/CMIP6/ScenarioMIP/EC-Earth-Consorti...,2015-01-01 00:00,2100-12-31 00:00,20201201,ScenarioMIP_EC-Earth-Consortium_EC-Earth3-Veg-...,raw,zarr,CMIP6
34341,ScenarioMIP,EC-Earth-Consortium,EC-Earth3-Veg-LR,ssp585,r1i1p1f1,fx,fx,"(areacella,)",gr,gs://cmip6/CMIP6/ScenarioMIP/EC-Earth-Consorti...,2015-01-01 00:00,2100-12-31 00:00,20201201,ScenarioMIP_EC-Earth-Consortium_EC-Earth3-Veg-...,raw,zarr,CMIP6


The `unique` function allows listing unique elements for either all the catalog or a subset of columns. It can be called in a few various ways, listed below:

In [3]:
# List all unique elements in the catalog, returns a pandas.Series
DC.unique()

activity                                          [CMIP, ScenarioMIP]
institution         [NOAA-GFDL, IPSL, CNRM-CERFACS, NASA-GISS, BCC...
source              [GFDL-CM4, IPSL-CM6A-LR, CNRM-CM6-1, GISS-E2-1...
experiment          [historical, ssp119, ssp126, ssp245, ssp370, s...
member              [r1i1p1f1, r9i1p1f1, r30i1p1f1, r8i1p1f1, r31i...
frequency                                              [3hr, day, fx]
xrfreq                                                    [3H, D, fx]
variable            [rlus, rsdsdiff, tos, vas, uas, tslsi, tas, rs...
domain                                             [gr1, gr2, gr, gn]
path                [gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/hist...
date_start                       [1985-01-01 00:00, 2015-01-01 00:00]
date_end                         [2014-12-31 00:00, 2100-12-31 00:00]
version             [20180701, 20180803, 20180917, 20181015, 20181...
id                  [CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_g...
processing_level    

In [4]:
# List all unique elements in a subset of columns, returns a pandas.Series
DC.unique(["variable", "frequency"])

variable     [rlus, rsdsdiff, tos, vas, uas, tslsi, tas, rs...
frequency                                       [3hr, day, fx]
dtype: object

In [5]:
# List all unique elements in a single columns, returns a list
DC.unique("id")[0:5]

['CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr1',
 'CMIP_NOAA-GFDL_GFDL-CM4_historical_r1i1p1f1_gr2',
 'CMIP_IPSL_IPSL-CM6A-LR_historical_r9i1p1f1_gr',
 'CMIP_IPSL_IPSL-CM6A-LR_historical_r30i1p1f1_gr',
 'CMIP_IPSL_IPSL-CM6A-LR_historical_r8i1p1f1_gr']

### Basic .search() commands

The `search` function comes from `intake-esm` and allows searching for specific elements in the catalog's columns. It accepts both wildcards and regular expressions (except for *variable*, which must be exact due to being in *tuples*).

While regex isn't great at inverse matching ("does not contain"), it is possible. Here are a few useful commands:

    - ^string            : Starts with string

    - string$            : Ends with string

    - ^(?!string).*$     : Does not start with string

    - .*(?<!string)$     : Does not end with string

    - ^((?!string).)*$   : Does not contain substring

    - ^(?!string$).*$    : Is not that exact string

This website can be used to test regex commands: https://regex101.com/

In [6]:
# Regex: Find all entries that start with "rcp"
print(DC.search(experiment="^ssp").unique("experiment"))

['ssp119', 'ssp126', 'ssp245', 'ssp370', 'ssp585', 'ssp434', 'ssp460']


In [7]:
# Regex: Exclude all entries that start with "ssp"
print(DC.search(experiment="^(?!ssp).*$").unique("experiment"))

['historical']


In [8]:
# Regex: Find all experiments except the exact string "ssp245"
print(DC.search(experiment="^(?!ssp245$).*$").unique("experiment"))

['historical', 'ssp119', 'ssp126', 'ssp370', 'ssp585', 'ssp434', 'ssp460']


In [9]:
# Wildcard: Find all entries that start with NorESM2
print(DC.search(source="NorESM2.*").unique("source"))

['NorESM2-LM', 'NorESM2-MM']


### Advanced search: xs.search_data_catalogs

`search` has multiple notable limitations for more advanced searches:

- It can't match specific criteria together, such as finding a dataset that would have both 3h precipitation and daily temperature.
- It has no explicit understanding of climate datasets, and thus can't match historial and future simulations together or know how realization members or grid resolutions work.

`xs.search_data_catalogs` was thus created as a more advanced version that is closer to the needs of climate services. It also plays the double role of preparing certain arguments for the extraction function.

Due to how different reference datasets are from climate simulations, this function might have to be called multiple times and the results concatenated into a single dictionary. The main arguments are:

- `variables_and_freqs` is used to indicate which variable and which frequency is required. <b> NOTE:</b> With the exception of fixed fields, where *'fx'* should be used, frequencies here use the `pandas` nomenclature ('D', 'H', '6H', 'MS', etc.).
- `other_search_criteria` is used to search for specific entries in other columns of the catalog, such as *activity*.
- `exclusions` is used to exclude certain simulations or keywords from the results.
- `match_hist_and_fut` is used to indicate that RCP/SSP simulations should be matched with their *historical* counterparts.
- `periods` is used to search for specific time periods.
- `allow_resampling` is used to allow searching for data at higher frequencies than requested.
- `allow_conversion` is used to allow searching for calculable variables, in the case where the requested variable would not be available.
- `restrict_resolution` is used to limit the results to the finest or coarsest resolution available for each source.
- `restrict_members` is used to limit the results to a maximum number of realizations for each source.

Note that compared to `search`, the result of `search_data_catalog` is a dictionary with one entry per unique ID. A given unique ID might contain multiple datasets as per `intake-esm`'s definition, because it groups catalog lines per *id - domain - processing_level - xrfreq*. Thus, it separates model data that exists at different frequencies.


#### Example 1: Simple dataset

Let's start by searching for CMIP6 data that has subdaily precipitation, daily temperature and the land fraction data. The main difference compared to searching for reference datasets is that in most cases, `match_hist_and_fut` will be required to match *historical* simulations to their future counterparts. This works for both CMIP5 and CMIP6 nomenclatures.

In [10]:
import xscen as xs

variables_and_freqs = {"tas": "D", "pr": "3H", "sftlf": "fx"}
other_search_criteria = {"institution": ["NOAA-GFDL"],
                         "experiment": ["ssp585"]}

cat_sim = xs.search_data_catalogs(data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
                                  variables_and_freqs=variables_and_freqs,
                                  other_search_criteria=other_search_criteria,
                                  match_hist_and_fut=True,
                                 )

cat_sim

2022-11-18 15:09:44,689 - xscen.extract - INFO - Catalog opened: <pangeo-cmip6 catalog with 3156 dataset(s) from 34343 asset(s)> from 1 files.
2022-11-18 15:09:44,690 - xscen.extract - INFO - Dispatching historical dataset to future experiments.
2022-11-18 15:09:51,733 - xscen.extract - INFO - 239 assets matched the criteria : {'institution': ['NOAA-GFDL'], 'experiment': ['ssp585']}.
2022-11-18 15:09:51,849 - xscen.extract - INFO - Iterating over 4 potential datasets.
2022-11-18 15:09:52,377 - xscen.extract - INFO - Found 2 with all variables requested and corresponding to the criteria.


{'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1': <pangeo-cmip6 catalog with 3 dataset(s) from 5 asset(s)>,
 'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr2': <pangeo-cmip6 catalog with 3 dataset(s) from 5 asset(s)>}

Two simulations correspond to the search criteria, but as can be seen from the results, it is the same simulation on 2 different grids (`gr1` and `gr2`). If desired, `restrict_resolution` can be called to choose the finest or coarsest grid in such cases.

In [11]:
import xscen as xs

variables_and_freqs = {"tas": "D", "pr": "3H", "sftlf": "fx"}
other_search_criteria = {"institution": ["NOAA-GFDL"],
                         "experiment": ["ssp585"]}

cat_sim = xs.search_data_catalogs(data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
                                  variables_and_freqs=variables_and_freqs,
                                  other_search_criteria=other_search_criteria,
                                  match_hist_and_fut=True,
                                  restrict_resolution="finest"
                                 )

cat_sim

2022-11-18 15:09:58,215 - xscen.extract - INFO - Catalog opened: <pangeo-cmip6 catalog with 3156 dataset(s) from 34343 asset(s)> from 1 files.
2022-11-18 15:09:58,216 - xscen.extract - INFO - Dispatching historical dataset to future experiments.
2022-11-18 15:10:05,168 - xscen.extract - INFO - 239 assets matched the criteria : {'institution': ['NOAA-GFDL'], 'experiment': ['ssp585']}.
2022-11-18 15:10:05,287 - xscen.extract - INFO - Iterating over 4 potential datasets.
2022-11-18 15:10:05,835 - xscen.extract - INFO - Found 2 with all variables requested and corresponding to the criteria.
2022-11-18 15:10:05,849 - xscen.extract - INFO - Dataset GFDL-CM4_ScenarioMIP_CMIP6_NOAA-GFDL_ssp585_r1i1p1f1 appears to have multiple resolutions.
2022-11-18 15:10:05,853 - xscen.extract - INFO - Removing ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr2 from the results.


{'ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1': <pangeo-cmip6 catalog with 3 dataset(s) from 5 asset(s)>}

If required, at this stage a dataset can be looked at in more details. If we examine the results (look at the 'date_start' and 'date_end' columns), we'll see that it successfully found historical simulations in the *CMIP* activity and renamed both their *activity* and *experiment* to match the future simulations.

In [12]:
cat_sim['ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1_gr1'].df

Unnamed: 0,activity,institution,source,experiment,member,frequency,xrfreq,variable,domain,path,date_start,date_end,version,id,processing_level,format,mip_era
0,ScenarioMIP,NOAA-GFDL,GFDL-CM4,ssp585,r1i1p1f1,day,D,"(tas,)",gr1,gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM...,2015-01-01 00:00,2100-12-31 00:00,20180701,ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1...,raw,zarr,CMIP6
1,ScenarioMIP,NOAA-GFDL,GFDL-CM4,ssp585,r1i1p1f1,day,D,"(tas,)",gr1,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1...,raw,zarr,CMIP6
2,ScenarioMIP,NOAA-GFDL,GFDL-CM4,ssp585,r1i1p1f1,3hr,3H,"(pr,)",gr1,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1...,raw,zarr,CMIP6
3,ScenarioMIP,NOAA-GFDL,GFDL-CM4,ssp585,r1i1p1f1,fx,fx,"(sftlf,)",gr1,gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM...,2015-01-01 00:00,2100-12-31 00:00,20180701,ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1...,raw,zarr,CMIP6
4,ScenarioMIP,NOAA-GFDL,GFDL-CM4,ssp585,r1i1p1f1,fx,fx,"(sftlf,)",gr1,gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...,1985-01-01 00:00,2014-12-31 00:00,20180701,ScenarioMIP_NOAA-GFDL_GFDL-CM4_ssp585_r1i1p1f1...,raw,zarr,CMIP6


#### Example 2: Advanced search

`allow_resampling` and `allow_conversion` are powerful search tools to find data that doesn't explicitely exist in the catalog, but that can easily be computed.

In [13]:
cat_sim_adv = xs.search_data_catalogs(data_catalogs=[f"{Path().absolute()}/samples/pangeo-cmip6.json"],
                                      variables_and_freqs={"evspsblpot": "D", "tas": "YS"},
                                      other_search_criteria={"source": ["NorESM2-MM"],
                                                             "processing_level": ["raw"]},
                                      match_hist_and_fut=True,
                                      allow_resampling=True,
                                      allow_conversion=True
                                     )
cat_sim_adv

2022-11-18 15:10:14,617 - xscen.extract - INFO - Catalog opened: <pangeo-cmip6 catalog with 3156 dataset(s) from 34343 asset(s)> from 1 files.
2022-11-18 15:10:14,618 - xscen.extract - INFO - Dispatching historical dataset to future experiments.
2022-11-18 15:10:22,618 - xscen.extract - INFO - 221 assets matched the criteria : {'source': ['NorESM2-MM'], 'processing_level': ['raw']}.
2022-11-18 15:10:22,739 - xscen.extract - INFO - Iterating over 5 potential datasets.
2022-11-18 15:10:23,770 - xscen.extract - INFO - Found 5 with all variables requested and corresponding to the criteria.


{'ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 6 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 6 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 6 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 6 asset(s)>,
 'ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn': <pangeo-cmip6 catalog with 1 dataset(s) from 6 asset(s)>}

If we examine the SSP5-8.5 results, we'll see that while it failed to find *evspsblpot*, it successfully understood that *tasmin* and *tasmax* can be used to compute it. It also understood that daily *tas* is a valid search result for `{tas: YS}`, since it can be aggregated.

In [14]:
cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn'].unique()

activity                                                [ScenarioMIP]
institution                                                     [NCC]
source                                                   [NorESM2-MM]
experiment                                                   [ssp585]
member                                                     [r1i1p1f1]
frequency                                                       [day]
xrfreq                                                            [D]
variable                                        [tasmax, tasmin, tas]
domain                                                           [gn]
path                [gs://cmip6/CMIP6/ScenarioMIP/NCC/NorESM2-MM/s...
date_start                       [2015-01-01 00:00, 1985-01-01 00:00]
date_end                         [2100-12-31 00:00, 2014-12-31 00:00]
version                                                    [20191108]
id                    [ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn]
processing_level    

#### Derived variables

The `allow_conversion` argument is built upon `xclim`'s virtual indicators module and `intake-esm`'s [DerivedVariableRegistry](https://ncar.github.io/esds/posts/2021/intake-esm-derived-variables/) in a way that should be seamless to the user. It works by using the methods defined in `xscen/xclim_modules/conversions.yml` to add a registry of *derived* variables that exist virtually through computation methods.

In the example above, we can see that the search failed to find *evspsblpot* within *NORESM2-MM*, but understood that *tasmin* and *tasmax* could be used to estimate it using `xclim`'s `potential_evapotranspiration`.

Most use cases should already be covered by the aforementioned file. The preferred way to add new methods is to [submit a new indicator to xclim](https://xclim.readthedocs.io/en/stable/contributing.html), and then to add a call to that indicator in `conversions.yml`. In the case where this is not possible or where the transformation would be out of scope for `xclim`, the calculation can be implemented into `xscen/xclim_modules/conversions.py` instead.

Alternatively, if other functions or other parameters are required for a specific use case (e.g. using `relative_humidity` instead of `relative_humidity_from_dewpoint`, or using a different formula), then a custom YAML file can be used. This custom file can be referred to using the `conversion_yaml` argument of `search_data_catalogs`.

`.derivedcat` can be called on a catalog to obtain the list of DerivedVariable and the function associated to them. In addition, `._requested_variables` will display the list of variables that will be opened by the `to_dataset_dict()` function, including *DerivedVariables*.

**NOTE:** `_requested_variables` should NOT be modified under any circumstance, as it is likely to make `to_dataset_dict()` fail. To add some transparency on which variables have been *requested* and which are the *dependent* ones, `xscen` has added `_requested_variables_true` and `_dependent_variables`. This is very likely to be changed in the future.

In [15]:
cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn'].derivedcat

DerivedVariableRegistry({'evspsblpot': DerivedVariable(func=functools.partial(<function _derived_func.<locals>.func at 0x7f0dc92a0c10>, ind=<xclim.indicators.conversions.POTENTIAL_EVAPOTRANSPIRATION object at 0x7f0dc8dba490>, nout=0), variable='evspsblpot', query={'variable': ['tasmin', 'tasmax']}, prefer_derived=False)})

In [16]:
print(cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn']._requested_variables)
print(f"Requested: {cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn']._requested_variables_true}")
print(f"Dependent: {cat_sim_adv['ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1f1_gn']._dependent_variables}")

['tasmin', 'evspsblpot', 'tasmax', 'tas', 'tasmin', 'tasmax']
Requested: ['evspsblpot', 'tas']
Dependent: ['tasmin', 'tasmax', 'tasmin', 'tasmax']


<div class="alert alert-warning"> <b>WARNING:</b> Note that `allow_conversion`  currently fails if:
<ul>
<li>The requested DerivedVariable also requires a DerivedVariable itself.</li>
<li>The dependent variables exist at different frequencies (e.g. 'pr @1hr' & 'tas @3hr')</li>
</ul>
</div>

## Creating a New Catalog from a Directory

### Initialisation

The `create` argument of `ProjectCatalog` can be called to create an empty *ProjectCatalog* and a new set of JSON and CSV files. The JSON file follows the ESM Catalog Specification v.0.1.0: https://github.com/NCAR/esm-collection-spec

By default, `xscen` will populate the JSON with generic information, defined in `catalog.esm_col_data`. That metadata can be changed using the `project` argument with entries compatible with the ESM Catalog Specification (refer to the link above). Usually, the most useful and common entries will be: 

- title
- description

`xscen` will also instruct `intake_esm` to group catalog lines per *id - domain - processing_level - xrfreq*. This should be adequate for most uses. In the case that it is not, the following can be added to `project`:

- "aggregation_control": {"groupby_attrs": [list_of_columns]}

Other attributes and behaviours of the project definition can be modified in a similar way.

In [17]:
project = {"title": "tutorial-catalog",
           "description": "Catalog for the tutorial NetCDFs."
          }

PC = ProjectCatalog(f"{Path().absolute()}/samples/tutorial-catalog.json", create=True, project=project, overwrite=True)

Successfully wrote ESM catalog json file to: file:///home/rondeau/python/github/xscen/docs/notebooks/samples/tutorial-catalog.json


In [18]:
# The metadata is stored in PC.esmcat
PC.esmcat

ESMCatalogModel(esmcat_version='0.1.0', attributes=[], assets=Assets(column_name='path', format=None, format_column_name='format'), aggregation_control=AggregationControl(variable_column_name='variable', groupby_attrs=['id', 'domain', 'processing_level', 'xrfreq'], aggregations=[Aggregation(type=<AggregationType.join_existing: 'join_existing'>, attribute_name='date_start', options={'dim': 'time'}), Aggregation(type=<AggregationType.union: 'union'>, attribute_name='variable', options={})]), id='tutorial-catalog', catalog_dict=None, catalog_file='/home/rondeau/python/github/xscen/docs/notebooks/samples/tutorial-catalog.csv', description='Catalog for the tutorial NetCDFs.', title='tutorial-catalog', last_updated=datetime.datetime(2022, 11, 18, 20, 10, 33, tzinfo=datetime.timezone.utc))

### Appending new data to a ProjectCatalog

At this stage, the CSV is still empty. There are two main ways to populate a catalog with data:

- Using `xs.ProjectCatalog.update_from_ds` to append a Dataset and populate the catalog columns using metadata. 

- Using `xs.catalog.parse_directory` to parse through existing NetCDF or Zarr data and decode their information based on file and directory names.

This tutorial will focus on `catalog.parse_directory`, as `update_from_ds` is moreso a function that will be called during a climate-scenario-generation workflow. See the [Getting Started](getting_started.ipynb#Updating-the-catalog) tutorial for more details on `update_from_ds`.

#### Parsing a directory 

<div class="alert alert-info"> <b>NOTE:</b> If you are an Ouranos employee, this section should be of limited use (unless you need to retroactively parse a directory containing exiting datasets). Please consult the existing Ouranos catalogs using xs.search_data_catalogs instead.</div>

The `parse_directory` function relies on analyzing patterns to adequately decode the filenames to store that information in the catalog. 

- Patterns are a succession of column names in curly brackets. See below for an example. The pattern starts where the directory path stops.
- The `parallel_depth` argument can be used to change the level at which to parallelize the file (and thus the `globpattern`) search. A value of 1 (default and minimum), means the subfolders of each directory are searched in parallel, a value of 2 would search the subfolders' subfolders in parallel, and so on.
- If necessary, `read_from_file` can be used to open the files and read metadata from global attributes. Refer to the API for Docstrings and usage.
- In cases where some column information is the same across all data, `homogenous_info` can be used to explicitely give an attribute to the datasets being processed.
- Anything that isn't filled will be marked as `None`. 

In [19]:
from xscen.catalog import parse_directory

df = parse_directory(
    directories=['samples/tutorial/'],
    globpattern="*.nc",
    patterns=[
        "{activity}/{domain}/{institution}/{source}/{experiment}/{member}/{frequency}/{*}.nc"
    ],
    parallel_depth=1,
    homogenous_info={
        'mip_era': 'CMIP6',
        'type': 'simulation',
        'processing_level': 'raw'
    },
    read_from_file=["variable", "date_start", "date_end"]
)
df

[#############                           ] | 33% Completed | 102.68 ms2022-11-18 15:10:37,093 - xscen.catalog - INFO - Parsing attributes with netCDF4 from samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn_raw.nc.
[##########################              ] | 66% Completed | 269.93 ms2022-11-18 15:10:37,354 - xscen.catalog - INFO - Parsing attributes with netCDF4 from samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp126/r1i1p1f1/fx/ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_gn_raw.nc.
2022-11-18 15:10:37,369 - xscen.catalog - INFO - Parsing attributes with netCDF4 from samples/tutorial/ScenarioMIP/example-region/NCC/NorESM2-MM/ssp245/r1i1p1f1/day/ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1f1_gn_raw.nc.
[##########################              ] | 66% Completed | 428.41 ms2022-11-18 15:10:37,417 - xscen.catalog - INFO - Parsing attributes with netCDF4 from samples/tutorial/ScenarioMIP/example-region/NCC/

Unnamed: 0,id,type,processing_level,bias_adjust_institution,bias_adjust_project,mip_era,activity,driving_institution,driving_model,institution,...,member,xrfreq,frequency,variable,domain,date_start,date_end,version,format,path
0,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,D,day,"(tas,)",example-region,2000-01-01 00:00,2050-12-31 00:00,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
1,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,fx,fx,"(sftlf,)",example-region,NaT,NaT,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
2,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,D,day,"(tas,)",example-region,2000-01-01 00:00,2050-12-31 00:00,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
3,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,fx,fx,"(sftlf,)",example-region,NaT,NaT,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
4,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r2i1p1f1,D,day,"(tas,)",example-region,2000-01-01 00:00,2050-12-31 00:00,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
5,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp245_r2i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r2i1p1f1,fx,fx,"(sftlf,)",example-region,NaT,NaT,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
6,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,D,day,"(tas,)",example-region,2000-01-01 00:00,2050-12-31 00:00,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
7,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp370_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,fx,fx,"(sftlf,)",example-region,NaT,NaT,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
8,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,D,day,"(tas,)",example-region,2000-01-01 00:00,2050-12-31 00:00,,nc,samples/tutorial/ScenarioMIP/example-region/NC...
9,CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp585_r1i1p1...,simulation,raw,,,CMIP6,ScenarioMIP,,,NCC,...,r1i1p1f1,fx,fx,"(sftlf,)",example-region,NaT,NaT,,nc,samples/tutorial/ScenarioMIP/example-region/NC...


#### Unique Dataset ID

In addition to the parse itself, `catalog.parse_directory` will create a unique Dataset ID that can be used to easily determine one simulation from another. This can be edited with the `id_columns` argument of `parse_directory`, but by default, IDs are based on CMIP6's ID structure with additions related to regional models and bias adjustment:

- `{bias_adjust_project} _ {mip_era} _ {activity} _ {driving_model} _ {institution} _ {source} _ {experiment} _ {member} _ {domain}`

This utility can also be called by itself through `xs.catalog.generate_id()`.

**NOTE:** Note that when constructing IDs, empty columns will be skipped.

In [20]:
df.iloc[0]["id"]

'CMIP6_ScenarioMIP_NCC_NorESM2-MM_ssp126_r1i1p1f1_example-region'

#### Appending data using ProjectCatalog.update()

At this stage, `df` is a `pandas.DataFrame`. `ProjectCatalog.update` is used to append this data to the CSV file and save the results on disk.

In [21]:
PC.update(df)

PC

Unnamed: 0,unique
id,5
type,1
processing_level,1
bias_adjust_institution,0
bias_adjust_project,0
mip_era,1
activity,1
driving_institution,0
driving_model,0
institution,1
