# Demonstrate data catalog functionality

`data_catalog` enables a user to define data catalogs with an input file.


In [109]:
import data_catalog
import importlib
import yaml

## here's what the input file looks like

In [110]:
!head -n 30 collections.yml

cesm2_runs:
  type: cesm
  data_sources:
    ctrl_bdrd_cmip6:
      - case: b.e21.B1850.f09_g17.CMIP6-piControl.001
        root_dir: /glade/collections/cdg/timeseries-cmip6/b.e21.B1850.f09_g17.CMIP6-piControl.001
        path_structure: component/proc/tseries/freq
        ctrl_branch_year: 0
        component_attrs:
          ocn:
            grid: POP_gx1v7

    historical_bdrd_cmip6:
      - case: b.e21.BHIST.f09_g17.CMIP6-historical.001
        root_dir: /glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.001
        path_structure: component/proc/tseries/freq
        ctrl_branch_year: 601
        component_attrs:
          ocn:
            grid: POP_gx1v7

      - case: b.e21.BHIST.f09_g17.CMIP6-historical.002
        root_dir: /glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.002
        path_structure: component/proc/tseries/freq
        ctrl_branch_year: 631
        component_attrs:
          ocn:
            grid: POP_gx1v7



## build the catalog

In [111]:
importlib.reload(data_catalog)
data_catalog.build_catalog('collections.yml', clobber=True)

active catalog: cesm2_runs
active catalog: cesm1_le




## set the active catalog

This is not implemented correctly, but a present, you can set the active catalog as follows.

In [130]:
data_catalog.set_catalog('cesm2_runs')

active catalog: cesm2_runs


## find entries matching query

Here's an example of where there's one file per variable, per ensemble

In [131]:
data_catalog.find_in_index(experiment='historical_bdrd_cmip6',
                           component='ocn',
                           variable='FG_CO2')

Unnamed: 0,case,component,date_range,ensemble,experiment,file_basename,files,freq,grid,sequence_order,variable,year_offset,ctrl_branch_year,has_ocean_bgc
12106,b.e21.BHIST.f09_g17.CMIP6-historical.001,ocn,"['185001', '201412']",0,historical_bdrd_cmip6,b.e21.BHIST.f09_g17.CMIP6-historical.001.pop.h...,/glade/collections/cdg/timeseries-cmip6/b.e21....,month_1,POP_gx1v7,0,FG_CO2,,601.0,
8917,b.e21.BHIST.f09_g17.CMIP6-historical.002,ocn,"['185001', '201412']",1,historical_bdrd_cmip6,b.e21.BHIST.f09_g17.CMIP6-historical.002.pop.h...,/glade/collections/cdg/timeseries-cmip6/b.e21....,month_1,POP_gx1v7,0,FG_CO2,,631.0,
5728,b.e21.BHIST.f09_g17.CMIP6-historical.003,ocn,"['185001', '201412']",2,historical_bdrd_cmip6,b.e21.BHIST.f09_g17.CMIP6-historical.003.pop.h...,/glade/collections/cdg/timeseries-cmip6/b.e21....,month_1,POP_gx1v7,0,FG_CO2,,661.0,
2539,b.e21.BHIST.f09_g17.CMIP6-historical.004,ocn,"['185001', '201412']",3,historical_bdrd_cmip6,b.e21.BHIST.f09_g17.CMIP6-historical.004.pop.h...,/glade/collections/cdg/timeseries-cmip6/b.e21....,month_1,POP_gx1v7,0,FG_CO2,,501.0,


The following case happens to have the data divided into two files; a workflow using these would concatenated them time.

In [132]:
data_catalog.find_in_index(experiment='ctrl_ocean-ice-core',
                           component='ocn',
                           variable='FG_CO2')

Unnamed: 0,case,component,date_range,ensemble,experiment,file_basename,files,freq,grid,sequence_order,variable,year_offset,ctrl_branch_year,has_ocean_bgc
461,g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001,ocn,"['000101', '024012']",0,ctrl_ocean-ice-core,g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001...,/glade/scratch/mclong/archive/g.e21a01d.G1850E...,month_1,POP_gx1v7,0,FG_CO2,1699.0,,
462,g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001,ocn,"['024101', '031012']",0,ctrl_ocean-ice-core,g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001...,/glade/scratch/mclong/archive/g.e21a01d.G1850E...,month_1,POP_gx1v7,0,FG_CO2,1699.0,,


The `get_entries` method returns list of each column match a query

In [133]:
entries = data_catalog.get_entries(experiment='ctrl_ocean-ice-core',
                                   component='ocn',
                                   variable='FG_CO2')

print(yaml.dump(entries))

case: [g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001, g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001]
component: [ocn, ocn]
ctrl_branch_year: [.nan, .nan]
date_range: ['[''000101'', ''024012'']', '[''024101'', ''031012'']']
ensemble: [0, 0]
experiment: [ctrl_ocean-ice-core, ctrl_ocean-ice-core]
file_basename: [g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001.pop.h.FG_CO2.000101-024012.nc,
  g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001.pop.h.FG_CO2.024101-031012.nc]
files: [/glade/scratch/mclong/archive/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001/ocn/proc/tseries/month_1/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001.pop.h.FG_CO2.000101-024012.nc,
  /glade/scratch/mclong/archive/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001/ocn/proc/tseries/month_1/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001.pop.h.FG_CO2.024101-031012.nc]
freq: [month_1, month_1]
grid: [POP_gx1v7, POP_gx1v7]
has_ocean_bgc: [.nan, .nan]
sequence_order: [0, 0]
variable: [FG_CO2, FG_CO2]
year_offset: [1699.0, 1699.

The `get_files` method returns a (sorted) list of files.

If we don't specific an `ensemble`, we get all the files for all ensembles.

In [134]:
data_catalog.get_files(experiment='historical_bdrd_cmip6',
                       component='ocn',
                       variable='FG_CO2')

['/glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.001/ocn/proc/tseries/month_1/b.e21.BHIST.f09_g17.CMIP6-historical.001.pop.h.FG_CO2.185001-201412.nc',
 '/glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.002/ocn/proc/tseries/month_1/b.e21.BHIST.f09_g17.CMIP6-historical.002.pop.h.FG_CO2.185001-201412.nc',
 '/glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.003/ocn/proc/tseries/month_1/b.e21.BHIST.f09_g17.CMIP6-historical.003.pop.h.FG_CO2.185001-201412.nc',
 '/glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.004/ocn/proc/tseries/month_1/b.e21.BHIST.f09_g17.CMIP6-historical.004.pop.h.FG_CO2.185001-201412.nc']

Specifying the `ensemble` returns just that file.

In [135]:
data_catalog.get_files(experiment='historical_bdrd_cmip6',
                       component='ocn',
                       ensemble=0,
                       variable='FG_CO2')

['/glade/collections/cdg/timeseries-cmip6/b.e21.BHIST.f09_g17.CMIP6-historical.001/ocn/proc/tseries/month_1/b.e21.BHIST.f09_g17.CMIP6-historical.001.pop.h.FG_CO2.185001-201412.nc']

## the return value of a query can be manipulated to get unique attributes

In [136]:
data_catalog.find_in_index(experiment='historical_bdrd_cmip6').component.unique()

array(['atm', 'ice', 'lnd', 'ocn', 'rof'], dtype=object)

In [137]:
data_catalog.find_in_index(experiment='historical_bdrd_cmip6',
                           component='ocn',
                           variable='FG_CO2').grid.unique()

array(['POP_gx1v7'], dtype=object)

The grid column is coming from the `collections.yml` input file, so if it wasnt' specified for some components, it will be `nan`

In [138]:
data_catalog.find_in_index(experiment='historical_bdrd_cmip6').grid.unique()

array([nan, 'POP_gx1v7'], dtype=object)

Determine how many ensembles exist.

In [139]:
data_catalog.find_in_index(experiment='historical_bdrd_cmip6',
                           component='ocn',
                           variable='FG_CO2').ensemble.unique()

array([0, 1, 2, 3])

In [140]:
data_catalog.find_in_index(experiment='ctrl_ocean-ice-core',
                           component='ocn',
                           variable='FG_CO2').ensemble.unique()

array([0])

In [141]:
data_catalog.find_in_index(experiment='ctrl_ocean-ice-core').component.unique()

array(['ocn'], dtype=object)

Again, the following dataset has 2 files for a single ensemble.

In [142]:
data_catalog.get_files(experiment='ctrl_ocean-ice-core',
                       component='ocn',
                       ensemble=0,
                       variable='FG_CO2')

['/glade/scratch/mclong/archive/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001/ocn/proc/tseries/month_1/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001.pop.h.FG_CO2.000101-024012.nc',
 '/glade/scratch/mclong/archive/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001/ocn/proc/tseries/month_1/g.e21a01d.G1850ECOIAF.T62_g17.extraterr-fe.001.pop.h.FG_CO2.024101-031012.nc']

## switch catalogs: CESM-LE

In [143]:
data_catalog.set_catalog('cesm1_le')

active catalog: cesm1_le


A key use case is to concatenate transient experiments; passing a list of experiments to the query enables this. The `sequence_order` field enables putting things in the right order.

In [144]:
entries = data_catalog.get_entries(experiment=['20C', 'RCP85'],
                         component='ocn',
                         ensemble=1,
                         variable='FG_CO2')

The `get_files` method should return a sorted list of files with the `sequence_order` applied.

In [145]:
data_catalog.get_files(experiment=['20C', 'RCP85'],
                       component='ocn',
                       ensemble=1,
                       variable='FG_CO2')

['/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly/FG_CO2/b.e11.B20TRC5CNBDRD.f09_g16.001.pop.h.FG_CO2.185001-200512.nc',
 '/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly/FG_CO2/b.e11.BRCP85C5CNBDRD.f09_g16.001.pop.h.FG_CO2.200601-208012.nc',
 '/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly/FG_CO2/b.e11.BRCP85C5CNBDRD.f09_g16.001.pop.h.FG_CO2.208101-210012.nc']

The `get_entries` method returns a dictionary with lists of all fields from a query.

In [146]:
entries = data_catalog.get_entries(experiment=['20C', 'RCP85'],
                         component='ocn',
                         ensemble=1,
                         variable='FG_CO2')
print(yaml.dump(entries))

case: [b.e11.B20TRC5CNBDRD.f09_g16.001, b.e11.BRCP85C5CNBDRD.f09_g16.001, b.e11.BRCP85C5CNBDRD.f09_g16.001]
component: [ocn, ocn, ocn]
ctrl_branch_year: [.nan, .nan, .nan]
date_range: ['[''185001'', ''200512'']', '[''200601'', ''208012'']', '[''208101'',
    ''210012'']']
ensemble: [1, 1, 1]
experiment: [20C, RCP85, RCP85]
file_basename: [b.e11.B20TRC5CNBDRD.f09_g16.001.pop.h.FG_CO2.185001-200512.nc, b.e11.BRCP85C5CNBDRD.f09_g16.001.pop.h.FG_CO2.200601-208012.nc,
  b.e11.BRCP85C5CNBDRD.f09_g16.001.pop.h.FG_CO2.208101-210012.nc]
files: [/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly/FG_CO2/b.e11.B20TRC5CNBDRD.f09_g16.001.pop.h.FG_CO2.185001-200512.nc,
  /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly/FG_CO2/b.e11.BRCP85C5CNBDRD.f09_g16.001.pop.h.FG_CO2.200601-208012.nc,
  /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly/FG_CO2/b.e11.BRCP85C5CNBDRD.f09_g16.001.pop.h.FG_CO2.208101-210012.nc]
freq: [mont