# Exploring the COSIMA Cookbook

## Statement of problem

COSIMA is producing a lot of data and we need to be able to find it to analyse it. The current location for COSIMA outputs is in the `outputs` directory in the `ik11` project. Contained within are subdirectories for each model resolution and within each of these directories are subdirectories for each model configuration

In [1]:
!ls /g/data/ik11/outputs

access-om2  access-om2-01  access-om2-025  README


In [2]:
!ls /g/data/ik11/outputs/access-om2-01

01deg_jra55_iaf_v2.0.0rc3
01deg_jra55_iaf_v2.0.0rc3_nonuniform_albedo
01deg_jra55v13_ryf9091
01deg_jra55v13_ryf9091_5Kv
01deg_jra55v13_ryf9091_k_smag_iso3
01deg_jra55v13_ryf9091_ndte240
01deg_jra55v13_ryf9091_ndte500
01deg_jra55v13_ryf9091_ndte60
01deg_jra55v13_ryf9091_OFAM3visc
01deg_jra55v13_ryf9091_qian_wp
01deg_jra55v13_ryf9091_tides_control
01deg_jra55v13_ryf9091_tides_fixed
01deg_jra55v140_iaf


All the data is contained in netCDF files, of which there are many!

In [3]:
!find /g/data/ik11/outputs/ -iname '*.nc' | wc -l

86002


**GOAL**: access data by specifying an experiment and a variable

## COSIMA Cookbook solution

In order to achieve the above goal the COSIMA Cookbook provides tools to crawl directories looking for netCDF data files, read metadata from the files about the data they contain, and then save this data to an SQL database.

The Cookbook also provides an API to query the database and retrieve data by experiment and variable name.

In [4]:
import cosima_cookbook as cc

In [5]:
session = cc.database.create_session()

In [6]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='u', session=session, n=1)

Unnamed: 0,Array,Chunk
Bytes,5.83 GB,3.69 MB
Shape,"(1, 75, 2700, 3600)","(1, 19, 135, 180)"
Count,1601 Tasks,1600 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 5.83 GB 3.69 MB Shape (1, 75, 2700, 3600) (1, 19, 135, 180) Count 1601 Tasks 1600 Chunks Type float64 numpy.ndarray",1  1  3600  2700  75,

Unnamed: 0,Array,Chunk
Bytes,5.83 GB,3.69 MB
Shape,"(1, 75, 2700, 3600)","(1, 19, 135, 180)"
Count,1601 Tasks,1600 Chunks
Type,float64,numpy.ndarray


The question then becomes, how do I find out what experiment to use, and what variables are available? Currently the API provides `get_experiments` to give a list of experiments and `get_variables` which returns a list of variables for a given experiment

In [7]:
cc.querying.get_experiments(session, all=True)

Unnamed: 0,experiment,contact,email,created,description,notes,root_dir,ncfiles
0,01deg_jra55v13_ryf9091_OFAM3visc,Andrew Kiss,andrew.kiss@anu.edu.au,2020-03-29,0.1 degree ACCESS-OM2 global model configurati...,,/g/data/ik11/outputs/access-om2-01/01deg_jra55...,50
1,01deg_jra55v13_ryf9091_tides_fixed,Adele Morrison,adele.morrison@anu.edu.au,2020-06-11,0.1 degree ACCESS-OM2 global model configurati...,"Mostly 1 month run lengths, but a couple of mo...",/g/data/ik11/outputs/access-om2-01/01deg_jra55...,1851
2,01deg_jra55v13_ryf9091_k_smag_iso3,Andrew Kiss,andrew.kiss@anu.edu.au,2020-03-29,0.1 degree ACCESS-OM2 global model configurati...,,/g/data/ik11/outputs/access-om2-01/01deg_jra55...,128
3,01deg_jra55v13_ryf9091_5Kv,Ryan Holmes,ryan.holmes@unsw.edu.au,2020-03-01,As for 01deg_jra55v13_ryf9091 except with a ba...,,/g/data/ik11/outputs/access-om2-01/01deg_jra55...,19
4,1deg_jra55v131_ryf_nonuniform_albedo,Andrew Kiss,andrew.kiss@anu.edu.au,2020-03-24,1 degree ACCESS-OM2 global model configuration...,,/g/data/ik11/outputs/access-om2/1deg_jra55v131...,260
5,01deg_jra55v13_ryf9091_tides_control,,,NaT,,,/g/data/ik11/outputs/access-om2-01/01deg_jra55...,620
6,1deg_jra55v131_ryf_const_albedo,Andrew Kiss,andrew.kiss@anu.edu.au,2020-03-24,1 degree ACCESS-OM2 global model configuration...,,/g/data/ik11/outputs/access-om2/1deg_jra55v131...,260
7,01deg_jra55v13_ryf9091_tides,,,NaT,,,/g/data/ik11/outputs/access-om2-01/01deg_jra55...,2578
8,025deg_jra55_ryf9091_gadi_noGM,Ryan Holmes,ryan.holmes@unsw.edu.au,2020-04-01,0.25 degree ACCESS-OM2 global model configurat...,,/g/data/ik11/outputs/access-om2-025/025deg_jra...,316
9,1deg_jra55_iaf_v2.0.0rc3_nonuniform_albedo,Andrew Kiss,andrew.kiss@anu.edu.au,2020-05-30,1 degree ACCESS-OM2 global model configuration...,,/g/data/ik11/outputs/access-om2/1deg_jra55_iaf...,4660


In [8]:
variables = cc.querying.get_variables(session, experiment='01deg_jra55v140_iaf')
variables

Unnamed: 0,name,long_name,frequency,ncfile,# ncfiles,time_start,time_end
0,SALT,,,restart239/ice/monthly_sstsss.nc,61,,
1,TEMP,,,restart239/ice/monthly_sstsss.nc,61,,
2,Time,Time,,restart243/ocean/ocean_velocity_advection.res.nc,671,,
3,Tsfcn,,,restart227/ice/iced.2015-01-01-00000.nc,61,,
4,advectionu,advectionu,,restart243/ocean/ocean_velocity_advection.res.nc,61,,
...,...,...,...,...,...,...,...
362,time,time,static,output243/ocean/ocean-2d-drag_coeff.nc,3660,1900-01-01 00:00:00,2019-01-01 00:00:00
363,xt_ocean,tcell longitude,static,output034/ocean/ocean-2d-geolat_t.nc,1708,1900-01-01 00:00:00,1900-01-01 00:00:00
364,xu_ocean,ucell longitude,static,output243/ocean/ocean-2d-drag_coeff.nc,1952,1900-01-01 00:00:00,2019-01-01 00:00:00
365,yt_ocean,tcell latitude,static,output034/ocean/ocean-2d-geolat_t.nc,1708,1900-01-01 00:00:00,1900-01-01 00:00:00


But there are sometimes duplicate variables with different frequency:

In [9]:
variables[variables.name == 'surface_salt']

Unnamed: 0,name,long_name,frequency,ncfile,# ncfiles,time_start,time_end
169,surface_salt,Practical Salinity,1 daily,output243/ocean/ocean-2d-surface_salt-1-daily-...,244,1958-01-01 00:00:00,2019-01-01 00:00:00
304,surface_salt,Practical Salinity,1 monthly,output243/ocean/ocean-2d-surface_salt-1-monthl...,244,1958-01-01 00:00:00,2019-01-01 00:00:00


If you just try and load this data you will get an error because you will be trying to load data from different files with different temporal frequency

In [10]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)

ValueError: Resulting object does not have monotonic global indexes along dimension time

## Exploring a Cookbook Database

The COSIMA Cookbook `explore` submodule seeks to solve the issue of how to find relevant experiments and variables within a Cookbook database and simplify the process of loading this data.

It does this by providing GUI elements that users can embed in their jupyter notebooks that can be used to filter and query the database.

**Requirements:** The `explorer` submodule feature requires using the `cosima-cookbook` version found in `conda/analysis3-20.07` (or later) kernel on NCI (or your own up-to-date cookbook installation).

In [11]:
from cosima_cookbook import explore

### Database Explorer

The first component is `DatabaseExplorer`, which is used to find relevant experiments. Re-use an existing `session` or don't specify `session` and it will start with the default database. 

Filtering can be applied to narrow down the number of experiments. Select one or more keywords to reduce the listed experiments to those that contain all the selected keywords. To show only those experiments which contain a given variable select the variable from the list of available variables in Database and push the '>>' button to move them to the right hand box. Now when filter is pushed only experiments which contain the variables in the right hand box will be shown. Variables can be removed from the filtering box by selecting and pushing '<<'. Note that the list of available variables contains *all* variables contained in the database. The filtering by keyword does not change the available variables. Both filtering methods are applied to find the list of matching experiments, but the two methods are independent in all other respects.

Note also that the list of available variables is pre-filtered: all variables from restart files and variables that can be unambiguously identified as coordinate variables are not listed. It is possible to remove this pre-filtering by deselecting the checkboxes underneath the variable list.

By default all variables from all model components are shown in the selection box. To display only variables from one model component select the required component from the dropdown menu which defaults to "All models".

The search box can be used to further narrow the list of available variables. When text is entered into the search box only variables that contain that text in their variable name or their `long_name` attribute will be displayed in the selection box.

When a variable is selected the `long_name` is displayed below the variable selector box. In some cases when filtering and/or searching a variable will be automatically selected but may show as highlighted in the selector box. This is undesirable, but currently unavoidable.

When an experiment is selected and the 'Load Experiment' button pushed, it open an Experiment Explorer gui element below the Database Explorer. A detailed explanation of the Experiment Explorer is in the next section.

**(Note: The widgets have been exported to be usable in an HTML page, but they will ONLY function properly if loaded as a jupyter notebook)**

In [12]:
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx

DatabaseExplorer(children=(HTML(value='<style>.header p{ line-height: 1.4; margin-bottom: 10px }</style>\n    …

### Experiment Explorer

The `ExperimentExplorer` can be used independently of the `DatabaseExplorer` if you already know the experiment you wish to load. 

You can re-use an existing database session, or not supply that argument and a new session will be created automatically with the default database. If you pass an experiment name this experiment will be loaded by default, but it is not necessary to do so, as any experiment present in the database can be selected from a drop-down menu at the top.

The box showing the available variables is the same as the one in the filtering element from `DatabaseExplorer`, with exactly the same functionality to show only variables from selected models, search by variable name and long name, and filter out coordinates and restarts.

When a variable is selected the long name is displayed below the box as before, but it also populates the frequency drop down and date range slider to the right. Identical variables can be present in a data set with different temporal frequencies. It is necessary to choose a frequency in this case as those variables cannot be loaded into the same `xarray.DataArray`. When a frequency is selected the date range slider may change the range of available dates if they differ between the two frequencies.

It is advisable to reduce the date range you load if you know you only need the data for a limited time range, as it is much quicker to load the metadata as fewer files need to be opened and their metadata checked.

Once you have selected a variable, confirmed the frequency and date range are correct, push the "Load" button and the data will be loaded into an `xarray.DataArray` object. When this is done the metadata from the loaded data will be displayed at the end of the cell output.

The relevant command used to load the data is displayed, so that it can be copied, reused, and/or modified.

The loaded data is available as the `.data` attribute of the `ExperimentExplorer` object. At any time a different variable from the same or a different experiment can be loaded, and the `.data` attribute will be updated to reflect the new data.

In [13]:
ee = explore.ExperimentExplorer(session=session, experiment='01deg_jra55v140_iaf')
ee

ExperimentExplorer(children=(HTML(value='\n            <h3>Experiment Explorer</h3>\n\n            <p>Select a…

In [14]:
ee.data

Unnamed: 0,Array,Chunk
Bytes,3.85 GB,1.56 MB
Shape,"(99, 2700, 3600)","(1, 540, 720)"
Count,4983 Tasks,2475 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.85 GB 1.56 MB Shape (99, 2700, 3600) (1, 540, 720) Count 4983 Tasks 2475 Chunks Type float32 numpy.ndarray",3600  2700  99,

Unnamed: 0,Array,Chunk
Bytes,3.85 GB,1.56 MB
Shape,"(99, 2700, 3600)","(1, 540, 720)"
Count,4983 Tasks,2475 Chunks
Type,float32,numpy.ndarray
