# Using CleF - Climate Finder to discover ESGF data at NCI

This notebook shows examples of how to use the CleF (Climate Finder) python module to search for ESGF data on the NCI server. <br>
Currently the tool is set up for CMIP5 and CMIP6 data, but other ESGF dataset like CORDEX will be available in the future. <br> 

CleF is currently installed in the CMS conda module analysis3. This is managed by the CMS and is available simply by running
  >  module use /g/data3/hh5/public/modules <br>
  >  module load conda/analysis3
  
You could use the module interactively, for the moment we will use its command line options. <br>
Let's start!

## Command syntax

In [None]:
# run this if you haven't done so already in the terminal
#module use /g/data3/hh5/public/modules
#module load conda/analysis3

In [1]:
!clef

Usage: clef [OPTIONS] COMMAND [ARGS]...

Options:
  --remote   returns only ESGF search results
  --local    returns only local files matching arguments in the CLEF database
  --missing  returns only missing files matching ESGF search
  --request  send NCI request to download missing files matching ESGF search
  --debug    Show debug info
  --help     Show this message and exit.

Commands:
  cmip5  Search ESGF and local database for CMIP5 files Constraints can be...
  cmip6  Search ESGF and local database for CMIP6 files Constraints can be...
  ds     Search local database for non-ESGF datasets


By simpling running the command **clef** with no arguments, the tool shows the help message and then exits, basically it is equivalent to 
> clef --help <br>

We can see currently there are 3 sub-commands, **ds** to query non-ESGF collections and one for each cmip dataset: **cmip5** and **cmip6**.  <br>
There are also five different options that can be passed before the sub-commands, one we have already seen is *--help*. The others are used to modify how the tool will deal with the main query output. We will have a look at them and at **ds** later. <br>
Let's start from quering some CMIP5 data, to see what we can pass to the **cmip5** sub-command we can simply run it with its *--help* option.

## CMIP5

In [2]:
!clef cmip5 --help

Usage: clef cmip5 [OPTIONS] [QUERY]...

  Search ESGF and local database for CMIP5 files

  Constraints can be specified multiple times, in which case they are
  combined    using OR: -v tas -v tasmin will return anything matching
  variable = 'tas' or variable = 'tasmin'. The --latest flag will check ESGF
  for the latest version available, this is the default behaviour

Options:
  -e, --experiment x              CMIP5 experiment: piControl, rcp85, amip ...
  --experiment_family [Atmos-only|Control|Decadal|ESM|Historical|Idealized|Paleo|RCP]
                                  CMIP5 experiment family: Decadal, RCP ...
  -m, --model x                   CMIP5 model acronym: ACCESS1.3, MIROC5 ...
  -t, --table, --mip [Amon|Omon|OImon|LImon|Lmon|6hrPlev|6hrLev|3hr|Oclim|Oyr|aero|cfOff|cfSites|cfMon|cfDay|cf3hr|day|fx|grids]
  -v, --variable x                Variable name as shown in filanames: tas,
                                  pr, sic ...
  -en, --ensemble, --member TE

### Passing arguments and options

The *help* shows all the constraints we can pass to the tool, there are also some additional options which can change the way we run our query. For the moment we can ignore these and use their default values. <br>
Some of the constraints can be passed using an abbreviation,like *-v* instead of *--variable*. This is handy once you are more familiar with the tool. <br>
The same option can have more than one name, for example *--ensemble* can also be passed as *--member*, this is because the terminology has changed between CMIP5 and CMIP6. <br>
You can pass how many constraints you want and pass the same constraint more than once. Let's see what happens though if we do not pass any constraint.

In [3]:
!clef cmip5

None
Too many results 3766700, try limiting your search:
  https://esgf.nci.org.au/search/esgf-nci?query=&type=File&distrib=True&replica=False&latest=True&project=CMIP5


In [4]:
!clef cmip5 --variable tasmin --experiment historical --table day --ensemble r2i1p1s

None
No matches found on ESGF, check at https://esgf.nci.org.au/search/esgf-nci?query=&type=File&distrib=True&replica=False&latest=True&project=CMIP5&ensemble=r2i1p1s&experiment=historical&cmor_table=day&variable=tasmin


Oops that wasn't reasonable! I mispelled the ensemble "r2i1p1s" does not exists and the tool is telling me it cannot find any matches.

In [5]:
!clef cmip5 --variable tasmin --experiment historical --table days --ensemble r2i1p1

Usage: clef cmip5 [OPTIONS] [QUERY]...
Try "clef cmip5 --help" for help.

Error: Invalid value for "--table" / "--mip" / "-t": invalid choice: days. (choose from Amon, Omon, OImon, LImon, Lmon, 6hrPlev, 6hrLev, 3hr, Oclim, Oyr, aero, cfOff, cfSites, cfMon, cfDay, cf3hr, day, fx, grids)


Made another spelling mistake, in this case the tool knows that I passed a wrong value and lists for me all the available options for the CMOR table. Eventually we are aiming to validate all the arguments we can, although for some it is no possible to pass all the possible values (ensemble for example).

In [6]:
!clef cmip5 --variable tasmin --experiment historical --table day --ensemble r2i1p1

None
/g/data1/rr3/publications/CMIP5/output1/CSIRO-QCCCE/CSIRO-Mk3-6-0/historical/day/atmos/day/r2i1p1/files/tasmin_20110518/
/g/data1b/al33/replicas/CMIP5/combined/CCCma/CanCM4/historical/day/atmos/day/r2i1p1/v20120207/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/CCCma/CanCM4/historical/day/atmos/day/r2i1p1/v20120612/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/CCCma/CanESM2/historical/day/atmos/day/r2i1p1/v20120410/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/CNRM-CERFACS/CNRM-CM5/historical/day/atmos/day/r2i1p1/v20120703/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r2i1p1/v20130506/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/IPSL/IPSL-CM5A-MR/historical/day/atmos/day/r2i1p1/v20130506/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/LASG-IAP/FGOALS-s2/historical/day/atmos/day/r2i1p1/v20161204/tasmin/
/g/data1b/al33/replicas/CMIP5/combined/MIROC/MIROC-ESM/historical/day/atmos/day/r2i1p1/v20120710/tasmin/
/g/data1b/al33/replicas/CMIP5

The tool first search on the ESGF for all the files that match the constraints we passed. It then looks for these file locally and if it finds them it returns their path on raijin.
For all the files it can't find locally, the tool check an NCI table listing the downloads they are working on. Finally it lists missing datasets which are in the download queue, followed by the datasets that are not available locally and no one has yet requested. <br>

The tool list the datasets paths and dataset_ids, if you want you can get a more detailed list by file by passing the *--format file* option. <br>

The query by default returns the latest available version. What if we want to have a look at all the available versions?

In [7]:
!clef cmip5 --variable tasmin --experiment historical --table Amon -m ACCESS1.0 --all-versions --format file

None
/g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/files/tasmin_20120115/tasmin_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc
/g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r2i1p1/files/tasmin_20130726/tasmin_Amon_ACCESS1-0_historical_r2i1p1_185001-200512.nc
/g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r3i1p1/files/tasmin_20140402/tasmin_Amon_ACCESS1-0_historical_r3i1p1_185001-200512.nc

Everything available on ESGF is also available locally


The option *--all-versions* is the reverse of *--latest*, which is also the default, so we get a list of all available versions. <br>
Since all the ACCESS1.0 data is available on NCI (which is the authoritative source for the ACCESS models) the tool shouldn't find any missing datasets, if it does please let us know about it.

## CMIP6

In [8]:
!clef cmip6 --help

Usage: clef cmip6 [OPTIONS] [QUERY]...

  Search ESGF and local database for CMIP6 files Constraints can be
  specified multiple times, in which case they are combined using OR:  -v
  tas -v tasmin will return anything matching variable = 'tas' or variable =
  'tasmin'. The --latest flag will check ESGF for the latest version
  available, this is the default behaviour

Options:
  -mip, --activity [AerChemMIP|C4MIP|CDRMIP|CFMIP|CMIP|CORDEX|DAMIP|DCPP|DynVarMIP|FAFMIP|GMMIP|GeoMIP|HighResMIP|ISMIP6|LS3MIP|LUMIP|OMIP|PAMIP|PMIP|RFMIP|SIMIP|ScenarioMIP|VIACSAB|VolMIP]
  -e, --experiment x              CMIP6 experiment, list of available depends
                                  on activity
  --source_type [AER|AGCM|AOGCM|BGC|CHEM|ISM|LAND|OGCM|RAD|SLAB]
  -t, --table x                   CMIP6 CMOR table: Amon, SIday, Oday ...
  -m, --model, --source_id x      CMIP6 model id: GFDL-AM4, CNRM-CM6-1 ...
  -v, --variable x                CMIP6 variable name as in filenames
  -mi

The **cmip6** sub-command works in the same way but some constraints are different. As well as changes in terminology CMIP6 has more attributes (*facets*) that can be used to select the data. <br>
Examples of these are the **activity** which groups experiments, **resolution** which is an approximation of the actual resolution and **grid**.

### Controlling the ouput: clef options

In [9]:
!clef --local cmip6 -e 1pctCO2 -t Amon -v tasmax -v tasmin -g gr

/g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/1pctCO2/r1i1p1f2/Amon/tasmax/gr/v20180626
/g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1/1pctCO2/r1i1p1f2/Amon/tasmax/gr/v20181018
/g/data1b/oi10/replicas/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3-Veg/1pctCO2/r1i1p1f1/Amon/tasmax/gr/v20190702
/g/data1b/oi10/replicas/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/1pctCO2/r1i1p1f1/Amon/tasmax/gr/v20180727
/g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/1pctCO2/r1i1p1f2/Amon/tasmin/gr/v20180626
/g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-ESM2-1/1pctCO2/r1i1p1f2/Amon/tasmin/gr/v20181018
/g/data1b/oi10/replicas/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3-Veg/1pctCO2/r1i1p1f1/Amon/tasmin/gr/v20190702
/g/data1b/oi10/replicas/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/1pctCO2/r1i1p1f1/Amon/tasmin/gr/v20180727


In this example we used the *--local* option for the main command **clef** to get only the local matching data path as output. <br> 
Note also that:
- we are using abbreviations for the options where available; 
- we are passing the variable *-v* option twice; 
- we used the CMIP6 specific option *-g/--grid* to search for all data that is not on the model native grid. This doesn't indicate a grid common to all the CMIP6 output only to the model itself, the same is true for member_id and other attributes.<br>

*--local* is actually executing the query directly on the the CLEF database, which is different from the default query where the search is executed first on the ESGF and then its results are matched locally.<br>
In the example above the final result is exactly the same, whichever way we perform the query. This way of searching can give you more results if a node is offline or if a version have been unpublished from the ESGF but is still available locally.

In [10]:
!clef --missing cmip6 -e 1pctCO2 -v clw -v clwvi -t Amon -g gr

None

Available on ESGF but not locally:
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r1i1p1f1.Amon.clwvi.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r2i1p1f1.Amon.clw.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r2i1p1f1.Amon.clwvi.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r3i1p1f1.Amon.clw.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r3i1p1f1.Amon.clwvi.gr.v20191020
CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.clw.gr.v20180626
CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.clwvi.gr.v20180626
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.clw.gr.v20181018
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.clwvi.gr.v20181018
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.clw.gr.v20181031
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.clwvi.gr.v20181031
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r3i1p1f2.Amon.clw.gr.v20181107
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r3i1p1f2.Amon.clwvi.gr.v20181107
CMIP6.CMIP.CNRM-C

This time we used the *--missing* option and the tool returned only the results matching the constraints that are available on the ESGF but not locally (we changed variables to make sure to get some missing data back).

In [11]:
!clef --remote cmip6 -e 1pctCO2 -v tasmin -t Amon -g gr

None
CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.tasmin.gr.v20180626
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.tasmin.gr.v20181018
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.tasmin.gr.v20181031
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r3i1p1f2.Amon.tasmin.gr.v20181107
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r4i1p1f2.Amon.tasmin.gr.v20190328
CMIP6.CMIP.EC-Earth-Consortium.EC-Earth3-Veg.1pctCO2.r1i1p1f1.Amon.tasmin.gr.v20190702
CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Amon.tasmin.gr.v20180727


The *--remote* option returns the Dataset_ids of the data matching the constraints, regardless that they are available locally or not.

In [12]:
!clef --remote cmip6 -e 1pctCO2 -v tasmin -t Amon -g gr -mi r1i1p1f2 --format file

None
CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.tasmin.gr.v20180626.tasmin_Amon_CNRM-CM6-1_1pctCO2_r1i1p1f2_gr_185001-199912.nc
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.tasmin.gr.v20181018.tasmin_Amon_CNRM-ESM2-1_1pctCO2_r1i1p1f2_gr_185001-199912.nc


Running the same command with the option *--format file* after the sub-command, will return the File_ids instead of the default Dataset_ids. <br>
Please note that *--local*, *--remote* and *--missing* together with *--request*, which we will look at next, are all options of the main command **clef** and they need to come before any sub-commands.

## Requesting new data

What should we do if we found out there is some data we are interested to that has not been downloaded or requested yet? <br>
This is a complex data collection, NCI, in consultation with the community, decided the best way to manage it was to have one point of reference. Part of this agreement is that NCI will download the files and update the database that **clef** is interrrogating. After consultation with the community a priority list was decided and NCI has started downloading anything that falls into it as soon as become available. <br> <br>
Users can then request from the NCI helpdesk, other combinations of variables, experiments etc that do not fall into this list. <br>
The list is available from the NCI climate confluence website: <br>
Even without consulting the list you can use **clef**, as we demonstrated above, to search for a particular dataset, if it is not queued or downloaded already **clef** will give you an option to request it from NCI. <br>
Let's see how it works.

In [13]:
%%bash
clef --request cmip6 -e 1pctCO2 -v clw -v clwvi -t Amon -g gr
no

None

Available on ESGF but not locally:
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r1i1p1f1.Amon.clwvi.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r2i1p1f1.Amon.clw.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r2i1p1f1.Amon.clwvi.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r3i1p1f1.Amon.clw.gr.v20191020
CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r3i1p1f1.Amon.clwvi.gr.v20191020
CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.clw.gr.v20180626
CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.clwvi.gr.v20180626
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.clw.gr.v20181018
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.clwvi.gr.v20181018
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.clw.gr.v20181031
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.clwvi.gr.v20181031
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r3i1p1f2.Amon.clw.gr.v20181107
CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r3i1p1f2.Amon.clwvi.gr.v20181107
CMIP6.CMIP.CNRM-C

We run the same query which gave us as a result 4 missing datasets but this time we used the *--request* option after **clef**.<br>
The tool will execute the query remotely, then look for matches locally and on the NCI download list. Having found none gives as an option of putting in a request. <br>
It will accept any of the following as a positive answer:
> Y  YES y yes <br>

With anything else or if you don't pass anything it will assume you don't want to put in a request.<br>
It still saved the request in a file we can use later.<br>

In [14]:
!cat CMIP6_*.txt

dataset_id=CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r1i1p1f1.Amon.clwvi.gr.v20191020
dataset_id=CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r2i1p1f1.Amon.clw.gr.v20191020
dataset_id=CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r2i1p1f1.Amon.clwvi.gr.v20191020
dataset_id=CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r3i1p1f1.Amon.clw.gr.v20191020
dataset_id=CMIP6.CMIP.CAS.FGOALS-f3-L.1pctCO2.r3i1p1f1.Amon.clwvi.gr.v20191020
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.clw.gr.v20180626
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.1pctCO2.r1i1p1f2.Amon.clwvi.gr.v20180626
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.clw.gr.v20181018
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r1i1p1f2.Amon.clwvi.gr.v20181018
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.clw.gr.v20181031
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r2i1p1f2.Amon.clwvi.gr.v20181031
dataset_id=CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.1pctCO2.r3i1p1f2.Amon.clw.gr.v2

If I answered 'yes' the tool would have sent an e-mail to the NCI helpdesk with the text file attached, NCI can pass that file as input to their download tool and queue your request.
NB if you are running clef from raijin you cannot send an e-mail so in that case the tool will remind you you you need to send an e-mail to the NCI helpdesk yourself to finalise the request.

## Integrating the local query in your scripts

Until now we looked at how to run queries from the command line, but you can use use the same query run by the *--local* option directly in your python code. By doing so you also get access to a lot more information on the datasets returned not only the path.<br>
To do so we have first to import some functions from the clef.code sub-module. In particular the **search()** function and **connect()** and **Session()** that we'll use to open a connection to the database.

In [15]:
from clef.code import *
db = connect()
s = Session()

### Running search()

**search()** takes 4 inputs: the db session, the project (i.e. currently 'cmip5' or 'cmip6'), latest (True or False) and a dictionary containing the query constraints:
> search(session, project='CMIP5', latest=True, **kwargs)<br>

Let's start by defining some constraints.

In [16]:
constraints = {'variable': 'tas', 'model': 'MIROC5', 'cmor_table': 'day', 'experiment': 'rcp85'}

The available keys depend on the project you are querying and the attributes stored by the database. You can use any of the *facets* used for ESGF but in future we will be adding other options based on extra fields which are stored as attributes.

In [17]:
results = search(s, project='CMIP5', **constraints)
results

[{'filenames': ['tas_day_MIROC5_rcp85_r1i1p1_20100101-20191231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20900101-20991231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20300101-20391231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20400101-20491231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20500101-20591231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20800101-20891231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_21000101-21001231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20060101-20091231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20600101-20691231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20700101-20791231.nc',
   'tas_day_MIROC5_rcp85_r1i1p1_20200101-20291231.nc'],
  'project': 'CMIP5',
  'institute': 'MIROC',
  'model': 'MIROC5',
  'experiment': 'rcp85',
  'frequency': 'day',
  'realm': 'atmos',
  'r': '1',
  'i': '1',
  'p': '1',
  'ensemble': 'r1i1p1',
  'cmor_table': 'day',
  'version': '20120710',
  'variable': 'tas',
  'pdir': '/g/data1b/al33/replicas/CMIP5/combined/MIROC/MIROC5/rcp85/day/atmos/day/r1i1p1/v20120710/tas',
  'periods':

Both the keys and values of the constraints get checked before being passed to the query function. This means that if you passed a key or a value that doesn't exist for the chosen project, the function will print a list of valid values and then exit.<br>
Let's re-write the constraints dictionary to show an example.

In [18]:
constraints = {'v': 'tas', 'm': 'MIROC5', 'table': 'day', 'experiment': 'rcp85', 'activity': 'CMIP'}
results = search(s, **constraints)

ClefException: Warning activity is not a valid constraint nameValid constraints are:
dict_values([['source_id', 'model', 'm'], ['realm'], ['time_frequency', 'frequency', 'f'], ['variable_id', 'variable', 'v'], ['experiment_id', 'experiment', 'e'], ['table_id', 'table', 'cmor_table', 't'], ['member_id', 'member', 'ensemble', 'en', 'mi'], ['institution_id', 'institution', 'institute'], ['experiment_family']])

You can see that the function told us 'activity' is not a valid constraints for CMIP5, in fact that can be used only with CMIP6<br>
NB. that the search accepted all the other abbreviations, there's a few terms that can be used for each key.<br>
The full list of valid keys is available from from the github repository:<br>
https://github.com/coecms/clef/blob/master/clef/data/valid_keys.json

In [19]:
constraints = {'v': 'tas', 'm': 'MIROC5', 'table': 'day', 'experiment': 'rcp85', 'member': 'r1i1p1'}
results = search(s, **constraints)
results[0]

{'filenames': ['tas_day_MIROC5_rcp85_r1i1p1_20100101-20191231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20900101-20991231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20300101-20391231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20400101-20491231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20500101-20591231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20800101-20891231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_21000101-21001231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20060101-20091231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20600101-20691231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20700101-20791231.nc',
  'tas_day_MIROC5_rcp85_r1i1p1_20200101-20291231.nc'],
 'project': 'CMIP5',
 'institute': 'MIROC',
 'model': 'MIROC5',
 'experiment': 'rcp85',
 'frequency': 'day',
 'realm': 'atmos',
 'r': '1',
 'i': '1',
 'p': '1',
 'ensemble': 'r1i1p1',
 'cmor_table': 'day',
 'version': '20120710',
 'variable': 'tas',
 'pdir': '/g/data1b/al33/replicas/CMIP5/combined/MIROC/MIROC5/rcp85/day/atmos/day/r1i1p1/v20120710/tas',
 'periods': [('20100101', '20191231')

NB that *project* is by default 'CMIP5' so it can be omitted when querying CMIP5 data and *latest* is True by default. Set this to *False* if you want to return all the available versions.

#### Running search() for different sets of attributes

The **search()** function works for one set of attributes, you can specify only one value for each of the attributes at one time. If you want to run a query for two or more different sets of attributes you can call **search()** in a loop. If you have a small numbers of queries then this is easy to implement and run. To make **search()** works for a random number of inputs passed by the command line we set up a function **call_local_query()** that deals with this more efficiently.<br>
The arguments are very similar to **search()** with the important difference that we are passing list of values instead of strings:<br>
>call_local_query(s, project, oformat, latest, **kwargs)

Let's look at an example:

In [20]:
constraints = {'variable': ['tasmin','tasmax'], 'model': ['MIROC5','MIROC4h'],
               'cmor_table': ['day'], 'experiment': ['rcp85'], 'ensemble': ['r1i1p1']}
results, paths = call_local_query(s, project='CMIP5', oformat='Dataset', latest=True, **constraints)

Because this function was created to deliver results for the command line local query option, as well as the list of results, it also outputs a list of their paths. Under the hood this function works out all the combinations of the arguments you passed and will run **search()** for each of them, before doing so will also run other functions that check that the values and keys passed to the function are valid.<br><br>
The extra arguments *oformat* and "latest" are necessary to resolve the command line *--format* and *--latest* option respectively. The first can be 'file' or 'dataset', with the last being the default. It influences the *paths* output but no *results* which will contain all the datasets information including filenames.

### AND Filter

We started adding additional features to CleF which allows more complex queries. We started from the following case.
Let's say that you want to find all the CMIP6 models that have both daily precipitation (pr) and soil moisture (mrso) for a particular experiment(historical). Up to now you would had to select separately both variables and then work out which models had both on your own.

We will show how this work starting by using the actual function interactively. There is also a command line option but it returns only a list of the models.<br>
First of all, since we are potentially passing more than one value to the query we are using lists in our *constraints* dictionary.<br>
Then we need to define the attributes for which we want all values to be present, only *variable_id* in this case.
Finally we tell the function which attributes define a simulation, this would most often be *model* and *member*.

In [21]:
constraints = {'variable_id': ['pr','mrso'], 'frequency': ['mon'], 'experiment_id': ['historical']}
allvalues = ['variable_id']
fixed = ['source_id', 'member_id']
results, selection = matching(s, allvalues, fixed, project='CMIP6', **constraints)

The function returns the selected models/members combinations that have both variables and the corresponding subset of the original query *results*.<br>
NB currently using the abbreviated version for the constraints keys won't work, you will have to use the attributes full names.<br><br>
You can see by printing the length of both lists and one of the first item of *selection* that the results have been grouped by models/ensembles and then filtered.

In [22]:
print(len(results),len(selection))
selection[0]

46 23


{'source_id': 'BCC-CSM2-MR',
 'member_id': 'r1i1p1f1',
 'comb': {('mrso',), ('pr',)},
 'table_id': {'Amon', 'Lmon'},
 'pdir': {'/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Amon/pr/gn/v20181126',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Lmon/mrso/gn/v20181114'},
 'version': {'v20181114', 'v20181126'}}

The full definition the **matching()** shows all the function arguments:<br>
>matching(session, cols, fixed, project='CMIP5', local=True, latest=True, **kwargs)

From this you can see that like **search()**  by default *project* is 'CMIP5' and *latest* is True. We didn't have to use yet the *local* argument which is True by default, we will see examples later where is set to False so we can do the same query remotely.

#### AND filter on more than one attribute

We can pass more than value for more than one attribute, let's add *piControl* to the experiment list.

In [23]:
constraints = {'variable_id': ['pr','mrso'], 'frequency': ['mon'], 'experiment_id': ['historical', 'piControl']}
results, selection = matching(s, allvalues, fixed, project='CMIP6', **constraints)
print(len(results),len(selection))
selection[0]

100 29


{'source_id': 'BCC-CSM2-MR',
 'member_id': 'r1i1p1f1',
 'comb': {('mrso',), ('pr',)},
 'table_id': {'Amon', 'Lmon'},
 'pdir': {'/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Amon/pr/gn/v20181126',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Lmon/mrso/gn/v20181114',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/piControl/r1i1p1f1/Amon/pr/gn/v20181016',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/piControl/r1i1p1f1/Lmon/mrso/gn/v20181012'},
 'version': {'v20181012', 'v20181016', 'v20181114', 'v20181126'}}

As you can see we get now many more results but only a few more combinations after applying the filter.<br>
This is because we are still defining a simulation by using model and member combinations we haven't included experiment and the results for the two experiments are grouped together, to fix this we need to add *experiment_id* to the *fixed* list.

In [24]:
fixed = ['source_id', 'member_id','experiment_id']
results, selection = matching(s, allvalues, fixed, project='CMIP6', **constraints)
print(len(results),len(selection))
selection[0]

98 49


{'source_id': 'BCC-CSM2-MR',
 'member_id': 'r1i1p1f1',
 'experiment_id': 'historical',
 'comb': {('mrso',), ('pr',)},
 'table_id': {'Amon', 'Lmon'},
 'pdir': {'/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Amon/pr/gn/v20181126',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Lmon/mrso/gn/v20181114'},
 'version': {'v20181114', 'v20181126'}}

If we wanted to find all models/members combinations which have both variables and both experiments, then we should have kept *fixed* as it was and add *experiment_id* to the *allvalues* list instead.

In [25]:
allvalues = ['variable_id', 'experiment_id']
fixed=['source_id','member_id']
results, selection = matching(s, allvalues, fixed, project='CMIP6', **constraints)
print(len(results),len(selection))
selection[0]

80 20


{'source_id': 'BCC-CSM2-MR',
 'member_id': 'r1i1p1f1',
 'comb': {('mrso', 'historical'),
  ('mrso', 'piControl'),
  ('pr', 'historical'),
  ('pr', 'piControl')},
 'table_id': {'Amon', 'Lmon'},
 'pdir': {'/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Amon/pr/gn/v20181126',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/Lmon/mrso/gn/v20181114',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/piControl/r1i1p1f1/Amon/pr/gn/v20181016',
  '/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/piControl/r1i1p1f1/Lmon/mrso/gn/v20181012'},
 'version': {'v20181012', 'v20181016', 'v20181114', 'v20181126'}}

#### AND filter applied to remote ESGF query

You can of course do the same query for CMIP5, in that case you can omit *project* when calling the function since its default value is 'CMIP5'.<br>
Another default option is *local=True*, this says the function to perfom this query directly on the CLEF database, if you want you can perform the same query on the ESGF database, so you can see what has been published.

In [26]:
constraints = {'variable': ['tasmin','tasmax'], 'cmor_table': ['Amon'], 'experiment': ['historical','rcp26', 'rcp85']}
allvalues = ['variable', 'experiment']
fixed=['model','ensemble']
results, selection = matching(s, allvalues, fixed, local=False, **constraints)
print(len(results),len(selection))
selection[0]

None
1488 46


{'model': 'CNRM-CM5',
 'ensemble': 'r1i1p1',
 'comb': {('tasmax', 'historical'),
  ('tasmax', 'rcp26'),
  ('tasmax', 'rcp85'),
  ('tasmin', 'historical'),
  ('tasmin', 'rcp26'),
  ('tasmin', 'rcp85')},
 'cmor_table': {'Amon'},
 'dataset_id': {'cmip5.output1.CNRM-CERFACS.CNRM-CM5.historical.mon.atmos.Amon.r1i1p1.v20110901|esg1.umr-cnrm.fr',
  'cmip5.output1.CNRM-CERFACS.CNRM-CM5.rcp26.mon.atmos.Amon.r1i1p1.v20110629|esg1.umr-cnrm.fr',
  'cmip5.output1.CNRM-CERFACS.CNRM-CM5.rcp85.mon.atmos.Amon.r1i1p1.v20110930|esg1.umr-cnrm.fr'},
 'version': {'v20110629', 'v20110901', 'v20110930'}}

Please note how I used different attributes names because we are querying CMIP5 now. <br>
*comb* highlights all the combinations that have to be present for a model/ensemble to be returned while we are getting a dataset_id rather than a directory path.

#### AND filter on the command line

The command line version of **matching** can be called using the *--and* flag followed by the attribute for which we want all values, the flag can be used more than once. By default model/ensemble combinations define a simulation, and only model, ensemble and version are returned as final result.

In [27]:
!clef --local cmip5 -v tasmin -v tasmax -e rcp26 -e rcp85 -e historical -t Amon --and variable

ACCESS1.0 r1i1p1 {None}
ACCESS1.0 r2i1p1 {None}
ACCESS1.0 r3i1p1 {None}
ACCESS1.3 r1i1p1 {None}
ACCESS1.3 r2i1p1 {None}
ACCESS1.3 r3i1p1 {None}
BCC-CSM1.1 r1i1p1 {'1', '20120705'}
BCC-CSM1.1 r2i1p1 {'1'}
BCC-CSM1.1 r3i1p1 {'1'}
BCC-CSM1.1(m) r1i1p1 {'20120709', '20130405', '20120910'}
BCC-CSM1.1(m) r2i1p1 {'20120709'}
BCC-CSM1.1(m) r3i1p1 {'20120709'}
BNU-ESM r1i1p1 {'20120510'}
CCSM4 r1i1p1 {'20130426', '20160829'}
CCSM4 r1i2p1 {'20130715'}
CCSM4 r1i2p2 {'20130715'}
CCSM4 r2i1p1 {'20121031', '20160829'}
CCSM4 r3i1p1 {'20121031', '20160829'}
CCSM4 r4i1p1 {'20121031', '20160829'}
CCSM4 r5i1p1 {'20121031', '20160829'}
CCSM4 r6i1p1 {'20120709', '20160829'}
CESM1(BGC) r1i1p1 {'20130213', '20130216'}
CESM1(CAM5) r1i1p1 {'20130313'}
CESM1(CAM5) r2i1p1 {'20130313'}
CESM1(CAM5) r3i1p1 {'20130313', '20140310'}
CESM1(WACCM) r1i1p1 {'20130314'}
CESM1(WACCM) r2i1p1 {'20130314'}
CESM1(WACCM) r3i1p1 {'20130314', '20130315'}
CESM1(WACCM) r4i1p1 {'20130314', '20130315'}
CE

The same will work for *--remote* and *cmip6*

In [28]:
!clef --remote cmip6 -v pr -v mrso -e piControl  -mi r1i1p1f1 --frequency mon --and variable_id

None
BCC-CSM2-MR r1i1p1f1 {'v20181016', 'v20181012'}
BCC-ESM1 r1i1p1f1 {'v20181211', 'v20181214'}
CAMS-CSM1-0 r1i1p1f1 {'v20190729'}
CESM2 r1i1p1f1 {'v20190320'}
CESM2-WACCM r1i1p1f1 {'v20190320'}
CanESM5 r1i1p1f1 {'v20190429'}
E3SM-1-0 r1i1p1f1 {'v20190719', 'v20190807'}
EC-Earth3 r1i1p1f1 {'v20190712'}
EC-Earth3-Veg r1i1p1f1 {'v20190619'}
GISS-E2-1-G r1i1p1f1 {'v20180824'}
GISS-E2-1-G-CC r1i1p1f1 {'v20190815'}
GISS-E2-1-H r1i1p1f1 {'v20190410'}
HadGEM3-GC31-LL r1i1p1f1 {'v20190628'}
HadGEM3-GC31-MM r1i1p1f1 {'v20190920'}
IPSL-CM6A-LR r1i1p1f1 {'v20181123'}
MCM-UA-1-0 r1i1p1f1 {'v20191017', 'v20190731'}
MIROC6 r1i1p1f1 {'v20190311', 'v20181212'}
MPI-ESM1-2-HR r1i1p1f1 {'v20190710'}
MRI-ESM2-0 r1i1p1f1 {'v20190603', 'v20190222'}
NorCPM1 r1i1p1f1 {'v20190914'}
SAM0-UNICON r1i1p1f1 {'v20190910'}


## New features

We recently added new output features following a user request.<br>
These are currently only available in the analysis3-unstable environment<br>

In [None]:
# !module load conda/analysis3-unstable

### CSV file output

The *--csv* option added to the command line will output the query results in a csv file. rather than getting only the files path, it will list all the available attributes.<br>
This currently works only with the *--local* option, it doesn't yet work for the standard search or remote. These last both perform an ESGF query rather than searching directly the CLEF database as *local* so they need to be treated differently. We are still working on this.

In [29]:
!clef --local cmip6 -v pr -v mrso -e piControl  -mi r1i1p1f1 --frequency mon --and variable_id --csv

BCC-CSM2-MR r1i1p1f1 {'v20181016', 'v20181012'}
BCC-ESM1 r1i1p1f1 {'v20181211', 'v20181214'}
CAMS-CSM1-0 r1i1p1f1 {'v20190729'}
CESM2 r1i1p1f1 {'v20190320'}
CESM2-WACCM r1i1p1f1 {'v20190320'}
CanESM5 r1i1p1f1 {'v20190429'}
EC-Earth3 r1i1p1f1 {'v20190712'}
EC-Earth3-Veg r1i1p1f1 {'v20190619'}
GISS-E2-1-G r1i1p1f1 {'v20180824'}
GISS-E2-1-G-CC r1i1p1f1 {'v20190815'}
GISS-E2-1-H r1i1p1f1 {'v20190410'}
HadGEM3-GC31-LL r1i1p1f1 {'v20190628'}
HadGEM3-GC31-MM r1i1p1f1 {'v20190920'}
IPSL-CM6A-LR r1i1p1f1 {'v20181123'}
MCM-UA-1-0 r1i1p1f1 {'v20190731', 'v20191017'}
MIROC6 r1i1p1f1 {'v20181212', 'v20190311'}
MPI-ESM1-2-HR r1i1p1f1 {'v20190710'}
MRI-ESM2-0 r1i1p1f1 {'v20190603', 'v20190222'}
NorCPM1 r1i1p1f1 {'v20190914'}
NorESM2-LM r1i1p1f1 {'v20190815'}
SAM0-UNICON r1i1p1f1 {'v20190910'}


In [30]:
!head -n 4 CMIP6_query.csv

activity_id,source_id,source_type,experiment_id,sub_experiment_id,frequency,r,i,p,f,variant_label,member_id,variable_id,grid_label,nominal_resolution,table_id,version,variable,pdir,fdate,tdate,time_complete
CMIP,BCC-CSM2-MR,AOGCM,piControl,none,mon,1,1,1,1,r1i1p1f1,r1i1p1f1,pr,gn,100 km,Amon,v20181016,pr,/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-CSM2-MR/piControl/r1i1p1f1/Amon/pr/gn/v20181016,18500101,24491231,True
CMIP,BCC-ESM1,AER AOGCM BGC,piControl,none,mon,1,1,1,1,r1i1p1f1,r1i1p1f1,pr,gn,250 km,Amon,v20181214,pr,/g/data1b/oi10/replicas/CMIP6/CMIP/BCC/BCC-ESM1/piControl/r1i1p1f1/Amon/pr/gn/v20181214,18500101,23001231,True
CMIP,CAMS-CSM1-0,AOGCM,piControl,none,mon,1,1,1,1,r1i1p1f1,r1i1p1f1,pr,gn,100 km,Amon,v20190729,pr,/g/data1b/oi10/replicas/CMIP6/CMIP/CAMS/CAMS-CSM1-0/piControl/r1i1p1f1/Amon/pr/gn/v20190729,29000101,33991231,True


### Query summary option

The *--stats* option added to the command line will print a summary of the query results<br>
It works for both *--local* and *--remote* options, but not with the default query.<br>
Currently it prints the following:
* total number of models, followed by their names
* total number of unique model-ensembles/members combinations
* number of models that have N ensembles/members, followed by their names

In [31]:
!clef --local cmip5 -v pr -v mrso -e piControl --frequency mon --stats


Query summary

48 model/s are available:
ACCESS1.0 ACCESS1.3 BCC-CSM1.1 BCC-CSM1.1(m) BNU-ESM CCSM4 CESM1(BGC) CESM1(CAM5) CESM1(WACCM) CESM1-CAM5.1-FV2 CESM1-FASTCHEM CMCC-CESM CMCC-CM CMCC-CMS CNRM-CM5 CNRM-CM5-2 CSIRO-Mk3.6.0 CSIRO-Mk3L-1-2 CanESM2 EC-EARTH FGOALS-g2 FGOALS-s2 FGOALS_g2 FIO-ESM GFDL-CM3 GFDL-ESM2G GFDL-ESM2M GISS-E2-H GISS-E2-H-CC GISS-E2-R GISS-E2-R-CC HadGEM2-AO HadGEM2-CC HadGEM2-ES IPSL-CM5A-LR IPSL-CM5A-MR IPSL-CM5B-LR MIROC-ESM MIROC-ESM-CHEM MIROC4h MIROC5 MPI-ESM-LR MPI-ESM-MR MPI-ESM-P MRI-CGCM3 NorESM1-M NorESM1-ME inmcm4 

A total of 59 unique model-member combinations are available.

44 model/s have 1 member/s:
ACCESS1.0 ACCESS1.3 BCC-CSM1.1 BCC-CSM1.1(m) BNU-ESM CESM1(BGC) CESM1(CAM5) CESM1(WACCM) CESM1-CAM5.1-FV2 CESM1-FASTCHEM CMCC-CESM CMCC-CM CMCC-CMS CNRM-CM5 CSIRO-Mk3.6.0 CSIRO-Mk3L-1-2 CanESM2 EC-EARTH FGOALS-g2 FGOALS-s2 FGOALS_g2 FIO-ESM GFDL-CM3 GFDL-ESM2G GFDL-ESM2M GISS-E2-H-CC GISS-E2-R-CC HadGEM2-AO HadGEM2-CC HadGEM2-ES IPSL-CM5

### Errata and ESDOC

Another new features are functions that retrieve errata associated to a file and the documents available in the ESDOC system.<br>
We are still working to make these accessible from the command line and also to add tracking_ids to our query outputs.<br>
In the meantime you can load them and use them after having retrieve the tracking_id attribute in another way (for example with a simple nc_dump or via xarray if in python).<br>
Let's start from the errata:

In [32]:
from clef.esdoc import *
tracking_id = 'hdl:21.14100/a2c2f719-6790-484b-9f66-392e62cd0eb8'
error_ids = errata(tracking_id)
for eid in error_ids:
    print_error(eid)

You can view the full report online:
https://errata.es-doc.org/static/view.html?uid=99f28ccc-53b3-68dc-8fb1-f7ca4a2d3393
Title: pr and prc have incorrect values at daily and monthly timescales due to an incorrect scaling factor
Status: resolved
Description: Within the conversion from CESM's CAM precipitation units (m s-1) to CMIP's units of (kg m-2 s-1) an incorrect scaling factor was applied. The conversion should have been to multiply CAM's values by 1000 kg m-3. Instead, the values were multiplied by 1000 and then divided by 86400, resulting in values that are too small.


As you can see I've chosen a tracking_id that was associated to some errata. First I use the **errata()** function to retrieve any associated error_ids and then I print out the result using the **print_error()** function.<br>
This first retrieve the message associted to any error_id and then prints it in a human readable form, including the url for the original error report.<br><br>
Let's now have a look at how to retrieve and print some documentation from ESDOC.

In [33]:
doc_url = get_doc(dtype='model', name='MIROC6', project='CMIP6')

MIP Era > CMIP6
Institute > MIROC
Canonical Name > --
Name > MIROC6
Type > GCM
Long Name > --
Overview > --
Keywords > --
name > MIROC6
keywords > CCSR-AGCM, SPRINTARS, COCO, MATSIRO, atmosphere, aerosol, sea-ice ocean, land surface
overview > MIROC6 is a physical climate model mainly composed of three sub-models: atmosphere, land, and sea ice-ocean. The atmospheric model is based on the CCSR-NIES atmospheric general circulation model. The horizontal resolution is a T85 spectral truncation that is an approximately 1.4° grid interval for both latitude and longitude. The vertical grid coordinate is a hybrid σ-p coordinate. The model top is placed at 0.004 hPa, and there are 81 vertical levels. The Spectral Radiation-Transport Model for Aerosol Species (SPRINTARS) is used as an aerosol module for MIROC6 to predict the mass mixing ratios of the main tropospheric aerosols. By coupling the radiation and cloud-precipitation schemes, SPRINTARS calculates not only the aerosol transport processe

aerosols - cloud lifetime effect - RFaci from sulfate only > False
aerosols - dust - provision > M
aerosols - tropospheric volcanic - provision > C
name > SPRINTARS
keywords > aerosol transport,aerosol-radiation interaction,aerosol-cloud interaction
overview > Spectral Radiation-Transport Model for Aerosol Species (SPRINTARS) predicts mass mixing ratios of the main tropospheric aerosols which are black carbon (BC), organic matter (OM), sulfate, soil dust, and sea salt, and the precursor gases of sulfate (sulfur dioxide and dimethylsulfide). SPRINTARS calculates not only the aerosol transport processes of emission, advection, diffusion, sulfur chemistry, wet deposition, dry deposition, and gravitational settling, but also the aerosol-radiation and aerosol-cloud interactions by coupled with the radiation and cloud-precipitation schemes in MIROC. See sections on model description in Takemura (2018, http://www.cger.nies.go.jp/publications/report/i138/i138.pdf) for furtther details.
scheme 

soil map - soil depth > N/A. Soil depth is constant.
snow free albedo - prognostic > True
snow free albedo - functions > Vegetation state
snow free albedo - direct diffuse > Distinction between direct and diffuse albedo
snow free albedo - number of wavelength bands > 3
hydrology - description > The unfrozen soil moisture is predicted by the Richards equation with hydraulic properies based on Clapp and Hornberger (1979).
hydrology - time step > 180
hydrology - tiling > nan
hydrology - vertical discretisation > Soil has six layers with a thickness of 0.05, 0.2, 0.75, 1, 2, and 10 m.
hydrology - number of ground water layers > 6
hydrology - lateral connectivity > Other: No connectivity
hydrology - method > Explicit diffusion
hydrology - freezing - number of ground ice layers > 6
hydrology - freezing - ice storage method > Thermo dynamics
hydrology - freezing - permafrost > There is no specific treatment for permafrost. But near-surface permafrost is represented by soil freezing processes.

resolution - thickness level 1 > 2
tuning applied - description > Our main target is to reproduce reasonable THC (in particular AMOC) strength and volume transport across some key starits/pathways. In addition, we also checked T/S fields in orde to avoid unrealistic long-term trends. We mainly modified ocean bathmetry rather than parameter-level tuning to retain these metrics.
conservation - description > We have checked changes of the properties which should be conserved are in the range of numerical error by calculating the difference of these properties by using multiple snapshots of the modeled ocean.
conservation - scheme > Salt
name > COCO medium resolution model
discretisation - vertical - coordinates > Hybrid / Z+S
discretisation - vertical - partial steps > True
discretisation - horizontal - type > Two north poles (ORCA-style)
discretisation - horizontal - staggering > Arakawa B-grid
name > Staggard timestepping
diurnal cycle > Via coupling: Diurnal cycle via coupling frequenc

salt - has multiple sea ice salinities > False
salt - sea ice salinity thermal impacts > True
salt - mass transport - salinity type > Constant
salt - mass transport - constant salinity value > 5
salt - thermodynamics - salinity type > Constant
salt - thermodynamics - constant salinity value > 5
ice thickness distribution - representation > Explicit
ice floe size distribution - representation > Parameterised
melt ponds - are included > False
melt ponds - formulation > Other
snow processes - has snow aging > False
snow processes - has snow ice formation > True
snow processes - snow ice formation scheme > When snow-ice interface comes below sea level, the snow between the interface and sea level turns into sea ice.
snow processes - redistribution > Snow-ice
surface albedo > Other


This time we can use directly one function **get_doc()**. It gets three arguments:
  * the kind of document, can be model, experiment or mip;
  * the name of the model, experiment or mip;
  * project for which I want to retrieve the document, by default this is CMIP6.<br><br>
It will retrieve the document online and print out a summary.<br>
It will also return the url for the full document report, shown below.

In [35]:
print(doc_url)

https://api.es-doc.org/2/document/search-name?client=ESDOC-VIEWER-DEMO&encoding=html&project=CMIP6&name=MIROC6&type=CIM.2.SCIENCE.MODEL


ESDOC works only for CMIP6 and newer ESGF datasets. The World data Center for Climate (WDCC) website holds documentation for both CMIP6 and CMIP5, the **get_wdcc()** function access these documents. In this case rather than the type of document you have to use the datset_id to retrieve the information.<br>

In [36]:
doc_url, response = get_wdcc('cmip5.output1.MIROC.MIROC5.historical.mon.atmos.Amon.r1i1p1.v20111028')
print(doc_url)
print(response['response']['docs'])

https://cera-www.dkrz.de/WDCC/ui/cerasearch/solr/select?rows=1&wt=json&q=entry_name_s:cmip5*output1*MIROC*MIROC5
[{'geo': ['ENVELOPE(-180.00, 180.00, 90.00,-90.00)'], 'accuracy_report_s': 'not filled', 'specification_s': 'not filled', 'completeness_report_s': 'not filled', 'entry_type_s': 'experiment', 'qc_institute_s': 'MIROC', 'summary_s': 'MIROC data of the MIROC5 model as contribution for CMIP5 - Coupled Model\nIntercomparison Project Phase 5 (https://pcmdi.llnl.gov/mips/cmip5).\nExperiment design is described in detail in\nhttps://pcmdi.llnl.gov/mips/cmip5/experiment_design.html and the list of output\nvariables and their temporal resolutions are given in\nhttps://pcmdi.llnl.gov/mips/cmip5/datadescription.html . The output is stored in netCDF\nformat as time series per variable in model grid spatial resolution. For more information\non the Earth System model and the simulation please refer to the CIM repository.', 'general_key_ss': ['CMIP5', 'IPCC', 'IPCC-AR5', 'IPCC-DDC', 'MIROC5

We are still working to add a function that will give a formatted print of the wdcc documents as for the the ESDOC ones.

### More on queries

#### About experiment_family

Experiment_family is a facet present only for CMIP5, it allows you to select all the experiments following in the same category. The correspondent in CMIP6 is activity. However, not all experiments belong to a family and searching for both experiment and experiment_family at the same time can give unexpected results.<br>
Let's look at an example, if I want to get all the rcps experiments and historical I might be tempted to pass them as constraints in the same query:

In [37]:
!clef cmip5 -m CMCC-CM -e historical --experiment_family RCP -t Omon -v tos -en r1i1p1

None
No matches found on ESGF, check at https://esgf.nci.org.au/search/esgf-nci?query=&type=File&distrib=True&replica=False&latest=True&project=CMIP5&ensemble=r1i1p1&experiment=historical&model=CMCC-CM&cmor_table=Omon&variable=tos&experiment_family=RCP


We couldn't find any matches because both constraints have to be true, similarly if we pass rcp45 as experiment as well as the family RCP we will only get the rcp45 results.

In [38]:
!clef cmip5 -m CMCC-CM -e rcp45 --experiment_family RCP -t Omon -v tos -en r1i1p1

None
/g/data1b/al33/replicas/CMIP5/combined/CMCC/CMCC-CM/rcp45/mon/ocean/Omon/r1i1p1/v20120518/tos/
/g/data1b/al33/replicas/CMIP5/combined/CMCC/CMCC-CM/rcp45/mon/ocean/Omon/r1i1p1/v20170725/tos/

Everything available on ESGF is also available locally


Finally, it is now possible to use experiment_family also in the local search:

In [39]:
!clef --local cmip5 -m CMCC-CM --experiment_family RCP -t Omon -v tos -en r1i1p1

/g/data1b/al33/replicas/CMIP5/combined/CMCC/CMCC-CM/rcp45/mon/ocean/Omon/r1i1p1/v20120518/tos
/g/data1b/al33/replicas/CMIP5/combined/CMCC/CMCC-CM/rcp45/mon/ocean/Omon/r1i1p1/v20170725/tos
/g/data1b/al33/replicas/CMIP5/combined/CMCC/CMCC-CM/rcp85/mon/ocean/Omon/r1i1p1/v20120528/tos
/g/data1b/al33/replicas/CMIP5/combined/CMCC/CMCC-CM/rcp85/mon/ocean/Omon/r1i1p1/v20170725/tos


## Searching for other climate datasets: ds

Let's get back to the command line now and have a look at the third command **ds**<br>
This command let you query a separate database that contains information on other climate datasets which are available on raijin.

In [40]:
!clef ds --help

Usage: clef ds [OPTIONS]

  Search local database for non-ESGF datasets

Options:
  -d, --dataset TEXT              Dataset name
  -v, --version TEXT              Dataset version
  -f, --format [netcdf|grib|HDF5|binary]
                                  Dataset file format as defined in clef.db
                                  Dataset table
  -sn, --standard-name [air_temperature|air_pressure|rainfall_rate]
                                  Variable standard_name this is the most
                                  reliable way to look for a variable across
                                  datasets
  -cn, --cmor-name [ps|pres|psl|tas|ta|pr|tos]
                                  Variable cmor_name useful to look for a
                                  variable across datasets
  -va, --variable [T|U|V|Z]       Variable name as defined in files: tas, pr,
                                  sic, T ...
  --frequency [yr|mon|day|6hr|3hr|1hr]
                                

clef ds  
with no other argument will return a list of the local datasets available in the database.<br>
NB this is not an exhaustive list of the climate collections at NCI and not all the datasets already in the database have been completed.

In [41]:
!clef ds

ERA5 v1.0: /g/data/ub4/era5/netcdf/<stream>/<varname>/<year>/
MACC v1.0: /g/data/ub4/macc/grib/<stream>/
YOTC v1.0: /g/data/rq7/yotc
ERAI v1.0: /g/data/ub4/erai/netcdf/<frequency>/<realm>/<stream>/<version>/<varname>/
OSTIA vNA: /g/data/ua8/ostia
TRMM_3B42 v7: /g/data/ua8/NASA_TRMM/TRMM_L3/TRMM_3B42/<YYYY>/
OISST v2.0: /g/data/ua8/NOAA_OISST/AVHRR/v2-0_modified/
MERRA2 v5.12.4: /g/data/rr7/MERRA2/raw/<streamv1>.<version>/<YYYY>/<MM>/
ERAI v1.0: /g/data/ub4/erai/netcdf/<frequency>/<realm>/<stream>/v01/<varname>/
MACC v1.0: /g/data/ub4/macc/netcdf/<frequency>/<realm>/<stream>/v01/<varname>/
YOTC v1.0: /g/data/rq7/yotc


If you specify any of the variable options then the query will return a list of variables rather then datasets.
Since variables can be named differently among datasets, using the *standard_name* or *cmor_name* options to identify them is the best option.

In [42]:
!clef ds -f netcdf --standard-name air_temperature

2T: /g/data/ub4/era5/netcdf/surface/2T/<year>/2T_era5_-90 90 -180 179.75_<YYYYMMDD>_<YYYYMMDD>.nc
T: /g/data/ub4/era5/netcdf/pressure/T/<year>/T_era5_-57 20 78 -140_<YYYYMMDD>_<YYYYMMDD>.nc
2T: /g/data/ub4/era5/netcdf/surface/2T/<year>/2T_era5_-90 90 -180 179.75_<YYYYMMDD>_<YYYYMMDD>.nc
T: /g/data/ub4/era5/netcdf/pressure/T/<year>/T_era5_-57 20 78 -140_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_pl/1.0/ta/ta_6hr_ERAI_historical_oper_an_pl_<YYYYMMDD>_<YYYYMMDD>.nc
tas: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_sfc/1.0/tas/tas_6hr_ERAI_historical_oper_an_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_ml/1.0/ta/ta_6hr_ERAI_historical_oper_an_ml_<YYYYMMDD>_<YYYYMMDD>.nc
mn2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mn2t/mn2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
mx2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mx2t/mx2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
tas: /g/data/ub

This returns all the variable available as netcdf files and with air_temperature as standard_name.<br>
NB for each variable a path structure is returned.

In [43]:
!clef ds -f netcdf --cmor-name ta

T: /g/data/ub4/era5/netcdf/pressure/T/<year>/T_era5_-57 20 78 -140_<YYYYMMDD>_<YYYYMMDD>.nc
T: /g/data/ub4/era5/netcdf/pressure/T/<year>/T_era5_-57 20 78 -140_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_pl/1.0/ta/ta_6hr_ERAI_historical_oper_an_pl_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_ml/1.0/ta/ta_6hr_ERAI_historical_oper_an_ml_<YYYYMMDD>_<YYYYMMDD>.nc


This returns a subset of the previous query using the cmor_name to clearly identify one kind of air_temperature.