ESGF_Search_REST_API

Torsten Rathmann edited this page Jan 21, 2016 · 7 revisions
Clone this wiki locally

The ESGF Search RESTful API

The ESGF search service exposes a RESTful URL that can be used by clients (browsers and desktop clients) to query the contents of the underlying search index, and return results matching the given constraints. Because of the distributed capabilities of the ESGF search, the URL at any Index Node can be used to query that Node only, or all Nodes in the ESGF system.

Syntax

The general syntax of the ESGF search service URL is:

http://<base_search_URL>/search?[keyword parameters as (name, value) pairs][facet parameters as (name,value) pairs]

where is the base URL of the search service at a given Index Node.

All parameters (keyword and facet) are optional. Also, the value of all parameters must be URL-encoded, so that the complete search URL is well formed.

Keywords

Keyword parameters are query parameters that have reserved names, and are interpreted by the search service to control the fundamental nature of a search request: where to issue the request to, how many results to return, etc.

The following keywords are currently used by the system - see below for usage examples:

  • facets= to return facet values and counts

  • shards= to specify an explicit list of shards to be queried

  • offset= , limit= to paginate through the available results (default: offset=0, limit=10)

  • fields= to return only specific metadata fields for each matching result (default: fields=*)

  • format= to specify the response document output format

Core Facets

Facet parameters are "search categories" that can be used to apply constraints to the search, and thus reduce the number of results returned. Internally, facets are metadata fields (single valued or multi-valued) that are stored for each search record. The search service will select records for which the metadata field values match the corresponding facet constraints.

The following facets are core system facets , and their names are reserved in the system. These facets can be used as valid query parameters at _ all _ sites in the federation.

  • query= for free text searches (default: query=*)

  • distrib=true to execute a distributed query, distrib=false to execute a local query (default: distrib=true)

  • id , master_id , instance_id : core record identifiers carrying different semantics - see later for detailed explanation.

  • title : record (short) title

  • description : record (longer) description

  • type : denotes the intrinsic type of the record. Currently supported values: Dataset, File, Aggregation (default: Dataset)

  • replica : indicates wether the record is the "master" copy, or a replica. Use replica=false to return only originals, replica=true to return only replicas (default: no replica flag specified, i.e. return both replicas and originals)

  • latest : indicates wether the record is the latest available version, or a previous version. Use latest=true to return only the latest version of all records, latest=false to return previous versions (default: no latest flag specified, i.e. return all versions)

  • data_node : indicates the Data Node where the data is stored

  • index_node : the Index Node where the data is published

  • version : the record version (a string)

  • timestamp : the date and time when the record was last modified

  • url : specific URL(s) to access the record

  • access : high level access capability available for a record

  • xlink : reference to external record documentation, such as technical notes

  • size : record size (for Datasets or Files)

  • checksum , checksum_type : file checksum value and type

  • number_of_files : number of files contained in a dataset

  • number_of_aggregations : number of aggregations in a dataset

  • dataset_id : the "id" value of the enclosing dataset (Files and Aggregations only)

  • tracking_id : the UUID assigned to a File by some special publication software, if available

  • drs_id : a templated string assigned to a Dataset by some special publication software, if available. Note: this field is deprecated .

  • start= , end= to execute a temporal range query

  • bbox=[west,south,east,north] to execute a spatial coverage query

  • from= , to= to execute a query based on the record last update date and time

Custom Facets

Additionally, each ESGF Index Node can harvest and make available additional custom facets that are relevant to its projects and users. For example, most Index Nodes support the set of CMIP5 facets , plus others. These custom facets are configured by the Node administrator in the file /esgf/config/facets.properties and can be discovered by the user through the following query:

http://<base_search_URL>/search?facets=*&distrib=false&limit=0

Example:

CMIP5 Facets

The following set of facets is supported by most ESGF Index Nodes in the federation, and can be used to discover/query/retrieve CMIP5 data. (the fa

  • CF Standard Name: cf_standard_name

  • Ensemble: ensemble

  • Experiment: experiment

  • Experiment Family: experiment_family

  • Institute: institute

  • MIP Table: cmor_table

  • Model: model

  • Project: project

  • Product: product

  • Realm: realm

  • Time Frequency: time_frequency

  • Variable: variable

  • Variable Long Name: variable_long_name

  • Instrument: source_id

Example:

Default Query

If no parameters at all are specified, the search service will execute a query using all the default values, specifically:

  • query=* (query all records)
  • distrib=true (execute a distributed search)
  • type=Dataset (return results of type "Dataset")

Example:

Free Text Queries

The keyword parameter query= can be specified to execute a query that matches the given text _ anywhere _ in the records metadata fields. The parameter value can be any expression following the Apache Lucene query syntax (because it is passed "as-is" to the back-end Solr query), and must be URL- encoded.

Examples:

Facet Queries

A request to the search service can be constrained to return only those records that match specific values for one or more facets. Specifically, a facet constraint is expressed through the general form: = , where is chosen from the controlled vocabulary of facet names configured at each site, and must match _ exactly _ one of the possible values for that particular facet.

When specifying more than one facet constraint in the request, multiple values for the same facet are combined with a logical OR, while multiple values for different facets are combined with a logical AND . For example, _ experiment=decadal2000&variable=hus _ will return all records that match _ experiment=decadal2000 _ AND variable= _ hus _ , while _ variable=hus&variable=ta _ will return all records that match variable= _ hus _ OR variable= _ ta _ .

A facet constraint can be negated by using the != operator. For example, _ model!=CCSM _ searches for all items that do NOT match the CCSM model. Note that all negative facets are combined in logical AND, for example _ model!=CCSM&model!=HadCAM _ searches for all items that do not match _ CCSM _ , and do not match _ HadCAM _ .

By default, no facet counts are returned in the output document. Facet counts must be explicitly requested by specifying the facet names individually (for example: facets= _ experiment,model _ ) or via the special notation _ facets=* _ . The facets list must be comma-separated, and white spaces are ignored. Note also that at this time, the special notation _ facets=* _ will only count those facets that are explicitly configured in the file _ application- context.xml _ .

If facet counts is requested, facet values are sorted alphabetically (facet.sort=lex) , and all facet values are returned (facet.limit=-1), provided they match one or more records (facet.mincount=1)

The facet type must be always specified as part of any request to the ESGF search services, so that the appropriate records can be examined and returned. If not specified explicitly, the default value is type=Dataset .

Examples:

Temporal Coverage Queries

The keyword parameters start= and/or end= can be used to query for data with temporal coverage that _ overlaps _ the specified range. The parameter values can either be date-times in the format "YYYY-MM-DDTHH:MM:SSZ" (UTC ISO 8601 format), or special values supported by the Solr DateMath syntax.

Examples:

Spatial Coverage Queries

The keyword parameter bbox=[west, south, east, north] can be used to query for data with spatial coverage that _ overlaps _ the given bounding box.

Examples:

Timestamp (aka ''last update'') Queries

The keyword parameters from= and/or to= can be used to query for data that was last updated in a given time range. These queries are executed against the "timestamp" field of the Solr records, which represents the date and time when the record was last modified. Note that if the timestamp cannot be set from the source metadata for that record, it is left unassigned so not to bias the query for records that have a valid timestamp.

When parsing THREDDS catalogs, the timestamp is assigned from the value of the properties creation_time (for datasets) and mod_time (for files), which are interpreted in the local time zone (local to the harvesting agent), and converted to UTC for input into the index. For example, the input value of creation_time="2012-03-15 12:59:09" (in the PDT time zone) becomes timestamp="2012-03-15T19:59:09Z".

The constraint values can either be date-times in the format "YYYY-MM- DDTHH:MM:SSZ" (UTC ISO 8601 format), or special values supported by the Solr DateMath syntax.

Examples:

Distributed Queries

The keyword parameter distrib= can be used to control whether the query is executed versus the local Index Noe only, or distributed to all other Nodes in the federation. If not specified, the default value distrib=true is assumed.

Examples:

Shard Queries

By default, a distributed query ( _ distrib=true _ ) targets all ESGF Nodes in the current peer group, i.e. all nodes that are listed in the local configuration file /esg/config/esgf_shards.xml , which is continuously updated by the local node manager to reflect the latest state of the federation. It is possible to execute a distributed search that targets only one or more specific nodes, by specifying them in the _ shards _ parameter, as such: _ shards=hostname1:port1/solr,hostname2:port2/solr,.... _ . Note that the explicit shards value is ignored if _ distrib=false _ (but distrib=true by default if not otherwise specified).

Examples:

Replica Queries

Replicas (Datasets and Files) are distinguished from the original record (a.k.a. the _ master _ ) in the Solr index by the value of two special keywords:

  • _ replica _ : a flag that is set to false for master records, true for replica records.

  • _ master_id _ : a string that is identical for the master and all replicas of a given logical record (Dataset or File).

By default, a query returns all records (masters and replicas) matching the search criteria, i.e. no _ replica _ constraint is used. To return only master records, use _ replica=false _ , to return only replicas, use _ replica=true _ . To search for all identical Datasets or Files (i.e. for the master AND replicas of a Dataset or File), use _ master_id=... _ .

Examples:

Latest and Version Queries

By default, a query to the ESGF search services will return all versions of the matching records (Datasets or Files). To only return the very last, up-to- date version include _ latest=true _ . To return a specific version, use _ version= _ . Using _ latest=false _ will return only datasets that were _ superseded _ by newer versions.

Examples:

Results Pagination

By default, a query to the search service will return the first 10 records matching the given constraints. The offset into the returned results, and the total number of returned results, can be changed through the keyword parameters limit= and offset= . The system imposes a maximum value of limit <= 10,000.

Examples:

Sorting

By default, the results returned by a search are unsorted. The query parameter sort=true can be used to sort the returned results in inverse order of last modification time, i.e. to return the most up to date records first.

Example:

Output Format

The keyword parameter output= can be used to request results in a specific output format. Currently the only available options are Solr/XML (the default) and Solr/JSON.

Examples:

Returned Metadata Fields

By default, all available metadata fields are returned for each result. The keyword parameter fields= can be used to limit the number of fields returned in the response document, for each matching result. The list must be comma-separated, and white spaces are ignored. Use _ fields=* _ to return all fields (same as not specifiying it, since it is the default). Note that the pseudo field _ score _ is always appended to any fields list.

Examples:

Identifiers

Each search record in the system is assigned the following identifiers (all of type string):

  • id : universally unique for each record across the federation, i.e. specific to each dataset or file, version and replica (and the data node storing the data). It is intended to be "opaque", i.e. it should not be parsed by clients to extract any information.

    • Example: id=obs4MIPs.CNES.AVISO.mon.v1|esg-datanode.jpl.nasa.gov
  • master_id : same for all replicas and versions across the federation. When parsing THREDDS catalogs, it is extracted from the properties "dataset_id" or "file_id".

    • Example: obs4MIPs.CNES.AVISO.mon
  • instance_id : same for all replicas across federation, but specific to each version. When parsing THREDDS catalogs, it is extracted from ID attribute of tag in THREDDS (for both Datasets and Files).

    • Example: obs4MIPs.CNES.AVISO.mon.v1

Note also that the record version is the same for all replicas of that record, but different across versions. Examples:

  • version=20120201
  • version=1

Access URLs

In the returned Solr XML output document, URLs that are access points for Datasets and Files are encoded as 3-tuple of the form url|mime type|service name , where the fields are separated by the _ | _ character, and the _ mime type _ and _ service name _ are chosen from the ESGF controlled vocabulary.

Examples of Dataset URLs:

Examples of File URLs:

Wget scripting

The same RESTful API that is used to query the ESGF search services can also be used, with minor modifications, to generate a Wget script to download all files matching the given constraints. Specifically, each ESGF Index Node exposes the following URL for generating Wget scripts:

http://<base_search_URL>/wget?[keyword parameters as (name, value) pairs][facet parameters as (name,value) pairs]

where again is the base URL of the search service at a given Index Node. The only syntax differences with respect to the search URL are:

A typical workflow pattern consists in first identifying all datasets or files matching some scientific criteria, then changing the request URL from "/search?" to "/wget?" to generate the corresponding shell scripts for bulk download of files.

Example:

For more information on the wget scrip see ESGF_wget