Skip to content

ESGFInterfaceGroups|ThreddsGroup

Stephen Pascoe edited this page Apr 9, 2014 · 8 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category REVISE.
This page contains useful content but needs revision. It may contain out of date or inaccurate content.

Thredds / Publishing Group

Definition of an ESGF THREDDS profile is in development as the document attached below.

Syntactic description of ESG THREDDS XML

The Master Catalog

The ESGCET publisher organized published data in a single level hierarchy where each leaf container of the DRS is represented by a sub-catalog file. These are collected into a master catalog using catalogRef elements. Like this:

<catalog xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
         name="Earth System Grid catalog"
         xsi:schemaLocation="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0
                             http://www.unidata.ucar.edu/schemas/thredds/InvCatalog.1.0.2.xsd">
      <catalogRef name="cmip5.output1.CCCma.CanESM2.historical.mon.atmos.Amon.r1i1p1.v1"
                  xlink:title="cmip5.output1.CCCma.CanESM2.historical.mon.atmos.Amon.r1i1p1.v1"
                  xlink:href="1/cmip5.output1.CCCma.CanESM2.historical.mon.atmos.Amon.r1i1p1.v1.xml"/>
      <catalogRef name="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.v1"
                  xlink:title="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.v1"
                  xlink:href="1/cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.v1.xml"/>
</catalog>

A Sub-Catalog

Each sub-catalog is organized as follows:

Below the catalog element the publisher includes a collection of service elements. The services that are included in the catalog is determined by configuration elements in the publisher initialization file (esgcet.ini). In this case, there are 5 services: OPeNDAP, LAS, HTTPServer, SRM and GridFTP. [Bob - not sure about why there is a Compound service and what changes might occur when publishing that would use the Compound service. Nor do I understand how the gridFTP or SRM services will be used.).

<catalog ...[all of the namespace declarations as above ...]
  <service serviceType="LAS" base="http://pcmdi3.llnl.gov/las/getUI.do/" name="LASatPCMDI" desc="Live Access Server">
    <property name="requires_authorization" value="false"/>
    <property name="application" value="Web Browser"/>
  </service>
  <service serviceType="OpenDAP" base="/thredds/dodsC/" name="gridded" desc="PCMDI OPeNDAP">
      <property name="requires_authorization" value="false"/>
      <property name="application" value="Web Browser"/>
    </service>
  <service serviceType="Compound" base="" name="fileservice">
    <service serviceType="HTTPServer" base="/thredds/fileServer/" name="HTTPServer" desc="PCMDI TDS">
      <property name="requires_authorization" value="true"/>
      <property name="application" value="Web Browser"/>
      <property name="application" value="Web Script"/>
    </service>
    <service serviceType="GridFTP" base="gsiftp://oberon.llnl.gov:2811/" name="GridFTPTestAtPCMDI" desc="GridFTP">
      <property name="requires_authorization" value="true"/>
      <property name="application" value="DataMover-Lite"/>
    </service>
  </service>
  <service serviceType="SRM" base="srm://host.sample.gov:6288/srm/v2/server?SFN=/archive.sample.gov/"
           name="HRMatPCMDI" desc="SRM">
    <property name="requires_authorization" value="false"/>
  </service>

( StephenPascoe ) TODO: Define semantics of requires_authorization . The above says LAS and OPeNDAP does not require authorization. Is this right? > ( StephenPascoe ) What is the purpose of the application property?

Below the service elements, the publisher provide a property that describes the version of this profile that was used to prepare this catalog.

<property name="catalog_version" value="2"/>

In general, THREDDS catalogs can contain any number of name/value pairs as property elements. THREDDS provides many ways to include metadata into the catalog and ESG uses those elements where ever possible, but for information such as the profile version where there is not THREDDS specific place to include this information ESG uses a property element.

Below the version property each sub-catalog contains a container element. It is a container data set since it has no access elements which allow clients access to one or more of the above services. This data set contains most of the metadata which describes the data in this sub-catalog. And it contains data set elements which have access elements to allow clients access to the individual variables through the various services defined in the sub-catalog. The metadata in this data set may or may not apply to all of the sub-data sets. Whether the sub-data set inherit the metadata is determined by the "inherited" attribute on the metadata element. If you are parsing the XML from one of these catalogs, you need to pay attention to this inheritance. The THREDDS parser in the Java CDM software from Unidata takes care of this for you.

  <dataset restrictAccess="esg-user"
      ID="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.v1"
      name="project=CMIP5 / IPCC Fifth Assessment Report, model=GFDL, experiment=historical, time_frequency=mon, modeling realm=atmos, run=r1i1p1, version=1">
    <property name="dataset_id" value="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1"/>
    <property name="dataset_version" value="1"/>
    <property name="project" value="cmip5"/>
    <property name="experiment" value="historical"/>
    <property name="product" value="output1"/>
    <property name="model" value="gfdl-cm3"/>
    <property name="time_frequency" value="mon"/>
    <property name="realm" value="atmos"/>
    <property name="cmor_table" value="Amon"/>
    <property name="ensemble" value="r1i1p1"/>
    <property name="institute" value="NOAA-GFDL"/>
    <property name="forcing" value="Ghg,Sa,Oz,Lu,Sl,Vl,Ss,Bc,Md,Oc (Ghg includes CO2, CH4, N2O, CFC11, CFC12, HCFC22, CFC113)"/>
    <property name="title" value="NOAA GFDL GFDL-CM3, historical (run 1) experiment output for CMIP5 AR5"/>
    <property name="creation_time" value="2011-02-04 14:02:09"/>
    <property name="format" value="netCDF, CF-1.4"/>
    <property name="drs_id" value="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1"/>

This collection of properties are used to describe the data as it relates to it's place in the DRS, along with other details like when the data was created and what forcing was used. [Bob - I have no idea really what to say about the specifics of these properties. Please fill this in as needed.]

    <metadata>
      <variables vocabulary="CF-1.0">
        <variable name="ts" vocabulary_name="surface_temperature" units="K">Surface Temperature</variable>
        <variable name="uas" vocabulary_name="eastward_wind" units="m s-1">Eastward Near-Surface Wind</variable>
        <variable name="ccb" vocabulary_name="air_pressure_at_convective_cloud_base" units="Pa">Air Pressure at Convective Cloud Base</variable>
        <variable name="cct" vocabulary_name="air_pressure_at_convective_cloud_top" units="Pa">Air Pressure at Convective Cloud Top</variable>
        <variable name="ci" vocabulary_name="Fraction of Time Convection Occurs" units="1">Fraction of Time Convection Occurs</variable>
        <variable name="clivi" vocabulary_name="atmosphere_cloud_ice_content" units="kg m-2">Ice Water Path</variable>
        <variable name="clt" vocabulary_name="cloud_area_fraction" units="%">Total Cloud Fraction</variable>
        <variable name="clwvi" vocabulary_name="atmosphere_cloud_condensed_water_content" units="kg m-2">Condensed Water Path</variable>
        <variable name="evspsbl" vocabulary_name="water_evaporation_flux" units="kg m-2 s-1">Evaporation</variable>
        <variable name="hfls" vocabulary_name="surface_upward_latent_heat_flux" units="W m-2">Surface Upward Latent Heat Flux</variable>
        <variable name="hfss" vocabulary_name="surface_upward_sensible_heat_flux" units="W m-2">Surface Upward Sensible Heat Flux</variable>
        <variable name="hur" vocabulary_name="relative_humidity" units="%">Relative Humidity</variable>
        <variable name="hurs" vocabulary_name="relative_humidity" units="%">Near-Surface Relative Humidity</variable>
        <variable name="hus" vocabulary_name="specific_humidity" units="1">Specific Humidity</variable>
        <variable name="huss" vocabulary_name="specific_humidity" units="1">Near-Surface Specific Humidity</variable>
        <variable name="pr" vocabulary_name="precipitation_flux" units="kg m-2 s-1">Precipitation</variable>
        <variable name="prc" vocabulary_name="convective_precipitation_flux" units="kg m-2 s-1">Convective Precipitation</variable>
        <variable name="prsn" vocabulary_name="snowfall_flux" units="kg m-2 s-1">Snowfall Flux</variable>
        <variable name="prw" vocabulary_name="atmosphere_water_vapor_content" units="kg m-2">Water Vapor Path</variable>
        <variable name="ps" vocabulary_name="surface_air_pressure" units="Pa">Surface Air Pressure</variable>
        <variable name="psl" vocabulary_name="air_pressure_at_sea_level" units="Pa">Sea Level Pressure</variable>
        <variable name="rlds" vocabulary_name="surface_downwelling_longwave_flux_in_air" units="W m-2">Surface Downwelling Longwave Radiation</variable>
        [... and many more varaible elements ...]

The < [ metadata ](http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/v 1.0.2/InvCatalogSpec.html#threddsMetadataGroup) > and < [ variables ](http://w ww.unidata.ucar.edu/projects/THREDDS/tech/catalog/v1.0.2/InvCatalogSpec.html#v ariablesType) > element are standard THREDDS elements. The are used by ESG to describe the physical parameters that are contained in the sub-catalog. In this case we are describing the collecction as a whole and so this metadata is _ not _ inherited by the sub-data sets.

( StephenPascoe ) NOTE: Current ESGF THREDDS catalogs do not put the variables element inside metadata , it is a child of the top-level dataset element. TODO: clarify the use of metadata tag.

    <metadata inherited="true">
      <dataType>Grid</dataType>
      <dataFormat>NetCDF</dataFormat>
    </metadata>

THREDDS provides a standard way to describe the [ underlying geometry of the data ](http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/v1.0.2/InvCat alogSpec.html#dataType_descrip) and the [ format ](http://www.unidata.ucar.edu /projects/THREDDS/tech/catalog/v1.0.2/InvCatalogSpec.html#dataFormatType) that is used to store the data and the publisher provides there here, but as far as I know they will all be the same for CMIP5 data.

<access
     urlPath=
"?catid=0A05774572417402AB51EF2856959E20_ns_cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.v1"
     serviceName="LASatPCMDI"
     dataFormat="NetCDF"/>

LAS attempt to give users access to an entire collection of related variables at one time through its own user interface client. This access element will initialize an LAS UI and give the user access to all of the variables in this data set container.

( StephenPascoe ) NOTE: Current ESGF THREDDS catalogs only use the access elements for LAS endpoints. They use a @urlPath attribute on the dataset element and declare a service element for file datasets and aggregations. TODO: clarify when access is allowed in the ESGF profile.

Finally, a collection of netCDF files that contain a particular varaible are described in the catalog in a couple of different ways. These different ways of presenting the data give the THREDDS client scanning the catalog access to different services available from the TDS.

    <dataset name="hur_Amon_gfdl-cm3_historical_r1i1p1_195001-195412.nc"
     ID="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.v1.hur_Amon_gf
dl-cm3_historical_r1i1p1_195001-195412.nc"
urlPath="home_test/cmip5/gfdl/hur_Amon_gfdl-cm3_historical_r1i1p1_195001-195412.nc"
serviceName="HTTPServer">
      <property name="file_id" value="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur_Amon_gfdl-cm3_historical_r1i1p1_195001-195412.nc"
/>
      <property name="file_version" value="1"/>
      <property name="size" value="71552736"/>
      <property name="tracking_id" value="b965f99f-92fa-41eb-8ad0-f30fe23db4b0"/>
      <property name="mod_time" value="2010-12-06 10:08:35"/>
      <variables vocabulary="CF-1.0">
        <variable name="hur" vocabulary_name="relative_humidity" units="%">Relative Humidity</variable>
      </variables>
      <dataSize units="bytes">71552736</dataSize>
    </dataset>

This data set description give access to the HTTPServer (bulk data download) service of the TDS for each of the individual files that contain the time series of this variable. (This is a degenerate case where there is only one file which contains all of the time steps.

    <dataset ID="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation" name="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.
atmos.Amon.r1i1p1.hur.v1.aggregation">
      <property name="aggregation_id" value="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation"/>
      <variables vocabulary="CF-1.0">
        <variable name="hur" vocabulary_name="relative_humidity" units="%">Relative Humidity</variable>
      </variables>
      <metadata inherited="true">
        <property name="z_values" value=" 100000.   92500.   85000.   77500.   70000.   60000.   50000.   40000.   30000.   25000.   20000.   15000.   10000.
    7000.    5000.    3000.    2000.    1000.     700.     500.     300.     200.     100."/>
        <geospatialCoverage>
          <northsouth>
            <start>-89.000000</start>
            <size>178.0</size>
            <units>degrees_north</units>
          </northsouth>
          <eastwest>
            <start>1.250000</start>
            <size>357.5</size>
            <units>degrees_east</units>
          </eastwest>
          <updown>
            <start>100000.000000</start>
            <size>-99900.0</size>
            <units>Pa</units>
          </updown>
        </geospatialCoverage>
        <timeCoverage>
          <start>1950-01-16T12:00:00</start>
          <end>1954-12-16T12:00:00</end>
        </timeCoverage>
      </metadata>
      <access urlPath="?catid=0A05774572417402AB51EF2856959E20_ns_cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation" serviceName="LASatPCMDI" dataFormat="NetCDF"/>
      <dataset urlPath="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation.1" serviceName="gridded" ID="cmip5.output1.NOAA-
GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation.1" name="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregati
on - Subset 1">
        <property name="aggregation_id" value="cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation.1"/>
        <property name="time_delta" value="1 month"/>
        <property name="calendar" value="noleap"/>
        <property name="start" value="1950-1-1 0:0:0.0"/>
        <property name="time_length" value="60"/>
        <access urlPath="?catid=0A05774572417402AB51EF2856959E20_ns_cmip5.output1.NOAA-GFDL.gfdl-cm3.historical.mon.atmos.Amon.r1i1p1.hur.v1.aggregation.1" s
erviceName="LASatPCMDI" dataFormat="NetCDF"/>
        <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
          <aggregation dimName="time" type="joinExisting">
            <netcdf ncoords="60" location="/home/drach1/data/cmip5/gfdl/hur_Amon_gfdl-cm3_historical_r1i1p1_195001-195412.nc"/>
          </aggregation>
        </netcdf>
      </dataset>
    </dataset>

This data set collects all of the individual files that make up the time series (in this case there is only 1) for this varaible into a single aggregation. This section also includes metadata elements that describe the real world coordinates of the data as they are for the entire aggregation. Since there is not standard metadata element to specify the levels of the vertical dimension ESG uses a special property to include this information. The geo-spatical and time metadata are used by the LAS service to configure the user interface for these data and if the metadata is incomplete, LAS will skip this aggregation. The LAS access element included with the aggregation will lead to the same user interface as the one above.

This data set will also enable OPeNDAP access to the aggregation via the serviceName="gridded" attribute.

( StephenPascoe ) We should be clear that this snippet is of a THREDDS configuration catalog, not what clients will see over HTTP. The NcML definition will not be shown to the client.

Known issues, grey areas and inconsistencies

ESGF THREDDS catalogs are now being consumed by a wide variety of tools:

  • IPSL's download tool

  • MOHC's download tool

  • Estani's Replication script?

  • Ad-hoc replication scripts at BADC

  • IS-ENES portal at DKRZ

This section gathers together the experience of these tools in keeping compatible with ESGF THREDDS catalogs

Issue

Reported by

Catalogs can contain files labeled ".nc_0", ".nc_1" etc. This was unexpected and broke scripts

StephenPascoe

Changes to a catalog can only be detected by downloading it -- inefficient for large collections -- a checksum or change date for each catalogRef would be very helpful

Martin Juckes

Group Roadmap

To define a ESGF THREDDS XML profile which describes the format clients can expect from THREDDS catalogs used within ESGF.

  1. Clarify the scope of all elements, attributes or properties. Scopes could include * ESGF-internal: Used internally by ESGF software. Content and semantics may change. * ESGF-external: Defined semantics. Designed for consumption by clients. * General THREDDS: included for compatibility with general THREDDS clients. * DRS: included to support the Data Reference Syntax (how DRS is used beyond CMIP5 also needs clarifying) * project-specific: non-drs items that are specific to certain projects.
  2. Define which elements, attributes or properties are required, recommended and optional.
  3. For external scopes describe the semantics of each element. Document briefly which parts of the ESGF stack rely on each element: e.g. when content is to be displayed in a UI.
  4. Where multiple mechanisms for expressing the same thing exist in the THREDDS schema define which are supported. E.g. * <access> element vs. <dataset@urlPath>/<service>
* Compound service types 
* ` <variables> ` element as parent of ` <metadata> ` or ` <dataset> `

* Different uses of identifiers: ` dataset@ID ` , ` property:&#160;drs_id ` , ` property:&#160;dataset_id `
  1. Define mechanisms for improving crawl performance of TDS servers. * timestamping catalogRef elements to prevent re-crawling unchanged catalogs
Clone this wiki locally