# The CF Conventions --- A Metadata Standard for netCDF Data Format

## Why should we adopt a metadata standard?

In the earth system science (ESS) community, huge amount of data are generated, exchanged, and consumed. It is often a time consuming task for researchers to discover from an enormous pool of data the right dataset they want. Meanwhile, it is also challenging for data providers to bring their data products to users. To meet this challenge in today's research ecosystem, it is a highly recommended practice to enrich and standardize metadata of any kind of research outputs. By doing this, you're making your data more findable, interoperable, and reusable by other stakeholders in the community, thus contributing to the achievement of the FAIR principle of open science in a broader sense.

A **metadata standard**, by definition, is a set of rules or guidelines that defines how metadata should be structured, described, and managed. It specifies the elements or attributes to be included, the semantics of those elements, and often the syntax or format in which the metadata should be encoded. 

The CF Conventions is becoming a widely adopted metadata standard for netCDF data format. It is the successor of the COARDS Conventions, and is characterised by its flexibility and campatibility with other metadata standards. Broadly speaking, the CF Conventions does two important things:

1. Providing standards for attributes: What kind of attributes should be included? How are the attributes named?

2. Giving recommendations on data structure in netCDF for different types of data (e.g. grid data, discret points etc.)

In this chapter, we will have a look at the most foundational and important content of each aspect.


## Standards for Attributes

**Example**
* Sea surface temperatures collected by the Program for Climate Model Diagnosis & Intercomparison (PCMDI): https://pcmdi.llnl.gov/about.html
* Data source: https://www.unidata.ucar.edu/software/netcdf/examples/files.html

In [1]:
import xarray as xr
ds = xr.open_dataset("~/data/tos_O1_2001-2002.nc",
                     decode_cf=False)

In [8]:
ds

In [9]:
ds.info()

xarray.Dataset {
dimensions:
	lon = 180 ;
	bnds = 2 ;
	lat = 170 ;
	time = 24 ;

variables:
	float64 lon(lon) ;
		lon:standard_name = longitude ;
		lon:long_name = longitude ;
		lon:units = degrees_east ;
		lon:axis = X ;
		lon:bounds = lon_bnds ;
		lon:original_units = degrees_east ;
	float64 lon_bnds(lon, bnds) ;
	float64 lat(lat) ;
		lat:standard_name = latitude ;
		lat:long_name = latitude ;
		lat:units = degrees_north ;
		lat:axis = Y ;
		lat:bounds = lat_bnds ;
		lat:original_units = degrees_north ;
	float64 lat_bnds(lat, bnds) ;
	float64 time(time) ;
		time:standard_name = time ;
		time:long_name = time ;
		time:units = days since 2001-1-1 ;
		time:axis = T ;
		time:calendar = 360_day ;
		time:bounds = time_bnds ;
		time:original_units = seconds since 2001-1-1 ;
	float64 time_bnds(time, bnds) ;
	float32 tos(time, lat, lon) ;
		tos:standard_name = sea_surface_temperature ;
		tos:long_name = Sea Surface Temperature ;
		tos:units = K ;
		tos:cell_methods = time: mean (interval: 30 

### tos
- Now, the most important variable containing the 3D field of air temperature values
    - in front of the variable name you can find the data type of this variable, in this case **float**
    - important attributes:
        - **units**, in this case **"K"** meaning Kelvin
        - there is the **standard_name** "sea_surface_temperature"
        - and there is called **_FillValue**: everywhere in the 3D field where are no values, they used the value **1.e+20**

```yaml
    float tos(time, lat, lon) ;
		tos:standard_name = "sea_surface_temperature" ;
		tos:long_name = "Sea Surface Temperature" ;
		tos:units = "K" ;
		tos:cell_methods = "time: mean (interval: 30 days)" ;
		tos:_FillValue = 1.e+20f ;
		tos:missing_value = 1.e+20f ;
		tos:original_name = "sosstsst" ;
		tos:original_units = "degC" ;
		tos:history = " At   16:37:23 on 01/11/2005: CMOR altered the data in the following ways: added 2.73150E+02 to yield output units;  Cyclical dimension was output starting at a different lon;" ;
```                

### time
- Then, we look at the temporal variable time with data type double:
    - The main attributes are again
        - **units**, which show here a specific format:
            - it is always a **temporal interval since a certain time origin**, in this case days since 2001-1-1
        - **standard_name**: in this case **"time"**

```yaml
    double time(time) ;
		time:standard_name = "time" ;
		time:long_name = "time" ;
		time:units = "days since 2001-1-1" ;
		time:axis = "T" ;
		time:calendar = "360_day" ;
		time:bounds = "time_bnds" ;
		time:original_units = "seconds since 2001-1-1" ;
```                

#### 1.1. Variable Attributes

* Required:

    * `long_name`/`standard_name`: Tells what the variable is? `standard_name` must be a controlled vocabulary as defined in the ["CF Standard Name Table"](https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html).

    * `units`: The unit of the variable, should be parsable by the [UDUNITS library](https://www.unidata.ucar.edu/software/udunits/). If a variable has a `standard_name`, its `units` can be looked up in the "CF Standard Name Table" too.
        * `units` in `time` variable must be a string similar to the form of `[time-interval] since YYYY-MM-DD hh:mm:ss`. "seconds", "minutes", "hours", and "days" are the most commonly used time intervals; "months" or "years" are not recommended because the interval length may vary.



* Optional (but can be important):

    * `axis`: Identifies latitude, longitude, vertical or time axes. (`X` for longitude, `Y` for latitude, `Z` for vertical axis, `T` for time)

    * `coordinates`: List of names of auxiliary coordinate variables (and optionally coordinate variables) separated by a blank. There is no restriction on the order in which the variable names appear in the string.

    * `valid_range`: Two numbers specifying the MIN and MAX valid values for this variable. Any values outside this range are treated as missing. Must not be defined if either `valid_min` or `valid_max` is defined.

    * `_FillValue`: Indicates missing data. It should be scalar (only one value) and outside of the `valid_range`. *Not allowed for coordinate variables*.

    * `scale_factor`/`add_offset`: Used for unpacking data for display. 
    
    $${unpackedData = scaleFactor * storedData + addOffset}$$
        
    * `actual_range`: Must exactly equal to the MIN and MAX of the (unpacked) variable. 

    





* Specific for Grid Data

    * `cell_methods`: records the method used to derive data that represents the cell values; presented as a string of the form `name:method`.

    * `bounds`: The name of the variable that contains the vertices of the cell boundaries.

    * Boundary variables inherits some attributes of its parent variable.


#### 1.2. Global Attributes

* `title`: short description of the file contents

* `institution`: Where the original data was produced.

* `source`: Method used to produce the original data.

* `Conventions`: Name of the metadata standard applied to this dataset.

* `references`: References that describe the data or the data production method, e.g. papers.

* `history`: List of actions taken to modify the original data.

Notes:

1. Including attributes that are not specified in the CF Conventions doesn't make a dataset incompatible with the CF Conventions.

2. More information about individual attributes see [Appendix A](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#attribute-appendix) in the CF Conventions.

### 2. File Structure Recommendations for Diverse Data

* **Grid**: Satellite images, climate model data etc.

* Discrete Geometry Samples (DSG)

    * **Point**: Unconnected points / stations, each contains only a single data element (e.g. Earthquake data, Lightning data).

    * **Time Series**: Data are taken over periods of time at a single station (e.g. Weather station data, Fixed buoys).

    * **Profile**: Data are taken along a vertical line at a single station (e.g. Atmospheric profiles from satellites).

    * **Trajectory**: Data are taken along a path through space, each trajectory contains a set of connected points (e.g. Cruise data, drifting buoys).

    * Combined DSG:
        * **Timeseries of Profiles**: Profiles taken over periods of time for a fixed station.
        * **Trajectory of Profiles**: A collection of profiles along a trajectory (e.g. Ship soundings).

The attribute `featureType` should be included in the global attributes for DSG NetCDF, the value can be one of these: `point`, `timeSeries`, `profile`, `trajectory`, `timeSeriesProfile`, `trajectoryProfile`.

![img](https://live.staticflickr.com/65535/53876343350_2425d3064b_o_d.jpg)

Note: Merging everything is not always the best solution; when things get complicated, storing each feature separately is also a good solution.