# NetCDF Metadata and the CF Conventions

In the previous chapter, we learned about the core centent structure of a standard netCDF dataset. It is clear that metadata is an essential component for a netCDF file to be self-described. In this chapter, we are going to learn about the CF Conventions, a widely adopted metadata standard for netCDF files. The first question that may rise is probably, why do we ever want to standardize metadata? To answer this question, let's think about a few situations we may encounter when conducting a research:

* I would like to use this and that dataset, where do I find them? 

* How do I know if there are any other datasets available that can be valuable for my research? And where do I find such datasets?

* I want the dataset in a analysis-ready status, ideally suit for joint use with other datasets. If not, how can I transform them for jointly use?

And if you are a data provider, you are very likely concerned about

* How can I make my dataset possibly land in the hands of users who need it?

In the current age, the earth system science (ESS) communities are generating, exchanging, and consuming huge amount of data. On one hand, it's often a challenging task for data users to find the right datasets from an enormous pool of data; on the other hand, data providers also face difficulties to bring their datasets to users. In nowaday's research data management (RDM) ecosystem, enriching and standardizing the metadata of all kinds of data is a basic strategy and an effective practice to meet these challenges, because it largely enhances the foundability of a dataset by machines. Moreover, standardized metadata will also extensively ease the joint use and promote the reusability of data.

In a broader sense, metadata standardization can facilitate the achievement of the FAIR principles. "FAIR" stands for: 

1. **Findable**: Data should be easy to find for both *humans* and *computers*. This means data should have a unique identifier and be described with rich metadata.

2. **Accessible**: Once found, data should be easy to access, ideally through standardized methods. Even if the data is private, the metadata should be accessible to show the data exists.

3. **Interoperable**: Data should be compatible with other data and tools. This means deploying data analysis workflows with different data and tools intergrated is much easier.

4. **Reusable**: Data should be well-described and documented so it can be reused in the future. This involves clear usage licenses and detailed information about the data.

Research data become more useful and valuable to the scientific community and other users when the FAIR principles are implemented, and metadata standardization is a core practice in implementing the FAIR principles. In the case of netCDF data, the CF Conventions is becoming a widely adopted metadata standard.

## What are the CF Conventions?

By definition, a *metadata standard* is a set of rules or guidelines that defines how metadata should be structured, described, and managed. It specifies the elements or attributes to be included, the semantics of those elements, and often the syntax or format in which the metadata should be encoded. As a metadata standard for netCDF data format, the CF Conventions provide definitions to different variable types in a netCDF dataset (coordinate variable, auxiliary coordinate variable, boundary variable etc.), give guidlines on the content structure of different types of data (grid, point, time series, trajectory, profiles), and standardizes what metadata should be included and how they are annotated.

At first, let's make it clear that the CF Conventions is not the only metadata standard for netCDF files. Its accessor, the COARDS Conventions, is for example a widely implemented metadata standard too. Sometimes big data providers also have their own metadata standards. However, the CF Conventions is becoming more and more popular in the community by showing its good flexibility and compatibility with other standards. For instance, the COARDS Conventions places rigid restriction on the order of dimensions while the CF Conventions doesn't. And as a successor of the COARDS Conventions, the CF Conventions is *backward compatible* with COARDS, which means programs that can process CF conforming datasets should likely be able to process COARDS conforming datasets too.

Now let's take a closer look at how the CF Conventions formulates the metadata in a netCDF file with an example [dataset](https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc) that contains sea surface temperatures collected by [PCMDI](https://en.wikipedia.org/wiki/Program_for_Climate_Model_Diagnosis_and_Intercomparison).

In [1]:
import xarray as xr

ds = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/cfbook/src/data/tos_O1_2001-2002.nc",
                     decode_cf=False)
ds.info()

xarray.Dataset {
dimensions:
	lon = 180 ;
	bnds = 2 ;
	lat = 170 ;
	time = 24 ;

variables:
	float64 lon(lon) ;
		lon:standard_name = longitude ;
		lon:long_name = longitude ;
		lon:units = degrees_east ;
		lon:axis = X ;
		lon:bounds = lon_bnds ;
		lon:original_units = degrees_east ;
	float64 lon_bnds(lon, bnds) ;
	float64 lat(lat) ;
		lat:standard_name = latitude ;
		lat:long_name = latitude ;
		lat:units = degrees_north ;
		lat:axis = Y ;
		lat:bounds = lat_bnds ;
		lat:original_units = degrees_north ;
	float64 lat_bnds(lat, bnds) ;
	float64 time(time) ;
		time:standard_name = time ;
		time:long_name = time ;
		time:units = days since 2001-1-1 ;
		time:axis = T ;
		time:calendar = 360_day ;
		time:bounds = time_bnds ;
		time:original_units = seconds since 2001-1-1 ;
	float64 time_bnds(time, bnds) ;
	float32 tos(time, lat, lon) ;
		tos:standard_name = sea_surface_temperature ;
		tos:long_name = Sea Surface Temperature ;
		tos:units = K ;
		tos:cell_methods = time: mean (interval: 30 

## NetCDF Metadata complied with the CF Conventions

### Variable Attributes

In the CF Conventions, metadata are considered as "required" or "optional".

For all kinds of variables, the attribute `units` and at least one of the attribute `long_name` and `standard_name` are required. Both `long_name` and `standard_name` tell what the variable is; `standard_name` must be a controlled vocabulary as defined in the ["CF Standard Name Table"](https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html), while `long_name` allows data providers to name the variable on their own. If a variable has a `standard_name`, its `units` can be looked up in the "CF Standard Name Table" as well. In case a variable cannot be named by a term from the "CF Standard Name Table", data providers should name it under `long_name` and provide a `units` that's parsable by the UDUNITS library.

As you can see, the example dataset contains three coordinate variables: `lon`, `lat`, and `time`. As regulated in the CF Conventions, a variable representing the longitude must literally have "longitude" as `standard_name` and "degrees_east" as `units`; similarly, the `standard_name` and `units` for a latitude variable must be "latitude" and "degrees_north". As for the time variable, it must have "time" as `standard_name` and a string similar to the form of "[time-interval] since YYYY-MM-DD hh:mm:ss" as `units`, e.g. in our example the unit of the time variable is "seconds since 2001-1-1". "seconds", "minutes", "hours", and "days" are the most commonly used time intervals; it is not recommended to use "months" or "years" as the length of these time intervals can vary. If you aren't yet familiar with "coordinate variable" in a netCDF file, you can learn about it in the [previous chapter](netcdf_101.ipynb).

axis, bounds, 





Similar to a coordinate variable, a data variable also requires the attribute `units` and at least one of `long_name` and `standard_name`. Other common attributes for a data variable include:

* `cell_methods`: records the method used to derive data that represents the cell values

* `_FillValue`: A value used to represent missing or undefined data. Not allowed for coordinate variables but allowed for auxiliary coordinate variables.

* `missing_value`: A value or values used to represent missing or undefined data. Not allowed for coordinate variables but allowed for auxiliary coordinate variables.

* `history`: List of the applications that hvae modified the original data. Usually appear as global attribute

* `coordinates`: List of names of auxiliary coordinate variables (and optionally coordinate variables) separated by a blank. There is no restriction on the order in which the variable names appear in the string. [excerpt](https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html)

Some optional attributes can be attached to both data variable and coordinate variable, like `source`, `references`.

### Global Attributes

* title: short description of the file contents

* intitution: Where the original data was produced.

* source: method of production of the original data.

* Conventions: Name of the conventions followed by the dataset

* references: References that describe the data or methods used to produce it.

One of the important global attributes to be included is `Convention`, which points the metadata convention applied to this dataset. A metadata convention standardizes the formulation of the metadata in a dataset, thus is a crucial component to make a dataset more aligned with the [FAIR principle](https://www.nature.com/articles/sdata201618).

A full version of the usual attributes in the CF Conventions are available in the [documentation](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#attribute-appendix) of the CF Convensions.

There are some attributes specific to the dataset, such as calendar, original_units for coordinate variable and data variable; contact, project_id, cmor_version in global variable. 

```{note}
It should be noted that including attributes that are not specified in the CF Conventions doesn't make a dataset incompatible with the CF Conventions.
```

## Another Chapter

Here is a [reference to the intro](00_intro.md).

Here is a reference to the [previous chapter](section-1).