# The CF Conventions and NetCDF Metadata

## Why use Metadata Standard? 

<img src="https://openscience.cvut.cz/wp-content/uploads/sites/3/2022/12/RDM.png">

The **FAIR** Principle:

1. **F**indable: Data should be easy to find for both *humans* and *computers*. This means data should have a unique identifier and be described with rich metadata.

2. **A**ccessible: Once found, data should be easy to access, ideally through standardized methods. Even if the data is private, the metadata should be accessible to show the data exists.

3. **I**nteroperable: Data should be compatible with other data and tools. This means deploying data analysis workflows with different data and tools intergrated is much easier.

4. **R**eusable: Data should be well-described and documented so it can be reused in the future. This involves clear usage licenses and detailed information about the data.

Metadata standards come into play to facilitate the FAIR Principle.

By definition, a *metadata standard* is a set of rules or guidelines that defines how metadata should be structured, described, and managed. It specifies the elements or attributes to be included, the semantics of those elements, and often the syntax or format in which the metadata should be encoded.

## Introduction to the CF Conventions

* A metadata standard for NetCDF files;

* Successor of the COARDS Conventions;

* Good flexibility and compatibility with other standards;

**What does the CF Conventions do?**

1. Standards for Attributes (metadata).

2. Standards for file structure of various data: 

### 1. Standards for Attributes

In [1]:
import xarray as xr
ds = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/cfbook/src/data/tos_O1_2001-2002.nc",
                     decode_cf=False)
ds.info()

zsh:1: command not found: ncdump
xarray.Dataset {
dimensions:
	lon = 180 ;
	bnds = 2 ;
	lat = 170 ;
	time = 24 ;

variables:
	float64 lon(lon) ;
		lon:standard_name = longitude ;
		lon:long_name = longitude ;
		lon:units = degrees_east ;
		lon:axis = X ;
		lon:bounds = lon_bnds ;
		lon:original_units = degrees_east ;
	float64 lon_bnds(lon, bnds) ;
	float64 lat(lat) ;
		lat:standard_name = latitude ;
		lat:long_name = latitude ;
		lat:units = degrees_north ;
		lat:axis = Y ;
		lat:bounds = lat_bnds ;
		lat:original_units = degrees_north ;
	float64 lat_bnds(lat, bnds) ;
	float64 time(time) ;
		time:standard_name = time ;
		time:long_name = time ;
		time:units = days since 2001-1-1 ;
		time:axis = T ;
		time:calendar = 360_day ;
		time:bounds = time_bnds ;
		time:original_units = seconds since 2001-1-1 ;
	float64 time_bnds(time, bnds) ;
	float32 tos(time, lat, lon) ;
		tos:standard_name = sea_surface_temperature ;
		tos:long_name = Sea Surface Temperature ;
		tos:units = K ;
		tos:cell_me

#### 1.1. Variable Attributes

* Required:

    * `long_name`/`standard_name`: Tells what the variable is? `standard_name` must be a controlled vocabulary as defined in the ["CF Standard Name Table"](https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html).

    * `units`: The unit of the variable, should be parsable by the UDUNITS library. If a variable has a `standard_name`, its `units` can be looked up in the "CF Standard Name Table" too.


* Optional (but can be important):

    * `valid_range`: Two numbers specifying the MIN and MAX valid values for this variable. Any values outside this range are treated as missing. Must not be defined if either `valid_min` or `valid_max` is defined.

    * `_FillValue`: Indicates missing data. Should be scalar (only one value), outside of the `valid_range`, and of the same types as the variable. There are [default `_FillValue`](https://www.ncl.ucar.edu/Document/Language/fillval.shtml) for different data types in NetCDF. **Not allowed for coordinate variables**.

    * *`missing_value`: Usually indicates user-defined missing data, can be scalar or array. But NetCDF doesn't do anything special with this attribute.*

    * `scale_factor`/`add_offset`: Used for unpacking data for display. $$unpackedData = scaleFactor * storedData + addOffset$$
    `valid_range`, `_fillValue` and `missing_value` should have the same data type of the stored data, and should be specified when data are still packed, so that they can be interpreted before unpacking the data for display.
        
    * `actual_range`: Must exactly equal to the MIN and MAX of the (unpacked) variable. 

    * `coordinates`: List of names of auxiliary coordinate variables (and optionally coordinate variables) separated by a blank. There is no restriction on the order in which the variable names appear in the string. [excerpt](https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html)


axis, bounds, 





* Specific for Grid Data
    * `cell_methods`: records the method used to derive data that represents the cell values

* Special Note:

    * For `time` variable: `units` is a string similar to the form of *[time-interval] since YYYY-MM-DD hh:mm:ss*. "seconds", "minutes", "hours", and "days" are the most commonly used time intervals; it is not recommended to use "months" or "years" as the length of these time intervals can vary.

    * For data variable: String of auxiliary coordinate variables (and optionally coordinate variables) separated by a blank. There is no restriction on the order in which the variable names appear in the string.

Some optional attributes can be attached to both data variable and coordinate variable, like `source`, `references`.

#### 1.2. Global Attributes

* `title`: short description of the file contents

* `intitution`: Where the original data was produced.

* `source`: method of production of the original data.

* `Conventions`: Name of the metadata standard applied to this dataset.

* `references`: References that describe the data or methods used to produce it.

* `history`: List of the applications that hvae modified the original data. Usually appear as global attribute

```{note}
It should be noted that including attributes that are not specified in the CF Conventions doesn't make a dataset incompatible with the CF Conventions.
```

### 2. File Structure Recommendations for Diverse Data

* Grid: Satellite images, climate model data etc.

* Discrete Geometry Samples (DSG) --- Discrete Points

    * Point: Unconnected points / stations, each contains only a single data element. E.g. Earthquake data, Lightning data.

    * Time Series: Data are taken over periods of time at a single / multiple stations. E.g. Weather station data, Fixed buoys.

    * Profile: Connected observations are taken along a vertical line; each profile has a single time, lat and lon. E.g. Atmospheric profiles from satellites.

    * Trajectory: Data are taken along discrete paths through space, each path contains a set of connected points. E.g. Cruise data, drifting buoys.

    * Combined DSG:
        * Timeseries of Profiles: Profiles taken over periods of time for fixed station.
        * Trajectory of Profiles: A collection of profiles along a trajectory. E.g. Ship soundings.

![img](https://live.staticflickr.com/65535/53865227144_8bfea699c4_c_d.jpg)

#### 2A. Example Template: [Single Time Series (H.2.3)](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_single_time_series_including_deviations_from_a_nominal_fixed_spatial_location)

![img](https://live.staticflickr.com/65535/53865541569_c2c0846b23_b_d.jpg)

In [5]:
ds_h23 = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/M1-3/workshop2024/workshop_2024_how_to_create_publishable_netcdf_data/DATA/example/KFM_seaTemp_h23.netcdf")
ds_h23

#### 2B. Example Template: [Orthogonal multidimensional array representation of time series (H.2.1)](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_orthogonal_multidimensional_array_representation_of_time_series)

**Multiple time series, same time series length in each, same time values.**

![img](https://live.staticflickr.com/65535/53865541599_b1709c860e_b_d.jpg)

In [13]:
ds_h21 = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/M1-3/workshop2024/workshop_2024_how_to_create_publishable_netcdf_data/DATA/example/KFM_seaTemp_h21.netcdf")
ds_h21

#### 2C. Example Template: [Incomplete multidimensional array of representation of time series (H.2.2)](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_incomplete_multidimensional_array_representation_of_time_series)

**Multiple time series, same time series length in each, different time values.**

![img](https://live.staticflickr.com/65535/53865541579_cd88445c3e_b_d.jpg)

In [25]:
ds_h22 = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/M1-3/workshop2024/workshop_2024_how_to_create_publishable_netcdf_data/DATA/example/KFM_seaTemp_h22.netcdf")
ds_h22

#### 2D. Example Template: [Contiguous ragged array representation of time series (H.2.4)](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_contiguous_ragged_array_representation_of_time_series)

**Multiple time series, each has different length, dataset complete.**

![img](https://live.staticflickr.com/65535/53865434693_78513dbff0_b_d.jpg)

In [26]:
ds_h24 = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/M1-3/workshop2024/workshop_2024_how_to_create_publishable_netcdf_data/DATA/example/KFM_seaTemp_h24.netcdf")
ds_h24

#### 2E. Example Template: [Indexed ragged array representation of time series (H.2.5)](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_indexed_ragged_array_representation_of_time_series)

**Multiple time series, each has different length, dataset incomplete (additional data anticipated).**

![img](https://live.staticflickr.com/65535/53864279122_0cc95d1b60_b_d.jpg)

In [27]:
ds_h25 = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/M1-3/workshop2024/workshop_2024_how_to_create_publishable_netcdf_data/DATA/example/KFM_seaTemp_h25.netcdf")
ds_h25