## What is netCDF?

netCDF stands for Unidata's Network Common Form, is a data format generally used to create, access, and share array-oriented scientific data. In the field of earth science, datasets are predominently published in this data format, considerably because of several reasons:

* netCDF file includes both data and the information about the data (metadata), making it *self-describing*.

* A netCDF file can be accessed by various types of computers, can also be read and written simultaneously, so it is *portable* and *sharable*.

* In netCDF, data subsets can be easily extracted, and additional data can be appended to the existing file as long as the data structure is aligned, thus it's a *scalable* and *appendable* data format.

* Older versions of netCDF data are as well supported as the current and future versions by most softwares, so it's a good option for *data archiving* too.

Quite often we find it uneasy to interpret a netCDF file, especially if we are new to this data format. So let's walk through the content structure of a netCDF file together in this chapter!

## netCDF Content Structure

A netCDF file typically includes the **data** itself and the information about the data (also called **metadata**). In the earth science field, the data are often physical variable(s), e.g. sea water temperature, earth surface temperature, wind speed etc. In common, the measurement of those variables spans across multiple longitudes and latitudes, as well as time or/and vertical dimensions such as altitude, depth, pressure levels, etc. Therefore, the data stored in a netCDF file are usually multi-dimensional. 

The metadata, on the other hand, provide useful attribute information about the data. In a standardized netCDF file, metadata should provide information on individual variables, as well as the complete dataset. Important questions like what the dataset is about, what variables are measured, in which unit they're measured and so on should be answered by the metadata, so that data users can interpret the dataset solely by observing the dataset itself.

To take a closer look into the content structure of a standard netCDF file, we take the NCEP-NCAR reanalysis [dataset](https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis/Dailies/surface_gauss/) of daily mean air temperature at 2 meters above surface for year 1948 as an example.

In [2]:
import xarray as xr

ds = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/cfbook/src/data/air.2m.gauss.1948.nc",
                     decode_cf=False)
ds.info()

xarray.Dataset {
dimensions:
	lat = 94 ;
	lon = 192 ;
	time = 366 ;
	nbnds = 2 ;

variables:
	float32 lat(lat) ;
		lat:units = degrees_north ;
		lat:actual_range = [ 88.542 -88.542] ;
		lat:long_name = Latitude ;
		lat:standard_name = latitude ;
		lat:axis = Y ;
	float32 lon(lon) ;
		lon:units = degrees_east ;
		lon:long_name = Longitude ;
		lon:actual_range = [  0.    358.125] ;
		lon:standard_name = longitude ;
		lon:axis = X ;
	float64 time(time) ;
		time:long_name = Time ;
		time:delta_t = 0000-00-01 00:00:00 ;
		time:avg_period = 0000-00-01 00:00:00 ;
		time:standard_name = time ;
		time:axis = T ;
		time:units = hours since 1800-01-01 00:00:0.0 ;
		time:actual_range = [1297320. 1306080.] ;
		time:coordinate_defines = start ;
	float32 air(time, lat, lon) ;
		air:long_name = mean Daily Air temperature at 2 m ;
		air:units = degK ;
		air:precision = 2 ;
		air:GRIB_id = 11 ;
		air:GRIB_name = TMP ;
		air:var_desc = Air temperature ;
		air:dataset = NCEP Reanalysis Daily Averages ;
		

As we can read, this dataset is made up of four dimensions (`lat`, `lon`, `time`, `nbnds`) and five variables (`lat`, `lon`, `time`, `air`, `time_bnds`).

Let's take a look at the variable `lat(lat)`. The `lat` on the left is the name of this variable, while the bracketed `lat` on the right refers to the dimension `lat`; this formation indicates that the variable `lat` is a function of the dimension `lat`, or to say the variable `lat` is a one-dimensional array on the dimension `lat`. As it is, such a variable like `lat` is a **coordinate variable**. A coordinate variable must be one-dimensional and monotonically increasing or decreasing, is usually named the same as the dimension it depends on, and doesn't allow missing values. It annotates the coordinates on the corresponding dimension. In this example, `lat`, `lon`, and `time` are all coordinate variables, labeling the dimension `lat`, `lon` and `time` respectively.

However, if a "coordinate variable" doesn't fill these requirements, it can be added as an **auxiliary coordinate variable**. Unlike a real coordinate variable, an auxiliary coordinate variable can be multidimensional, and there is no relationship between its name and the name(s) of its dimension(s). It doesn't have to have monotonic or unique values, and it allows missing values as well. In our example, `time_bnds` is an auxiliary coordinate variable.

Different from the coordinate variables, `air` is the **data variable** in this example, containing the actual measurements of air temperature. `air(time, lat, lon)` indicates that this variable contains a three-dimensional array that spans along the dimensions `time`, `lat`, and `lon`. Thus we know is available for a specific region at multiple time steps.




In [6]:
ds.time

In [7]:
ds.time_bnds

Now that we learned about the data in this dataset, let's have a look at the metadata. As you may have already noticed, there are attribute information provided for almost all the variables (**variable attributes**) as well as for the dataset as a whole (**global attributes**). 

Variable attributes provide information on individual variables, such as what was measured, in which units the data are presented, etc.

Global attributes may contain a wide range of information about a dataset, such as title and description about a dataset. 

One of the important global attributes to be included is `Convention`, which points the metadata convention applied to this dataset. A metadata convention standardizes the formulation of the metadata in a dataset, thus is a crucial component to make a dataset more aligned with the [FAIR principle](https://www.nature.com/articles/sdata201618). We will dive deeper into metadata conventions in the next chapter.

In a brief summary, we learned the core content structure of a standardized netCDF file in this chapter. 
In a standard netCDF file, dimension, variable and metadata are the three main building blocks. The three types of variables are:

* Coordinate Variable
* Auxiliary Coordinate Variable
* Data Variable

We can recognize two kinds of metadata:
* Variable Attribute
* Global Attribute

In the next chapter, we'll learn about CF-Conventions, a broadly adopted metadata convention in Earth System Science.

[excerpt](https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html)

Following the CF (Climate and Forecast) conventions for netCDF metadata, we define an auxiliary coordinate variable as any netCDF variable that contains coordinate data, but is not a coordinate variable (See Coordinate Variables). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

The value of the coordinates attribute is a blank separated list of names of auxiliary coordinate variables and (optionally) coordinate variables. There is no restriction on the order in which the variable names appear in the coordinates attribute string.

In [32]:
ds.to_dataarray()[1][0][0][0]