# What is netCDF?

netCDF, short for Unidata's Network Common Form, is a data format widely used in the creation, access, and sharing of array-oriented scientific data. In the field of earth system science (ESS), this format is the go-to for publishing datasets, largely due to several key advantages:

* A netCDF file combines both the data itself and crucial information about the data (metadata), making it easily understandable or *self-describing*.

* A netCDF file allows access from various computing platforms, and facilitates simultaneous reading and writing, thus promoting *portability* and *shareability*.

* With netCDF, extracting specific data subsets or adding new data to an existing file is straightforward, thanks to its *scalable* and *appendable* nature.

* Compatibility across software applications extends to older as well as future versions of netCDF data, making it an ideal choice for *data archiving*.

Despite its widespread use, interpreting a netCDF file can be challenging, especially for newcomers to the format. This chapter aims to simplify the understanding of a netCDF file's content structure through a guided exploration.

## NetCDF Content Structure

In general, a netCDF file contains data and attributes about the data, i.e. metadata. The "data" are structured by **dimensions** and variables (**coordinate variables**, **data variables**, and occasionally other types of variables). The `dimensions` define how the data are arranged, while the `coordinate variables` mark the coordinates on each dimension. The `data variables` are often physical variable(s) in the ESS domain, e.g. sea water temperature, earth surface temperature, wind speed etc. In common, the measurement of these variables spans across multiple dimensions, e.g. longitude, latitude, as well as time or/and vertical dimensions such as altitude, depth, pressure levels, etc. 

Besides the data itself, two types of "metadata" should be made available in a standardized netCDF file. One is the **variable attributes**, giving information about individual coordinate variables as well as data variables; and the other one is the **global attributes**, containing information about the dataset as a whole.

### Data

To easy the understanding of the content structure of a netCDF file, we'll look into an [example dataset](https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis/Dailies/surface_gauss/) from NCEP-NCAR reanalysis, containing daily mean air temperature at 2 meters above earth surface for year 1948.

In [1]:
import xarray as xr

ds = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/cfbook/src/data/air.2m.gauss.1948.nc",
                     decode_cf=False)
ds.info()

xarray.Dataset {
dimensions:
	lat = 94 ;
	lon = 192 ;
	time = 366 ;
	nbnds = 2 ;

variables:
	float32 lat(lat) ;
		lat:units = degrees_north ;
		lat:actual_range = [ 88.542 -88.542] ;
		lat:long_name = Latitude ;
		lat:standard_name = latitude ;
		lat:axis = Y ;
	float32 lon(lon) ;
		lon:units = degrees_east ;
		lon:long_name = Longitude ;
		lon:actual_range = [  0.    358.125] ;
		lon:standard_name = longitude ;
		lon:axis = X ;
	float64 time(time) ;
		time:long_name = Time ;
		time:delta_t = 0000-00-01 00:00:00 ;
		time:avg_period = 0000-00-01 00:00:00 ;
		time:standard_name = time ;
		time:axis = T ;
		time:units = hours since 1800-01-01 00:00:0.0 ;
		time:actual_range = [1297320. 1306080.] ;
		time:coordinate_defines = start ;
	float32 air(time, lat, lon) ;
		air:long_name = mean Daily Air temperature at 2 m ;
		air:units = degK ;
		air:precision = 2 ;
		air:GRIB_id = 11 ;
		air:GRIB_name = TMP ;
		air:var_desc = Air temperature ;
		air:dataset = NCEP Reanalysis Daily Averages ;
		

As we see, the example dataset consists of four dimensions (`lat`, `lon`, `time`, `nbnds`) and five variables (`lat`, `lon`, `time`, `air`, `time_bnds`).

By convention, a coordinate variable usually has the same name as the dimension where it specifies the coordinate values; moreover, a coordinate variable should be one-dimensional, has values strictly increasing or decreasing, and doesn't allow missing values {cite}`Eaton:2023`. Among the five variables in this dataset, `lat`, `lon`, and `time` are all coordinate variables.


In [12]:
print(ds.lon.coords)
print(ds.lat.coords)
print(ds.time.coords)

Coordinates:
  * lon      (lon) float32 768B 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
Coordinates:
  * lat      (lat) float32 376B 88.54 86.65 84.75 82.85 ... -84.75 -86.65 -88.54
Coordinates:
  * time     (time) float64 3kB 1.297e+06 1.297e+06 ... 1.306e+06 1.306e+06


```{note}
Variables are displayed as `variable_name(dimension)`. For instance, in variable `lat(lat)`, the `lat` outside the brackets is the variable name, while the bracketed `lat` refers to the dimension `lat`. It indicates that the variable `lat` is a function of the dimension `lat`, or to say the variable `lat` is expanded in the dimension `lat`.
```

However, if a variable that contains coordinate data doesn't fulfill the requirements to be a coordinate variable, it is considered an **auxiliary coordinate variable**. Unlike a coordinate variable, an auxiliary coordinate variable can be multidimensional, and there is no relationship between its name and the name(s) of its dimension(s). It doesn't have to have monotonic or unique values, and it allows missing values as well. In out example, `time_bnds` is an auxiliary coordinate variable.

```{note}
The variable `time_bnds` defines the bounds of time intervals. For the convenience of computing, time coordinates are often the midpoint of a time interval, so some data creators are providing more information about time intervals by defining a time bounary variable. Similar case can be found often for the longitude and latitude too.
```

Different from the (auxiliary) coordinate variables explained above, `air` is the **data variable** in this dataset, containing air temperature data. `air(time, lat, lon)` indicates that the data in this variable is expanded along three dimensions, `time`, `lat`, and `lon`. 

In [14]:
print(ds.air)

<xarray.DataArray 'air' (time: 366, lat: 94, lon: 192)> Size: 26MB
array([[[225.85   , 225.85   , ..., 225.92   , 225.87   ],
        [226.42   , 226.75   , ..., 226.04   , 226.22   ],
        ...,
        [258.57   , 258.7    , ..., 258.4    , 258.5    ],
        [258.02002, 258.02002, ..., 258.03998, 258.03998]],

       [[227.7    , 227.75   , ..., 227.7    , 227.7    ],
        [226.56999, 227.     , ..., 226.     , 226.27   ],
        ...,
        [258.1    , 258.22   , ..., 257.97   , 258.07   ],
        [257.75   , 257.7    , ..., 257.8    , 257.77002]],

       ...,

       [[229.65   , 229.65   , ..., 229.62   , 229.7    ],
        [231.97   , 231.8    , ..., 232.56999, 232.3    ],
        ...,
        [257.62   , 257.57   , ..., 257.77002, 257.72   ],
        [256.77002, 256.72   , ..., 256.95   , 256.9    ]],

       [[227.81999, 227.95   , ..., 227.56999, 227.72   ],
        [228.3    , 228.3    , ..., 228.3    , 228.27   ],
        ...,
        [256.45   , 256.39   , ..., 

### Metadata

After learning about some basic variable types in a standard netCDF file, now let's have a brief look at the metadata. As mentioned earlier, there are two types of metadata included in a netCDF file, i.e. **variable attributes** and **global attributes**. The attributes about individual variables (e.g. `time`, `lat`, `lon`, `air`) are variable attributes, and the attributes about the whole dataset are global attributes. What kind of information should be included in variable attributes? Do the attributes for a coordinate variable differ from those for a data variable? How about the global attributes? These are the questions covered by the CF Conventions, a broadly adopted metadata standard in Earth System Science. We will delve deeper into it in the next chapter.

## Wrap-up

In a brief summary, in this chapter we learned about the core content structure of a standardized netCDF, where "dimension", "variable" and "metadata" are the three main building blocks. Data are stored in data variables and are expanded along dimensions, of which the coordinates are marked by corresponding coordinate variables. Auxilliary coordinate variables also contain coordinates, and usually enrich the information about the data. Metadata about individual variables (variable attribute) as well as the whole dataset (global attribute) are supposed to be delivered in a standard netCDF file.