# What is netCDF?

netCDF stands for Network Common Data Form. It is a data format widely used for creating, accessing, and sharing array-oriented scientific data. In earth system science (ESS), datasets are predominantly published in this format for several reasons:

* **Self-describing**: A netCDF file includes both the data and metadata, which is information about the data.

* **Portable and sharable**: A netCDF file can be accessed by various types of computers and can be read and written simultaneously.

* **Scalable and appendable**: Data subsets can be easily extracted from netCDF files. Additional data can also be appended to an existing file, as long as the structure of both datasets is aligned.

* **Suitable for data archiving**: Older versions of netCDF data are as well supported as current and future versions by most softwares.

Despite its widespread use, interpreting a netCDF file can be challenging, especially for newcomers to the format. This chapter will help you to easily understand the content structure of a standard netCDF file.

## NetCDF Content Structure

A netCDF file generally contains **data** and attributes about the data, known as **metadata**. The data is composed of *dimensions* and *variables*, which include coordinate variables, data variables, and sometimes other types of variables.

* *Dimensions* define the dimensional span and the shape of the data arrays.

* *Coordinate variables* specify the coordinates along each dimension.
    
* *Data variables* usually represent physical variables in the earth system science (ESS) domain, such as air temperature, wind speed, etc. These variables typically span multiple dimensions, usually over longitude, latitude, time, and vertical dimensions like altitude, depth, and pressure levels.

In addition to the data, a standardized netCDF file includes two types of metadata:

1. *Variable attributes*: These provide information about individual coordinate and data variables.
    
2. *Global attributes*: These provide information about the entire dataset.

### Data

To make the structure of a netCDF file easier to understand, let's examine an [example dataset](https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis/Dailies/surface_gauss/) from the NCEP-NCAR reanalysis. This dataset contains the daily mean air temperature at 2 meters above the Earth's surface for the year 1948. *Please note that `tos:cell_methods = time: mean (interval: 30 minutes)` is considered a mistake of the source dataset, `30 days` instead of `30 minutes` would be correct in its context.*

In [1]:
import xarray as xr

ds = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/cfbook/src/data/tos_O1_2001-2002.nc",
                     decode_cf=False)
ds.info()

xarray.Dataset {
dimensions:
	lon = 180 ;
	bnds = 2 ;
	lat = 170 ;
	time = 24 ;

variables:
	float64 lon(lon) ;
		lon:standard_name = longitude ;
		lon:long_name = longitude ;
		lon:units = degrees_east ;
		lon:axis = X ;
		lon:bounds = lon_bnds ;
		lon:original_units = degrees_east ;
	float64 lon_bnds(lon, bnds) ;
	float64 lat(lat) ;
		lat:standard_name = latitude ;
		lat:long_name = latitude ;
		lat:units = degrees_north ;
		lat:axis = Y ;
		lat:bounds = lat_bnds ;
		lat:original_units = degrees_north ;
	float64 lat_bnds(lat, bnds) ;
	float64 time(time) ;
		time:standard_name = time ;
		time:long_name = time ;
		time:units = days since 2001-1-1 ;
		time:axis = T ;
		time:calendar = 360_day ;
		time:bounds = time_bnds ;
		time:original_units = seconds since 2001-1-1 ;
	float64 time_bnds(time, bnds) ;
	float32 tos(time, lat, lon) ;
		tos:standard_name = sea_surface_temperature ;
		tos:long_name = Sea Surface Temperature ;
		tos:units = K ;
		tos:cell_methods = time: mean (interval: 30 

The example dataset includes four dimensions (`lat`, `lon`, `time`, `nbnds`) and five variables (`lat`, `lon`, `time`, `air`, `time_bnds`).

In principal, a coordinate variable is a one-dimensional array, samely named as the dimension it represents. The array values should be strictly increasing or decreasing without any missing values {cite}`Eaton:2023`. For example, `time(time)` in the dataset above is a coordinate variable; defined by dimension `time = 24`, it is a 1D array containing 24 time coordinates. The same applies to longitude and latitude. The example dataset has three coordinate variables, i.e. `lat(lat)`, `lon(lon)`, and `time(time)`.

In [2]:
#print(ds.lon.coords)
print("The time coordinates are: ", ds.time.data)
#print(ds.time.coords)

The time coordinates are:  [ 15.  45.  75. 105. 135. 165. 195. 225. 255. 285. 315. 345. 375. 405.
 435. 465. 495. 525. 555. 585. 615. 645. 675. 705.]


```{note}
In programs that can read netCDF files, variables are usually displayed as `variable_name(dimension)`. In the case of `lat(lat)`, for instance, the `lat` outside the brackets is the variable name, while the `lat` inside the brackets refers to the dimension "lat".
```

If a variable contains coordinate information but doesn't fulfill the conditions to be a coordinate variable (not one-dimensional, or not strictly increasing or decreasing, or missing values contained), it is considered an **auxiliary coordinate variable**. In the example above, `time_bnds(time, bnds)` is an auxiliary coordinate variable and, at the same time, a so-called *boundary variable*. It is a two-dimensional array and contains coordinates in pairs that describe the bounds of a time interval. It is a common pratice for many data publishers to select the midpoint of a time interval as time coordinate and define the start and the end of the intervals with a time boundary variable. In the example above, the "15th day" of a "30-day interval" are selected as time coordinates and time boundaries respectively. This practice is also common for variables of longitude and latitude. Similar to `time_bnds(time, bnds)`, `lon_bnds(lon, bnds)` and `lat_bnds(lat, bnds)` in the example are also auxiliary coordinate variables and the boundary variables for `lon` and `lat`.

In [3]:
print("The time boundary coordinates are: ", ds.time_bnds.data)

The time boundary coordinates are:  [[  0.  30.]
 [ 30.  60.]
 [ 60.  90.]
 [ 90. 120.]
 [120. 150.]
 [150. 180.]
 [180. 210.]
 [210. 240.]
 [240. 270.]
 [270. 300.]
 [300. 330.]
 [330. 360.]
 [360. 390.]
 [390. 420.]
 [420. 450.]
 [450. 480.]
 [480. 510.]
 [510. 540.]
 [540. 570.]
 [570. 600.]
 [600. 630.]
 [630. 660.]
 [660. 690.]
 [690. 720.]]


Different from coordinate variable and auxiliary coordinate variable, a **data variable** holds actually measured data. In the example above, variable `tos(time, lat, lon)` is a data variable that contains sea surface temperature data on a space defined by the dimension and coordinates of longitude, latitude and time.

### Metadata

Now that we walked through the common variable types in a standard netCDF file, let's take a brief look at metadata, which can be generally divided into **variable attributes** and **global attributes**.

* *Variable attributes* describe a variable by providing information on its name, unit, missing values and so on.
    
* *Global attributes* provide valuable information about the entire dataset.

In [4]:
print("The variable attributes of the data variable tos: ")
ds.tos.attrs

The variable attributes of the data variable tos: 


{'standard_name': 'sea_surface_temperature',
 'long_name': 'Sea Surface Temperature',
 'units': 'K',
 'cell_methods': 'time: mean (interval: 30 minutes)',
 '_FillValue': 1e+20,
 'missing_value': 1e+20,
 'original_name': 'sosstsst',
 'original_units': 'degC',
 'history': ' At   16:37:23 on 01/11/2005: CMOR altered the data in the following ways: added 2.73150E+02 to yield output units;  Cyclical dimension was output starting at a different lon;'}

In [5]:
print("The global attributes: ")
ds.attrs

The global attributes: 


{'title': 'IPSL  model output prepared for IPCC Fourth Assessment SRES A2 experiment',
 'institution': 'IPSL (Institut Pierre Simon Laplace, Paris, France)',
 'source': 'IPSL-CM4_v1 (2003) : atmosphere : LMDZ (IPSL-CM4_IPCC, 96x71x19) ; ocean ORCA2 (ipsl_cm4_v1_8, 2x2L31); sea ice LIM (ipsl_cm4_v',
 'contact': 'Sebastien Denvil, sebastien.denvil@ipsl.jussieu.fr',
 'project_id': 'IPCC Fourth Assessment',
 'table_id': 'Table O1 (13 November 2004)',
 'experiment_id': 'SRES A2 experiment',
 'realization': 1,
 'cmor_version': 0.96,
 'Conventions': 'CF-1.0',
 'history': 'YYYY/MM/JJ: data generated; YYYY/MM/JJ+1 data transformed  At 16:37:23 on 01/11/2005, CMOR rewrote data to comply with CF standards and IPCC Fourth Assessment requirements',
 'references': 'Dufresne et al, Journal of Climate, 2015, vol XX, p 136',
 'comment': 'Test drive'}

In this chapter, we learnt about the basic building blocks of a standard netCDF file: "dimensions", "variables", and "metadata". Dimensions define the dimensional span of a netCDF dataset and are labeled by the correponding coordinate variables. Further coordinate information is enriched by auxiliary coordinate variables. Measured data are hold by data variables. Metadata is available for individual variables (variable attributes) and the entire dataset (global attributes) in a standard netCDF file.

At this stage, you may wonder what kind of attributes should be defined for variables and the entire dataset? Do the attributes of a coordinate variable differ from that of a data variable? Moreover, the structure of a netCDF file may also vary depending on the data type. These are important subjects addressed by the CF Conventions, a widely adopted metadata standard for netCDF file. We will delve into these questions in further details in the next chapter.