# What is netCDF?

netCDF stands for Network Common Data Form. It is a data format widely used for creating, accessing, and sharing array-oriented scientific data. In earth system science (ESS), datasets are predominantly published in this format for several reasons:

* **Self-describing**: A netCDF file includes both the data and metadata, which is information about the data.

* **Portable and sharable**: A netCDF file can be accessed by various types of computers and can be read and written simultaneously.

* **Scalable and appendable**: Data subsets can be easily extracted from netCDF files. Additional data can also be appended to an existing file, as long as the structure of both datasets is aligned.

* **Suitable for data archiving**: Older versions of netCDF data are as well supported as current and future versions by most softwares.

Despite its widespread use, interpreting a netCDF file can be challenging, especially for newcomers to the format. This chapter will help you to easily understand the content structure of a standard netCDF file.

## NetCDF Content Structure

A netCDF file generally contains **data** and attributes about the data, known as **metadata**. The data is composed of *dimensions* and *variables*, which include coordinate variables, data variables, and sometimes other types of variables.

* *Dimensions* define how the data is organized.

* *Coordinate variables*  specify the coordinates along each dimension.
    
* *Data variables* usually represent physical variables in the earth system science (ESS) domain, such as air temperature, wind speed, etc. These variables typically span multiple dimensions, usually over longitude, latitude, time, and vertical dimensions like altitude, depth, and pressure levels.

In addition to the data, a standardized netCDF file includes two types of metadata:

1. *Variable attributes*: These provide information about individual coordinate and data variables.
    
2. *Global attributes*: These provide information about the entire dataset.

### Data

To make the structure of a netCDF file easier to understand, let's examine an [example dataset](https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis/Dailies/surface_gauss/) from the NCEP-NCAR reanalysis. This dataset contains the daily mean air temperature at 2 meters above the Earth's surface for the year 1948.

In [1]:
import xarray as xr

ds = xr.open_dataset("/Users/icdc/Documents/NFDI/Kemeng/cfbook/src/data/air.2m.gauss.1948.nc",
                     decode_cf=False)
ds.info()

xarray.Dataset {
dimensions:
	lat = 94 ;
	lon = 192 ;
	time = 366 ;
	nbnds = 2 ;

variables:
	float32 lat(lat) ;
		lat:units = degrees_north ;
		lat:actual_range = [ 88.542 -88.542] ;
		lat:long_name = Latitude ;
		lat:standard_name = latitude ;
		lat:axis = Y ;
	float32 lon(lon) ;
		lon:units = degrees_east ;
		lon:long_name = Longitude ;
		lon:actual_range = [  0.    358.125] ;
		lon:standard_name = longitude ;
		lon:axis = X ;
	float64 time(time) ;
		time:long_name = Time ;
		time:delta_t = 0000-00-01 00:00:00 ;
		time:avg_period = 0000-00-01 00:00:00 ;
		time:standard_name = time ;
		time:axis = T ;
		time:units = hours since 1800-01-01 00:00:0.0 ;
		time:actual_range = [1297320. 1306080.] ;
		time:coordinate_defines = start ;
	float32 air(time, lat, lon) ;
		air:long_name = mean Daily Air temperature at 2 m ;
		air:units = degK ;
		air:precision = 2 ;
		air:GRIB_id = 11 ;
		air:GRIB_name = TMP ;
		air:var_desc = Air temperature ;
		air:dataset = NCEP Reanalysis Daily Averages ;
		

The example dataset includes four dimensions (`lat`, `lon`, `time`, `nbnds`) and five variables (`lat`, `lon`, `time`, `air`, `time_bnds`).

By convention, a coordinate variable usually has the same name as the dimension it represents. A coordinate variable should be one-dimensional, with values that are strictly increasing or decreasing, and it should not have any missing values {cite}`Eaton:2023`. In this dataset, latitude(`lat`), longitude(`lon`), and time(`time`) fill all these criteria and thus are coordinate variables.

In [12]:
print(ds.lon.coords)
print(ds.lat.coords)
print(ds.time.coords)

Coordinates:
  * lon      (lon) float32 768B 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
Coordinates:
  * lat      (lat) float32 376B 88.54 86.65 84.75 82.85 ... -84.75 -86.65 -88.54
Coordinates:
  * time     (time) float64 3kB 1.297e+06 1.297e+06 ... 1.306e+06 1.306e+06


```{note}
Variables are displayed as `variable_name(dimension)`. For example, in `lat(lat)`, the `lat` outside the brackets is the variable name, while the `lat` inside the brackets refers to the dimension "lat". This means that the variable "lat" is defined along the "lat" dimension.
```

If a variable contains coordinate data but doesn't meet the criteria to be a coordinate variable, it is considered an **auxiliary coordinate variable**. Unlike coordinate variables, auxiliary coordinate variables can be multidimensional, their names do not have to match their dimension names, they do not need to have strictly increasing or decreasing values, and they can include missing values. In our example, `time_bnds` is an auxiliary coordinate variable.

```{note}
The variable `time_bnds` defines the bounds of time intervals. Often, time coordinates represent the midpoint of a time interval to simplify calculations. By including a time boundary variable (`time_bnds`), data creators can define the start and the end of the time interval. This practice is also common for longitude and latitude coordinates.
```

The `air` variable is different from the coordinate and auxiliary coordinate variables; it is a **data variable** that contains air temperature data. The notation `air(time, lat, lon)` indicates that the data array in this variable is organized along the dimensions `time`, `lat`, and `lon` in sequence.

In [14]:
print(ds.air)

<xarray.DataArray 'air' (time: 366, lat: 94, lon: 192)> Size: 26MB
array([[[225.85   , 225.85   , ..., 225.92   , 225.87   ],
        [226.42   , 226.75   , ..., 226.04   , 226.22   ],
        ...,
        [258.57   , 258.7    , ..., 258.4    , 258.5    ],
        [258.02002, 258.02002, ..., 258.03998, 258.03998]],

       [[227.7    , 227.75   , ..., 227.7    , 227.7    ],
        [226.56999, 227.     , ..., 226.     , 226.27   ],
        ...,
        [258.1    , 258.22   , ..., 257.97   , 258.07   ],
        [257.75   , 257.7    , ..., 257.8    , 257.77002]],

       ...,

       [[229.65   , 229.65   , ..., 229.62   , 229.7    ],
        [231.97   , 231.8    , ..., 232.56999, 232.3    ],
        ...,
        [257.62   , 257.57   , ..., 257.77002, 257.72   ],
        [256.77002, 256.72   , ..., 256.95   , 256.9    ]],

       [[227.81999, 227.95   , ..., 227.56999, 227.72   ],
        [228.3    , 228.3    , ..., 228.3    , 228.27   ],
        ...,
        [256.45   , 256.39   , ..., 

### Metadata

Now that we've covered the basic types of variables in a standard netCDF file, let's take a brief look at the metadata. The two types of metadata in a netCDF file are **variable attributes** and **global attributes**.

* *Variable attributes* provide information about individual variables, such as time, lat, lon, and air.
    
* *Global attributes* provide information about the entire dataset.

What kind of information should be included in variable attributes? Do the attributes for a coordinate variable differ from those for a data variable? What about global attributes? These questions are addressed by the CF Conventions, a widely adopted metadata standard in Earth System Science. We will explore this in more detail in the next chapter.

To summarize, in this chapter we learned about the core content structure of a standardized netCDF file, which includes three main components: "dimensions," "variables," and "metadata." Data is stored in data variables and organized along dimensions, with coordinates specified by corresponding coordinate variables. Auxiliary coordinate variables provide additional coordinate information. Metadata is included for both individual variables (variable attributes) and the entire dataset (global attributes) in a standard netCDF file.