 üåé GPGN268 - Geophysical Data Analysis
- **Instructor:** Bia Villas Boas  
- **TA:** Seunghoo Kim

## Lecture 17: Introduction to netCDF

#### üéØ Learning Objectives from this Lecture:
- Describe the netcdf data format as it is used to store climate data
- Describe how xarray can be used to read netCDF files

## What is netCDF Data?
NetCDF (network Common Data Form) is a hierarchical data format. It is what is known as a ‚Äúself-describing‚Äù data structure which means that metadata, or descriptions of the data, are included in the file itself and can be parsed programmatically, meaning that they can be accessed using code dto build automated and reproducible workflows.

The NetCDF format can store data with multiple dimensions. It can also store different types of data through arrays that can contain geospatial imagery, rasters, terrain data, climate data, and text. These arrays support metadata, making the netCDF format highly flexible. NetCDF was developed and is supported by [UCAR](https://www.ucar.edu/) who maintains standards and software that support the use of the format.

### Data in netCDF format is:

- **Self-Describing**. A netCDF file includes information about the data it contains.
- **Portable**. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
- **Scalable**. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.
- **Appendable**. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
- **Sharable**. One writer and multiple readers may simultaneously access the same netCDF file.
- **Archivable**. Access to all earlier forms of netCDF data will be supported by current and future versions of the software.‚Äù

### NetCDF4 Format for Climate Data
The hierarchical and flexible nature of netcdf files supports storing data in many different ways. The netCDF4 data standard is used broadly by the climate science community to store climate data. Climate data are:

- often delivered in a time series format (months and years of historic or future projected data).
- spatial in nature, covering regions such as the United States or even the world.
- driven by models which require documentation making the self describing aspect of netCDF files useful.


## Xarray for multidimensional gridded data

In the previous set of lectures, we saw how Pandas provided a way to keep track of additional ‚Äúmetadata‚Äù surrounding tabular datasets, including ‚Äúindexes‚Äù for each row and labels for each column. These features, together with Pandas‚Äô many useful routines for all kinds of data munging and analysis, have made Pandas one of the most popular python packages in the world.

However, not all Earth science datasets easily fit into the ‚Äútabular‚Äù model (i.e. rows and columns) imposed by Pandas. In particular, we often deal with multidimensional data. By multidimensional data (also often called N-dimensional), I mean data with many independent dimensions or axes. For example, we might represent Earth‚Äôs surface temperature $T$ as a three dimensional variable

$$
T(x, y, t)
$$

where 
$x$ is longitude, $y$ is latitude, and $y$ is time.

The point of xarray is to provide pandas-level convenience for working with this type of data.

![](https://docs.xarray.dev/en/stable/_images/dataset-diagram.png)

## Xarray data structures

Like Pandas, xarray has two fundamental data structures:

- a `DataArray`, which holds a single multi-dimensional variable and its coordinates

- a `Dataset`, which holds multiple variables that potentially share the same coordinates

### DataArray
A DataArray has four essential attributes:

- `values`: a `numpy.ndarray` holding the array‚Äôs values

- `dims`: dimension names for each axis (e.g., ('x', 'y', 'z'))

- `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)

- `attrs`: an `OrderedDict` to hold arbitrary metadata (attributes)

Let‚Äôs start by constructing some DataArrays manually

## The Argo program

We will get some practice with `xarray` using data from [Argo floats](https://argo.ucsd.edu/)

![](https://argo.ucsd.edu/wp-content/uploads/sites/361/2020/03/APEX_world_HJF-225x300.jpg)

Data from Argo floats are available from several data centers. Here, we will use the data available form the French Institure for Ocean Research [IFREMER](https://data-argo.ifremer.fr/)


In [None]:
import numpy as np
import xarray as xr
from matplotlib import pyplot as plt

import cartopy.crs as ccrs

In [None]:
# We use xarray.load_dataset to load our profile data 
ds_raw = xr.load_dataset('/Users/bia/Downloads/5901429_prof.nc')

xarray will read the netCDF data as an `xarray.Dataset` object. Below, we see that our dataset has 64 variables and 5 dimension. Looking at the file's Attributes it becomes clear what we mean by metadata and self-describing.

In [None]:
ds_raw

Similarly to `pandas`, we can visualize the data directly from `xarray`. Below, we use the "dot" notation to access the variable `TEMP_ADJUSTED` and make a plot. Note that xarray used the metadata to already add information to the plot in the form of axes labels (it shows even the units ü§Ø)

In [None]:
ds_raw.TEMP_ADJUSTED.plot()

What we have here is sea temperature as a function of N_LEVELS and N_RPOFILES. This is not very intuitive. Maybe it would make more sense to analize temperature as a function of depth and time. Also, we don't need all variables from this files and the variable names are a bit annoying to type. Let's go head and do some data cleanup.

In [None]:
# Define a list with the variables that we want to keep
variables = ['PRES_ADJUSTED','TEMP_ADJUSTED', 'PSAL_ADJUSTED', 'LATITUDE', 'LONGITUDE', 'JULD']
# Select only these variables from the whole dataset
ds = ds_raw[variables]
ds

Now, we can rename variables in `xarray`, using the method `rename` and passing the current variable names and the respective new variable names in the form of a dictionary `{'current_name1':'new_name1', 'current_name2':'new_name2'}`. For example

In [None]:
ds = ds.rename({'JULD':'time'})
ds

Now, let's do this for the othe variables 

In [None]:
ds = ds.rename({'PRES_ADJUSTED':'pressure', 'TEMP_ADJUSTED':'temperature',
               'PSAL_ADJUSTED':'salinity', 'LATITUDE':'latitude',
                'LONGITUDE':'longitude'})
ds

In [None]:
ds.temperature

We have succefully changed the variable names. Now, we see that the dataset dimensions are profile number and level number (N_PROF, N_LEVELS), but we would prefer to have time as a dimension. We can swap the dimension `N_PROF` with `time`

In [None]:
ds = ds.swap_dims({'N_PROF':'time'})
ds

Nice! Now, if we try to plot temperature `xarray` will disply it as a function of time.

In [None]:
ds.temperature.plot()

### Operations in xarray are dimension aware

- Back when we were using numpy, if we wanted to perform an operation on a given array, we had to specify the axis on which to operate. For example, `np.mean(data, axis=1)`. In `xarray` this is much more intuitive: you specify the dimension on which we want to operate.

In [None]:
ds.temperature.mean(dim='N_LEVELS').plot()

### ü§î Pressure or depth?
- The pressure in this file is given in decibar. Discuss with your peers what is a decibar and how it relates to depth.
- It seems like the pressure values in your dataset are not exactly the same for each profile. Discuss with your peers some strategies the you could use to have a comon range of depths for all profiles and map the dimension 'N_LEVELS' to depth in your dataset. 

In [None]:
# Hide this cell
ds.pressure.mean(dim='time').plot()
ds.pressure.median(dim='time').plot()
ds.pressure.max(dim='time').plot()
ds.pressure.min(dim='time').plot()

In [None]:
depths = ds.pressure.median(dim='time')
ds['depth'] = depths
ds = ds.swap_dims({'N_LEVELS':'depth'})
ds

In [None]:
ds.temperature.plot()

In [None]:
ds.temperature.plot(x='time', y='depth', yincrease=False)

### Data selection in xarray

Similarly to pandas `loc` and `iloc`, in xarray you can select data by index or by the actual value of the data. For example: 

In [None]:
ds.temperature.isel(time=10).plot(y='depth', yincrease=False)

In [None]:
# You can also ask for a slice
ds.temperature.isel(time=slice(0, 20)).plot()

In [None]:
# Which also works with multiple dimensions
ds.temperature.isel(time=slice(0, 20), depth=slice(0, 10)).plot()

In [None]:
ds.temperature.sel(time="2007-01-15")

In [None]:
ds.temperature.sel(time="2007-01-15", method='nearest')