# 3.2 Introduction to xarray - NetCDF and Dataset

prepared by Mathias Hauser

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

## Reading NetCDFs

NetCDF (Network Common Data Format) is a data format that is very commonly used in the geosciences. It is used to save and distribute observational data or the output from climate models. It has a very similar format as the `DataArray`s we got to know - named dimensions, coordinates, etc. Indeed, xarray was modelled after the NetCDF data model.

We will use a NetCDF file that contains observed annual maximum temperature (TXx) data. The data is described in Dunn et al. ([2020](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019JD032263)). The data has already undergone some preprocessing.

NetCDF files can be opened with the `xr.open_dataset` function:

In [None]:
file = "../data/HadEX3_TXx_ANN.nc"
ds = xr.open_dataset(file)

In [None]:
type(ds)

In [None]:
ds

This is not a `DataArray` but a `Dataset`. Also its representation looks a bit different than the one of a `DataArray` that we saw in the last exercise. A `Dataset` is a collection of `DataArray`s. Most operations that work with a `DataArray` also work with a `Dataset`.

The `Dataset` we opened here has 4 dimensions `time`, `lon`, `lat` and `bnds`. Only the first three also have coordinates. Then there is a number of data variables: `longitude_bnds` and `latitude_bnds` give the bounds of the `lon` and `lat` coords. `TXx` is the time-dependent annual maximum temperature. `trend` and `is_significant` give an estimate of the mean change over time and its significance.

Further, it has a number of `Attributes` that are descriptive - e.g. they indicate the reference of the data. 

The `Dataset` contains a number of data variables - these variables are `DataArray`s and can be accessed either with dot-notation (`ds.variable`) or by indexing (`ds["variable"]`):

In [None]:
TXx = ds.TXx
TXx

In [None]:
type(TXx)

### Exercise

* Read the variable `trend`.

In [None]:
# code here

In [None]:
# solution

trend = ds.trend
trend

* Call `trend.plot()` to create a plot.

In [None]:
# code here

In [None]:
# solution

ds.trend.plot()

## Computation with Datasets

You can do the same computations with a `Dataset` as with a `DataArray`. For example to compute the mean over the latitude and longitude:

In [None]:
ds.mean(("lat", "lon"))

### Exercise

* Calculate the mean over time.

In [None]:
# code here

In [None]:
# solution

ds.mean("time")

## High level operations

xarray offers many high level operations that go beyond simple reductions. Many of these rely on the coordinates and make use of the named dimensions. They work for `DataArray` and `Dataset` and include:

* `ds.groupby`
* `ds.resample`
* `ds.rolling`
* `ds.weighted`

We cannot look at all of them but will shortly introduce some of them below.

## Weighted reductions

We calculated the mean over the latitude and longitude above, but we have to be careful with this. The individual grid cells become smaller as we move towards the pole! So we need to give less weights to gridpoints at high latitudes. If the lat/ lon grid is rectangular the cosine of the latitude is a good proxy for the area of the grid cell.

We look at this with a tutorial dataset xarray offers. `air` is a two-year air temperature dataset over the North American continent. The dataset can be accessed like so:

In [None]:
air = xr.tutorial.open_dataset("air_temperature")
air

We first need to calculate the weights. Because `np.cos` expects the data in radians we need to convert latitudes (which are given in degrees) to radians first.

In [None]:
wgt = np.cos(np.deg2rad(air.lat))

wgt

In [None]:
wgt.plot()

This illustrates another helpful property of xarray objects - you can pass them directly to numpy functions. To calculate the weighted mean we have to create a weighted operation. And calculate the mean over lat & lon:

In [None]:
air_weighted = air.weighted(wgt).mean(("lat", "lon"))

We also compute the unweighted mean and compare it to the weighted mean. Why is the weighted mean warmer than the unweighted?

In [None]:
air_unweighted = air.mean(("lat", "lon"))

# ===

air_weighted.air.plot(label="weighted")
air_unweighted.air.plot(label="unweighted")

plt.legend()

### Exercise

* Repeat the calculation from above using `ds`.

Hints: you need to calculate the weights again. If you want to create a plot you will need to select the `DataArray` (`TXx`) first.

In [None]:
# code here

In [None]:
# solution

wgt = np.cos(np.deg2rad(ds.lat))

ds.weighted(wgt).mean(("lat", "lon"))

In [None]:
# solution

wgt = np.cos(np.deg2rad(ds.lat))

ds.weighted(wgt).mean(("lat", "lon")).TXx.plot()

## Time coordinates

Both datasets used here have a time axis. The time axis has some special properties that help us to work with time coordinates.

The `air` dataset is 6-hourly (4 datapoints each day):

In [None]:
air.time.head()

We can select a single timestep with a string:

In [None]:
air.sel(time="2013-01-01T00")

### Exercise

* What happens if you select with `time="2013-01-01"`?

In [None]:
# code here

In [None]:
# solution
air.sel(time="2013-01-01")

# this selects all timesteps of the day

Again, we can select ranges using `slice`. The first five months of 2013 can be selected using:

In [None]:
air.sel(time=slice("2013-01", "2013-05"))

### Exercise

* Select the period 1981 to 2010 from `ds`.

In [None]:
# code here

In [None]:
# solution
ds.sel(time=slice("1981", "2010"))

### Exercise

* What does the following command do? Have a closer look at the resulting time coordinate.

In [None]:
air.resample(time="d").mean()

In [None]:
# it calculates ...

In [None]:
# solution

# it calculates a daily mean

### Exercise

* Use the code snipped from above to calculate the monthly mean.

In [None]:
# code here

In [None]:
# solution

air.resample(time="m").mean()

### Exercise

* What does the following command do? Where did the `time` dimension go? How long is the new dimension? Can you see the difference to the computation with `resample`?

In [None]:
air.groupby("time.month").mean()

In [None]:
# it calculates ...

In [None]:
# solution

# it calculates the mean over all Januaries, Februaries, etc..

### Exercise

* What does the following command do? 

In [None]:
air.time.dt.hour

In [None]:
# it returns ...

In [None]:
# solution

# it returns the hour of the day