# 3.1 Introduction to xarray - DataArrays

prepared by Mathias Hauser

xarray is an extension on top of numpy - it introduces labels in the form of dimensions, coordinates and attributes on top of numPy-like _multi_ dimensional<sup>*</sup> arrays.

<sup>*</sup>The [pandas](https://pandas.pydata.org/) library offers similar functionality for 2-dimensional tabular data (think: spreadsheets).

First we need some imports. The only new thing here is xarray which is abbreviated as xr.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

## Temperature data at two stations 

We start the introduction with an example using pure numpy to motivate why labelled arrays with coordinates may be a good thing.

Assume we have some annual-mean temperature data at two stations named "a" and "b" for the years 1999 to 2019. We create the data using numpy - the following commands should already be familiar to you:

In [None]:
year = np.arange(1999, 2020)
station = np.array(["a", "b"])

# random temperature data with a trend
rng = np.random.default_rng(1)
data = np.array([[0.3, 0.5]]).T * (year - year[0]) + rng.standard_normal(size=(2, 21))

We create a small plot to visualize the data:

In [None]:
f, ax = plt.subplots()

ax.plot(year, data[0, :], label="a")
ax.plot(year, data[1, :], label="b")

ax.legend()

ax.set_xlabel("Year")
ax.set_xticks(np.arange(2000, 2021, 5))

### Exercises

You don't need to code this - just think about it.

* Which of the two commands calculates the mean over all years?
  - `np.mean(data, axis=0)`
  - `np.mean(data, axis=1)`
* Which of the two commands selects station `"b"`?
  - `data[1, :]`
  - `data[:, 1]`
* Which of the four commands select the data of the year 2013?
  - `data[:, 12]`
  - `data[:, 13]`
  - `data[:, 14]`
  - `data[:, year == 2013]`

It would be much easier to answer these questions if 
* the two axes (dimensions) of the array were named
* the year and station (coordinates) were directly associated with the data

Both of this is given in a `DataArray`.


## Creating an xarray DataArray

A `DataArray` is the xarray equivalent of a numpy `array`. We can create a `DataArray` by passing some data to it:

In [None]:
dta = [1, 0, 3, 9, 7, np.NaN]
xr.DataArray(dta)

But this is not very helpful because it does not have any coordinates and the dimension is called `dim_0`. These have to be passed separately and bring all the power to the `DataArray`:

In [None]:
x = [0, 10, 20, 30, 40, 50]

da0 = xr.DataArray(dta, dims=["x"], coords=dict(x=x))

da0

Now the dimension is called `"x"` and the array gained coordinates.

### Exercise

* Create a `DataArray` using `data` from above as data variable, `year` and `station` as coordinates. The dimensions should be called `"year"` and `"station`".

In [None]:
# code here

In [None]:
# solution

xr.DataArray(data, dims=("station", "year"), coords=dict(station=station, year=year))

We now introduce how you can work with `DataArray` objects using `da0`. This array is only one dimensional, thus some of the examples are a bit pointless but everything also generalizes to multi-dimensional `DataArray`s.


## Attributes

Similar to a numpy array a DataArray also has some attributes which describe it. A new one here is the `dims` which lists the names of the dimensions. However, I also find the printed representational already very helpful:

In [None]:
print(f"{da0.ndim = }")
print(f"{da0.shape= }")
print(f"{da0.dims = }")
print(f"{da0.dtype= }")

In [None]:
print(da0)

## Plotting a DataArray

A `DataArray` has a `plot` method which creates a figure of the data it contains. This is super useful because it uses the dimensions and coordinates for the plot. Check the x-axis to see that the x-coords were used. Under the hood, xarray also uses matplotlib to create the plot.

In [None]:
da0.plot(marker="*")

## Basic operations

Similar to the numpy arrays you can perform arithmetic operations on `DataArray`s:

In [None]:
da0 + 1000

This only affects the data - the coordinates are unchanged! You can also multiply two `DataArray`s: 

In [None]:
da0 * da0

## Alignment and broadcasting

xarray uses the dimension names and coordinates to align & broadcast the data.

In [None]:
da1 = xr.DataArray([1, 0, 3], dims=["x"], coords=dict(x=[0, 10, 20]))
da1

In [None]:
da0 + da1

See how only coordinates that occur in both arrays (0, 10, and 20) are used in the result. xarray does an inner join for arithmetic operations (this behavior can be changed).

### Exercise

* Replace `20` by `50` in `da2` - what happens?

In [None]:
# update code
da2 = xr.DataArray([1, 0, 3], dims=["x"], coords=dict(x=[0, 10, 20]))

da0 + da2

# what happens?
#

In [None]:
# solution

da2 = xr.DataArray([1, 0, 3], dims=["x"], coords=dict(x=[0, 10, 50]))

da0 + da2

# what happens?
# the element at x=50 is used

### Exercise

* Replace `20` by `20 + 1e-14` in `da3` - what happens?

In [None]:
# update code
da3 = xr.DataArray([1, 0, 3], dims=["x"], coords=dict(x=[0, 10, 20]))

da0 + da3

# what happens?
#

In [None]:
# solution
da3 = xr.DataArray([1, 0, 3], dims=["x"], coords=dict(x=[0, 10, 20 + 1e-14]))

da0 + da3

# what happens
# 20 != 20 + 1e-14 -> therefore this element is no longer part of the result

### Exercise

* Rename `"x"` to `"y"` in `da4` - what happens?

In [None]:
# update code
da4 = xr.DataArray([1, 0, 3], dims=["x"], coords=dict(x=[0, 10, 20]))

da0 + da4

# what happens
#

In [None]:
# solution
da4 = xr.DataArray([1, 0, 3], dims=["y"], coords=dict(y=[0, 10, 20]))

da0 + da4

# what happens
# the coordinates are combined - the result now has x & y coords

## Reductions

Reductions work similar as in numpy. Per default it also reduces over all dimensions:

In [None]:
da0.mean()

One difference to numpy is that we always call the method on the `DataArray` (i.e. `da.mean()`. A second difference concerns the handling of missing values. Per default xarray skips missing values in the reduction operations. This is (in my experience) almost always what we want. If missing values should be kept, we need to set `skipna=False`:

In [None]:
da0.mean(skipna=False)

However, the biggest change compared to numpy is that instead of reducing over `axis` we reduce over `dim`:

In [None]:
da0.mean("x")

For our 1-dimensional array this is trivial but becomes extremely convenient if the array has more dimensions.

## Selecting data (Indexing)

You can select data just like in numpy using `da0[0]`

In [None]:
da0

In [None]:
da0[2]

However, this does not take advantage of the named dimensions or the coordinates. We have to differentiate between two ways to select data

* by position: this is done using `da.isel(x=2)` (or `da[2]`)
* by coordinate: this is done using `da.sel(x=20)` (or `da.loc[20]`)

Let's try it out



In [None]:
da0.isel(x=2)

In [None]:
da0.sel(x=20)

We can also use slicing to get more than single values from an array. Unfortunately, we can _not_ use the convenient syntax with `3:5` but have to write this as `da.isel(x=slice(3, 5))`:



In [None]:
da0.isel(x=slice(1, 3))

## Exercises

Let's use what we learned above to work with our example dataset. First let's create a `DataArray` with the temperature data:

In [None]:
da = xr.DataArray(
    data, dims=["station", "year"], coords=dict(station=station, year=year)
)

da

### Exercise
* Calculate the mean over all years. (Make sure you get two numbers!)

In [None]:
# code here

In [None]:
# solution

da.mean("year")

### Exercise
* Select station `"b"` by its coordinate using `da.sel`.


In [None]:
# code here

In [None]:
# solution

da.sel(station="b")

### Exercise
* Select station `"a"` by position using `da.isel`.

In [None]:
# code here

In [None]:
# solution

da.isel(station=0)

### Exercise
* Select the data of the year 2013.

In [None]:
# code here

In [None]:
# solution
da.sel(year=2013)

### Exercise
* Select the years 2000 to 2010 (use `slice`).

In [None]:
# code here

In [None]:
# solution
da.sel(year=slice(2000, 2010))

### Exercise
* Calculate anomalies. Subtract the mean of the years 2000 to 2010 from the data (make sure you do not average over the two stations!).

In [None]:
# code here

In [None]:
# solution

da - da.sel(year=slice(2000, 2010)).mean("year")

### Exercise
* Create a plot of `da` (You will have to pass `infer_intervals=False` else it will result in an error). Why does this not result in a line plot?

> Note: depending on the version of xarray and matplotlib this may throw an error - just ignore the exercise in this case.

In [None]:
# code here

In [None]:
# solution
da.plot(infer_intervals=False)

# xarray creates a 2D plot if 2D data is passed

### Exercise
* Select `station="a"` and then create a plot.

In [None]:
# code here

In [None]:
# solution
da.sel(station="a").plot()

### Exercise

* Try `da.plot(hue="station")` (use a translator to find out what "hue" means if you are not sure)

In [None]:
# code here

In [None]:
# solution
da.plot(hue="station")