# Xarray Dataset and DataArray operations

In [15]:
# initialization
import numpy as np
import pandas as pd
import xarray as xr

In [16]:
# Load the B-SOSE dataset
bsose = xr.open_dataset("data/bsose_monthly_velocities.nc")

## Subsetting an xarray Dataset or DataArray

The main tool to subset an xarray `Dataset` or `DataArray` is the `.isel()` and `.sel()` methods. Similar to `.iloc[]` of DataFrame, `.isel()` subset a Dataset or DataArray by indices of dimension. For example, to select the most shallow depths from the `bsose` Dataset, we may do:

In [3]:
bsose.isel(depth=0)

We can also supply a slice as arguments to `isel()` and `sel()`. However, instead of using the shorthand _start_:_stop_:_step_, we need to use an explicit `slice()` call. For example, to select the two most shallow depths from the `bsose` Dataset, we can do:

In [4]:
bsose.isel(depth=slice(0, 2))

As you should be familiar right now, the slice is *endpoint exclusive* when you use `.isel()`

Furthermore, we can make selection on multiple dimension in a single go, e.g.:

In [5]:
bsose.isel(depth=slice(0,2), time=5)

While `.isel()` can be useful in limited situations, the more useful subsetting method is `.sel()`, which subset dimensions by the corresponding coordinate values, e.g.,:

In [13]:
bsose.sel(depth=2.1)

And just like `.iloc[]` for DataFrame, `.sel()` is *endpoint inclusive*

In [15]:
bsose.sel(depth=slice(2.1, 26.25))

Quite often you may not know the exact coordinates of a dimension you want to subset. There are two mechanisms in `.sel()` that can assist you in such case. First of all, in selecting a single coordinate, you can use the `method="nearest"` argument to select the closest match, e.g.:

In [16]:
bsose.sel(depth=5, method="nearest")

Second, when you subset a Dataset or DataArray by slice, you can specify bounds are not exact coordinates, e.g.,

In [17]:
bsose.sel(depth=slice(0, 100))

The same also works for time:

In [18]:
bsose.sel(time=slice(pd.to_datetime("2012-01-01"), pd.to_datetime("2012-03-31")))

Finally, just like `.loc[]` for DataFrame, you can subset in `.sel()` using logical vectors, e.g.:

In [21]:
bsose.sel(lat=(bsose.coords["lat"] < -50) & (bsose.coords["lat"] > -70))

## DataArray calculations and Dataset modifications

Arithmetic operators (e.g., `+`, `-`, `*`, `/`) work as expected on xarray `DataArray`, and the result is another `DataArray`, For example, the `U` and `V` variables in `bsose` are eastward and northward ocean current velocity, respectively. Ignoring the vertical component of the ocean current, from the usual definition of speed and velocity, the ocean current speed can be obtained as:

In [3]:
speed = np.sqrt(bsose["U"]**2 + bsose["V"]**2)

In [4]:
display(speed)

Note that numpy mapping functions can be applied on xarray `DataArray` without any issues.

If we have a xarray `Dataset`, we can assign new data variables to it using the same `[]` operator. For example, with our speed calculation, we may do:

In [7]:
bsose["speed"] = np.sqrt(bsose["U"]**2 + bsose["V"]**2)

In [8]:
display(bsose)

Again note that the xarray is being modified *in-place*. Moreover, if the data variable already exists it will be overwritten.

## Xarray statistics functions

In addition to calculation involving arithmetic operator and numpy mapping functions, xarray also provides a number of statistics function, which can be applied along specific dimensions. For example, suppose we want to compute yearly (month averaged) ocean current, we may do:

In [11]:
bsose_yearly = bsose.mean("time")
display(bsose_yearly)

Note that the same method exists for DataArray also, for example:

In [12]:
speed_yearly = speed.mean("time")
display(speed_yearly)

Other statistics functions include:
+ `.median()`: calculate the median along given dimension(s)
+ `.min()`: calculate the minimum along given dimension(s)
+ `.max()`: calculate the maximum along given dimension(s)
+ `.sum()`: calculate sum along given dimension(s)
+ `.var()`: calculate the variance along given dimension(s)
+ `.std()`: calculate standard deviation along given dimension(s)

## Remark: methods chaining

Sometimes you want to perform different action on different dimensions, e.g., you want to extract data at the shallowest depth and also average over time. Since `.sel()`, `.mean()`, etc. all return an object of the same type as the input, we can easily perform multiple actions using a coding style known as method chaining. For example, for the specific case above, we may do:

In [14]:
bsose.isel(depth=0).mean("time")