# Working with Zarr and Xarray
The backbone of SeqData are the packages Zarr and Xarray. Most Pythonistas are familair with Pandas and NumPy but might be less familiar with Zarr or Xarray. This tutorial is meant to give you what you need to know about Zarr and Xarray to work with SeqData. More comprehensive tutorials can be found at the [Xarray](https://docs.xarray.dev/en/stable/) and [Zarr](https://zarr.dev/) documentation.

## Why use Xarray?

Genomics data is multidimensional and complex, and while Pandas is great for 2D data and NumPy can handle n-dimensional arrays, Xarray is specifically designed to handle n-dimensional data with labeled dimensions and coordinates. We believe this leads to a more intuitive, more concise, and less error-prone developer experience. The good thing about Xarray is that it built with a Pythonic API very similar to Pandas and NumPy and can easily convert between these libraries (when applicable).

In [2]:
import numpy as np
import pandas as pd
import xarray as xr

## Xarray Data Structures
Adapted from: https://docs.xarray.dev/en/latest/user-guide/data-structures.html#data-structures

Xarray has two core data structures that are fundamentally N-dimensional. The first are `DataArray`s which are simply labeled, N-dimensional arrays. `DataArray`s are an N-D generalization of a `pandas.Series` and work very similarly to Numpy arrays:

In [4]:
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
data

We created a 2D array above with the labeled dimensions `x` and `y`. Xarray uses 'coordinates' to provide meaningful labels for the dimensions of a dataset. In this case we gave the two dimensions the coordinates 10 and 20. Coordinates are not required, but will enable indexing of data along those dimensions beyond simple integer indexing.

The second Xarray data structure worth mentioning is the `Dataset`. `Dataset`s are multi-dimensional, in-memory array databases that behave like Python dictionaries of `DataArray` objects. `Dataset`s can be aligned aligned along any number of shared dimensions, and serve a similar purpose in Xarray to a `DataFrame` in pandas.

In [5]:
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
ds

In [6]:
ds["foo"]

## Indexing Xarray
Adapted from: https://docs.xarray.dev/en/stable/user-guide/indexing.html

Xarray supports four different kinds of indexing, as summarized in this table:

| Dimension lookup | Index lookup | `DataArray` syntax            | `Dataset` syntax             |
|------------------|--------------|-------------------------------|------------------------------|
| Positional       | By integer   | `da[:, 0]`                    | *not available*              |
| Positional       | By label     | `da.loc[:, 'IA']`             | *not available*              |
| By name          | By integer   | `da.isel(space=0)` or <br>    | `ds.isel(space=0)` or <br>   |
|                  |              | `da[dict(space=0)]`           | `ds[dict(space=0)]`          |
| By name          | By label     | `da.sel(space='IA')` or <br>  | `ds.sel(space='IA')` or <br> |
|                  |              | `da.loc[dict(space='IA')]`    | `ds.loc[dict(space='IA')]`   |

Let's see how indexing works in practice:

### Positional indexing

In [18]:
da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)
da

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [19]:
# Integer based indexing
da[:2]

Xarray also supports label-based indexing, just like pandas.

In [20]:
# Label based indexing
da.loc["2000-01-01":"2000-01-02", "IA"]

### Indexing with dimension names

With the dimension names, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:

1. Use the sel() and isel() convenience methods:

In [21]:
# index by integer array indices
da.isel(space=0, time=slice(None, 2))

In [22]:
# index by dimension coordinate labels
da.sel(time=slice("2000-01-01", "2000-01-02"))

2. Use a dictionary as the argument for array positional or label based array indexing:

In [24]:
# index by integer array indices
da[dict(space=0, time=slice(None, 2))]

In [25]:
# index by dimension coordinate labels
da.loc[dict(time=slice("2000-01-01", "2000-01-02"))]

### Dataset indexing

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset.

In [27]:
da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)
ds = da.to_dataset(name="foo")
ds

Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with dimension names:

In [29]:
ds.isel(space=[0], time=[0])

In [30]:
ds.sel(time="2000-01-01")

## Other useful Xarray operations
Adatped from: https://docs.xarray.dev/en/latest/user-guide/reshaping.html

### Transposing Xarray objects

We can reorder dimensions in an Xarray object in manner very similar to NumPy arrays using the transpose() method. The main difference is that we can use dimension names instead of axis numbers:

In [34]:
ds = xr.Dataset({"foo": (("x", "y", "z"), [[[42]]]), "bar": (("y", "z"), [[24]])})
ds


In [35]:
ds.transpose("y", "z", "x")

### Concatenating Xarray objects

We can concatenate Xarray objects along a new or existing dimension using the concat() function:

In [38]:
da = xr.DataArray(
    np.arange(6).reshape(2, 3), [("x", ["a", "b"]), ("y", [10, 20, 30])]
)
da

In [40]:
xr.concat([da[:, :1], da[:, 1:]], dim="y")

## Converting Xarray to other formats

In [19]:
# convert to a pandas Series
series = data.to_series()
series

x   y
10  0    0.623860
    1   -0.393178
    2   -1.052273
20  0    1.107031
    1    0.636440
    2    1.454341
dtype: float64

In [None]:
# convert to a pandas DataFrame

## Zarr stores
Adapted from: https://docs.xarray.dev/en/latest/user-guide/io.html#zarr

Zarr is a Python package that provides an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities, including the ability to store and analyze datasets far too large fit onto disk (particularly in combination with dask).

In [None]:
ds = xr.Dataset(
    {"foo": (("x", "y"), np.random.rand(4, 5))},
    coords={
        "x": [10, 20, 30, 40],
        "y": pd.date_range("2000-01-01", periods=5),
        "z": ("x", list("abcd")),
    },
)

In [None]:
ds.to_zarr("path/to/directory.zarr")  # The suffix .zarr is optional–just a reminder that a zarr store lives there.)

__IMPORTANT__: Xarray can’t open just any zarr dataset, because xarray requires special metadata (attributes) describing the dataset dimensions and coordinates. 

If a zarr store is already present at that path, an error will be raised, preventing it from being overwritten. To override this behavior and overwrite an existing store, add mode='w' when invoking to_zarr().

In [None]:
ds.to_zarr("path/to/directory.zarr", mode="w")

To read back a zarr dataset that has been created this way, we use the open_zarr() method:

In [None]:
ds_zarr = xr.open_zarr("path/to/directory.zarr")

Xarray supports several ways of incrementally writing variables to a Zarr store. These options are useful for scenarios when it is infeasible or undesirable to write your entire dataset at once.

1. Use mode='a' to add or overwrite entire variables,
2. Use append_dim to resize and append to existing variables, and
3. Use region to write to limited regions of existing arrays.

In [None]:
ds1 = xr.Dataset(
    {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))},
    coords={
        "x": [10, 20, 30, 40],
        "y": [1, 2, 3, 4, 5],
        "t": pd.date_range("2001-01-01", periods=2),
    },
)

In [None]:
ds1.to_zarr("path/to/directory.zarr")

In [None]:
ds2 = xr.Dataset(
    {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))},
    coords={
        "x": [10, 20, 30, 40],
        "y": [1, 2, 3, 4, 5],
        "t": pd.date_range("2001-01-03", periods=2),
    },
)

In [None]:
ds2.to_zarr("path/to/directory.zarr", append_dim="t")