# Working with Zarr and Xarray
Most Pythonistas are familair with Pandas and NumPy (and maybe Torch) for handling their data and might be less familiar with Zarr and Xarray. This tutorial is meant to highlight what you need to know about Zarr and Xarray to work with SeqData. More comprehensive tutorials can be found in the [Xarray](https://docs.xarray.dev/en/stable/) and [Zarr](https://zarr.dev/)

## Why use Xarray?

Genomics data is multidimensional and complex, and while Pandas is great for 2D data and NumPy can handle n-dimensional arrays, Xarray is specifically designed to handle n-dimensional data with labeled dimensions and coordinates. We believe this leads to a more intuitive, more concise, and less error-prone developer experience. The good thing about Xarray is that it built with a Pythonic API very similar to Pandas and NumPy and can easily convert between these libraries.

In [1]:
import numpy as np
import pandas as pd
import xarray as xr

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.


Xarray has two core data structures that are fundamentally N-dimensional. The first are `DataArrays` which are simply labeled, N-dimensional array. `DataArrays` are an N-D generalization of a `pandas.Series` and work very similarly to numpy arrays:

In [2]:
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
data

The second is a multi-dimensional, in-memory array database called a Dataset. It is a Python dictionary like container of `DataArray` objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the `pandas.DataFrame.`

In [3]:
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
ds



The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can have not only different data types, but also different numbers of dimensions.



In [4]:
ds["foo"]

## Working with Xarray

## Indexing Xarray
Adapted from: https://docs.xarray.dev/en/stable/user-guide/indexing.html

Xarray supports four different kinds of indexing, as described below and summarized in this table:

| Dimension lookup | Index lookup | `DataArray` syntax            | `Dataset` syntax             |
|------------------|--------------|-------------------------------|------------------------------|
| Positional       | By integer   | `da[:, 0]`                    | *not available*              |
| Positional       | By label     | `da.loc[:, 'IA']`             | *not available*              |
| By name          | By integer   | `da.isel(space=0)` or <br>    | `ds.isel(space=0)` or <br>   |
|                  |              | `da[dict(space=0)]`           | `ds[dict(space=0)]`          |
| By name          | By label     | `da.sel(space='IA')` or <br>  | `ds.sel(space='IA')` or <br> |
|                  |              | `da.loc[dict(space='IA')]`    | `ds.loc[dict(space='IA')]`   |

Let's see how indexing works on some dummy data:

In [5]:
da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)
da

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [6]:
da[:2]

In [7]:
da[0, 0]

Xarray also supports label-based indexing, just like pandas.

In [8]:
da.loc["2000-01-01":"2000-01-02", "IA"]

With the dimension names, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:

In [9]:
# index by integer array indices
da.isel(space=0, time=slice(None, 2))

In [10]:
# index by dimension coordinate labels
da.sel(time=slice("2000-01-01", "2000-01-02"))

In [12]:
da[dict(space=0, time=slice(None, 2))]

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:

In [13]:
da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)

In [14]:
ds = da.to_dataset(name="foo")

In [15]:
ds

In [16]:
ds.isel(space=[0], time=[0])

In [17]:
ds.sel(time="2000-01-01")

Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with dimension names:

In [18]:
ds[dict(space=[0], time=[0])]

## Converting between Xarray and the NumPy stack

In [19]:
# convert to a pandas Series
series = data.to_series()
series

x   y
10  0    0.623860
    1   -0.393178
    2   -1.052273
20  0    1.107031
    1    0.636440
    2    1.454341
dtype: float64

## Writing to Zarr stores

Zarr is a Python package that provides an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities, including the ability to store and analyze datasets far too large fit onto disk (particularly in combination with dask).

## Reading from Zarr stores
