# <div align=center><b>Working with Zarr and Xarray</b></div>
Most Pythonistas are familair with Pandas and NumPy (and maybe Torch) for handling their data and might be less familiar with Zarr and Xarray. This tutorial is meant to highlight what you need to know about Zarr and Xarray to work with SeqData. More comprehensive tutorials can be found at ...

## Why use Xarray?

Genomics data is multidimensional and complex, and while Pandas is great for 2D data and NumPy can handle n-dimensional arrays, Xarray is specifically designed to handle n-dimensional data with labeled dimensions and coordinates. We believe this leads to a more intuitive, more concise, and less error-prone developer experience. The good thing about Xarray is that it built with a Pythonic API very similar to Pandas and NumPy and can easily convert between these libraries.

In [2]:
import numpy as np
import pandas as pd
import xarray as xr

Xarray has two core data structures that are fundamentally N-dimensional. The first are `DataArrays` which are simply labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. Data arrays work very similarly to numpy ndarrays:

In [8]:
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
data

The second is a multi-dimensional, in-memory array database called a Dataset. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

In [9]:
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
ds



The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can have not only different data types, but also different numbers of dimensions.



In [10]:
ds["foo"]

Dot operators
