# xarray in 5 minutes

## Or how to make your numpy code readable while saving yourself time

xarray adds a metadata layer on top of numpy arrays that makes doing calculations much easier. The package has two core data structures -- the DataArray and the Dataset that are, respecitvely, abstractions of numpy arrays and pandas data frames. We'll focus on DataArrays today.

## DataArray
Anywhere you would use a numpy array, you can use a DataArray and get readability benefits

In [2]:
import numpy as np
import pandas as pd
import xarray as xr

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

  data = yaml.load(f.read()) or {}
  defaults = yaml.load(f)


Imagine having data about how peoples' income breaks down between consumption and savings in every year of their 55 year lives for 12 income cohorts.

In [3]:
lifespan = 55
cohorts = 12
items = 2
income = xr.DataArray(np.random.normal(0,1,(lifespan, cohorts, items)), dims=["time","cohort","item"], coords={"item":["consumption","savings"]})
income

<xarray.DataArray (time: 55, cohort: 12, item: 2)>
array([[[ 0.603666,  1.010704],
        [-1.196699, -0.346329],
        ...,
        [-0.106378,  0.194449],
        [-0.660774,  2.129403]],

       [[-2.549262, -0.108783],
        [-0.111364, -0.267842],
        ...,
        [-0.491312, -0.087399],
        [-0.576834, -0.206627]],

       ...,

       [[ 0.457655,  1.048237],
        [-0.439872,  1.021856],
        ...,
        [-0.022706, -0.470221],
        [ 1.321579,  0.224391]],

       [[-0.099386,  0.120204],
        [ 1.357281, -0.140589],
        ...,
        [ 0.741628,  1.503845],
        [-2.244713,  0.119958]]])
Coordinates:
  * item     (item) <U11 'consumption' 'savings'
Dimensions without coordinates: time, cohort

DataArrays add two main concepts on top of numpy arrays:
1. Dimensions -- names for the axes of an array
2. Coordinates -- indices within a dimension that can be used to select parts of an array with Pandas-like syntax
Note that we do not have to provide coordinates for every dimension. In the example above, only the `item` dimension has coordinates.

The major benefit of DataArrays is that they make the intent of calculations clear for the reader. We might, for example, want to calculate the average savings across cohorts in every year. Xarray syntax doesn't require you to remember what axis or index corresponds to each concept.

In [4]:
income.sel(item="savings").mean("cohort")
# An alternative -- income.loc[:,:,"savings"].mean("cohort")
income[:,:,1].mean(axis=1) # but note that the numpy syntax still works with a DataArray object

<xarray.DataArray (time: 55)>
array([ 2.049015e-01, -1.500204e-01, -6.910400e-03, -2.048993e-01,
       -4.110532e-01,  4.449936e-04, -2.063799e-01,  8.628705e-03,
       -2.700546e-01, -2.767655e-02, -2.646690e-01, -8.861646e-02,
       -4.870388e-01,  4.762607e-02, -1.841712e-01,  3.428996e-02,
        1.720702e-01, -2.170771e-01,  4.241526e-01, -1.404080e-01,
       -1.012706e-01, -1.272997e-01,  1.484155e-01,  2.192039e-01,
       -3.217237e-01, -1.819011e-02,  2.407995e-01,  3.465314e-01,
        2.513461e-01, -6.849824e-02, -1.019591e-02, -1.011416e+00,
        2.737511e-01,  7.776690e-02, -2.895052e-01,  8.893490e-02,
       -2.807038e-01,  1.537548e-01,  5.602983e-01,  5.775268e-01,
        1.264659e-01, -8.163092e-02, -3.517645e-01,  5.411058e-01,
       -1.147727e-01, -2.002858e-01, -1.892822e-01, -6.241851e-01,
       -6.786658e-02, -7.448489e-02, -4.529225e-01, -1.757785e-01,
       -1.382956e-04,  6.142747e-01,  1.576083e-01])
Coordinates:
    item     <U11 'savings'
Dimen

<xarray.DataArray (time: 55)>
array([ 2.049015e-01, -1.500204e-01, -6.910400e-03, -2.048993e-01,
       -4.110532e-01,  4.449936e-04, -2.063799e-01,  8.628705e-03,
       -2.700546e-01, -2.767655e-02, -2.646690e-01, -8.861646e-02,
       -4.870388e-01,  4.762607e-02, -1.841712e-01,  3.428996e-02,
        1.720702e-01, -2.170771e-01,  4.241526e-01, -1.404080e-01,
       -1.012706e-01, -1.272997e-01,  1.484155e-01,  2.192039e-01,
       -3.217237e-01, -1.819011e-02,  2.407995e-01,  3.465314e-01,
        2.513461e-01, -6.849824e-02, -1.019591e-02, -1.011416e+00,
        2.737511e-01,  7.776690e-02, -2.895052e-01,  8.893490e-02,
       -2.807038e-01,  1.537548e-01,  5.602983e-01,  5.775268e-01,
        1.264659e-01, -8.163092e-02, -3.517645e-01,  5.411058e-01,
       -1.147727e-01, -2.002858e-01, -1.892822e-01, -6.241851e-01,
       -6.786658e-02, -7.448489e-02, -4.529225e-01, -1.757785e-01,
       -1.382956e-04,  6.142747e-01,  1.576083e-01])
Coordinates:
    item     <U11 'savings'
Dimen

Pandas-like syntax makes the intention of operations clear

In [165]:
income.groupby("item").mean()
income.mean(axis=(0,1))

<xarray.DataArray (item: 2)>
array([-0.013371,  0.006306])
Coordinates:
  * item     (item) <U11 'consumption' 'savings'

<xarray.DataArray (item: 2)>
array([-0.013371,  0.006306])
Coordinates:
  * item     (item) <U11 'consumption' 'savings'

But we still get the broadcasting benefits of numpy arrays... except without worrying about shapes

In [9]:
# We want to broadcast these arrays but their dimensions are in different orders
price = xr.DataArray(np.random.normal(0,1,(lifespan, cohorts, items)), dims=["time","cohort","item"])
quantity = xr.DataArray(np.random.normal(0,1,(lifespan, items, cohorts)), dims=["time","item", "cohort"])
value = price * quantity # works because we have labeled dimensions

price.shape
quantity.shape
value

value = price.values * quantity.values # fails

(55, 12, 2)

(55, 2, 12)

<xarray.DataArray (time: 55, cohort: 12, item: 2)>
array([[[ 3.202512e-04, -8.847250e-01],
        [ 9.487290e-01,  6.831409e-03],
        ...,
        [ 4.582729e-01,  1.484235e-01],
        [ 7.324769e-01, -3.675171e-01]],

       [[ 6.672245e-02,  2.305614e-01],
        [ 6.711112e-02,  1.611101e-01],
        ...,
        [-1.056175e-01, -8.822694e-01],
        [-3.000610e-01,  1.505815e+00]],

       ...,

       [[-8.716453e+00,  1.897045e-01],
        [-1.488331e-01, -1.688854e+00],
        ...,
        [-1.262049e-01,  6.951389e-01],
        [ 4.767304e-01, -1.903129e+00]],

       [[ 9.055891e-01, -2.888493e-01],
        [-7.419514e-02, -2.220506e+00],
        ...,
        [-2.357175e-01, -9.338105e-01],
        [ 3.883903e-01,  1.087652e+00]]])
Dimensions without coordinates: time, cohort, item

ValueError: operands could not be broadcast together with shapes (55,12,2) (55,2,12) 

In [11]:
# One more example of broadcasting 
income - income.mean("cohort")
income.values - income.values.mean(axis=1, keepdims=True) # What is axis 1? Why do we have to use keepdims?

<xarray.DataArray (time: 55, cohort: 12, item: 2)>
array([[[ 0.858597,  0.805802],
        [-0.941768, -0.551231],
        ...,
        [ 0.148553, -0.010453],
        [-0.405843,  1.924502]],

       [[-2.498655,  0.041238],
        [-0.060757, -0.117822],
        ...,
        [-0.440705,  0.062622],
        [-0.526227, -0.056606]],

       ...,

       [[ 0.283607,  0.433963],
        [-0.613919,  0.407582],
        ...,
        [-0.196753, -1.084495],
        [ 1.147532, -0.389884]],

       [[ 0.413355, -0.037405],
        [ 1.870022, -0.298198],
        ...,
        [ 1.254369,  1.346237],
        [-1.731972, -0.03765 ]]])
Coordinates:
  * item     (item) <U11 'consumption' 'savings'
Dimensions without coordinates: time, cohort

array([[[ 0.85859715,  0.80580242],
        [-0.94176843, -0.55123084],
        [ 0.69637537,  1.36375417],
        ...,
        [-1.79869726, -1.32846242],
        [ 0.14855308, -0.01045296],
        [-0.4058435 ,  1.92450172]],

       [[-2.49865504,  0.04123793],
        [-0.06075683, -0.1178218 ],
        [ 1.1177054 , -1.29462919],
        ...,
        [ 1.71076806, -0.58373617],
        [-0.44070506,  0.06262156],
        [-0.52622727, -0.05660631]],

       [[ 0.67168269,  0.09760732],
        [-1.34703648,  1.3188391 ],
        [ 0.28481107, -1.04532128],
        ...,
        [ 0.95378183,  0.75702923],
        [ 0.64234387,  0.08348674],
        [ 1.67267935,  0.28837676]],

       ...,

       [[-1.4334678 ,  0.93918592],
        [ 0.19358499,  0.15119702],
        [-0.6154556 , -0.42888821],
        ...,
        [ 0.55380427,  0.90094897],
        [ 2.0311323 , -0.57459321],
        [ 0.00347043, -1.81666729]],

       [[ 0.2836075 ,  0.43396276],
        [-0.61391902,  0.40

### Dataset

xarray also has an extension of the pandas dataframe that's supposed to be useful in mutlidimensional data cases, but I haven't worked with it.