# Data Structures

To keep this documentation generic we typically use dimensions `x` or `y`, but this should *not* be seen as a recommendation to use these labels for anything but actual positions or offsets in space.

## Variable

### Basics

[scipp.Variable](../generated/scipp.Variable.rst#scipp.Variable) is a labeled multi-dimensional array.
A variable can be constructed using:

- `values`: a multi-dimensional array of values, e.g., a [numpy.ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray)
- `variances`: a (optional) multi-dimensional array of variances for the array values
- `dims`: a list of dimension labels (strings) for each axis of the array
- `unit`: a (optional) physical unit of the values in the array

Note that variables, unlike [DataArray](data-structures.ipynb#DataArray) and its eponym [xarray.DataArray](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.rst#xarray.DataArray), variables do *not* have coordinate arrays.

In [1]:
import numpy as np
import scipp as sc

In [2]:
var = sc.Variable(values=np.random.rand(2, 4), dims=['x', 'y'])

In [3]:
sc.show(var)

In [4]:
var

In [5]:
var.unit

dimensionless

In [6]:
var.values

array([[0.86092829, 0.51027421, 0.24520177, 0.73026643],
       [0.58194138, 0.12089829, 0.08035668, 0.29124704]])

In [7]:
try:
    var.variances
except RuntimeError:
    print('No variances specified, so they do not exist.')

Variances must have the same shape as values, and units are specified using the [scipp.units](../python-reference/units.rst) module:

In [8]:
var = sc.Variable(values=np.random.rand(2, 4),
                  variances=np.random.rand(2, 4),
                  dims=['x', 'y'],
                  unit=sc.units.m/sc.units.s)
sc.show(var)

In [9]:
var

In [10]:
var.variances

array([[0.80408666, 0.1926496 , 0.6713678 , 0.53265249],
       [0.27801794, 0.68528131, 0.18466593, 0.65287974]])

### 0-D variables (scalars)

A 0-dimensional variable contains a single value (and an optional variance).
The most convenient way to create a scalar variable is by multiplying a value by a unit:

In [11]:
scalar = 1.2 * sc.units.m
sc.show(scalar)
scalar

For convenience, singular versions of the `values` and `variances` properties are provided:

In [12]:
print(scalar.value)
print(scalar.variance)

1.2
None


Note that `value` and `variance` include a check ensuring that the data is 0-D.
Using them with, e.g., a 1-D variable with dimension extent 1 will raise an exception.

Creating scalar variables with variances or with custom `dtype` is possible using the constructor:

In [13]:
var_0d = sc.Variable(variances=True, dtype=sc.dtype.float32, unit=sc.units.kg)
var_0d

In [14]:
var_0d.value = 2.3
var_0d.variance

0.0

An exception is raised from the `value` and `variance` properties if the variable is not 0-dimensional.
Note that a variable with one or more dimension extent(s) of 1 contains just a single value as well, but the `value` property will nevertheless raise an exception.

### Event data

[Variable](../generated/scipp.Variable.rst#scipp.Variable) also supports event data stored as event lists.
In this case it is currently not possible to set data directly in the constructor.
Instead we create it by specifying a shape and a `dtype`:

In [15]:
var = sc.Variable(dims=['x'],
                  shape=[4],
                  variances=True,
                  unit=sc.units.kg,
                  dtype=sc.dtype.event_list_float64)
var

In [16]:
var.shape # The event list "dimension" is not part of the shape

[4]

In [17]:
len(var.values[0]) # Initially evenry event list is empty

0

For more details see [Event data](event-data.rst).

## DataArray

### Basics

[scipp.DataArray](../generated/scipp.DataArray.rst#scipp.DataArray) is a labeled array with associated coordinates.
A data array is essentially a [Variable](../generated/scipp.Variable.rst#scipp.Variable) object with attached dicts of coords and labels.

A data array has the following key properties:

- `data`: the variable holding the array data.
- `coords`: a dict-like container of coordinates (both dimension and non-dimension) for the array, accessed using a dimension label as dict key.
- `masks`: a dict-like container of masks for the array, accessed using a string as dict key.
- `attrs`: a dict-like container of attributes for the array, accessed using a string as dict key.

Note that `coords` in scipp correspond to dimension-coordinates in xarray, whereas `labels` corresponds to non-dimension coordinates.
See also the [xarray documentation](http://xarray.pydata.org/en/stable/data-structures.html#coordinates).

The key distinction between `coords` and `attrs` is that the former are required to match in operations between multiple datasets whereas the latter are not.
`masks` allows for storing boolean-valued masks alongside data.
All four have items that are internally a [Variable](../generated/scipp.Variable.rst#scipp.Variable), i.e., they have a physical unit and optionally variances.

In [18]:
d = sc.DataArray(
    data = sc.Variable(dims=['y', 'x'], values=np.random.rand(2, 3)),
    coords={
        'y': sc.Variable(['y'], values=np.arange(2.0), unit=sc.units.m),
        'x': sc.Variable(['x'], values=np.arange(3.0), unit=sc.units.m),
        'aux': sc.Variable(['x'], values=np.random.rand(3))})
sc.show(d)

Note how the `'aux'` coordinate (sometimes referred to as `labels`) are essentially a secondary coordinate for the x dimension.
The dict-like `coords`, `masks`, and `attrs` properties give access to the respective underlying variables:

In [19]:
d.coords['x']

In [20]:
d.coords['aux']

Just like `coords`, the `masks` and `attrs` properties are also require a string as a key.

Further details about data arrays are implicitly discussed in the next section, which is covering datasets, since each item in a dataset behaves equivalently to a data array.

### Distinction between dimension coords and non-dimension coords (=labels)

It is important to highlight that for practical purposes (such as matching in operations) **dimension coords and non-dimension are handled equivalently**.
Essentially:

- **Non-dimension coordinates are coordinates**.
- The only difference is that non-dimension coordinates provide a way to "label" a dimension of our data with some additional information that can prove very useful in many cases.

- This also implies that there is at most one dimension coord for each dimension, but there can be multiple non-dimension coords (labels).
- In the special case of non-dimension coords that have more than 1 dimension, they are considered to be labels for their inner dimension.

## Dataset

[scipp.Dataset](../generated/scipp.Dataset.rst#scipp.Dataset) is a dict-like container of data arrays.
Individual items of a dataset ("data arrays") are accessed using a string as a dict key.

In a dataset the coordinates of the sub-arrays are enforced to be *aligned*.
That is, a dataset is not actually just a dict of data arrays.
Instead, the individual arrays share coordinates, labels, and attributes.
It is therefore not possible to combine arbitrary data arrays into a dataset.
If, e.g., the extents in a certain dimension mismatch, or if coordinate/label values mismatch, insertion of the mismatching data array will fail.

Typically a dataset is not created from individual data arrays.
Instead we may provide a dict of variables (the data of the items), and dicts for coords and labels:

In [21]:
d = sc.Dataset(
            {'a': sc.Variable(dims=['x', 'y'], values=np.random.rand(2, 3)),
             'b': sc.Variable(dims=['x'], values=np.random.rand(2)),
             'c': sc.Variable(1.0)},
             coords={
                 'x': sc.Variable(['x'], values=np.arange(2.0), unit=sc.units.m),
                 'y': sc.Variable(['y'], values=np.arange(3.0), unit=sc.units.m),
                 'aux': sc.Variable(['y'], values=np.random.rand(3))})
sc.show(d)

In [22]:
d

In [23]:
d.coords['x'].values

array([0., 1.])

The name of a data item serves as a dict key.
Item access returns a view (`DataArrayView`) onto the data in the dataset and its corresponding coordinates, i.e., no copy is made.
Apart from that it behaves exactly like `DataArray`.

In [24]:
sc.show(d['a'])
d['a']

Each data item is linked to its corresponding coordinates, labels, and attributes.
These are accessed using the `coords` and `attrs` properties, in the same was as for `Dataset` itself.
The variable holding the data of the dataset item is accessible via the `data` property:

In [25]:
d['a'].data

For convenience, properties of the data variable are also properties of the data item:

In [26]:
d['a'].values

array([[0.55912001, 0.20432581, 0.698263  ],
       [0.33009462, 0.44695601, 0.85693451]])

In [27]:
d['a'].variances

In [28]:
d['a'].unit

dimensionless

Coordinates and attributes of a data item include only those that are relevant to the item's dimensions, all others are hidden.
For example, when accessing `'b'`, which does not depend on the `'y'` dimension, the coord for `'y'` as well as the `'aux'` labels are not part of the items `coords`:

In [29]:
sc.show(d['b'])

Similarely, when accessing a 0-dimensional data item, it will have no coordinates or labels:

In [30]:
sc.show(d['c'])

All variables in a dataset must have consistent dimensions.
Thanks to labeled dimensions transposed data is supported:

In [31]:
d['d'] = sc.Variable(dims=['y', 'x'], values=np.random.rand(3, 2))
sc.show(d)
d

The usual `dict`-like methods are available for `Dataset`:

In [32]:
for name in d:
    print(name)

d
c
a
b


In [33]:
'a' in d

True

In [34]:
'e' in d

False