# Dataset in a Nutshell
 This demonstrates key functionality and usage of the `dataset` library.

## Getting started
### What is a `Dataset`?
There are two basic analogies to aid in thinking about a `Dataset`:
1. As a `dict` of `numpy.ndarray`s, with the addition of named dimensions and units.
2. As a table.

### Creating a dataset

In [1]:
import numpy as np

In [2]:
import sys
sys.path.append('/home/nvaytet/work/code/scipp/install/')
import scipp as sc

In [3]:
d = sc.Dataset()
d

<scipp.Dataset>
Dimensions: {}
Coordinates:
Labels:
Data:
Attributes:


## Using `Dataset` as a table
We can not only think about a dataset as a table, we can also use it as one.
This will demonstrate the basic ways of creating datasets and interacting with them.

In [4]:
d.set_coord(sc.Dim.Row, sc.Variable([sc.Dim.Row], values=np.arange(3)))
d["alice"] = sc.Variable([sc.Dim.Row], values=[1.0,1.1,1.2], variances=[0.01,0.01,0.02], unit=sc.units.m)
d

<scipp.Dataset>
Dimensions: {{Dim.Row, 3}}
Coordinates:
    Dim.Row                   int64     [dimensionless]  (Dim.Row)  [0, 1, 2]
Labels:
Data:
    alice                     double    [m]              (Dim.Row)  [1.000000, 1.100000, 1.200000]  [0.010000, 0.010000, 0.020000]
Attributes:


The datatype (`dtype`) is derived from the provided data, so passing `np.arange(3)` will yield a variable (column) containing 64-bit integers.

Datasets with up to one dimension can be displayed as a simple table:

In [5]:
sc.table(d)

Coord: Row,alice [m],alice [m]
Values,Values,Variances
0,1.0,0.01
1,1.1,0.01
2,1.2,0.02


A variable (column) in a dataset (table) is identified by its name (`"alice"`). A 1D variable will have a coordinate (`Row`), and holds `Values` and optionally `Variances` which are grouped together inside a common structure.

Each variable (column) comes with a physical unit attached to it, which we should set up correctly as early as possible.

In [6]:
d["alice"].unit = sc.units.m
sc.table(d)

Coord: Row,alice [m],alice [m]
Values,Values,Variances
0,1.0,0.01
1,1.1,0.01
2,1.2,0.02


Setting the units can also be done when constructing the `Variable` by using the `units` keyword argument

In [7]:
d["alice"] = sc.Variable([sc.Dim.Row], values=[1.0,1.1,1.2], variances=[0.01,0.01,0.02], unit=sc.units.m)
sc.table(d)

Coord: Row,alice [m],alice [m]
Values,Values,Variances
0,1.0,0.01
1,1.1,0.01
2,1.2,0.02


Units and uncertainties are handled automatically in operations.

In [8]:
d *= d
sc.table(d)

Coord: Row,alice [m^2],alice [m^2]
Values,Values,Variances
0,1.0,0.02
1,1.21,0.024
2,1.44,0.058


Operations between columns are supported by indexing into a dataset with a name.

In [9]:
d["bob"] = d["alice"]
d

<scipp.Dataset>
Dimensions: {{Dim.Row, 3}}
Coordinates:
    Dim.Row                   int64     [dimensionless]  (Dim.Row)  [0, 1, 2]
Labels:
Data:
    alice                     double    [m^2]            (Dim.Row)  [1.000000, 1.210000, 1.440000]  [0.020000, 0.024200, 0.057600]
    bob                       double    [m^2]            (Dim.Row)  [1.000000, 1.210000, 1.440000]  [0.020000, 0.024200, 0.057600]
Attributes:


In [10]:
sc.table(d)

Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
0,1.0,0.02,1.0,0.02
1,1.21,0.024,1.21,0.024
2,1.44,0.058,1.44,0.058


In [11]:
d["bob"] += d["alice"]
sc.table(d)

Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
0,1.0,0.02,2.0,0.04
1,1.21,0.024,2.42,0.048
2,1.44,0.058,2.88,0.115


In [12]:
sc.plot(d)

FigureWidget({
    'data': [{'error_y': {'array': array([0.14142136, 0.15556349, 0.24      ]), 'type': 'data',…

Operations between rows are supported by indexing into a dataset with a dimension label and an index.

Slicing dimensions behaves similar to `numpy`:
If a single index is given, the dimension is dropped, if a range is given, the dimension is kept.
For a `Dataset`, in the former case the corresponding coordinates are dropped, whereas in the latter case it is preserved.

In [13]:
a = np.arange(8)

In [14]:
a[4]

4

In [15]:
a[4:5]

array([4])

In [16]:
d[sc.Dim.Row, 1] += d[sc.Dim.Row, 2]
sc.table(d)

Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
0,1.0,0.02,2.0,0.04
1,2.65,0.082,5.3,0.164
2,1.44,0.058,2.88,0.115


Note the key advantage over `numpy` or `MATLAB`:
We specify the index dimension, so we always know which dimension we are slicing.
The advantage is not so apparent in 1D, but will become clear once we move to higher-dimensional data.

### Summary
There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset: 

In [17]:
# Single row (dropping corresponding coordinates)
sc.table(d[sc.Dim.Row, 0])
# Size-1 row range (keeping corresponding coordinates)
sc.table(d[sc.Dim.Row, 0:1])
# Range of rows
sc.table(d[sc.Dim.Row, 1:3])
# Single variable
sc.table(d["alice"].data)
# Subset of variables with given name, keeping coordinates
sc.table(d["alice"])
# Subset containing a single (data) variable, in addition to coordinates
# ds.table(d.subset[Data.Value, 'alice'])

alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Variances,Values,Variances
1.0,0.02,2.0,0.04


Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
0,1.0,0.02,2.0,0.04


Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
1,2.65,0.082,5.3,0.164
2,1.44,0.058,2.88,0.115


alice [m^2],alice [m^2]
Values,Variances
1.0,0.02
2.65,0.082
1.44,0.058


Coord: Row,alice [m^2],alice [m^2]
Values,Values,Variances
0,1.0,0.02
1,2.65,0.082
2,1.44,0.058


### Exercise 1
1. Combining row slicing and "column" slicing, add the last row of the data for Alice to the first row of data for Bob.
2. Using the slice-range notation `a:b`, try adding the last two rows to the first two rows. Why does this fail?

In [18]:
d["bob"][sc.Dim.Row, 0] += d["alice"][sc.Dim.Row, -1]
sc.table(d)

Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
0,1.0,0.02,3.44,0.098
1,2.65,0.082,5.3,0.164
2,1.44,0.058,2.88,0.115


If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented.

In [19]:
try:
    d["bob"][sc.Dim.Row, 0:2] += d["alice"][sc.Dim.Row, 1:3]
except RuntimeError:
    print("Failed as expected!")

Failed as expected!


We can operate with individual variables to circumvent the safety catch:

In [20]:
d["bob"][sc.Dim.Row, 0:2].values += d["alice"][sc.Dim.Row, 1:3].values
sc.table(d)

Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
0,1.0,0.02,6.09,0.098
1,2.65,0.082,6.74,0.164
2,1.44,0.058,2.88,0.115


but note that the propagation of errors is then not taken into account by the operation as we are simply adding two `numpy` arrays together.

We can also imagine ways to explicitly drop coordinates from a subset, e.g., `d['bob'].drop_coords()`, to allow for direct operation with subset. This is currently not supported.

The slicing notation for variables (columns) and rows does not return a copy, but a view object.
This is very similar to how `numpy` operates:

In [21]:
a_slice = a[0:3]
a_slice += 100
a

array([100, 101, 102,   3,   4,   5,   6,   7])

### Exercise 2

Using the slicing notation, create a new table (or replace the existing dataset `d`) by one that does not contain the first and last row of `d`.

In [23]:
d2 = d[sc.Dim.Row, 1:-1].copy()

# Or:
# from copy import copy
# table = copy(d[Dim.Row, 1:-1])

sc.table(d2)

Coord: Row,alice [m^2],alice [m^2],bob [m^2],bob [m^2]
Values,Values,Variances,Values,Variances
1,2.65,0.082,6.74,0.164


## More advanced operations with tables
In addition to binary operators, basic functions like `concatenate`, `sort`, and `merge` are available.

In [24]:
d = sc.concatenate(d[sc.Dim.Row, 0:3], d[sc.Dim.Row, 1:3], sc.Dim.Row)
d = sc.sort(d, sc.Dim.Row)
eve = sc.Dataset()
eve["eve"] = sc.Variable([sc.Dim.Row], values=np.arange(5).astype(np.float64))
d.merge(eve)
sc.table(d)

TypeError: concatenate(): incompatible function arguments. The following argument types are supported:
    1. (arg0: scipp._scipp.Variable, arg1: scipp._scipp.Variable, arg2: scipp._scipp.Dim) -> scipp._scipp.Variable

Invoked with: <scipp.DatasetProxy>
Dimensions: {{Dim.Row, 3}}
Coordinates:
    Dim.Row                   int64     [dimensionless]  (Dim.Row)  [0, 1, 2]
Labels:
Data:
    alice                     double    [m^2]            (Dim.Row)  [1.000000, 2.650000, 1.440000]  [0.020000, 0.081800, 0.057600]
    bob                       double    [m^2]            (Dim.Row)  [6.090000, 6.740000, 2.880000]  [0.097600, 0.163600, 0.115200]
Attributes:

, <scipp.DatasetProxy>
Dimensions: {{Dim.Row, 2}}
Coordinates:
    Dim.Row                   int64     [dimensionless]  (Dim.Row)  [1, 2]
Labels:
Data:
    alice                     double    [m^2]            (Dim.Row)  [2.650000, 1.440000]  [0.081800, 0.057600]
    bob                       double    [m^2]            (Dim.Row)  [6.740000, 2.880000]  [0.163600, 0.115200]
Attributes:

, Dim.Row

### Exercise 3
Add the sum of the data for `alice` and `bob` as a new variable (column) to the dataset.

In [None]:
d.subset['sum'] = d.subset['alice'] + d.subset['bob']
ds.table(d)

### Interaction with `numpy` and scalars
Variable in a dataset are exposed in a `numpy`-compatible buffer format, so we can directly hand them to `numpy` functions.

In [None]:
d[Data.Value, 'eve'] = np.exp(d[Data.Value, 'eve'])
ds.table(d)

Direct access to the `numpy`-like underlying data array is possible using the `numpy` property:

In [None]:
d[Data.Value, 'eve'].numpy

### Exercise 4
1. As above for `np.exp` applied to the data for Eve, apply a `numpy` function to the data for Alice.
2. What happens to the unit and uncertanties when modifying data with external code such as `numpy`?

In [None]:
d[Data.Value, 'alice'] = np.sin(d[Data.Value, 'alice'])
ds.table(d)

Numpy operations are not aware of the unit and uncertainties. Therefore the result is garbage, unless the user has ensured herself that units and uncertainties are handled manually.

Corollary: Whenever available, built-in operators and functions should be preferred over the use of `numpy`.

### Exercise 5
1. Try adding a scalar value such as `1.5` to the data for Eve.
2. Try the same for Alice or Bob. Why is it not working?

In [None]:
d.subset['eve'] += 1.5
ds.table(d)

The data for Alice has a unit, so a direct addition with a dimensionless quantity fails:

In [None]:
try:
    d.subset['alice'] += 1.5
except RuntimeError:
    print("Failed as expected!")

We can use `Variable` to provide scalar quantity with attached unit:

In [None]:
d.subset['alice'] += ds.Variable(1.5, unit=ds.units.m*ds.units.m)
ds.table(d)

Continue to [Dataset in a Nutshell - Part 2](demo-part2.ipynb) to see how datasets are used with multi-dimensional data.