# The NDDataset object

The NDDataset is the main object use by **SpectroChemPy**. 

Like numpy ndarrays, NDDataset have the capability to be sliced, sorted and subject to matematical operations. 

But, in addition, NDDataset may have units and coordinates with units for all dimensions. This make NDDataset aware of unit compatibility, e.g., for binary operation such as addtions or subtraction or during the application of mathematical operations. In addition or in replacement of numerical data for coordinates, NDDatset can aslo have labeled coordinates where labels can be different kind of objects (strings, datetime, numpy nd.ndarray or othe NDDatasets, etc...). 

This offer a lot of flexibility in using NDDatasets that,  we hope, will be useful for applications. See the **Tutorials** for more information about such possible applications. 

**Below (and in the next sections), we try to give an almost complete view of the NDDataset features.**

In [1]:
from spectrochempy import *




0,1
,SpectroChemPy's API - v.0.1a11.dev11+g90996f7.d20190226 © Copyright 2014-2019 - A.Travert & C.Fernandez @ LCS


Multidimensional array are defined in Spectrochempy using the ``NDDataset`` object.

``NDDataset`` objects mostly behave as numpy's `numpy.ndarray`.

However, unlike raw numpy's ndarray, the presence of optional properties such
as `mask`, `units`, `axes`, and axes `labels` make them
(hopefully) more appropriate for handling spectroscopic information, one of
the major objectives of the SpectroChemPy package.

Additional metadata can also be added to the instances of this class through the
`meta` properties.

## Create a ND-Dataset from scratch

In the following example, a minimal 1D dataset is created from a simple list, to which we can add some metadata:

In [2]:
da = NDDataset([1, 2, 3])
da.title = 'intensity'
da.name = 'mydataset'
da.history = 'created from scratch'
da.description = 'Some experimental measurements'
da.units = 'dimensionless'
print_(da) 

[32m           id[39m: NDDataset_08d71b64
[32m         name[39m: mydataset
[32m       author[39m: spectrocat@cf-macbookpro.home
[32m      created[39m: 2019-02-26 21:49:39.395645
[32m     modified[39m: 2019-02-26 21:49:39.397400
[32m  description[39m: Some experimental measurements
[32m      history[39m: created from scratch
[1m          DATA [0m
[32m        title[39m: intensity
[32m       values[39m: ... 
[34m         [       1        2        3] dimensionless[39m
[32m         size[39m: 3


<div class='alert-info'>

**Note** : In the above code, we use of `print_` (with an underscore) not the usual `print` function. 
The `print` output only a short line of information

</div>

In [3]:
print(da)

NDDataset: [int64] dimensionless (size: 3)


To get a rich display of the dataset, we can simply type on the last line of the cell: This output a html version of the information string.

In [4]:
da

0,1
id,NDDataset_08d71b64
,
name,mydataset
,
author,spectrocat@cf-macbookpro.home
,
created,2019-02-26 21:49:39.395645
,
modified,2019-02-26 21:49:39.397400
,


Except few addtional metadata such `author`, `created` ..., there is not much
differences with respect to a conventional `numpy.ndarray`. For example, one
can apply numpy ufunc's directly to a NDDataset or make basic arithmetic
operation with these objects:

In [5]:
da2 = np.sqrt(da ** 3)
da2

0,1
id,NDDataset_0ec4c72e
,
name,**mydataset
,
author,spectrocat@cf-macbookpro.home
,
created,2019-02-26 21:49:39.395645
,
modified,2019-02-26 21:49:49.345662
,


In [6]:
da3 = da + da / 2.
da3

0,1
id,NDDataset_0f254fa4
,
name,*mydataset
,
author,spectrocat@cf-macbookpro.home
,
created,2019-02-26 21:49:39.395645
,
modified,2019-02-26 21:49:49.978775
,


da is a 1D (1-dimensional) dataset with only one dimension. 

Some attributes are useful to check this kind of information:

In [7]:
da.shape # the shape of 1D contain only one dimension size

(3,)

In [8]:
da.ndim # the number of dimensions

1

In [9]:
da.dims # the name of the dimension (it has been automatically attributed)

['x']

**Note** : The names of the dimensions are set automatically. For now there is no way to change them

To create a nD NDDataset, we have to provide a nD-array like object to the NDDataset instance constructor

In [10]:
arr = np.random.rand(2,4,6) # note here that np (for numpy space has been automatically 
                            # imported with spectrochempy, thus no need to use the 
                            # classical `import numpy as np`)
arr

array([[[   0.409,    0.502, ...,    0.669,    0.068],
        [   0.762,    0.272, ...,    0.471,    0.213],
        [   0.459,    0.245, ...,    0.433,    0.019],
        [   0.652,    0.431, ...,    0.951,    0.605]],

       [[   0.880,    0.099, ...,    0.111,    0.344],
        [   0.739,    0.121, ...,    0.816,    0.769],
        [   0.106,    0.541, ...,    0.477,    0.455],
        [   0.046,    0.114, ...,    0.138,    0.284]]])

In [11]:
ds = NDDataset(arr)
ds.title = 'Energy'
ds.name = '3D dataset creation'
ds.history = 'created from scratch'
ds.description = 'Some example'
ds.units = 'eV'
ds

0,1
id,NDDataset_11c94fd0
,
name,3D dataset creation
,
author,spectrocat@cf-macbookpro.home
,
created,2019-02-26 21:49:54.403430
,
modified,2019-02-26 21:49:54.404453
,


In [12]:
ds.dims # 3 automatic dimension names

['z', 'y', 'x']

In [13]:
ds.ndim

3

In [14]:
ds.shape

(2, 4, 6)

There is 3 dimensions but no coordinate

To get the list of all defined coordnates, we can use the `coords` attribute:

In [15]:
ds.coords  # no coordinates, so it returns nothing (None)

In [16]:
ds.x       # the same for coordinate  x, y or z

To add coordinates, on way is to set them one by one:

In [17]:
ds.x = np.arange(6)*.1 # we need a sequence of 6 values for axe x (see shape above) 
ds.x.title = 'meters'
ds.coords # now return a list of coordinates

TypeError: 'tuple' object does not support item assignment

In [18]:
ds.x

0,1
size,6
,
title,meters
,
coordinates,[ 0.000 0.100 0.200 0.300 0.400 0.500]
,


In [19]:
ds.coords[-1]   # ds.x is a faster way to get this information 

0,1
size,6
,
title,meters
,
coordinates,[ 0.000 0.100 0.200 0.300 0.400 0.500]
,


In [20]:
ds.coords('x')  # another alternative way to get a given coordinates

0,1
size,6
,
title,meters
,
coordinates,[ 0.000 0.100 0.200 0.300 0.400 0.500]
,


The two other coordinates are empty

In [25]:
ds.y

0,1
title,
,
coordinates,Undefined
,


In [24]:
ds.z

0,1
title,
,
coordinates,Undefined
,


Programatically, we can use the attribute `is_empty` or `has_data` to check this

In [None]:
ds.z.has_data, ds.coords[0].is_empty

It is possible to use labels instead of numerical coordinates. They are sequence of objects .The length of the sequence must be equal to the size of a dimension

In [None]:
from datetime import datetime, timedelta, time
timedelta()

In [None]:
tags = list('abcdef')
start = timedelta(0)
times = [start + timedelta(seconds=x*60) for x in range(6)]
ds.x.labels = (tags, times)
ds.x

## Create a NDDataset : full example

There are many ways to create `NDDataset` objects.

Above we have created a `NDDataset` from a simple list, but it is generally more
convenient to create `numpy.ndarray`).

Below is an example of a 3D-Dataset created from a ``numpy.ndarray`` to which axes for each dimension can be added. 

Let's first create the 3 one-dimensional coordinates, for which we can define `labels`, `units`, and `masks`! 

In [None]:
coord0 = Coord(data=np.linspace(4000., 1000., 100),
               labels=None,
               mask=None,
               units="cm^-1",
               title='wavenumber')

coord1 = Coord(data=np.linspace(0., 60., 60),
               labels=None,
               mask=None,
               units="minutes",
               title='time-on-stream')

coord2 = Coord(data=np.linspace(200., 300., 3),
               labels=['cold', 'normal', 'hot'],
               mask=None,
               units="K",
               title='temperature')

Here is the displayed info for coord1 for instance:

In [None]:
coord1

Now we create some 3D data (a ``numpy.ndarray``):

In [None]:
nd_data = np.array(
    [np.array([np.sin(coord2.data * 2. * np.pi / 4000.) * np.exp(-y / 60.) for y in coord1.data]) * float(t)
     for t in coord0.data]) ** 2

The dataset is now created with these data and axis. All needed information are passed as parameter of the 
NDDataset instance constructor. 

In [None]:
mydataset = NDDataset(nd_data,
                      name = 'mydataset',
                      coords=[coord0, coord1, coord2],
                      title='Absorbance',
                      units='absorbance'
                      )

mydataset.description = """Dataset example created for this tutorial. 
It's a 3-D dataset (with dimensionless intensity)"""

mydataset.author = 'Blake & Mortimer'

We can get some information about this object:

In [None]:
mydataset

## Copying existing NDDataset

To copy an existing dataset, this is as simple as:

In [None]:
da_copy = da.copy()

or alternatively:

In [None]:
da_copy = da[:]

Finally, it is also possible to initialize a dataset using an existing one:

In [None]:
dc = NDDataset(mydataset, name='duplicate of %s'%mydataset.name , units='absorbance')
dc

### Other ways to create NDDatasets

Some numpy creation function can be used to set up the initial dataset array:
       [numpy array creation routines](https://docs.scipy.org/doc/numpy/reference/routines.array-creation.html#routines-array-creation)


In [None]:
dz = zeros((2, 2), units='meters', title='Datasets with only zeros')
dz

In [None]:
do = ones((2, 2), units='kilograms', title='Datasets with only ones')
do

In [None]:
df = full((2, 2), fill_value=1.25, units='radians',
     title='with only float=1.25')  
df

As with numpy, it is also possible to take another dataset as a template:

In [None]:
do = ones((2, 3), dtype=bool)
do[1,1]=0
do

Now we use the previous dataset ``do`` as a template, for the shape, but we can change the `dtype`.

In [None]:
df = full_like(dc, dtype=np.float64, fill_value=2.5)
df

## Importing from external dataset

NDDataset can be created from the importation of external data

A **test**'s data folder contains some data for experimenting some features of datasets.

In [None]:
# let check if this directory exists and display its actual content:
import os

datadir = general_preferences.datadir
if os.path.exists(datadir):
    # let's display only the last part of the path
    print(os.path.basename(datadir))

###  Reading a IR dataset saved by OMNIC (.spg extension)

Even if we do not specify the **datadir**, the application first look in tht directory by default.

In [None]:
dataset = NDDataset.read_omnic(os.path.join('irdata', 'NH4Y-activation.SPG'))
dataset

## Slicing a NDDataset

NDDataset can be sliced like conventional numpy-array...

*e.g.,*:

1. by index, using a slice such as [3], [0:10], [:, 3:4], [..., 5:10], ...

2. by values, using a slice such as [3000.0:3500.0], [..., 300.0], ...

3. by labels, using a slice such as ['monday':'friday'], ...

In [None]:
new = mydataset[..., 0]
new

or using the axes labels:

In [None]:
new = mydataset[..., 'hot']
new

Be sure to use the correct type for slicing.

Floats are used for slicing by values

In [None]:
correct = mydataset[2000.]
correct

In [None]:
outside_limits = mydataset[2000]

<div class='alert alert-info'>
    
**NOTE:**
If one use an integer value (2000), then the slicing is made **by index not by value**, and in the following particular case, an `Error` is issued as index 2000 does not exists (size along axis `x` (axis:0) is only 100, so that index vary between 0 and 99!). 

</div>

One can mixed slicing methods for different dimension:

In [None]:
new = mydataset[4000.0:2000., 0, 'normal':'hot']
new

## Loading of experimental data


### NMR Data

Now, lets load a NMR dataset (in the Bruker format).

In [None]:
path = os.path.join(datadir, 'nmrdata', 'bruker', 'tests', 'nmr', 'bruker_1d')

# load the data in a new dataset
ndd = NDDataset()
ndd.read_bruker_nmr(path, expno=1, remove_digital_filter=True)
ndd

In [None]:
# view it...
_ = ndd.plot(color='blue')

In [None]:
path = os.path.join(datadir, 'nmrdata', 'bruker', 'tests', 'nmr', 'bruker_2d')

# load the data directly (no need to create the dataset first)
ndd2 = NDDataset.read_bruker_nmr(path, expno=1, remove_digital_filter=True)

# view it...
ndd2.x.to('s')
ndd2.y.to('ms')

ax = ndd2.plot(method='map')
ndd2

### IR data

In [None]:
dataset = NDDataset.read_omnic(os.path.join(datadir, 'irdata', 'NH4Y-activation.SPG'))
dataset

In [None]:
ax = dataset.plot(method='stack')

## Masks

if we try to get for example the maximum of the previous dataset, we face a problem due to the saturation around 1100 cm$^{-1}$.

In [None]:
dataset.max()

One way is to apply the max function to only a part of the spectrum. Another way is to mask the undesired data.

Masking values in a dataset is straigthforward. Just set a value `masked` or True for those data you want to mask.

In [None]:
dataset[1290.:890.] = MASKED

Now the max function return the  correct position 

In [None]:
dataset.max().x

Here is a display the figure with the new mask

In [None]:
_ = dataset.plot_stack()

## Transposition

Dataset can be transposed

In [None]:
datasetT = dataset.T
datasetT

As it can be observed the dimension `x`and `y`have been exchanged, *e.g.* the originalshape was **(x:5549, y:55)**, and after transposition it is **(y:55, x:5549)**.
(the dimension names stay the same, but the index of the corresponding axis are exchanged).

Let's vizualize the result:

In [None]:
_ = datasetT.plot()

In [None]:
dataset[4000.:3000.], datasetT[:,4000.:3000.]

## Units


Spectrochempy can do calculations with units - it uses [pint](https://pint.readthedocs.io) to define and perform operation on data with units.

### Create quantities

* to create quantity, use for instance, one of the following expression:

In [None]:
Quantity('10.0 cm^-1')

In [None]:
Quantity(1.0, 'cm^-1/hour')

In [None]:
Quantity(10.0, ur.cm / ur.km)

or may be (?) simpler,

In [None]:
10.0 * ur.meter / ur.gram / ur.volt

`ur` stands for **unit registry**, which handle many type of units
(and conversion between them)

### Do arithmetics with units

In [None]:
a = 900 * ur.km
b = 4.5 * ur.hours
a / b

Such calculations can also be done using the following syntax, using a string expression

In [None]:
Quantity("900 km / (8 hours)")

### Convert between units

In [None]:
c = a / b
c.to('cm/s')

We can make the conversion *inplace* using *ito* instead of *to*

In [None]:
c.ito('m/s')
c

### Do math operations with consistent units

In [None]:
x = 10 * ur.radians
np.sin(x)

Consistency of the units are checked!

In [None]:
x = 10 * ur.meters
np.sqrt(x)

but this is wrong...

In [None]:
x = 10 * ur.meters
try:
    np.cos(x)
except DimensionalityError as e:
    log.error(e)

Units can be set for NDDataset data and/or Coordinates

In [None]:
ds = NDDataset([1., 2., 3.], units='g/cm^3', title='concentration')
ds

In [None]:
ds.to('kg/m^3')

In [None]:
Quantity(10.0, 'cm').plus_minus(.2)

## Numpy universal functions (ufunc's)

A numpy universal function (or `numpy.ufunc` for short) is a function that
operates on `numpy.ndarray` in an element-by-element fashion. It's
vectorized and so rather fast.

As SpectroChemPy NDDataset imitate the behaviour of numpy objects, many numpy
ufuncs can be applied directly.

For example, if you need all the elements of a NDDataset to be changed to the
squared rooted values, you can use the `numpy.sqrt` function:

In [None]:
da = NDDataset([1., 2., 3.])
da_sqrt = np.sqrt(da)
da_sqrt

### Ufuncs with NDDataset with units

When NDDataset have units, some restrictions apply on the use of ufuncs:

Some function functions accept only dimensionless quantities. This is the
case for example of logarithmic functions: :`exp` and `log`.

In [None]:
np.log10(da)

In [None]:
da.units = ur.cm

try:
    np.log10(da)
except DimensionalityError as e:
    log.error(e)

## Complex or hypercomplex NDDatasets


NDDataset objects with complex data are handled differently than in
`numpy.ndarray`.

Instead, complex data are stored by interlacing the real and imaginary part.
This allows the definition of data that can be complex in several axis, and *e
.g.,* allows 2D-hypercomplex array that can be transposed (useful for NMR data).

In [None]:
da = NDDataset([[1. + 2.j, 2. + 0j], [1.3 + 2.j, 2. + 0.5j], [1. + 4.2j, 2. + 3j], [5. + 4.2j, 2. + 3j]])
da

A dataset of type float can be transformed into a complex dataset (using two cionsecutive rows to create a complex row)

In [None]:
da = NDDataset(np.arange(40).reshape(10,4))
da

In [None]:
dac = da.set_complex()
dac

Note the `x`dimension size is divided by a factor of two 

A dataset which is complex in two dimensions is called hypercomplex (it's datatype in SpectroChemPy is set to quaternion). 

In [None]:
daq = da.set_quaternion()   # equivalently one can use the set_hypercomplex method
daq

In [None]:
daq.dtype