# Combining data after pipeline computation

A common use case of TokSearch is to aggregate the data in a results object for visualization, training a machine learning model, or performing statistical analysis. This notebook shows a few examples of how one might go about doing that.

## Creating a simple pipeline

We'll start be creating a very simple pipeline. This pipeline fetches flux-on-the-grid data from MDSplus. 

In [9]:
import numpy as np
import xarray as xr
from toksearch import MdsSignal, Pipeline

pipeline = Pipeline([165920, 165921, 173000])

psirz_sig = MdsSignal(
    r'\psirz',                  
    'efit01',
    dims=('r', 'z', 'times'),
    data_order=('times', 'r', 'z'),
)

pipeline.fetch_dataset('ds', {'psirz': psirz_sig})

### Aside on ordering of dimensions
Note that in creating the ```MdsSignal``` object, we had to be careful to specify the ```dims``` keyword argument along with the ```data_order``` keyword argument. This is done because the order in which MDSplus stores the coordinates for a node's dimensions doesn't necessarily correspond to the shape of the data that is being retrieved. In this case, MDSplus is set up such that ```dim_of(0)``` is the ```r``` coordinates, ```dim_of(1)``` is the ```z``` coordinates, and ```dim_of(2)``` is the ```times``` coordinates. However, the underlying Numpy ndarray has shape ('times', 'r', 'z').

### Computing the data

Now we go ahead and compute the pipeline. Recall that the object returned from the ```compute_*``` family of methods is a list-like object that can be iterated over. So, we can extract a list of xarray ```Dataset``` objects. This list can be used subsequently as a basis for a few types of aggregations.

In [10]:
recs = pipeline.compute_serial()

datasets = [rec['ds'] for rec in recs]

### Using ```xr.concat```

One option is to create a new dataset that is concatenated along the ```shot``` dimension. Note that if, for example, the timebases are different (as they almost always will be), this methodology will leave you with some ```nan```s in the data.

In [11]:
combined_along_shot_dim = xr.concat(datasets, dim='shot')
print(combined_along_shot_dim)

<xarray.Dataset> Size: 16MB
Dimensions:  (times: 312, r: 65, z: 65, shot: 3)
Coordinates:
  * times    (times) float32 1kB 100.0 120.0 140.0 ... 6.36e+03 6.38e+03
  * r        (r) float32 260B 0.84 0.8666 0.8931 0.9197 ... 2.487 2.513 2.54
  * z        (z) float32 260B -1.6 -1.55 -1.5 -1.45 -1.4 ... 1.45 1.5 1.55 1.6
  * shot     (shot) int64 24B 165920 165921 173000
Data variables:
    psirz    (shot, times, r, z) float32 16MB -0.2949 -0.2961 -0.297 ... nan nan


Similarly, we can concatenate along the ```times``` dimension:

In [12]:
combined_along_times_dim = xr.concat(datasets, dim='times')
print(combined_along_times_dim)

<xarray.Dataset> Size: 13MB
Dimensions:  (shot: 3, r: 65, z: 65, times: 758)
Coordinates:
  * shot     (shot) int64 24B 165920 165921 173000
  * r        (r) float32 260B 0.84 0.8666 0.8931 0.9197 ... 2.487 2.513 2.54
  * z        (z) float32 260B -1.6 -1.55 -1.5 -1.45 -1.4 ... 1.45 1.5 1.55 1.6
  * times    (times) float32 3kB 100.0 140.0 160.0 ... 5.4e+03 5.42e+03 5.44e+03
Data variables:
    psirz    (times, r, z) float32 13MB -0.2949 -0.2961 -0.297 ... 0.2966 0.2973


### Converting to numpy ndarrays

It is often useful to manipulate the dataset data directly as ndarrays.

In [13]:
ndarrays = [ds['psirz'].values for ds in datasets]
ndarrays[0].shape

(303, 65, 65)

The list of ndarrays can then be, for example, concatenated along the time dimension:

In [14]:
big_array = np.concatenate(ndarrays, axis=0)
big_array.shape

(758, 65, 65)