# PAL-flavoured Datatree

The [xarray Datatree](https://xarray-datatree.readthedocs.io) is used as the core data structure for SwarmPAL. You can think of this like a file directory (a tree) which contains an arbitrary number of related xarray datasets. Data can be fetched from different resources (including VirES) and stored in a `Datatree`.

`PalDataItem` provides tools to construct an `xarray.Dataset` from different sources (VirES, HAPI, etc). `create_paldata` helps to construct a `Datatree` from a set of those datasets.

In [None]:
import datetime as dt

## Fetching data

In [None]:
from swarmpal.io import create_paldata, PalDataItem

### from VirES API

In [None]:
# Set of options which are passed to viresclient
data_params = dict(
    collection="SW_OPER_MAGA_LR_1B",
    measurements=["B_NEC"],
    models=["IGRF"],
    start_time="2016-01-01T00:00:00",
    end_time="2016-01-01T03:00:00",
    # start_time=dt.datetime(2016, 1, 1),  # Can use ISO string or datetime
    # end_time=dt.datetime(2016, 1, 1, 3),
    server_url="https://vires.services/ows",
    options=dict(asynchronous=False, show_progress=False),
)
# create_paldata takes an arbitrary number of args & kwargs
# If using args, dataset names will be used as tree names
# If using kwargs, user specifies the tree name/path
data = create_paldata(PalDataItem.from_vires(**data_params))
print(data)

In [None]:
# Interactive view of the datatree
data

In [None]:
# Refer to a branch of the tree like:
data["SW_OPER_MAGA_LR_1B"]

In [None]:
# Note that the above is actually a Datatree object
# To get a view of the Dataset:
data["SW_OPER_MAGA_LR_1B"].ds

## `swarmpal` accessor

The behaviour of the datatree is extended by the addition of an ["accessor"](https://docs.xarray.dev/en/stable/internals/extending-xarray.html) that adds functionality from SwarmPAL under the `.swarmpal` namespace, e.g.:

In [None]:
# Metadata related to the SwarmPAL framework
data.swarmpal.pal_meta

In [None]:
data.swarmpal.magnetic_model_name

The above properties are constructed from metadata which are stored within the datatree itself:

In [None]:
data["SW_OPER_MAGA_LR_1B"].attrs["PAL_meta"]

It is possible to add more complex methods that work on the datasets:

In [None]:
data["SW_OPER_MAGA_LR_1B"].swarmpal.magnetic_residual()

## Defining and running a `PalProcess`

A process can be defined which will act on datatrees obtained as above. Define processes by subclassing the abstract `PalProcess` class.

In [None]:
from swarmpal.io import PalProcess

In [None]:
help(PalProcess)

Here is an example of defining a process. Still subject to change!

Three methods must be set:
- `process_name` identifies the process, and is used to update the `"PAL_meta"` attribute in the datatree when the process is applied.
- `set_config` takes keyword arguments and stores them as a dict in the `config` property.
- `_call` defines the behaviour of the process itself, and should accept the input datatree and return a modified datatree

When a process object is instantiated, the user optionally provides two arguments which are set as properties of the process
- `active_tree (str)` selects which branch of the tree is to be used
- `config (dict)` provides parameters to control the behaviour of the process

The config can also be provided using `.set_config()` after the process object is created. This enables the process to provide and document default configurations, as well allowing the IDE to provide hints for what configuration is available.

In [None]:
from datatree import DataTree
from xarray import Dataset


class MyProcess(PalProcess):
    """Compute the first differences on a given variable"""

    @property
    def process_name(self):
        return "MyProcess"

    def set_config(self, dataset="SW_OPER_MAGA_LR_1B", parameter="B_NEC"):
        self.config = dict(dataset=dataset, parameter=parameter)

    def _call(self, datatree):
        # Identify inputs for algorithm
        subtree = datatree[f"{self.config.get('dataset')}"]
        dataset = subtree.ds
        parameter = self.config.get("parameter")
        # Apply the algorithm
        output_data = dataset[parameter].diff(dim="Timestamp")
        # Create an output dataset
        data_out = Dataset(
            data_vars={
                f"d/dt ({parameter})": output_data,
            }
        )
        # Write the output into a new path in the datatree and return it
        subtree["output"] = DataTree(data=data_out)
        return datatree

The process can now be created with some configuration:

In [None]:
process = MyProcess(
    config={"dataset": "SW_OPER_MAGA_LR_1B", "parameter": "B_NEC"},
)

...and there is a tool to apply this process to the datatree:

In [None]:
data = data.swarmpal.apply(process)
print(data)

The resulting data can be interrogated with the usual tools (in this case we added a new dataset to the tree under `"SW_OPER_MAGA_LR_1B/output"`):

In [None]:
data["SW_OPER_MAGA_LR_1B/output"].ds["d/dt (B_NEC)"].plot.line(x="Timestamp");

... and the datatree carries with it the metadata about the process which has been applied:

In [None]:
data.swarmpal.pal_meta

## More tricks with `create_paldata`

### Fetching data from HAPI

Two differences from using VirES:
- Parameters follow the scheme in `hapiclient`  
  Example: http://hapi-server.org/servers/#server=VirES-for-Swarm&dataset=SW_OPER_MAGA_LR_1B&parameters=B_NEC&start=2016-01-01T00:00:00&stop=2016-01-01T03:00:00&return=script&format=python
- The output dataset is not identical to that retrieved from VirES (variables and their content are the same, but less metadata etc)

In [None]:
data_params = dict(
    server="https://vires.services/hapi",
    dataset="SW_OPER_MAGA_LR_1B",
    parameters="B_NEC",
    start="2016-01-01T00:00:00",
    stop="2016-01-01T03:00:00",
)
data_hapi = create_paldata(alpha_hapi=PalDataItem.from_hapi(**data_params))
print(data_hapi)

### Time padding

A tuple of `timedelta` can be given as an extra parameter. This extends the retrieved time interval, while storing the original time interval in `"analysis_window"` within the `"Pal_meta"` attribute.

In [None]:
data_params = dict(
    server="https://vires.services/hapi",
    dataset="SW_OPER_MAGA_LR_1B",
    parameters="B_NEC",
    start="2016-01-01T00:00:00",
    stop="2016-01-01T03:00:00",
    pad_times=(dt.timedelta(hours=1), dt.timedelta(hours=1)),
)
data_hapi = create_paldata(alpha_hapi=PalDataItem.from_hapi(**data_params))
print(data_hapi)