# Benefits of edit

Without a proper framework or tools, loading data requires the reimplementation of base boiler plate code between projects. 

Rather than wasting Scientist's time on setting this up, EDIT aims to provide a unified interface to access any dataset the user has access to.

Currently EDIT uses `xarray` as it's base level data loader, so any `netcdf`, `zarr`, and other `xarray` supported structures are included by default.

## Data Loaders

### Location Specific 
`edit` is designed with the knowledge that each computing infrastructure hold it's data in a slightly different way.

Accordingly, an archive must be set up for each location to allow the data at said location to be loaded.

This example notebook, shall use the `NCI` (National Computing Infrastructure) specific archive, as this is the main development environment.

### Other Data 
In addition to these specific location data loaders, `edit` contains an `Intake`, and a `CDS` data loader, allowing data to be downloaded or accessed via a intake catalog.

Also, various `patterns` are implemented that allow to be saved out, and loaded as if it was an archive.

## Examples
Lets go through the retrieval of data from two common sources at the Bureau

### ERA5

In [1]:
import edit.data



`edit` includes an auto import step, which seeks to identify the current computing platform and load the associated archive. 

Therefore by default, upon import of `edit.data`, the `NCI` specific archive is loaded.

In [2]:
print(f"NCI archive loaded: {hasattr(edit.data.archive, 'NCI')}")

NCI archive loaded: True


In [3]:
help(edit.data.archive.NCI.ERA5.__init__)

Help on function __init__ in module edit_archive_NCI.ERA5:

__init__(self, variables: 'list[str] | str', *, product: "Literal['monthly-averaged', 'monthly-averaged-by-hour', 'reanalysis']" = 'reanalysis', level_value: 'int | float | list[int | float] | tuple[list | int, ...] | None' = None, transforms: 'Transform | TransformCollection' = Transform Collection:
   Empty)
    Setup ERA5 Indexer

    Args:
        variables (list[str] | str):
            Data variables to retrieve
        resolution (Literal[ERA_RES], optional):
            Resolution of data, must be one of 'monthly-averaged','monthly-averaged-by-hour', 'reanalysis'.
            Defaults to 'reanalysis'.
        level_value: (int, optional):
            Level value to select if data contains levels. Defaults to None.
        transforms (Transform | TransformCollection, optional):
            Base Transforms to apply.
            Defaults to TransformCollection().



All `Indexes` require just enough information to identify the data requested by the user. So for `ERA5`, that is the `variables` & `resolution`.

Let's retrieve `t` and `tcwv`, to show the retrieval of single and pressure levels.

In [4]:
ERA5_index = edit.data.archive.NCI.ERA5(('t', 'tcwv'))
ERA5_index

Now that we have an index specified, the object can be indexed with a time, to actually get the data.

In [5]:
ERA5_index['2023-01-01T00']

Unnamed: 0,Array,Chunk
Bytes,146.54 MiB,12.01 MiB
Shape,"(1, 37, 721, 1440)","(1, 6, 486, 1080)"
Dask graph,28 chunks in 4 graph layers,28 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 146.54 MiB 12.01 MiB Shape (1, 37, 721, 1440) (1, 6, 486, 1080) Dask graph 28 chunks in 4 graph layers Data type float32 numpy.ndarray",1  1  1440  721  37,

Unnamed: 0,Array,Chunk
Bytes,146.54 MiB,12.01 MiB
Shape,"(1, 37, 721, 1440)","(1, 6, 486, 1080)"
Dask graph,28 chunks in 4 graph layers,28 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.96 MiB,255.94 kiB
Shape,"(1, 721, 1440)","(1, 182, 360)"
Dask graph,16 chunks in 4 graph layers,16 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 3.96 MiB 255.94 kiB Shape (1, 721, 1440) (1, 182, 360) Dask graph 16 chunks in 4 graph layers Data type float32 numpy.ndarray",1440  721  1,

Unnamed: 0,Array,Chunk
Bytes,3.96 MiB,255.94 kiB
Shape,"(1, 721, 1440)","(1, 182, 360)"
Dask graph,16 chunks in 4 graph layers,16 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


You can notice that in the `repr` for the index, a couple of `Transforms` are specified.

These are added by default, and by the `Index` itself to trim the data. Their role is to 'transform' the data into a more usable and universal format.

So:
- Coordinates are renamed to a standard
- Some ERA5 variables are aligned to the variable name given in the path

### ACCESS-G
Now that we have data from ERA5, let's show how to get data from ACCESS-G

In [6]:
help(edit.data.archive.NCI.ACCESS.__init__)

Help on function __init__ in module edit_archive_NCI.ACCESS:

__init__(self, variables: 'list[str] | str', region: 'str', *, datatype: 'str', level_value: 'Any' = None, transforms: 'Transform | TransformCollection' = Transform Collection:
   Empty, **kwargs)
    Setup ACCESS Index Class

    Args:
        variables (list[str] | str):
            Variables to retrieve
        region (str):
            ACCESS Region Code - ['g','bn','ad','sy','vt','ph','nq','dn']
        datatype (str):
            ACCESS Datatype Code - ['an', 'fc', 'fcmm']
        level_value: (int, optional):
            Level value to select if data contains levels. Defaults to None.
        transforms (Transform | TransformCollection, optional):
            Base Transforms to apply. Defaults to TransformCollection().



In [7]:
ACCESS_index = edit.data.archive.NCI.ACCESS.analysis('pl/air_temp', region = 'G')
ACCESS_index

Once again, the `Index` can be simply indexed to get data

In [8]:
ACCESS_index['2023-01-01T00']

Unnamed: 0,Array,Chunk
Bytes,324.00 MiB,65.03 MiB
Shape,"(1, 27, 1536, 2048)","(1, 15, 924, 1230)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 324.00 MiB 65.03 MiB Shape (1, 27, 1536, 2048) (1, 15, 924, 1230) Dask graph 8 chunks in 4 graph layers Data type float32 numpy.ndarray",1  1  2048  1536  27,

Unnamed: 0,Array,Chunk
Bytes,324.00 MiB,65.03 MiB
Shape,"(1, 27, 1536, 2048)","(1, 15, 924, 1230)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## Comparison
There is no meaningful difference between how the two indexes are used.

Therefore it is quite easy for a user to change their data source, and all the complex data finding is abstracted away.

## Patterns
A pattern index can be used to save and access data generated by the user,

It appears as if it was a normal index, but generates paths based on a predefined pattern, or one defined by the user.

There are a couple already implemented, and can be easily configured.

In [9]:
edit.data.patterns.__dir__()

['__name__',
 '__doc__',
 '__package__',
 '__loader__',
 '__spec__',
 '__path__',
 '__file__',
 '__cached__',
 '__builtins__',
 'utils',
 'default',
 'PatternIndex',
 'PatternTimeIndex',
 'PatternForecastIndex',
 'PatternVariableAware',
 'argument',
 'Argument',
 'ArgumentExpansion',
 'ArgumentExpansionVariable',
 'ArgumentExpansionFactory',
 'direct',
 'Direct',
 'TemporalDirect',
 'ForecastDirect',
 'DirectVariable',
 'ForecastDirectVariable',
 'TemporalDirectVariable',
 'DirectFactory',
 'expanded_date',
 'ExpandedDate',
 'TemporalExpandedDate',
 'ForecastExpandedDate',
 'ExpandedDateVariable',
 'ForecastExpandedDateVariable',
 'TemporalExpandedDateVariable',
 'ExpandedDateFactory',
 'static',
 'Static',
 'parser',
 'ParsingPattern']

In [10]:
edit.data.patterns.ExpandedDate('temp', prefix = 'data_').search('2020-01-01T00')

PosixPath('/jobfs/117227791.gadi-pbs/tmpwbp528j7/2020/01/01/data_20200101T0000.nc')

These can be setup to automatically split out variables from an `xarray` dataset, or allow temporal indexing.

If a preexisting pattern does not exist for the data structure in question, a `ParsingPattern` can be made to generate paths from either kwargs or from information in the dataset.

In [11]:
help(edit.data.patterns.ParsingPattern.__init__)

Help on function __init__ in module edit.data.patterns.parser:

__init__(self, root_dir: 'str', parse_str: 'str', *, transforms: 'Transform | TransformCollection' = Transform Collection:
   Empty, add_default_transforms: 'bool' = True, preprocess_transforms: 'Transform | TransformCollection | Callable | None' = None, **kwargs)
    Create pattern from a formatting string

    If being used to retrieve data without saving it first,
    set values in `parse_str` through `kwargs` or when using `search`.

    Args:
        root_dir (str):
            Root directory to begin the path, can be 'temp' for temp directory.
        parse_str (str):
            str to parse to find paths. Use 'variable' for data vars
            E.g. '{level}/{variable}/{time:%Y%M}'.
        transforms (Transform | TransformCollection, optional):
            Transforms to add on retrieval. Defaults to TransformCollection().
        add_default_transforms (bool, optional):
            Whether to add default transfor

In [12]:
edit.data.patterns.ParsingPattern('temp', '{level:04d}/{value}', level = [0,10]).search(value = 'dataset')

[PosixPath('/jobfs/117227791.gadi-pbs/tmp_mry287_/0000/dataset'),
 PosixPath('/jobfs/117227791.gadi-pbs/tmp_mry287_/0010/dataset')]